[CUDA]Determining threads per block, no. of blocks and Grid

abhishekdey1985 · Post by **abhishekdey1985** » Mon Mar 15, 2010 3:12 pm

Hi Evryone,

How to determine threads per block, no. of blocks on a CUDA Device?

For example: i need to multiply two single dimensioned arrays A, B and Copy the result into C Array.

Code: Select all

int N = 10; //Array Containing Maximum of 10 elements
size_t size = N*sizeof(float);
...
cudaMalloc((**void &&)a_d, size);
cudaMalloc((**void &&)b_d, size);
cudaMalloc((**void &&)c_d, size);
...
...
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);

//How to determine no. of threads here???
int threadsPerBlock = ???
int noOfBlocks = ??

fmultiply<<>>(a_d, b_d, c_d);

cudaMemcpy(c_d, c_h, size, cudaMemcpyDeviceToHost);
...
...

hybrid · Post by **hybrid** » Mon Mar 15, 2010 4:36 pm

Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.

hybrid · Post by **hybrid** » Mon Mar 15, 2010 4:36 pm

Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.

abhishekdey1985 · Post by **abhishekdey1985** » Tue Mar 16, 2010 7:31 am

I meant on what basis the Threads per block should be decided. Because the same program can run with 2 Blocks and 10 Threads per block OR 1 block and 20 Threads per block.

Also i get garbage value when accessing the Second Array passed to the Kernel.

Code Snippet:

Code: Select all

__global__ void fmultiply(float *A, float *B, float *C )
{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	//if(idx<10)
		C[idx] = A[idx]*B[idx];
}

Need suggestions on accessing device memory.

hybrid · Post by **hybrid** » Tue Mar 16, 2010 9:28 am

Well, the thread layout depends on your algorithm (how many threads can you populate), your gfx card (how many threads are possible, how many run per warp, what's the memory access), and a general overhead estimation. We have had very hard to predict patterns in the optimal numbers, maybe just let them run with varying figures and see which one is best.