Page 1 of 1

[CUDA]Determining threads per block, no. of blocks and Grid

Posted: Mon Mar 15, 2010 3:12 pm
by abhishekdey1985
Hi Evryone,

How to determine threads per block, no. of blocks on a CUDA Device?

For example: i need to multiply two single dimensioned arrays A, B and Copy the result into C Array.

Code: Select all

int N = 10; //Array Containing Maximum of 10 elements
size_t size = N*sizeof(float);
...
cudaMalloc((**void &&)a_d, size);
cudaMalloc((**void &&)b_d, size);
cudaMalloc((**void &&)c_d, size);
...
...
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);

//How to determine no. of threads here???
int threadsPerBlock = ???
int noOfBlocks = ??

fmultiply<<>>(a_d, b_d, c_d);

cudaMemcpy(c_d, c_h, size, cudaMemcpyDeviceToHost);
...
...

Posted: Mon Mar 15, 2010 4:36 pm
by hybrid
Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.

Posted: Mon Mar 15, 2010 4:36 pm
by hybrid
Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.

Posted: Tue Mar 16, 2010 7:31 am
by abhishekdey1985
I meant on what basis the Threads per block should be decided. Because the same program can run with 2 Blocks and 10 Threads per block OR 1 block and 20 Threads per block.

Also i get garbage value when accessing the Second Array passed to the Kernel.

Code Snippet:

Code: Select all

__global__ void fmultiply(float *A, float *B, float *C )
{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	//if(idx<10)
		C[idx] = A[idx]*B[idx];
}
Need suggestions on accessing device memory.

Posted: Tue Mar 16, 2010 9:28 am
by hybrid
Well, the thread layout depends on your algorithm (how many threads can you populate), your gfx card (how many threads are possible, how many run per warp, what's the memory access), and a general overhead estimation. We have had very hard to predict patterns in the optimal numbers, maybe just let them run with varying figures and see which one is best.