[CUDA]Determining threads per block, no. of blocks and Grid

Discussion about everything. New games, 3d math, development tips...
Post Reply
abhishekdey1985
Posts: 102
Joined: Sat Jan 17, 2009 4:33 am
Location: Pune
Contact:

[CUDA]Determining threads per block, no. of blocks and Grid

Post by abhishekdey1985 »

Hi Evryone,

How to determine threads per block, no. of blocks on a CUDA Device?

For example: i need to multiply two single dimensioned arrays A, B and Copy the result into C Array.

Code: Select all

int N = 10; //Array Containing Maximum of 10 elements
size_t size = N*sizeof(float);
...
cudaMalloc((**void &&)a_d, size);
cudaMalloc((**void &&)b_d, size);
cudaMalloc((**void &&)c_d, size);
...
...
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);

//How to determine no. of threads here???
int threadsPerBlock = ???
int noOfBlocks = ??

fmultiply<<>>(a_d, b_d, c_d);

cudaMemcpy(c_d, c_h, size, cudaMemcpyDeviceToHost);
...
...
I work on "The Best Real-Time 3D Engine"
hybrid
Admin
Posts: 14143
Joined: Wed Apr 19, 2006 9:20 pm
Location: Oldenburg(Oldb), Germany
Contact:

Post by hybrid »

Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.
hybrid
Admin
Posts: 14143
Joined: Wed Apr 19, 2006 9:20 pm
Location: Oldenburg(Oldb), Germany
Contact:

Post by hybrid »

Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.
abhishekdey1985
Posts: 102
Joined: Sat Jan 17, 2009 4:33 am
Location: Pune
Contact:

Post by abhishekdey1985 »

I meant on what basis the Threads per block should be decided. Because the same program can run with 2 Blocks and 10 Threads per block OR 1 block and 20 Threads per block.

Also i get garbage value when accessing the Second Array passed to the Kernel.

Code Snippet:

Code: Select all

__global__ void fmultiply(float *A, float *B, float *C )
{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	//if(idx<10)
		C[idx] = A[idx]*B[idx];
}
Need suggestions on accessing device memory.
I work on "The Best Real-Time 3D Engine"
hybrid
Admin
Posts: 14143
Joined: Wed Apr 19, 2006 9:20 pm
Location: Oldenburg(Oldb), Germany
Contact:

Post by hybrid »

Well, the thread layout depends on your algorithm (how many threads can you populate), your gfx card (how many threads are possible, how many run per warp, what's the memory access), and a general overhead estimation. We have had very hard to predict patterns in the optimal numbers, maybe just let them run with varying figures and see which one is best.
Post Reply