Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.
Just ask the CudaInformation struct. These numbers would give you the maximal number of blocks etc., but since you're still on the host, you're able to set the desired number of processes in your calls later on.
I meant on what basis the Threads per block should be decided. Because the same program can run with 2 Blocks and 10 Threads per block OR 1 block and 20 Threads per block.
Also i get garbage value when accessing the Second Array passed to the Kernel.
Well, the thread layout depends on your algorithm (how many threads can you populate), your gfx card (how many threads are possible, how many run per warp, what's the memory access), and a general overhead estimation. We have had very hard to predict patterns in the optimal numbers, maybe just let them run with varying figures and see which one is best.