How to Use Constant Memory in CUDA
So the other day I wanted to store a lookup table in constant memory because of its decent sized cache (compared to global memory with no cache). After trying to get it working, I just could not get the syntax correct. At first there was no speed up at all, so the global memory must have been incorrectly used. Eventually I stumbled across a code sample that used the constant memory in the way I wanted to, and got my code using constant memory correctly! Heres how I did it:
Step 1: Global declaration (ie. not inside a function) of the constant array
extern __constant__ int d_nLookup[1<<13];
Step 2: Copy the data to the constant memory in a host running function
CUDA_SAFE_CALL(cudaMemcpyToSymbol(d_nlookup, h_nlookup, 1<<13 * sizeof(int), 0, cudaMemcpyHostToDevice));
Step 3: Use array in __device__ or __global__ functions directly (no need to pass array reference as a parameter)
fTemp += d_idata[d_nLookup[i]]; // use only loop map
As the array is defined as extern, it can be referenced in any __device__ or __global__ function without needing a reference to be passed to the function. Unfortunately, the size of the array has to be set at compile time (although I didn’t try dynamic allocation, it might just work). It’s important to note the use of cudaMemcpyToSymbol instead of cudaMemcpy; this fact was not emphasized anywhere and really had me stumped! Also, there is no need to cudaMalloc or cudaFree constant memory.
I suspect that the constant memory is simply a wrapped up version of 1D texturing. The constant memory array is used inside a loop, so theoretically with a hit and miss cache setup the first call to each position of the array should be a miss (meaning global memory latency). However this is not the case, so I suspect the compiler has detected the usage and fills up the 8KB cache on the first call.
Leave a comment