So, as I mentioned, I was manually setting the number of groups in the kernel to 64 but somehow getting the right answer.
Looking back, it shouldn't have worked, but it did.
So, to fix it I made some new variables:
const int COUNT = 20246528;
const int LOCAL_SIZE = 256;
const int NUM_GROUPS = COUNT / LOCAL_SIZE;
I then use this value for the size of my array that I'm going to sum on the CPU, and change "groupSize" in the GPU to get_num_groups(0)
Guess what! It worked.
int outVar[NUM_GROUPS];
//...
int main(int argc, char** argv)
{
//...
kernel.setArg(1, LOCAL_SIZE * sizeof(cl_int), localMem);
//...
err = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(COUNT), cl::NDRange(LOCAL_SIZE));
memcpy(outVar, queue.enqueueMapBuffer(outBuf, CL_TRUE, CL_MAP_READ, 0, NUM_GROUPS * sizeof(int), nullptr, nullptr, &err),
NUM_GROUPS * sizeof(int));
err = queue.enqueueUnmapMemObject(outBuf, outVar);
err = queue.finish();
Kernel |
Source |