The very first thing you want to do is set up OpenCL, which means getting an SDK / library from the internet, building it and linking it to your project. There is a good tutorial series on how to do this as well as the other things I cover in this post. I will link it at the end.
The first thing you have to do in OpenCL is create a platform (or several), then from that platform you select your device. In my case the device is my GPU but it can be a CPU. Then you can create a context for that device. Next you want to create some code to run on the GPU, this is called the kernel. After that you'll want to generate your program from your kernel against your context. This is not confusing at all. Once you have your program you need to build it. You might have to specify which version of OpenCL you're using in your build arguments, e.g. "-cl-std=CL1.2"
But what if it doesn't build correctly because there's a syntax error (or something)? Well, program.build() returns a cl_int which serves as an error code. -11 is the code for an invalid kernel. You can get a list of these codes on the internet, but I'll attach a handy helper function so you won't need to. Well, that's fair enough if you've got a simple kernel. But what if you've got something that's more than one line long? It would be nice to print out the build log to find out which bit it didn't like. Fortunately, there's a way to do that too - I'll include it in my helper file.
Context, Device, and Program are particularly useful throughout your program - you can get all kinds of information from them and they are often used in various functions you will need to call. Program is related to the kernel for the most part.
Now you will want to generate some buffers to give to the device. There are loads of ways to generate buffers (well, about 3) but what you want to consider when doing so is what flags you're going to set - does the GPU only need to read the data or write it? Do you need a host pointer? Probably not. How much space do you need to give it? Does the buffer need to be populated?
Now that you have your buffers you are going to want to find the subroutine in the kernel so that you can pass them as arguments. Another point! Do you want to put these on the heap or the stack, do you want to pass them by reference or by value? You do this by creating a kernel object on the host side, passing the function name as a parameter to the constructor. Once you've got this object you can call kernel.setArg(0, val) where 0 is the index of the argument (1st arg is 0, 2nd is 1 etc.) and val is the buffer you created earlier. If you want to give the device something to work on locally you set that here as well. You give it 3 parameters, the index, the size and the value - a nullptr in my case, this was to store the "local result", more on that later.
The next thing to do is create a command queue - this is created from a context and a device and is used to give instructions to the GPU. The first thing I put on the queue is a call to enqueueNDRangeKernel. This sets up the ranges for the amount of memory the GPU has available for the program, and the amount that it is split into. The splits are worked on in work groups, the number of work groups is generated by the compiler by dividing the total size by the chunk size. It is incredibly useful to have both of these numbers divisible by 64, and the big divisible by the small. Each device has a max work group size, mine is 4100 which is suspiciously close to 4096.
So there are these things called work groups, and there are these things called work items - these are not items of work to be done, they are workers. The more work items in a work group, the better. For some reason the maximum size I could give as my local (chunk) size was 256. I feel like this requires more investigation but I feel like that's a job for another time.
BAM! The GPU is now doing stuff. Cool. Now we want to get some information out of it. We need to add something else to the queue. Something to read data from the buffer we gave it to write to. Again, there are several ways of doing this. I settled on using enqueueMapBuffer() as it is supposedly faster than enqueueReadBuffer(). This returns a void pointer which I memcpy'd the data from into an array of integers. Why an array? I'll explain later. Then after that you need to unmap the buffer with a call to enqueueUnmapMemObject(). Finally once the queue is finished and you've added everything you want to it, you call queue.finish() which waits (apparently) for the operations on the queue to finish executing before continuing on with the rest of the program.
Then I add up the contents of my array and print it.
I am now done explaining what I do on the host, now to talk about the device.
In case you didn't know, the host I keep referring to is the CPU, the device could be a GPU, CPU or APU. As mentioned, it is a GPU in my case. Below is my kernel code:
__kernel void FindOddQty(__global int *data, __local int* local_result, __global int* group_result, const int arraySize)
{
int sum;
int localSize = get_local_size(0);
int numGroups = 64;
int localID = get_local_id(0);
int groupID = get_group_id(0);
int startPos, workAmount, endPos;
workAmount = arraySize / numGroups;
startPos = groupID * workAmount + localID;
endPos = (groupID + 1) * workAmount;
if (endPos > arraySize)
endPos = arraySize;
int val = 0;
for (int i = startPos; i < endPos; i += localSize)
{
val += data[i] & 1;
}
local_result[localID] = val;
barrier(CLK_LOCAL_MEM_FENCE);
if(localID == 0)
{
sum = 0.0f;
for (int i = 0; i < localSize; i++)
{
sum += local_result[i];
}
group_result[groupID] = sum;
}
}
Parameter 0 is the data from the image. This is covered in a previous post, but I will reiterate, it's a truncated array of integers, where each item represents one pixel. This is because the un-normalized R,G,B components of the pixel were summed and stored in it. I did this using stb_image. It is truncated so that it is nicely divisible by 64.
Next is the local result - this is another array of integers, the one that we only allocated the size for before and not the value. Each work item comes up with its own local result. These are then summed and stored in a group result, i.e. a result for that work group. This is the third parameter, and what we read from in the host. The fourth parameter is the size of the array we're passing in. The reason for this is that it is what I saw in an example and seems safer than trying to work it out from get_global_size(0). NB the parameter here is for the dimension - we are using a 1-dimensional array so it is always 0 (the first dimension) for us.
This leads me to my next point - each thread will be going through this code (the same code) and executing its own thing. The get calls return a unique result for each thread e.g. get_num_groups(0). This means that they all work on a different part of the image at the same time. Fortunately, the increment to val is thread safe as each thread has its own copy. What is nice is that we can add them together across threads, you wouldn't be able to do that in OpenGL.
"val" is the number of odd numbers for each work item. We have global, group, and local. After finding our local result, we want to make sure all the threads have finished finding theirs so we can add them up into group results correctly. For this we use a barrier. Here is a simplified diagram:
num_work_groups = global_size / local_size
This is pretty confusing because the number of groups the GPU says I have is a different number to 64. My groups are different to its groups because I get 64 group results out of it.
You can print from the GPU in OpenCL if you want to find out your work group size. Pretty nifty IMO (but each thread's going to do it if you're not careful). You do this using printf().
When we do add our group results together, we do it using only one thread for each of them.
Anyway, after we've got our array of group results (the six squares on the big one) we get them onto our host and sum them. To check the result, I did the same calculation on the CPU beforehand (as it is so quick). Here is the output from my program:
Max work group size is: 4100 Created platform and selected device Built OpenCL kernel Allocated buffers First 20246528 pixels contain 10088883 odd numbers Found kernel subroutine, passing arguments... Starting clock Kernel ranges successfully set - let's go! Successfully read buffer from kernel Queue finished successfully Clock stopped The program took 0.133066s to execute 10088883 odd numbers, 10157645 even numbers. Press any key to continue . . .
Lastly, I will note that there is a bug with the program. A fast-fail is triggered by a stack-based buffer-overrun. This occurs when I read the buffer. Stepping through the code at run-time causes this error only to occur when main returns. For now I have disabled the check for this issue.
At one time, when I allocated 10x more memory for the buffer, I did not get this error. Now though, it occurs whether I do or do not.
OpenCL Error Helper |
Kernel |
Source |