Writing CUDA C extensions for R

Here’s a quick guide to writing CUDA C extensions for R to use a GPU for computation. I found another guide elsewhere, but it’s a bit roundabout. This guide uses a more straightforward method that might also be a little more complicated.

First, we’ll look at writing a C extension and see how that generalizes. Let’s say we’re writing a simple program that uses C to add two numbers. If we were just writing a C program, it would look something like this:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  #include #include   void add(double *a, double *b, double *c){ *c = *a + *b; Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c); }   int main(void){ double *a, *b, *c; size_t fbytes = sizeof(float); a = (double *) malloc(fbytes); b = (double *) malloc(fbytes); c = (double *) malloc(fbytes);   //code for user input omitted   add(a, b, c);   printf("\nc = %.4f\n", *c);   return 0; }

None of this should be new to anyone who’s programmed in C before. Now suppose we want to be able to call the function add from R. We need to change the function slightly to accommodate this:

 1 2 3 4 5 6 7 8 9 10 11 12  #include #include #include //needed to interface with R   void add(double *a, double *b, double *c){ *c = *a + *b;   //similar to printf, except prints to R console Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c);   }

There are a couple of twists here. First, we need to include the header R.h. This contains functions that allow R to interact with the C function, including Rprintf. Rprintf works almost exactly like printf except it prints directly to the R console. In addition to including the header, any function that we want to call from R has to satisfy two constraints: it must return type void and only have pointer arguments. Then to compile the program so that it’s callable from R, we would type into the terminal

 1  R CMD SHLIB foo.c

Where foo.c is the name of the file containing the code above. This command yields the following output in our terminal

 1 2 3  gcc -std=gnu99 -I/apps/lib64/R/include -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o gcc -std=gnu99 -shared -L/usr/local/lib64 -o foo.so foo.o

These are the commands you would have entered if you wanted to compile the code manually. We’ll take a closer look at that later when we compile CUDA C code. For now though, just note that now in your working directory you have two new files: foo.o and foo.so. The latter is the file we’ll use to call add from R. To do this, we’ll need an R wrapper to call our C function transparently, e.g.:

 1 2 3 4 5 6 7 8 9 10  add <- function(a,b){ ##check to see if function is already loaded if(!is.loaded("add")) dyn.load("foo.so")   c <- 0 z <- .C("add",a=a,b=b,c=c) c <- z$c return(c) } There are a couple of elements here. First, the function dyn.load() loads the shared library file foo.so. is.loaded(), unsurprisingly, checks to see if its argument has already been loaded. Note that we load a file but check to see if a specific function has already been loaded. The next element is the .C function. There are other ways to call C functions from R, but .C is the easiest. The first argument of .C is the name of the C function you want to call in string form. The rest of the arguments are the arguments for the function you’re calling. .C returns a list containing the updated values of all of the arguments passed to the C function. R copies all of the arguments and passes them to C using .C – it does NOT simply update the value of c for us, so we have to copy it back from the list that .C returns. When we run this code, this is what we see:  1 2 3 4  > source("foo.r") > add(1,1) 1 + 1 = 2 [1] 2 This is all simple enough, but what about when we want to call a function that runs on the gpu? Assuming that we are trying to run the same function, there are a couple of tweaks. Here’s the source:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  #include #include #include __global__ void add(float *a, float *b, float *c){ *c = *b + *a; } extern "C" void gpuadd(float *a, float *b, float *c){ float *da, *db, *dc; cudaMalloc( (void**)&da, sizeof(float) ); cudaMalloc( (void**)&db, sizeof(float) ); cudaMalloc( (void**)&dc, sizeof(float) ); cudaMemcpy( da, a, sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( db, b, sizeof(float), cudaMemcpyHostToDevice); add<<<1,1>>>(da, db, dc); cudaMemcpy(c, dc, sizeof(float), cudaMemcpyDeviceToHost); cudaFree(da); cudaFree(db); cudaFree(dc); Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c); } There are two noteworthy things here. First, we still have to have a C function that allocates memory on the GPU and calls the kernel (GPU function) that we want the GPU to run – we can’t directly call the kernel from R. Second, we need to make sure that R knows the name of the C function we want to call using .C. We do this with the extern "C" command. This tells the compiler to treat gpuadd as a C function so that in the shared library file (i.e. gpufoo.so) it’s still called “gpuadd.” Otherwise, the compiler treats the function as a C++ function and changes how its name is stored without telling us what to call the function while using .C. Also there’s a quirk here – all of the variables are floats instead of doubles since most GPUs only support single precision floating point math. This will affect how we write our R wrapper to call gpuadd. If your GPU supports double precision floating point operations you can ignore that part, but keep in mind that most GPUs don’t if you want to distribute your code widely. We’ll assume that the code above is saved in gpufoo.cu. So that’s the code, how do we compile it? This time there isn’t a simple R command we can call from the terminal but the output from compiling a C file (R CMD SHLIB file.c) gives us clues about how to do it manually:  1 2 3  gcc -std=gnu99 -I/apps/lib64/R/include -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o gcc -std=gnu99 -shared -L/usr/local/lib64 -o foo.so foo.o As I mentioned before, these are precisely the commands we would have used if we wanted to compile manually. I’ll go through them step by step. Starting with the first line, or the compiling step, gcc is the compiler we call on our source code. -std=gnu99 tells the compiler which standard of C to use. We’ll ignore this option. -I/... tells the compiler to look in the folder /... for any included headers. Depending on how your system has been set up, the particular folders where R automatically looks may be different from mine. Make a note of which folders are displayed here as you’ll need them later. These folders contain R.h among other headers. -fpic essentially tells the compiler to make the code suitable for a shared library – i.e. what we’ll be callying with dyn.load() from R. -g is just a debugging option while -O2 tells the compiler to optimize the code nearly as much as possible. Neither are essential for our purposes though both are useful. Finally, -c tells the compiler to only compile and assemble the code – i.e. don’t link it, foo.c is the name of the source file and -o foo.o tells the compiler what to name the output file. The next line is the linking step. Here, -shared tells the linker that the output will be a shared library while -L/... tells it where to find previously compiled libraries that this code may rely on. This is where the compiled version of R libraries may probably are at. Again use the path R outputs for you, not necessarily the path I have here. Finally, -o foo.so tells the linker what to name the output file and foo.o is the name of the object file that needs to be linked. So now we need to use these two statements to construct similar nvcc commands to compile our CUDA C code. The big reveal first, then the explanation:  1 2 3  nvcc -g -G -O2 -I/apps/lib64/R/include -I/usr/local/include -Xcompiler "-fpic" -c gpufoo.cu gpufoo.o nvcc -shared -L/usr/local/lib64 gpufoo.o -o gpufoo.so -g and -O2 do the same thing here as before with the caveat that they only apply to code that runs on the host (i.e. not on the GPU). That is to say -g generates debugging information for only the host code and -O2 optimizes only the host code. -G, on the other hand, generates debugging information for the device code, i.e. the code that runs on the GPU. So far, nothing different from before. The wrinkle is in this component: -Xcompiler "-fpic". This tells the compiler to pass on the arguments in quotes to the C compiler, i.e. to gcc. This argument is exactly the same as above, as is the rest of the arguments outside of the quotes. The link step is basically identical to before as well. In reality, there’s no need for some of the commands in both the compile and link step. An alternative would be  1 2 3  nvcc -g -G -O2 -I/apps/lib64/R/include -Xcompiler "-Wall -Wextra -fpic" -c gpufoo.cu gpufoo.o nvcc -shared -lm gpufoo.o -o gpufoo.so This version removes a path from both the compile and the link step because nothing in those folders is relevant to compiling the above program – at least on my system. In the compile step I added two arguments to be passed to the compiler: -Wall and -Wextra. These tell the compiler to show warnings for things in our code that commonly cause errors – very useful for preventing bugs. Finally in the link step I added the command -lm. In general, the command -lname links the library named “name.” In this case, it links the math library which we would be using if we had #include  in our source file. If, for example, we were using NVIDIA’s CUBLAS library we would need -lcublas here. Now in order to call this function from R, our wrapper needs to be slightly different:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  gpuadd <- function(a,b){ if(!is.loaded("gpuadd")) dyn.load("gpufoo.so") ##tell R to convert to single precision before copying to .C c <- single(1) mode(a) <- "single" mode(b) <- "single" z <- .C("gpuadd",a,b,c=c) c <- z$c   ##Change to standard numeric to avoid dangerous side effects attr(c,"Csingle") <- NULL return(c) }

The essential difference here is that .C‘s arguments need to be have the “Csingle” attribute so that it knows to copy them as floats instead of doubles. The same applies for integer arguments. Finally, this attribute needs to be removed before returning the result to avoid some dangerous side effects – which occurs right before the function returns its output.

Writing CUDA C extensions for R — 15 Comments

1. Great post!

I’m a CUDA noob and I was going crazy while trying to interface it
with R.
Using your instruction I didn’t have any problem. Thanks!

• Your welcome! I was mildly annoyed that no one else had written a post like this, so when I finally sorted things out I figured I should at least make the solution googleable.

2. Hi,

Thanks for your post, it’s really useful. I’m trying to replicate what you have done on Windows but I’m having problems with compiling the code. Basically I use the first line to create a .obj file:

The -“fpic” component doesn’t work on Windows, but this part runs fine (any idea what the windows equivalent of usr/local/include is?). I then use:

To attempt to create a .dll. However, this fails because of error LNK2019: Unresolved external symbol Rprintf… I’m trying to solve it myself but I’m quite new to this so if anything jumps out at you please let me know (any idea what windows equivalent of usr/local/lib64 is?).

Thanks again,

James

• usr/local/lib64 should be the directory where the header files R.h and Rmath.h are at on your computer. If you can find those files, you have found the proper directory.

On the other hand usr/local/include should be where generic C libraries stored on your computer. If you can find the file iostream.h, then you’ve probably found the proper directory.

That’s my guess though. I don’t have any experience running nvcc on a windows machine, so I’m not sure if there are some extra quirks that are causing you problems.

3. Pingback: R + C + CUDA =…

4. Hi,
Thanks for your post. I am able to run this code but i am modifying the code instead of launching 1 thread, i am launching N=40 threads but then also its showing only addition of first element not all of them.

• You need to do some pointer arithmetic in order to tell each thread to add a different portion of your array. I’d suggest taking a look at some of the examples in this introduction (pdf).

5. Yes, I have changed the code accordingly.
This is the code which i tried to run, But its only showing the addition of first element of an array.

#include
#include
#include
#include
#define N 40

__global__ void add(float *a, float *b, float *c){
int i= theraIdx.x+blockIdx.x*blockdim;
if(N<40)

c[i] = b[i] + a[i];
}

extern "C" void gpuadd(float *a, float *b, float *c){
float *da, *db, *dc;

cudaMalloc( (void**)&da, N*sizeof(float) );
cudaMalloc( (void**)&db, N*sizeof(float) );
cudaMalloc( (void**)&dc, N*sizeof(float) );

cudaMemcpy( da, a, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy( db, b, N*sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(c, dc, N*sizeof(float), cudaMemcpyDeviceToHost);

cudaFree(da);
cudaFree(db);
cudaFree(dc);

for(int i=0;i<N;i++)
{
Rprintf("%.0f + %.0f = %.0f\n", a[i], b[i], c[i]);
}
}

6. Ok, i got the vector addition of an array.. but now i want to store the addition result in R, but it is showing only the last element..

7. I am using your above mentioned code, can you suggest R code to store vector addition result in R.

• The line c <- single(1) needs to be changed to c <- single(n) where n is the number of elements of your vector/matrix. E.g. if you have a 10x20 matrix, n=200. For a vector (or single dimensional array), that's all you need to do. For multidimensional arrays, e.g. a matrix, you might want to convert the vector to the matrix or array type in R. Be careful that you identify to rows and columns correctly. Experiment a bit to see how this works.

• Sorry, I don’t have any experience using CUDA C on windows. My approach in that situation would be to dual boot and do everything on linux.