Exactly how volatile is bitcoin?

Recently Eli Dourado introduced the bitcoin volatility index or btcvol, a measure of bitcoin’s volatility. Eli estimates the volatility of the return to holding bitcoin at a given date by computing the standard deviation of log returns from the past 30 days. This is fine so far as it goes – as a quick and dirty measure – but I think we can do better. The ideal best, as Eli admits, would be the implied volatility from the derivatives market. The problem with this measure is that the bitcoin derivatives market is very immature, though bitcoin derivatives do exist. Short of implied volatility, we can still use a formal method to estimate historical volatility and potentially predict future volatility. One fairly straightforward method of doing this is through estimating a stochastic volatility model. Below the jump, I’ll describe the model, show how to quickly fit it in R, and display some nifty graphs as a result of estimating the model. If you’re just looking for the estimated or predicted volatilities from the model, for the R script that I used to actually fit the model, or for information on the stochvol R package I used in that script, scroll to the end of the post.

Continue reading

Why infinite utility is like the Spanish inquisition

Very often in philosophical discussions involving decision theory or ethics, someone will attempt to show that some action has infinite expected utility and try to draw some conclusions from this proposition. Or at least confuse the hell out of everyone. The most common culprit is (some versions of) Pascal’s wager, but the St. Petersburg paradox is another. What I want to do is convince you to be extremely skeptical of any conclusions that depend on something or another having infinite expected utility. To do this, we’re going to have to learn some of the nitty-gritty math of decision theory – yes, this is partially just an excuse to talk about decision theory. There won’t be much math, but when there is the details will be important. (I wrote that last sentence before I wrote the rest and let’s just say I was a bit optimistic.)

Continue reading

Generating random variates from a symmetric “signed” square root distribution.

Suppose you have some continuous random variable $X>0$ that is easy to sample from. Suppose you’re interested in sampling from another real valued random variable $Y$ where $X=Y^2$. In other words, $Y=\pm\sqrt{X}$. This is the signed square root of $X$ – signed because $Y$ is real valued and so can take on negative values. If the distribution of $Y$ is symmetric around zero, it turns out all you have to do in order to draw from $Y$ is draw from $X$ then flip a fair coin to choose the sign to give $Y$. This basic problem has come up twice in my research lately, and unsurprisingly once I abstracted away from the details of the distributions I was working with I came up with an elegant little proof.
Continue reading

I’m Almost Sure it’s Probably Converging… Wait, What?

Every first year graduate student in statistics or economics has to learn the basics of probability and statistics theory. Most of this is fairly understandable – once you grasp the basic concepts behind probability and distribution theory, everything else is applications and details for the most part. There is at least one notable exception to this – the relationship between the various notions of convergence used in probability theory.

Now everyone learns that convergence almost surely implies convergence in probability which, in turn, implies convergence in distribution. Everyone also learns that none of the converse implications hold in general, but I don’t think anyone comes away really grasping the difference between all of these concepts – they learn the “that” but not the “why.” I’m guilty of this too. We memorize which is stronger then go on to use them to talk about central limit theorems, laws of large numbers, and properties of estimators. This is probably fine since these first year courses typically focus on getting comfortable using the main concepts of probability theory rather than on deep understanding – that’s for later courses. But some people can’t wait for those courses, so I’ll try to explain the difference both intuitively and with some math.

Continue reading

Using R for Experimental Economics

I’m giving a short talk to the experimental economics class I’m taking on how to use R to do the most common tests. This is a short tutorial in order to get R up and running with R studio. I assume that you’ve followed this tutorial already for my talk. If you’re looking for my slides from the talk or the dataset I worked with, they’re at the bottom of this page along with some other useful resources.

Let’s get started. First, go to R’s website, which should look something like this:

On the left hand side, click on the CRAN link under “Download, Packages” next to the arrow in the image above. This will bring up a page of links to download R from, like the image below. Choose a location close to you so that the download is faster.

This should bring you to a page that looks like the image below. Now click on the link to download R for whatever operating system you’re running. I.e. if your computer has Windows, click on “Download R for Windows.”

This should bring up another page with a few more links, like the picture below. Click on “base.”

Finally we get to the page with the actual download link, like below. Click on “Download R Z for X” where X is whatever operating system you have and Z is the latest version of R, i.e. some number like “2.15.1″. At the time I created this post, it was “Download R 2.15.1 for Windows” since I was installing R on a windows machine.

This will download the executable to install R – a file named “R-2.15.1-win.exe” or something similar. Make sure you save it somewhere you can find it easily. When it finishes, run the file. Just follow the on-screen instructions that pop up. You shouldn’t have to change anything in order for R to properly install.

Now you’re all set to start using R… except the GUI that R comes with out of the box isn’t very good. Rstudio is a free IDE that improves on the base R GUI substantially. Go here to download it. Download the version of Rstudio that their website recommends for your machine somewhere that you can easily find. Once this completes, open the file – it should be called “RStudio-0.96.316.exe” or something similar. From this point, just follow the on-screen instructions to complete the installation.

Now we’ll install a couple of useful packages that exist for R. First we’ll install the R package “ggplot2″. ggplot2 is a package for creating statistical graphics that drastically improves upon R’s base graphics.

In order to install this, open up R Studio and make sure you’re connected to the internet. Then type install.packages(“ggplot2″) into the R console and hit enter, as below.

R will output the following message, or something similar:

Installing package(s) into ‘C:/Users/Matt/R/win-library/2.15’
(as ‘lib’ is unspecified)
— Please select a CRAN mirror for use in this session —

Wait for a few seconds, then R will give you some options for which mirror to download from. Type the number for the mirror that is closest to you to download everything faster, then press enter. See the picture below.

R should be pretty busy for a few minutes while it downloads and installs several packages that ggplot2 depends on as well as ggplot2 itself. Once it finishes and you see the blue “>” in the bottom of the R console, the packages are installed.

There’s another package you need to install using basically the same process. agricolae is a package that has a bunch of methods for agricultural statistics, but more importantly it has some useful nonparametric tests. To install it, type install.packages(“agricolae”) into the R console and follow the same process as before.

In addition, I’ve uploaded a dataset that we’ll be using during my presentation. Download it here. Save it as diam.csv somewhere you can easily find it.

That’s all the software you’ll need to be up and running! Here are a bunch of useful resources, including the slides from my talk:

Slides from the presentation. (pdf) Probably more useful while you’re sitting at your computer than while I was talking.

diam.csv. The dataset I used for most examples in my presentation.

An R script containing every R command from my presentation. Open with any text editor, though opening it in R Studio is best.

R Studio tutorial. (pdf) Useful for getting your bearings in R while using R Studio. It covers the basics of computation R, including some stuff I didn’t cover such as dealing with vectors and matrices.

An R script containing an old presentation. This one has many more details about the basics in R as well as using ggplot2, plus some stuff about quickly using Bayesian methods to fit models. Note: enter install.packages(“arm”) into the R console to use the Bayesian stuff.

An R reference card. (pdf) Print it out and tape it on the wall next to your desk. Seriously. Do it now.

The ggplot2 website. This contains useful information for making complicated, informative and pretty graphics using the ggplot2 package.

Course website for the class I took to learn R. Some overlap, but there are many new things.

Knitr. This is a fantastic way to integrate your computations from R into a nice compiled Latex file, and it’s relatively painless with R Studio.

Introduction to Bayesian Statistics Videos

My advisor, Jarad Niemi, is posting lectures on Bayesian statistics on his youtube channel while teaching Stat 544 – the master’s/Ph.D. level introduction to Bayesian statistics at Iowa State University. I’ve taken a look at a few of them and they’re pretty good. Most are short (~10 minute) explanations of a particular topic in order to supplement required readings. Some of them, like the extended Metropolis within Gibbs example, are full lecture length, i.e. about one hour. Since the course is ongoing this semester, you can expect a few more to be posted.

The lectures do assume a rather high level understanding of probability theory. The statistics students in the class have seen at least chapters 1 – 5 of Casella and Berger in some detail. Other students in the class have similar backgrounds, though perhaps not quite as strong. Some knowledge of R would also be useful to understand the more computationally centered lectures. While the videos might not be useful for everyone, they’re probably a great supplement if you’re learning some of this material elsewhere.

Prior distributions for covariance matrices: the scaled inverse-Wishart prior

At some point in a variety of Bayesian applications, we end up having to specify a prior distribution on a covariance matrix. The canonical example is a hierarchical regression model. Suppose we have a response vector $y$ along with some covariate $x$ where we expect the simple linear regression model to provide a decent fit. In addition, suppose the data is organized into distinct groups such that each group possibly has its own regression line, though we expect the lines to be related. In order to model this situation, we start first with the data level:

\[y_{ij} | \alpha_j, \beta_j \stackrel{iid}{\sim} N(\alpha_j + x_{ij}\beta_j, \sigma^2)\]

where $j=1,…,J$ indicates groups and $i=1,…,I_j$ indicates observations within groups. In other words, each group has a simple linear regression with a common error variance $\sigma^2$. We further model the regression coefficients as coming from a common distribution:

\[\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \stackrel{iid}{\sim} N\left(\begin{pmatrix} \alpha_0 \\ \beta_0 \end{pmatrix}, {\mathbf \Sigma}\right)\]

In order to complete the model and run a Bayesian analysis, we need to add priors for all remaining parameters including the covariance matrix for a group’s regression coefficients ${\mathbf \Sigma}$. A popular prior for ${\mathbf \Sigma}$ is the inverse-Wishart distribution, but there are some problems with using it in this role. A post over at dahtah has the details, but essentially the problem is that using the standard “noninformative” version of the inverse-Wishart prior, which makes the marginal distribution of the correlations uniform, large standard deviations are associated with large absolute correlations. This isn’t exactly noninformative and in addition it can have adverse affects on inference in some models where we want shrinkage to occur.

The question, then, is what prior distribution should we use instead? A paper (pdf) by Barnard, McCulloch and Meng argue for using a separation strategy: write ${\mathbf \Sigma}$ as $\mathbf{\Delta\Omega\Delta}$ where $\mathbf{\Delta}=\mathrm{diag}(\mathbf{\delta})$ is a diagonal matrix of standard deviations and ${\mathbf \Omega}$ is a correlation matrix. Then model ${\mathbf \Delta}$ and ${\mathbf \Omega}$ separately. There are a number of ways to do this, but Barnard et al. suggest using independent lognormal priors on the standard deviations in $\mathbf{\delta}$:

\[\delta_j\stackrel{iid}{\sim}N(\delta_0,\sigma^2_\delta)\]

while using the following family of densities on $\mathbf{\Omega}$:

\[ f_k(\mathbf{\Omega}|v) \propto |\mathbf{\Omega}|^{-\frac{1}{2}(v+k_1)}\left(\prod_i\omega^{ii}\right)^{-\frac{v}{2}} = |\mathbf{\Omega}|^{\frac{1}{2}(v-1)(k-1)-1}\left(\prod_i|\mathbf{\Omega}_{ii}|\right)^{-\frac{v}{2}} \]

where $\omega^{ii}$ is the $i$’th diagonal element of $\mathbf{\Omega}^{-1}$, $\mathbf{\Omega}_{ii}$ is the $i$’th leading principle sub-matrix of $\mathbf{\Omega}$, $k$ is the dimension of $\mathbf{\Omega}$ and $v$ is a tuning parameter. This density is obtained by transforming an inverse-Wishart random matrix, specifically $IW(v, \mathbf{I})$, into a correlation matrix. It turns out that $v=k+1$ results in uniform marginal distributions on the correlations in $\mathbf{\Omega}$.

An alternative strategy based on the separation strategy comes from a paper (pdf) by O’Malley and Zaslavsky and endorsed by Gelman – instead of modeling ${\mathbf \Omega}$ as a correlation matrix, only constrain it to be positive semi-definite so that ${\mathbf \Delta}$ and ${\mathbf \Omega}$ jointly determine the standard deviations, but ${\mathbf \Omega}$ still determines the correlations alone. This strategy uses the same lognormal prior on $\mathbf{\Delta}$ but uses the inverse-Wishart distribution $IW(k+1, \mathbf{I})$ on $\mathbf{\Omega}$ which still induces marginally uniform correlations on the resulting covariance matrix. This strategy is attractive since the inverse-Wishart distribution is already in a number of statistical packages and typically results in a conditionally conjugate model for the covariance matrix, allowing for a relatively simple analysis. If $\mathbf{\Omega}$ is constrained to be a correlation matrix as in Barnard et al., sampling from the conditional posterior requires significantly more work. So the scaled inverse-Wishart is a much easier to work with, but theoretically it still allows for some dependence between the correlations and the variances in $\mathbf{\Sigma}$.

I’ve used the scaled inverse-Wishart before, so I was curious about how much a difference it makes to use Barnard et al.’s prior. Now the impact on inference of using different priors will vary from model to model, but we can at least see how different priors encode the same prior information. Since $f$ is just the density of the correlation matrix resulting from an inverse-Wishart distribution on the covariance matrix, it’s relatively easy to sample from $f$ – just sample from an inverse-Wishart and transform to correlations. So I specified Barnard’s separation strategy prior (SS) and the scaled inverse-Wishart prior (sIW) in a similar way and sampled from them both, and then compared them to a sample from a standard inverse-Wishart (IW). In the first test, I assumed that the covariance matrix was $2\times2$ and that $\mathrm{log}(\delta_j)\stackrel{iid}{\sim}N(0,1)$ for both the SS and sIW priors. The only difference was in $\mathbf{\Omega}$: for the SS prior I assumed that $\mathbf{\Omega}\sim f(\mathbf{\Omega}|v=3)$ and for the sIW prior, I assumed $\mathbf{\Omega}\sim IW(3, \mathbf{I})$. In other words, for both priors I assumed that the correlations were marginally uniform. For the IW prior, I assumed that $\mathbf{\Sigma}\sim IW(3, \mathbf{I})$ to once again ensure marginally uniform correlations as a point of comparison. Taking 10,000 samples from each prior, we can see this behavior in the following plot of the correlations:

Histograms of correlation by prior

As expected for all priors the correlations look uniform. With either the SS or sIW priors, it’s possible to change this so that they favor either high correlation (closer to $1$ or $-1$) or low correlation (closer to $0$) through the manipulation of $v$ – see Barnard et al. or O’malley & Zaslavsky for details. Next, we take a look at the log standard deviations:

Histograms of log standard deviations by prior and variance component

The histograms of SS and sIW look pretty similar with the only difference between the two that sIW has slightly fatter tails and/or a higher variance. In other words, given the same parameter choices sIW yields a slightly less informative prior. This isn’t a problem – once we understand the behavior of sIW vs. SS, we can set the priors on $\mathbf{\Delta}$ differently to encode the same prior information on the standard deviations. The priors as specified don’t appear to be satisfactorily noninformative – if that were the prior information we were trying to capture, but that can easily be fixed by increasing the variance of the prior. The IW prior, on the other hand, has a very narrow range of values for the standard deviation that can only be changed by also changing the prior on the correlations – one of its main drawbacks. Finally, we look at the dependence between the first variance component and the correlation coefficient in the next plot:

Scatterplot of correlation coefficient vs. first variance component by prior

As expected, there’s no dependence between the correlation and the variance in the SS prior – the prior assumes they are independent. The sIW prior, on the other hand, exhibits the same disturbing dependence as the IW prior, also documented in Simon Barthelmé’s post at dahtah. High variances are associated with more extreme correlations in both the sIW and IW priors – the sIW prior doesn’t seem to improve on the IW prior at all in this respect.

I changed the prior on $\mathrm{log}(\mathbf{\delta})$ to have mean $(10, 0)$ and covariance matrix $\mathrm{diag}(.2, .2)$ reflecting a situation where we have a fairly strong prior belief that the two standard deviations are different from each other. Without commentary, the plots tell essentially the same story:

Histograms of correlation by prior

Histograms of log standard deviations by prior and variance component

Scatterplot of correlation coefficient vs. first variance component by prior

It doesn’t look great for the scaled inverse-Wishart. This prior is capable of encoding a wider range of prior information than the standard inverse-Wishart by allowing the modeler to separately model the correlations and the variances. However, the modeler isn’t able to control the prior dependence between the variances and the correlations – the level of dependence appears to be the same as in the standard inverse-Wishart prior. I suspect that for most researchers the problems this causes aren’t so bad when weighed against the computational issues that arise when trying to simulate from a correlation matrix, but it’s hard to tell without actually fitting models in order to determine the effect on shrinkage.

Finally, here’s some R code to generate the plots in this post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
require(MCMCpack)
require(ggplot2)
require(reshape)
 
## simulates a single sample from the sIW prior
sIW.sim <- function(k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    R <- riwish(df, M)
    S <- diag(exp(mvrnorm(1,m, diag(s))))
    SRS <- S%*%R%*%S
    return(SRS)
}
 
## simulates n samples from the sIW prior
sIW.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- sIW.sim(2, m, s, df, M)
        sds <- sqrt(diag(E))
        rho[i] <- cov2cor(E)[2,1]
        sig1[i] <- sds[1]
        sig2[i] <- sds[2]
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
}
 
## simulates n samples from the SS prior
SS.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- riwish(df, M)
        rho[i] <- cov2cor(E)[2,1]
        sds <- exp(mvrnorm(1,m, diag(s)))
        sig1[i] <- sds[1]
        sig2[i] <- sds[2]
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
}
 
## simulates n samples from the IW prior
IW.test <- function(n, k, df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- riwish(df, M)
        rho[i] <- cov2cor(E)[2,1]
        vars <- diag(E)
        sig1[i] <- sqrt(vars[1])
        sig2[i] <- sqrt(vars[2])
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
 
}
 
n <- 10000
k <- 2
m <- c(0,0)
s <- c(1,1)
df <- 3
M <- diag(2)
 
 
sIWsam.1 <- sIW.test(n, k, m, s, df, M)
SSsam.1 <- SS.test(n, k, m, s, df, M)
IWsam.1 <- IW.test(n, k, df, M)
 
sIWsam.1$dens <- "sIW"
SSsam.1$dens <- "SS"
IWsam.1$dens <- "IW"
 
data.1 <- rbind(sIWsam.1, SSsam.1)
data.1 <- rbind(sIWsam.1, SSsam.1, IWsam.1)
 
qplot(value, data=data.1[data.1$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation")
 
qplot(value, data=data.1[data.1$parameter=="rho",], facets=dens~., xlab="Correlation")
 
data.1.melt <- melt(data.1, id=c("parameter", "dens", "sam"))
data.1.cast <- cast(data.1.melt, dens+sam~parameter)
 
qplot(sigma1^2, rho, data=data.1.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x")
 
n <- 10000
k <- 2
m <- c(10,0)
s <- c(.2,.2)
df <- 3
M <- diag(2)
 
sIWsam.2 <- sIW.test(n, k, m, s, df, M)
SSsam.2 <- SS.test(n, k, m, s, df, M)
IWsam.2 <- IW.test(n, k, df, M)
 
sIWsam.2$dens <- "sIW"
SSsam.2$dens <- "SS"
IWsam.2$dens <- "IW"
 
 
data.2 <- rbind(sIWsam.2, SSsam.2, IWsam.2)
 
qplot(value, data=data.2[data.2$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation")
 
qplot(value, data=data.2[data.2$parameter=="rho",], facets=dens~., xlab="Correlation")
 
data.2.melt <- melt(data.2, id=c("parameter", "dens", "sam"))
data.2.cast <- cast(data.2.melt, dens+sam~parameter)
 
qplot(sigma1^2, rho, data=data.2.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x")

Setting up R for CFAR minicamp

I’m giving a short “unconference” on doing basic statistics with R at a Center for Applied Rationality minicamp I’m attending next week, and I’ll need participants to show up with R installed on their laptops along with a Rstudio. This is a short tutorial for participants to follow in order to get everything up and running.

First, go to R’s website, which should look something like this:

On the left hand side, click on the CRAN link under “Download, Packages” next to the arrow in the image above. This will bring up a page of links to download R from, like the image below. Choose a location close to you so that the download is faster.

This should bring you to a page that looks like the image below. Now click on the link to download R for whatever operating system you’re running. I.e. if your computer has Windows, click on “Download R for Windows.”

This should bring up another page with a few more links, like the picture below. Click on “base.”

Finally we get to the page with the actual download link, like below. Click on “Download R 2.15.1 for X” where X is whatever operating system you have.

This will download the executable to install R – a file named “R-2.15.1-win.exe” or something similar. Make sure you save it somewhere you can find it easily. When it finishes, run the file. Just follow the on-screen instructions that pop up. You shouldn’t have to change anything in order for R to properly install.

Now you’re all set to start using R… except the GUI that R comes with out of the box isn’t very good. Rstudio is a free IDE that improves on the base R GUI substantially. Go here to download it. Download the version of Rstudio that their website recommends for your machine somewhere that you can easily find. Once this completes, open the file – it should be called “RStudio-0.96.316.exe” or something similar. From this point, just follow the on-screen instructions to complete the installation.

That’s all the software you’ll need for my presentation! If you have some time it might be useful to poke around Rstudio to get a general feel for it, but you certainly don’t need to in order to understand my presentation. There are a few tutorials to guide your poking scattered about the web, including this one (warning: pdf).

EDIT: Here are a few more resources that will be useful. First, you’ll need an additional R package called arm – this package allows you to quickly fit Bayesian linear and generalized linear models. In order to install this, open up R Studio and make sure you’re connected to the internet. Then type install.packages(“arm”) into the R console, as below.

R will output the following message, or something similar:

Installing package(s) into ‘C:/Users/Matt/R/win-library/2.15’
(as ‘lib’ is unspecified)
— Please select a CRAN mirror for use in this session —

Wait for a few seconds, then R will give you some options for which mirror to download from. Type the number for the mirror that is closest to you to download everything faster, then press enter. See the picture below.

R should be pretty busy for a few minutes while it downloads and installs several packages that arm depends on as well as arm itself. Once it finishes and you see the blue “>” in the bottom of the R console, the packages are installed.

Futhermore, you’ll need another R package call “ggplot2″. You can install this using the same command: install.packages(“ggplot2″).

In addition, I’ve uploaded a dataset that we’ll be using during my presentation. Download it here. Save it as diam.csv somewhere you can easily find it.

So to be 100% ready for my presentation, you need to have installed R, R Studio, and the arm and ggplot2 packages, as well as have downloaded the data file diam.csv.

Finally, here are a few useful resources for before, during, and after my presentation:
An R script containing my entire presentation. Open with any text editor, though opening it in R Studio is best.
An R reference card. (pdf) Print it out and tape it on the wall next to your desk. Useful for after the presentation.
The ggplot2 website. This contains useful information for making complicated and informative graphics. Useful after the presentation.
Course website for the class I took to learn R. Some overlap, but there are many new things. Useful for after the presentation.

Writing CUDA C extensions for R

Here’s a quick guide to writing CUDA C extensions for R to use a GPU for computation. I found another guide elsewhere, but it’s a bit roundabout. This guide uses a more straightforward method that might also be a little more complicated.

First, we’ll look at writing a C extension and see how that generalizes. Let’s say we’re writing a simple program that uses C to add two numbers. If we were just writing a C program, it would look something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>
#include <stdlib.h>
 
void add(double *a, double *b, double *c){
  *c = *a + *b;
  Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c);
}
 
int main(void){
  double *a, *b, *c;
  size_t fbytes = sizeof(float);
  a = (double *) malloc(fbytes);
  b = (double *) malloc(fbytes);
  c = (double *) malloc(fbytes);
 
  //code for user input omitted
 
  add(a, b, c);
 
  printf("\nc = %.4f\n", *c);
 
  return 0;
}

None of this should be new to anyone who’s programmed in C before. Now suppose we want to be able to call the function add from R. We need to change the function slightly to accommodate this:

1
2
3
4
5
6
7
8
9
10
11
12
#include <stdio.h>
#include <stdlib.h>
#include <R.h> 
//needed to interface with R
 
void add(double *a, double *b, double *c){
  *c = *a + *b;
 
  //similar to printf, except prints to R console
  Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c); 
 
}

There are a couple of twists here. First, we need to include the header R.h. This contains functions that allow R to interact with the C function, including Rprintf. Rprintf works almost exactly like printf except it prints directly to the R console. In addition to including the header, any function that we want to call from R has to satisfy two constraints: it must return type void and only have pointer arguments. Then to compile the program so that it’s callable from R, we would type into the terminal

1
R CMD SHLIB foo.c

Where foo.c is the name of the file containing the code above. This command yields the following output in our terminal

1
2
3
gcc -std=gnu99 -I/apps/lib64/R/include  -I/usr/local/include 
    -fpic -g -O2 -c foo.c -o foo.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o foo.so foo.o

These are the commands you would have entered if you wanted to compile the code manually. We’ll take a closer look at that later when we compile CUDA C code. For now though, just note that now in your working directory you have two new files: foo.o and foo.so. The latter is the file we’ll use to call add from R. To do this, we’ll need an R wrapper to call our C function transparently, e.g.:

1
2
3
4
5
6
7
8
9
10
add <- function(a,b){
  ##check to see if function is already loaded
  if(!is.loaded("add"))
    dyn.load("foo.so")
 
  c <- 0
  z <- .C("add",a=a,b=b,c=c)
  c <- z$c
  return(c)
}

There are a couple of elements here. First, the function dyn.load() loads the shared library file foo.so. is.loaded(), unsurprisingly, checks to see if its argument has already been loaded. Note that we load a file but check to see if a specific function has already been loaded. The next element is the .C function. There are other ways to call C functions from R, but .C is the easiest. The first argument of .C is the name of the C function you want to call in string form. The rest of the arguments are the arguments for the function you’re calling. .C returns a list containing the updated values of all of the arguments passed to the C function. R copies all of the arguments and passes them to C using .C – it does NOT simply update the value of c for us, so we have to copy it back from the list that .C returns. When we run this code, this is what we see:

1
2
3
4
> source("foo.r")
> add(1,1)
1 + 1 = 2
[1] 2

This is all simple enough, but what about when we want to call a function that runs on the gpu? Assuming that we are trying to run the same function, there are a couple of tweaks. Here’s the source:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <stdio.h>
#include <stdlib.h>
#include <R.h>
 
__global__ void add(float *a, float *b, float *c){
  *c = *b + *a;
}
 
extern "C" void gpuadd(float *a, float *b, float *c){
  float *da, *db, *dc;
 
  cudaMalloc( (void**)&da, sizeof(float) );
  cudaMalloc( (void**)&db, sizeof(float) );
  cudaMalloc( (void**)&dc, sizeof(float) );
 
  cudaMemcpy( da, a, sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy( db, b, sizeof(float), cudaMemcpyHostToDevice);
 
  add<<<1,1>>>(da, db, dc);
 
  cudaMemcpy(c, dc, sizeof(float), cudaMemcpyDeviceToHost);
 
  cudaFree(da);
  cudaFree(db);
  cudaFree(dc);
 
  Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c);
}

There are two noteworthy things here. First, we still have to have a C function that allocates memory on the GPU and calls the kernel (GPU function) that we want the GPU to run – we can’t directly call the kernel from R. Second, we need to make sure that R knows the name of the C function we want to call using .C. We do this with the extern "C" command. This tells the compiler to treat gpuadd as a C function so that in the shared library file (i.e. gpufoo.so) it’s still called “gpuadd.” Otherwise, the compiler treats the function as a C++ function and changes how its name is stored without telling us what to call the function while using .C. Also there’s a quirk here – all of the variables are floats instead of doubles since most GPUs only support single precision floating point math. This will affect how we write our R wrapper to call gpuadd. If your GPU supports double precision floating point operations you can ignore that part, but keep in mind that most GPUs don’t if you want to distribute your code widely. We’ll assume that the code above is saved in gpufoo.cu.

So that’s the code, how do we compile it? This time there isn’t a simple R command we can call from the terminal but the output from compiling a C file (R CMD SHLIB file.c) gives us clues about how to do it manually:

1
2
3
gcc -std=gnu99 -I/apps/lib64/R/include  -I/usr/local/include 
    -fpic -g -O2 -c foo.c -o foo.o
gcc -std=gnu99 -shared -L/usr/local/lib64 -o foo.so foo.o

As I mentioned before, these are precisely the commands we would have used if we wanted to compile manually. I’ll go through them step by step. Starting with the first line, or the compiling step, gcc is the compiler we call on our source code. -std=gnu99 tells the compiler which standard of C to use. We’ll ignore this option. -I/... tells the compiler to look in the folder /... for any included headers. Depending on how your system has been set up, the particular folders where R automatically looks may be different from mine. Make a note of which folders are displayed here as you’ll need them later. These folders contain R.h among other headers. -fpic essentially tells the compiler to make the code suitable for a shared library – i.e. what we’ll be callying with dyn.load() from R. -g is just a debugging option while -O2 tells the compiler to optimize the code nearly as much as possible. Neither are essential for our purposes though both are useful. Finally, -c tells the compiler to only compile and assemble the code – i.e. don’t link it, foo.c is the name of the source file and -o foo.o tells the compiler what to name the output file.

The next line is the linking step. Here, -shared tells the linker that the output will be a shared library while -L/... tells it where to find previously compiled libraries that this code may rely on. This is where the compiled version of R libraries may probably are at. Again use the path R outputs for you, not necessarily the path I have here. Finally, -o foo.so tells the linker what to name the output file and foo.o is the name of the object file that needs to be linked.

So now we need to use these two statements to construct similar nvcc commands to compile our CUDA C code. The big reveal first, then the explanation:

1
2
3
nvcc -g -G -O2 -I/apps/lib64/R/include -I/usr/local/include 
     -Xcompiler "-fpic" -c gpufoo.cu gpufoo.o
nvcc -shared -L/usr/local/lib64 gpufoo.o -o gpufoo.so

-g and -O2 do the same thing here as before with the caveat that they only apply to code that runs on the host (i.e. not on the GPU). That is to say -g generates debugging information for only the host code and -O2 optimizes only the host code. -G, on the other hand, generates debugging information for the device code, i.e. the code that runs on the GPU. So far, nothing different from before. The wrinkle is in this component: -Xcompiler "-fpic". This tells the compiler to pass on the arguments in quotes to the C compiler, i.e. to gcc. This argument is exactly the same as above, as is the rest of the arguments outside of the quotes. The link step is basically identical to before as well.

In reality, there’s no need for some of the commands in both the compile and link step. An alternative would be

1
2
3
nvcc -g -G -O2 -I/apps/lib64/R/include -Xcompiler 
     "-Wall -Wextra -fpic" -c gpufoo.cu gpufoo.o
nvcc -shared -lm gpufoo.o -o gpufoo.so

This version removes a path from both the compile and the link step because nothing in those folders is relevant to compiling the above program – at least on my system. In the compile step I added two arguments to be passed to the compiler: -Wall and -Wextra. These tell the compiler to show warnings for things in our code that commonly cause errors – very useful for preventing bugs. Finally in the link step I added the command -lm. In general, the command -lname links the library named “name.” In this case, it links the math library which we would be using if we had #include
in our source file. If, for example, we were using NVIDIA’s CUBLAS library we would need -lcublas here.

Now in order to call this function from R, our wrapper needs to be slightly different:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
gpuadd <- function(a,b){
  if(!is.loaded("gpuadd"))
    dyn.load("gpufoo.so")
 
  ##tell R to convert to single precision before copying to .C
  c <- single(1)
  mode(a) <- "single"
  mode(b) <- "single"
 
  z <- .C("gpuadd",a,b,c=c)
  c <- z$c
 
  ##Change to standard numeric to avoid dangerous side effects
  attr(c,"Csingle") <- NULL 
  return(c)
}

The essential difference here is that .C‘s arguments need to be have the “Csingle” attribute so that it knows to copy them as floats instead of doubles. The same applies for integer arguments. Finally, this attribute needs to be removed before returning the result to avoid some dangerous side effects – which occurs right before the function returns its output.