Update: Code added below to simulate from the truncated Dirichlet distribution, along with an example.

For the example of a beta distribution, which is similar to a Dirichlet, here’s what you do. The density you’re after is

$$p(x) \propto x^{\alpha-1}(1-x)^{\beta-1}\mathbf{1}(a\leq x\leq b)$$

for $0\leq a < b \leq 1$ and $\alpha,\beta >0$. One approach, called rejection sampling, is to simulate a bunch of random draws from the unconstrained beta distribution and only keep those that fall between $a$ and $b$. This is often extremely inefficient, however, when the constraints force $x$ to be in an area of the parameter space where the density puts very little mass.

Another approach is called CDF inversion. The CDF of this truncated beta distribution is

$$G(x) = \frac{F(x;\alpha,\beta)}{F(b;\alpha,\beta) – F(a;\alpha,\beta)}$$

where $F(x;\alpha,\beta)$ is the CDF of the unconstrained beta distribution — many statistical packages have routines for computing this CDF and its inverse, which is nontrivial. In order to use the CDF inversion technique we need only draw a single uniformly distributed random variable from 0 to 1, $u$, then set

$$x=G^{-1}(u) = F^{-1}\left([F(b;\alpha,\beta) – F(a;\alpha,\beta)]u; \alpha,\beta\right).$$

This technique is as good as the methods we have to compute $G^{-1}$ and thus $F$ and $F^{-1}$ are, and they aren’t so good for the beta distribution that you would want to use CDF inversion for all possible cases – you will run into numerical problems.

Enter Damien and Walker – they use a relatively simple technique that only requires you add a latent variable and give up exact sampling for an MCMC algorithm that converges in distribution to the beta distribution we are after. The cost of moving from exact sampling to MCMC is small in my case since I want to draw from these truncated Dirichlets in the context of MCMC. But back to the beta distribution first. First we add a latent variable $y$ and define the joint density of $x$ and $y$:

$$p(x,y) \propto x^{\alpha-1}\mathbf{1}\left(0\leq y \leq (1-x)^{\beta-1}, a \leq x \leq b\right)$$

Integrating out $y$ will yield the original density for $x$, $p(x)$. Now we construct the following Gibbs sampler consisting of 2 steps:

1. Draw $y$ from a uniform distribution on the set $\left(0,(1-x)^{\beta-1}\right)$.

2. Draw $x$ using the CDF inversion technique from

$$x^{\alpha-1}\mathbf{1}\left(a\leq x \leq \min\left(1-y^{\frac{1}{\beta-1}}, b\right)\right)$$

if $\beta > 1$ and

$$x^{\alpha-1}\mathbf{1}\left(\max\left(a, 1 – y^{\frac{1}{\beta-1}}\right)\leq x \leq b\right)$$

if $\beta < 1$.

The CDF inversion technique works well for densities proportional to $x^{\alpha-1}$ because we can write down the CDF in closed form. What if $\beta=1$? Then you could use the CDF inversion directly without introducing the latent variable.

If you run this algorithm for many iterations, the marginal distribution of the last draw will converge to the truncated beta distribution we want to draw from. More importantly, the sample mean of any function of $x$ will converge to the expectation of that function of $x$, i.e.

$$\sum_{i=1}^N\frac{h(x_i)}{N} \to \int_0^1h(x)p(x)dx$$

by the law of large numbers.

This is great and all, but I really need to sample from a truncated Dirichlet distribution. A $k$ dimensional Dirichlet distribution has the density

$$p(x_1, x_2, …, x_k) \propto x_1^{\alpha_1-1} x_2^{\alpha_2-1} \cdots x_{k-1}^{\alpha_{k-1}-1}(1-x_1-…-x_{k-1})^{\alpha_k-1}$$

for $0\leq x_1, x_2,…,x_{k-1} \leq 1$ and $\alpha_1,\alpha_2,\cdots,\alpha_k > 0$. I’ll use the 4 dimensional case just to keep things relatively simple. The truncated version of this distribution I want to sample from is

$$p(x_1,x_2,x_3) \propto x_1^{\alpha_1-1}x_2^{\alpha_2-1}x_3^{\alpha_3-1}(1-x_1-x_2-x_3)^{\alpha_4-1}\mathbf{1}(x\in B)$$

where the set $B$ is a subset of $[0,1]^3$. The easiest way to apply Damien & Walker’s method is to add a latent variable $y$ with the joint distribution

$$p(x,y) \propto x_1^{\alpha_1}x_2^{\alpha_2-1}x_3^{\alpha_3-1}\mathbf{1}\left(0 \leq y \leq (1-x_1-x_2-x_3)^{\alpha_4-1}, x \in B \right).$$

Now the Gibbs sampler is straightforward:

1. Draw $y$ from a uniform distribution on $\left(0,(1-x_1-x_2-x_3-x_4)^{\alpha_4-1}\right)$.

2. For $i = 1, 2, 3$ draw use the cdf inversion technique to draw $x_i$ from the distribution

$$p(x_i|y, x_{-i}) \propto x_i^{\alpha_i – 1}\mathbf{1}\left(a_i(x_{-i}) \leq x_i \leq \min\left( 1 – y^{\frac{1}{\alpha_4-1}} – \sum_{j\neq i} x_j, b_i(x_{-i})\right)\right)$$

if $\alpha_4>1$ and

$$p(x_i|y, x_{-i}) \propto x_i^{\alpha_i – 1}\mathbf{1}\left(\max\left( 1 – y^{\frac{1}{\alpha_4-1}} – \sum_{j\neq i} x_j, a_i(x_{-i})\right) \leq x_i <\leq b_i(x_{-i})\right)$$

if $\alpha_4< 1$ and, again, use CDF inversion directly if $\alpha_4=1$. Here I’m assuming that the set $B$ constrains each $x_i$ to an interval $(a_i(x_{-i}), b_i(x_{-i}))$ conditional on the other $x_j$’s. If that’s not the case, it may take a little more work to make sure the CDF inversion technique works correctly. In the simplest case where $B$ is a hypercube, $a_i(x_{-i})=a_i$ and $b_i(x_{-i})=b_i$.

Then once again the marginal distribution of $(x_1,x_2,x_3)$ will converge to the target truncated Dirichlet and the sample mean of functions of $(x_1,x_2,x_3)$ will converge to their expectations.

The only annoying thing about this algorithm is the number of Gibbs steps – each $x_i$ is drawn in a separate Gibbs step conditional on all of the others. If we could draw some of the $x_i$’s jointly, this would speed up the convergence of the algorithm. If the goal is just sample from a particular truncated Dirichlet, this probably won’t be much of a problem – we can run the algorithm long enough for it not to matter. For large enough $k$ or in the context of a larger MCMC sampler, it might become very slow. Though very slow is perhaps still better than the alternatives.

Update: I’ve added some R code below for simulating from an arbitrary truncated Dirichlet distribution using this the Gibbs samplers constructed in this post. Use rtruncdirichletiter to use this within the context of a larger Gibbs sampler and rtruncdirichlet to run a Gibbs sampler for the truncated Dirichlet itself.

## produces a random draw from the truncated exponential distribution ## a: exponent, density is propto x^a ## lb: lower bound, < ub ## ub: upper bound, > lb ## uses inverse cdf method ## returns a single real number rtruncexp <- function(a,lb,ub){ u <- runif(1,0,1) out <- (u*(ub^a - lb^a) + lb^a)^(1/a) return(out) } ## produces an iteration of a MCMC sampler for a truncated Dirichlet distribution ## xold: k dimensional vector, last draw of x_1,...,x_k ## where x_{k+1}= 1 - x_1 - ... - x_k ## as: k dimensional vector, dirichlet parameters ## ubfun: function of the form ubfun(x, i) for i=1,2,...,k ## computes the upper bound of x_i given x_{-i} ## returns a real number between 0 (exclusive) and 1 (inclusive) ## lbfun: function of the form lbfun(x, i) for i=1,2,...,k ## computes the lower bound of x_i given x_{-i} ## returns a real number between 0 (inclusive) and 1 (exclusive) ## returns a vector whose first element is a latent variable, y, used in ## the algorithm, and the last k+1 elements are x_1,...,x_k, x_{k+1}. rtruncdirichletiter <- function(xold, as, lbfun, ubfun){ k <- length(xold) alast <- as[k+1] xnew <- xold if(alast > 1){ ynew <- runif(1, 0, (1 - sum(xold))^(alast-1)) for(i in 1:k){ nb <- 1 - sum(xnew[-i]) - ynew^(1/(alast-1)) xnew[i] <- rtruncexp(as[1], lbfun(xnew,i), min(ubfun(xnew, i), nb)) } } if(alast < 1){ ynew <- runif(1, 0, (1 - sum(xold))^(alast-1)) for(i in 1:k){ nb <- 1 - sum(xnew[-i]) - ynew^(1/(alast-1)) xnew[i] <- rtruncexp(as[1], max(nb, lbfun(xnew,i)), ubfun(xnew, i)) } } if(alast == 1){ ynew <- 1 for(i in 1:k){ xnew[i] <- rtruncexp(as[1], lbfun(xnew,i), ubfun(xnew, i)) } } return(c(ynew, xnew)) } ## produces N iterations of an MCMC sampler for a truncated Dirichlet distribution ## xold: k dimensional vector, last draw of x_1,...,x_k ## where x_{k+1}= 1 - x_1 - ... - x_k ## as: k dimensional vector, dirichlet parameters ## ubfun: function of the form ubfun(x, i) for i=1,2,...,k ## computes the upper bound of x_i given x_{-i} ## returns a real number between 0 (exclusive) and 1 (inclusive) ## lbfun: function of the form lbfun(x, i) for i=1,2,...,k ## computes the lower bound of x_i given x_{-i} ## returns a real number between 0 (inclusive) and 1 (exclusive) ## returns a matrix whose first column is a latent variable, y, used in ## the algorithm, and the last k+1 columns are x_1,...,x_k, x_{k+1}. rtruncdirichlet <- function(N, as, lbfun, ubfun, xstart){ out <- matrix(0,ncol=5,nrow=N) xold <- xstart colnames(out) <- c("y", "x1", "x2", "x3", "x4") for(i in 1:N){ olds <- rtruncdirichletiter(xold, as, lbfun, ubfun) xold <- olds[-1] out[i,] <- c(olds, 1 - sum(xold)) } return(out) }

This should work for any truncated Dirichlet distribution where you can express the area $x_i$ is constrained to as an interval function of $x_{-i}$ for $i=1,2,...,k$. You have to supply the functions lbfun(x, i) and ubfun(x, i) to supply the lower and upper bounds of the $i$'th element of $x$ as a function of the full vector $x$. For example if the constraint is $x_1 < x_2 < ... < x_k$ you can use the functions below:

ubfun <- function(x, i){ xs <- c(0,x,1 - sum(x) ) return(xs[i+2]) } lbfun <- function(x, i){ xs <- c(0,x,1 - sum(x) ) return(xs[i]) }

For example, suppose we have a $\mathcal{D}(10, 15, 28, 10)$ distribution truncated so that $x_1<x_2<x_3<x_4$ (recall that $x_4 = 1 - x_1 - x_2 - x_3$). Then we can start the chain at $x_1=.2$, $x_2=.3$, and $x_3=.4$ and obtain a Markov chain with 10,000 iterations as follows:

N <- 10000 as <- c(10, 15, 28, 10) xstart <- c(.2, .3, .4) out <- rtruncdirichlet(N, as, lbfun, ubfun, xstart)

And a quick look at the trace plots to check for convergence:

library(MCMCpack) ## for traceplot and mcmc functions par(mfrow=c(3,2)) traceplot(mcmc(out))

which yields the plots below -- at a glance, it appears to converge quickly and mix fairly well. At least for this particular distribution.

]]>

**Stochastic volatility**

A common phenomenon in economic time series is something called volatility clustering. Consider the price of your favorite stock. Often the price of this stock will hardly move at all for days at a time, then after some major announcement the stock jumps around a lot before settling back in to a more stable pattern. In other words, the size of the price change today tends to predict the size of the price change tomorrow, but the *sign* of today’s change doesn’t necessarily tell you much about the sign of tomorrow’s change. Stochastic volatility models are one of several classes of models which try to capture this sort of behavior.

Suppose we have some time series $\{y_t\}$. Then the stochastic volatility model assumes that each $y_t$ is drawn from a mean zero normal distribution with its own variance. Conditional on these variances, the model assumes that the draws are independent. In other words

$$

y_t|\sigma^2_{1:T} \stackrel{ind}{\sim} N(0,\sigma^2_t).

$$

Next the model assumes some sort of evolution process for $\sigma^2_t$. Since the $\sigma^2_t$’s are variances, this is called a volatility process. Theoretically there’s a ton of stuff you could use for the volatility process, but practically speaking something relatively simple would be nice. The stochastic volatility model uses one of the simplest processes we could choose here – a stationary AR(1) process on the *log* variances. Let $h_t = log(\sigma^2_t)$. Then

$$

h_t|h_{t-1} \sim N(\mu + \phi(h_{t-1} – \mu), \sigma^2_h)

$$

where $\mu$ is the average value of the $h_t$ process, $\sigma^2_h$ is the conditional variance of $h_t$ given $h_{t-1}$, and $\phi$ is the first order autocorrelation of $h_t$ process, i.e. the correlation between $h_t$ and $h_{t-1}$. This leaves us with three unknown parameters, $\mu$, $\phi$, and $\sigma^2_h$, and a latent process $\{h_t\}$, to estimate. In order to initialize the $h_t$’s, the model assumes that $h_0$ is drawn from the stationary distribution of the volatility process, i.e.

$$

h_0 \sim N\left(\mu, \frac{\sigma_h^2}{1-\phi^2}\right).

$$

With everything put together, the model reads

$$

y_t|h_{1:T} \stackrel{ind}{\sim} N\left(0,e^{h_t}\right)\\

h_t|h_{t-1} \sim N\left(\mu + \phi(h_{t-1} – \mu), \sigma^2_h\right)\\

h_0 \sim N\left(\mu, \frac{\sigma_h^2}{1-\phi^2}\right).

$$

Now it may seem like a major restriction that the stochastic volatility model assumes that the data is mean zero – prices aren’t mean zero after all and in fact might have a trend. But typically this model is fit on the demeaned log returns, i.e. let

$$

\tilde{y}_t = log\left(\frac{p_t – p_{t-1}}{p_{t-1}}\right) = log(p_t) – log(p_{t-1})

$$

then

$$

y_t = \tilde{y}_t – \frac{1}{T}\sum_{s=1}^T\tilde{y}_s.

$$

This is usually enough to ensure that the time series is not only trendless but mean zero. If you want to also estimate the mean return, you can simply add that as a parameter in the model – specifically as the mean of the normal distribution from which the $y_t$’s are drawn.

**Fitting the model**

We’ll fit the model using Bayesian methods. There are advantages and disadvantages to this approach, but one major advantage is how easy it is to compute estimates and credence intervals for any quantity we are interested in. In order to do so, we need prior a prior distribution for $(\mu,\phi,\sigma^2_h)$. Choosing good priors is part art and part science, and not necessarily easy. Here, I’ll defer to the experts on this particular model. Gregor Kastner and Sylvia Frühwirth-Schnatter have a great paper (pdf) explaining a computational method for fitting the model, and Kastner has written an R package (stochvol) that implements this method with an excellent tutorial (pdf).

The basic guidance that Kastner gives is that if you want a noninformative prior in the sense that a small change in the prior doesn’t change the results much, then there are good options for $\mu$ and $\sigma_n^2$ but not really for $\phi$. He suggests putting independent priors on each of the parameters, which is common in Bayesian applications. In particular he suggests a normal prior on $\mu$ with a mean of zero and a large ($\geq 100$) variance. For $\sigma_n^2$ Kastner suggests a fun prior – putting a mean zero normal distribution on $\pm \sqrt{\sigma_n^2}$, which is sort of like specifying the prior on the standard deviation but not really since the prior allows the parameter to be negative. This is equivalent to a $Gamma(1/2, 1/(2Q))$ prior directly on $\sigma_n^2$ where $Q$ is the prior variance of $\pm \sqrt{\sigma_h^2}$. This prior is chosen primarily because it makes computation fairly straightforward (see the Kastner & Frühwirth-Schnatter paper linked above).

Edit: Gregor mentions in the comments that the real motivation for the gamma prior is that it avoids some well known problems with the most computationally convenient prior – the inverse gamma prior. This (pdf) paper by Gelman explains the issue pretty well.

In practice, the particular choice of variance for each of these normal priors doesn’t tend to impact inference much so long as their variances are both large enough, but the prior on the autocorrelation parameter $\phi$ is a different story. Since we want the volatility process to be stationary, this forces $-1\lt \phi\lt 1$. A common prior on bounded parameters such as this is a transformed beta distribution. The usual beta distribution is a probability distribution on the unit interval, $(0,1)$. If you want a beta distribution on the interval $(a,b)$ instead, then putting the usual beta distribution on $(y-a)/(b-a)$ will do the trick. So here, $(\phi + 1)/2$ will have a beta prior that depends on two hyperparameters that we must specify, $\alpha$ and $\beta$. It turns out that posterior inference for $\phi$ is highly sensitive to the choice of $\alpha$ and $\beta$, and in some cases the posterior distribution may be identical to the prior. Kastner cites another paper which suggests using $\alpha=20$ and $\beta=1.5$, which implies a prior mean of $0.86$ for $\phi$, a prior standard deviation of $0.11$, and that it’s extremely unlikely that $\phi < 0$. The default in Kastner’s R package sets $\alpha=5$ instead which makes the prior mean of $\phi$ only about $0.54$ with a much larger prior standard deviation of $0.31$. We’ll use these hyperparameters as a default option since a prior expected level of persistence of about $0.5$ seems roughly right to me (in the dark) and I assume that Kastner set it as the default for a reason. A serious analysis should put some careful thought into what these values should be though, and I’m certainly open to arguments. We’ll set prior variance of the mean of the volatility process to 100, and the prior variance of $\pm\sqrt{\sigma_n^2}$ to 10, which are good default values according to Kastner. So to summarize the prior on $(\mu,\sigma_n^2,\phi)$, we’re assuming that they’re independent a priori and that

$$

\mu \sim N(0,100)\\

\pm\sqrt{\sigma_n^2} \sim N(0,10)\\

(\phi + 1)/2 \sim Beta(5, 1.5).

$$

The actual fitting of the model requires an MCMC (Markov chain Monte Carlo) sampler in order to approximately sample from the joint posterior distribution of the parameters $(\mu,\sigma_n^2,\phi)$ and the latent volatility process $\{h_t\}$. Constructing a good MCMC sampler isn’t always easy, but luckily, Kastner’s R package, stochvol, implements the MCMC strategy that he and Frühwirth-Schnatter developed in the paper I linked above, so we don’t have to put any thought into it. If you’re into the nuts and bolts of MCMC, I recommend reading their paper – the strategy is pretty cutting edge and uses several ideas from different strands of the literature in order to solve several very commonly encountered problems when fitting these sorts of models.

**Results**

To fit the model, I used the stochvol R package to obtain 100,000 approximate draws from the posterior distribution after a burn in of 10,000 draws. This only took about a minute on my laptop. Here is an R script that contains all of the commands I used to load the data into R, fit the model, create the plots I’m about to show you, and a few other things.

First, let’s compare Eli’s btcvol, which uses a 30 day window to compute sample standard deviations, to the estimated daily standard deviations from the stochastic volatility model.

The red is btcvol while the black is an estimate of the daily volatility using the stochastic volatility model. Strictly speaking, we’re using the median of the posterior distribution of a given day’s volatility (i.e. standard deviation) as a point estimate of the true volatility that day. The posterior mean is another commonly used point estimate, but I prefer the median because it emphasizes that, short of using a realistic loss/utility function, we care about the whole posterior distribution and not just its mean. The grey lines combine to create a 90% credible interval for the daily volatilities. In other words, the posterior probability that a given day’s volatility is inside the interval created by the gray lines on that day is 0.9. This gives us some idea of how certain we are about the level of volatility on a given date which I think is a significant advantage of using this method to measure volatility.

The most obvious thing about this graph is that btcvol is much more persistent – when volatility is suddenly really high on Tuesday, btcvol assumes that volatility is also high for the next month or so. On the other hand, the estimates from the stochastic volatility model jump fairly freely so that today’s volatility can be drastically different from yesterday’s volatility. This is a direct consequence of how btcvol is calculated. If the log return of bitcoin is constant for a month after a massive spike, btcvol will tell us that volatility was high during that entire period since the spike is still in the 30 day window. The stochastic volatility model, on the other hand, will only tell us that volatility was high on the day of the spike and maybe for a couple of days after that. We can make the model estimates behave more like btcvol by adjusting the prior on $\phi$. Specifically, if we increase $\alpha$ the prior will put more weight on higher values of $\phi$, so the volatility process will consequently be more persistent both in the prior and the posterior. To some extent the data can overcome this prior, but how much weight to place on the data is a tough question, and one we probably want to answer if we want to pick a good prior for $\phi$.

Another advantage of estimating volatility this way is that we can easily predict volatility as well using the posterior distribution of the $h_t$’s and the recursion that gives us $h_{t+1}$’s distribution conditional on $h_t$. The next plot zooms in on the last 2-3 months and adds about a month of predictions.

This zoomed in look gives us a better view of the differences between btcvol and the stochastic volatility estimates. Btcvol smooths out the volatility of Bitcoin’s returns pretty heavily while the stochastic volatility estimates are happy to let each day have drastically different volatility. The new portion of this plot is the dashed lines, which represent the prediction version of their corresponding solid colored line. In other words, the dashed black line is the posterior predicted median for daily volatility while the dashed gray lines create a 90% posterior prediction interval.

If you want the data so you can fit this model yourself or perhaps the estimates or predictions from the model I fit, look no further:

The most recent bitcoin return data and btcvol estimate is here (csv), from Eli.

The volatility estimates along with the data used to fit the model are here (csv).

The volatility predictions from the model are here (csv).

The R script I wrote to fit the model and produce these plots (and do a few other things) is here (R – text file).

The tutorial on using the stochvol package by Gregor Kastner is here (pdf).

The paper where by Gregor Kastner and Sylvia Frühwirth-Schnatter describe the computational method used to fit the model in stochvol is here (pdf).

]]>

**What the hell is utility anyway?**

First, we have to start at the beginning – what’s utility? Why do we expect it? What? Ok, forget about utility for now, first we’re going to talk about preferences. Mathematically, we nee two things to have preferences, a set of things to have preferences over, and something that tells us which of those things are preferred to other things. The first is called the choice set, which we’ll use $X$ to represent. The second is a preference relation, which we’ll use $\succ$ or $\succeq$ to represent. Suppose that $(X,\succeq)$ is Bob’s preferences. If $x_1 \in X$ and $x_2 \in X$, i.e. they are options in the choice set, then $x_1 \succ x_2$ means that Bob prefers $x_1$ to $x_2$. On the other hand $x_1 \succeq x_2$ means that Bob is either indifferent between $x_1$ and $x_2$ – he wouldn’t care which of these two options you gave him – or he prefers $x_1$ to $x_2$. So $\succ$ is a bit like $>$ and $\succeq$ is a bit like $\geq$, but they don’t mean the same thing. For example, our choice set could be $X={0,1,2,3,…}$ where $x\in X$ is the number of times Bob gets hit in the head by a hammer – clearly $5 > 4$, but unless we think Bob has strange preferences, $4 \succ 5$. To simplify notation a bit, we also have the symbols $\sim$, $\prec$, and $\preceq$ which mean pretty much exactly what you think they mean, e.g. $x_1 \preceq x_2$ means that $x_2 \succeq x_1$, i.e. Bob either prefers $x_2$ to $x_1$ or is indifferent between them, and $x_1 \sim x_2$ means that both $x_1 \succeq x_2$ and $x_2 \succeq x_1$, i.e. that Bob is indifferent between $x_1$ and $x_2$.

That’s not all, we also need the some constraints on the relation $\succeq$ in order for the preference relation to be “rational” – we need $\succeq$ to be complete and transitive. The relation $\succeq$ is complete if it ranks any two options in the choice set, i.e. if $x_1$ and $x_2$ are possible options, then either $x_1 \succeq x_2$, $x_2 \succeq x_1$, or both. This constraint basically says that there’s always an answer to the question “which one do you want Bob?” Bob doesn’t have to *know* the answer to this question with 100% certainty, but we often assume that he does. Bob is a pretty smart guy. I bet he even went to college. The transitivity constraint basically says that Bob’s preferences are consistent. Formally, they say that if $x_1 \succeq x_2$ and $x_2 \succeq x_3$ then $x_1 \succeq x_3$. This constraint is to prevent preference circles – e.g. where Bob would rather eat an apple than a banana, he’d rather eat a banana than an orange, and he’d rather eat an orange than an apple. So not only is Bob a pretty smart guy, but he’s also not insane. So he’s not the unabomber. That’s a good thing, right?

You may be wondering at this whether these preferences can capture some notion of *how much* Bob prefers bananas to oranges. The answer is that they can’t. This isn’t to say that there isn’t a precise answer to this question; we’ve just abstracted away from those details. We’ll eventually be able to capture some of this notion of “how much more” something is preferred once we talk about uncertainty, but for now the theory has limits.

**No, really, I thought you were going to tell me about utility**

Now that we have the mathematical structure, we can talk about Bob’s preferences. We can throw a bunch of possible choices at him, and he can rank them for us. “An apple is better than an orange, which is at least as good as a banana, which is better than a grape…” But working directly with preferences is a big pain in the butt – not for Bob, but for us, talking about Bob. The problem is that preferences have this strange, unique structure that isn’t always easy to analyze. The solution is to come up with a way to represent Bob’s preferences using a mathematical structure that *is* easy to play around with. This is what a utility function is – a convenient representation of preferences, but not the preferences themselves. Mathematically, a utility function is a function (duh) $U:X\to \Re$ that takes an option from the choice set and gives a number to it. But not just any old number – the numbers have to agree with the underlying preference relation, i.e. if $x_1\succeq x_2$ then $U(x_1) \geq U(x_2)$. So instead of having to look at the preference relation, we can just look at this list of numbers associated with each option in the choice set. If the the utility function is “nice”, e.g. continuous and differentiable, then we can do all sorts of fun mathy things with it.

Not every preference relation can be represented by a utility function though. I won’t go into detail, but if the preference relation is continuous in some sense that I’m not willing to define right now, a utility function exists. The classic example of a preference relation which doesn’t admit a utility function representation is lexicographic preferences. Suppose the choice set consists of the number of apples that Bob gets to consume and the number of bananas he gets to consume, so a typical element would be something like $(x,y)$ where $x$ is the number of apples and $y$ is the number of bananas. We’ll assume that $(x_1, y_1)\succeq (x_2,y_2)$ any time $x_1 > x_2$ and also when $x_1 = x_2$ and $y_1 \geq y_2$. So basically if you offer Bob the choice between two fruit baskets filled with apples and bananas, the first thing he does is count the number of apples in the basket. If one basket has more apples than the other, he takes that one, if they’re tied, he takes the basket with more bananas. Bananas are nothing more than tiebreakers. We can’t represent this preference relation by assigning a single number to each option without somehow losing information. Two numbers would work, but one is not enough.

A second thing to worry about is that when a utility function exists for some preference relation, it isn’t unique. Suppose the choice set is just two things – a banana and an apple, and Bob prefers the apple to the banana. We could assign $U(apple) = 1$ and $U(banana) = 0$, or $U(apple) = 1,000,000,000,000$ and $U(banana) = -\pi/2$, or anything else we wanted, as long as $U(apple) > U(banana)$. The numbers are meaningless apart from their order. In general if we have some utility function $U$ and another monotonic function $f$, then $f(U(x))$ is also a valid utility function, e.g. $log(U(x))$ or $e^{U(x)}$ or $aU(x) + b$ when $a>0$ and $b$ is any number. This reinforces the idea that a utility function is a *representation* of preferences and not the preferences themselves – in any given situation we may pick a different utility function merely because it’s convenient to work with.

**I still don’t know how you expect utility**

OK, to talk about expected utility, we have to take another detour. Up until now, we’ve assumed that Bob can directly choose outcomes. This makes sense when he’s picking between fruit baskets, but maybe not so much when Bob has to decide on which whether to keep or fold his hand in a game of Texas hold ’em. Yeah, Bob is a pretty cool guy. He plays poker all the time. We can easily modify our existing framework to deal with Bob’s gambling habit. Now we have some outcome space $O$ that denotes all of the possible things that can happen – i.e. does he win the hand or lose it? How much does he win or lose? But Bob can’t choose items directly from $O$ – he can’t choose how much he wins. Instead his options are whether to bet (and how much), call, or fold. Each option induces a probability distribution over $O$, over the amount of money he wins (or loses… and let’s be honest, Bob loses a lot) and he has to pick from among these options. So the solution is to treat Bob’s choice set as the set of all probability distributions over $O$, i.e. $X = \mathcal{L}(O)$. The script L stands for “lottery” and just indicates that we’re looking at the probabilities of the different elements (or subsets of) $O$ occurring. Now $x_1 \succ x_2$ means that the probability distribution over $O$ that $x_1$ represents is preferred to the distribution $x_2$ represents. E.g. if we’re talking about a coin flip where Bob wins $\$1$ he it lands heads and loses $\$1$ if it lands tails, $X=\mathcal{L}(O)$ represents the set of all possible probabilities of the coin landing on heads. So if $x_1 = .5$ and $x_2 = .4$, $x_1 \succ x_2$ since, we presume, Bob likes more money.

So if we suppose Bob has a preference relation $\succeq$ on the choice set $\mathcal{L}(O)$ we’re done right? Choice under uncertainty, got it. Well, not so fast. We’ve sneakily assumed that Bob only cares about the final probabilities of the outcomes of $O$ occurring and not how they’re constructed. For example suppose Bob is choosing whether to bet $\$1$ on the outcome of a coin flip, but he doesn’t know the probability that the coin comes up heads. We can represent this ignorance with a probability distribution over the probability that the coin lands heads – e.g. Bob may think that there’s a 50% chance that $P(HEADS) \geq .5$. What we’ve basically assumed is that we can average the coin’s probability over the distribution of coin-flip-probabilities and just look at the resulting marginal distribution over $O$. In other words, we can reduce $\mathcal{L}(\mathcal{L}(…\mathcal{L}(\mathcal{L}(O))…))$ to just $\mathcal{L}(O)$. This makes sense from the perspective of normative decision theory, but may not accurately describe human behavior, so whether or not this assumption is strong depends on the purpose you’re using decision theory for.

OK, OK, so we can define preferences over lotteries/gambles/probability distributions over the outcome set as long as we acknowledge some technicalities, and if the preferences are nice we get a utility function defined on these lotteries. Great, but we still aren’t expecting something. To get to expected utility, we need to acknowledge a huge computational problem we just introduced. Previously, our utility function took a single element from $X$, which could easily just be one number (i.e. consumption) or a couple of numbers (apples and oranges). Now that $X=\mathcal{L}(O)$, the utility function needs a bunch of numbers. Suppose $O$ contains $n$ elements, then our utility function depends on $n-1$ numbers because it takes into account a probability for each possible outcome. If $O$ is an infinite set, e.g. the possible amounts of money you can win in a bet, or the number of possible flipped coins landing heads in an infinite number of flips, then the utility function depends on an infinite number of numbers – suddenly it’s really hard to work with again.

A solution is this problem is to define another type of utility function, called a von Neumann – Morgenstern or VNM utility function. Under some much stronger conditions on the preference relation $\succeq$ over $X=\mathcal{L}(O)$, we can create a VNM utility function $u:O\to \Re$ where $U(x)=E_x[u(o)]$ for any probability distribution $x\in X=\mathcal{L}(O)$. OK, let’s unpack this. The VNM utility function $u$ does the same basic thing as utility functions always do, but this time to the outcome space instead of the choice space – it assigns number to outcomes. What do these numbers mean? Well if $x_1$ and $x_2$ are two probability distributions over the outcome space, then when $x_1 \succeq x_2$, $E_{x_1}[u(o)] \geq E_{x_2}[u(o)]$. Ok, what the hell does $E_{x_1}[u(o)]$ mean? It means that you take the VNM utility function $u(o)$ and average it using the probability distribution $x_1$. So, for example if there are two possibilities, heads and tails, and $u(heads) = 1$ while $u(tails) = 0$, then if $x_1 = p = P(heads)$, $E_{x_1}[u(0)] = p\times 1 + (1 – p)\times 0 = p$. We call this the expected utility given the probability distribution $x_1$. This object is often (if you do it right) pretty easy to work with.

Like the original utility function, the VNM utility function isn’t unique either in the sense that you can always find another function that represents preferences equally well. If $u(o)$ is a VNM utility function representing the preference relation $\succeq$, then so is $a\times u(o) + b$ for $a > 0$ and $b$ any real number. This also helps motivate why VNM utility helps us capture some of the “how much more” intuition about preferences. If $u(o_1)$ is much larger than $u(o_2)$ but only slightly larger than $u(o_3)$, this relative difference of differences is preserved by picking a different VNM utility function, so we’re able to coherently talk about Bob liking $o_1$ way more than $o_2$ but only a little bit more than $o_3$ without going outside of the math.

**So, wait, infinite utility is like the Spanish inquisition?**

Yep. Nobody expects infinite utility. Infinite utility is simply not amongst our weaponry – ahem – it’s simply not a possibility in the math we’ve gone through so far. Bob’s utility function over lotteries on the space of possible outcomes is $U:\mathcal{L}(O)\to \Re$, and his VNM utility function is $u:L\to O$ such that if $x\in \mathcal{L}(O)$ then $U(x) = E_x[u(o)]$. There’s no room for infinity here – Bob’s utility function assigns a *number* to each lottery on $O$. Infinity is not a number, QED, right? Well, no, there’s more to this story worth mentioning. The requirement that $U$ only give real numbers to lotteries ends up imposing a pretty strong restriction on the VNM utility function $u$ – it has to be bounded. In other words there has to be some number $M$ such that $u(o)\leq M$ for all $o\in O$. You can relax this restriction if you only allow some of the lotteries in $\mathcal{L}(O)$, but fundamentally, whichever restriction you choose is there to prevent an infinite expected utility.

On the other hand, there’s no reason we can’t go back to the beginning and allow our utility function to assign each element of the choice set a number *or* positive/negative infinity, i.e. $U:X\to \Re\cup\{-\infty,\infty\}$. Now we can allow infinite expected utility if we like. But note what we’ve done – in order for our utility function to be faithfully representing Bob’s preferences, $U(x)=E_x[u(o)]=\infty$ means that $x$ is literally (literally literally) the best probability distribution over $O$ according to Bob’s preferences, and if there are two such $x$’s, they are tied for the best. And we can’t rule out ties either. The same thing holds for $-\infty$, except this time $x$ is (tied for) the worst. You can do this if you want, but you have to be careful with the infinities and not overinterpret them. Remember, they’re supposed to be representing preferences. If at any time you see infinite expected utility, what’s more likely? That you took all the proper precautions so that this infinity is faithfully representing the preferences of our good friend Bob? Or that you just waved your hands and turned Bob into a theist without his consent?

First, we’ll characterize the cdf of $Y$. Suppose the cdf of $X$ is $F_X(x)=P(X\leq x)$. Let $F_Y(y)$ denote the cdf of $Y$. If $y\geq 0$ then we have

$$F_Y(y)=P(Y\leq y) = P(Y\leq 0) + P(0\leq Y \leq y).$$

Then by the symmetry of the distribution of $Y$ we have

$$F_Y(y) = \frac{1}{2} + \frac{1}{2}P(-y\leq Y \leq y)=\frac{1 + P(Y^2\leq y^2)}{2} = \frac{1 + F_X(y^2)}{2}.$$

Similarly for $y<0$ we have
\[
\begin{aligned}
F_Y(y) &= P(Y\leq 0) - P(y\leq Y\leq 0) = \frac{1}{2} - \frac{1}{2}P(-y\leq Y \leq y)\\
&=\frac{1 - P(Y^2\leq y^2)}{2} = \frac{1 - F_X(y^2)}{2}.
\end{aligned}
\]
Thus the cdf of $Y$ is
\[
F_Y(y)=\frac{1 + sgn(y)F_X(y^2)}{2}.
\]
where
\[
sgn(y) = \begin{cases} 1 &\mathrm{if}\ y>0\\

0 &\mathrm{if}\ y=0\\

-1 &\mathrm{if}\ y<0
\end{cases}.
\]
Now let’s suppose that $W$ is a binary random variable with a 50/50 shot of being either 1 or -1, independent of $X$. Our goal is to show that $P(W\sqrt{X}\leq y) = P(Y\leq y)$, i.e. that the two random variables have the same distribution function. This ends up being pretty similar to characterizing the cdf of $Y$. First suppose $y\geq 0$, then:
\[
\begin{aligned}
P(W\sqrt{X} \leq y) &= P(\sqrt{X} \leq y | W=1)P(W=1) + P(\sqrt{X} \geq -y|W=-1)P(W=-1) \\
& = P(X\leq y^2)\frac{1}{2} + \frac{1}{2} = \frac{1+F_X(y^2)}{2}.
\end{aligned}
\]
Independence of $X$ and $W$ is crucial to getting the second line. Now suppose $y<0$. The steps are similar:
\[
\begin{aligned}
P(W\sqrt{X} \leq y) &= P(\sqrt{X} \leq y | W=1)P(W=1) + P(\sqrt{X} \geq -y|W=-1)P(W=-1)\\
& = 0 + \frac{1}{2}(1-P(X \leq y^2) = \frac{1-F_X(y^2)}{2}.
\end{aligned}
\]
So we can write the cdf of $WX$ as
\[
P(WX\leq y) = \frac{1 + sgn(y)F_X(y^2)}{2}.
\]
Now we’re done! If you want to draw from $Y$, first raw from $X$ and take the square root, then flip a coin to choose whether to use positive or negative $\sqrt{X}$.

Now everyone learns that convergence almost surely implies convergence in probability which, in turn, implies convergence in distribution. Everyone also learns that none of the converse implications hold in general, but I don’t think anyone comes away really grasping the difference between all of these concepts – they learn the “that” but not the “why.” I’m guilty of this too. We memorize which is stronger then go on to use them to talk about central limit theorems, laws of large numbers, and properties of estimators. This is probably fine since these first year courses typically focus on getting comfortable using the main concepts of probability theory rather than on deep understanding – that’s for later courses. But some people can’t wait for those courses, so I’ll try to explain the difference both intuitively and with some math.

First, what are the convergence concepts? Let’s back up and start at the beginning with the most primitive notion of convergence. Suppose we have a sequence of points $a_1$, $a_2$, $a_3$, and so on. Then the sequence $\{a_n\}$ converges to the number $a_0$ as $n$ approaches $\infty$ if you can force $a_n$ to be as close to $a_0$ as you want by choosing $n$ large enough. More formally:

The sequence $\{a_n\}$ converges to $a_0$ as $n$ approaches $\infty$, denoted $a_n\to a_0$ as $n\to\infty$ or $\lim_{n\to\infty}a_n=a_0$, if for any $\epsilon > 0$ there is an integer $N$ such that if $n>N$, then $|a_n – a_0|<\epsilon$.

But this notion of convergence doesn’t quite get us to the notions used in probability theory. To do that, we need to be able to talk about convergence of functions. Why? Because a random variable is a function – but more on that later. What does it mean for a sequence of functions to converge to another function? Essentially it means that if you evaluate each function in the sequence at the same point, the value you get becomes closer and closer to the value of the function the sequence converges to evaluated at that point. To make this mathy:

Let $\{f_n(x)\}$ be a sequence of functions, $f_n:\mathcal{X}\to\Re$, and $f_0:\mathcal{X}\to\Re$. Then the sequence of functions $f_n$ converges to $f_0$ at $x$ if $\lim_{n\to\infty}f_n(x)=f_0(x)$. The sequence converges pointwise if $\lim_{n\to\infty}f_n(x)=f_0(x)$ for all $x$ in $\mathcal{X}$.

So basically $f_n$ converges to $f_0$ at $x$ if the number you get when you plug $x$ into $f_n(.)$ converges to the number you get when you plug $x$ into $f_0(.)$, and $f_n$ converges pointwise to $f_0$ if this happens no matter which $x$ in $\mathcal{X}$ you choose.

Ok, so enough with the review of calculus. Suppose we have a sequence of random variables $X_1$, $X_2$, $X_3$, and so on. In what ways can the sequence converge to another random variable $X_0$? The three key modes of convergence in decreasing order of strength are almost sure convergence, convergence in probability, and convergence in distribution. Let’s take them in reverse order since that’s probably increasing order of difficulty.

**Convergence in Distribution**

Intuitively, when we say that $X_n$ converges in distribution to $X_0$, we’re say that the probability distribution of $X_n$ gets closer and closer to that of $X_0$ as $n$ becomes large. So you might think that, formally, this requires that the pdf of $X_n$ becomes closer and closer to the pdf of $X_0$ as $n$ becomes large. For technical reasons that have to do with measure theory, the pdf of $X_n$ doesn’t always have to exist when $X_n$ is a well defined random variable. To allow us to be able to talk about convergence with arbitrary random variables, we instead use the cdfs of $X_n$ and $X_0$. Recall the cdf of a random variable $X$ is defined as $F(x)=P(X\leq x)$. I.e. if you give it a number $x$, the cdf of $X$ will give you the probability that $X$ is less than that number. So what does it mean for $X_n$ to converge in distribution to $X_0$? Well, formally,

Let $\{X_n\}$ be a sequence of random variables with cdfs $\{F_n(.)\}$, and let $X_0$ be a random variable with cdf $F_0(.)$. $X_n$

converges in distributionto $X_0$ as $n\to\infty$, denoted $X_n\stackrel{d}{\to}X_0$, if $\lim_{n\to\infty}F_n(x)= F_0(x)$ for all $x\in\Re$ where $F_0(x)$ is continuous.

So basically $X_n$ converges in distribution to $X_0$ when the cdf of $X_n$ converges pointwise to the cdf of $X_0$, though this is a little stronger than necessary. We don’t need $F_n(x)$ to converge to $F_0(x)$ when $F_0(x)$ isn’t continuous at $x$ because for some sufficiently small $\delta>0$, $F_0(x)=F_0(x+\delta)$ and $F_0$ is continuous at $x+\delta$ (look at the properties of the cdf to see why this is true). So we can safely ignore all points where $F_0$ is discontinuous yet still be able to approximate $F_0(x)$ with $F_n(x)$ at any point $x$.

**Convergence in Probability**

Convergence in probability is a bit more complicated looking once you write down the math, but fundamentally it’s just trying to capture the situation where as $n$ becomes large, the random variables $X_n$ and $X_0$ have the same value. Here’s the complicated looking math:

Let $\{X_n\}$ be a sequence of random variables and let $X_0$ be a random variable. $X_n$

converges in probabilityto $X_0$ as $n\to\infty$, denoted $X_n\stackrel{p}{\to}X_0$, if for all $\epsilon > 0$, $\lim_{n\to\infty}P(|X_n – X_0| \geq \epsilon)= 0$.

So let’s unpack this. Essentially, it says that as $n$ becomes large, the probability that $X_n$ and $X_0$ differ by at least some fixed (positive) amount goes to zero. In other words, $X_n$ gets arbitrarily close to $X_0$ with probability approaching one as $n$ becomes large.

**Almost Sure Convergence**

Finally, there’s almost sure convergence. But before we talk about that, we need to go back to the basics of probability theory. I said before that a random variable was actually a function. Well a function has to have a domain (a set of inputs) and a codomain (a set containing outputs), so what are the domain and codomain of a random variable? Recall that in order to talk about probability, we need a sample space, $\Omega$. $\Omega$ is the space of all possible outcomes (according to our probability model). In addition, we need a function $P$ that takes subsets of $\Omega$ as an input, called “events”, and gives a number between $0$ and $1$, the probability of that event. There are a number of details I’m leaving out here, but they aren’t important for the task at hand. So what is the domain of a random variable? The domain is the sample space, $\Omega$. What is the codomain? The real line. That is, the random variable $X_0$ is a function $X_0:\Omega\to\Re$. Given an element of the sample space, $\omega\in\Omega$, $X_0(\omega)$ is completely determined. All of the properties of $X_0$ are thus determined by $P$. For example the cdf of $X_0$ is $F_0(x)=P(\{\omega:X_0(\omega)\leq x\})$, i.e. it’s the probability of the set of $\omega$’s such that $X_0(\omega)\leq x$.

Now we can talk about almost sure convergence. Basically, $X_n$ converges to $X_0$ almost surely when the probability that $X_n$ converges to $X_0$ is $1$, thinking of both of them as functions. Formally:

Let $\{X_n\}$ be a sequence of random variables and let $X_0$ be a random variable. $X_n$

converges almost surelyto $X_0$ as $n\to\infty$, denoted $X_n\to X_0 \ a.s.$, if $P(\{\omega:\lim_{n\to\infty}X_n(\omega)=X_0(\omega)\})=1$ or, equivalently, $P(\{\omega:\lim_{n\to\infty}X_n(\omega)\neq X_0(\omega)\})=0$

In other words, the event that $X_n(\omega)$ doesn’t converge to $X_0(\omega)$ never happens (has probability 0) according to the probability distribution $P$.

**Convergence Relationships**

So now we can talk about the differences between these various notions of convergence. The difference between convergence in distribution and convergence in probability is pretty easy and I’ve basically already spelled it out – it’s the difference between a the distribution of a random variable and the value of a random variable. Go back to flipping coins – suppose each random variable is a just a coin that is either 1 (“heads”) or 0 (“tails”) with some probability. If $X_n$ converges to $X_0$ in distribution, then all that is happening is that the probability that $X_n$ is heads converges to the probability that $X_0$ is heads. But that doesn’t mean that $X_n$ and $X_0$ have to be the same or even close any more than two coins with the same probability of heads have to flip the same side up! If $X_n$ converges to $X_0$ in probability, on the other hand, then $X_n$ and $X_0$ are very likely to both be heads or tails when $n$ is large. This basic idea generalizes to all sorts of random variables. If $X_n$ converges to $X_0$ in distribution then as $n$ becomes large the probability that $X_n$ is in a region becomes closer and closer to the probability that $X_0$ is in that same region. But for any given realization of each of the random variables, there’s no reason why their values have to be anywhere near each other. On the other, if $X_n$ converges in probability to $X_0$, then the value of $X_n$ gets arbitrarily close to the value of $X_0$ as $n$ becomes large.

If that was the extent of the confusion surrounding convergence, this post wouldn’t be necessary. Dealing with the difference between almost sure convergence and convergence in probability is trickier though – we have to somehow deal with the fact that a random variable is a function from the sample space to the real line. The big spoiler is that the difference between these two modes of convergence is the difference between a function and its value. If that doesn’t instantly clear everything up for you, let me explain. Almost sure convergence requires that the sequence of functions $X_n(\omega)$ converges to the function $X_0(\omega)$, except perhaps on a set of $\omega$’s that has probability 0. Convergence in probability requires that the value of $X_n$ and the value of $X_0$ are arbitrarily close with a probability that approaches 1 as $n$ approaches $\infty$.

These facts should hopefully be clear from the definitions and explanations above, but something weird still appears to be going on. We all learned that almost sure convergence is stronger than convergence in probability, but viewed in this light it seems like they might be equivalent. If the values two functions are (almost always) the same, then aren’t they (almost always) the same function? The best way to show that this is incorrect is with an example of a sequence of random variables that converges in probability to another random variable, but doesn’t converge almost surely. To do this we have to specify the underlying sample space and probability measure. Let the sample space be the unit interval, i.e. $\Omega=[0,1]$. Now to assign a probability measure $P$ to this set, we need to be able to give all “nice” subsets of $\Omega$ a probability. We’ll sidestep issues centered around what “nice” means and the basic properties any probability measure must satisfy. For the moment, let’s look at a certain type of simple subsets of $\Omega$: intervals. Suppose the interval $(a,b)$ is a subset of $\Omega$, i.e. that $0\leq a**
$X_1(\omega)=\begin{cases}1 & if\ \omega\leq \frac{1}{2} \\ 0 & otherwise \end{cases}$
$X_2(\omega)=\begin{cases}1 & if\ \omega\geq \frac{1}{2} \\ 0 & otherwise \end{cases}$
$X_3(\omega)=\begin{cases}1 & if\ \omega\leq \frac{1}{3} \\ 0 & otherwise \end{cases}$
$X_4(\omega)=\begin{cases}1 & if\ \frac{1}{3}\leq \omega\leq \frac{2}{3} \\ 0 & otherwise \end{cases}$
$X_5(\omega)=\begin{cases}1 & if\ \omega\geq \frac{2}{3} \\ 0 & otherwise \end{cases}$
$X_6(\omega)=\begin{cases}1 & if\ \omega\leq \frac{1}{4} \\ 0 & otherwise \end{cases}$
$X_7(\omega)=\begin{cases}1 & if\ \frac{1}{4}\leq \omega\leq \frac{1}{2} \\ 0 & otherwise \end{cases}$
and so on
**

so that

$P(X_1 = 1)=\frac{1}{2}$

$P(X_2 = 1)=\frac{1}{2}$

$P(X_3 = 1)=\frac{1}{3}$

$P(X_4 = 1)=\frac{1}{3}$

$P(X_5 = 1)=\frac{1}{3}$

$P(X_6 = 1)=\frac{1}{4}$

$P(X_7 = 1)=\frac{1}{4}$

etc.

So basically we divide the sample space up into two equal sized intervals. The first coin is “heads” in the first interval, the second is “heads” in the second interval. Then we divide the sample space into three equal sized intervals. The third coin is “heads” in the first of these intervals, the fourth coin is “heads” in the second interval, and the fifth coin is “heads” in the third interval. Then we continue this process by dividing the sample space into fourths, fifths, sixths, etc. to obtain the full sequence of coins. Now this sequence of coins converges in probability to a coin that lands “tails” every time, i.e. to $X_0$. If you want $P(X_n\neq X_0)$ to be, say, less than $\delta$, you need only go far enough in the sequence so that the sample space is being divided into equal intervals that are no longer that $\delta$.

However, $X_n$ doesn’t converge almost surely to $X_0$. In fact, at no point $\omega$ in the sample space does $X_n$ converge to $X_0$! Pick any arbitrary $\omega$ between $0$ and $1$ (inclusive). If $X_n(\omega)$ converges to $X_0(\omega)$ at that point, then we should be able to find a sufficiently large $N$ so that $n>N$ means $X_n(\omega)=0$. Consider any point in the sequence. At this point we’ve implicitly divided the sample space into, say, $k$ equal-length intervals, each with width $\frac{1}{k}$ where $k$ is an integer. But because each interval includes its endpoints, $\omega$ must be in one of the intervals. So one of the $X_n$’s associated with these intervals must be equal to 1 at $\omega$. But this has to be true with any division of the sample space into equal-length intervals. So no matter where we are in the sequence, there are future random variables in the sequence such that $X_n(\omega)=1$. So at $\omega$, $X_n$ doesn’t converge to $X_0$ and since $\omega$ was arbitrarily chosen, at no point in the sample space does $X_n$ converge to $X_0$.

What’s going on here? Well, both modes of convergence require the subset of the sample space on which $X_n$ is “heads” to shrink as $n$ increases. Convergence in probability allows this set to wander around the sample space as $n$ increases so long as it’s shrinking. This mode of convergence only cares about the values the random variables spit out, not where in the sample space they spit out which values. But almost sure convergence imposes a stricter requirement – the set of $\omega$ where $X_n(\omega)$ and $X_0(\omega)$ differ must not only shrink, but stay put. If it’s allowed to drift around the sample space, $X_n(\omega)$ will never converge to $X_0(\omega)$ because there’s always some point in the future of the sequence where they differ.

This sort of reasoning leads to an equivalent way of stating almost sure convergence.

Let $X_n$ be a sequence of random variables and $X_0$ a random variable. $X_n$

converges almost surelyto $X_0$ if for all $\epsilon>0$, $\lim_{n\to\infty}P(\sup_{m>n}|X_m-X_0|\geq\epsilon)=0$

$\sup$ stands for “supremum” and means “least upper bound.” Basically, the $\sup$ operator picks out the largest value of the sequence $X_{n+1}$, $X_{n+2}$, … If there is no maximum value – if, for example, $X_n=\frac{n}{n+1}$ for all $n=1,2,…$ – then the $sup$ operator returns the smallest number that is larger than every number in the sequence or, if necessary, $\infty$. It’s a somewhat difficult technical exercise to prove that this notion of almost sure convergence and the other notion mentioned above are equivalent. In order to do so, you need the idea of an event that occurs “infinitely often” a.k.a. “limsup for sets.”

Notice the similarity to convergence in probability. Intuitively, convergence in probability says that the probability that $X_n$ and $X_0$ differ by at least a fixed positive number goes to zero. Stated like this, almost sure convergence says that the probability that the maximum difference between rest of sequence after $X_n$ and $X_0$ is at least a fixed positive number goes to zero. Parsing the precise difference between these two definitions is perhaps easier if we make the fact that random variables are functions explicit. To wit:

Let $X_n$ be a sequence of random variables and $X_0$ a random variable.

$X_n$

converges in probabilityto $X_0$ if for all $\epsilon>0$, $\lim_{n\to\infty}P(\{\omega:|X_n(\omega)-X_0(\omega)|\geq\epsilon\})=0$$X_n$

converges almost surelyto $X_0$ if for all $\epsilon>0$, $\lim_{n\to\infty}P(\{\omega:\sup_{m>n}|X_m(\omega)-X_0(\omega)|\geq\epsilon\})=0$

Both modes of convergences define a set of $\omega$’s for every $n$, and in both cases the probability of this set must go to zero as $n$ goes to infinity – i.e. the set must shrink. In the case of almost sure convergence the set must also never add new members. To see this, suppose we have an arbitrary $\epsilon>0$ and suppose at $\omega$, $\sup_{m>n}|X_m(\omega)-X_0(\omega)|<\epsilon$. In other words, for all $m>n$, $X_m$ and $X_0$ differ by less than $\epsilon$ at $\omega$. But then at $t>n$, all we’ve done is removed some of the members of the sequence. So for all $m>t$, again $X_m$ and $X_0$ differ by less than $\epsilon$ at $\omega$. Convergence in probability, on the other hand, allows $\omega$’s to be both added and removed from the set at every $n$, so long as the *probability* of the set eventually goes to zero. This allows the set of $\omega$’s where $X_n$ and $X_0$ differ by at least $\epsilon$ to contain any given $\omega$ for infinitely many $n$ or “infinitely often,” as it did in the example above.

This doesn’t exhaustively explain the difference between almost sure convergence and convergence in probability. To get at it completely, you need to get your hands dirty with measure theory and prove that the two notions of almost sure convergence are equivalent, then play with some of the moving parts. Hopefully, though, there’s enough intuition here so that you have some idea why the two notions of convergence are different.

]]>

Let’s get started. First, go to R’s website, which should look something like this:

On the left hand side, click on the CRAN link under “Download, Packages” next to the arrow in the image above. This will bring up a page of links to download R from, like the image below. Choose a location close to you so that the download is faster.

This should bring you to a page that looks like the image below. Now click on the link to download R for whatever operating system you’re running. I.e. if your computer has Windows, click on “Download R for Windows.”

This should bring up another page with a few more links, like the picture below. Click on “base.”

Finally we get to the page with the actual download link, like below. Click on “Download R Z for X” where X is whatever operating system you have and Z is the latest version of R, i.e. some number like “2.15.1”. At the time I created this post, it was “Download R 2.15.1 for Windows” since I was installing R on a windows machine.

This will download the executable to install R – a file named “R-2.15.1-win.exe” or something similar. Make sure you save it somewhere you can find it easily. When it finishes, run the file. Just follow the on-screen instructions that pop up. You shouldn’t have to change anything in order for R to properly install.

Now you’re all set to start using R… except the GUI that R comes with out of the box isn’t very good. Rstudio is a free IDE that improves on the base R GUI substantially. Go here to download it. Download the version of Rstudio that their website recommends for your machine somewhere that you can easily find. Once this completes, open the file – it should be called “RStudio-0.96.316.exe” or something similar. From this point, just follow the on-screen instructions to complete the installation.

Now we’ll install a couple of useful packages that exist for R. First we’ll install the R package “ggplot2″. ggplot2 is a package for creating statistical graphics that drastically improves upon R’s base graphics.

In order to install this, open up R Studio and make sure you’re connected to the internet. Then type install.packages(“ggplot2″) into the R console and hit enter, as below.

R will output the following message, or something similar:

Installing package(s) into ‘C:/Users/Matt/R/win-library/2.15’

(as ‘lib’ is unspecified)

— Please select a CRAN mirror for use in this session —

Wait for a few seconds, then R will give you some options for which mirror to download from. Type the number for the mirror that is closest to you to download everything faster, then press enter. See the picture below.

R should be pretty busy for a few minutes while it downloads and installs several packages that ggplot2 depends on as well as ggplot2 itself. Once it finishes and you see the blue “>” in the bottom of the R console, the packages are installed.

There’s another package you need to install using basically the same process. agricolae is a package that has a bunch of methods for agricultural statistics, but more importantly it has some useful nonparametric tests. To install it, type install.packages(“agricolae”) into the R console and follow the same process as before.

In addition, I’ve uploaded a dataset that we’ll be using during my presentation. Download it here. Save it as diam.csv somewhere you can easily find it.

That’s all the software you’ll need to be up and running! Here are a bunch of useful resources, including the slides from my talk:

Slides from the presentation. (pdf) Probably more useful while you’re sitting at your computer than while I was talking.

diam.csv. The dataset I used for most examples in my presentation.

An R script containing every R command from my presentation. Open with any text editor, though opening it in R Studio is best.

R Studio tutorial. (pdf) Useful for getting your bearings in R while using R Studio. It covers the basics of computation R, including some stuff I didn’t cover such as dealing with vectors and matrices.

An R script containing an old presentation. This one has many more details about the basics in R as well as using ggplot2, plus some stuff about quickly using Bayesian methods to fit models. Note: enter install.packages(“arm”) into the R console to use the Bayesian stuff.

An R reference card. (pdf) Print it out and tape it on the wall next to your desk. Seriously. Do it now.

The ggplot2 website. This contains useful information for making complicated, informative and pretty graphics using the ggplot2 package.

Course website for the class I took to learn R. Some overlap, but there are many new things.

Knitr. This is a fantastic way to integrate your computations from R into a nice compiled Latex file, and it’s relatively painless with R Studio.

]]>The lectures do assume a rather high level understanding of probability theory. The statistics students in the class have seen at least chapters 1 – 5 of Casella and Berger in some detail. Other students in the class have similar backgrounds, though perhaps not quite as strong. Some knowledge of R would also be useful to understand the more computationally centered lectures. While the videos might not be useful for everyone, they’re probably a great supplement if you’re learning some of this material elsewhere.

]]>\[y_{ij} | \alpha_j, \beta_j \stackrel{iid}{\sim} N(\alpha_j + x_{ij}\beta_j, \sigma^2)\]

where $j=1,…,J$ indicates groups and $i=1,…,I_j$ indicates observations within groups. In other words, each group has a simple linear regression with a common error variance $\sigma^2$. We further model the regression coefficients as coming from a common distribution:

\[\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \stackrel{iid}{\sim} N\left(\begin{pmatrix} \alpha_0 \\ \beta_0 \end{pmatrix}, {\mathbf \Sigma}\right)\]

In order to complete the model and run a Bayesian analysis, we need to add priors for all remaining parameters including the covariance matrix for a group’s regression coefficients ${\mathbf \Sigma}$. A popular prior for ${\mathbf \Sigma}$ is the inverse-Wishart distribution, but there are some problems with using it in this role. A post over at dahtah has the details, but essentially the problem is that using the standard “noninformative” version of the inverse-Wishart prior, which makes the marginal distribution of the correlations uniform, large standard deviations are associated with large absolute correlations. This isn’t exactly noninformative and in addition it can have adverse affects on inference in some models where we want shrinkage to occur.

The question, then, is what prior distribution should we use instead? A paper (pdf) by Barnard, McCulloch and Meng argue for using a separation strategy: write ${\mathbf \Sigma}$ as $\mathbf{\Delta\Omega\Delta}$ where $\mathbf{\Delta}=\mathrm{diag}(\mathbf{\delta})$ is a diagonal matrix of standard deviations and ${\mathbf \Omega}$ is a correlation matrix. Then model ${\mathbf \Delta}$ and ${\mathbf \Omega}$ separately. There are a number of ways to do this, but Barnard et al. suggest using independent lognormal priors on the standard deviations in $\mathbf{\delta}$:

\[\delta_j\stackrel{iid}{\sim}N(\delta_0,\sigma^2_\delta)\]

while using the following family of densities on $\mathbf{\Omega}$:

\[ f_k(\mathbf{\Omega}|v) \propto |\mathbf{\Omega}|^{-\frac{1}{2}(v+k_1)}\left(\prod_i\omega^{ii}\right)^{-\frac{v}{2}} = |\mathbf{\Omega}|^{\frac{1}{2}(v-1)(k-1)-1}\left(\prod_i|\mathbf{\Omega}_{ii}|\right)^{-\frac{v}{2}} \]

where $\omega^{ii}$ is the $i$’th diagonal element of $\mathbf{\Omega}^{-1}$, $\mathbf{\Omega}_{ii}$ is the $i$’th leading principle sub-matrix of $\mathbf{\Omega}$, $k$ is the dimension of $\mathbf{\Omega}$ and $v$ is a tuning parameter. This density is obtained by transforming an inverse-Wishart random matrix, specifically $IW(v, \mathbf{I})$, into a correlation matrix. It turns out that $v=k+1$ results in uniform marginal distributions on the correlations in $\mathbf{\Omega}$.

An alternative strategy based on the separation strategy comes from a paper (pdf) by O’Malley and Zaslavsky and endorsed by Gelman – instead of modeling ${\mathbf \Omega}$ as a correlation matrix, only constrain it to be positive semi-definite so that ${\mathbf \Delta}$ and ${\mathbf \Omega}$ jointly determine the standard deviations, but ${\mathbf \Omega}$ still determines the correlations alone. This strategy uses the same lognormal prior on $\mathbf{\Delta}$ but uses the inverse-Wishart distribution $IW(k+1, \mathbf{I})$ on $\mathbf{\Omega}$ which still induces marginally uniform correlations on the resulting covariance matrix. This strategy is attractive since the inverse-Wishart distribution is already in a number of statistical packages and typically results in a conditionally conjugate model for the covariance matrix, allowing for a relatively simple analysis. If $\mathbf{\Omega}$ is constrained to be a correlation matrix as in Barnard et al., sampling from the conditional posterior requires significantly more work. So the scaled inverse-Wishart is a much easier to work with, but theoretically it still allows for some dependence between the correlations and the variances in $\mathbf{\Sigma}$.

I’ve used the scaled inverse-Wishart before, so I was curious about how much a difference it makes to use Barnard et al.’s prior. Now the impact on inference of using different priors will vary from model to model, but we can at least see how different priors encode the same prior information. Since $f$ is just the density of the correlation matrix resulting from an inverse-Wishart distribution on the covariance matrix, it’s relatively easy to sample from $f$ – just sample from an inverse-Wishart and transform to correlations. So I specified Barnard’s separation strategy prior (SS) and the scaled inverse-Wishart prior (sIW) in a similar way and sampled from them both, and then compared them to a sample from a standard inverse-Wishart (IW). In the first test, I assumed that the covariance matrix was $2\times2$ and that $\mathrm{log}(\delta_j)\stackrel{iid}{\sim}N(0,1)$ for both the SS and sIW priors. The only difference was in $\mathbf{\Omega}$: for the SS prior I assumed that $\mathbf{\Omega}\sim f(\mathbf{\Omega}|v=3)$ and for the sIW prior, I assumed $\mathbf{\Omega}\sim IW(3, \mathbf{I})$. In other words, for both priors I assumed that the correlations were marginally uniform. For the IW prior, I assumed that $\mathbf{\Sigma}\sim IW(3, \mathbf{I})$ to once again ensure marginally uniform correlations as a point of comparison. Taking 10,000 samples from each prior, we can see this behavior in the following plot of the correlations:

As expected for all priors the correlations look uniform. With either the SS or sIW priors, it’s possible to change this so that they favor either high correlation (closer to $1$ or $-1$) or low correlation (closer to $0$) through the manipulation of $v$ – see Barnard et al. or O’malley & Zaslavsky for details. Next, we take a look at the log standard deviations:

The histograms of SS and sIW look pretty similar with the only difference between the two that sIW has slightly fatter tails and/or a higher variance. In other words, given the same parameter choices sIW yields a slightly less informative prior. This isn’t a problem – once we understand the behavior of sIW vs. SS, we can set the priors on $\mathbf{\Delta}$ differently to encode the same prior information on the standard deviations. The priors as specified don’t appear to be satisfactorily noninformative – if that were the prior information we were trying to capture, but that can easily be fixed by increasing the variance of the prior. The IW prior, on the other hand, has a very narrow range of values for the standard deviation that can only be changed by also changing the prior on the correlations – one of its main drawbacks. Finally, we look at the dependence between the first variance component and the correlation coefficient in the next plot:

As expected, there’s no dependence between the correlation and the variance in the SS prior – the prior assumes they are independent. The sIW prior, on the other hand, exhibits the same disturbing dependence as the IW prior, also documented in Simon Barthelmé’s post at dahtah. High variances are associated with more extreme correlations in both the sIW and IW priors – the sIW prior doesn’t seem to improve on the IW prior at all in this respect.

I changed the prior on $\mathrm{log}(\mathbf{\delta})$ to have mean $(10, 0)$ and covariance matrix $\mathrm{diag}(.2, .2)$ reflecting a situation where we have a fairly strong prior belief that the two standard deviations are different from each other. Without commentary, the plots tell essentially the same story:

It doesn’t look great for the scaled inverse-Wishart. This prior is capable of encoding a wider range of prior information than the standard inverse-Wishart by allowing the modeler to separately model the correlations and the variances. However, the modeler isn’t able to control the prior dependence between the variances and the correlations – the level of dependence appears to be the same as in the standard inverse-Wishart prior. I suspect that for most researchers the problems this causes aren’t so bad when weighed against the computational issues that arise when trying to simulate from a correlation matrix, but it’s hard to tell without actually fitting models in order to determine the effect on shrinkage.

Finally, here’s some R code to generate the plots in this post:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | require(MCMCpack) require(ggplot2) require(reshape) ## simulates a single sample from the sIW prior sIW.sim <- function(k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){ R <- riwish(df, M) S <- diag(exp(mvrnorm(1,m, diag(s)))) SRS <- S%*%R%*%S return(SRS) } ## simulates n samples from the sIW prior sIW.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){ sig1 <- rep(0,n) sig2 <- rep(0,n) rho <- rep(0,n) for(i in 1:n){ E <- sIW.sim(2, m, s, df, M) sds <- sqrt(diag(E)) rho[i] <- cov2cor(E)[2,1] sig1[i] <- sds[1] sig2[i] <- sds[2] } return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3))) } ## simulates n samples from the SS prior SS.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){ sig1 <- rep(0,n) sig2 <- rep(0,n) rho <- rep(0,n) for(i in 1:n){ E <- riwish(df, M) rho[i] <- cov2cor(E)[2,1] sds <- exp(mvrnorm(1,m, diag(s))) sig1[i] <- sds[1] sig2[i] <- sds[2] } return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3))) } ## simulates n samples from the IW prior IW.test <- function(n, k, df=k+1, M=diag(k)){ sig1 <- rep(0,n) sig2 <- rep(0,n) rho <- rep(0,n) for(i in 1:n){ E <- riwish(df, M) rho[i] <- cov2cor(E)[2,1] vars <- diag(E) sig1[i] <- sqrt(vars[1]) sig2[i] <- sqrt(vars[2]) } return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3))) } n <- 10000 k <- 2 m <- c(0,0) s <- c(1,1) df <- 3 M <- diag(2) sIWsam.1 <- sIW.test(n, k, m, s, df, M) SSsam.1 <- SS.test(n, k, m, s, df, M) IWsam.1 <- IW.test(n, k, df, M) sIWsam.1$dens <- "sIW" SSsam.1$dens <- "SS" IWsam.1$dens <- "IW" data.1 <- rbind(sIWsam.1, SSsam.1) data.1 <- rbind(sIWsam.1, SSsam.1, IWsam.1) qplot(value, data=data.1[data.1$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation") qplot(value, data=data.1[data.1$parameter=="rho",], facets=dens~., xlab="Correlation") data.1.melt <- melt(data.1, id=c("parameter", "dens", "sam")) data.1.cast <- cast(data.1.melt, dens+sam~parameter) qplot(sigma1^2, rho, data=data.1.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x") n <- 10000 k <- 2 m <- c(10,0) s <- c(.2,.2) df <- 3 M <- diag(2) sIWsam.2 <- sIW.test(n, k, m, s, df, M) SSsam.2 <- SS.test(n, k, m, s, df, M) IWsam.2 <- IW.test(n, k, df, M) sIWsam.2$dens <- "sIW" SSsam.2$dens <- "SS" IWsam.2$dens <- "IW" data.2 <- rbind(sIWsam.2, SSsam.2, IWsam.2) qplot(value, data=data.2[data.2$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation") qplot(value, data=data.2[data.2$parameter=="rho",], facets=dens~., xlab="Correlation") data.2.melt <- melt(data.2, id=c("parameter", "dens", "sam")) data.2.cast <- cast(data.2.melt, dens+sam~parameter) qplot(sigma1^2, rho, data=data.2.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x") |

First, go to R’s website, which should look something like this:

On the left hand side, click on the CRAN link under “Download, Packages” next to the arrow in the image above. This will bring up a page of links to download R from, like the image below. Choose a location close to you so that the download is faster.

This should bring you to a page that looks like the image below. Now click on the link to download R for whatever operating system you’re running. I.e. if your computer has Windows, click on “Download R for Windows.”

This should bring up another page with a few more links, like the picture below. Click on “base.”

Finally we get to the page with the actual download link, like below. Click on “Download R 2.15.1 for X” where X is whatever operating system you have.

This will download the executable to install R – a file named “R-2.15.1-win.exe” or something similar. Make sure you save it somewhere you can find it easily. When it finishes, run the file. Just follow the on-screen instructions that pop up. You shouldn’t have to change anything in order for R to properly install.

Now you’re all set to start using R… except the GUI that R comes with out of the box isn’t very good. Rstudio is a free IDE that improves on the base R GUI substantially. Go here to download it. Download the version of Rstudio that their website recommends for your machine somewhere that you can easily find. Once this completes, open the file – it should be called “RStudio-0.96.316.exe” or something similar. From this point, just follow the on-screen instructions to complete the installation.

That’s all the software you’ll need for my presentation! If you have some time it might be useful to poke around Rstudio to get a general feel for it, but you certainly don’t need to in order to understand my presentation. There are a few tutorials to guide your poking scattered about the web, including this one (warning: pdf).

EDIT: Here are a few more resources that will be useful. First, you’ll need an additional R package called arm – this package allows you to quickly fit Bayesian linear and generalized linear models. In order to install this, open up R Studio and make sure you’re connected to the internet. Then type install.packages(“arm”) into the R console, as below.

R will output the following message, or something similar:

Installing package(s) into ‘C:/Users/Matt/R/win-library/2.15’

(as ‘lib’ is unspecified)

— Please select a CRAN mirror for use in this session —

Wait for a few seconds, then R will give you some options for which mirror to download from. Type the number for the mirror that is closest to you to download everything faster, then press enter. See the picture below.

R should be pretty busy for a few minutes while it downloads and installs several packages that arm depends on as well as arm itself. Once it finishes and you see the blue “>” in the bottom of the R console, the packages are installed.

Futhermore, you’ll need another R package call “ggplot2″. You can install this using the same command: install.packages(“ggplot2″).

In addition, I’ve uploaded a dataset that we’ll be using during my presentation. Download it here. Save it as diam.csv somewhere you can easily find it.

So to be 100% ready for my presentation, you need to have installed R, R Studio, and the arm and ggplot2 packages, as well as have downloaded the data file diam.csv.

Finally, here are a few useful resources for before, during, and after my presentation:

An R script containing my entire presentation. Open with any text editor, though opening it in R Studio is best.

An R reference card. (pdf) Print it out and tape it on the wall next to your desk. Useful for after the presentation.

The ggplot2 website. This contains useful information for making complicated and informative graphics. Useful after the presentation.

Course website for the class I took to learn R. Some overlap, but there are many new things. Useful for after the presentation.

First, we’ll look at writing a C extension and see how that generalizes. Let’s say we’re writing a simple program that uses C to add two numbers. If we were just writing a C program, it would look something like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #include <stdio.h> #include <stdlib.h> void add(double *a, double *b, double *c){ *c = *a + *b; Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c); } int main(void){ double *a, *b, *c; size_t fbytes = sizeof(float); a = (double *) malloc(fbytes); b = (double *) malloc(fbytes); c = (double *) malloc(fbytes); //code for user input omitted add(a, b, c); printf("\nc = %.4f\n", *c); return 0; } |

None of this should be new to anyone who’s programmed in C before. Now suppose we want to be able to call the function `add`

from R. We need to change the function slightly to accommodate this:

1 2 3 4 5 6 7 8 9 10 11 12 | #include <stdio.h> #include <stdlib.h> #include <R.h> //needed to interface with R void add(double *a, double *b, double *c){ *c = *a + *b; //similar to printf, except prints to R console Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c); } |

There are a couple of twists here. First, we need to include the header `R.h`

. This contains functions that allow R to interact with the C function, including `Rprintf`

. `Rprintf`

works almost exactly like `printf`

except it prints directly to the R console. In addition to including the header, any function that we want to call from R has to satisfy two constraints: it must return type `void`

and only have pointer arguments. Then to compile the program so that it’s callable from R, we would type into the terminal

1 | R CMD SHLIB foo.c |

Where foo.c is the name of the file containing the code above. This command yields the following output in our terminal

1 2 3 | gcc -std=gnu99 -I/apps/lib64/R/include -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o gcc -std=gnu99 -shared -L/usr/local/lib64 -o foo.so foo.o |

These are the commands you would have entered if you wanted to compile the code manually. We’ll take a closer look at that later when we compile CUDA C code. For now though, just note that now in your working directory you have two new files: `foo.o`

and `foo.so`

. The latter is the file we’ll use to call `add`

from R. To do this, we’ll need an R wrapper to call our C function transparently, e.g.:

1 2 3 4 5 6 7 8 9 10 | add <- function(a,b){ ##check to see if function is already loaded if(!is.loaded("add")) dyn.load("foo.so") c <- 0 z <- .C("add",a=a,b=b,c=c) c <- z$c return(c) } |

There are a couple of elements here. First, the function `dyn.load()`

loads the shared library file `foo.so`

. `is.loaded()`

, unsurprisingly, checks to see if its argument has already been loaded. Note that we load a file but check to see if a *specific function* has already been loaded. The next element is the `.C`

function. There are other ways to call C functions from R, but `.C`

is the easiest. The first argument of `.C`

is the name of the C function you want to call in string form. The rest of the arguments are the arguments for the function you’re calling. `.C`

returns a list containing the updated values of all of the arguments passed to the C function. R copies all of the arguments and passes them to C using `.C`

– it does NOT simply update the value of `c`

for us, so we have to copy it back from the list that `.C`

returns. When we run this code, this is what we see:

1 2 3 4 | > source("foo.r") > add(1,1) 1 + 1 = 2 [1] 2 |

This is all simple enough, but what about when we want to call a function that runs on the gpu? Assuming that we are trying to run the same function, there are a couple of tweaks. Here’s the source:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | #include <stdio.h> #include <stdlib.h> #include <R.h> __global__ void add(float *a, float *b, float *c){ *c = *b + *a; } extern "C" void gpuadd(float *a, float *b, float *c){ float *da, *db, *dc; cudaMalloc( (void**)&da, sizeof(float) ); cudaMalloc( (void**)&db, sizeof(float) ); cudaMalloc( (void**)&dc, sizeof(float) ); cudaMemcpy( da, a, sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( db, b, sizeof(float), cudaMemcpyHostToDevice); add<<<1,1>>>(da, db, dc); cudaMemcpy(c, dc, sizeof(float), cudaMemcpyDeviceToHost); cudaFree(da); cudaFree(db); cudaFree(dc); Rprintf("%.0f + %.0f = %.0f\n", *a, *b, *c); } |

There are two noteworthy things here. First, we still have to have a C function that allocates memory on the GPU and calls the kernel (GPU function) that we want the GPU to run – we can’t directly call the kernel from R. Second, we need to make sure that R knows the name of the C function we want to call using `.C`

. We do this with the `extern "C"`

command. This tells the compiler to treat `gpuadd`

as a C function so that in the shared library file (i.e. `gpufoo.so`

) it’s still called “gpuadd.” Otherwise, the compiler treats the function as a C++ function and changes how its name is stored without telling us what to call the function while using `.C`

. Also there’s a quirk here – all of the variables are floats instead of doubles since most GPUs only support single precision floating point math. This will affect how we write our R wrapper to call `gpuadd`

. If your GPU supports double precision floating point operations you can ignore that part, but keep in mind that most GPUs don’t if you want to distribute your code widely. We’ll assume that the code above is saved in `gpufoo.cu`

.

So that’s the code, how do we compile it? This time there isn’t a simple R command we can call from the terminal but the output from compiling a C file (`R CMD SHLIB file.c`

) gives us clues about how to do it manually:

1 2 3 | gcc -std=gnu99 -I/apps/lib64/R/include -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o gcc -std=gnu99 -shared -L/usr/local/lib64 -o foo.so foo.o |

As I mentioned before, these are precisely the commands we would have used if we wanted to compile manually. I’ll go through them step by step. Starting with the first line, or the compiling step, `gcc`

is the compiler we call on our source code. `-std=gnu99`

tells the compiler which standard of C to use. We’ll ignore this option. `-I/...`

tells the compiler to look in the folder `/...`

for any included headers. Depending on how your system has been set up, the particular folders where R automatically looks may be different from mine. Make a note of which folders are displayed here as you’ll need them later. These folders contain `R.h`

among other headers. `-fpic`

essentially tells the compiler to make the code suitable for a shared library – i.e. what we’ll be callying with `dyn.load()`

from R. `-g`

is just a debugging option while `-O2`

tells the compiler to optimize the code nearly as much as possible. Neither are essential for our purposes though both are useful. Finally, `-c`

tells the compiler to only compile and assemble the code – i.e. don’t link it, `foo.c`

is the name of the source file and `-o foo.o`

tells the compiler what to name the output file.

The next line is the linking step. Here, `-shared`

tells the linker that the output will be a shared library while `-L/...`

tells it where to find previously compiled libraries that this code may rely on. This is where the compiled version of R libraries may probably are at. Again use the path R outputs for you, not necessarily the path I have here. Finally, `-o foo.so`

tells the linker what to name the output file and `foo.o`

is the name of the object file that needs to be linked.

So now we need to use these two statements to construct similar `nvcc`

commands to compile our CUDA C code. The big reveal first, then the explanation:

1 2 3 | nvcc -g -G -O2 -I/apps/lib64/R/include -I/usr/local/include -Xcompiler "-fpic" -c gpufoo.cu gpufoo.o nvcc -shared -L/usr/local/lib64 gpufoo.o -o gpufoo.so |

`-g`

and `-O2`

do the same thing here as before with the caveat that they only apply to code that runs on the host (i.e. not on the GPU). That is to say `-g`

generates debugging information for only the host code and `-O2`

optimizes only the host code. `-G`

, on the other hand, generates debugging information for the *device* code, i.e. the code that runs on the GPU. So far, nothing different from before. The wrinkle is in this component: `-Xcompiler "-fpic"`

. This tells the compiler to pass on the arguments in quotes to the C compiler, i.e. to `gcc`

. This argument is exactly the same as above, as is the rest of the arguments outside of the quotes. The link step is basically identical to before as well.

In reality, there’s no need for some of the commands in both the compile and link step. An alternative would be

1 2 3 | nvcc -g -G -O2 -I/apps/lib64/R/include -Xcompiler "-Wall -Wextra -fpic" -c gpufoo.cu gpufoo.o nvcc -shared -lm gpufoo.o -o gpufoo.so |

This version removes a path from both the compile and the link step because nothing in those folders is relevant to compiling the above program – at least on my system. In the compile step I added two arguments to be passed to the compiler: `-Wall`

and `-Wextra`

. These tell the compiler to show warnings for things in our code that commonly cause errors – very useful for preventing bugs. Finally in the link step I added the command `-lm`

. In general, the command `-lname`

links the library named “name.” In this case, it links the math library which we would be using if we had `#include`

in our source file. If, for example, we were using NVIDIA’s CUBLAS library we would need

`-lcublas`

here.

Now in order to call this function from R, our wrapper needs to be slightly different:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | gpuadd <- function(a,b){ if(!is.loaded("gpuadd")) dyn.load("gpufoo.so") ##tell R to convert to single precision before copying to .C c <- single(1) mode(a) <- "single" mode(b) <- "single" z <- .C("gpuadd",a,b,c=c) c <- z$c ##Change to standard numeric to avoid dangerous side effects attr(c,"Csingle") <- NULL return(c) } |

The essential difference here is that `.C`

‘s arguments need to be have the “Csingle” attribute so that it knows to copy them as floats instead of doubles. The same applies for integer arguments. Finally, this attribute needs to be removed before returning the result to avoid some dangerous side effects – which occurs right before the function returns its output.