# I’m Almost Sure it’s Probably Converging… Wait, What?

Every first year graduate student in statistics or economics has to learn the basics of probability and statistics theory. Most of this is fairly understandable – once you grasp the basic concepts behind probability and distribution theory, everything else is applications and details for the most part. There is at least one notable exception to this – the relationship between the various notions of convergence used in probability theory.

Now everyone learns that convergence almost surely implies convergence in probability which, in turn, implies convergence in distribution. Everyone also learns that none of the converse implications hold in general, but I don’t think anyone comes away really grasping the difference between all of these concepts – they learn the “that” but not the “why.” I’m guilty of this too. We memorize which is stronger then go on to use them to talk about central limit theorems, laws of large numbers, and properties of estimators. This is probably fine since these first year courses typically focus on getting comfortable using the main concepts of probability theory rather than on deep understanding – that’s for later courses. But some people can’t wait for those courses, so I’ll try to explain the difference both intuitively and with some math.

First, what are the convergence concepts? Let’s back up and start at the beginning with the most primitive notion of convergence. Suppose we have a sequence of points $a_1$, $a_2$, $a_3$, and so on. Then the sequence $\{a_n\}$ converges to the number $a_0$ as $n$ approaches $\infty$ if you can force $a_n$ to be as close to $a_0$ as you want by choosing $n$ large enough. More formally:

The sequence $\{a_n\}$ converges to $a_0$ as $n$ approaches $\infty$, denoted $a_n\to a_0$ as $n\to\infty$ or $\lim_{n\to\infty}a_n=a_0$, if for any $\epsilon > 0$ there is an integer $N$ such that if $n>N$, then $|a_n – a_0|<\epsilon$.

But this notion of convergence doesn’t quite get us to the notions used in probability theory. To do that, we need to be able to talk about convergence of functions. Why? Because a random variable is a function – but more on that later. What does it mean for a sequence of functions to converge to another function? Essentially it means that if you evaluate each function in the sequence at the same point, the value you get becomes closer and closer to the value of the function the sequence converges to evaluated at that point. To make this mathy:

Let $\{f_n(x)\}$ be a sequence of functions, $f_n:\mathcal{X}\to\Re$, and $f_0:\mathcal{X}\to\Re$. Then the sequence of functions $f_n$ converges to $f_0$ at $x$ if $\lim_{n\to\infty}f_n(x)=f_0(x)$. The sequence converges pointwise if $\lim_{n\to\infty}f_n(x)=f_0(x)$ for all $x$ in $\mathcal{X}$.

So basically $f_n$ converges to $f_0$ at $x$ if the number you get when you plug $x$ into $f_n(.)$ converges to the number you get when you plug $x$ into $f_0(.)$, and $f_n$ converges pointwise to $f_0$ if this happens no matter which $x$ in $\mathcal{X}$ you choose.

Ok, so enough with the review of calculus. Suppose we have a sequence of random variables $X_1$, $X_2$, $X_3$, and so on. In what ways can the sequence converge to another random variable $X_0$? The three key modes of convergence in decreasing order of strength are almost sure convergence, convergence in probability, and convergence in distribution. Let’s take them in reverse order since that’s probably increasing order of difficulty.

Convergence in Distribution

Intuitively, when we say that $X_n$ converges in distribution to $X_0$, we’re say that the probability distribution of $X_n$ gets closer and closer to that of $X_0$ as $n$ becomes large. So you might think that, formally, this requires that the pdf of $X_n$ becomes closer and closer to the pdf of $X_0$ as $n$ becomes large. For technical reasons that have to do with measure theory, the pdf of $X_n$ doesn’t always have to exist when $X_n$ is a well defined random variable. To allow us to be able to talk about convergence with arbitrary random variables, we instead use the cdfs of $X_n$ and $X_0$. Recall the cdf of a random variable $X$ is defined as $F(x)=P(X\leq x)$. I.e. if you give it a number $x$, the cdf of $X$ will give you the probability that $X$ is less than that number. So what does it mean for $X_n$ to converge in distribution to $X_0$? Well, formally,

Let $\{X_n\}$ be a sequence of random variables with cdfs $\{F_n(.)\}$, and let $X_0$ be a random variable with cdf $F_0(.)$. $X_n$ converges in distribution to $X_0$ as $n\to\infty$, denoted $X_n\stackrel{d}{\to}X_0$, if $\lim_{n\to\infty}F_n(x)= F_0(x)$ for all $x\in\Re$ where $F_0(x)$ is continuous.

So basically $X_n$ converges in distribution to $X_0$ when the cdf of $X_n$ converges pointwise to the cdf of $X_0$, though this is a little stronger than necessary. We don’t need $F_n(x)$ to converge to $F_0(x)$ when $F_0(x)$ isn’t continuous at $x$ because for some sufficiently small $\delta>0$, $F_0(x)=F_0(x+\delta)$ and $F_0$ is continuous at $x+\delta$ (look at the properties of the cdf to see why this is true). So we can safely ignore all points where $F_0$ is discontinuous yet still be able to approximate $F_0(x)$ with $F_n(x)$ at any point $x$.

Convergence in Probability

Convergence in probability is a bit more complicated looking once you write down the math, but fundamentally it’s just trying to capture the situation where as $n$ becomes large, the random variables $X_n$ and $X_0$ have the same value. Here’s the complicated looking math:

Let $\{X_n\}$ be a sequence of random variables and let $X_0$ be a random variable. $X_n$ converges in probability to $X_0$ as $n\to\infty$, denoted $X_n\stackrel{p}{\to}X_0$, if for all $\epsilon > 0$, $\lim_{n\to\infty}P(|X_n – X_0| \geq \epsilon)= 0$.

So let’s unpack this. Essentially, it says that as $n$ becomes large, the probability that $X_n$ and $X_0$ differ by at least some fixed (positive) amount goes to zero. In other words, $X_n$ gets arbitrarily close to $X_0$ with probability approaching one as $n$ becomes large.

Almost Sure Convergence

Finally, there’s almost sure convergence. But before we talk about that, we need to go back to the basics of probability theory. I said before that a random variable was actually a function. Well a function has to have a domain (a set of inputs) and a codomain (a set containing outputs), so what are the domain and codomain of a random variable? Recall that in order to talk about probability, we need a sample space, $\Omega$. $\Omega$ is the space of all possible outcomes (according to our probability model). In addition, we need a function $P$ that takes subsets of $\Omega$ as an input, called “events”, and gives a number between $0$ and $1$, the probability of that event. There are a number of details I’m leaving out here, but they aren’t important for the task at hand. So what is the domain of a random variable? The domain is the sample space, $\Omega$. What is the codomain? The real line. That is, the random variable $X_0$ is a function $X_0:\Omega\to\Re$. Given an element of the sample space, $\omega\in\Omega$, $X_0(\omega)$ is completely determined. All of the properties of $X_0$ are thus determined by $P$. For example the cdf of $X_0$ is $F_0(x)=P(\{\omega:X_0(\omega)\leq x\})$, i.e. it’s the probability of the set of $\omega$’s such that $X_0(\omega)\leq x$.

Now we can talk about almost sure convergence. Basically, $X_n$ converges to $X_0$ almost surely when the probability that $X_n$ converges to $X_0$ is $1$, thinking of both of them as functions. Formally:

Let $\{X_n\}$ be a sequence of random variables and let $X_0$ be a random variable. $X_n$ converges almost surely to $X_0$ as $n\to\infty$, denoted $X_n\to X_0 \ a.s.$, if $P(\{\omega:\lim_{n\to\infty}X_n(\omega)=X_0(\omega)\})=1$ or, equivalently, $P(\{\omega:\lim_{n\to\infty}X_n(\omega)\neq X_0(\omega)\})=0$

In other words, the event that $X_n(\omega)$ doesn’t converge to $X_0(\omega)$ never happens (has probability 0) according to the probability distribution $P$.

Convergence Relationships

So now we can talk about the differences between these various notions of convergence. The difference between convergence in distribution and convergence in probability is pretty easy and I’ve basically already spelled it out – it’s the difference between a the distribution of a random variable and the value of a random variable. Go back to flipping coins – suppose each random variable is a just a coin that is either 1 (“heads”) or 0 (“tails”) with some probability. If $X_n$ converges to $X_0$ in distribution, then all that is happening is that the probability that $X_n$ is heads converges to the probability that $X_0$ is heads. But that doesn’t mean that $X_n$ and $X_0$ have to be the same or even close any more than two coins with the same probability of heads have to flip the same side up! If $X_n$ converges to $X_0$ in probability, on the other hand, then $X_n$ and $X_0$ are very likely to both be heads or tails when $n$ is large. This basic idea generalizes to all sorts of random variables. If $X_n$ converges to $X_0$ in distribution then as $n$ becomes large the probability that $X_n$ is in a region becomes closer and closer to the probability that $X_0$ is in that same region. But for any given realization of each of the random variables, there’s no reason why their values have to be anywhere near each other. On the other, if $X_n$ converges in probability to $X_0$, then the value of $X_n$ gets arbitrarily close to the value of $X_0$ as $n$ becomes large.

If that was the extent of the confusion surrounding convergence, this post wouldn’t be necessary. Dealing with the difference between almost sure convergence and convergence in probability is trickier though – we have to somehow deal with the fact that a random variable is a function from the sample space to the real line. The big spoiler is that the difference between these two modes of convergence is the difference between a function and its value. If that doesn’t instantly clear everything up for you, let me explain. Almost sure convergence requires that the sequence of functions $X_n(\omega)$ converges to the function $X_0(\omega)$, except perhaps on a set of $\omega$’s that has probability 0. Convergence in probability requires that the value of $X_n$ and the value of $X_0$ are arbitrarily close with a probability that approaches 1 as $n$ approaches $\infty$.

These facts should hopefully be clear from the definitions and explanations above, but something weird still appears to be going on. We all learned that almost sure convergence is stronger than convergence in probability, but viewed in this light it seems like they might be equivalent. If the values two functions are (almost always) the same, then aren’t they (almost always) the same function? The best way to show that this is incorrect is with an example of a sequence of random variables that converges in probability to another random variable, but doesn’t converge almost surely. To do this we have to specify the underlying sample space and probability measure. Let the sample space be the unit interval, i.e. $\Omega=[0,1]$. Now to assign a probability measure $P$ to this set, we need to be able to give all “nice” subsets of $\Omega$ a probability. We’ll sidestep issues centered around what “nice” means and the basic properties any probability measure must satisfy. For the moment, let’s look at a certain type of simple subsets of $\Omega$: intervals. Suppose the interval $(a,b)$ is a subset of $\Omega$. Then we’ll say that the probability of $(a,b)$ is the length of the interval $b-a$. We’ll say that the probability of a single point is $0$ and that the probability of any half-open interval or closed interval is the same as the probability of the associated open interval, i.e. $P((a,b))=P((a,b])=P([a,b))=P([a,b])$. So $P$ is the uniform distribution over the interval $[0,1]$.

Now we need to define the random variables. We’ll keep things easy and suppose than all of the $X_n$’s and $X_0$ are coin flips. Suppose that $X_0$ is a weighted coin that always lands on “tails,” i.e. $X_0(\omega)=0$ for all $\omega\in[0,1]$. Now suppose each of the $X_n$’s are also weighted coins that land on “heads” and “tails” with their own probabilities. We’ll specify these probabilities as follows:

$X_1(\omega)=\begin{cases}1 & if\ \omega\leq \frac{1}{2} \\ 0 & otherwise \end{cases}$
$X_2(\omega)=\begin{cases}1 & if\ \omega\geq \frac{1}{2} \\ 0 & otherwise \end{cases}$
$X_3(\omega)=\begin{cases}1 & if\ \omega\leq \frac{1}{3} \\ 0 & otherwise \end{cases}$
$X_4(\omega)=\begin{cases}1 & if\ \frac{1}{3}\leq \omega\leq \frac{2}{3} \\ 0 & otherwise \end{cases}$
$X_5(\omega)=\begin{cases}1 & if\ \omega\geq \frac{2}{3} \\ 0 & otherwise \end{cases}$
$X_6(\omega)=\begin{cases}1 & if\ \omega\leq \frac{1}{4} \\ 0 & otherwise \end{cases}$
$X_7(\omega)=\begin{cases}1 & if\ \frac{1}{4}\leq \omega\leq \frac{1}{2} \\ 0 & otherwise \end{cases}$
and so on

so that

$P(X_1 = 1)=\frac{1}{2}$
$P(X_2 = 1)=\frac{1}{2}$
$P(X_3 = 1)=\frac{1}{3}$
$P(X_4 = 1)=\frac{1}{3}$
$P(X_5 = 1)=\frac{1}{3}$
$P(X_6 = 1)=\frac{1}{4}$
$P(X_7 = 1)=\frac{1}{4}$
etc.

So basically we divide the sample space up into two equal sized intervals. The first coin is “heads” in the first interval, the second is “heads” in the second interval. Then we divide the sample space into three equal sized intervals. The third coin is “heads” in the first of these intervals, the fourth coin is “heads” in the second interval, and the fifth coin is “heads” in the third interval. Then we continue this process by dividing the sample space into fourths, fifths, sixths, etc. to obtain the full sequence of coins. Now this sequence of coins converges in probability to a coin that lands “tails” every time, i.e. to $X_0$. If you want $P(X_n\neq X_0)$ to be, say, less than $\delta$, you need only go far enough in the sequence so that the sample space is being divided into equal intervals that are no longer that $\delta$.

However, $X_n$ doesn’t converge almost surely to $X_0$. In fact, at no point $\omega$ in the sample space does $X_n$ converge to $X_0$! Pick any arbitrary $\omega$ between $0$ and $1$ (inclusive). If $X_n(\omega)$ converges to $X_0(\omega)$ at that point, then we should be able to find a sufficiently large $N$ so that $n>N$ means $X_n(\omega)=0$. Consider any point in the sequence. At this point we’ve implicitly divided the sample space into, say, $k$ equal-length intervals, each with width $\frac{1}{k}$ where $k$ is an integer. But because each interval includes its endpoints, $\omega$ must be in one of the intervals. So one of the $X_n$’s associated with these intervals must be equal to 1 at $\omega$. But this has to be true with any division of the sample space into equal-length intervals. So no matter where we are in the sequence, there are future random variables in the sequence such that $X_n(\omega)=1$. So at $\omega$, $X_n$ doesn’t converge to $X_0$ and since $\omega$ was arbitrarily chosen, at no point in the sample space does $X_n$ converge to $X_0$.

What’s going on here? Well, both modes of convergence require the subset of the sample space on which $X_n$ is “heads” to shrink as $n$ increases. Convergence in probability allows this set to wander around the sample space as $n$ increases so long as it’s shrinking. This mode of convergence only cares about the values the random variables spit out, not where in the sample space they spit out which values. But almost sure convergence imposes a stricter requirement – the set of $\omega$ where $X_n(\omega)$ and $X_0(\omega)$ differ must not only shrink, but stay put. If it’s allowed to drift around the sample space, $X_n(\omega)$ will never converge to $X_0(\omega)$ because there’s always some point in the future of the sequence where they differ.

This sort of reasoning leads to an equivalent way of stating almost sure convergence.

Let $X_n$ be a sequence of random variables and $X_0$ a random variable. $X_n$ converges almost surely to $X_0$ if for all $\epsilon>0$, $\lim_{n\to\infty}P(\sup_{m>n}|X_m-X_0|\geq\epsilon)=0$

$\sup$ stands for “supremum” and means “least upper bound.” Basically, the $\sup$ operator picks out the largest value of the sequence $X_{n+1}$, $X_{n+2}$, … If there is no maximum value – if, for example, $X_n=\frac{n}{n+1}$ for all $n=1,2,…$ – then the $sup$ operator returns the smallest number that is larger than every number in the sequence or, if necessary, $\infty$. It’s a somewhat difficult technical exercise to prove that this notion of almost sure convergence and the other notion mentioned above are equivalent. In order to do so, you need the idea of an event that occurs “infinitely often” a.k.a. “limsup for sets.”

Notice the similarity to convergence in probability. Intuitively, convergence in probability says that the probability that $X_n$ and $X_0$ differ by at least a fixed positive number goes to zero. Stated like this, almost sure convergence says that the probability that the maximum difference between rest of sequence after $X_n$ and $X_0$ is at least a fixed positive number goes to zero. Parsing the precise difference between these two definitions is perhaps easier if we make the fact that random variables are functions explicit. To wit:

Let $X_n$ be a sequence of random variables and $X_0$ a random variable.

$X_n$ converges in probability to $X_0$ if for all $\epsilon>0$, $\lim_{n\to\infty}P(\{\omega:|X_n(\omega)-X_0(\omega)|\geq\epsilon\})=0$

$X_n$ converges almost surely to $X_0$ if for all $\epsilon>0$, $\lim_{n\to\infty}P(\{\omega:\sup_{m>n}|X_m(\omega)-X_0(\omega)|\geq\epsilon\})=0$

Both modes of convergences define a set of $\omega$’s for every $n$, and in both cases the probability of this set must go to zero as $n$ goes to infinity – i.e. the set must shrink. In the case of almost sure convergence the set must also never add new members. To see this, suppose we have an arbitrary $\epsilon>0$ and suppose at $\omega$, $\sup_{m>n}|X_m(\omega)-X_0(\omega)|<\epsilon$. In other words, for all $m>n$, $X_m$ and $X_0$ differ by less than $\epsilon$ at $\omega$. But then at $t>n$, all we’ve done is removed some of the members of the sequence. So for all $m>t$, again $X_m$ and $X_0$ differ by less than $\epsilon$ at $\omega$. Convergence in probability, on the other hand, allows $\omega$’s to be both added and removed from the set at every $n$, so long as the probability of the set eventually goes to zero. This allows the set of $\omega$’s where $X_n$ and $X_0$ differ by at least $\epsilon$ to contain any given $\omega$ for infinitely many $n$ or “infinitely often,” as it did in the example above.

This doesn’t exhaustively explain the difference between almost sure convergence and convergence in probability. To get at it completely, you need to get your hands dirty with measure theory and prove that the two notions of almost sure convergence are equivalent, then play with some of the moving parts. Hopefully, though, there’s enough intuition here so that you have some idea why the two notions of convergence are different.

I should also like to add convergence in moments is important and extends the conversation. Convergence in moments (sometimes called $L^p$ convergence, where $p$ represents the moment) implies convergence in probability but not almost surely.
The modified Rademacher functions you use to distinguish between almost sure convergence and convergence in probability does a good job illustrating the differences between moment and almost sure convergence. That is for some $k$ and $c$ we always have that $E[X_k]=\frac{1}{c}$ where, as $k$ increases $E[|X_k|]=\frac{1}{c}\rightarrow 0$. It then follows that $X_k$ converges in $L^1$ however it’s clear that $X_k$ does not converge almost surely.