Prior distributions for covariance matrices: the scaled inverse-Wishart prior

At some point in a variety of Bayesian applications, we end up having to specify a prior distribution on a covariance matrix. The canonical example is a hierarchical regression model. Suppose we have a response vector $y$ along with some covariate $x$ where we expect the simple linear regression model to provide a decent fit. In addition, suppose the data is organized into distinct groups such that each group possibly has its own regression line, though we expect the lines to be related. In order to model this situation, we start first with the data level:

\[y_{ij} | \alpha_j, \beta_j \stackrel{iid}{\sim} N(\alpha_j + x_{ij}\beta_j, \sigma^2)\]

where $j=1,…,J$ indicates groups and $i=1,…,I_j$ indicates observations within groups. In other words, each group has a simple linear regression with a common error variance $\sigma^2$. We further model the regression coefficients as coming from a common distribution:

\[\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \stackrel{iid}{\sim} N\left(\begin{pmatrix} \alpha_0 \\ \beta_0 \end{pmatrix}, {\mathbf \Sigma}\right)\]

In order to complete the model and run a Bayesian analysis, we need to add priors for all remaining parameters including the covariance matrix for a group’s regression coefficients ${\mathbf \Sigma}$. A popular prior for ${\mathbf \Sigma}$ is the inverse-Wishart distribution, but there are some problems with using it in this role. A post over at dahtah has the details, but essentially the problem is that using the standard “noninformative” version of the inverse-Wishart prior, which makes the marginal distribution of the correlations uniform, large standard deviations are associated with large absolute correlations. This isn’t exactly noninformative and in addition it can have adverse affects on inference in some models where we want shrinkage to occur.

The question, then, is what prior distribution should we use instead? A paper (pdf) by Barnard, McCulloch and Meng argue for using a separation strategy: write ${\mathbf \Sigma}$ as $\mathbf{\Delta\Omega\Delta}$ where $\mathbf{\Delta}=\mathrm{diag}(\mathbf{\delta})$ is a diagonal matrix of standard deviations and ${\mathbf \Omega}$ is a correlation matrix. Then model ${\mathbf \Delta}$ and ${\mathbf \Omega}$ separately. There are a number of ways to do this, but Barnard et al. suggest using independent lognormal priors on the standard deviations in $\mathbf{\delta}$:

\[\delta_j\stackrel{iid}{\sim}N(\delta_0,\sigma^2_\delta)\]

while using the following family of densities on $\mathbf{\Omega}$:

\[ f_k(\mathbf{\Omega}|v) \propto |\mathbf{\Omega}|^{-\frac{1}{2}(v+k_1)}\left(\prod_i\omega^{ii}\right)^{-\frac{v}{2}} = |\mathbf{\Omega}|^{\frac{1}{2}(v-1)(k-1)-1}\left(\prod_i|\mathbf{\Omega}_{ii}|\right)^{-\frac{v}{2}} \]

where $\omega^{ii}$ is the $i$’th diagonal element of $\mathbf{\Omega}^{-1}$, $\mathbf{\Omega}_{ii}$ is the $i$’th leading principle sub-matrix of $\mathbf{\Omega}$, $k$ is the dimension of $\mathbf{\Omega}$ and $v$ is a tuning parameter. This density is obtained by transforming an inverse-Wishart random matrix, specifically $IW(v, \mathbf{I})$, into a correlation matrix. It turns out that $v=k+1$ results in uniform marginal distributions on the correlations in $\mathbf{\Omega}$.

An alternative strategy based on the separation strategy comes from a paper (pdf) by O’Malley and Zaslavsky and endorsed by Gelman – instead of modeling ${\mathbf \Omega}$ as a correlation matrix, only constrain it to be positive semi-definite so that ${\mathbf \Delta}$ and ${\mathbf \Omega}$ jointly determine the standard deviations, but ${\mathbf \Omega}$ still determines the correlations alone. This strategy uses the same lognormal prior on $\mathbf{\Delta}$ but uses the inverse-Wishart distribution $IW(k+1, \mathbf{I})$ on $\mathbf{\Omega}$ which still induces marginally uniform correlations on the resulting covariance matrix. This strategy is attractive since the inverse-Wishart distribution is already in a number of statistical packages and typically results in a conditionally conjugate model for the covariance matrix, allowing for a relatively simple analysis. If $\mathbf{\Omega}$ is constrained to be a correlation matrix as in Barnard et al., sampling from the conditional posterior requires significantly more work. So the scaled inverse-Wishart is a much easier to work with, but theoretically it still allows for some dependence between the correlations and the variances in $\mathbf{\Sigma}$.

I’ve used the scaled inverse-Wishart before, so I was curious about how much a difference it makes to use Barnard et al.’s prior. Now the impact on inference of using different priors will vary from model to model, but we can at least see how different priors encode the same prior information. Since $f$ is just the density of the correlation matrix resulting from an inverse-Wishart distribution on the covariance matrix, it’s relatively easy to sample from $f$ – just sample from an inverse-Wishart and transform to correlations. So I specified Barnard’s separation strategy prior (SS) and the scaled inverse-Wishart prior (sIW) in a similar way and sampled from them both, and then compared them to a sample from a standard inverse-Wishart (IW). In the first test, I assumed that the covariance matrix was $2\times2$ and that $\mathrm{log}(\delta_j)\stackrel{iid}{\sim}N(0,1)$ for both the SS and sIW priors. The only difference was in $\mathbf{\Omega}$: for the SS prior I assumed that $\mathbf{\Omega}\sim f(\mathbf{\Omega}|v=3)$ and for the sIW prior, I assumed $\mathbf{\Omega}\sim IW(3, \mathbf{I})$. In other words, for both priors I assumed that the correlations were marginally uniform. For the IW prior, I assumed that $\mathbf{\Sigma}\sim IW(3, \mathbf{I})$ to once again ensure marginally uniform correlations as a point of comparison. Taking 10,000 samples from each prior, we can see this behavior in the following plot of the correlations:

Histograms of correlation by prior

As expected for all priors the correlations look uniform. With either the SS or sIW priors, it’s possible to change this so that they favor either high correlation (closer to $1$ or $-1$) or low correlation (closer to $0$) through the manipulation of $v$ – see Barnard et al. or O’malley & Zaslavsky for details. Next, we take a look at the log standard deviations:

Histograms of log standard deviations by prior and variance component

The histograms of SS and sIW look pretty similar with the only difference between the two that sIW has slightly fatter tails and/or a higher variance. In other words, given the same parameter choices sIW yields a slightly less informative prior. This isn’t a problem – once we understand the behavior of sIW vs. SS, we can set the priors on $\mathbf{\Delta}$ differently to encode the same prior information on the standard deviations. The priors as specified don’t appear to be satisfactorily noninformative – if that were the prior information we were trying to capture, but that can easily be fixed by increasing the variance of the prior. The IW prior, on the other hand, has a very narrow range of values for the standard deviation that can only be changed by also changing the prior on the correlations – one of its main drawbacks. Finally, we look at the dependence between the first variance component and the correlation coefficient in the next plot:

Scatterplot of correlation coefficient vs. first variance component by prior

As expected, there’s no dependence between the correlation and the variance in the SS prior – the prior assumes they are independent. The sIW prior, on the other hand, exhibits the same disturbing dependence as the IW prior, also documented in Simon Barthelmé’s post at dahtah. High variances are associated with more extreme correlations in both the sIW and IW priors – the sIW prior doesn’t seem to improve on the IW prior at all in this respect.

I changed the prior on $\mathrm{log}(\mathbf{\delta})$ to have mean $(10, 0)$ and covariance matrix $\mathrm{diag}(.2, .2)$ reflecting a situation where we have a fairly strong prior belief that the two standard deviations are different from each other. Without commentary, the plots tell essentially the same story:

Histograms of correlation by prior

Histograms of log standard deviations by prior and variance component

Scatterplot of correlation coefficient vs. first variance component by prior

It doesn’t look great for the scaled inverse-Wishart. This prior is capable of encoding a wider range of prior information than the standard inverse-Wishart by allowing the modeler to separately model the correlations and the variances. However, the modeler isn’t able to control the prior dependence between the variances and the correlations – the level of dependence appears to be the same as in the standard inverse-Wishart prior. I suspect that for most researchers the problems this causes aren’t so bad when weighed against the computational issues that arise when trying to simulate from a correlation matrix, but it’s hard to tell without actually fitting models in order to determine the effect on shrinkage.

Finally, here’s some R code to generate the plots in this post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
require(MCMCpack)
require(ggplot2)
require(reshape)
 
## simulates a single sample from the sIW prior
sIW.sim <- function(k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    R <- riwish(df, M)
    S <- diag(exp(mvrnorm(1,m, diag(s))))
    SRS <- S%*%R%*%S
    return(SRS)
}
 
## simulates n samples from the sIW prior
sIW.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- sIW.sim(2, m, s, df, M)
        sds <- sqrt(diag(E))
        rho[i] <- cov2cor(E)[2,1]
        sig1[i] <- sds[1]
        sig2[i] <- sds[2]
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
}
 
## simulates n samples from the SS prior
SS.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- riwish(df, M)
        rho[i] <- cov2cor(E)[2,1]
        sds <- exp(mvrnorm(1,m, diag(s)))
        sig1[i] <- sds[1]
        sig2[i] <- sds[2]
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
}
 
## simulates n samples from the IW prior
IW.test <- function(n, k, df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- riwish(df, M)
        rho[i] <- cov2cor(E)[2,1]
        vars <- diag(E)
        sig1[i] <- sqrt(vars[1])
        sig2[i] <- sqrt(vars[2])
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
 
}
 
n <- 10000
k <- 2
m <- c(0,0)
s <- c(1,1)
df <- 3
M <- diag(2)
 
 
sIWsam.1 <- sIW.test(n, k, m, s, df, M)
SSsam.1 <- SS.test(n, k, m, s, df, M)
IWsam.1 <- IW.test(n, k, df, M)
 
sIWsam.1$dens <- "sIW"
SSsam.1$dens <- "SS"
IWsam.1$dens <- "IW"
 
data.1 <- rbind(sIWsam.1, SSsam.1)
data.1 <- rbind(sIWsam.1, SSsam.1, IWsam.1)
 
qplot(value, data=data.1[data.1$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation")
 
qplot(value, data=data.1[data.1$parameter=="rho",], facets=dens~., xlab="Correlation")
 
data.1.melt <- melt(data.1, id=c("parameter", "dens", "sam"))
data.1.cast <- cast(data.1.melt, dens+sam~parameter)
 
qplot(sigma1^2, rho, data=data.1.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x")
 
n <- 10000
k <- 2
m <- c(10,0)
s <- c(.2,.2)
df <- 3
M <- diag(2)
 
sIWsam.2 <- sIW.test(n, k, m, s, df, M)
SSsam.2 <- SS.test(n, k, m, s, df, M)
IWsam.2 <- IW.test(n, k, df, M)
 
sIWsam.2$dens <- "sIW"
SSsam.2$dens <- "SS"
IWsam.2$dens <- "IW"
 
 
data.2 <- rbind(sIWsam.2, SSsam.2, IWsam.2)
 
qplot(value, data=data.2[data.2$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation")
 
qplot(value, data=data.2[data.2$parameter=="rho",], facets=dens~., xlab="Correlation")
 
data.2.melt <- melt(data.2, id=c("parameter", "dens", "sam"))
data.2.cast <- cast(data.2.melt, dens+sam~parameter)
 
qplot(sigma1^2, rho, data=data.2.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x")

Comments

Prior distributions for covariance matrices: the scaled inverse-Wishart prior — 7 Comments

  1. Pingback: Priors of convenience « dahtah

  2. Pingback: More on scaled-inverse Wishart and prior independence « Statistical Modeling, Causal Inference, and Social Science

  3. Pingback: Some helps for running and evaluating Bayesian regression models | House of Stones

  4. I am a newbie in the field and trying to learn a little bit of all these hierarchical basyeian models.

    I have a basic question; what is the reason of using inverse-Wishart distribution. Is it just because of conjugation ? if so why not Wishart by itself or why not just a Gaussian ?! they are conjugate as well.

    • The inverse Wishart distribution is the conditionally conjugate prior for the covariance matrix of a normal distribution, and that’s the primary reason why it’s used – for computational convenience. The problem with most other priors is that despite whatever advantages they have, computation is difficult and slow enough to make them often not worthwhile.

      The Wishart distribution is not conditionally conjugate in the normal model. If you write down the conditional distribution of the covariance matrix given the data and any other model parameters, this is easy to see. The Gaussian (Normal) distribution is also not conditionally conjugate, but suffers from an additional problem – a covariance matrix has to be positive definite which, among other things, restricts the diagonal elements to be positive. A normal distribution doesn’t satisfy these constraint – each element of the matrix would take on values on the entire real line. Additionally, you’d have to make sure the matrix was symmetric – this just means that your nxn matrix has n(n+1)/2 unique elements instead of n^2, but I’m not sure if there’s a nice way to write down this version of the normal distribution.

      • Thanks for making it clear. I hope you agree with me that, having a clear idea on the computational cost/implementation challenges of the model requires hands on experience, which I do not have. e.g I am trying to implement a Dirichlet Process and the available codes on web, are far more complicated that formulas on the papers :).

        You have a nice blog – keep writing ! ;)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>