Prior distributions for covariance matrices: the scaled inverse-Wishart prior

At some point in a variety of Bayesian applications, we end up having to specify a prior distribution on a covariance matrix. The canonical example is a hierarchical regression model. Suppose we have a response vector $y$ along with some covariate $x$ where we expect the simple linear regression model to provide a decent fit. In addition, suppose the data is organized into distinct groups such that each group possibly has its own regression line, though we expect the lines to be related. In order to model this situation, we start first with the data level:

\[y_{ij} | \alpha_j, \beta_j \stackrel{iid}{\sim} N(\alpha_j + x_{ij}\beta_j, \sigma^2)\]

where $j=1,…,J$ indicates groups and $i=1,…,I_j$ indicates observations within groups. In other words, each group has a simple linear regression with a common error variance $\sigma^2$. We further model the regression coefficients as coming from a common distribution:

\[\begin{pmatrix} \alpha_j \\ \beta_j \end{pmatrix} \stackrel{iid}{\sim} N\left(\begin{pmatrix} \alpha_0 \\ \beta_0 \end{pmatrix}, {\mathbf \Sigma}\right)\]

In order to complete the model and run a Bayesian analysis, we need to add priors for all remaining parameters including the covariance matrix for a group’s regression coefficients ${\mathbf \Sigma}$. A popular prior for ${\mathbf \Sigma}$ is the inverse-Wishart distribution, but there are some problems with using it in this role. A post over at dahtah has the details, but essentially the problem is that using the standard “noninformative” version of the inverse-Wishart prior, which makes the marginal distribution of the correlations uniform, large standard deviations are associated with large absolute correlations. This isn’t exactly noninformative and in addition it can have adverse affects on inference in some models where we want shrinkage to occur.

The question, then, is what prior distribution should we use instead? A paper (pdf) by Barnard, McCulloch and Meng argue for using a separation strategy: write ${\mathbf \Sigma}$ as $\mathbf{\Delta\Omega\Delta}$ where $\mathbf{\Delta}=\mathrm{diag}(\mathbf{\delta})$ is a diagonal matrix of standard deviations and ${\mathbf \Omega}$ is a correlation matrix. Then model ${\mathbf \Delta}$ and ${\mathbf \Omega}$ separately. There are a number of ways to do this, but Barnard et al. suggest using independent lognormal priors on the standard deviations in $\mathbf{\delta}$:

\[\delta_j\stackrel{iid}{\sim}N(\delta_0,\sigma^2_\delta)\]

while using the following family of densities on $\mathbf{\Omega}$:

\[ f_k(\mathbf{\Omega}|v) \propto |\mathbf{\Omega}|^{-\frac{1}{2}(v+k_1)}\left(\prod_i\omega^{ii}\right)^{-\frac{v}{2}} = |\mathbf{\Omega}|^{\frac{1}{2}(v-1)(k-1)-1}\left(\prod_i|\mathbf{\Omega}_{ii}|\right)^{-\frac{v}{2}} \]

where $\omega^{ii}$ is the $i$’th diagonal element of $\mathbf{\Omega}^{-1}$, $\mathbf{\Omega}_{ii}$ is the $i$’th leading principle sub-matrix of $\mathbf{\Omega}$, $k$ is the dimension of $\mathbf{\Omega}$ and $v$ is a tuning parameter. This density is obtained by transforming an inverse-Wishart random matrix, specifically $IW(v, \mathbf{I})$, into a correlation matrix. It turns out that $v=k+1$ results in uniform marginal distributions on the correlations in $\mathbf{\Omega}$.

An alternative strategy based on the separation strategy comes from a paper (pdf) by O’Malley and Zaslavsky and endorsed by Gelman – instead of modeling ${\mathbf \Omega}$ as a correlation matrix, only constrain it to be positive semi-definite so that ${\mathbf \Delta}$ and ${\mathbf \Omega}$ jointly determine the standard deviations, but ${\mathbf \Omega}$ still determines the correlations alone. This strategy uses the same lognormal prior on $\mathbf{\Delta}$ but uses the inverse-Wishart distribution $IW(k+1, \mathbf{I})$ on $\mathbf{\Omega}$ which still induces marginally uniform correlations on the resulting covariance matrix. This strategy is attractive since the inverse-Wishart distribution is already in a number of statistical packages and typically results in a conditionally conjugate model for the covariance matrix, allowing for a relatively simple analysis. If $\mathbf{\Omega}$ is constrained to be a correlation matrix as in Barnard et al., sampling from the conditional posterior requires significantly more work. So the scaled inverse-Wishart is a much easier to work with, but theoretically it still allows for some dependence between the correlations and the variances in $\mathbf{\Sigma}$.

I’ve used the scaled inverse-Wishart before, so I was curious about how much a difference it makes to use Barnard et al.’s prior. Now the impact on inference of using different priors will vary from model to model, but we can at least see how different priors encode the same prior information. Since $f$ is just the density of the correlation matrix resulting from an inverse-Wishart distribution on the covariance matrix, it’s relatively easy to sample from $f$ – just sample from an inverse-Wishart and transform to correlations. So I specified Barnard’s separation strategy prior (SS) and the scaled inverse-Wishart prior (sIW) in a similar way and sampled from them both, and then compared them to a sample from a standard inverse-Wishart (IW). In the first test, I assumed that the covariance matrix was $2\times2$ and that $\mathrm{log}(\delta_j)\stackrel{iid}{\sim}N(0,1)$ for both the SS and sIW priors. The only difference was in $\mathbf{\Omega}$: for the SS prior I assumed that $\mathbf{\Omega}\sim f(\mathbf{\Omega}|v=3)$ and for the sIW prior, I assumed $\mathbf{\Omega}\sim IW(3, \mathbf{I})$. In other words, for both priors I assumed that the correlations were marginally uniform. For the IW prior, I assumed that $\mathbf{\Sigma}\sim IW(3, \mathbf{I})$ to once again ensure marginally uniform correlations as a point of comparison. Taking 10,000 samples from each prior, we can see this behavior in the following plot of the correlations:

Histograms of correlation by prior

As expected for all priors the correlations look uniform. With either the SS or sIW priors, it’s possible to change this so that they favor either high correlation (closer to $1$ or $-1$) or low correlation (closer to $0$) through the manipulation of $v$ – see Barnard et al. or O’malley & Zaslavsky for details. Next, we take a look at the log standard deviations:

Histograms of log standard deviations by prior and variance component

The histograms of SS and sIW look pretty similar with the only difference between the two that sIW has slightly fatter tails and/or a higher variance. In other words, given the same parameter choices sIW yields a slightly less informative prior. This isn’t a problem – once we understand the behavior of sIW vs. SS, we can set the priors on $\mathbf{\Delta}$ differently to encode the same prior information on the standard deviations. The priors as specified don’t appear to be satisfactorily noninformative – if that were the prior information we were trying to capture, but that can easily be fixed by increasing the variance of the prior. The IW prior, on the other hand, has a very narrow range of values for the standard deviation that can only be changed by also changing the prior on the correlations – one of its main drawbacks. Finally, we look at the dependence between the first variance component and the correlation coefficient in the next plot:

Scatterplot of correlation coefficient vs. first variance component by prior

As expected, there’s no dependence between the correlation and the variance in the SS prior – the prior assumes they are independent. The sIW prior, on the other hand, exhibits the same disturbing dependence as the IW prior, also documented in Simon Barthelmé’s post at dahtah. High variances are associated with more extreme correlations in both the sIW and IW priors – the sIW prior doesn’t seem to improve on the IW prior at all in this respect.

I changed the prior on $\mathrm{log}(\mathbf{\delta})$ to have mean $(10, 0)$ and covariance matrix $\mathrm{diag}(.2, .2)$ reflecting a situation where we have a fairly strong prior belief that the two standard deviations are different from each other. Without commentary, the plots tell essentially the same story:

Histograms of correlation by prior

Histograms of log standard deviations by prior and variance component

Scatterplot of correlation coefficient vs. first variance component by prior

It doesn’t look great for the scaled inverse-Wishart. This prior is capable of encoding a wider range of prior information than the standard inverse-Wishart by allowing the modeler to separately model the correlations and the variances. However, the modeler isn’t able to control the prior dependence between the variances and the correlations – the level of dependence appears to be the same as in the standard inverse-Wishart prior. I suspect that for most researchers the problems this causes aren’t so bad when weighed against the computational issues that arise when trying to simulate from a correlation matrix, but it’s hard to tell without actually fitting models in order to determine the effect on shrinkage.

Finally, here’s some R code to generate the plots in this post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
require(MCMCpack)
require(ggplot2)
require(reshape)
 
## simulates a single sample from the sIW prior
sIW.sim <- function(k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    R <- riwish(df, M)
    S <- diag(exp(mvrnorm(1,m, diag(s))))
    SRS <- S%*%R%*%S
    return(SRS)
}
 
## simulates n samples from the sIW prior
sIW.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- sIW.sim(2, m, s, df, M)
        sds <- sqrt(diag(E))
        rho[i] <- cov2cor(E)[2,1]
        sig1[i] <- sds[1]
        sig2[i] <- sds[2]
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
}
 
## simulates n samples from the SS prior
SS.test <- function(n, k, m=rep(0,k), s=rep(1,k), df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- riwish(df, M)
        rho[i] <- cov2cor(E)[2,1]
        sds <- exp(mvrnorm(1,m, diag(s)))
        sig1[i] <- sds[1]
        sig2[i] <- sds[2]
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
}
 
## simulates n samples from the IW prior
IW.test <- function(n, k, df=k+1, M=diag(k)){
    sig1 <- rep(0,n)
    sig2 <- rep(0,n)
    rho <- rep(0,n)
    for(i in 1:n){
        E <- riwish(df, M)
        rho[i] <- cov2cor(E)[2,1]
        vars <- diag(E)
        sig1[i] <- sqrt(vars[1])
        sig2[i] <- sqrt(vars[2])
    }
    return(data.frame(value=c(sig1, sig2, rho), parameter=c(rep("sigma1",n), rep("sigma2",n), rep("rho",n)), sam=rep(c(1:n),3)))
 
}
 
n <- 10000
k <- 2
m <- c(0,0)
s <- c(1,1)
df <- 3
M <- diag(2)
 
 
sIWsam.1 <- sIW.test(n, k, m, s, df, M)
SSsam.1 <- SS.test(n, k, m, s, df, M)
IWsam.1 <- IW.test(n, k, df, M)
 
sIWsam.1$dens <- "sIW"
SSsam.1$dens <- "SS"
IWsam.1$dens <- "IW"
 
data.1 <- rbind(sIWsam.1, SSsam.1)
data.1 <- rbind(sIWsam.1, SSsam.1, IWsam.1)
 
qplot(value, data=data.1[data.1$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation")
 
qplot(value, data=data.1[data.1$parameter=="rho",], facets=dens~., xlab="Correlation")
 
data.1.melt <- melt(data.1, id=c("parameter", "dens", "sam"))
data.1.cast <- cast(data.1.melt, dens+sam~parameter)
 
qplot(sigma1^2, rho, data=data.1.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x")
 
n <- 10000
k <- 2
m <- c(10,0)
s <- c(.2,.2)
df <- 3
M <- diag(2)
 
sIWsam.2 <- sIW.test(n, k, m, s, df, M)
SSsam.2 <- SS.test(n, k, m, s, df, M)
IWsam.2 <- IW.test(n, k, df, M)
 
sIWsam.2$dens <- "sIW"
SSsam.2$dens <- "SS"
IWsam.2$dens <- "IW"
 
 
data.2 <- rbind(sIWsam.2, SSsam.2, IWsam.2)
 
qplot(value, data=data.2[data.2$parameter!="rho",], log="x", facets=dens~parameter, xlab="Log Standard Deviation")
 
qplot(value, data=data.2[data.2$parameter=="rho",], facets=dens~., xlab="Correlation")
 
data.2.melt <- melt(data.2, id=c("parameter", "dens", "sam"))
data.2.cast <- cast(data.2.melt, dens+sam~parameter)
 
qplot(sigma1^2, rho, data=data.2.cast, facets=.~dens, xlab="First variance component", ylab="Correlation", log="x")

Comments

Prior distributions for covariance matrices: the scaled inverse-Wishart prior — 16 Comments

  1. Pingback: Priors of convenience « dahtah

  2. Pingback: More on scaled-inverse Wishart and prior independence « Statistical Modeling, Causal Inference, and Social Science

  3. Pingback: Some helps for running and evaluating Bayesian regression models | House of Stones

  4. I am a newbie in the field and trying to learn a little bit of all these hierarchical basyeian models.

    I have a basic question; what is the reason of using inverse-Wishart distribution. Is it just because of conjugation ? if so why not Wishart by itself or why not just a Gaussian ?! they are conjugate as well.

    • The inverse Wishart distribution is the conditionally conjugate prior for the covariance matrix of a normal distribution, and that’s the primary reason why it’s used – for computational convenience. The problem with most other priors is that despite whatever advantages they have, computation is difficult and slow enough to make them often not worthwhile.

      The Wishart distribution is not conditionally conjugate in the normal model. If you write down the conditional distribution of the covariance matrix given the data and any other model parameters, this is easy to see. The Gaussian (Normal) distribution is also not conditionally conjugate, but suffers from an additional problem – a covariance matrix has to be positive definite which, among other things, restricts the diagonal elements to be positive. A normal distribution doesn’t satisfy these constraint – each element of the matrix would take on values on the entire real line. Additionally, you’d have to make sure the matrix was symmetric – this just means that your nxn matrix has n(n+1)/2 unique elements instead of n^2, but I’m not sure if there’s a nice way to write down this version of the normal distribution.

      • Thanks for making it clear. I hope you agree with me that, having a clear idea on the computational cost/implementation challenges of the model requires hands on experience, which I do not have. e.g I am trying to implement a Dirichlet Process and the available codes on web, are far more complicated that formulas on the papers :).

        You have a nice blog – keep writing ! 😉

  5. I have been using Inverse Wishart prior for my covariance matrix for a Bivariate normal hierarchical model. But I am surprised that even after using an identity matrix with 3 df, I dont get a uniform distribution for teh correlation coefficient and the dependency of variance to correlation coefficient is different from what you have noticed. Heres my code in WinBUGS
    model{

    #likelihood

    for(j in 1 : Nf){

    p1[j, 1:2 ] ~ dmnorm(gamma[1:2], T[1:2 ,1:2])

    for (i in 1:2){

    logit(p[j,i]) <- p1[j,i]

    Y[j,i] ~ dbin(p[j,i],n)

    }

    }

    #priors

    gamma[1:2] ~ dmnorm(mn[1:2],prec[1:2 ,1:2])

    expit[1] <- exp(gamma[1])/(1+exp(gamma[1]))

    expit[2] <- exp(gamma[2])/(1+exp(gamma[2]))

    T[1:2 ,1:2] ~ dwish(R[1:2 ,1:2], 2)

    sigma2[1:2, 1:2] <- inverse(T[,])

    sig<-log(sigma2[1,1])

    rho <- sigma2[1,2]/sqrt(sigma2[1,1]*sigma2[2,2])

    }

    # Data

    list(Nf =20, mn=c(-1.59, -2.44), n=60,

    prec = structure(.Data = c(.001, 0,

    0, .001),.Dim = c(2, 2)),

    R = structure(.Data = c(1, 0,

    0, 1),.Dim = c(2, 2)),

    Y= structure(.Data=c(

    27,8,
    8,4,
    2,2,
    36,24,
    5,4,
    10,1,
    14,7,
    12,8,
    12,6,
    22,11,
    11,7,
    24,7,
    13,6,
    6,3,
    2,1,
    10,1,
    12,12,
    11,8,
    16,11,
    2,0
    ),.Dim = c(20, 2)))

    • Are you sure you’re looking at the prior distribution? The correlation coefficient will be uniform in the prior, but the data will change that in the posterior.

      • Thanks for clarifying. You are correct, I was talking about the posterior distribution of the variance-covariance parameters. I was wondering if you might have noticed that however the hyperparameters of Wishart are defined, the posterior variance covariance have the mass of the distribution at the boundaries.

        • If I recall correctly, when the estimated variances are much larger than the prior means in the inverse Wishart prior, the estimated correlation will be set to approximately 1 or -1. I think this paper has some detail.

          Edit: After emailing Ignacio about it (first author on the linked paper) it seems I had misremembered. Can you describe more clearly what parameter is getting posterior mass near the boundary? And which boundary?

          • I wanted to show you the posterior distribution plots for var(p1), var(p2) cov(p1,p2) and correlation using the model above from my previous post. I am unable to paste the graphs here. The posterior estimates are concentrated around zero.

          • Variance estimates near zero is often a sign that MCMC isn’t working properly, which is often caused by… inverse gamma priors on variance parameters. Really, the inverse gamma/Wishart causes all sorts of problems. Since you’re using BUGS, trying using the prior suggested in the top comment on this post – inverse wishart on the covariance matrix with diagonal scale matrix, but the diagonal entries get further modeled as coming from a gamma distribution. This gives you implied half-T priors on the standard deviations, which is much much better. Also Ignacio noticed that your degrees of freedom in the code you posted isn’t set to the value that yields a uniform prior on the correlations. It should be dwish(R[1:2 ,1:2], 3) not dwish(R[1:2 ,1:2], 2). This change on its own might solve the MCMC problems – the closer that hyperparameter is to it’s boundary, the more ill behaved MCMC will tend to be in my experience.

            P.S. I just noticed the commenting system is sometimes labeling my comments as coming from “Matt” and other times from “Admin” – we are one and the same!

  6. Thanks for the post. I definitely hadn’t thought of this before and was just using the scaled inverse Wishart not realizing that the priors might not have the properties I thought. Any idea how to implement the SS prior in BUGS/JAGS?

  7. Hi,
    I stumbled on your post while reading up on scaled Inverse Wisharts, and I was hoping you would have some insight:

    I have a problem where I am trying to encode as a (somewhat strong) prior that one of the covariances is larger than the other by about an order of magnitude.

    The issue is that the data might be rotated, so I can’t just set that directly in the IW as nu_0 = 1e3, Lamba ~ diag([10, 1])*1e3 (if I say x has a large variance, but it turns out y does, then the posterior is not so good).

    I’ve read through the multiple blogs/papers (this one is a nice summary of the different methods http://arxiv.org/pdf/1408.4050.pdf), but the focus seems to be on making sure the prior is uninformative, without actually discussing which allows more flexibility in shaping the prior.

    I have three questions:

    1) Any idea whether the sIW would be able to handle this? It seems that based on your comment where you say “I changed the prior on log(δ) to have mean (10,0) and covariance matrix diag(.2,.2) reflecting a situation where we have a fairly strong prior belief that the two standard deviations are different from each other” it should be able to, but I am not sure whether that would still handle the case where the data is rotated?

    2) Assuming the sIW would work, I’ve tried following the derivation in O’Malley and Zaslavsky to implement it inside of https://github.com/mattjj/pybasicbayes (I’m assuming different Matt?), but in equation (11) I’m not sure where the “σ” comes from! Do you have any other references for a derivation of this conditional posterior likelihood for δ?

    3) Since the conditional posterior for δi cannot be easily sampled from, and that a Metropolis-Hastings step is necessary, how is this done inside of a Gibbs sampler (as part of a larger MCMC model):

    One Gibbs step:
    sample µ | y, Σ=∆Φ∆
    sample Φ | y, µ
    sample δi from f(δi) using Metropolis-Hastings with t proposal distribution

    OR

    One Gibbs step:
    Iterate for n steps:
    sample µ | y, Σ=∆Φ∆
    sample Φ | y, µ
    sample δi from f(δi) using Metropolis-Hastings with t proposal distribution

    I am assuming it’s the latter, but if it is the latter, I don’t see the point of using sIW instead of the separation strategy prior? if I have to iterate for each Gibbs step, I might as well transform Φ to a correlation?

    Any help is much appreciated :)

Leave a Reply

Your email address will not be published. Required fields are marked *