1 Introduction
Covariance estimation and selection is a fundamental problem in multivariate statistical inference, and plays a crucial role in many data analytic methods. In highdimensional settings, where the number of variables is much larger than the number of samples, the sample covariance matrix (traditional estimator for the population covariance matrix) can perform rather poorly. See (Bickel and Levina, 2008a, b; El Karoui, 2007) for example. To address the challenge posed by highdimensionality, several promising methods have been proposed in the literature. In particular, methods inducing sparsity in the covariance matrix , its inverse , or the Cholesky factor of the inverse, have proven to be very effective in applications. In this paper, we focus on imposing sparsity on the Cholesky factor of the inverse covariance (precision) matrix. These models are also referred to as Gaussian DAG models.
Consider a case when we have i.i.d. observations obeying a
variate normal distribution with mean vector
and covariance matrix . Let be the unique modified Cholesky decomposition of the inverse covariance matrix , where is a lower triangular matrix with unit diagonal entries, and is a diagonal matrix with positive diagonal entries. A given sparsity pattern on corresponds to certain conditional independence relationships, which can be encoded in terms of a directed acyclic graph on the set of variables as follows: if the variables and do not share an edge in , then (see Section 2 for more details).There are two major approaches in the literature for sparse estimation of . The first approach is based on regularized likelihood/pseudolikelihood using penalization. See (Huang, Liu, Pourahmadi, and Liu, 2006; Rutimann and Buhlmann, 2009; Shojaie and Michailidis, 2010; Rothman, Levina, and Zhu, 2010; Aragam, Amini, and Zhou, 2015; Yu and Bien, 2016; Khare, Oh, Rahman, and Rajaratnam, 2017). Some of these frequentist approaches assume is banded, i.e., the elements of that are far from the diagonal are taken to be zero. The other methods put restrictions on the maximum number of nonzero entries in .
On the Bayesian side, when the underlying graph is known, literature exists that explores the posterior convergence rates for Gaussian concentration graph models, which induce sparsity in the inverse covariance matrix . See (Banerjee and Ghosal, 2014, 2015; Xiang, Khare, and Ghosh, 2015; Lee and Lee, 2017) for example. Gaussian concentration graph models and Gaussian DAG models studied in this paper intersect only at perfect DAG models, which are equivalent to decomposable concentration graphical models. For general Gaussian DAG models, comparatively fewer works have tackled with asymptotic consistency properties. Recently, Cao, Khare, and Ghosh (2019) establish both strong model selection consistency and posterior convergence rates for sparse Gaussian DAG models in a highdimensional regime. In particular, the authors consider a hierarchical Gaussian DAG model with DAGWishart priors introduced in (BenDavid, Li, Massam, and Rajaratnam, 2016) on the Cholesky parameter space and independent Bernoulli
priors for each edge in the DAG (the socalled ErdosRenyi prior). However, the sparsity assumptions on the true model required to establish consistency are rather restrictive. In addition, as a result of the extremely small value of the edge probability
in the Bernoulli prior, the simulations studies always tend to favor more sparse models under smaller values of . Lee, Lee, and Lin (2018) also explore the Cholesky factor selection consistency under the empirical sparse Cholesky (ESC) prior and posteriors. Compared with (Cao, Khare, and Ghosh, 2019), under relaxed conditions in terms of the dimensionality, sparsity and lower bound of the nonzero elements in the Cholesky factor, Lee, Lee, and Lin (2018) establish strong model selection consistency with the posterior distribution.It recently came to our attention that two more flexible alternative priors compared to the ErdosRenyi prior have been considered in the undirected graphical models literature: (a) the multiplicative prior (Tan, Jasra, De Iorio, and Ebbels, 2017), and (b) the betamixture prior (Carvalho and Scott, 2009). Both priors are more diverse than the ErdosRenyi prior (the ErdosRenyi prior can be obtained as a degenerate version of these priors), and have various attractive properties. For example, the multiplicative model prior can account for greater variability in the degree distribution as compared to the ErodsRenyi model, while the betamixture prior allows for stronger control over the number of spurious edges and corrects for multiple hypothesis testing automatically. We provide the algebraic forms of these priors in Section 3 and Section 5 respectively, and refer the reader to (Carvalho and Scott, 2009; Tan, Jasra, De Iorio, and Ebbels, 2017) for a detailed discussion of their properties.
To the best of our knowledge, a rigorous investigation of highdimensional posterior consistency properties with the multiplicative prior or the betamixture prior has not been undertaken for either undirected graphical models or Gaussian DAG models. Hence, our goal was to investigate if highdimensional consistency results could be established under these two more diverse and algebraically complex class of prior distributions in the Gaussian DAG model setting. Another goal was to investigate if these highdimensional posterior consistency results can be obtained under much weaker conditions as compared to (Cao, Khare, and Ghosh, 2019), particularly conditions similar to those in (Lee, Lee, and Lin, 2018). This was a challenging goal, particularly for the multiplicative model prior, as the prior mass function is not available in closed form (note that the mass functions for the ErdosRenyi, ESC and betamixture priors are available in closed form).
As the main contributions of this paper, we establish highdimensional posterior consistency results for Gaussian DAG models with spike and slab priors on the Cholesky factor , under both the multiplicative prior as well as the betamixture prior on the sparsity pattern in (Theorems 4.1 to 5.3), using assumptions similar to those in (Lee, Lee, and Lin, 2018) (where a different setting of ESC priors and posteriors is used). Also, through simulation studies, we demonstrate that the models studied in this paper can outperform existing stateoftheart methods including both penalized likelihood and Bayesian approaches in different settings.
The rest of paper is organized as follows. Section 2 provides background material regarding Gaussian DAG model and introduce the spike and slab prior on the Choleksy factor. In Section 3, we revisit the multiplicative prior, and present our hierarchical Bayesian model and the parameter class for the inverse covariance matrices. Model selection consistency results for both the multiplicative prior and the betamixture prior are stated in Section 4 and Section 5 with proofs provided in Section 7. In Section 6 we use simulation experiments to illustrate the posterior ratio consistency result, and demonstrate the benefits of our Bayesian approach and computation procedures for Choleksy factor selection visavis existing Bayesian and penalized likelihood approaches. We end our paper with a discussion session in Section 8.
2 Preliminaries
In this section, we provide the necessary background material from graph theory, Gaussian DAG models, and also introduce our spike and slab prior on the Cholesky parameter.
2.1 Gaussian DAG Models
We consider the multivariate Gaussian distribution
(1) 
where is a inverse covariance matrix. Any positive definite matrix can be uniquely decomposed as , where is a lower triangular matrix with unit diagonal entries, and is a diagonal matrix with positive diagonal entries. This decomposition is known as the modified Cholesky decomposition of (see for example Pourahmadi (2007)). In particular, the model (1) can be interpreted as a Gaussian DAG model depending on the sparsity pattern of .
A directed acyclic graph (DAG) consists of the vertex set and an edge set such that there is no directed path starting and ending at the same vertex. As in (BenDavid, Li, Massam, and Rajaratnam, 2016; Cao, Khare, and Ghosh, 2019), we will without loss of generality assume a parent ordering, where that all the edges are directed from larger vertices to smaller vertices. For several applications in genetics, finance, and climate sciences, a location or time based ordering of variables is naturally available. For example, in genetic datasets, the variables can be genes or SNPs located contiguously on a chromosome, and their spatial location provides a natural ordering. More examples can be found in (Huang, Liu, Pourahmadi, and Liu, 2006; Shojaie and Michailidis, 2010; Yu and Bien, 2016; Khare, Oh, Rahman, and Rajaratnam, 2017). The set of parents of , denoted by , is the collection of all vertices which are larger than and share an edge with . Similarly, the set of children of , denoted by , is the collection of all vertices which are smaller than and share an edge with .
A Gaussian DAG model over a given DAG , denoted by , consists of all multivariate Gaussian distributions which obey the directed Markov property with respect to a DAG . In particular, if and , then
for each . Furthermore, it is wellknown that if is the modified Cholesky decomposition of , then if and only if whenever . In other words, the structure of the DAG is uniquely reflected in the sparsity pattern of the Cholesky factor . In light of this, it is often more convenient to reparametrize the inverse covariance matrix in terms of the Cholesky parameter .
2.2 Notations
Consider the modified cholesky decomposition , where is a lower triangular matrix with all the unit diagonals and , where ’s are all positive. We suggest to impose spike and slab priors on the lower diagonal of
to recover the sparse structure of the Cholesky factor. To facilitate this purpose, we introduce latent binary variables
for to indicate whether is active, i.e., if and 0, otherwise. We can view the binary variable as the indicator for the sparsity pattern of . In other words, for each , let , a subset of , be the index set of all nonzero components in . explicitly gives the support of the Cholesky factor and the sparsity pattern of the underlying DAG. Denote as the cardinality of set for .Following the definition of , for any matrix , denote the column vectors and Also, let ,
In particular, .
Next, we provide some additional required notation. For , let and represent the standard and norms. For a matrix , let
be the ordered eigenvalues of
and denoteIn particular,
2.3 Spike and Slab Prior on Cholesky Parameter
In this section, we specify our spike and slab prior on the Cholesky factor as follows.
(2)  
(3) 
for some constants , where denotes a point mass at . We refer to (2) and (3) as our spike and slab Cholesky (SSC) prior. implies being the “signal” (i.e., from the slab component), and implies
being the noise (i.e., from the spike component). Note that to obtain our desired asymptotic consistency results, appropriate conditions for these hyperparameters will be introduced in Section
4.1. Xu and Ghosh (2015) also impose this type of priors on the regression factors. Further comparisons and discussion are provided in Remark 3.Remark 1.
Note that in (3), we are allowing the hyperparameters for the inversegamma prior to be zero. In (Cao, Khare, and Ghosh, 2019), the DAGWishart prior with multiple shape parameters introduced in (BenDavid, Li, Massam, and Rajaratnam, 2016) is placed on the Cholesky parameter. As indicated in Theorem 7.3 in (BenDavid, Li, Massam, and Rajaratnam, 2016)
, the DAGWishart distribution defined on the Cholesky parameter space given a DAG yields the independent inversegamma distribution with strictly positive shape and scale parameters on
and multivariate Gaussian distribution on the nonzero elements in each column of given . Hence, for given DAG structures, there are some difference and connection between the DAGWishart prior and our spike and slab prior.3 Model Specification
In this section, we revisit the multiplicative prior introduced in (Tan, Jasra, De Iorio, and Ebbels, 2017) over space of graphs, and specify our hierarchical model.
3.1 Multiplicative Prior
In the context of Gaussian graphical model, Tan, Jasra, De Iorio, and Ebbels (2017) allow the probability of a link between nodes and , to vary with by taking and for each . The authors further treat each as a variable with a beta prior to adopt a fully Bayesian approach. The authors further utilize Laplace approximations, and through simulation studies, show that the proposed multiplicative model (following the nomenclature in (Tan, Jasra, De Iorio, and Ebbels, 2017)) facilitates the purpose to encourage sparsity or graphs that exhibit particular degree patterns based on prior knowledge. Adapted to our framework, we consider the following multiplicative prior over the space of sparsity variation for the Cholesky factor.
(4)  
(5) 
where are positive constants. Compared with the universal indicator probability in an ErdosRenyi prior, here we allow the variation attainable in the degree structure of each node through different values of . Note that under the multiplicative prior, the marginal posterior for can not be obtained in closed form, which leads to further challenges not only in the theoretical analysis, but also in the computational strategy. We will elaborate on this matter in Section 6.
3.2 Hierarchical Model Formulation
Let be independent and identically distributed variate Gaussian vectors with mean and true covariance matrix , where is the modified Cholesky decomposition of . Let denotes the sample covariance matrix. The sparsity pattern of the true Choleksy factor is uniquely encoded in the true binary variable denoted as . Similar to (Cao, Khare, and Ghosh, 2019), we also denote as the maximum number of nonzero entries in any column of , and . For sequences and , means for some constant , as . Let represent as .
The class of spike and slab Cholesky distributions in Section 2 and the multiplicative priors in Section 3.1 can be used for Bayesian model selection of the Cholesky factor through the following hierarchical model,
(6)  
(7)  
(8)  
(9)  
(10) 
where
represents the beta distribution with shape parameters
. The proposed hierarchical model now has five hyperparameters: the scale parameter in model (7) controlling the variance of the spike part in the spike and slab prior on each
, the shape parameter and scale parameter in model (8), and the two positive shape parameters in the beta distribution in model (10). Further restrictions on these hyperparameters to ensure desired consistency will be specified in Section 4.1.2.The intuition behind this setup with latent variables is that the elements in the Cholesky factor with zero or very small values will be identified with zero
values, while the active entries will be classified as
. We use the posterior probabilities of all the
latent variables to identify the active elements in . In particular, the following lemmas help specify the upper bound for the marginal probability ratio and the marginal posterior ratio for any “nontrue” model compared with the true model under the multiplicative prior. The proof will be provided in Section 7.1.Lemma 3.1.
Lemma 3.1 further enables the marginalized posterior likelihood ratio to be upper bounded by decomposed prior terms absorbed into the product of items as follows.
4 Model Selection Consistency
In this section we will explore the highdimensional asymptotic properties of the Bayesian model selection approach for the Cholesky factor specified in Section 3.2. For this purpose, we will work in a setting where the dimension of the data vectors, and the hyperparameters vary with the sample size and . Assume that the data is actually being generated from a true model specified as follows. Let be independent and identically distributed variate Gaussian vectors with mean and true covariance matrix , where is the modified Cholesky decomposition of . The sparsity pattern of the true Choleksy factor is reflected in . Recall the definition in Section 3.2 that is the maximum number of nonzero entries in any column of , and . In order to establish our asymptotic consistency results, we need the following mild assumptions with respective discussion/interpretation.
4.1 Assumptions
4.1.1 Assumptions on the True Parameter Class
Assumption 1.
There exists , such that for every .
This assumption ensures that the eigenvalues of the true precision matrices are bounded by fixed constants, which has been commonly used for establish high dimensional covariance asymptotic properties. See for example (Bickel and Levina, 2008a; El Karoui, 2008; Banerjee and Ghosal, 2014; Xiang, Khare, and Ghosh, 2015; Banerjee and Ghosal, 2015). Previous work (Cao, Khare, and Ghosh, 2019) relaxes this assumption by allowing the lower and upper bounds on the eigenvalues to depend on and .
Assumption 2.
as .
This is a much weaker assumption for high dimensional covariance asymptotic than for example, (Xiang, Khare, and Ghosh, 2015; Banerjee and Ghosal, 2014, 2015; Cao, Khare, and Ghosh, 2019). Here we essentially allow the number of variables to grow slower than compared to previous literatures with rate .
Assumption 3.
as .
Recall that is the smallest (in absolute value) nonzero offdiagonal entry in . Hence, this assumption also known as the “betamin” condition also provides a lower bound for the “slab” part of
that is needed for establishing consistency. This type of condition has been used for the exact support recovery of the highdimensional linear regression models as well as Gaussian DAG models. See for example
(Lee, Lee, and Lin, 2018; Yang, Wainwright, and Jordan, 2016; Khare, Oh, Rahman, and Rajaratnam, 2017; Cao, Khare, and Ghosh, 2019; Yu and Bien, 2016).Remark 2.
It is worthwhile to point out that our assumptions on the true Cholesky factor are weaker compared to (Lee, Lee, and Lin, 2018). In particular, Lee, Lee, and Lin (2018) introduce conditions A(2) and A(4) on the sparsity pattern of the true Cholesky factor such that the number of nonzero elements in each row as well as each column of to be smaller than some constant , while in this paper, we are allowing the maximum number of nonzero entries in any column of to grow at a smaller rate than (Assumption 2).
4.1.2 Assumptions on the Prior Hyperparameters
Assumption 4.
for all satisfying , where .
This assumption essentially states that the prior on the space of the possible models, places zero mass on unrealistically large models (see similar assumptions in (Johnson and Rossell, 2012; Shin, Bhattacharya, and Johnson, 2018; Narisetty and He, 2014) in the context of regression). Assumption 4 is also more relaxed compared with Condition (P) in (Lee, Lee, and Lin, 2018) where for some constant . Note that this condition is for the hyperparameter of the prior distribution on the latent variables only, which does not affect the true parameter space.
Assumption 5.
The hyperparameter in (9) satisfies and , as , for some constant .
This assumption provides the rate at which the variance of the slab prior is required to grow to guarantee desired model selection consistency. Similar conditions on the hyperparameter can be seen in (Narisetty and He, 2014; Shin, Bhattacharya, and Johnson, 2018; Johnson and Rossell, 2012).
Assumption 6.
This assumption provides the rate at which the shape parameter needs to grow to ensure desired consistency. Previous literature with ErdosRenyi priors puts restrictions on the rate of the edge probability. In particular, previous work (Cao, Khare, and Ghosh, 2019) assumes , where for some to penalize large models. Similar assumptions on the hyperparameters can be also found in (Yang, Wainwright, and Jordan, 2016; Narisetty and He, 2014) under regression setting. In Section 6.2, we will see the proposed model without specifying particular values for helps avoiding the potential computation limitation such as simulation results always favor the most sparse model.
For the rest of this paper, , , ,, , will be denoted as , , , , , , by leaving out the superscript for notational convenience.
4.2 Posterior Ratio Consistency
We now state and prove the main model selection consistency results. The proofs for all the theorems will be provided in Section 7.1 and Section 7.2. Our first result establishes what we refer to as “posterior ratio consistency” (following the terminology in (Cao, Khare, and Ghosh, 2019)). This notion of consistency implies that the true model will be the mode of the posterior distribution among all the models with probability tending to as
Theorem 4.1.
Under Assumptions 16, the following holds:
4.3 Model Selection Consistency for Posterior Mode
If one was interested in a point estimate of which reflects the sparsity pattern of , the most apparent choice would be the posterior mode defined as
(13) 
From a frequentist point of view, it would be natural to obtain if we have model selection consistency for the posterior mode, which follows immediately from posterior ratio consistency established in Theorem 4.1, by noting that Therefore, we have the following corollary.
Corollary 4.1.
Under Assumptions 16, the posterior mode is equal to the true model with probability tending to , i.e.,
Remark 3.
In the context of linear regression, Xu and Ghosh (2015) tackle the Bayesian group lasso problem. In particular, the authors propose the following hierarchical Bayesian model:
In particular, they impose an independent spike and slab type prior on each factor (conditional on the variance parameter ), and an inverse Gamma prior on the variance. Each regression factor is explicitly present in the model with a probability . In this setting under an orthogonal design, the authors in (Xu and Ghosh, 2015) establish oracle property and variable selection consistency for the median thresholding estimator of the regression coefficients on the group level. Note that with parent ordering, the offdiagonal entries in the column of can be interpreted as the linear regression coefficients corresponding to fitting the variable against all variables with label greater than . Hence, there are similarities with respect to the model and consistency results between (Xu and Ghosh, 2015) and this work. However, despite these similarities, fundamental differences exist in these models and the corresponding analysis. Firstly, the number of groups (or factors) is considered to be fixed in (Xu and Ghosh, 2015), while we allow the number of predictors to grow at an exponential rate of in a ultra highdimensional setting, which creates more theoretically challenges. Secondly, the ‘design’ matrices corresponding to the regression coefficients in each column of which can be represented as functions of the sample covariance matrix are random and correlated with each other, while (Xu and Ghosh, 2015) only considers the orthogonal design where with no correlation introduced. Thirdly, the consistency result in (Xu and Ghosh, 2015) focuses only on group level selection only and is tailored for problems that only require group level sparsity, while our model can induce sparsity in each individual element of . The authors also propose a Bayesian hierarchical model referred to as Bayesian sparse group lasso to enable shrinkage both at the group level and within a group. However, no consistency results are addressed regarding this model. Lastly, in our model, each coefficient is present independently with multiplicative prior that incorporates information that is sparse, which is not the case in (Xu and Ghosh, 2015) as each factor is present with . In particular, all the aspects discussed above lead to major differences and further challenges in analyzing the ratio of posterior probabilities.
4.4 Strong Model Selection Consistency
Next we establish another stronger result (compared to Theorem 4.1) which implies that the posterior mass assigned to the true model converges to 1 in probability (under the true model). Following (Narisetty and He, 2014; Cao, Khare, and Ghosh, 2019), we refer to this notion of consistency as strong selection consistency.
Theorem 4.2.
Under Assumptions 16, the following holds:
Remark 4.
We would like to point out that our posterior ratio consistency and strong model selection consistency do not require any additional assumptions on bounding the maximum number of edges. In particular, Cao, Khare, and Ghosh (2019) consider only the DAGs with the total number of edges at most for . By the assumptions in the previous work, it follows that the DAGs in the analysis do not include the models where the Cholesky factor has one or more nonzero elements for each column, since , as , while in our result, each row can have at most number of nonzero entries as indicated in Assumption 4. Hence, our strong model selection consistency results is more general than (Cao, Khare, and Ghosh, 2019; Lee, Lee, and Lin, 2018) in the sense that the consistency holds for a larger class of DAGs.
5 Results for Betamixture Prior
Though the multiplicative prior could allow variation among the indicator probabilities, the intractable marginal posteriors remain problematic in practice. The authors in (Tan, Jasra, De Iorio, and Ebbels, 2017) address this issue via Laplace approximation. However, the computational cost for that will become extensive as increases. To obtain the marginal posterior probabilities in closed form and for ease of computation, we consider the following betamixture prior over the space of introduced in (Carvalho and Scott, 2009),
(14)  
(15) 
where
denotes the Bernoulli distribution with probability
, and represents the beta distribution with shape parameters . We refer to model (14) and (15) as the betamixture prior over the space of latent variables indicating the sparsity structure for the Cholesky factor.Remark 5.
Cao, Khare, and Ghosh (2019); Banerjee and Ghosal (2015) introduce an ErdosRenyi type of distribution on the space of DAGs as the prior distribution for DAGs, where each directed edge is present with probability independently of the other edges. In particular, they define , to be the edge indicator and let , be independent identically distributed Bernoulli(
Cao, Khare, and Ghosh (2019) establish the DAG selection consistency under suitable assumptions. while Banerjee and Ghosal (2015) address the estimation consistency, and provide highdimensional Laplace approximations for the marginal posterior probabilities for the graphs. In our framework, we extend the previous work by putting a beta distribution on the edge probability . The betamixture type of priors have previously been placed on graphs for simulation purpose in (Carvalho and Scott, 2009), but the theoretical properties have yet to be investigated. A clear advantage of such an approach as indicated in (Carvalho and Scott, 2009) is that treating the previous fixed tuning constant as a model parameter shrinks the graph size to a datadetermined value of , and allows strong control over the number of spurious edges.In order to obtain the posterior consistency for , we need the following lemma, which specifies the closed form for the marginal posterior density of with proof provided in Section 7.3.
Lemma 5.1.
The marginal posterior density under the betamixture prior satisfies
(16) 
in which , and The second equation follows from
In particular, these posterior probabilities can be used to select a model representing the sparsity pattern of by computing the posterior mode that maximize the posterior densities. The convenient closed form for the marginal posterior in (5.1) also yields nice posterior ratio consistency under the following weaker assumption on compared with Assumption 6.
Assumption 7.
Theorem 5.2.
Under Assumptions 15 and 7, the following holds under the betamixture prior:
The next theorem establishes the strong selection consistency under the betamixture prior. See proofs for Theorem 5.2 and Theorem 5.3 in Section 7.3.
Theorem 5.3.
Under Assumptions 16, for the betamixture prior, the following holds:
Remark 6.
We would like to point out that posterior ratio consistency (Theorem 5.2 does not require any restriction on (the rate of the shape parameter in the beta distribution (15)) that will be growing, this requirement is only needed for strong selection consistency (Theorem 5.3). Similar restrictions on the hyperparameters have been considered for establishing consistency properties in the regression setup. See (Yang, Wainwright, and Jordan, 2016; Lee, Lee, and Lin, 2018; Cao, Khare, and Ghosh, 2018) for example.
The closed form for the marginal posterior probability in (5.1) is convenient for showing the consistency. However, when it comes to simulation, the beta term in (5.1) pertaining to the betamixture prior is often too large, and could sometimes blow up when
is relatively large. In addition, for the betamixture prior, probability
is assumed to be universal across all indicators, which seems not flexible and diverse enough. In the following section, we will take on the task to investigate and evaluate the simulation performance for both the multiplicative model and the betamixture model.6 Simulation Studies
In this section, we demonstrate our main results through simulation studies. First recall from (5.1) that the marginal posterior distributions for under the betamixture prior can be derived analytically in closed form (up to a constant) in (5.1). Therefore, we can evaluate the parameter space more clearly with this naturally assigned “score”, that is the posterior probability.
For the multiplicative prior, the () can not be integrated out, thus the closed form for the marginal distribution of can not be conveniently acquired. As indicated in (Tan, Jasra, De Iorio, and Ebbels, 2017), evaluating the marginal densities via Monte Carlo becomes more computationally intensive as the dimension increases. Therefore, the authors propose to estimate these quantities efficiently through Laplace approximation instead. Detailed functional and Hessian expressions can be found in the supplemental material in (Tan, Jasra, De Iorio, and Ebbels, 2017). Here we adopt the same Laplace approximation for estimating the marginal densities for . However, as we will see in Figure 3, though the multiplicative prior could potentially lead to better model selection performance, the additional procedure when evaluating each individual posterior probability could be quite time consuming. In particular, the Newtontype algorithm used for obtaining the mode of the loglikelihood runs extremely slow in higher dimensions.
6.1 Simulation I: Illustration of Posterior Ratio Consistency
In this section, we illustrate the consistency result in Theorem 4.1 and Theorem 5.2 using a simulation experiment. Our goal is to show that the log of the posterior ratio for any “nontrue” model compared to the true model will converge to negative infinity. To serve this purpose, we consider different values of ranging from to , and choose . Next, for each fixed , a lower triangular matrix with diagonal entries and offdiagonal entries is constructed. In particular, unlike in previous work (Cao et al., 2019) where the expected value of nonzero entries in each column of does not exceed 3, here we randomly chose 3% or 5% of the lower triangular entries of the Cholesky factor and set them to be 0.5. The remaining entries were set to zero.
The purpose of this setting is to show our consistency requires more relaxed sparsity assumptions on the true model compared to (Cao, Khare, and Ghosh, 2019). We refer to this matrix as . The matrix also reflects the true underlying DAG structure encoded in . Next, we generate i.i.d. observations from the distribution, and set the hyperparameters as , , , for . The above process ensures all the assumptions are satisfied. We then examine posterior ratio consistency under four different cases by computing the log posterior ratio of a “nontrue”model and as follows.

Case : Model is a submodel of and the number of total nonzero entries of is exactly half of , i.e. .

Case : is a submodel of and the number of total nonzero entries of is exactly twice of , i.e. .

Case : is not necessarily a submodel of , but satisfying the number of total nonzero entries in is half the number of nonzero entries in .

Case : is not necessarily a submodel of , but the number of total nonzero entries in is twice the number of nonzero elements in .
The log of the posterior probability ratio for various cases under two different sparsity settings and our two different priors are provided in Figure 1. As expected the log of the posterior probability ratio decreases to large negative numbers as becomes large in all four cases and in both sparsity settings and under both sparsity priors, thereby providing a numerical illustration of Theorem 4.1.
We would like to point out that in (Cao, Khare, and Ghosh, 2019), the log of posterior ratios are almost all positive real numbers, when and the expected value of nonzero entries in each column of does not exceed 3, which indicates the hierarchical model with DAGWishart distribution and the ErdosRenyi type of prior over graphs only performs better with really higher dimension and much more sparse settings. In particular, this leads to one potential drawback of using the DAGWishart distribution coupled with the ErdosRenyi type of prior on the Cholesky space, as in real applications, extremely highdimensional and sparse data sets are not very commonly seen, while our spike and slab Cholesky prior with the betamixture or multiplicative prior is more adaptable and diverse in that aspect.
6.2 Simulation II: Illustration of Model Selection
In this section, we perform a simulation experiment to illustrate the potential advantages of using our Bayesian model selection approach. We consider values of ranging from to , with . For each fixed , the Cholesky factor of the true concentration matrix, and the corresponding dataset, are generated by the same mechanism as in Section 6.1. Then, we perform model selection on the Cholesky factor using the four procedures outlined below.

LassoDAG with quantile based tuning: We implement the LassoDAG approach in (Shojaie and Michailidis, 2010) by choosinf penalty parameters (separate for each variable ) given by , where denotes the quantile of the standard normal distribution. This choice is justified in (Shojaie and Michailidis, 2010) based on asymptotic considerations.

ESC MetropolisHastings algorithm: We implement the RaoBlackwellized MetropolisHastings algorithm for the ESC prior introduced in (Lee, Lee, and Lin, 2018) for exploring the space of the Cholesky factor. The hyperparameters and the initial states are taken as suggested in (Lee et al., 2018). Each MCMC chain for each row of the Cholesky factor runs for 5000 iterations with a burnin period of 2000. All the active components in with inclusion probability larger than 0.5 are selected. We would like to point out that since the MetropolisHastings algorithm needs to be executed for each row of , the procedure could be extremely time consuming, especially in higher dimensions.

DAGWishart logscore path search: The hierarchical DAGWishart prior (Cao et al., 2019) also gives us the closed form to calculate the marginal posterior up to a constant. In particular,
where is the normalized constant in the DAGWishart distrbution and
with , where . Follow the simulation procedures in previous work (Cao et al., 2019). We set the hyperparameters as and for and generate candidate graphs by thresholding the modified Cholesky factor of ( is the sample covariance matrix) on a grid from 0.1 to 0.5 by 0.0001 to get a sequence of graphs. The log posterior probabilities are computed for all candidate graphs, and the graph with the highest probability is chosen. As we discussed previously, we will see in Figure 2 that for the previous DAGWishart model, we always end up choosing the most sparse estimator, since the graph obtained at the thresholding value 0.5 always has the highest log posterior score. Hence, we observe that the choice though could guarantee the model selection consistency, makes the posterior stuck in very small size models and we are not able to detect the true model.

Spike and slab Cholesky with betamixture prior/multiplicative prior: For our Bayesian approach with spike and slab Cholesky prior and betamixture/multiplicative prior on the sparsity pattern of , we adopt the similar procedure as DAGWishart logscore path search method. We construct two candidate sets as follows.

All the Cholesky factors with respect to the graphs on the solution paths for LassoDAG, CSCS and DAGWishart are included in our Cholesky factor candidate set.

To increase the search range, we also generate additional graphs by thresholding the modified Cholesky factor of ( is the sample covariance matrix) on a grid from 0.1 to 0.5 by 0.0001 to get a sequence of additional Cholesky factors, and include them in the candidate set. We then search around all the above candidates using Shotgun Stochastic Search Algorithm in (Shin et al., 2018) to generate even more candidate Cholesky factors. In particular, the authors in (Shin et al., 2018) claim that the simplified algorithm can significantly lessen the simulation runtime and increase the model selection performance.
The log posterior probabilities are computed for all Cholesky factors in the candidate sets using (5.1), and the one with the highest probability is chosen. In Figure 2, we plot the log of marginal posterior densities under the spike and slab Cholesky prior and the multiplicative/betamixture prior for all the Cholesky factors under different thresholding values compared with the marginal posteriors under previous DAGWishart model. Unlike the DAGWishart distribution always favor the most sparse Cholesky factor corresponding to the largest thresholding value, we observe the maximum log posterior score occurs in the middle of the curve for our proposed models, which leads to the significant improvement of the model selection results shown in Table 1 and Table 2.

The model selection performance of these four methods is then compared using several different measures of structure such as positive predictive value, true positive rate and mathews correlation coefficient (average over independent repetitions). Positive Predictive Value (PPV) represents the proportion of true nonzero entries among all the entries detected by the given procedure, True Positive Rate (TPR) measures the proportion of true nonzero entries detected by the given procedure among all the nonzero entries from the true model. PPV and TPR are defined as
Mathews correlation Coefficient (MCC) is commonly used to assess the performance of binary classification methods and is defined as
where TP, TN, FP and FN correspond to true positive, true negative, false positive and false negative, respectively. Note that the value of MCC ranges from 1 to 1 with larger values corresponding to better fits (1 and 1 represent worst and best fits, respectively). Similar to MCC, one would also like the PPV and TPR values to be as close to as possible. The results are provided in Table 1 and Table 2, corresponding to different true sparsity levels. In Figure 4, we draw the heatmap comparison between the true and estimated using our Bayesian spike and slab Cholesky approach under two different sparsity levels when .
LassoDAG  ESC  DAGW  SSCB  SSCM  

PPV  TPR  MCC  PPV  TPR  MCC  PPV  TPR  MCC  PPV  TPR  MCC  PPV  TPR  MCC  
300  100  0.2  0.2  0.19  0.17  0.43  0.26  0.99  0.3  0.55  0.73  0.85  0.78  0.98  0.69  0.82 
600  200  0.15  0.18  0.16  0.15  0.52  0.27  0.99  0.31  0.55  0.69  0.92  0.79  0.89  0.82  0.85 
900  300  0.15  0.20  0.17  0.12  0.54  0.24  1  0.33  0.57  0.62  0.93  0.76  0.83  0.87  0.84 
1200  400  0.11  0.17  0.14  0.08  0.52  0.21  1  0.33  0.58  0.61  0.94  0.76  0.78  0.90  0.84 
1500  500  0.12  0.21  0.16  0.06  0.45  0.20  1  0.33  0.58  0.56  0.96  0.73  0.71  0.93  0.81 
LassoDAG  ESC  DAGW  SSCB  SSCM  

PPV  TPR  MCC  PPV  TPR  MCC  PPV  TPR  MCC  PPV  TPR  MCC  PPV  TPR  MCC  
300  100  0.19  0.1  0.13  0.14  0.33  0.19  0.99  0.3  0.54  0.66  0.81  0.73  0.99  0.43  0.65 
450  150  0.12  0.09  0.1  0.11  0.35  0.18  1  0.29  0.53  0.63  0.86  0.73  0.93  0.72  0.82 
600  200  0.12  0.09  0.1  0.10  0.38  0.18  1  0.3  0.55  0.57  0.89  0.71  0.87  0.80  0.83 
750  250  0.09  0.08  0.08  0.08  0.36  0.16  1  0.31  0.55  0.59  0.9  0.72  0.80  0.86  0.83 
900  300  0.11  0.09  0.09  0.05  0.31  0.13  0.99  0.31  0.55  0.56  0.92  0.72  0.77  0.87  0.82 
It is clear that our hierarchical fully Bayesian approach with betamixture prior and multiplicative prior outperforms the penalized likelihood approaches, the Bayesian DAGWishart and ESC approach based on almost all measures. The PPV values for our Bayesian spike and slab Cholesky approach are all above , while the ones for the penalized likelihood approach and ESC are below . Though the PPV for the DAGWishart approach is almost 1, it is actually a consequence of the maximized log score occurring at the most sparse model. Hence, The precision (PPV) for the DAGWishart method is rather high, as the resulting is extremely sparse and all the remaining nonzero entries are the true elements in . The TPR values for the proposed approaches are almost all beyond , while the ones for the penalized likelihood approaches are all below . Now again under this measure, as a result of the final sparse estimator, DAGWishart Bayesian approach performs very poorly compared to the spike and slab approach with betamixture/multiplicative prior. For the most comprehensive measure of MCC, our fully Bayesian approach outperforms all the other three methods under all the cases of and two different sparsity levels.
It is also meaningful to compare the computational runtime between different methods. In Figure 3, we plot the run time comparison between our spike and slab Cholesky with betamixture prior/multiplicative prior and ESC. Since the marginal posterior is available in closed form (up to a constant) for the SSC with betamixture prior, we can see that the run time for SSCB via thresholding coupled with stochastic search is significantly lessened compared to the MCMC approach. The computational cost of ESC is extremely expensive in the sense that it requires not only additional run time, but also larger memory (more than 30GB when ). On the other hand, for the multiplicative prior, though the model selection performance is almost the best among all the competitors, with the extra step of the Laplace approximation for calculating each posterior probability, the computational burden is quite extensive as increases.
Overall, this experiment illustrates that the proposed hierarchical fully Bayesian approach with our spike and slab Cholesky prior and the betamixture prior can be used for a broader yet computationally feasible model search, while our spike and slab Cholesky prior with the multiplicative prior though more computationally expensive, can lead to a much more significant improvement in model selection performance for estimating the sparsity pattern of the Cholesky factor and the underlying DAG.
Comments
There are no comments yet.