Introduction and Motivation
edit
There is progression in probabilistic models which could develop rich generative models. The models have been expanded with neural network, implicit densities, and with scalable algorithms to very large data for their Bayesian inference. However, most of the models are focus on capturing statistical relationships rather than causal relationships. Causal models give us a sense on how manipulate the generative process could change the final results.
Genome-wide association studies (GWAS) are examples of causal relationship. Specifically, GWAS is about figuring out how genetic factors cause disease among humans. Here the genetic factors we are referring to is single nucleotide polymorphisms (SNPs), and getting a particular disease is treated as a trait, i.e., the outcome. In order to know about the reason of developing a disease and to cure it, the causation between SNPs and diseases is interested: first, predict which one or multiple SNPs cause the disease; second, target the selected SNPs to cure the disease.
This paper dealt with two questions. The first one is how to build rich causal models with specific needs by GWAS. In general, probabilistic causal models involve a function and a noise . For the working simplicity, we usually assume as a linear model with a Gaussian noise. However, proof has shown that in GWAS, it is necessary to accommodate non-linearity and interactions between multiple genes into the models.
The second accomplishment of this paper is that it addressed the problem caused by latent confounders. Latent confounders are issues when we apply the causal models since we cannot observe them nor knowing the underlying structure. In this paper, they developed implicit causal models which can adjust for confounders.
There has been growing works on causal models which focus on causal discovery and typically have strong assumptions such as Gaussian processes on noise variable or nonlinearities for the main function.
Failed to parse (unknown function "\math"): {\displaystyle L = \alpha L_1 + (1-\alpha) L_2<\math> <math>L_1 = \frac{1}{t} \sum_{k=1}^{t} = \norm[0]{\hat{p}_k - p_k}^2_2 + \kappa_\phi \norm[0]{\hat{\phi}_k - \phi_k}^2_2<\math> <math>L_2 = \frac{1}{t} \sum_{k=1}^{t} = \norm[0]{\hat{p}_k - p_k}^2_2 + \kappa_q(1-\langle \hat{q_k},q_k \rangle ^2)<\math> ==Implicit Causal Models== Implicit causal models are an extension of probabilistic causal models. Probabilistic causal models will be introduced first. === Probabilistic Causal Models === Probabilistic causal models have two parts: deterministic functions of noise and other variables. Consider a global variable <math>\beta} and noise , where
Each and is a function of noise; is a function of noise and ,
The target is the causal mechanism so that the causal effect can be calculated. means that we specify a value of under the fixed structure . By other paper’s work, it is assumed that .
An example of probabilistic causal models is additive noise model.
is usually a linear function or spline functions for nonlinearities. is assumed to be standard normal, as well as . Thus the posterior can be represented as
where is the prior which is known. Then, variational inference or MCMC can be applied to calculate the posterior distribution.
Implicit Causal Models
editThe difference between implicit causal models and probabilistic causal models is the noise variable. Instead of an additive noise term, implicit causal models directly take noise into a neural network and output .
The causal diagram has changed to:
They used fully connected neural network with a fair amount of hidden units to approximate each causal mechanism. Below is the formal description:
Implicit Causal Models with Latent Confounders
editPreviously, they assumed the global structure is observed. Next, the unobserved scenario is being considered.
Causal Inference with a Latent Confounder
editSame as before, the interest is the causal effect . Here, the SNPs other than is also under consideration. However, it is confounded by the unobserved confounder . As a result, the standard inference method cannot be used in this case.
The paper proposed a new method which include the latent confounders. For each subject Failed to parse (syntax error): {\displaystyle n=1,…,N} and each SNP Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "http://localhost:6011/en.wikipedia.org/v1/":): {\displaystyle m=1,…,M} ,
The mechanism for latent confounder is assumed to be known. SNPs depend on the confounders and the trait depends on all the SNPs and the confounders as well.
The posterior of is needed to be calculate in order to estimate the mechanism as well as the causal effect , so that it can be explained how changes to each SNP cause changes to the trait .
Note that the latent structure is assumed known.
Implicit Causal Model with a Latent Confounder
editThis section is the algorithm and functions to implementing an implicit causal model for GWAS.
Generative Process of Confounders .
editThe distribution of confounders is set as standard normal. , where is the dimension of and should make the latent space as close as possible to the true population structural.
Generative Process of SNPs .
editGiven SNP is coded for,
The authors defined a distribution on . And used logistic factor analysis to design the SNP matrix.
A SNP matrix looks like this: File:SNP matrix.png
Since logistic factor analysis makes strong assumptions, this paper suggests to use a neural network to relax these assumptions,
This renders the outputs to be a full matrix due the the variables , which act as principal component in PCA.
Generative Process of Traits .
editPreviously, each trait is modeled by a linear regression,
This also has very strong assumptions on SNPs, interactions, and additive noise. It can also be replaced by a neural network which only outputs a scalar,
Likelihood-free Variational Inference
editCalculating the posterior of is the key of applying the implicit causal model with latent confounders.
could be reduces to
However, with implicit models, integrating over a nonlinear function could be suffered. The authors applied likelihood-free variational inference (LFVI). LFVI proposes a family of distribution over the latent variables. Here the variables and are all assumed to be Normal,
Empirical Study
editThe authors performed simulation on 100,000 SNPs, 940 to 5,000 individuals, and across 100 replications of 11 settings. Four methods were compared:
- implicit causal model (ICM);
- PCA with linear regression (PCA);
- a linear mixed model (LMM);
- logistic factor analysis with inverse regression (GCAT).
The feedforward neural networks for traits and SNPs are fully connected with two hidden layers using ReLU activation function, and batch normalization.
Simulation Study
editBased on real genomic data, a true model is applied to generate the SNPs and traits for each configuration. There are four datasets used in this simulation study:
1. HapMap [Balding-Nichols model]
2. 1000 Genomes Project (TGP) [PCA]
3a. Human Genome Diversity project (HGDP) [PCA]
3b. HGDP [Pritchard-Stephens-Donelly model]
4. A latent spatial position of individuals for population structure [spatial]
The table shows the prediction accuracy. The accuracy is calculated by the rate of the number of true positives divide the number of true positives plus false positives. True positives measure the proportion of positives that are correctly identified as such (e.g. the percentage of SNPs which are correctly identified as having the causal relation with the trait). In contrast, false positives state the SNPs has the causal relation with the trait when they don’t. The closer the rate to 1, the better the model is since false positives are considered as wrong prediction.
The result represented above shows that the implicit causal model has the best performance among these four models in every situation. Especially, other models tend to do poor on PSD and Spatial when is small, but the ICM achieved a significant high rate. The only comparable method to ICM is GCAT, when applying to simpler configurations.
Real-data Analysis
editThey also applied ICM to a real-world GWAS of Northern Finland Birth Cohorts which contain 324,160 SNPs and 5,027 individuals. Ten implicit causal models were fitted and the 2 neural networks both with two hidden layers were used for SNP and trait.
The numbers in the above table are the number of significant loci for each of the 10 traits. The number for other methods, such as GCAT, LMM, PCA, and "uncorrected" are obtained from other papers. By comparison, the ICM reached the level of the best previous model for each trait.
Conclusion
editThis paper introduced implicit causal models in order to account for nonlinear complex causal relationships, and applied the method to GWAS. It can not only capture important interactions between genes within an individual and among population level, but also can adjust for latent confounders by taking account of the latent variables into the model.
By the simulation study, the authors proved that the implicit causal model could beat other methods by 15-45.3% on a variety of datasets with variations on parameters.
The authors also believed this GWAS application is only a start of the usage of implicit causal models. It might could also be used in physics or economics.
Critique
editI think this paper is an interesting and novel work. The main contribution of this paper is to connect the statistical genetics and the machine learning methodology. The method is technically sound and does indeed generalize techniques currently used in statistical genetics.
The neural network used in this paper is a very simple feedforward 2 hidden layers neural network, but the idea of where to use the neural network is crucial and might be significant in GWAS.
It has limitations as well. The empirical example in this paper is too easy, and far away from the realistic situation. Despite the simulation study showed some competing results, the Northern Finland Birth Cohort Data application did not demonstrate the advantage of using implicit causal model whether are better than the previous methods, such as GCAT or LMM.
Another limitation is about linkage disequilibrium as the authors stated as well. SNPs are not completely independent of each other; usually they have correlations when the alleles at close locus. They did not consider this complex case, rather they only considered the simplest case where they assumed all the SNPs are independent.
Furthermore, one SNP maybe does not have enough power to explain the causal relationship. Recent papers indicate that causation to a trait may involve multiple SNPs. This could be a future work as well.
References
editTran D, Blei D M. Implicit Causal Models for Genome-wide Association Studies[J]. arXiv preprint arXiv:1710.10742, 2017.
Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Prof Bernhard Schölkopf. Non- linear causal discovery with additive noise models. In Neural Information Processing Systems, 2009.
Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.
Minsun Song, Wei Hao, and John D Storey. Testing for genetic associations in arbitrarily structured populations. Nature, 47(5):550–554, 2015.
Dustin Tran, Rajesh Ranganath, and David M Blei. Hierarchical implicit models and likelihood-free variational inference. In Neural Information Processing Systems, 2017.