Draft:Bayesian model comparison


Bayesian Model Comparison
File:BayesTheorem.png
Bayes' Theorem
Bayes' Theorem, a fundamental concept in Bayesian statistics, is used to update the probability of a hypothesis as more evidence or information becomes available.
Key Concepts
Bayes FactorsA ratio of marginal likelihoods that quantifies the evidence in favor of one model compared to another.
Marginal Likelihood (Evidence)The probability of the observed data given a model, integrated over all possible parameter values.
Posterior Model ProbabilityThe probability that a model is true given the observed data and prior information.
Information CriteriaApproximations to Bayes Factors, e.g., BIC, AIC, DIC.
Predictive AccuracyHow well a model predicts new or unseen data, often assessed through cross-validation or WAIC.
Model AveragingCombining predictions from multiple models, weighted by their posterior probabilities or predictive performance.
Methods
Separate EstimationComparing models based on posterior predictive distributions, Bayes factors, and information criteria.
Comparative EstimationAssessing the 'distance' between posterior distributions using measures like Kullback-Leibler divergence.
Simultaneous EstimationExploring the model space using techniques like reversible jump MCMC (RJMCMC) or birth-and-death MCMC (BDMCMC).

Bayesian model comparison means comparing how well statistical models fit to data by Bayesian statistics. It is used for diverse tasks like variable selection in regression, determining the number of components in a mixture model, and choosing parametric families. The goal of model comparison may be selecting a single "best" model, or improve estimation via model ensemble averaging, where expectation values from different models are weighted-averaged by their posterior probabilities.

Common methods for Bayesian model comparison include:

  • Separate estimation: Comparing models through posterior predictive distributions, Bayes factors, and approximations like BIC and DIC.
  • Comparative estimation: Assessing the "distance" between posterior distributions using measures like Kullback-Leibler divergence.
  • Simultaneous estimation: Exploring the model space using techniques like RJMCMC or BDMCMC.

Setup

edit

Bayesian evidence, or marginal likelihood, for a model   is the average likelihood of observing the data   under the prior distribution of the model parameters  : When comparing two models,   and  , the Bayes factor is the ratio of their evidences: A Bayes factor greater than 1 favors  , while a value less than 1 favors  . The magnitude of the Bayes factor reflects the strength of evidence, often interpreted using Jeffreys' scale.

Generally, the prior probability is chosen to quantify Occam's razor. A model with many free parameters will generally fit the data better, but it may overfit and perform poorly on new, unseen data. This can be quantified by choosing a prior distribution that decreases with model parameter count.

Bayesian complexity measures the effective number of parameters that the data can support, accoutning for parameters that are unconstrained by the data.[1]

Instead of choosing a single "best" model, Bayesian model averaging (BMA) combines predictions from multiple models, weighted by their posterior probabilities. This approach acknowledges uncertainty about the true model, incorporating it into the final inference.

Bayesian stacking, a more recent technique, weights models based on their out-of-sample predictive performance, using the entire dataset for model fitting. This method relaxes the assumption that the true model is within the set of candidate models.

Approximations

edit

Calculating the Bayesian evidence involves multi-dimensional integration, often computationally demanding. Several approximation methods exist, including:

  • Laplace approximation: Assumes a Gaussian likelihood and prior, simplifying the evidence integral.
  • Thermodynamic integration (simulated annealing): A numerical integration technique for complex likelihoods.
  • Nested sampling: Recasts the multi-dimensional integral into a simpler one-dimensional form.

Information criteria

edit

A family of approximations to the Bayes factor were derived based on information theory, all named "information criteria". These rely on simplifying assumptions that may be satisfied in practice.[2] The most popular ones are:

Predictive accuracy

edit

Model evaluation focuses on a model's predictive capacity rather than its fit to the observed data. Techniques like cross-validation and leave-one-out cross-validation (LOO-CV) partition the data to assess a model's performance on unseen data, mitigating overfitting.

Pareto smoothed importance sampling LOO-CV (PSIS-LOO-CV) enhances computational efficiency and stability of LOO-CV, particularly for complex models.

Separate Estimation

edit

Consider two models,   and  . For prediction, a natural Bayesian approach compares models based on their posterior predictive distributions. Another approach involves comparing models using their posterior probabilities given the data. Using Bayes' rule, the choice between models can be made using the ratio: The second term in this ratio, the ratio of marginal likelihoods, is the Bayes factor (BF). It is obtained by integrating over all parameter values, not by maximizing as in likelihood ratios. While theoretically attractive, Bayes factors can be difficult to calculate, especially for complex models, and are sensitive to prior choices.

Approximations to the Bayes factor, such as BIC and DIC, provide computationally efficient alternatives. These criteria penalize models with greater complexity, favoring parsimonious models that adequately explain the data. However, these approximations rely on specific assumptions and may not be appropriate for all model types.

Other examples

edit

Models can be compared by assessing the "distance" between their posterior (or posterior predictive) distributions. If the distance is small, the more parsimonious model might be preferred. Examples include the Kullback-Leibler divergence and entropy distance measures.

MCMC methods

edit

Markov chain Monte Carlo (MCMC) can be used to perform Bayesian model selection. The idea is to construct an MCMC chain in the space of possible models  , such that the MCMC chain samples the space of possible models according to the model posterior distribution, or some other distribution.

Reversible jump MCMC (or trans-dimensional MCMC)[3], allows "jumps" between models of different dimensions. Birth and death MCMC[4][5] is an alternative that models the time between jumps as a random variable, with model probabilities determined by the time spent in each model.

Applications

edit

Mixture Models

edit

Mixture models are widely used for data exhibiting heterogeneity. Several techniques exist for comparing mixture models. For instance, the DIC can be used when the mixture model is well defined. In other cases, alternative DIC estimators tailored for mixture models can be employed. Bayes factors, posterior predictive checks, and visual inspection of model fits also aid in selecting appropriate mixture models.

References

edit

General references

edit
  • Gelman, Andrew (2014). Bayesian data analysis. Chapman & Hall/CRC texts in statistical science (Third ed.). Boca Raton: CRC Press. ISBN 978-1-4398-4095-5.
  • Congdon, P. (2007). Bayesian Statistical Modelling. Wiley Series in Probability and Statistics. Wiley. ISBN 978-0-470-03593-1.
  • Robert, Christian P.; Casella, George (2004). "Monte Carlo Statistical Methods". Springer Texts in Statistics. New York, NY: Springer New York. doi:10.1007/978-1-4757-4145-2. ISBN 978-1-4419-1939-7. ISSN 1431-875X.
  • Kruschke, John K. (2015). "Model Comparison and Hierarchical Modeling". Doing Bayesian Data Analysis. Elsevier. pp. 265–296. doi:10.1016/b978-0-12-405888-0.00010-6. ISBN 978-0-12-405888-0.
  • K. P. Burnham and D. R. Anderson, Model Selection and Multi-model Inference: A Practical Information-theoretic Approach, 2nd edn (Springer, New York, 2002).
  • D. MacKay, Information theory, inference, and learning algorithms (Cambridge University Press, Cambridge, UK, 2003).
  • Aitkin, M. (1997). The calibration of P-values, posterior Bayes factors and the AIC from the posterior distribution of the likelihood (with discussion). Statist. And Computing 7, 253-272.
  • Celeux, G., Forbes, F., Robert, C.P. and Titterington, D.M. (2003). Deviance information criteria for missing data models. Cahiers du Ceremade 0325.
  • Congdon, P. (2001). Bayesian Statistical Modelling. Wiley, England.
  • Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (1995). Bayesian Data Analysis. Chapman and Hall, London.
  • George, E. and McCulloch, R. (1993). Variable selection via Gibbs sampling. J. American Statist. Association 88(423), 881-889.
  • Green, P. (1995). Reversible jump MCMC computation and Bayesian model determination. Biometrika 82(4), 711-732.
  • Kass, R. and Raftery, A. (1995). Bayes factors. J. American Statist. Assoc. 90, 773-795.
  • Perez, J.M. and Berger, J. (2002). Expected posterior prior distributions for model selection. Biometrika 89, 491-512.
  • Richardson, S. and Green, P. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. Royal Statist. Soc. Series B 59 731-792.
  • Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition.
  • Spiegelhalter, D.J., Best, N.G., Carlin, B.P., van der Linde, A. (2002). Bayesian measures of model complexity and fit. J. Royal Statist. Society Series B 64(3), 583-639.

See also

edit
  1. ^ Spiegelhalter, David J.; Best, Nicola G.; Carlin, Bradley P.; Van Der Linde, Angelika (2002-10-01). "Bayesian Measures of Model Complexity and Fit". Journal of the Royal Statistical Society Series B: Statistical Methodology. 64 (4): 583–639. doi:10.1111/1467-9868.00353. ISSN 1369-7412.
  2. ^ Konishi, Sadanori; Kitagawa, Genshiro (2008). Information Criteria and Statistical Modeling. Springer Series in Statistics. New York, NY: Springer New York. doi:10.1007/978-0-387-71887-3. ISBN 978-0-387-71886-6.
  3. ^ Green, Peter J. (1995). "Reversible jump Markov chain Monte Carlo computation and Bayesian model determination". Biometrika. 82 (4): 711–732. doi:10.1093/biomet/82.4.711. ISSN 0006-3444.
  4. ^ Stephens, Matthew (2000). "Bayesian Analysis of Mixture Models with an Unknown Number of Components- An Alternative to Reversible Jump Methods". The Annals of Statistics. 28 (1): 40–74. doi:10.1214/aos/1016120364. ISSN 0090-5364. JSTOR 2673981.
  5. ^ Richardson, Sylvia.; Green, Peter J. (1997-11-01). "On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion)". Journal of the Royal Statistical Society Series B: Statistical Methodology. 59 (4): 731–792. doi:10.1111/1467-9868.00095. ISSN 1369-7412.