Predictive Posteriors – Alexandre Thiéry

A Bayesian model gives a distribution \(P_\theta\) for future data, indexed by a parameter \(\theta\) abd predictions are usually done through the posterior predictive. If \(\Pi_n\) is the posterior on \(\theta\), then the posterior predictive is \(P_{\Pi_n} = \int_\Theta P_\theta \, d\Pi_n(\theta)\) and it can be interpreted as a mixture of the model components \(P_\theta\) weighed by the posterior \(\Pi_n\). The standard Bayesian approach chooses the weights by scoring parameters with the likelihood, not by directly scoring the mixture predictive, which can be especially limiting when the model is misspecified!

For example, suppose that the data come from two subpopulations, but the model only accounts for one. The posterior will learn a compromise parameter, and the posterior predictive will be a mixture of one component. A predictive posterior approach, as we will describe below, can learn a mixture of two components and predict much better. This is achieved by scoring the mixture predictive directly, rather than scoring the individual components through the likelihood function. In other words, the predictive posteriors still mix the original model components \(P_\theta\), still using a distribution \(Q\) on \(\theta\), but they choose \(Q\) by scoring the resulting mixture predictive directly in terms of predictive performance. In contrast, Bayes chooses \(Q\) by scoring the individual components \(P_\theta\) through the likelihood function.

Let us setup some notation: the observations take values in a space \(\mathcal Z\), and we typically have \(n\) observations \(z_1,\ldots,z_n\in\mathcal Z\). The parameter \(\theta\) belongs to a parameter space \(\Theta\), and \(p_\theta(z)\) is the density of \(P_\theta\) with respect to a reference measure \(\mu\) on \(\mathcal Z\). The log-likelihood is \(\ell_n(\theta) = \sum_{i=1}^n \log p_\theta(z_i)\) and the model is:

\[ \mathcal M=\{P_\theta:\theta\in\Theta\}. \]

A mixing distribution \(Q\) is a probability distribution on \(\Theta\) and the mixture predictive with respect to \(Q\) is defined as

\[ P_Q = \int_\Theta P_\theta \, dQ(\theta). \tag{1}\]

The mixture \(P_Q\) is again a probability distribution on \(\mathcal Z\) and for an event \(A\subset\mathcal Z\) we have \(P_Q(A)=\int_\Theta P_\theta(A)\,dQ(\theta)\).

Average the score or score the average

Let \(Q_0\) be a prior or reference measure on \(\Theta\). The Kullback-Leibler term below is the regularizer. Bayesian and generalized Bayesian approaches can be written as

\[ Q_n^G \in \mathop{\mathrm{argmin}}_{Q\in\mathcal P(\Theta)} \left\{ -\int_\Theta \ell_n(\theta)\,dQ(\theta) + \lambda_n D_{\text{KL}}(Q \mid Q_0) \right\}. \tag{2}\]

Standard Bayes is the special case with \(\lambda_n=1\) and \(Q_0\) the prior, while generalized Bayes allows other losses and scaling of \(\lambda_n\) (Bissiri, Holmes, and Walker 2016). A predictive posterior instead solves

\[ Q_n^P \in \mathop{\mathrm{argmin}}_{Q\in\mathcal P(\Theta)} \left\{ D_n(P_n,P_Q) + \lambda_n D_{\text{KL}}(Q \mid Q_0) \right\} \tag{3}\]

where \(P_n=\frac1n\sum_{i=1}^n\delta_{z_i}\) is the empirical distribution of the data and \(D_n\) is a discrepancy between the predictive \(P_Q\) and the empirical distribution \(P_n\). The first term scores the mixture predictive directly, while the second term keeps \(Q\) close to a reference measure \(Q_0\). The discrepancy \(D_n(P_n,P)\) may be chosen as the negative log likelihood, i.e.

\[ D_n(P_n,P_Q) = -\sum_{i=1}^n \log \int_\Theta p_\theta(z_i)\,dQ(\theta), \]

but indeed it does not have to be. It can be any scoring rule and a few examples are given in the next section. The regularization term is important to avoid overfitting, especially when the model is misspecified and the mixture predictive can be very flexible. For a pointwise scoring rule \(S\), with smaller scores being better, the discrepancy can be written as

\[ D_n(P_n,P_Q) = \sum_{i=1}^n S(P_Q,z_i). \]

Note that the quantity \(\lambda_n\) can be chosen to depend on \(n\), and it can be tuned to optimize predictive performance, typically using a validation set or cross-validation.

Log score and mixture likelihood

With the log score and no KL penalty, Equation 3 becomes

\[ \mathop{\mathrm{argmin}}_Q \left\{ -\sum_{i=1}^n \log\int_\Theta p_\theta(z_i)\,dQ(\theta) \right\}. \tag{4}\]

This is the likelihood for a mixture model with unknown mixing distribution (Lindsay 1995). So in this case the predictive posterior is doing classical mixture estimation. If \(Q\) is restricted to a finite list of predictors, the same objective gives stacking or convex aggregation (Yao et al. 2018). Better prediction may simply mean that mixtures predict better than single components. The posterior-like part is the entropy penalty on \(Q\).

Conditional prediction

For supervised data, write \(x\in\mathcal X\) for covariates and \(y\in\mathcal Y\) for responses so that the observation is \(z=(x,y)\in\mathcal X\times\mathcal Y\). Replace \(P_\theta\) by a conditional distribution \(P_\theta(dy\mid x)\) on \(\mathcal Y\) and the mixture predictive is

\[ P_Q(dy\mid x) = \int_\Theta P_\theta(dy\mid x)\,dQ(\theta). \tag{5}\]

If one uses a proper scoring rule \(S\) as the discrepancy, the predictive posterior solves

\[ \sum_{i=1}^n S(P_Q(\cdot\mid x_i),y_i) + \lambda_nD_{\text{KL}}(Q\mid Q_0) \tag{6}\]

where as usual the first term scores the mixture predictive and the second term keeps \(Q\) close to a reference measure \(Q_0\). The log score is a special case, but other proper scoring rules can be used as well.

A latent-subpopulation example

Consider the Gaussian regression model \(P_\theta(dy\mid x)=\mathcal N(y;x^\top\theta,\sigma^2)\,dy\) where \(sigma^2\) is known and \(\theta\in\mathbb R^p\) is the regression coefficient. Now suppose the population has unobserved groups:

\[ Y\mid X=x \sim \sum_{k=1}^K w_k\,\mathcal N(y;x^\top\theta_k,\sigma^2). \tag{7}\]

Bayes under the single-slope model learns one compromise slope, which can have arbitrarily poor predictive performance. A predictive posterior can put mass near \(\theta_1,\ldots,\theta_K\), which gives a mixture predictive that can be much closer to the truth. If the groups are part of the scientific model, \(Q\) estimates population heterogeneity. If the same mixture is only approximating nonlinear regression, heteroscedastic noise, or missing covariates, \(Q\) is an ensemble representation.

A misspecified simulator example

The interpretation changes in mechanistic models. Consider a mechanistic simulator \(P_\theta(dy\mid x)\). For a deterministic simulator, this may be a point mass, or the simulator output plus observation noise. The parameter \(\theta\) may contain rate constants, forcing terms, initial-state corrections, or boundary conditions. If the simulator omits a source of randomness or a reaction channel, ordinary Bayes can learn one best wrong simulator. The mixture can add predictive spread, as is empirically observed in ensemble weather forecasts. It can widen intervals, add heavier tails, or create several plausible predictive modes. Note that this predictive spread is not a discovery of the missing mechanism. It is just a predictive fix for a misspecified model. In other words, there is nothing magical about the mixture predictive: it is just a mixture of the original model components, and the mixture weights are chosen to optimize predictive performance. A broad \(Q\) may mean physical heterogeneity, missing process noise, numerical ensemble spread, or regularization.

What does Q mean?

The objective identifies \(P_Q\), not \(Q\) and the map \(Q \mapsto P_Q\) needs not be injective since different mixing distributions can give the same predictive distribution. Two nearly identical predictive distributions can come from very different mixing distributions. The objective directly supports claims about \(P_Q\). On the other hand, claims about parameter uncertainty, latent heterogeneity, or scientific mechanism need more assumptions. They need identifiability of the mixture representation and a data-generating story in which \(Q\) has that meaning.

The regularization scale \(\lambda_n\) also matters. While a Bayes-scaled value keeps the posterior analogy closer, a validation-tuned value targets prediction. Large \(\lambda_n\) keeps \(Q\) close to \(Q_0\). Small \(\lambda_n\) chases predictive fit. In the second case, the spread of \(Q\) is partly a tuning effect.

Computation

The optimization is over distributions \(Q\) on parameter space, so computation starts by choosing a representation of \(Q\). A variational implementation chooses a parametric density \(q_\lambda\) on \(\Theta\) and optimizes \(\lambda\). A particle implementation uses

\[ Q_M=\frac1M\sum_{m=1}^M\delta_{\theta_m} \tag{8}\]

and optimizes the locations \(\theta_1,\ldots,\theta_M\) of the particles. This leads to a nonconvex optimizaition problem that is often attacked with stochastic gradient descent. For MMD, CRPS, or energy scores, the objective can often be estimated from simulations. They only require simulation from \(P_\theta\), not evaluation of \(p_\theta\).

Where this appears

Several recent papers study this score-the-mixture update. Predictive variational inference, prediction-centric uncertainty quantification, and predictively oriented posteriors all fit the score-the-mixture pattern (Lai, Linero, and Yao 2024; Shen et al. 2024; McLatchie et al. 2025). Generalized Bayes fits the average-loss pattern (Bissiri, Holmes, and Walker 2016). MMD posterior bootstrap and MMD regression are nearby, but they remain parameter-projection methods unless they score the mixture predictive itself.

References

Bissiri, P. G., C. C. Holmes, and S. G. Walker. 2016. “A General Framework for Updating Belief Distributions.” Journal of the Royal Statistical Society Series B: Statistical Methodology 78 (5). Oxford University Press (OUP): 1103–30. doi:10.1111/rssb.12158.

Lai, Jinlin, Antonio Linero, and Yuling Yao. 2024. “Predictive Variational Inference: Learn the Predictively Optimal Posterior Distribution.” arXiv preprint arXiv:2410.14843.

Lindsay, Bruce G. 1995. Mixture Models: Theory, Geometry and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Mathematical Statistics; American Statistical Association. doi:10.1214/cbms/1462106013.

McLatchie, Yann, Badr-Eddine Cherief-Abdellatif, David T. Frazier, and Jeremias Knoblauch. 2025. “Predictively Oriented Posteriors.” arXiv preprint arXiv:2510.01915.

Shen, Zheyang, Jeremias Knoblauch, Sam Power, and Chris. J. Oates. 2024. “Prediction-Centric Uncertainty Quantification via MMD.” In International Conference on Artificial Intelligence and Statistics.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions (with Discussion).” Bayesian Analysis 13 (3). Institute of Mathematical Statistics: 917–1007. doi:10.1214/17-ba1091.