Bayesian Model Reduction

Introduction

Bayesian Model Reduction (BMR) is a generalisation of the Savage-Dickey ratio allowing an efficient comparison of a large number of generative models, all differing only in their specification of the prior distribution. In the context of our Dynamic Factor Analysis implementation (and its FA/PCA special cases), we use BMR to post-hoc prune parameters of the loading matrix \(\mathbf{H}\) and potentially the dynamics matrix \(\mathbf{F}\), removing redundant latent dimensions and providing an efficient approach to learning sparse representations.

BMR is applied within the M-step pipeline after posterior computation and PX rotation, using the ARD-augmented prior (where \(\mathbb{E}[\tau_k]\) from the previous iteration’s ARD posterior is already incorporated into the prior precision). See Bayesian Dynamic Factor Analysis for the full pipeline description.

This document provides the mathematical foundation for the BMR algorithm implemented in sppcax.

Principle of Bayesian Model Reduction

Bayesian model selections rests on evaluating and comparing model evidence of different generative models \(m_i\) given some data \(\mathbf{y}\). The model evidence is given by:

\[p(\mathbf{y}|m_i) = \int p(\mathbf{y}|\boldsymbol{\theta}, m_i) p(\boldsymbol{\theta}|m_i) d\boldsymbol{\theta}\]

where \(\boldsymbol{\theta}\) are the model parameters, \(p(\mathbf{y}|\boldsymbol{\theta}, m_i)\) is the likelihood, and \(p(\boldsymbol{\theta}|m_i)\) is the prior.

Instead of fitting each model to the data, we can use BMR and leverage the posterior distribution obtained from a full model, \(m_0\), to analytically derive the evidence for each reduced model, \(m_{i>0}\). This is only possible when the reduced models differ from the full model only in their prior specifications. Specifically, we call reduced model any model that has a prior of lower entropy compared to the prior of the full model. Importantly, this condition is satisfied for sparse Factor Analysis, as we are looking for a sparse representation of the loading matrix, that corresponds to a large number of zero valued elements of the matrix.

Mathematical Formulation

Consider a full model \(m_0\) with prior \(p(\boldsymbol{\theta}|m_F)\) and any reduced model \(m_{i>0}\) with prior \(p(\boldsymbol{\theta}|m_{i>0})\). If we know the posterior distribution of the full model \(p(\boldsymbol{\theta}|\mathbf{y}, m_0)\), we can compute the evidence for reduced models as:

\[p(\mathbf{y}|m_i) = p(\mathbf{y}|m_0) \int d\boldsymbol{\theta}\frac{p(\boldsymbol{\theta}|m_i)}{p(\boldsymbol{\theta}|m_0)} p(\boldsymbol{\theta}|\mathbf{y}, m_0)\]

This allows us to compute the ratio of model evidences (Bayes factor) between the reduced and full models as:

\[\frac{p(\mathbf{y}|m_i)}{p(\mathbf{y}|m_0)} = \int \frac{p(\boldsymbol{\theta}|m_i)}{p(\boldsymbol{\theta}|m_0)} p(\boldsymbol{\theta}|\mathbf{y}, m_0) d\boldsymbol{\theta}\]

Variational Approximation

For Bayesian Factor Analysis model (or PCA), we use a variational approximation \(q_0(\boldsymbol{\theta})\) to the true posterior \(p(\boldsymbol{\theta}|\mathbf{y}, m_0)\). Importantly, we assume the same functional form for the posterior and the prior (see Prior Distributions for details). The variational free energy is given by:

\[F(m_0) = \int d\boldsymbol{\theta} q_0(\boldsymbol{\theta}) \ln \frac{q_0(\boldsymbol{\theta})}{p(\mathbf{y}, \boldsymbol{\theta}|m_0)}\]

where we assume that the log-joint \(\ln p(\mathbf{y}, \boldsymbol{\theta}|m_0)\) is obtained during the e-step (see VBE-step:) as:

\[\ln p(\mathbf{y}, \boldsymbol{\theta}|m_0) \approx \ln p(\pmb{\theta}|m_0) + \int d \pmb{z} q(\pmb{z}) \ln \frac{q(\pmb{z})}{p(\pmb{y}|\pmb{\theta}, \pmb{z})}\]

For any reduced model, the variational free energy can be computed as:

\[\begin{split}F(m_i) &= F(m_0) - \ln \int d\boldsymbol{\theta} q_0(\boldsymbol{\theta}) \frac{p(\boldsymbol{\theta}|m_i)}{p(\boldsymbol{\theta}|m_0)} \\ &= F(m_0) - \Delta F_i\end{split}\]

This formula allows us to compute the change in free energy (\(\Delta F\)) when switching from the full model to a reduced model, which is what we implemented in the compute_delta_f function.

Note that if we enumerate different models \(m_i\) in a way that makes to neghbouring models \(m_i\) and \(m_{i+1}\) such that they differ only in a single parameter, where \(m_{i+1}\) is a reduced version of \(m_i\) then we can express the variational free-energy of model \(m_i\) as a folowing sequence

\[\begin{split}F(m_{i-1}) &= F(m_{i}) + \Delta F_{i:i-1} \\ F(m_0) &= F(m_i) + \sum_{k=1}^i \Delta F_{k:k-1} \\ \Delta F_{k:k-1} &= \ln \int d \pmb{\theta} q_{k-1}(\theta) \frac{p(\pmb{\theta}|m_k)}{p(\pmb{\theta}|m_{k-1})}\end{split}\]

Computing \(\Delta F\)

In sppcax, BMR is applied to both the loading matrix \(\tilde{\mathbf{H}}\) (MVNIG posterior) and the dynamics matrix \(\tilde{\mathbf{F}}\) (MVN posterior). The prior precision used in both cases already contains \(\mathbb{E}[\tau_k]\) from the ARD posterior of the previous iteration, so no separate ARD terms appear in the \(\Delta F\) computation.

We define the following quantities for a given row with pruning mask \(\mathbf{g}\) (1 = active, 0 = pruned):

\[\begin{split}\mathbf{G} &= \text{diag}(\mathbf{g}) \\ \mathbf{L}_{\text{post}} &= \text{chol}(\mathbf{G} \boldsymbol{\Lambda}_{\text{post}} \mathbf{G} + \mathbf{I}_{\text{pruned}}) \\ \mathbf{L}_{\text{prior}} &= \text{chol}(\mathbf{G} \boldsymbol{\Lambda}_{\text{prior}} \mathbf{G} + \mathbf{I}_{\text{pruned}}) \\ \tilde{\boldsymbol{\mu}}_{\text{post}} &= \mathbf{L}_{\text{post}}^\top (\mathbf{g} \odot \boldsymbol{\mu}_{\text{post}}) \\ \tilde{\boldsymbol{\mu}}_{\text{prior}} &= \mathbf{L}_{\text{prior}}^\top (\mathbf{g} \odot \boldsymbol{\mu}_{\text{prior}})\end{split}\]

where \(\boldsymbol{\Lambda}_{\text{post}}\) and \(\boldsymbol{\Lambda}_{\text{prior}}\) are the posterior and prior precision matrices, \(\boldsymbol{\mu}_{\text{post}}\) and \(\boldsymbol{\mu}_{\text{prior}}\) are the posterior and prior means, and \(\mathbf{I}_{\text{pruned}} = \mathbf{I} - \mathbf{G}\) fills in the pruned dimensions to keep the Cholesky factorisation well-conditioned.

MVNIG Case (Loading Matrix)

The loading matrix has a joint MVNIG posterior (see Variational Inference):

\[q(\tilde{\mathbf{h}}_d, \psi_d) = \mathcal{N}(\tilde{\mathbf{h}}_d \mid \boldsymbol{\mu}_d, \psi_d^{-1} \boldsymbol{\Sigma}_d) \, \text{Gamma}(\psi_d \mid \alpha_d, \beta_d)\]

The change in variational free energy for pruning a set of elements in row \(d\) is:

\[\Delta F_d = \ln |\mathbf{L}_{\text{post}}| - \ln |\mathbf{L}_{\text{prior}}| + \alpha_d \ln \beta_d - \alpha_d \ln \left(\beta_d + \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{post}}^\top \tilde{\boldsymbol{\mu}}_{\text{post}} - \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{prior}}^\top \tilde{\boldsymbol{\mu}}_{\text{prior}} \right)\]

The \(\alpha_d \ln(\cdot)\) terms arise from integrating over the shared noise precision \(\psi_d\) under the Gamma posterior. When the noise is isotropic (PCA), \(\alpha_d\) and \(\beta_d\) are shared across all dimensions.

MVN Case (Dynamics Matrix)

The dynamics matrix has an MVN posterior without a noise precision parameter (since \(\mathbf{Q} = \mathbf{I}\) is fixed):

\[q(\tilde{\mathbf{f}}_k) = \mathcal{N}(\tilde{\mathbf{f}}_k \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\]

The change in variational free energy simplifies to:

\[\Delta F_k = \ln |\mathbf{L}_{\text{post}}| - \ln |\mathbf{L}_{\text{prior}}| - \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{post}}^\top \tilde{\boldsymbol{\mu}}_{\text{post}} + \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{prior}}^\top \tilde{\boldsymbol{\mu}}_{\text{prior}}\]

This is a direct quadratic form — the absence of the \(\alpha \ln(\beta + \cdots)\) terms reflects the fact that there is no noise variance to integrate over.

Gibbs Sampling with Indian Buffet Process Prior

To determine the final sparse structure of the loading matrix (and dynamics matrix), we use Gibbs sampling with a spike-and-slab prior on the sparsity matrix \(\boldsymbol{\Lambda}\). If \(\lambda_{dk} = 1\), the element retains its normal prior; if \(\lambda_{dk} = 0\), the prior becomes a delta function at zero, forcing the element to vanish.

Indian Buffet Process Prior

Each column \(k\) of the sparsity matrix has an independent inclusion probability drawn from a truncated Indian Buffet Process (IBP) prior:

\[\pi_k \sim \text{Beta}\!\left(\frac{\alpha_0}{K},\, 1\right)\]

where \(\alpha_0\) is a concentration parameter controlling the expected number of active features, and \(K\) is the total number of latent dimensions. This prior encourages sparse representations while allowing the data to determine which dimensions are needed.

Gibbs Sampling Procedure

The posterior probability of element \(\lambda_{dk} = 1\) is:

\[\begin{split}q(\lambda_{dk} = 1) &= \frac{p(\mathbf{y}|m_{i-1})\,\pi_k}{p(\mathbf{y}|m_{i-1})\,\pi_k + p(\mathbf{y}|m_i)\,(1-\pi_k)} \\ &= \sigma\!\left(\Delta F_{d,\, i:i-1} + \ln \frac{\pi_k}{1 - \pi_k}\right)\end{split}\]

where \(\sigma(\cdot)\) is the logistic function and \(\Delta F_{d,\, i:i-1}\) is the change in variational free energy between two models differing only in element \((d, k)\).

At each Gibbs iteration \(t\):

Sample sparsity columns: For \(k = 1, \ldots, K\):

\[\lambda_{dk}^{(t)} \sim \text{Bernoulli}\!\left(\sigma\!\left(\Delta F_{d,\, i:i-1} + \ln \frac{\pi_k}{1 - \pi_k}\right)\right) \cdot \text{mask}_{dk}\]

where \(\text{mask}_{dk}\) is the structural constraint from the model definition.
Update column-wise hyperparameters:

\[\begin{split}\alpha_k &= \frac{\alpha_0}{K} + \sum_d \lambda_{dk}^{(t)} \\ \beta_k &= 1 + \sum_d (1 - \lambda_{dk}^{(t)})\end{split}\]
Update concentration parameter \(\alpha_0\) via empirical Bayes:

\[\alpha_0 = \frac{-K^2}{\sum_k \left[\psi(\alpha_k) - \psi(\alpha_k + \beta_k)\right]}\]

where \(\psi(\cdot)\) is the digamma function.

Posterior Correction After Pruning

After the Gibbs sampler converges with final mask \(\boldsymbol{\Lambda}\), the posterior distributions are corrected to account for the pruned elements.

MVNIG (Emissions)

For the loading matrix posterior:

Pruned elements are set to zero in both the mean and natural parameters.
The Inverse Gamma \(\beta\) parameter is corrected:

\[\beta_d^{\text{new}} = \beta_d + \frac{1}{2}\left( -\boldsymbol{\mu}_{\text{pruned},d}^\top \boldsymbol{\Lambda}_{\text{post},d}\, \boldsymbol{\mu}_{\text{pruned},d} + \boldsymbol{\mu}_{\text{pruned},d}^{0\top} \boldsymbol{\Lambda}_{\text{prior},d}\, \boldsymbol{\mu}_{\text{pruned},d}^0 \right)\]

where \(\boldsymbol{\mu}_{\text{pruned},d}\) denotes the pruned elements of the posterior mean and \(\boldsymbol{\mu}_{\text{pruned},d}^0\) the pruned elements of the prior mean.
The Inverse Gamma scale is optimised:

\[\text{nat2}_0 = \min\!\left(\frac{\Delta\text{nat2}}{\alpha - 1},\, \frac{-1}{\alpha - 1}\right)\]

MVN (Dynamics)

For the dynamics matrix posterior, pruned elements are set to zero via the updated mask. No noise precision correction is needed since \(\mathbf{Q} = \mathbf{I}\) is fixed.

References

Friston, K., Mattout, J., Trujillo-Barreto, N., Ashburner, J., & Penny, W. (2007). Variational free energy and the Laplace approximation. Neuroimage, 34(1), 220-234.
Friston, K., & Penny, W. (2011). Post hoc Bayesian model selection. Neuroimage, 56(4), 2089-2099.
Friston, K., Parr, T., & Zeidman, P. (2018). Bayesian model reduction. arXiv preprint arXiv:1805.07092.
Penny, W., & Ridgway, G. (2013). Efficient posterior probability mapping using Savage-Dickey ratios. PloS one, 8(3), e59655.
Zeidman, P., Jafarian, A., Seghier, M. L., Litvak, V., Cagnan, H., Price, C. J., & Friston, K. J. (2019). A guide to group effective connectivity analysis, part 2: Second level analysis with PEB. Neuroimage, 200, 12-25.
Marković, D., Friston, K. and Kiebel, S. (2024). “Bayesian sparsification for deep neural networks with Bayesian model reduction.” IEEE Access, 12, 88231 - 88242.