============================ Bayesian Model Reduction ============================ Introduction ============ Bayesian Model Reduction (BMR) is a generalisation of the Savage-Dickey ratio allowing an efficient comparison of a large number of generative models, all differing only in their specification of the prior distribution. In the context of our Dynamic Factor Analysis implementation (and its FA/PCA special cases), we use BMR to post-hoc prune parameters of the loading matrix :math:`\mathbf{H}` and potentially the dynamics matrix :math:`\mathbf{F}`, removing redundant latent dimensions and providing an efficient approach to learning sparse representations. BMR is applied within the M-step pipeline **after** posterior computation and PX rotation, using the ARD-augmented prior (where :math:`\mathbb{E}[\tau_k]` from the previous iteration's ARD posterior is already incorporated into the prior precision). See :doc:`dynamic_factor_analysis` for the full pipeline description. This document provides the mathematical foundation for the BMR algorithm implemented in ``sppcax``. Principle of Bayesian Model Reduction ===================================== Bayesian model selections rests on evaluating and comparing model evidence of different generative models :math:`m_i` given some data :math:`\mathbf{y}`. The model evidence is given by: .. math:: p(\mathbf{y}|m_i) = \int p(\mathbf{y}|\boldsymbol{\theta}, m_i) p(\boldsymbol{\theta}|m_i) d\boldsymbol{\theta} where :math:`\boldsymbol{\theta}` are the model parameters, :math:`p(\mathbf{y}|\boldsymbol{\theta}, m_i)` is the likelihood, and :math:`p(\boldsymbol{\theta}|m_i)` is the prior. Instead of fitting each model to the data, we can use BMR and leverage the posterior distribution obtained from a full model, :math:`m_0`, to analytically derive the evidence for each reduced model, :math:`m_{i>0}`. This is only possible when the reduced models differ from the full model only in their prior specifications. Specifically, we call reduced model any model that has a prior of lower entropy compared to the prior of the full model. Importantly, this condition is satisfied for sparse Factor Analysis, as we are looking for a sparse representation of the loading matrix, that corresponds to a large number of zero valued elements of the matrix. Mathematical Formulation ======================== Consider a full model :math:`m_0` with prior :math:`p(\boldsymbol{\theta}|m_F)` and any reduced model :math:`m_{i>0}` with prior :math:`p(\boldsymbol{\theta}|m_{i>0})`. If we know the posterior distribution of the full model :math:`p(\boldsymbol{\theta}|\mathbf{y}, m_0)`, we can compute the evidence for reduced models as: .. math:: p(\mathbf{y}|m_i) = p(\mathbf{y}|m_0) \int d\boldsymbol{\theta}\frac{p(\boldsymbol{\theta}|m_i)}{p(\boldsymbol{\theta}|m_0)} p(\boldsymbol{\theta}|\mathbf{y}, m_0) This allows us to compute the ratio of model evidences (Bayes factor) between the reduced and full models as: .. math:: \frac{p(\mathbf{y}|m_i)}{p(\mathbf{y}|m_0)} = \int \frac{p(\boldsymbol{\theta}|m_i)}{p(\boldsymbol{\theta}|m_0)} p(\boldsymbol{\theta}|\mathbf{y}, m_0) d\boldsymbol{\theta} Variational Approximation ========================= For Bayesian Factor Analysis model (or PCA), we use a variational approximation :math:`q_0(\boldsymbol{\theta})` to the true posterior :math:`p(\boldsymbol{\theta}|\mathbf{y}, m_0)`. Importantly, we assume the same functional form for the posterior and the prior (see :ref:`sec-prior-dist` for details). The variational free energy is given by: .. math:: F(m_0) = \int d\boldsymbol{\theta} q_0(\boldsymbol{\theta}) \ln \frac{q_0(\boldsymbol{\theta})}{p(\mathbf{y}, \boldsymbol{\theta}|m_0)} where we assume that the log-joint :math:`\ln p(\mathbf{y}, \boldsymbol{\theta}|m_0)` is obtained during the e-step (see :ref:`e-step`) as: .. math:: \ln p(\mathbf{y}, \boldsymbol{\theta}|m_0) \approx \ln p(\pmb{\theta}|m_0) + \int d \pmb{z} q(\pmb{z}) \ln \frac{q(\pmb{z})}{p(\pmb{y}|\pmb{\theta}, \pmb{z})} For any reduced model, the variational free energy can be computed as: .. math:: F(m_i) &= F(m_0) - \ln \int d\boldsymbol{\theta} q_0(\boldsymbol{\theta}) \frac{p(\boldsymbol{\theta}|m_i)}{p(\boldsymbol{\theta}|m_0)} \\ &= F(m_0) - \Delta F_i This formula allows us to compute the change in free energy (:math:`\Delta F`) when switching from the full model to a reduced model, which is what we implemented in the ``compute_delta_f`` function. Note that if we enumerate different models :math:`m_i` in a way that makes to neghbouring models :math:`m_i` and :math:`m_{i+1}` such that they differ only in a single parameter, where :math:`m_{i+1}` is a reduced version of :math:`m_i` then we can express the variational free-energy of model :math:`m_i` as a folowing sequence .. math:: F(m_{i-1}) &= F(m_{i}) + \Delta F_{i:i-1} \\ F(m_0) &= F(m_i) + \sum_{k=1}^i \Delta F_{k:k-1} \\ \Delta F_{k:k-1} &= \ln \int d \pmb{\theta} q_{k-1}(\theta) \frac{p(\pmb{\theta}|m_k)}{p(\pmb{\theta}|m_{k-1})} Computing :math:`\Delta F` =========================== In ``sppcax``, BMR is applied to both the loading matrix :math:`\tilde{\mathbf{H}}` (MVNIG posterior) and the dynamics matrix :math:`\tilde{\mathbf{F}}` (MVN posterior). The prior precision used in both cases already contains :math:`\mathbb{E}[\tau_k]` from the ARD posterior of the previous iteration, so no separate ARD terms appear in the :math:`\Delta F` computation. We define the following quantities for a given row with pruning mask :math:`\mathbf{g}` (1 = active, 0 = pruned): .. math:: \mathbf{G} &= \text{diag}(\mathbf{g}) \\ \mathbf{L}_{\text{post}} &= \text{chol}(\mathbf{G} \boldsymbol{\Lambda}_{\text{post}} \mathbf{G} + \mathbf{I}_{\text{pruned}}) \\ \mathbf{L}_{\text{prior}} &= \text{chol}(\mathbf{G} \boldsymbol{\Lambda}_{\text{prior}} \mathbf{G} + \mathbf{I}_{\text{pruned}}) \\ \tilde{\boldsymbol{\mu}}_{\text{post}} &= \mathbf{L}_{\text{post}}^\top (\mathbf{g} \odot \boldsymbol{\mu}_{\text{post}}) \\ \tilde{\boldsymbol{\mu}}_{\text{prior}} &= \mathbf{L}_{\text{prior}}^\top (\mathbf{g} \odot \boldsymbol{\mu}_{\text{prior}}) where :math:`\boldsymbol{\Lambda}_{\text{post}}` and :math:`\boldsymbol{\Lambda}_{\text{prior}}` are the posterior and prior precision matrices, :math:`\boldsymbol{\mu}_{\text{post}}` and :math:`\boldsymbol{\mu}_{\text{prior}}` are the posterior and prior means, and :math:`\mathbf{I}_{\text{pruned}} = \mathbf{I} - \mathbf{G}` fills in the pruned dimensions to keep the Cholesky factorisation well-conditioned. MVNIG Case (Loading Matrix) ---------------------------- The loading matrix has a joint MVNIG posterior (see :ref:`sec-post-dist`): .. math:: q(\tilde{\mathbf{h}}_d, \psi_d) = \mathcal{N}(\tilde{\mathbf{h}}_d \mid \boldsymbol{\mu}_d, \psi_d^{-1} \boldsymbol{\Sigma}_d) \, \text{Gamma}(\psi_d \mid \alpha_d, \beta_d) The change in variational free energy for pruning a set of elements in row :math:`d` is: .. math:: \Delta F_d = \ln |\mathbf{L}_{\text{post}}| - \ln |\mathbf{L}_{\text{prior}}| + \alpha_d \ln \beta_d - \alpha_d \ln \left(\beta_d + \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{post}}^\top \tilde{\boldsymbol{\mu}}_{\text{post}} - \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{prior}}^\top \tilde{\boldsymbol{\mu}}_{\text{prior}} \right) The :math:`\alpha_d \ln(\cdot)` terms arise from integrating over the shared noise precision :math:`\psi_d` under the Gamma posterior. When the noise is isotropic (PCA), :math:`\alpha_d` and :math:`\beta_d` are shared across all dimensions. MVN Case (Dynamics Matrix) --------------------------- The dynamics matrix has an MVN posterior without a noise precision parameter (since :math:`\mathbf{Q} = \mathbf{I}` is fixed): .. math:: q(\tilde{\mathbf{f}}_k) = \mathcal{N}(\tilde{\mathbf{f}}_k \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) The change in variational free energy simplifies to: .. math:: \Delta F_k = \ln |\mathbf{L}_{\text{post}}| - \ln |\mathbf{L}_{\text{prior}}| - \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{post}}^\top \tilde{\boldsymbol{\mu}}_{\text{post}} + \tfrac{1}{2} \tilde{\boldsymbol{\mu}}_{\text{prior}}^\top \tilde{\boldsymbol{\mu}}_{\text{prior}} This is a direct quadratic form — the absence of the :math:`\alpha \ln(\beta + \cdots)` terms reflects the fact that there is no noise variance to integrate over. Gibbs Sampling with Indian Buffet Process Prior ================================================ To determine the final sparse structure of the loading matrix (and dynamics matrix), we use Gibbs sampling with a spike-and-slab prior on the sparsity matrix :math:`\boldsymbol{\Lambda}`. If :math:`\lambda_{dk} = 1`, the element retains its normal prior; if :math:`\lambda_{dk} = 0`, the prior becomes a delta function at zero, forcing the element to vanish. Indian Buffet Process Prior ---------------------------- Each column :math:`k` of the sparsity matrix has an independent inclusion probability drawn from a truncated Indian Buffet Process (IBP) prior: .. math:: \pi_k \sim \text{Beta}\!\left(\frac{\alpha_0}{K},\, 1\right) where :math:`\alpha_0` is a concentration parameter controlling the expected number of active features, and :math:`K` is the total number of latent dimensions. This prior encourages sparse representations while allowing the data to determine which dimensions are needed. Gibbs Sampling Procedure ------------------------- The posterior probability of element :math:`\lambda_{dk} = 1` is: .. math:: q(\lambda_{dk} = 1) &= \frac{p(\mathbf{y}|m_{i-1})\,\pi_k}{p(\mathbf{y}|m_{i-1})\,\pi_k + p(\mathbf{y}|m_i)\,(1-\pi_k)} \\ &= \sigma\!\left(\Delta F_{d,\, i:i-1} + \ln \frac{\pi_k}{1 - \pi_k}\right) where :math:`\sigma(\cdot)` is the logistic function and :math:`\Delta F_{d,\, i:i-1}` is the change in variational free energy between two models differing only in element :math:`(d, k)`. At each Gibbs iteration :math:`t`: 1. **Sample sparsity columns**: For :math:`k = 1, \ldots, K`: .. math:: \lambda_{dk}^{(t)} \sim \text{Bernoulli}\!\left(\sigma\!\left(\Delta F_{d,\, i:i-1} + \ln \frac{\pi_k}{1 - \pi_k}\right)\right) \cdot \text{mask}_{dk} where :math:`\text{mask}_{dk}` is the structural constraint from the model definition. 2. **Update column-wise hyperparameters**: .. math:: \alpha_k &= \frac{\alpha_0}{K} + \sum_d \lambda_{dk}^{(t)} \\ \beta_k &= 1 + \sum_d (1 - \lambda_{dk}^{(t)}) 3. **Update concentration parameter** :math:`\alpha_0` via empirical Bayes: .. math:: \alpha_0 = \frac{-K^2}{\sum_k \left[\psi(\alpha_k) - \psi(\alpha_k + \beta_k)\right]} where :math:`\psi(\cdot)` is the digamma function. Posterior Correction After Pruning =================================== After the Gibbs sampler converges with final mask :math:`\boldsymbol{\Lambda}`, the posterior distributions are corrected to account for the pruned elements. MVNIG (Emissions) ------------------ For the loading matrix posterior: 1. Pruned elements are set to zero in both the mean and natural parameters. 2. The Inverse Gamma :math:`\beta` parameter is corrected: .. math:: \beta_d^{\text{new}} = \beta_d + \frac{1}{2}\left( -\boldsymbol{\mu}_{\text{pruned},d}^\top \boldsymbol{\Lambda}_{\text{post},d}\, \boldsymbol{\mu}_{\text{pruned},d} + \boldsymbol{\mu}_{\text{pruned},d}^{0\top} \boldsymbol{\Lambda}_{\text{prior},d}\, \boldsymbol{\mu}_{\text{pruned},d}^0 \right) where :math:`\boldsymbol{\mu}_{\text{pruned},d}` denotes the pruned elements of the posterior mean and :math:`\boldsymbol{\mu}_{\text{pruned},d}^0` the pruned elements of the prior mean. 3. The Inverse Gamma scale is optimised: .. math:: \text{nat2}_0 = \min\!\left(\frac{\Delta\text{nat2}}{\alpha - 1},\, \frac{-1}{\alpha - 1}\right) MVN (Dynamics) --------------- For the dynamics matrix posterior, pruned elements are set to zero via the updated mask. No noise precision correction is needed since :math:`\mathbf{Q} = \mathbf{I}` is fixed. See Also ======== For practical examples demonstrating BMR in action: - :doc:`examples/test_px_em_fa` — FA with BMR for sparse loading matrix discovery - :doc:`examples/test_px_em_dfa` — DFA with BMR for sparse structure discovery References ========== 1. Friston, K., Mattout, J., Trujillo-Barreto, N., Ashburner, J., & Penny, W. (2007). Variational free energy and the Laplace approximation. Neuroimage, 34(1), 220-234. 2. Friston, K., & Penny, W. (2011). Post hoc Bayesian model selection. Neuroimage, 56(4), 2089-2099. 3. Friston, K., Parr, T., & Zeidman, P. (2018). Bayesian model reduction. arXiv preprint arXiv:1805.07092. 4. Penny, W., & Ridgway, G. (2013). Efficient posterior probability mapping using Savage-Dickey ratios. PloS one, 8(3), e59655. 5. Zeidman, P., Jafarian, A., Seghier, M. L., Litvak, V., Cagnan, H., Price, C. J., & Friston, K. J. (2019). A guide to group effective connectivity analysis, part 2: Second level analysis with PEB. Neuroimage, 200, 12-25. 6. Marković, D., Friston, K. and Kiebel, S. (2024). "Bayesian sparsification for deep neural networks with Bayesian model reduction." IEEE Access, 12, 88231 - 88242.