A Tutorial on Clustering Analysis for Bayesian Graphical Models using the R packages bgms and easybgm
Introduction
This is a short software tutorial accompanying the paper “A Stochastic Block Prior for Clustering in Graphical Models”. The paper introduces the Stochastic Block Model as a prior on the network structure of the ordinal Markov random field (cf., Marsman et al., 2025). In addition to offering an approach for incorporating the theoretical assumption of clustering directly into the statistical model, the method provides a principled way to infer the number of clusters in the network given the data, and to formally test hypotheses about clustering.
In this brief tutorial, the analysis is first illustrated using the R package bgms
(Marsman et al., 2023), which contains the core functions needed to implement the method. The same analysis is then demonstrated using the user-friendly wrapper package easybgm
(Huth et al., 2024). The package easybgm
provides a more accessible interface for conducting the analysis, particularly for users without a strong background in R
.
bgms
Currently, the version of bgms
that includes the SBM prior is not yet available on CRAN. Therefore, we need to use the following code to install the stable GitHub version:
if (!requireNamespace("remotes")) {
install.packages("remotes")
} ::install_github("Bayesian-Graphical-Modelling-Lab/bgms") remotes
Now, suppose the goal is to assess whether the network structure inferred from an empirical dataset exhibits clustering. This can be examined by estimating the most representative vector of node allocations given the data, which indicates the cluster assignment of each node and thus provides insight into potential clustering in the network. In addition, it is possible to test hypotheses about clustering using Bayes factors. All of this can be achieved using the bgm
function in the bgms
package.
The first step is to estimate the model using bgm
, setting the edge_prior
argument to "Stochastic-Block"
. Several additional arguments are relevant when using the stochastic block model (SBM) as a prior for the network structure. One is the concentration parameter of the Dirichlet hyperprior on the cluster allocation probabilities (dirichlet_alpha
). This parameter influences the expected cluster sizes: higher values favor more balanced cluster sizes, while lower values lead to more uneven, possibly sparse, cluster structures. The default is dirichlet_alpha = 1
. Another argument is the rate parameter of the truncated Poisson prior (denoted as lambda
), which governs the prior distribution over the number of clusters. This parameter encodes the expected number of clusters a priori; the default value lambda = 1
implies no prior assumption about clustering (i.e., the network has only one cluster).
In the example code below, the default values for these hyperparameters are used, as well as the default settings for the remaining function arguments. However, it is strongly recommended that researchers carefully consider these values when conducting their own analyses. For more details, see, for example, Marsman et al. (2023). If the argument save = TRUE
is used, refer to the subsection below for an additional step needed to retrieve the summarized results. In the code examples below, dataset is a placeholder for the actual dataset to be analysed. This should be an \(n \times p\) matrix or data frame, where \(n\) is the number of observations and \(p\) is the number of variables. These variables can be binary, ordinal, or a combination of both.
library(bgms) # load the bgms package
# run the model
<- bgm(dataset, edge_prior = "Stochastic-Block") fit
Now, first we check the estimated (posterior) cluster allocation vector. The output is a vector whose length equals the number of variables (columns) in the dataset. To examine this, we can inspect the allocations vector in the fit object, where the results of our analysis are stored.
"allocations"]] fit[[
We can also examine the estimated posterior probabilities for all possible clusters. In the software, this ranges from 1 (i.e., no clustering) to the total number of variables in the dataset (i.e., each variable forms its own cluster).
"components"]] fit[[
These posterior probabilities are particularly useful for calculating the cluster Bayes factors, which are defined as follows:
\[\text{BF}_{10} = \underbrace{\frac{p(\mathcal{H}_1\mid \text{data})}{p(\mathcal{H}_0 \mid \text{data})}}_{\text{Posterior Odds}} \bigg/ \underbrace{\frac{p(\mathcal{H}_1)}{p(\mathcal{H}_0)}}_{\text{Prior Odds}}\]
Depending on the hypotheses, we consider two types of Bayes factors:
- The Bayes factor testing the hypothesis of clustering, i.e., \(\mathcal{H}_1: B > 1\), against the null hypothesis of no clustering \(\mathcal{H}_0: B = 1\). For this, we first need to calculate the posterior odds of the two hypotheses, which is simply the ratio of the posterior probabilities of the two hypotheses. For \(\mathcal{H}_1: B > 1\), we would need to sum the posterior probabilities of all possible numbers of clusters greater than 1.
<- unname(fit[["components"]][-1,2]) # posterior probabilities under H1
H1 <- unname(fit[["components"]][1,2]) # posterior probabilities under H0
H0
<- sum(H1) / H0 posterior_odds
After calculating the posterior odds, we need to divide these by the prior odds in order to obtain the Bayes factor. For this Bayes factor, the prior odds are defined as:
\[ \frac{(\exp(\lambda) - 1 - \lambda)}{\lambda}.\] Since we leave the default value for the rate parameter \(\lambda\) of the Poisson distribution, we calculate the prior odds as follows:
= 1
lambda <- (exp(lambda) - 1 - lambda) / lambda prior_odds
Finally the Bayes factor is given by
<- posterior_odds / prior_odds
BF_10 # Bayes factor in favor of H1
BF_10 1/BF_10 # Bayes factor in favor of H0
- The Bayes factor testing the hypothesis of a specific number of clusters, i.e., \(\mathcal{H}_1: B = b_1\), against \(\mathcal{H}_2: B = b_2\). Say we wish to test the null hypothesis of one cluster against the hypothesis that assumes there are two clusters (i.e., \(b_1 = 1\) and \(b_2 = 2\)). For this, we follow similar steps as above.
<- unname(fit[["components"]][1,2]) # posterior probabilities under H0 (i.e., B = 1)
H1 <- unname(fit[["components"]][2,2]) # posterior probabilities under H1 (i.e., B = 2)
H2 <- H1 / H2 posterior_odds
The prior odds for this Bayes factor are given by: \[\lambda^{b_1 - b_2} \cdot \frac{b_2!}{b_1!}\]
<- lambda^(1 - 2) * factorial(2) / factorial(1) prior_odds
Finally the Bayes factor is given by
<- posterior_odds / prior_odds
BF_12 # Bayes factor in favor of H1
BF_12 1/BF_12 # Bayes factor in favor of H2
when save = TRUE
We would also like to note that, if for any reason, researchers wish to save the samples from the posterior distribution, then in order to obtain the summarized posterior clustering information, they would need to use the function summarySBM
as follows:
<- bgm(dataset,
fit edge_prior = "Stochastic-Block",
save = TRUE)
<- summarySBM(fit) cluster_summerized
easybgm
Everything that has been illustrated using bgms
can be more easily achieved using easybgm
. Currently, the version of easybgm
that supports this analysis is available only on GitHub and can be installed as follows:
# Install easybgm (SBM branch) and all dependencies including the GitHub version of 'bgms'
if (!requireNamespace("remotes")) {
install.packages("remotes")
} ::install_github("sekulovskin/easybgm",
remotesref = "SBM_in_easybgm", dependencies = TRUE)
Note for Windows users: If an error appears such as "cannot rename file ... reason 'Access is denied'"
, restart R or RStudio with administrative privileges (e.g., by right-clicking the icon and selecting “Run as administrator”), then re-run the installation commands.
We now simply perform the analysis as follows:
library(easybgm)
<- easybgm(dataset,
fit type = "ordinal", # this analysis only works with binary and/or ordinal data
edge_prior = "Stochastic-Block")
Now, we do not need to perform any post-processing or dig into the objects containing the results of the analysis or calculate the Bayes factors “by hand.” We can simply use the summary
function, and all the relevant output will be displayed.
summary(fit)
After the edge-specific overview (for more information, see Huth et al. (2024)), we see the cluster-specific overview, which includes the posterior probabilities for all possible numbers of clusters, the estimated node membership, as well as the Bayes factors testing the hypothesis of clustering and the null hypothesis of no clustering.
In case you wish to calculate the second type of Bayes factor, similar to the steps above, you can simply use the function clusterBayesfactor
, by setting the argument type = "simple"
and specifying the values for \(b_1\) and \(b_2\).
clusterBayesfactor(fit, type = "simple", b1 = 1, b2 = 2)