A Tutorial on Clustering Analysis for Bayesian Graphical Models using the R packages bgms and easybgm
Introduction
This is a short software tutorial accompanying the paper “A Stochastic Block Prior for Clustering in Graphical Models”. The paper introduces the Stochastic Block Model as a prior on the network structure of the ordinal Markov random field graphical model (cf., Marsman et al., 2025). In addition to offering an approach for incorporating the theoretical assumption of clustering directly into the statistical model, the method provides a principled way to infer the number of clusters in the network given the data, and to formally test hypotheses about clustering.
In this brief tutorial, the analysis is first illustrated using the R package bgms
(Marsman et al., 2023), which contains the core functions needed to implement the method. The same analysis is then demonstrated using the user-friendly wrapper package easybgm
(Huth et al., 2024). The package easybgm
provides a more accessible interface for conducting the analysis, particularly for users without a strong background in R
.
bgms
We first need to make sure that the latest CRAN version of bgms
is installed.
install.packages("bgms")
library(bgms) # load the bgms package
Suppose the goal is to assess whether the network structure inferred from an empirical dataset exhibits clustering. This can be investigated by estimating the most representative vector of node allocations given the data, which reveals the cluster assignment of each node and thus provides insight into potential clustering in the network. In addition, the function returns the estimated posterior probabilities for different numbers of clusters, enabling hypothesis testing about clustering using Bayes factors. All of this functionality is available through the bgm
function in the bgms
package.
The first step is to estimate the model using bgm
, and setting the edge_prior
argument to "Stochastic-Block"
. Several additional arguments are relevant when using the stochastic block model (SBM) as a prior for the network structure. One is the concentration parameter of the Dirichlet hyperprior on the cluster allocation probabilities (dirichlet_alpha
). This parameter influences the expected cluster sizes: higher values favor more balanced cluster sizes, while lower values lead to more uneven, possibly sparse, cluster structures. The default is dirichlet_alpha = 1
. Another argument is the rate parameter of the truncated Poisson prior (denoted as lambda
), which governs the prior distribution over the number of clusters. This parameter encodes the expected number of clusters we expect before seeing the data; the default value lambda = 1
implies no clustering (i.e., the nodes in the network form one cluster).
In the example code below, the default values for these hyperparameters are used, along with the default settings for the remaining prior hyperparameters of the OMRF and other function arguments. Nevertheless, it is strongly recommended that researchers carefully evaluate these values when conducting their analyses. For further details, see, for example, Marsman et al. (2023) and Sekulovski et al. (2024).
In the code examples, dataset
is a placeholder for the actual dataset to be analyzed. This dataset should be an \(n \times p\) matrix or data frame, where \(n\) denotes the number of observations and \(p\) the number of variables. The variables may be binary, ordinal, or a combination of both.
library(bgms) # load the bgms package
# run the model
<- bgm(dataset, edge_prior = "Stochastic-Block") fit
Now, first we check the estimated (posterior) cluster allocation vector. The output is a vector whose length equals the number of variables (columns) in the dataset. There are two possible options for this, the posterior mean vector and the posterior mode vector. To examine this, we can inspect the allocations vectors in the fit object, where the results of our analysis are stored.
cat("Posterior mean:\n")
print(fit$posterior_mean_allocations)
cat("\nPosterior mode:\n")
print(fit$posterior_mode_allocations)
We can also examine the estimated posterior probabilities for all possible clusters (blocks). In the software, this ranges from 1 (i.e., no clustering) to the total number of variables in the dataset (i.e., each variable forms its own cluster).
$posterior_num_blocks fit
These posterior probabilities are particularly useful for calculating the cluster Bayes factors, which are defined as follows:
\[\text{BF}_{10} = \underbrace{\frac{p(\mathcal{H}_1\mid \text{data})}{p(\mathcal{H}_0 \mid \text{data})}}_{\text{Posterior Odds}} \bigg/ \underbrace{\frac{p(\mathcal{H}_1)}{p(\mathcal{H}_0)}}_{\text{Prior Odds}}\]
Depending on the hypotheses, we consider two types of Bayes factors:
- The Bayes factor testing the hypothesis of clustering, i.e., \(\mathcal{H}_1: B > 1\), against the null hypothesis of no clustering \(\mathcal{H}_0: B = 1\). For this, we first need to calculate the posterior odds of the two hypotheses, which is simply the ratio of the posterior probabilities of the two hypotheses. For \(\mathcal{H}_1: B > 1\), we would need to sum the posterior probabilities of all possible numbers of clusters greater than 1.
<- unname(fit$posterior_num_blocks[-1,1]) # posterior probabilities under H1
H1 <- unname(fit$posterior_num_blocks[1,1]) # posterior probabilities under H0
H0
<- sum(H1) / H0 posterior_odds
After calculating the posterior odds, we need to divide these by the prior odds in order to obtain the Bayes factor. For this Bayes factor, the prior odds are defined as:
\[ \frac{(\exp(\lambda) - 1 - \lambda)}{\lambda}.\] We calculate the prior odds as follows:
<- fit$arguments$lambda # extract the value of lambda used (in this case the default)
lambda <- (exp(lambda) - 1 - lambda) / lambda prior_odds
Finally the Bayes factor is given by
<- posterior_odds / prior_odds
BF_10 cat("Bayes factor in favor of H1:\n")
BF_10 cat("Bayes factor in favor of H0:\n")
1/BF_10
- The Bayes factor testing the hypothesis of a specific number of clusters, i.e., \(\mathcal{H}_1: B = b_1\), against \(\mathcal{H}_2: B = b_2\). Say we wish to test the null hypothesis of one cluster against the hypothesis that assumes there are two clusters (i.e., \(b_1 = 1\) and \(b_2 = 2\)). For this, we follow similar steps as above.
<- unname(fit$posterior_num_blocks[1,1]) # posterior probabilities under H0 (i.e., B = 1)
H1 <- unname(fit$posterior_num_blocks[2,1]) # posterior probabilities under H1 (i.e., B = 2)
H2 <- H1 / H2 posterior_odds
The prior odds for this Bayes factor are given by: \[\lambda^{b_1 - b_2} \cdot \frac{b_2!}{b_1!}\]
<- lambda^(1 - 2) * factorial(2) / factorial(1) prior_odds
Finally the Bayes factor is given by
<- posterior_odds / prior_odds
BF_12 cat("Bayes factor in favor of H1:\n")
BF_12 cat("Bayes factor in favor of H0:\n")
1/BF_12
Finally, researchers can also use the coclustering matrix, a symmetric matrix that contains the pairwise proportions of co-occurrence for all variables. This matrix can be plotted to visually assess the estimated number of clusters and to identify nodes that tend to switch between clusters.
install.packages("pheatmap") # if not already installed
library(pheatmap)
pheatmap(fit$posterior_mean_coclustering_matrix,
cluster_rows = TRUE,
cluster_cols = TRUE,
display_numbers = FALSE
)
easybgm
Everything that has been illustrated using bgms
can be more easily achieved using easybgm
. Currently, the version of easybgm
that includes the SBM prior is not yet available on CRAN. This update will be released on CRAN shortly. In the meantime, if you encounter any errors after installing the packages as described below, please feel free to contact the author of this blog post or open an issue on GitHub. For now, we need to use the following code to install the stable GitHub version. On Windows, you must install Rtools before installing from GitHub using the following code, since the packages are compiled from source. On macOS and Linux, make sure the system’s build tools are installed (Xcode Command Line Tools on macOS, build-essential on Linux).
# Install easybgm (SBM branch)
#and all dependencies including the GitHub version of 'bgms'
if (!requireNamespace("remotes")) {
install.packages("remotes")
} ::install_github("sekulovskin/easybgm",
remotesref = "SBM_in_easybgm", dependencies = TRUE)
Note for Windows users: If an error appears such as "cannot rename file ... reason 'Access is denied'"
, restart R or RStudio with administrative privileges (e.g., by right-clicking the icon and selecting “Run as administrator”), then re-run the installation commands.
We now simply execute the analysis as follows:
library(easybgm)
<- easybgm(dataset,
fit type = "ordinal", # this analysis only works with binary and/or ordinal data
edge_prior = "Stochastic-Block")
There is no need to perform any post-processing, manually inspect the result objects, or calculate Bayes factors “by hand.” Instead, the summary
function can be used to display all relevant output directly.
summary(fit)
Following the edge-specific overview (for more details, see Huth et al. (2024)), the output provides a cluster-specific overview. This includes the posterior probabilities for all possible numbers of clusters, the estimated node memberships, and the Bayes factors comparing the hypothesis of clustering against the null hypothesis of no clustering.
In case you wish to calculate the second type of Bayes factor, similar to the steps above, you can simply use the function clusterBayesfactor
, by setting the argument type = "simple"
and specifying the values for \(b_1\) and \(b_2\).
clusterBayesfactor(fit, type = "simple", b1 = 1, b2 = 2)