A Tutorial on Clustering Analysis for Bayesian Graphical Models using the R packages bgms and easybgm

tutorial

Introduction

This is a short software tutorial accompanying the paper “A Stochastic Block Prior for Clustering in Graphical Models”. The paper introduces the Stochastic Block Model as a prior on the network structure of the ordinal Markov random field (cf., Marsman et al., 2025). In addition to offering an approach for incorporating the theoretical assumption of clustering directly into the statistical model, the method provides a principled way to infer the number of clusters in the network given the data, and to formally test hypotheses about clustering.

In this brief tutorial, the analysis is first illustrated using the R package bgms (Marsman et al., 2023), which contains the core functions needed to implement the method. The same analysis is then demonstrated using the user-friendly wrapper package easybgm (Huth et al., 2024). The package easybgm provides a more accessible interface for conducting the analysis, particularly for users without a strong background in R.

bgms

Currently, the version of bgms that includes the SBM prior is not yet available on CRAN. Therefore, we need to use the following code to install the stable GitHub version:

if (!requireNamespace("remotes")) { 
  install.packages("remotes")   
}   
remotes::install_github("Bayesian-Graphical-Modelling-Lab/bgms")

Now, suppose the goal is to assess whether the network structure inferred from an empirical dataset exhibits clustering. This can be examined by estimating the most representative vector of node allocations given the data, which indicates the cluster assignment of each node and thus provides insight into potential clustering in the network. In addition, it is possible to test hypotheses about clustering using Bayes factors. All of this can be achieved using the bgm function in the bgms package.

The first step is to estimate the model using bgm, setting the edge_prior argument to "Stochastic-Block". Several additional arguments are relevant when using the stochastic block model (SBM) as a prior for the network structure. One is the concentration parameter of the Dirichlet hyperprior on the cluster allocation probabilities (dirichlet_alpha). This parameter influences the expected cluster sizes: higher values favor more balanced cluster sizes, while lower values lead to more uneven, possibly sparse, cluster structures. The default is dirichlet_alpha = 1. Another argument is the rate parameter of the truncated Poisson prior (denoted as lambda), which governs the prior distribution over the number of clusters. This parameter encodes the expected number of clusters a priori; the default value lambda = 1 implies no prior assumption about clustering (i.e., the network has only one cluster).

In the example code below, the default values for these hyperparameters are used, as well as the default settings for the remaining function arguments. However, it is strongly recommended that researchers carefully consider these values when conducting their own analyses. For more details, see, for example, Marsman et al. (2023). If the argument save = TRUE is used, refer to the subsection below for an additional step needed to retrieve the summarized results. In the code examples below, dataset is a placeholder for the actual dataset to be analysed. This should be an \(n \times p\) matrix or data frame, where \(n\) is the number of observations and \(p\) is the number of variables. These variables can be binary, ordinal, or a combination of both.

library(bgms) # load the bgms package 
# run the model
fit <- bgm(dataset, edge_prior = "Stochastic-Block")

Now, first we check the estimated (posterior) cluster allocation vector. The output is a vector whose length equals the number of variables (columns) in the dataset. To examine this, we can inspect the allocations vector in the fit object, where the results of our analysis are stored.

fit[["allocations"]]

We can also examine the estimated posterior probabilities for all possible clusters. In the software, this ranges from 1 (i.e., no clustering) to the total number of variables in the dataset (i.e., each variable forms its own cluster).

fit[["components"]]

These posterior probabilities are particularly useful for calculating the cluster Bayes factors, which are defined as follows:

\[\text{BF}_{10} = \underbrace{\frac{p(\mathcal{H}_1\mid \text{data})}{p(\mathcal{H}_0 \mid \text{data})}}_{\text{Posterior Odds}} \bigg/ \underbrace{\frac{p(\mathcal{H}_1)}{p(\mathcal{H}_0)}}_{\text{Prior Odds}}\]

Depending on the hypotheses, we consider two types of Bayes factors:

  • The Bayes factor testing the hypothesis of clustering, i.e., \(\mathcal{H}_1: B > 1\), against the null hypothesis of no clustering \(\mathcal{H}_0: B = 1\). For this, we first need to calculate the posterior odds of the two hypotheses, which is simply the ratio of the posterior probabilities of the two hypotheses. For \(\mathcal{H}_1: B > 1\), we would need to sum the posterior probabilities of all possible numbers of clusters greater than 1.
H1 <- unname(fit[["components"]][-1,2]) # posterior probabilities under H1
H0 <- unname(fit[["components"]][1,2]) # posterior probabilities under H0

posterior_odds <- sum(H1) / H0 

After calculating the posterior odds, we need to divide these by the prior odds in order to obtain the Bayes factor. For this Bayes factor, the prior odds are defined as:

\[ \frac{(\exp(\lambda) - 1 - \lambda)}{\lambda}.\] Since we leave the default value for the rate parameter \(\lambda\) of the Poisson distribution, we calculate the prior odds as follows:

lambda = 1
prior_odds <- (exp(lambda) - 1 - lambda) / lambda

Finally the Bayes factor is given by

BF_10 <- posterior_odds / prior_odds
BF_10  # Bayes factor in favor of H1
1/BF_10 # Bayes factor in favor of H0
  • The Bayes factor testing the hypothesis of a specific number of clusters, i.e., \(\mathcal{H}_1: B = b_1\), against \(\mathcal{H}_2: B = b_2\). Say we wish to test the null hypothesis of one cluster against the hypothesis that assumes there are two clusters (i.e., \(b_1 = 1\) and \(b_2 = 2\)). For this, we follow similar steps as above.
H1 <- unname(fit[["components"]][1,2]) # posterior probabilities under H0 (i.e., B = 1)
H2 <- unname(fit[["components"]][2,2]) # posterior probabilities under H1 (i.e., B = 2)
posterior_odds <- H1 / H2

The prior odds for this Bayes factor are given by: \[\lambda^{b_1 - b_2} \cdot \frac{b_2!}{b_1!}\]

prior_odds <- lambda^(1 - 2) * factorial(2) / factorial(1)

Finally the Bayes factor is given by

BF_12 <- posterior_odds / prior_odds
BF_12  # Bayes factor in favor of H1
1/BF_12 # Bayes factor in favor of H2

when save = TRUE

We would also like to note that, if for any reason, researchers wish to save the samples from the posterior distribution, then in order to obtain the summarized posterior clustering information, they would need to use the function summarySBM as follows:

fit <- bgm(dataset, 
           edge_prior = "Stochastic-Block", 
           save = TRUE)
cluster_summerized <- summarySBM(fit)

easybgm

Everything that has been illustrated using bgms can be more easily achieved using easybgm. Currently, the version of easybgm that supports this analysis is available only on GitHub and can be installed as follows:

# Install easybgm (SBM branch) and all dependencies including the GitHub version of 'bgms'
if (!requireNamespace("remotes")) { 
  install.packages("remotes")   
}   
remotes::install_github("sekulovskin/easybgm", 
                        ref = "SBM_in_easybgm", dependencies = TRUE)

Note for Windows users: If an error appears such as "cannot rename file ... reason 'Access is denied'", restart R or RStudio with administrative privileges (e.g., by right-clicking the icon and selecting “Run as administrator”), then re-run the installation commands.

We now simply perform the analysis as follows:

library(easybgm)

fit <- easybgm(dataset, 
               type = "ordinal", # this analysis only works with binary and/or ordinal data
               edge_prior = "Stochastic-Block")

Now, we do not need to perform any post-processing or dig into the objects containing the results of the analysis or calculate the Bayes factors “by hand.” We can simply use the summary function, and all the relevant output will be displayed.

summary(fit)

After the edge-specific overview (for more information, see Huth et al. (2024)), we see the cluster-specific overview, which includes the posterior probabilities for all possible numbers of clusters, the estimated node membership, as well as the Bayes factors testing the hypothesis of clustering and the null hypothesis of no clustering.

In case you wish to calculate the second type of Bayes factor, similar to the steps above, you can simply use the function clusterBayesfactor, by setting the argument type = "simple" and specifying the values for \(b_1\) and \(b_2\).

clusterBayesfactor(fit, type = "simple", b1 = 1, b2 = 2)

References

Huth, K. B. S., Keetelaar, S., Sekulovski, N., Bergh, D. van den, & Marsman, M. (2024). Simplifying bayesian analysis of graphical models for the social sciences with easybgm: A user-friendly r-package. AIP Advances, e66366. https://doi.org/10.56296/aip00010
Marsman, M., Bergh, D. van den, & Haslbeck, J. M. B. (2025). Bayesian analysis of the ordinal Markov random field. Psychometrika. https://doi.org/10.1017/psy.2024.4
Marsman, M., Huth, K., Sekulovski, N., & Bergh, D. van den. (2023). Bgms: Bayesian variable selection for networks of binary and/or ordinal variables. https://CRAN.R-project.org/package=bgms