Tumorigenesis is a stepwise process driven by a sequence of molecular changes that are described as pathways of cancer progression. Conjunctive Bayesian Networks (CBN) are probabilistic graphical models designed for the analysis and modeling of these pathways [1]. CBN models have evolved into different varieties such as CT-CBN [2], H-CBN [3], B-CBN [4], and R-CBN [5], each addressing different aspects of this task. However, the software corresponding to these methods is not well integrated because they are implemented in different languages with heterogeneous input and output formats. This necessitates a unifying platform that integrates these models and enables the standardization of input and output formats. Evam-tools [6] is an R package that takes the initial steps towards this end. However, it does not include the B-CBN model or the recently developed R-CBN algorithm, which focuses on robust inference of cancer progression pathways [5]. Importantly, the B-CBN and R-CBN algorithms for pathway quantification necessitate exhaustive consideration and weighting of all potential dependency structures (posets) within the mutational quartets. This requires reimplementation of the CBN models and adjustment of downstream pathway analysis and modeling functions. Therefore, here we introduce the CBN2Path R package that not only includes the original implementation of the CBN models (e.g., CT-CBN and H-CBN) in a unifying interface but also accommodates the necessary modifications to support robust CBN algorithms (e.g., B-CBN and R-CBN). Importantly, CBN2Path includes a collection of functions required to quantify predictability [7], analyze robustness [5], and visualize mutational pathways from pre-processed cross-sectional genomic data. It is important to note that the R-CBN method has great potential for wide application in future predictive models because of its unique ability to offer an optimal balance between robustness and predictability [5]. Thus, we anticipate that CBN2Path will be a commonly used package in the field, particularly by providing a platform to facilitate future applications of the R-CBN model.
The package can be installed as follows:
If you use the CBN2Path package, please cite the paper formally as follows:
Choi-Kim W and Hosseini SR. CBN2Path: an R/Bioconductor package for the analysis of cancer progression pathways using Conjunctive Bayesian Networks. F1000Research 2025, 14:834 (https://doi.org/10.12688/f1000research.168810.1)
The genotype matrix is a binary matrix in which each row represents a sample and each column represents a gene. The element (i,j) of the matrix is 1 if the i-th sample contains a nonsense or missense mutation in the j-th gene, and zero otherwise. In the example below, a genotype matrix is generated for 414 Bladder Urothelial Carcinoma samples in the “TCGA-BLCA” data. Note that we in this example we have used “TP53”, “ARID1A”, “KDM6A” and “PIK3CA” as our genes of interest to genotype the samples.
CBN2Path provides the R interface for the continuous-time CBN model (CT-CBN), which was originally implemented in C programming language [2].
CT-CBN needs the following three inputs:
numMutations: The number of
mutations to be considered.
poset: The partially ordered set
(poset), which is represented as a two-column matrix each row of which
indicates that mutation in the first column must occur before the
mutation in the second column.
genotypeMatrix: a binary matrix
with n rows and m columns. Each row
corresponds to a given genotype in the sample.
Below, you can see an example on how these inputs are prepared to be
used in the original implementation of the ctcbn model
(ctcbnSingle):
# The poset
dag <- matrix(c(1, 1, 1, 4, 2, 3, 4, 3), 4, 2)
# The genotype matrix:
# gMat, which was generated using the generateTCGAMatrix function (see above).
# Preparing input of the ct-cbn/h-cbn methods
bc <- Spock$new(
poset = dag,
numMutations = 4,
genotypeMatrix = gMat
)
# Running the ct-cbn model
resultsC <- ctcbnSingle(bc)The directed acyclic graph representation of the input poset can be
visualized using the visualizeCBNModel
function as follows:
Furthermore, the Lambda parameters and the corresponding log likelihood can be obtained as:
The genotypeMatrixMutator function
generates a modified genotype matrix (gMatMut) by
subjecting the original genotype matrix (gMat) to
false-positive and false-negative error rates of 0.3
and 0.2, respectively:
and then you can rerun the ct-cbn model using the mutated genotype matrix as follows:
# The poset
dag <- matrix(c(1, 1, 1, 4, 2, 3, 4, 3), 4, 2)
# Preparing the inputs of the ct-cbn method
bc <- Spock$new(
poset = dag,
numMutations = 4,
genotypeMatrix = gMatMut
)
# Running the ct-cbn model
resultsCMut <- ctcbnSingle(bc)Note that if the posets and genotype data are stored in the original
format needed in the C implementation of the CBN models, you can
preprocess those files using readPoset and
readPattern functions in the CBN2Path
package. You can see an example below, where the number of mutations is
10:
examplePath <- getExamples()[1]
bc <- Spock$new(
poset = readPoset(examplePath)$sets,
numMutations = readPoset(examplePath)$mutations,
genotypeMatrix = readPattern(examplePath)
)
resultsC2 <- ctcbnSingle(bc)Finally, you can obtain and store the results using a list of posets
(instead of a single poset) as the input of the model using the
ctcbn function. In the example below, we
will consider the set of all potential (transitively-closed) DAGs of
size four (219 unique posets) as our list of posets that we use as the
input of the ctcbn function:
posets <- readRDS(system.file("extdata", "Posets.rds", package = "CBN2Path"))
bc <- Spock$new(
poset = posets,
numMutations = 4,
genotypeMatrix = gMat
)
resultsC3 <- ctcbn(bc, nCores = 3)You can see that the result is a list of 219 components each including the estimated parameters corresponding to one of the 219 posets in the input list. This strategy is utilized in the R-CBN and B-CBN models for quantifying the pathway probabilities.
You can obtain the log likelihood corresponding to each poset as follows:
Below, you can see that the maximum-likelihood poset according to the CT-CBN model in this example is the empty poset.
The input/output structure of the H-CBN model
(hcbnSingle and
hcbn) [3] is exactly the same as in the
CT-CBN model (ctcbnSingle and
ctcbn) described above.
# The poset
dag <- matrix(c(1, 1, 1, 4, 2, 3, 4, 3), 4, 2)
# Preparing the inputs of the h-cbn method
bc <- Spock$new(
poset = dag,
numMutations = 4,
genotypeMatrix = gMat
)
# Running the h-cbn model
resultsH <- hcbnSingle(bc)The Lambda parameters and the corresponding log likelihood can be obtained as:
You can also rerun the h-cbn model using the mutated genotype matrix as follows:
# The poset
dag <- matrix(c(3, 3, 4, 4, 1, 2, 1, 2), 4, 2)
# Preparing the inputs of the h-cbn method
bc <- Spock$new(
poset = dag,
numMutations = 4,
genotypeMatrix = gMatMut
)
# Running the h-cbn model
resultsHMut <- hcbnSingle(bc)Again the poset and genotype files stored in the original formats can
be processed using readPoset and
readPattern functions in the CBN2Path
package.
examplePath <- getExamples()[1]
bc <- Spock$new(
poset = readPoset(examplePath)$sets,
numMutations = readPoset(examplePath)$mutations,
genotypeMatrix = readPattern(examplePath)
)
resultsH2 <- hcbnSingle(bc)Finally, you can obtain and store the results using a list of posets
(instead of a single poset) as the input of the model using the
hcbn function.
posets <- readRDS(system.file("extdata", "Posets.rds", package = "CBN2Path"))
bc <- Spock$new(
poset = posets,
numMutations = 4,
genotypeMatrix = gMat
)
resultsH3 <- hcbn(bc, nCores = 3)You can obtain the log likelihood corresponding to each poset as follows:
Below, you can see that in this example, the maximum-likelihood poset based on H-CBN contains an edge in contrast to the CT-CBN model, which outputs an empty poset as the MLE poset:
One of the important feature of the CBN2Path package is its emphasis on mutational pathway analyses and modeling. In this section, we will work with a set of functions that enable quantification, analysis and visualization of the mutational pathways.
There are two approaches to quantify the pathway probabilities:
The first approach is to use the output of the ct-cbn or h-cbn
methods (namely the estimated lambda parameters and the inferred ML
poset) as input of the pathProbCBN
function. This method is generic as it can be used for every number of
mutations.
The second approach works only for mutational quartets
(n=4) and uses the genotypic data directly as the
input. In this setting, each CBN model has its own pathway
quantification functions:
pathProbQuartetCTCBN,
pathProbQuartetHCBN,
pathProbQuartetRCBN, and
pathProbQuartetBCBN.
As examples for the first approach, let’s use the
resultsC2 and resultsH2 that we obtained in
the previous section by learning respectively, ct-cbn and h-cbn models
on genotypic data with n=10 mutations. First, we need
to obtain the estimated lambda parameters and the inferred DAG and then
use them as the input of the pathProbCBN
function as follows:
lambdaC <- as.numeric(resultsC2$lambda)
lambdaH <- as.numeric(resultsH2$lambda)
dagC <- resultsC2$poset$sets
dagH <- resultsH2$poset$sets
probC <- pathProbCBN(dagC, lambdaC, 10)
probH <- pathProbCBN(dagH, lambdaH, 10)Note that in the above code, probabilities of 10!=3,628,800 pathways need to be calculated, which is extremely time-consuming, so we have not executed this chunk of the code.
Now, let’s try the second approach using both the gMat
and gMatMut genotype matrices. Note that in these
functions, the number of columns in the input genotype matrix must
always be four.
probC1 <- pathProbQuartetCTCBN(gMat)
probC2 <- pathProbQuartetCTCBN(gMatMut)
probH1 <- pathProbQuartetHCBN(gMat)
probH2 <- pathProbQuartetHCBN(gMatMut)You can visualize the 24 pathways and their associated probabilities
using visualizeProbabilities function. In
the figures below, you can compare the pathway probability distributions
(quantified using CT-CBN model) before and after errors:
You can see before adding errors (probC1), only four
pathways are feasible with non-zero probabilities, but in the presence
of error (probC2) all pathways are feasible and so the
probability is more uniformly distributed among the pathways.
We can now visualize the pathways with gene names and their
probability distributions (probC2) as follows:
Similarly, in the figures below, we can compare the pathway
probability distributions (quantified using H-CBN model) before
(probH1) and after errors (probH2):
We can see that under the H-CBN model, the pathway probabilities
before and after errors stay more similar than what we observed under
the CT-CBN model. We can formally quantify to what extent the two
probability distribution differ using the Jensen-Shannon Divergence
(JSD; implemented as
jensenShannonDivergence function).
jsdC <- jensenShannonDivergence(probC1, probC2)
jsdH <- jensenShannonDivergence(probH1, probH2)
jsdC
#> [1] 0.06213309
jsdH
#> [1] 0.2952957We can also quantify the predictability for a given pathway
probability distribution as described in [7] using the
predictability function. We can see that
the predictability after errors drops substantially under CT-CBN:
predC1 <- predictability(probC1, 4)
predC2 <- predictability(probC2, 4)
predC1
#> [1] 0.1059811
predC2
#> [1] 0.01232624
predC1 - predC2
#> [1] 0.09365483In contrast, under H-CBN, the predictability after errors, even gets slightly higher than the one obtained in the error-free setting:
predH1 <- predictability(probH1, 4)
predH2 <- predictability(probH2, 4)
predH1
#> [1] 0.4804701
predH2
#> [1] 0.4586789
predH1 - predH2
#> [1] 0.02179119Finally, we can compute the pathway compatibility scores both for
gMat and gMatMutgenotype matrices using
thepathwayCompatibilityQuartet function as follows:
The Spearman’s correlation coefficient between the pathway compatibility and the pathway probabilities quantified under CT-CBN or H-CBN can be quantified as follows:
rhoC1 <- cor(pathwayC1, probC1, method = "spearman")
rhoC2 <- cor(pathwayC2, probC2, method = "spearman")
rhoC1
#> [1] 0.9504348
rhoC2
#> [1] 0.9575631
rhoH1 <- cor(pathwayC1, probH1, method = "spearman")
rhoH2 <- cor(pathwayC2, probH2, method = "spearman")
rhoH1
#> [1] 0.5431513
rhoH2
#> [1] 0.5658366The R-CBN algorithm [5] for quantifying pathway probability
distributions is implemented in the
pathProbQuartetRCBN function, whose
input/output structure is similar to what we described above for the
CT-CBN and H-CBN. However, there are multiple functions, which are
called inside the pathProbQuartetRCBN
function, and so the user do not need to directly work with
(e.g. pathNormalization,
pathwayWeightingRCBN,
edgeMarginalized,
pathEdgeMapper, and
posetWeightingRCBN).
The pathProbQuartetRCBN function also
receives a four-column binary genotype matrix as the only input, and
outputs the corresponding pathway probability distribution. Let’s
quantify the pathway probability distributions before and after adding
errors:
In the figures below, you can compare the pathway probability distributions (quantified using R-CBN model) before and after errors:
Similar to what we described before, the Jensen-Shannon Divergence between the two distributions can be quantified as:
You can see that the JDS value under R-CBN in this example (0.05), is considerably smaller than those of CT-CBN (0.41) and H-CBN (0.31).
The predictability values can also be compared as follows:
predR1 <- predictability(probR1, 4)
predR2 <- predictability(probR2, 4)
predR1
#> [1] 0.364393
predR2
#> [1] 0.177655
predR1 - predR2
#> [1] 0.1867379Finally, the Spearman’s correlation coefficient between the pathway compatibility and the pathway probabilities quantified under R-CBN can be quantified as follows:
The workflow for the B-CBN algorithm [4], which is implemented in the
pathProbQuartetBCBN function, is also
similar to that of R-CBN.
The pathProbQuartetBCBN function also
receives a four-column binary genotype matrix as the only input, and
outputs the corresponding pathway probability distribution. Again, let’s
quantify the pathway probability distributions before and after the
errors:
In the figures below, you can compare the pathway probability distributions (quantified using B-CBN model) before and after errors:
Similar to what we described before, the Jensen-Shannon Divergence between the two distributions can be quantified as:
The predictability values can also be compared as follows:
predB1 <- predictability(probB1, 4)
predB2 <- predictability(probB2, 4)
predB1
predB2
predB1 - predB2Finally, the Spearman’s correlation coefficient between the pathway compatibility and the pathway probabilities quantified under B-CBN can be quantified as follows:
If we can establish a fitness landscape by directly assigning fitness
to all potential genotypes, we can employ evolutionary models to compute
the mutational pathway probabilities under the Strong-Selection
Weak-Mutation (SSWM) assumption [7], which is implemented in the
pathProbSSWM function.
In case of 4 mutations, we will have 16 genotypes, so a fitness
vector length of 16 is needed each element of which corresponds to a
given genotype that can be determined by the
generateMatrixGenotypes function. For
example:
fitnessVector <- c(0, 0.1, 0.2, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0, 0.6, 0.4, 0.3, 0.2, 1)
g <- generateMatrixGenotypes(4)The 7-th genotype in the g vector is “1 0 1 0” and its
corresponding fitness is fitnessVector[7] = 0.3.
The fitness landscape can be visualized as follows:
The pathway probability distribution under the SSWM-based model can be quantified as:
Moreover, the pathway probabilities can be visualized as:
and finally the associated predictability can be quantified as:
sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BiocParallel_1.45.0 TCGAbiolinks_2.39.0 CBN2Path_1.1.7
#> [4] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.3.0 gridExtra_2.3
#> [3] httr2_1.2.2 biomaRt_2.67.7
#> [5] rlang_1.2.0 magrittr_2.0.5
#> [7] matrixStats_1.5.0 compiler_4.5.3
#> [9] RSQLite_2.4.6 png_0.1-9
#> [11] vctrs_0.7.3 rvest_1.0.5
#> [13] stringr_1.6.0 pkgconfig_2.0.3
#> [15] crayon_1.5.3 fastmap_1.2.0
#> [17] dbplyr_2.5.2 XVector_0.51.0
#> [19] labeling_0.4.3 ggraph_2.2.2
#> [21] rmarkdown_2.31 tzdb_0.5.0
#> [23] purrr_1.2.2 bit_4.6.0
#> [25] xfun_0.57 cachem_1.1.0
#> [27] jsonlite_2.0.0 progress_1.2.3
#> [29] blob_1.3.0 DelayedArray_0.37.1
#> [31] tweenr_2.0.3 parallel_4.5.3
#> [33] prettyunits_1.2.0 R6_2.6.1
#> [35] bslib_0.10.0 stringi_1.8.7
#> [37] RColorBrewer_1.1-3 GenomicRanges_1.63.2
#> [39] jquerylib_0.1.4 Rcpp_1.1.1-1
#> [41] Seqinfo_1.1.0 SummarizedExperiment_1.41.1
#> [43] knitr_1.51 R.utils_2.13.0
#> [45] downloader_0.4.1 readr_2.2.0
#> [47] IRanges_2.45.0 Matrix_1.7-5
#> [49] igraph_2.2.3 tidyselect_1.2.1
#> [51] viridis_0.6.5 abind_1.4-8
#> [53] yaml_2.3.12 codetools_0.2-20
#> [55] curl_7.0.0 lattice_0.22-9
#> [57] tibble_3.3.1 plyr_1.8.9
#> [59] Biobase_2.71.0 withr_3.0.2
#> [61] KEGGREST_1.51.1 S7_0.2.1-1
#> [63] coda_0.19-4.1 evaluate_1.0.5
#> [65] polyclip_1.10-7 BiocFileCache_3.1.0
#> [67] xml2_1.5.2 Biostrings_2.79.5
#> [69] pillar_1.11.1 BiocManager_1.30.27
#> [71] filelock_1.0.3 MatrixGenerics_1.23.0
#> [73] stats4_4.5.3 generics_0.1.4
#> [75] vroom_1.7.1 S4Vectors_0.49.2
#> [77] hms_1.1.4 ggplot2_4.0.2
#> [79] scales_1.4.0 glue_1.8.1
#> [81] maketools_1.3.2 tools_4.5.3
#> [83] sys_3.4.3 data.table_1.18.2.1
#> [85] graphlayouts_1.2.3 buildtools_1.0.0
#> [87] XML_3.99-0.23 tidygraph_1.3.1
#> [89] cowplot_1.2.0 grid_4.5.3
#> [91] tidyr_1.3.2 AnnotationDbi_1.73.1
#> [93] patchwork_1.3.2 ggforce_0.5.0
#> [95] cli_3.6.6 rappdirs_0.3.4
#> [97] viridisLite_0.4.3 S4Arrays_1.11.1
#> [99] dplyr_1.2.1 gtable_0.3.6
#> [101] R.methodsS3_1.8.2 TCGAbiolinksGUI.data_1.31.0
#> [103] sass_0.4.10 digest_0.6.39
#> [105] BiocGenerics_0.57.1 SparseArray_1.11.13
#> [107] ggrepel_0.9.8 farver_2.1.2
#> [109] R.oo_1.27.1 memoise_2.0.1
#> [111] htmltools_0.5.9 lifecycle_1.0.5
#> [113] httr_1.4.8 bit64_4.6.0-1
#> [115] MASS_7.3-65[1] Beerenwinkel, et al. Conjunctive Bayesian Networks. Bernoulli, 13(4):893–909, November 2007. ISSN 1350-7265. doi: https://doi.org/10.3150/07-BEJ6133.
[2] Beerenwinkel and Sullivant. Markov models for accumulating mutations. Biometrika, 96 (3):645–661, September 2009. ISSN 0006-3444, 1464-3510. doi: https://doi.org/10.1093/biomet/asp023.
[3] Gerstung, et al. Quantifying cancer progression with conjunctive Bayesian networks. Bioinformatics (Oxford, England), 25(21):2809–2815, November 2009. doi: https://doi.org/10.1093/bioinformatics/btp505.
[4] Sakoparnig and Beerenwinkel. Efficient sampling for Bayesian inference of conjunctive Bayesian networks. Bioinformatics, 28(18):2318–2324, September 2012. ISSN 1367-4811, 1367-4803. doi: https://doi.org/10.1093/bioinformatics/bts433.
[5] Hosseini. Robust inference of cancer progression pathways using Conjunctive Bayesian Networks. BioRxiv, July 2025. doi: https://doi.org/10.1101/2025.07.15.663924.
[6] Diaz-Uriarte and Herrera-Nieto. EvAM-Tools: tools for evolutionary accumulation and cancer progression models. Bioinformatics, 38(24): 5457–5459, December 2022. ISSN 1367-4803, 1367-4811. doi: https://doi.org/10.1093/bioinformatics/btac710.
[7] Hosseini, et al. Estimating the predictability of cancer evolution. Bioinformatics, 35 (14):i389–i397, July 2019. ISSN 1367-4803, 1367-4811. doi: https://doi.org/10.1093/bioinformatics/btz332.