This vignette builds on the foundations established in
vignette("immLynx_vignette", package = "immLynx") to
explore advanced analytical workflows for T-cell receptor (TCR)
repertoire data. While the introductory vignette demonstrates each
analysis function individually, real-world studies often require
combining multiple methods to gain complementary views of repertoire
biology. For instance, sequence clustering identifies groups of
structurally related TCRs, while generation probability highlights
sequences that are unlikely to arise from random recombination and may
therefore reflect antigen-driven selection. Protein language model
embeddings offer yet another perspective by capturing learned
biochemical features that do not depend on predefined distance
metrics.
This vignette covers the following advanced topics:
We begin by loading the required packages and the example dataset.
The immLynx_example object is a
SingleCellExperiment containing scRNA-seq data with TCR
annotations from scRepertoire.
The MCL algorithm used by clusTCR has a key parameter,
inflation, that controls the granularity of cluster
assignments. Lower inflation values (e.g., 2.0) favor larger, more
inclusive clusters, while higher values (e.g., 3.0) produce tighter,
more specific groupings. Choosing the appropriate inflation parameter
depends on the biological question: broad clusters may capture groups of
TCRs with shared structural motifs, while tight clusters more closely
approximate antigen-specificity groups.
Here we run clusTCR with two different inflation parameters on the
same dataset and compare the resulting cluster counts. By using distinct
column_prefix values, both sets of assignments are stored
in the same SingleCellExperiment object for side-by-side
comparison.
sce_mcl_2 <- runClustTCR(immLynx_example,
inflation = 2.0,
column_prefix = "mcl_2")
sce_mcl_3 <- runClustTCR(sce_mcl_2,
inflation = 3.0,
column_prefix = "mcl_3")
cat("MCL (inflation=2):",
length(unique(na.omit(sce_mcl_3$mcl_2_TRB))),
"clusters\n")
cat("MCL (inflation=3):",
length(unique(na.omit(sce_mcl_3$mcl_3_TRB))),
"clusters\n")The ESM-2 family of protein language models ranges from 8 million to 15 billion parameters. Larger models generally produce embeddings that better capture structural and functional properties of protein sequences, but they require substantially more memory and computation time. For TCR CDR3 sequences, which are short (typically 10-20 amino acids), smaller models often perform well and may be preferable for large-scale analyses.
The following example compares embeddings from the 35M and 650M
parameter variants. Both sets of embeddings are stored as separate
dimensionality reductions in the SingleCellExperiment and
can be independently visualized with UMAP.
sce_small <- runEmbeddings(
immLynx_example,
model_name = "facebook/esm2_t12_35M_UR50D",
reduction_name = "esm_small"
)
sce_med <- runEmbeddings(
sce_small,
model_name = "facebook/esm2_t33_650M_UR50D",
reduction_name = "esm_medium"
)
sce_med <- scater::runUMAP(sce_med,
dimred = "esm_small",
name = "umap_small")
sce_med <- scater::runUMAP(sce_med,
dimred = "esm_medium",
name = "umap_medium")
p1 <- scater::plotReducedDim(sce_med,
dimred = "umap_small") +
ggtitle("ESM-2 35M")
p2 <- scater::plotReducedDim(sce_med,
dimred = "umap_medium") +
ggtitle("ESM-2 650M")
p1 + p2For paired alpha-beta TCR data, embedding both chains together
captures the joint receptor structure. The chains = "both"
option concatenates the alpha and beta CDR3 sequences before computing
embeddings, producing a single representation per cell that reflects the
full receptor identity. This is particularly useful when alpha chain
pairing contributes to antigen specificity.
immLynx is designed to complement scRepertoire, which defines clonotypes based on exact CDR3 sequence matching. By comparing clusTCR cluster assignments with scRepertoire’s clonotype definitions, we can assess how sequence-similarity-based clustering relates to strict clonotype identity. Ideally, each cluster should contain one or a few closely related clonotypes, and cells sharing the same clonotype should be assigned to the same cluster.
Generation probability (Pgen) from OLGA provides a window into selection pressures acting on the TCR repertoire. Sequences with low Pgen are rarely produced by V(D)J recombination; their presence in the observed repertoire suggests positive selection, potentially driven by antigen encounter. By comparing Pgen distributions across experimental conditions or sample types, we can identify conditions associated with enrichment of rare, potentially antigen-selected sequences.
sce_pgen <- runOLGA(immLynx_example,
chains = "TRB")
ggplot(as.data.frame(colData(sce_pgen)), aes(x = Type,
y = olga_pgen_log10_TRB)) +
geom_boxplot() +
labs(title = "TCR Generation Probability by Sample Type",
x = "Sample Type",
y = "log10(Pgen)")
low_pgen_cells <- which(colData(sce_pgen)$olga_pgen_log10_TRB < -15)
cat("Cells with Pgen < 10^-15:", length(low_pgen_cells), "\n")tcrdist3 distance matrices provide a continuous measure of TCR
similarity that can serve as input to a variety of unsupervised learning
methods. Here we demonstrate hierarchical clustering of the distance
matrix, which complements the MCL-based approach used by clusTCR. The
complete linkage method ensures that all members of a
cluster are within a maximum distance of each other, producing compact
groups. Note that because tcrdist3 deduplicates input sequences, the
distance matrix dimensions correspond to unique TCR sequences rather
than individual cells.
dist_results <- runTCRdist(immLynx_example, chains = "beta")
d <- as.dist(dist_results$distances$pw_beta)
hc <- hclust(d, method = "complete")
n_unique <- nrow(dist_results$distances$pw_beta)
clusters <- cutree(hc, k = min(50, n_unique))
cat("Distance matrix:", n_unique, "x", n_unique, "\n")
cat("Hierarchical clusters:", length(unique(clusters)), "\n")For datasets with thousands or tens of thousands of cells, computational resources may become a limiting factor. The following strategies can help manage memory and runtime:
chunk_size
parameter in runEmbeddings() controls how many sequences
are processed in each batch. Smaller chunks reduce peak memory usage at
the cost of slightly longer runtime.When HLA genotyping data is available, immLynx can test for
associations between metaclone assignments and specific HLA alleles
using runHLAassociation(). This analysis identifies
metaclones whose frequency differs significantly between HLA-positive
and HLA-negative individuals, suggesting potential HLA-restricted
antigen recognition. The function performs Fisher’s exact test for each
metaclone-HLA combination and returns adjusted p-values.
metaclones <- runMetaclonotypist(immLynx_example,
return_input = FALSE)
hla_data <- data.frame(
barcode = metaclones$barcode,
HLA_A_01_01 = sample(c(TRUE, FALSE),
nrow(metaclones),
replace = TRUE),
HLA_A_02_01 = sample(c(TRUE, FALSE),
nrow(metaclones),
replace = TRUE),
HLA_B_07_02 = sample(c(TRUE, FALSE),
nrow(metaclones),
replace = TRUE)
)
hla_results <- runHLAassociation(metaclones,
hla_data)
head(hla_results)All immLynx results can be exported to standard file formats for use
with external tools or for archiving. The
convertToTcrdist() function reformats TCR data into the
tabular format expected by standalone tcrdist3, facilitating
interoperability with Python-based analysis pipelines.
tcr_data <- extractTCRdata(immLynx_example,
chains = "both",
format = "wide")
tcrdist_format <- convertToTcrdist(tcr_data,
chains = "both")
write.csv(tcrdist_format, "tcr_for_tcrdist.csv", row.names = FALSE)
clusters <- data.frame(
barcode = colnames(sce_clust),
clustcr = sce_clust$clustcr_TRB
)
write.csv(clusters, "cluster_assignments.csv", row.names = FALSE)
embeddings <- reducedDim(sce_small, "esm_small")
write.csv(embeddings, "tcr_embeddings.csv")Data Quality: Always validate TCR data before
analysis using validateTCRdata(). This catches common
issues such as missing CDR3 sequences, non-standard amino acids, and
inconsistent chain annotations that can cause downstream errors or
misleading results.
Memory Management: For large datasets, use
smaller chunk sizes in runEmbeddings() and consider
subsampling strategies for distance-based methods.
Reproducibility: Set random seeds before running stochastic algorithms such as MCL clustering to ensure reproducible results.
Multiple Methods: Use multiple clustering or grouping methods and look for consensus. Sequences that are consistently co-clustered across methods are more likely to represent biologically meaningful groups.
Biological Validation: Computational clusters should be validated against biological evidence such as antigen specificity, phenotypic markers, or clinical outcomes whenever possible.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scater_1.39.4 ggplot2_4.0.3
#> [3] scran_1.39.2 scuttle_1.21.6
#> [5] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
#> [7] Biobase_2.71.0 GenomicRanges_1.63.2
#> [9] Seqinfo_1.1.0 IRanges_2.45.0
#> [11] S4Vectors_0.49.2 BiocGenerics_0.57.1
#> [13] generics_0.1.4 MatrixGenerics_1.23.0
#> [15] matrixStats_1.5.0 immLynx_0.99.4
#> [17] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.3 vipor_0.4.7
#> [4] dplyr_1.2.1 farver_2.1.2 viridis_0.6.5
#> [7] filelock_1.0.3 S7_0.2.2 fastmap_1.2.0
#> [10] bluster_1.21.1 digest_0.6.39 rsvd_1.0.5
#> [13] lifecycle_1.0.5 cluster_2.1.8.2 statmod_1.5.1
#> [16] magrittr_2.0.5 compiler_4.6.0 rlang_1.2.0
#> [19] sass_0.4.10 tools_4.6.0 igraph_2.3.0
#> [22] yaml_2.3.12 knitr_1.51 S4Arrays_1.11.1
#> [25] dqrng_0.4.1 reticulate_1.46.0 DelayedArray_0.37.1
#> [28] RColorBrewer_1.1-3 abind_1.4-8 BiocParallel_1.45.0
#> [31] withr_3.0.2 sys_3.4.3 grid_4.6.0
#> [34] beachmat_2.27.5 edgeR_4.9.9 scales_1.4.0
#> [37] cli_3.6.6 rmarkdown_2.31 metapod_1.19.2
#> [40] immApex_1.5.4 ggbeeswarm_0.7.3 cachem_1.1.0
#> [43] stringr_1.6.0 parallel_4.6.0 BiocManager_1.30.27
#> [46] XVector_0.51.0 basilisk_1.23.0 vctrs_0.7.3
#> [49] Matrix_1.7-5 jsonlite_2.0.0 dir.expiry_1.19.0
#> [52] BiocSingular_1.27.1 BiocNeighbors_2.5.4 ggrepel_0.9.8
#> [55] beeswarm_0.4.0 irlba_2.3.7 maketools_1.3.2
#> [58] locfit_1.5-9.12 limma_3.67.3 jquerylib_0.1.4
#> [61] glue_1.8.1 codetools_0.2-20 stringi_1.8.7
#> [64] gtable_0.3.6 ScaledMatrix_1.19.0 tibble_3.3.1
#> [67] pillar_1.11.1 htmltools_0.5.9 R6_2.6.1
#> [70] evaluate_1.0.5 lattice_0.22-9 png_0.1-9
#> [73] bslib_0.10.0 Rcpp_1.1.1-1 gridExtra_2.3
#> [76] SparseArray_1.11.13 xfun_0.57 buildtools_1.0.0
#> [79] pkgconfig_2.0.3