T-cell receptors (TCRs) are heterodimeric surface proteins that mediate adaptive immune recognition. Each TCR consists of an alpha and beta chain (or gamma and delta in a smaller subset of T cells), with the complementarity-determining region 3 (CDR3) serving as the primary determinant of antigen specificity. Advances in single-cell RNA sequencing (scRNA-seq) now enable simultaneous profiling of gene expression and TCR sequences at cellular resolution, producing rich datasets that benefit from computational analysis of repertoire diversity, clonal structure, and sequence similarity.
A growing ecosystem of Python-based tools has emerged for TCR repertoire analysis, including tcrdist3 for pairwise distance calculation, OLGA for generation probability estimation, soNNia for selection inference, clusTCR for sequence clustering, metaclonotypist for metaclone discovery, and the ESM-2 family of protein language models for learned sequence representations. While these tools are powerful individually, using them together requires managing multiple Python environments and translating data between frameworks—a barrier for researchers working primarily in R.
immLynx addresses this challenge by providing a
unified R interface to these Python tools through the basilisk
package, which manages isolated conda environments transparently. All
functions accept and return SingleCellExperiment objects,
ensuring seamless integration with the Bioconductor single-cell
ecosystem, including scRepertoire
for clonotype assignment and immApex
for TCR data extraction. The package supports the following analytical
capabilities:
This vignette demonstrates the core functionality of immLynx, walking
through data preparation, individual analysis functions, and a combined
workflow. For advanced use cases including method comparison and custom
embedding strategies, see the companion vignette
vignette("advanced_analysis", package = "immLynx").
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("immLynx")immLynx uses basilisk to manage Python dependencies automatically. The first time you call a function that invokes Python, basilisk will create an isolated conda environment containing all required packages (tcrdist3, olga, soNNia, clusTCR, metaclonotypist, and PyTorch with the ESM-2 model). This initial setup may take several minutes but only needs to occur once per installation. Subsequent calls reuse the existing environment with minimal overhead.
We begin by loading immLynx along with scran and
scater,
which provide utilities for single-cell analysis and visualization. The
immLynx_example dataset is a
SingleCellExperiment object containing scRNA-seq data from
multiple patients, with TCR repertoire information added via scRepertoire.
library(immLynx)
library(scran)
library(scater)
data("immLynx_example")
immLynx_example
#> class: SingleCellExperiment
#> dim: 2000 500
#> metadata(0):
#> assays(2): counts logcounts
#> rownames(2000): TRBV7-2 HSPA1B ... AC119396.1 BHLHE40
#> rowData names(0):
#> colnames(500): P17B_AAACCTGCAACACGCC-1 P17B_AAACGGGTCTCCAGGG-1 ...
#> P20L_TCAACGATCCCTCAGT-1 P20L_TCTGGAAGTGGAAAGA-1
#> colData names(16): orig.ident nCount_RNA ... ident clusters
#> reducedDimNames(1): UMAP
#> mainExpName: RNA
#> altExpNames(0):Before running downstream analyses, it is often useful to extract TCR
data into a data.frame format for inspection and quality control. The
extractTCRdata() function wraps
immApex::getIR() and supports both long format (one row per
chain) and wide format (one row per cell with columns for each chain).
The validateTCRdata() function checks for common issues
such as missing CDR3 sequences, non-standard amino acids, and
inconsistent chain annotations.
tcr_data <- extractTCRdata(immLynx_example,
chains = "TRB")
head(tcr_data)
#> cdr3_aa v d j c barcode chain
#> 1 CAISEQGKGELFF TRBV10-3 <NA> TRBJ2-2 TRBC2 P17B_AAACCTGCAACACGCC-1 TRB
#> 2 CASSVRRERANTGELFF TRBV9 <NA> TRBJ2-2 TRBC2 P17B_AAACGGGTCTCCAGGG-1 TRB
#> 4 CASSVRRERANTGELFF TRBV9 <NA> TRBJ2-2 TRBC2 P17B_AACTTTCTCAGCGACC-1 TRB
#> 6 CASSSRQDSTDTQYF TRBV27 <NA> TRBJ2-3 TRBC2 P17B_AATCGGTCAAGCCCAC-1 TRB
#> 8 CASSVRRERANTGELFF TRBV9 <NA> TRBJ2-2 TRBC2 P17B_ACGAGCCTCATTTGGG-1 TRB
#> 9 CASSPRAGTPNTEAFF TRBV4-1 <NA> TRBJ1-1 TRBC1 P17B_ACGCCAGGTTCATGGT-1 TRB
tcr_wide <- extractTCRdata(immLynx_example,
chains = "both",
format = "wide")
validation <- validateTCRdata(tcr_data)
print(validation)
#> $valid
#> [1] TRUE
#>
#> $errors
#> character(0)
#>
#> $warnings
#> character(0)
#>
#> $summary
#> $summary$n_rows
#> [1] 306
#>
#> $summary$n_unique_barcodes
#> [1] 306
#>
#> $summary$n_unique_cdr3
#> [1] 275
#>
#> $summary$n_na_cdr3
#> [1] 0
#>
#> $summary$chains
#> [1] "TRB"The summarizeTCRrepertoire() function computes standard
repertoire statistics including clonotype frequencies, diversity indices
(Shannon entropy, Simpson index, clonality), CDR3 length distributions,
and V/J gene usage. These metrics provide a high-level overview of
repertoire structure before detailed analysis.
summary <- summarizeTCRrepertoire(immLynx_example,
chains = "TRB")
print(summary)
#> === TCR Repertoire Summary ===
#> Chain(s): TRB
#>
#> --- Basic Statistics ---
#> Total cells with TCR: 306
#> Unique clonotypes: 275
#> Clonotype ratio: 0.8987
#>
#> --- Diversity Metrics ---
#> Shannon entropy: 5.511
#> Clonality: 0.0188
#> Simpson index: 0.0057
#> Inverse Simpson: 174.69
#>
#> --- Top Clonotypes ---
#> cdr3_aa count proportion
#> CASSVRRERANTGELFF 14 0.045751634
#> CASSPTVAGEQFF 3 0.009803922
#> CASSQAPFSTSGELFF 3 0.009803922
#> CASSQDRTGLDYEQYF 3 0.009803922
#> CASSRLRTGALYEQYF 3 0.009803922
#>
#> --- CDR3 Length Distribution ---
#> Mean: 14.8
#> Median: 15
#> Range: 10 - 21clusTCR groups TCR sequences by CDR3 similarity using the Markov
Cluster (MCL) algorithm. The inflation parameter controls
cluster granularity: higher values produce more, smaller clusters, while
lower values yield fewer, larger clusters. Cluster assignments are
stored directly in the colData of the
SingleCellExperiment object, making them immediately
available for downstream visualization and statistical testing.
tcrdist3 computes pairwise distances between TCR sequences using a substitution-matrix-based metric that accounts for CDR3 amino acid sequence, V gene, and J gene identity. The resulting distance matrix can be used for hierarchical clustering, nearest-neighbor analysis, or as input to dimensionality reduction methods. Note that tcrdist3 deduplicates sequences internally, so the distance matrix dimensions correspond to the number of unique TCR sequences rather than the total number of cells.
OLGA (Optimized Likelihood estimate of immunoGlobulin Amino-acid
sequences) computes the generation probability (Pgen) of a TCR
sequence—the probability that V(D)J recombination produces that specific
CDR3 amino acid sequence. Low Pgen values indicate sequences that are
unlikely to arise by chance, which may suggest antigen-driven selection.
The runOLGA() function adds log10-transformed Pgen values
to the colData of the input object.
sce <- runOLGA(
immLynx_example,
chains = "TRB",
model = "humanTRB"
)
hist(log10(sce$olga_pgen_TRB), breaks = 50)The generateOLGA() function can also produce synthetic
TCR sequences sampled from the recombination model, which is useful for
constructing background distributions or benchmarking analyses.
Protein language models such as ESM-2 learn dense vector
representations of amino acid sequences from large-scale protein
databases. When applied to CDR3 sequences, these embeddings capture
biochemical and structural properties that complement sequence-based
distance metrics. The runEmbeddings() function generates
per-sequence embeddings and stores them as a dimensionality reduction in
the SingleCellExperiment object, enabling visualization
with standard tools such as UMAP or PCA. Multiple pooling strategies
(mean, cls, max) are available to aggregate per-residue representations
into a fixed-length vector.
Metaclonotypist identifies metaclones—groups of TCR clonotypes that
share enough sequence similarity to potentially recognize the same
antigen. This approach extends traditional exact-match clonotyping by
allowing controlled levels of CDR3 edit distance and tcrdist-based
similarity. The max_edits and max_dist
parameters control the stringency of metaclone grouping.
soNNia uses a neural network to model post-thymic selection by
comparing observed TCR sequences against a background distribution
generated by V(D)J recombination. The model estimates a selection factor
for each sequence, quantifying how much more (or less) likely a sequence
is to appear in the observed repertoire relative to the recombination
baseline. This requires a background sequence file, which can be
generated using generateOLGA().
The following example demonstrates a combined workflow that
integrates multiple immLynx analyses on a single dataset. By storing all
results in the same SingleCellExperiment object, different
analytical layers (clustering, generation probability, embeddings) can
be compared and visualized together.
library(immLynx)
library(scran)
library(scater)
library(ggplot2)
data("immLynx_example")
# Summarize the repertoire
summary <- summarizeTCRrepertoire(immLynx_example)
print(summary)
# Cluster TCRs by CDR3 similarity
immLynx_example <- runClustTCR(
immLynx_example,
chains = "TRB",
method = "mcl"
)
# Calculate generation probability
immLynx_example <- runOLGA(
immLynx_example,
chains = "TRB"
)
# Generate protein language model embeddings
immLynx_example <- runEmbeddings(
immLynx_example,
chains = "TRB"
)
# Visualize embeddings colored by cluster assignment
immLynx_example <- scater::runUMAP(
immLynx_example,
dimred = "tcr_esm",
name = "tcr_umap"
)
scater::plotReducedDim(
immLynx_example,
dimred = "tcr_umap",
colour_by = "clustcr_TRB"
)
# Visualize embeddings colored by generation probability
scater::plotReducedDim(
immLynx_example,
dimred = "tcr_umap",
colour_by = "olga_pgen_log10_TRB"
)sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scater_1.39.4 ggplot2_4.0.3
#> [3] scran_1.39.2 scuttle_1.21.6
#> [5] SingleCellExperiment_1.33.2 SummarizedExperiment_1.41.1
#> [7] Biobase_2.71.0 GenomicRanges_1.63.2
#> [9] Seqinfo_1.1.0 IRanges_2.45.0
#> [11] S4Vectors_0.49.2 BiocGenerics_0.57.1
#> [13] generics_0.1.4 MatrixGenerics_1.23.0
#> [15] matrixStats_1.5.0 immLynx_0.99.4
#> [17] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.3 vipor_0.4.7
#> [4] dplyr_1.2.1 farver_2.1.2 viridis_0.6.5
#> [7] filelock_1.0.3 S7_0.2.2 fastmap_1.2.0
#> [10] bluster_1.21.1 digest_0.6.39 rsvd_1.0.5
#> [13] lifecycle_1.0.5 cluster_2.1.8.2 statmod_1.5.1
#> [16] magrittr_2.0.5 compiler_4.6.0 rlang_1.2.0
#> [19] sass_0.4.10 tools_4.6.0 igraph_2.3.0
#> [22] yaml_2.3.12 knitr_1.51 S4Arrays_1.11.1
#> [25] dqrng_0.4.1 reticulate_1.46.0 DelayedArray_0.37.1
#> [28] RColorBrewer_1.1-3 abind_1.4-8 BiocParallel_1.45.0
#> [31] withr_3.0.2 sys_3.4.3 grid_4.6.0
#> [34] beachmat_2.27.5 edgeR_4.9.9 scales_1.4.0
#> [37] cli_3.6.6 rmarkdown_2.31 metapod_1.19.2
#> [40] immApex_1.5.4 ggbeeswarm_0.7.3 cachem_1.1.0
#> [43] stringr_1.6.0 parallel_4.6.0 BiocManager_1.30.27
#> [46] XVector_0.51.0 basilisk_1.23.0 vctrs_0.7.3
#> [49] Matrix_1.7-5 jsonlite_2.0.0 dir.expiry_1.19.0
#> [52] BiocSingular_1.27.1 BiocNeighbors_2.5.4 ggrepel_0.9.8
#> [55] beeswarm_0.4.0 irlba_2.3.7 maketools_1.3.2
#> [58] locfit_1.5-9.12 limma_3.67.3 jquerylib_0.1.4
#> [61] glue_1.8.1 codetools_0.2-20 stringi_1.8.7
#> [64] gtable_0.3.6 ScaledMatrix_1.19.0 tibble_3.3.1
#> [67] pillar_1.11.1 htmltools_0.5.9 R6_2.6.1
#> [70] evaluate_1.0.5 lattice_0.22-9 png_0.1-9
#> [73] bslib_0.10.0 Rcpp_1.1.1-1 gridExtra_2.3
#> [76] SparseArray_1.11.13 xfun_0.57 buildtools_1.0.0
#> [79] pkgconfig_2.0.3