LRDE implements a Hurdle Negative Binomial
(Hurdle-NB) generalized linear model (GLM) tailored for
long-read RNA-seq data. Compared to short-read technologies, long-read
sequencing often involves smaller sample sizes and
substantial expression dropout (i.e., excess zeros),
posing challenges for standard differential expression methods.
This package is designed to address these issues by combining a flexible two-part model with stabilized parameter estimation.
The core contribution of LRDE is the integration of a
hurdle model with empirical Bayes (EB) shrinkage,
enabling robust and stable inference across genes.
For each gene gene \(g\) and sample \(i\), the observed count \(y_{ig}\) is modeled using a two-component framework:
Zero component:
A Bernoulli distribution models the probability of observing a
structural zero (dropout).
Count component:
Conditional on being non-zero, counts follow a zero-truncated
Negative Binomial distribution, capturing overdispersed
expression levels.
This formulation explicitly separates biological absence from technical dropout, improving interpretability and model fit.
Long-read RNA-seq data are often noisy, particularly for lowly
expressed genes. To mitigate this, LRDE borrows strength
across genes using an empirical Bayes strategy:
Binning:
Genes are grouped into bins according to similar mean expression
levels.
Prior estimation:
Within each bin, empirical priors are estimated for:
Shrinkage:
Gene-specific (tag-wise) estimates are shrunk toward their bin-specific
priors, resulting in more stable dispersion estimates and improved
statistical power.
LRDE provides two complementary approaches for
differential expression analysis:
Likelihood Ratio Test
(hurdle_LRT())
Recommended for general hypothesis testing and comparison of nested
models.
Wald Test
(hurdle_Wald_Test())
A computationally efficient alternative for testing individual
coefficients.
The main object returned by LRDE is a list containing
the following components:
counts: normalized count matrixgroup: sample group labelssize.factors: normalization factorstagwise.disp: gene-wise dispersion estimateszero_prob_matrix: estimated zero probabilities for each
grouplrt_stats: likelihood ratio test statisticswald_stats: Wald test statisticsp.values: p-values for differential expressionThese components can be accessed directly using standard list
indexing. LRDE also supports input from SummarizedExperiment
objects, enabling integration with standard Bioconductor workflows.
This section demonstrates a minimal end-to-end workflow for
differential expression analysis using LRDE.
# Load the package
suppressPackageStartupMessages(library(LRDE))
set.seed(123)
# 1. Simulate a count matrix (Negative Binomial)
mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50)
# Define sample groups
grp <- factor(c("A", "A", "A", "B", "B", "B"))
# 2. Prepare the data object
y <- prepareDGE(mat, grp)
# 3. Estimate size factors for normalization
y <- sizeFactorsEst(y)
# 4. Estimate tag-wise dispersions
y <- tagwiseEst(y)
# 5a. Differential expression testing: Wald test
y <- hurdle_Wald_Test(y)
# 5b. Differential expression testing: Likelihood Ratio Test
y <- hurdle_LRT(y)
# 6. Inspect results
head(y$lrt_stats) # LRT statistics per gene
#> [1] 3.454142872 0.899998178 0.041541545 0.261982772 6.372198462 0.005056271
head(y$wald_stats) # Wald statistics per gene
#> [1] -1.71589930 0.94594439 0.20390895 -0.51176389 -2.26213529 0.07112874
head(y$p.values) # p-values
#> [1] 0.06309344 0.34278220 0.83849617 0.60876122 0.01159219 0.94331223
head(y$tagwise.disp) # estimated dispersions
#> [1] 0.02735738 0.04630069 0.03428276 0.44069007 0.05461926 0.03239385
str(y, max.level = 1) # Data structure visualization
#> List of 8
#> $ counts : int [1:50, 1:6] 6 8 7 10 8 6 8 13 3 10 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ samples :'data.frame': 6 obs. of 3 variables:
#> $ baseMean : num [1:50] 3.81 7.36 6.62 7.19 4.88 ...
#> $ tagwise.disp : num [1:50] 0.0274 0.0463 0.0343 0.4407 0.0546 ...
#> $ zero_prob_matrix: num [1:50, 1:2] 0.0196 0.0215 0.0215 0.0215 0.0215 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ wald_stats : num [1:50] -1.716 0.946 0.204 -0.512 -2.262 ...
#> $ p.values : num [1:50] 0.0631 0.3428 0.8385 0.6088 0.0116 ...
#> $ lrt_stats : num [1:50] 3.4541 0.9 0.0415 0.262 6.3722 ...LRDE returns gene-level statistics for differential
expression analysis, including test statistics and p-values.
Genes with small p-values provide evidence against the null hypothesis of no differential expression between groups.
Since thousands of genes are typically tested simultaneously, it is important to control for multiple comparisons. A common approach is to control the false discovery rate (FDR) using the Benjamini–Hochberg method:
padj <- p.adjust(y$p.values, method = "BH")
head(padj)
#> [1] 0.3943340 0.6801222 0.9691057 0.8226503 0.2898047 0.9691057Genes with adjusted p-values below a chosen threshold (e.g., FDR < 0.05) are typically considered significantly differentially expressed.
In addition to standard matrix and
data.frame, prepareDGE supports SummarizedExperiment
objects, allowing LRDE to integrate seamlessly into
existing Bioconductor workflows.
# 1. Create a SummarizedExperiment object
suppressPackageStartupMessages(library(SummarizedExperiment))
set.seed(123)
# Simulate data
counts_mat <- matrix(rnbinom(300, size = 5, mu = 5), nrow = 50)
colnames(counts_mat) <- paste0("Sample", 1:6)
group_info <- factor(c("A", "A", "A", "B", "B", "B"))
# Construct SE with colData
se <- SummarizedExperiment(
assays = list(counts = counts_mat),
colData = data.frame(condition = group_info)
)
# 2. Start the LRDE pipeline using the SE object
# Extract the condition directly from colData
y_se <- prepareDGE(se, group = colData(se)$condition)
# 3. Standard downstream workflow
y_se <- sizeFactorsEst(y_se)
y_se <- tagwiseEst(y_se)
y_se <- hurdle_LRT(y_se)
# View results
head(y_se$lrt_stats)
#> [1] 3.454142872 0.899998178 0.041541545 0.261982772 6.372198462 0.005056271#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.41.1 Biobase_2.71.0
#> [3] GenomicRanges_1.63.2 Seqinfo_1.1.0
#> [5] IRanges_2.45.0 S4Vectors_0.49.1
#> [7] BiocGenerics_0.57.0 generics_0.1.4
#> [9] MatrixGenerics_1.23.0 matrixStats_1.5.0
#> [11] LRDE_0.99.6 BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-5 jsonlite_2.0.0 compiler_4.5.3
#> [4] BiocManager_1.30.27 jquerylib_0.1.4 yaml_2.3.12
#> [7] fastmap_1.2.0 lattice_0.22-9 XVector_0.51.0
#> [10] R6_2.6.1 S4Arrays_1.11.1 knitr_1.51
#> [13] DelayedArray_0.37.1 maketools_1.3.2 bslib_0.10.0
#> [16] rlang_1.2.0 cachem_1.1.0 xfun_0.57
#> [19] sass_0.4.10 sys_3.4.3 SparseArray_1.11.13
#> [22] cli_3.6.6 digest_0.6.39 grid_4.5.3
#> [25] lifecycle_1.0.5 evaluate_1.0.5 buildtools_1.0.0
#> [28] abind_1.4-8 rmarkdown_2.31 tools_4.5.3
#> [31] htmltools_0.5.9