Abstract
NBSplice is an R package that uses isoform expression counts to evaluate differential splicing by means of generalized linear models at the gene level. The package allows the estimation of differences between isoforms’ relative expression to infer changes in alternative splicing patterns. This vignette explains the use of the package in a typical analysis workflow. NBSplice package version: 1.15.0
NBSplice
is an R package sitting on Bioconductor. You
could download the source file from the Bioconductor site (http://bioconductor.org/) and then use
R CMD install
. Another easy way to install it is by
doing:
## try http:// if https:// URLs are not supported
if (!require("BiocManager"))
install.packages("BiocManager")
::install("NBSplice", version="devel") BiocManager
Here we show the most basic steps for a differential splicing
analysis pipeline using the NBSplice
R package.
To start the analysis, an expression matrix at the isoform/transcript
level called isoCounts
, previously generated by upstream
processing tools like Kallisto (Bray et
al. 2016) or RSEM (Li and Dewey
2011), is required. You also need a table describing isoform-gene
relationships called geneIso
and a table of experiment
information called designMatrix
. In particular, this table
must contain a column storing the experiment factor which will be
evaluated. Here, we assume that this column is called
condition
.
As input, the NBSplice
package expects isoform
expression counts obtained from RNA-seq in the form of a matrix of
numerical values. In this matrix, the count value found in the
i-th row and the j-th column represents the expression
counts assigned to isoform i in experimental sample j.
The values in the matrix should be un-normalized. We suggest the use of
the effective counts typically reported by the mentioned quantification
tools. It is not necessary to round those because NBSplice
will do it internally. The expression matrix could be easily obtained by
the combination of quantification files for all samples. For instance,
if Kallisto was used to transcript quantification, the expression matrix
should be built combining the est_counts
columns of those
files as is shown in the code below.
<-c(paste(rep("C1R",4), 1:4,"/abundance.tsv", sep=""),
fileNamespaste(rep("C2R",4), 1:4,"/abundance.tsv", sep=""))
<-lapply(seq_along(fileNames), function(x){
myInfo<-read.delim(fileNames[x], header = TRUE)
quant<-quant[,"est_counts"]
expressionreturn(expression)
})# Adding the isoform IDS as row names
<-read.delim(fileNames[x], header = TRUE)[,"target_id"]
isoform_id<-(do.call(cbind, myInfo))
expressionMatrixrownames(expressionMatrix)<-isoform_id
# Adding the samples names as column names
colnames(expressionMatrix)<-c(paste(rep("C1R",4), 1:4, sep=""),
paste(rep("C2R",4), 1:4, sep=""))
If another quantification tool was used, you only need to change the
name of the quantification files and the name of the columns having
estimated counts and isoforms IDs. For instance, for RSEM
quantification, isoforms counts and IDS are in the
expected_count
and transcript_id
columns of
the Sample.isoforms.results
file, respectively.
As an example, we have included an isoCounts
matrix
which is a subset of one of the simulated matrix used to demonstrate
NBSplice
utility.
data(isoCounts, package="NBSplice")
head(isoCounts)
## C1R1 C1R2 C1R3 C1R4 C2R1 C2R2 C2R3 C2R4
## ENST00000228345 79 0 106 99 76 0 68 23
## ENST00000358495 567 212 162 715 70 176 255 243
## ENST00000397230 15 0 0 28 0 0 22 3
## ENST00000430095 0 6 0 77 0 0 27 59
## ENST00000461568 11 0 0 0 0 2 0 0
## ENST00000463750 10 31 126 60 88 0 0 31
dim(isoCounts)
## [1] 3144 8
As can be noted, the isoCounts
matrix represents an
experiment with 8
samples were 3122
isoforms
were quantified.
The geneIso
matrix must have two columns called
isoform_id
and gene_id
and as many rows as the
isoCounts
matrix. The i-th row of the
geneIso
matrix specifies from which gene was originated the
i-th isoform. This information will be used by
NBSplice
methods for model fitting and hypothesis tests.
This matrix could be easily obtained from the annotation file used for
isoform quantification. For instance, if RSEM was used, the matrix can
be built taking the first two columns (transcript_id
and
gene_id
) of the Sample.isoforms.results
file.
In the case of Kallisto, this information is not stored in the
quantification file. You could use the biomaRt
R package to
obtain gene IDs for the quantified isoforms or the
refGenome
package to extract those from the annotation
file.
As an example, we provided a geneIso
matrix associated
with the isoCounts
object previously loaded.
data(geneIso, package="NBSplice")
head(geneIso)
## gene_id isoform_id
## ENST00000228345 ENSG00000002016 ENST00000228345
## ENST00000358495 ENSG00000002016 ENST00000358495
## ENST00000397230 ENSG00000002016 ENST00000397230
## ENST00000430095 ENSG00000002016 ENST00000430095
## ENST00000461568 ENSG00000002016 ENST00000461568
## ENST00000463750 ENSG00000002016 ENST00000463750
dim(geneIso)
## [1] 3144 2
This matrix contains annotation information, so, you need to build it everytime that you change your reference transcriptome, downloading the respective information from the database site.
The designMatrix
table must contain experiment
information relevant for the differential splicing analysis. Its number
of columns is variable but it should have as rows as columns the
isoCounts
matrix has. Thus, in the
designMatrix
the i-th row specifies experimental
information associated to the i-th sample. It is worth to
mention that the order and names of isoCounts
columns must
be the same of the designMatrix
rows. Currently,
NBSplice
allows two-levels comparison of single-factor
experiments (multiple-factor experiment comparisons are in development).
In order to do this, it is necessary to specify the argument
colName
which indicates the name of the
designMatrix
column corresponding to that factor. We
provided the designMatrix
object associated to our
isoCounts
matrix which could be loaded doing
data(designMatrix, package="NBSplice")
. Alternatively, this
matrix could be created using the code below.
<-data.frame(sample=c(paste(rep("C1R", 4), 1:4, sep=""),
designMatrixpaste(rep("C2R", 4), 1:4, sep="")), condition=factor(c(
rep("Normal", 4), rep("Tumor", 4)), levels=c("Normal",
"Tumor")))
rownames(designMatrix)<-designMatrix[,"sample"]
designMatrix
## sample condition
## C1R1 C1R1 Normal
## C1R2 C1R2 Normal
## C1R3 C1R3 Normal
## C1R4 C1R4 Normal
## C2R1 C2R1 Tumor
## C2R2 C2R2 Tumor
## C2R3 C2R3 Tumor
## C2R4 C2R4 Tumor
In our case, we must define our colName
argument
indicating that our factor of interest is condition.
<-"condition"
colNamelevels(designMatrix[,colName])
## [1] "Normal" "Tumor"
As can be noted, our experiment involved two experimental conditions:
Normal and Tumor. Note that this column of the
designMatrix
is a factor
with two levels where
the first level, commonly taken as the reference, is
Normal.
NBSplice
incorporates a new class called
IsoDataSet
to store expression counts and experimental
data. An object of this class has the next slots: counts
,
geneCounts
, colData
, isoGeneRel
,
design
and lowExpIndex
. The first slot will
contain the isoCounts
matrix, whereas, the second one store
the expression counts at the gene level. To do this,
NBSplice
incorporates a function which takes every gene of
geneIso
and, for each sample, compute its total expression
as the sum of the counts of all its isoforms contained in
isoCounts
. The third and fifth slots correspond to
designMatrix
and geneIso
objects. The slot
called design
save the formula used to model fitting and
hypothesis tests, and is internally defined by NBSplice
based on the colName
argument. The last argument,
lowExpIndex
, stores the indexes of low-expressed isoforms.
The way to define those isoforms is explained in Identifying low-expressed
isoforms.
The typical way to build an IsoDataSet
object is using
its constructor which have the same name as the class. In addition to
the four arguments previously defined, it is possible to define the
BPPARAM
argument which must be a
BiocParallelParam
instance defining the parallel back-end
to be used (Morgan et al., n.d.).
library(NBSplice)
<-IsoDataSet(isoCounts, designMatrix, colName, geneIso) myIsoDataSet
NBSplice
also provides several methods for easy object
and slots inspections.
show(myIsoDataSet)
## IsoDataSet
## Isoform Counts:
## C1R1 C1R2 C1R3 C1R4 C2R1 C2R2 C2R3
## ENST00000228345 111.79303 0.0000 149.7884 141.06040 111.9745 0.00 100.9089
## ENST00000358495 802.36265 296.6549 228.9219 1018.76958 103.1344 258.33 378.4085
## ENST00000397230 21.22653 0.0000 0.0000 39.89587 0.0000 0.00 32.6470
## C2R4
## ENST00000228345 36.469152
## ENST00000358495 385.304517
## ENST00000397230 4.756846
##
## Gene Counts:
## C1R1 C1R2 C1R3 C1R4 C2R1 C2R2
## ENSG00000002016 1158.9683 481.3646 854.9243 1667.0775 495.0451 408.04405
## ENSG00000005059 182.5481 158.1227 72.0680 115.4131 139.9681 58.71137
## ENSG00000005889 1982.5575 1859.6906 1575.6043 1530.2917 1461.5618 1241.74555
## C2R3 C2R4
## ENSG00000002016 651.4561 692.9139
## ENSG00000005059 136.5238 144.2910
## ENSG00000005889 1167.8724 2177.0498
##
## Isoform-Gene relationship:
## gene_id isoform_id
## ENST00000228345 ENSG00000002016 ENST00000228345
## ENST00000358495 ENSG00000002016 ENST00000358495
## ENST00000397230 ENSG00000002016 ENST00000397230
##
## Design Matrix:
## sample condition
## C1R1 C1R1 Normal
## C1R2 C1R2 Normal
## C1R3 C1R3 Normal
##
## Formula:
## counts ~ condition + iso + condition:iso
## <environment: 0x7fb75051b530>
head(isoCounts(myIsoDataSet))
## C1R1 C1R2 C1R3 C1R4 C2R1 C2R2
## ENST00000228345 111.79303 0.000000 149.7884 141.06040 111.9745 0.000000
## ENST00000358495 802.36265 296.654936 228.9219 1018.76958 103.1344 258.330043
## ENST00000397230 21.22653 0.000000 0.0000 39.89587 0.0000 0.000000
## ENST00000430095 0.00000 8.395894 0.0000 109.71365 0.0000 0.000000
## ENST00000461568 15.56612 0.000000 0.0000 0.00000 0.0000 2.935569
## ENST00000463750 14.15102 43.378788 178.0503 85.49115 129.6547 0.000000
## C2R3 C2R4
## ENST00000228345 100.90892 36.469152
## ENST00000358495 378.40846 385.304517
## ENST00000397230 32.64700 4.756846
## ENST00000430095 40.06678 93.551303
## ENST00000461568 0.00000 0.000000
## ENST00000463750 0.00000 49.154074
Note that the isoCounts
method returns the
counts
slot but it seems to be different to the
IsoCounts
object previously loaded. This is because this
method has an additional logical parameter, CPM
with
default value TRUE
, useful to indicate if expression counts
should be returned in counts per million (CPM) scale.
Low expression transcripts have been widely recognized as a source of
noise in model fitting and parameter estimation of RNA-seq expression
counts. Typically, count thresholds are used to decide if a gene or an
isoform has low expression. However, NBSplice
assumes that
an isoform has a low expression if their absolute or relative expression
is lower than specifics thresholds. This assumption is due to the
package model infers differential splicing by means of changes in
relative expression of isoforms. Consequently, a transcript which has a
very low relative expression in both conditions could exhibit a small
change but enough to be erroneously detected as significant. The
buildLowExpIdx
method is responsible for the identification
of low-expressed isoforms. It takes an IsoDataSet
object
and two expression thresholds and creates a low-expression index and
stores this index in the lowExpIndex
object slot. The two
thresholds are called ratioThres
and
countsThres
. The first one of those is used to set a
minimum of relative expression that each isoform should achieve in all
experimental samples. The second one indicates a threshold in CPM for
mean expression across conditions. The method also needs the
specification of the colName
argument, in order to compute
mean expressions per factor level, and admits parallelization using the
BPPARAM
argument.
<-buildLowExpIdx(myIsoDataSet, colName, ratioThres = 0.01,
myIsoDataSetcountThres = 1)
Once an IsoDataSet
was built and its low-expressed
isoforms were identified, you could use the NBSplice
for
differential splicing evaluation using the NBTest
method.
It provides a user-friendly interface between model fitting and
hypothesis testing and the user by means a single code line. The theory
behind the NBSplice
model is explained in NBSplice strategy. Briefly,
NBTest
iterates along genes obtaining significance p-values
at gene and isoform level indicating the occurrence of differential
splicing and significant changes in isoform proportion, respectively.
Then, isoform and gene p-values are adjusted. Low-expressed isoforms are
ignored for model fitting but their mean relative expression per
condition are also computed. The method requires the specification of
the colName
, as well as the statistic assumed for
hypothesis test p-values distribution. As the colName
factor could have more than two levels, it is necessary the
specification of an additional argument defining the two levels to be
contrasted. By default, the contrast
argument takes the two
first levels of colName
. NBTest
returns an
NBSpliceRes
object which stores the obtained results.
<-NBTest(myIsoDataSet, colName, test="F") myDSResults
NBSpliceRes
is another new class provided by NBSplice to
facilitate the manipulation and exploration of the differential splicing
results obtained using the NBTest
method. An object of this
class has three slots called results
,
lowExpIndex
and contrast
. The first slot
stores differential splicing results for all isoforms, whereas the
second slot keeps the indexes of low-expressed isoforms and the third
one contains the evaluated contrast. Specific methods for slot accession
are also provided by NBSplice
head(lowExpIndex(myDSResults))
## [1] 1 3 4 5 6 7
contrast(myDSResults)
## [1] "Normal" "Tumor"
head(results(myDSResults))
## iso gene ratio_Normal ratio_Tumor odd
## 2 ENST00000358495 ENSG00000002016 0.54719747 0.49422479 -0.10181928
## 21 ENST00000394650 ENSG00000005059 0.69478391 0.78304282 0.11958650
## 26 ENST00000379177 ENSG00000005889 0.34888943 0.25494787 -0.31369597
## 27 ENST00000379188 ENSG00000005889 0.40700345 0.34146857 -0.17556603
## 35 ENST00000540034 ENSG00000005889 0.08976672 0.08393136 -0.06721502
## 37 ENST00000219069 ENSG00000006194 0.76318911 0.80921396 0.05855751
## stat pval genePval FDR geneFDR
## 2 0.14639753 0.7151892 0.7151892 0.9926409 0.9995640
## 21 1.05528398 0.3439129 0.3439129 0.7935389 0.8405744
## 26 0.55196653 0.4671017 0.8761851 0.9190379 0.9995640
## 27 0.17311930 0.6822731 0.8761851 0.9901351 0.9995640
## 35 0.02497649 0.8761851 0.8761851 0.9982574 0.9995640
## 37 0.06668665 0.8005993 0.7438810 0.9982574 0.9995640
The data.frame
containing differential splicing results
has the next columns:
iso
: Isoform name.gene
: Gene name.ratio_X
: Relative expression of the isoform in
experimental condition X, which is the first value specified
with the contrast
argument.ratio_Y
: Relative expression of the isoform in
experimental condition Y, which is the second value specified
with the contrast
argument.odd
: Ratio between isoform’s relative expression in
condition X and Y.stat
: Statistic of the hypothesis test performed to
decide if the change in isoform’s relative expression between X
and Y conditions was significant.pval
: P-value at the isoform level corresponding to the
statistic stat
.genePval
: P-value at the gene level.FDR
: Adjusted p-value at the isoform level.genePval
: Adjusted p-value at the gene level.In particular, the pval
or its adjusted value,
FDR
, is useful for deciding if an isoform experimented
significant change in its relative expression due to the change between
X and Y conditions, here called Normal and
Tumor. Whereas, the genePval
or its adjusted
value, geneFDR
, will help you to decide if a gene evidenced
significant changes in its alternative splicing pattern when conditions
X and Y were contrasted.
In the example illustrated above, non-reliable isoforms were
discarded by means of the filter
argument of the
results
methods which has TRUE
as its default
value. Non-reliable refers to both low expressed isoforms and those
cases where model estimations did not achieve convergence. To obtain the
full results
matrix only you need specifies that
filter
argument is FALSE
.
head(results(myDSResults, filter = FALSE))
## iso gene ratio_Normal ratio_Tumor odd stat
## 1 ENST00000228345 ENSG00000002016 0.089070273 0.108429887 NA NA
## 2 ENST00000358495 ENSG00000002016 0.547197468 0.494224790 -0.1018193 0.1463975
## 3 ENST00000397230 ENSG00000002016 0.010561661 0.014244721 NA NA
## 4 ENST00000430095 ENSG00000002016 0.020813457 0.049128715 NA NA
## 5 ENST00000461568 ENSG00000002016 0.003357753 0.001798561 NA NA
## 6 ENST00000463750 ENSG00000002016 0.090468201 0.083210744 NA NA
## pval genePval FDR geneFDR
## 1 NA 0.7151892 NA 0.999564
## 2 0.7151892 0.7151892 0.9926409 0.999564
## 3 NA 0.7151892 NA 0.999564
## 4 NA 0.7151892 NA 0.999564
## 5 NA 0.7151892 NA 0.999564
## 6 NA 0.7151892 NA 0.999564
Non-reliable isoforms could be easily detected in the obtained matrix
because they present NA
value in all parameters derived
from model fitting and hypothesis test. It is worth to note that
although non-reliable isoforms were not analyzed by statistical models,
they have been used to compute the total gene counts so in order to
avoid overestimation of relative expression of the well-expressed
isoforms.
The NBSpliceRes
class has several methods useful for
differential splicing results exploration.
NBSplice
assumes a gene is differentially spliced if its
p-value (raw or adjusted) is lower than a significant threshold (default
value = 0.05). The results table only for those differentially spliced
genes is easily extracted using the GetDSResults
method.
Whereas, the names of those genes is obtained by means of the
GetDSGenes
. Both methods expect at least one argument of
class NBSpliceRes
. In addition, they could receive a
logical parameter, adjusted
, indicating if p-values must be
adjusted (default) or not, and a numeric parameter,
p.value
, establishing the significance threshold (default
value = 0.05).
<-GetDSResults(myDSResults)
mySignificantReshead(mySignificantRes)
## iso gene ratio_Normal ratio_Tumor odd
## 378 ENST00000206514 ENSG00000092068 0.03696968 0.07501606 0.7076040
## 379 ENST00000316902 ENSG00000092068 0.51890400 0.17702007 -1.0754558
## 381 ENST00000339733 ENSG00000092068 0.03601217 0.05323293 0.3908204
## 383 ENST00000422941 ENSG00000092068 0.02884770 0.03681750 0.2439428
## 386 ENST00000524758 ENSG00000092068 0.03023622 0.04352228 0.3642326
## 388 ENST00000528186 ENSG00000092068 0.02675809 0.04615988 0.5452741
## stat pval genePval FDR geneFDR
## 378 6.0654084 0.0170055687 0.00281695 0.128238715 0.03830455
## 379 14.3400329 0.0003851829 0.00281695 0.009577521 0.03830455
## 381 1.8410376 0.1804745117 0.00281695 0.566677648 0.03830455
## 383 0.7091551 0.4034386713 0.00281695 0.854605021 0.03830455
## 386 1.5880384 0.2130252802 0.00281695 0.629128681 0.03830455
## 388 3.5529531 0.0648255060 0.00281695 0.305294101 0.03830455
dim(mySignificantRes)
## [1] 110 10
<-GetDSGenes(myDSResults)
myDSGeneshead(myDSGenes)
## [1] "ENSG00000092068" "ENSG00000103275" "ENSG00000103942" "ENSG00000119446"
## [5] "ENSG00000123983" "ENSG00000126883"
length(myDSGenes)
## [1] 33
As can be seen, in our example, NBSplice
detected 184
isoforms with a significant change of relative expression between
Normal and Tumor condition whereas 40 genes were
identified as differentially spliced.
The scatter plot of isoforms’ relative expression between the two
contrasted conditions is explored using the plotRatiosDisp
.
It returns a ggplot2
object which could be then easily
modified and customized.
plotRatiosDisp(myDSResults)
As can be noted, isoforms are colored according to if they came from
those genes detected as differentially spliced or not. To do this, the
adjusted
and p.value
parameters, previously
used to call the getDSResults
method, could be specified in
the plotRatiosDisp
calling.
The volcano plot is commonly used to explore the dispersion between
statistical significance from a statistical test with the magnitude of
the change. In the case of NBSplice
the isoform p-value and
the difference between isoform’s relative expression per contrasted
condition are explored using the plotVolcano
method. In
addition, isoforms are colored according to if they came from those
genes detected as differentially spliced or not.
plotVolcano(myDSResults)
NBSplice
also provides two methods to individually
explore differentially spliced genes: GetGeneResults
and
plotGeneResults
. The first one is useful for obtaining the
differential expression results of a single gene, specified using the
gene
argument. The second one is a graphical method to
generate a bar-plot illustrating the isoforms’ relative expression
values in the two contrasted conditions. In both cases, it is possible
filtering those non-reliable isoforms by means of the
filterLowExpIso
argument. In the case of the
plotGeneResults
method, non-significant isoforms could be
also filtered out setting the filterNotSignificant
argument
to TRUE
which is its default value. As an example, the
results obtained for the first analyzed gene are shown below.
## Select the first differentially spliced gene
<-GetDSGenes(myDSResults)[1]
geneGetGeneResults(myDSResults, gene)
## iso gene ratio_Normal ratio_Tumor odd
## 378 ENST00000206514 ENSG00000092068 0.03696968 0.07501606 0.7076040
## 379 ENST00000316902 ENSG00000092068 0.51890400 0.17702007 -1.0754558
## 381 ENST00000339733 ENSG00000092068 0.03601217 0.05323293 0.3908204
## 383 ENST00000422941 ENSG00000092068 0.02884770 0.03681750 0.2439428
## 386 ENST00000524758 ENSG00000092068 0.03023622 0.04352228 0.3642326
## 388 ENST00000528186 ENSG00000092068 0.02675809 0.04615988 0.5452741
## 390 ENST00000528860 ENSG00000092068 0.02884770 0.03681750 0.2439428
## 391 ENST00000529705 ENSG00000092068 0.02209550 0.06319893 1.0509135
## 392 ENST00000532568 ENSG00000092068 0.03410357 0.07346647 0.7674271
## stat pval genePval FDR geneFDR
## 378 6.0654084 0.0170055687 0.00281695 0.128238715 0.03830455
## 379 14.3400329 0.0003851829 0.00281695 0.009577521 0.03830455
## 381 1.8410376 0.1804745117 0.00281695 0.566677648 0.03830455
## 383 0.7091551 0.4034386713 0.00281695 0.854605021 0.03830455
## 386 1.5880384 0.2130252802 0.00281695 0.629128681 0.03830455
## 388 3.5529531 0.0648255060 0.00281695 0.305294101 0.03830455
## 390 0.7091551 0.4034386713 0.00281695 0.854605021 0.03830455
## 391 13.1933293 0.0006259889 0.00281695 0.014451319 0.03830455
## 392 7.1220185 0.0100334774 0.00281695 0.093240396 0.03830455
plotGeneResults(myDSResults, gene)
## Keeping non-reliable and non-significant isoforms
plotGeneResults(myDSResults, gene, filterLowExpIso = FALSE,
filterNotSignificant = FALSE)
NBSplice
, is a package developed for differential
splicing detection starting from isoform quantification data from
RNA-seq experiments. Our tool is based on generalized linear models
(GLMs) and uses them to infer significant differences in isoforms
relative expression values between two experimental conditions. In
addition, NBSplice
not only reveals what genes were under
DS but also reveals the identity of the gene isoforms affected by it as
well as their relative expression in each experimental condition.
NBSplice
is based on the use of GLMs to model isoform
expression counts. This approach has been widely used for several
differential expression tools like edgeR
(Robinson, McCarthy, and Smyth 2010) and
DESeq2
(Love, Huber, and Anders
2014) for differential gene expression and DEXSeq
(Anders, Reyes, and Huber 2012) for
differential exon usage analysis. These tools are based in negative
binomial (NB) models and, in general, they assume that the counts of the
i-th gene/exon in the k-th sequenced sample, denoted as \(y_{ik}\), are a random variable with a NB
distribution.
\[\begin{equation} y_{ik}\sim NB(\mu_{ik}=p_{ik}s_k, \phi_{i}) \end{equation}\]
This NB distribution is characterized by two parameters: the mean, \(\mu_{ik}\), and the dispersion, \(\phi_{i}\). Specifically, previously mentioned R packages assume that \(\mu_{ik}\) can be expressed as the product of the real proportion of mRNA fragments of the i-th gene in the k-th sample, \(p_{ik}\), with a factor, \(s_k\), related to the sequencing library size. The parameter \(\phi_{i}\) is directly related to the variance of \(y_{ik}\)
\[\begin{equation} V(y_{ik})=\mu_{ik}+\phi_i\mu_{ik} \label{EQ:var} \end{equation}\]
Commonly used methods fit GLMs with a logarithmic link and with the elements \(x_{kr}\) of the design matrix to infer differential expression
\[\begin{equation} log(p_{ik})=\sum\limits_rx_{kr}\beta_{ir} \label{EQ:mod} \end{equation}\]
In the most simple case, a treatment-control experiment, \(r\) be equal to \(1\) and the design matrix only will have one column where each cell indicates if sample \(k\) was or not treated.
NBSplice
takes the advantage of the previously described
strategy although it proposes an alternative approach. It assumes that
isoform expression counts, in CPMs, of the i-th isoform from the j-th
gene in the k-th sample, \(y_{ijk}\),
follow a NB distribution as
\[\begin{equation} y_{ijk}\sim NB(media=p_{ijk}\mu_{jk}, dispersion=\phi_{j}) \label{EQ:eq5} \end{equation}\]
where \(p_{ijk}\) represents the proportion of the mRNA fragments from the j-th gene (\(\mu_{jk}\)) belonging to the i-th isoform and the \(\phi_{j}\) is the dispersion of the j-th gene, in the k-th sample. In this context, for each isoform, it is possible to predict its mean relative expression or proportion by means of a GLM fit at the level gene according to
\[\begin{equation} log(p_{ijk})=x_{r.}\beta_{ij} \label{EQ:eq6} \end{equation}\]
where \(\beta_{ij}\) represents the
change in \(p_{ijk}\) due to the effect
described by the r-th column of the design matrix X
(\(x_{r.}\)). Thus,
NBSplice
proposes, unlike other tools such as
DEXSeq
or Limma
(Ritchie et al. 2015), the adjustment of one
model for each gene. In particular, this model must consider the
different gene isoforms as effects. In the most simple case, where only
one factor (cond) with two levels (\(l=1, ~2\)) is analyzed , the previous
equation is summarized at:
\[\begin{equation} ln(p_{ijl})=\mu_0+iso_{ij}+cond_l+ iso_{ij}:cond_l \label{EQ:eq7} \end{equation}\]
Once coefficient models are estimated, \(\hat{\beta}\), NBSplice
evaluates if the changes in isoforms proportions due to experimental
conditions are significant by means of a linear hypothesis test (Wald
test) based on the F-statistic defined below (Greene 2017).
\[\begin{equation} H_0:cond_l+ iso_{ij}:cond_l=0 \\ F=\frac{(H\hat{\beta}-q)'(H \hat{cov}(\hat{\beta})H')^{-1}(H\hat{\beta}-q)}{p} \end{equation}\]
Here, \(H\) is the contrast matrix, \(q\) is a single element matrix having the value of the right side of the test equation (\(q=0\)), \(\hat{cov}(\hat{\beta})\) is the covariance matrix of \(\hat{\beta}\) and \(p\) is equal to the number of rows of \(H\) (\(p=1\)). In particular, \(H\) is a full rank matrix, conformable with \(\hat{\beta}\) so, when they are multiplied, the linear combination of the left side of the test equation is defined. Thus, under the null hypothesis, the F statistics will have \(F\) distribution with \(1\) and \(n-K\) degrees of freedom. Specifically, \(n\) is the number of samples and \(K\) is the length of \(\beta\). When only one experimental factor with two levels is analyzed (treatment-control case), \(K\) will be equal to the double of the number of analyzed isoforms of gene \(j\), \(I_j\). The comparison of the F statistics against its distribution under null hypothesis provides a p-value for each isoform of gene \(j\).
Given the fact that DS affect the whole gene, once isoform hypothesis
tests were performed, NBSplice
uses the Simes test (Simes 1986) to combine isoform p-values
obtaining a gene p-value. Given a family of null hypothesis, \(H_{01}, ..., H_{0I_j}\), to test changes in
relative isoform expression of the \(I_j\) isoforms from gene \(j\), the Simes test evaluates only one
global null hypothesis
\[\begin{equation} H_0=\bigcap\limits_{p=1}^{I_j}H_{0i} \end{equation}\]
To do this, the isoform p-values are sorted \(p-value_{(1)j}< \dots < p-value_{(m)j}< \dots <p-value{(I)j}\) and then the Simes statistic, \(T_j\), is defined as
\[\begin{equation} T_j=\min \{ \frac{I_j p-value{(m)j}}{m}\} \label{EQ:eq10} \end{equation}\]
Under the null hypothesis, this statistic has uniform distribution, \(U(0,1)\), representing a p-value. The Simes test rejects \(H_0\) if \(T_j\) is lower or equal to the significance threshold \(\alpha\) (Simes 1986). Finally, isoform and gene p-values are corrected using the FDR strategy (Benjamini and Hochberg 1995).
NBSplice
uses commonly used methods and R packages for
classical GLM fit (McCullagh 1984). In
particular, the glm.nb
function of the MASS
package (Venables and Ripley 2002) is used
to model fitting. It was chosen because this function allows the
estimation of the dispersion parameter reporting a logical parameter
indicating if the model estimation converged or not. This parameter is
used by NBSplice
because if estimations do not converge,
thus inferences over the corresponding genes won’t be done. In addition,
the lht
function of the car
package (Fox and Weisberg 2011) was used for linear
hypothesis tests and the simes.test
method from
mppa
package (Rubin-Delanchy and
Heard 2014), for the Simes test.
sessionInfo()
## R version 4.2.0 Patched (2022-04-24 r82246)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] NBSplice_1.15.0
##
## loaded via a namespace (and not attached):
## [1] locfit_1.5-9.5 tidyselect_1.1.2 xfun_0.31
## [4] bslib_0.3.1 reshape2_1.4.4 purrr_0.3.4
## [7] lattice_0.20-45 carData_3.0-5 colorspace_2.0-3
## [10] vctrs_0.4.1 generics_0.1.2 htmltools_0.5.2
## [13] yaml_2.3.5 utf8_1.2.2 rlang_1.0.2
## [16] jquerylib_0.1.4 pillar_1.7.0 glue_1.6.2
## [19] DBI_1.1.2 BiocParallel_1.31.3 plyr_1.8.7
## [22] lifecycle_1.0.1 stringr_1.4.0 munsell_0.5.0
## [25] gtable_0.3.0 codetools_0.2-18 evaluate_0.15
## [28] labeling_0.4.2 knitr_1.39 fastmap_1.1.0
## [31] parallel_4.2.0 fansi_1.0.3 highr_0.9
## [34] Rcpp_1.0.8.3 edgeR_3.39.1 scales_1.2.0
## [37] BiocManager_1.30.17 limma_3.53.0 jsonlite_1.8.0
## [40] abind_1.4-5 farver_2.1.0 BiocStyle_2.25.0
## [43] ggplot2_3.3.6 digest_0.6.29 stringi_1.7.6
## [46] dplyr_1.0.9 grid_4.2.0 cli_3.3.0
## [49] tools_4.2.0 magrittr_2.0.3 sass_0.4.1
## [52] tibble_3.1.7 mppa_1.0 crayon_1.5.1
## [55] car_3.0-13 pkgconfig_2.0.3 MASS_7.3-57
## [58] ellipsis_0.3.2 assertthat_0.2.1 rmarkdown_2.14
## [61] R6_2.5.1 compiler_4.2.0