scater 1.6.3
This document provides details on the import of single-cell data into a
SingleCellExperiment
object for use with the scater
package (frequent use
case) and how to quantify single-cell gene expression from within an R session
using the popular RNA-seq quantification tools Salmon
and kallisto
.
tximport
Lior Pachter’s group at Berkeley has recently released a new software tool
called kallisto for rapid quantification of transcript abundance from RNA-seq
data. Similarly, Rob Patro at Stony Brook University and colleagues have
released Salmon, a similarly fast and accurate RNA-seq quantification tool.
In scater
, a couple of wrapper functions to kallisto
and Salmon
enable easy and
extremely fast quantification of transcript abundance from within R.
For simplicity, examples are shown below demonstrating the use of kallisto
,
but the same could be done with Salmon
using almost identical interfaces
(replacing runKallisto
with runSalmon
, readKallistoResults
with
readSalmonResults
and so on).
################################################################################
### Tests and Examples
# Example if in the kallisto/test directory
setwd("/home/davis/kallisto/test")
kallisto_log <- runKallisto("targets.txt", "transcripts.idx", single_end=FALSE,
output_prefix="output", verbose=TRUE, n_bootstrap_samples=10)
sce_test <- readKallistoResults(kallisto_log, read_h5=TRUE)
sce_test
An example using real data from a project investigating cell cycle. Note that
this analysis is not ‘live’ (the raw data are not included in the package), but
it demonstrates what can be done with scater
and kallisto
.
setwd("/home/davis/021_Cell_Cycle/data/fastq")
system("wc -l targets.txt")
ave_frag_len <- mean(c(855, 860, 810, 760, 600, 690, 770, 690))
kallisto_test <- runKallisto("targets.txt",
"Mus_musculus.GRCm38.rel79.cdna.all.ERCC.idx",
output_prefix="kallisto_output_Mmus", n_cores=12,
fragment_length=ave_frag_len, verbose=TRUE)
sce_kall_mmus <- readKallistoResults(kallisto_test, read_h5=TRUE)
sce_kall_mmus
sce_kall_mmus <- readKallistoResults(kallisto_test)
sce_kall_mmus <- getBMFeatureAnnos(sce_kall_mmus)
head(fData(sce_kall_mmus))
head(pData(sce_kall_mmus))
sce_kall_mmus[["start_time"]]
counts(sce_kall_mmus)[sample(nrow(sce_kall_mmus), size=15), 1:6]
## Summarise expression at the gene level
sce_kall_mmus_gene <- summariseExprsAcrossFeatures(
sce_kall_mmus, exprs_values="tpm", summarise_by="feature_id")
datatable(fData(sce_kall_mmus_gene))
sce_kall_mmus_gene <- getBMFeatureAnnos(
sce_kall_mmus_gene, filters="ensembl_gene_id",
attributes=c("ensembl_gene_id", "mgi_symbol", "chromosome_name",
"gene_biotype", "start_position", "end_position",
"percentage_gc_content", "description"),
feature_symbol="mgi_symbol", feature_id="ensembl_gene_id",
biomart="ensembl", dataset="mmusculus_gene_ensembl")
datatable(fData(sce_kall_mmus_gene))
## Add gene symbols to featureNames to make them more intuitive
new_feature_names <- featureNames(sce_kall_mmus_gene)
notna_mgi_symb <- !is.na(fData(sce_kall_mmus_gene)$mgi_symbol)
new_feature_names[notna_mgi_symb] <- fData(sce_kall_mmus_gene)$mgi_symbol[notna_mgi_symb]
notna_ens_gid <- !is.na(fData(sce_kall_mmus_gene)$ensembl_gene_id)
new_feature_names[notna_ens_gid] <- paste(new_feature_names[notna_ens_gid],
fData(sce_kall_mmus_gene)$ensembl_gene_id[notna_ens_gid], sep="_")
sum(duplicated(new_feature_names))
featureNames(sce_kall_mmus_gene) <- new_feature_names
head(featureNames(sce_kall_mmus_gene))
tail(featureNames(sce_kall_mmus_gene))
sum(duplicated(fData(sce_kall_mmus_gene)$mgi_symbol))