salmon-wrapper {scater}R Documentation

Salmon wrapper functions

Description

Salmon wrapper functions

After generating transcript/feature abundance results using Salmon for a batch of samples, read these abundance values into a SingleCellExperiment object.

Run the abundance quantification tool Salmon on a set of FASTQ files. Requires Salmon (https://combine-lab.github.io/salmon/) to be installed and a Salmon transcript index must have been generated prior to using this function. See the Salmon website for installation and basic usage instructions.

Usage

readSalmonResultsOneSample(directory)

readSalmonResults(Salmon_log = NULL, samples = NULL, directories = NULL,
  logExprsOffset = 1, verbose = TRUE)

runSalmon(targets_file, transcript_index, single_end = FALSE,
  output_prefix = "output", lib_type = "A", n_processes = 2,
  n_thread_per_process = 4, n_bootstrap_samples = 0, seqBias = TRUE,
  gcBias = TRUE, posBias = FALSE, allowOrphans = FALSE,
  advanced_opts = NULL, verbose = TRUE, dry_run = FALSE,
  salmon_cmd = "salmon")

Arguments

directory

character string giving the path to the directory containing the Salmon results for the sample.

Salmon_log

list, generated by runSalmon. If provided, then samples and directories arguments are ignored.

samples

character vector providing a set of sample names to use for the abundance results.

directories

character vector providing a set of directories containing Salmon abundance results to be read in.

logExprsOffset

numeric scalar, providing the offset used when doing log2-transformations of expression data to avoid trying to take logs of zero. Default offset value is 1.

verbose

logical, should function provide output about progress?

targets_file

character string giving the path to a tab-delimited text file with either 2 columns (single-end reads) or 3 columns (paired-end reads) that gives the sample names (first column) and FastQ file names (column 2 and if applicable 3). The file is assumed to have column headers, although these are not used.

transcript_index

character string giving the path to the Salmon index to be used for the feature abundance quantification.

single_end

logical, are single-end reads used, or paired-end reads?

output_prefix

character string giving the prefix for the output folder that will contain the Salmon results. The default is "output" and the sample name (column 1 of targets_file) is appended (preceded by an underscore).

lib_type

scalar, indicating RNA-seq library type. See Salmon documentation for details. Default is "A", for automatic detection.

n_processes

integer giving the number of processes to use for parallel Salmon jobs across samples. The package parallel is used. Default is 2 concurrent processes.

n_thread_per_process

integer giving the number of threads for Salmon to use per process (to parallelize Salmon for a given sample). Default is 4.

n_bootstrap_samples

integer giving the number of bootstrap samples that Salmon should use (default is 0). With bootstrap samples, uncertainty in abundance can be quantified.

seqBias

logical, should Salmon's option be used to model and correct abundances for sequence specific bias? Default is TRUE.

gcBias

logical, should Salmon's option be used to model and correct abundances for GC content bias? Requires Salmon version 0.7.2 or higher. Default is TRUE.

posBias

logical, should Salmon's option be used to model and correct abundances for positional biases? Requires Salmon version 0.7.3 or higher. Default is FALSE.

allowOrphans

logical, Consider orphaned reads as valid hits when performing lightweight-alignment. This option will increase sensitivity (allow more reads to map and more transcripts to be detected), but may decrease specificity as orphaned alignments are more likely to be spurious. For more details see Salmon documentation.

advanced_opts

character scalar supplying list of advanced option arguments to apply to each Salmon call. For details see Salmon documentation or type salmon quant --help-reads at the command line.

dry_run

logical, if TRUE then a list containing the Salmon commands that would be run and the output directories is returned. Can be used to read in results if Salmon is run outside an R session or to produce a script to run outside of an R session.

salmon_cmd

(optional) string giving full command to use to call Salmon, if simply typing "salmon" at the command line does not give the required version of Salmon or does not work. Default is simply "salmon". If used, this argument should give the full path to the desired Salmon binary.

Details

The directory is expected to contain results for just a single sample. Putting more than one sample's results in the directory will result in unpredictable behaviour with this function. The function looks for the files (with the default names given by Salmon) 'quant.sf', 'stats.tsv', 'libFormatCounts.txt' and the sub-directories 'logs' (which contains a log file) and 'libParams' (which contains a file detailing the fragment length distribution). If these files are missing, or if results files have different names, then this function will not find them.

This function will work for Salmon v0.7.x and greater, as the name of one of the default output directories was changed from "aux" to "aux_info" in Salmon v0.7.

This function expects to find only one set of Salmon abundance results per directory; multiple adundance results in a given directory will be problematic.

A Salmon transcript index can be built from a FASTA file: salmon index [arguments] FASTA-file. See the Salmon documentation for further details. This simple wrapper does not give access to all nuances of Salmon usage. For finer-grained usage of Salmon please run it at the command line - results can still be read into R with readSalmonResults.

Value

A list with two elements: (1) a data.frame abundance with columns for 'target_id' (feature, transcript, gene etc), 'length' (feature length), 'est_counts' (estimated feature counts), 'tpm' (transcripts per million); (2) a list, run_info, with metadata about the Salmon run that generated the results, including number of reads processed, mapping percentage, the library type used for the RNA-sequencing, including details about number of reads that did not match the given or inferred library type, details about the Salmon command used to generate the results, and so on.

an SingleCellExperiment object

A list containing three elements for each sample for which feature abundance has been quantified: (1) salmon_call, the call used for Salmon, (2) salmon_log the log generated by Salmon, and (3) output_dir the directory in which the Salmon results can be found.

Examples

## Not run: 
# If Salmon results are in the directory "output", then call:
readSalmonResultsOneSample("output")

## End(Not run)
## Not run: 
## Define output directories in a vector called here "Salmon_dirs"
## and sample names as "Salmon_samples"
sceset <- readSalmonResults(samples = Salmon_samples, 
directories = Salmon_dirs)

## End(Not run)

## Not run: 
## If in Salmon's 'test' directory, then try these calls:
## Generate 'targets.txt' file:
write.table(data.frame(Sample="sample1", File1="reads_1.fastq.gz", File2="reads_1.fastq.gz"),
 file="targets.txt", quote=FALSE, row.names=FALSE, sep="\t")
Salmon_log <- runSalmon("targets.txt", "transcripts.idx", single_end=FALSE,
         output_prefix="output", verbose=TRUE, n_bootstrap_samples=10,
         dry_run = FALSE)

## End(Not run)

[Package scater version 1.8.4 Index]