Inferring Somatic Signatures from Single Nucleotide Variant Calls
Table of Contents
- 1. Motivation: The Concept Behind Mutational Signatures
- 2. Methodology: From Mutations to Somatic Signatures
- 3. Workflow and Implementation: Analysis with the SomaticSignatures Package
- 4. Use case: Estimating Somatic Signatures from TCGA WES Studies
- 4.1. Data: Preproccessing of the TCGA WES Studies
- 4.2. Motifs: Extracting the Sequence Context of Somatic Variants
- 4.3. Decomposition: Inferring Somatic Signatures
- 4.4. Visualization: Exploration of Signatures and Samples
- 4.5. Extensions: Normalization of Sequence Motif Frequencies and Batch Effects
- 5. Alternatives: Inferring Somatic Signatures with Different Approaches
- 6. Frequently Asked Questions
- 7. References
- 8. Session Info
1 Motivation: The Concept Behind Mutational Signatures
Recent publications introduced the concept of identifying mutational signatures from cancer sequencing studies and linked them to potential mutation generation processes [6] [14] [11]. Conceptually, this relates somatically occurring single nucleotide variants (SNVs) to the surrounding sequence which will be referred to as mutational or somatic motifs in the following. Based on the frequency of the motifs occurring in multiple samples, these can be decomposed mathematically into so called mutational signatures. In case of the investigation of tumors, the term somatic signatures will be used here to distinguish them from germline mutations and their generating processes.
The SomaticSignatures
package provides an efficient and user-friendly
implementation for the extraction of somatic motifs based on a list of
somatically mutated genomic sites and the estimation of somatic signatures with
a number of statistical approaches.
2 Methodology: From Mutations to Somatic Signatures
The basic idea of somatic signatures is composed of two parts:
Firstly, each somatic mutation is described in relation of the sequence context
in which it occurs. As an example, consider a SNV, resulting in the alteration
from A
in the normal to G
in the tumor sample, that is embedded in the sequence
context TAC
. Thus, the somatic motif can be written as TAC>TGC
or T.C
A>G
. The frequency of these motifs across multiple samples is then represented
as a matrix \(M_{ij}\), where \(i\) counts over the motifs and \(j\) over the samples.
In a second step, the matrix \(M\) is numerically decomposed into two matrices \(W\) and \(H\)
$$M_{ij} = \sum_{k=1}^{R} W_{ik} H_{jk}$$
for a fixed number \(R\) of signatures. While \(W\) describes the composition of each signature in term of somatic motifs, \(H\) shows the contribution of the signature to the alterations present in each sample.
3 Workflow and Implementation: Analysis with the SomaticSignatures Package
The SomaticSignatures
package offers a framework for inferring signatures of
SNVs in a user-friendly and efficient manner for large-scale data sets. A tight
integration with standard data representations of the Bioconductor
project
[1] was a major design goal. Further, it extends
the selection of multivariate statistical methods for the matrix decomposition
and allows a simple visualization of the results.
For a typical workflow, a set of variant calls and the reference sequence are
needed. Ideally, the SNVs are represented as a VRanges
object with the
genomic location as well as reference and alternative allele defined. The
reference sequence can be, for example, a FaFile
object, representing an
indexed FASTA file, a BSgenome
object, or a GmapGenome
object.
Alternatively, we provide functions to extract the relevant information from
other sources of inputs. At the moment, this covers the MuTect
[12] variant caller and the h5vc package
[15] [9].
Generally, the individual steps of the analysis can be summarized as:
- The somatic motifs for each variant are retrieved from the reference sequence
with the
mutationContext
function and converted to a matrix representation with themutationContextMatrix
function. - Somatic signatures are estimated with a method of choice (currently
nmfSignatures
,pcaSignatures
, orkmeansSignatures
). - The somatic signatures and their representation in the samples are assessed with a set of accessor and plotting functions.
To decompose \(M\), the SomaticSignatures
package implements three methods:
- Non-negative matrix factorization (NMF)
- The NMF decomposes \(M\) with the constraint of positive components in \(W\) and \(H\) [4]. The method was used [6] for the identification of mutational signatures, and can be computationally expensive for large data sets.
- Principal component analysis (PCA)
- The PCA employs the eigenvalue decomposition and is therefore suitable for large data sets [2]. While this is related to the NMF, no constraint on the sign of the elements of \(W\) and \(H\) exists.
- K-means clustering
- In contrast to the NMF and PCA, the k-means clustering assigns each sample to a single somatic signature, such that \(H\) can only contain the value 0 and 1. This approach is suited for large data sets.
4 Use case: Estimating Somatic Signatures from TCGA WES Studies
In the following, the concept of somatic signatures and the steps for inferring
these from an actual biological data set are shown. For the example, somatic
variant calls from whole exome sequencing (WES) studies from The Cancer Genome
Atlas (TCGA) project will be used, which are part of the
SomaticCancerAlterations
package.
library(SomaticSignatures)
library(GenomicRanges) library(VariantAnnotation) library(ggplot2) library(stringr)
library(SomaticCancerAlterations) library(BSgenome.Hsapiens.UCSC.hg19)
4.1 Data: Preproccessing of the TCGA WES Studies
The SomaticCancerAlterations
package provides the somatic SNV calls for eight
WES studies, each investigating a different cancer type. The metadata
summarizes the biological and experimental settings of each study.
sca_metadata = scaMetadata() print(sca_metadata)
Cancer_Type Center NCBI_Build Sequence_Source Sequencing_Phase Sequencer Number_Samples gbm_tcga GBM broad.mit.edu 37 WXS Phase_I Illumina GAIIx 291 hnsc_tcga HNSC broad.mit.edu 37 Capture Phase_I Illumina GAIIx 319 kirc_tcga KIRC broad.mit.edu 37 Capture Phase_I Illumina GAIIx 297 luad_tcga LUAD broad.mit.edu 37 WXS Phase_I Illumina GAIIx 538 lusc_tcga LUSC broad.mit.edu 37 WXS Phase_I Illumina GAIIx 178 ov_tcga OV broad.mit.edu 37 WXS Phase_I Illumina GAIIx 142 skcm_tcga SKCM broad.mit.edu 37 Capture Phase_I Illumina GAIIx 266 thca_tcga THCA broad.mit.edu 37 WXS Phase_I Illumina GAIIx 406 Number_Patients Cancer_Name gbm_tcga 291 Glioblastoma multiforme hnsc_tcga 319 Head and Neck squamous cell carcinoma kirc_tcga 293 Kidney Chromophobe luad_tcga 519 Lung adenocarcinoma lusc_tcga 178 Lung squamous cell carcinoma ov_tcga 142 Ovarian serous cystadenocarcinoma skcm_tcga 264 Skin Cutaneous Melanoma thca_tcga 403 Thyroid carcinoma
In this example, all mutational calls of a study will be pooled together, in
order to find signatures related to a specific cancer type. The data of all
studies is loaded and merged into a single GRanges
object, with each entry
describing a somatic variant call. Further on, only SNVs located on the human
autosomes will be considered. For later analyzes, each variant is also
associated with the study it originated from.
sca_all = scaLoadDatasets() sca_merge = unlist(sca_all) short_names = str_split_fixed(rownames(sca_metadata), "_", 2)[, 1] names(sca_merge) = sca_merge$study = factor(rep(short_names, times = elementLengths(sca_all))) sca_merge = sca_merge[sca_merge$Variant_Type %in% "SNP"] sca_merge = keepSeqlevels(sca_merge, hsAutosomes())
To get a first impression of the data, we count the number of somatic variants per study and their predicted effects.
sort(table(sca_merge$study), decreasing = TRUE)
luad skcm hnsc lusc kirc gbm thca ov 208724 200589 67125 61485 24158 19938 6716 5872
sort(table(sca_merge$Variant_Classification), decreasing = TRUE)
Missense_Mutation Silent Nonsense_Mutation Splice_Site RNA 377800 163535 27299 13934 11285 Nonstop_Mutation Translation_Start_Site Intron IGR 3'UTR 441 270 33 5 3 5'Flank 5'UTR Frame_Shift_Del Frame_Shift_Ins In_Frame_Del 1 1 0 0 0 In_Frame_Ins 0
The starting point of the analysis is a VRanges
object which describes the
somatic variants in terms of their genomic locations as well as reference and
alternative alleles. For more details about this class and how to construct it,
please see the documentation of the VariantAnnotation
package
[5]. Since the genomic positions are given
in the NCBI notation and the references used later are in UCSC notation, the
functions ucsc
and ncbi
are used to easily switch between the two notations.
sca_vr = VRanges(seqnames(sca_merge), ranges(sca_merge), ref = sca_merge$Reference_Allele, alt = sca_merge$Tumor_Seq_Allele2, seqinfo = seqinfo(sca_merge)) sca_vr = ucsc(sca_vr) head(sca_vr, 3)
VRanges with 3 ranges and 0 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepthgbm chr1 [887446, 887446] + G A gbm chr1 [909247, 909247] + C T gbm chr1 [978952, 978952] + C T sampleNames softFilterMatrix gbm gbm gbm --- seqlengths: chr1 chr2 chr3 chr4 chr5 chr6 ... chr18 chr19 chr20 chr21 chr22 249250621 243199373 198022430 191154276 180915260 171115067 ... 78077248 59128983 63025520 48129895 51304566 hardFilters: NULL
4.2 Motifs: Extracting the Sequence Context of Somatic Variants
In a first step, the sequence motif for each variant is extracted based on the
reference sequence. Here, the BSgenomes
object for the human hg19 reference
is used. However, all objects with a defined getSeq
method can serve as the
reference, e.g. an indexed FASTA file. Additionally, we transform all motifs to
have a pyrimidine base (C
or T
) as a reference base
[14].
sca_motifs = mutationContext(sca_vr, BSgenome.Hsapiens.UCSC.hg19, unify = TRUE)
sca_motifs$study = sca_merge$study head(sca_motifs, 3)
VRanges with 3 ranges and 3 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepthgbm chr1 [887446, 887446] + G A gbm chr1 [909247, 909247] + C T gbm chr1 [978952, 978952] + C T sampleNames softFilterMatrix | alteration context study | gbm | CT G.G gbm gbm | CT A.G gbm gbm | CT G.G gbm --- seqlengths: chr1 chr2 chr3 chr4 chr5 chr6 ... chr18 chr19 chr20 chr21 chr22 249250621 243199373 198022430 191154276 180915260 171115067 ... 78077248 59128983 63025520 48129895 51304566 hardFilters: NULL
To continue with the estimation of the somatic signatures, the matrix \(M\) of the
form {motifs × studies} is constructed. The normalize
argument specifies
that frequencies rather than the actual counts are returned.
sca_occurrence = mutationContextMatrix(sca_motifs, group = "study", normalize = TRUE) head(round(sca_occurrence, 4))
gbm hnsc kirc luad lusc ov skcm thca CA A.A 0.0083 0.0098 0.0126 0.0200 0.0165 0.0126 0.0014 0.0077 CA A.C 0.0093 0.0082 0.0121 0.0217 0.0156 0.0192 0.0009 0.0068 CA A.G 0.0026 0.0061 0.0046 0.0144 0.0121 0.0060 0.0004 0.0048 CA A.T 0.0057 0.0051 0.0070 0.0134 0.0100 0.0092 0.0007 0.0067 CA C.A 0.0075 0.0143 0.0215 0.0414 0.0390 0.0128 0.0060 0.0112 CA C.C 0.0075 0.0111 0.0138 0.0415 0.0275 0.0143 0.0018 0.0063
The observed occurrence of the motifs, also termed somatic spectrum, can be visualized across studies, which gives a first impression of the data. The distribution of the motifs clearly varies between the studies.
plotSamplesObserved(sca_motifs)
4.3 Decomposition: Inferring Somatic Signatures
The somatic signatures can be estimated with each of the statistical methods
implemented in the package. Here, we will use the NMF
and PCA
, and compare
the results. Prior to the estimation, the number \(R\) of signatures to obtain has to
be fixed; in this example, the data will be decomposed into 5 signatures.
n_sigs = 5 sigs_nmf = nmfSignatures(sca_occurrence, r = n_sigs) sigs_pca = pcaSignatures(sca_occurrence, r = n_sigs)
The results contains the decomposed matrices stored in a list and can be accessed using standard R accessor functions.
names(sigs_nmf)
[1] "w" "h" "v" "raw"
sapply(sigs_nmf, dim)
$w [1] 96 5 $h [1] 8 5 $v [1] 96 8 $raw [1] 96 8 5
head(sigs_nmf$w, 3)
S1 S2 S3 S4 S5 CA A.A 5.107e-06 2.375e-16 2.047e-13 1.279 0.52570 CA A.C 6.276e-02 2.220e-16 2.220e-16 1.320 0.47930 CA A.G 1.212e-14 1.962e-07 2.792e-03 1.012 0.06564
head(sigs_nmf$h, 3)
S1 S2 S3 S4 S5 gbm 0.037329 0.002054 0.002269 0.002543 0.007775 hnsc 0.008512 0.008109 0.010338 0.006018 0.004097 kirc 0.008999 0.003468 0.001181 0.004720 0.015132
4.4 Visualization: Exploration of Signatures and Samples
To explore the results for the TCGA data set, we will use the plotting
functions. All figures are generated with the ggplot2
package, and thus,
their properties and appearances can also be modified at a later stage.
Focusing on the results of the NMF first, the five somatic signatures (named S1 to S5) can be visualized either as a heatmap or as a barchart.
plotSignatureMap(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Heatmap")
plotSignatures(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Barchart")
Each signature represents different properties of the somatic spectrum observed
in the data. While signature S1 is mainly characterized by selective C>T
alterations,
others as S4 and S5 show a broad distribution across the motifs.
In addition, the contribution of the signatures in each study can be represented with the same sets of plots. Signature S1 and S3 are strongly represented in the GBM and SKCM study, respectively. Other signatures show a weaker association with a single cancer type.
plotSampleMap(sigs_nmf)
plotSamples(sigs_nmf)
In the same way as before, the results of the PCA can be visualized. In contrast to the NMF, the signatures also contain negative values, indicating the depletion of a somatic motif.
plotSignatureMap(sigs_pca) + ggtitle("Somatic Signatures: PCA")
plotSignatures(sigs_pca)
plotSampleMap(sigs_pca)
plotSamples(sigs_pca)
Comparing the results of the two methods, we can see similar characteristics between the sets of signatures, for example S1 of the NMF and S2 of the PCA.
4.5 Extensions: Normalization of Sequence Motif Frequencies and Batch Effects
When investigating somatic signatures between samples from different studies,
corrections for technical confounding factors should be considered. In our use
case of the TCGA WES studies, this is of minor influence due to
similar sequencing technology and variant calling methods across the studies.
Approaches for the identification of so termed batch effects have been proposed
[3] [8] and could be adapted to the
setting of somatic signatures with existing implementations (the sva
and
leapp
packages). While this correction is not performed here, we exemplify
the usage by taking the different sequencing technologies of the studies into
account.
library(sva) library(stringr) df = as(sca_metadata, "data.frame") ## sample x covariable pheno = data.frame(s = unlist(df[, "Sequence_Source"]), c = unlist(df[, "Cancer_Type"])) rownames(pheno) = str_split_fixed(rownames(pheno), "_", 2)[, 1] mod = model.matrix(~s + c, data = pheno) mod0 = model.matrix(~c, data = pheno) sv = sva(sca_occurrence, mod, mod0, method = "irw")
If comparisons are performed across samples or studies with different capture
targets, for example by comparing whole exome with whole genome sequencing,
further corrections for the frequency of sequence motifs can be taken into
account. The kmerFrequency
function provides the basis for calculating the
occurrence of k-mers over a set of ranges of a reference sequence.
As an example, we compute the frequency of 3-mers for the human chromosome 1, based on a sample of 100'000 locations. Analogously, the k-mer occurrence across the human exome can be obtained easily.
k = 3 n = 1e+05 chrs = "chr1" chr1_ranges = as(seqinfo(BSgenome.Hsapiens.UCSC.hg19), "GRanges") chr1_ranges = keepSeqlevels(chr1_ranges, chrs) k3_chr1 = kmerFrequency(BSgenome.Hsapiens.UCSC.hg19, n, k, chr1_ranges) k3_chr1
seq AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG 0.03375 0.01361 0.01778 0.02239 0.01784 0.01029 0.00199 0.01511 0.02069 0.01273 0.01685 0.01493 0.01699 0.01152 0.01547 ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC 0.02121 0.01713 0.01346 0.01949 0.01601 0.01771 0.01254 0.00266 0.01761 0.00209 0.00209 0.00266 0.00231 0.01126 0.01655 CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA 0.01889 0.01805 0.01716 0.00873 0.01596 0.01170 0.01336 0.01116 0.00229 0.01294 0.01404 0.01165 0.01264 0.01113 0.01008 GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT 0.00860 0.01299 0.01304 0.01793 0.01020 0.01150 0.01772 0.01771 0.01483 0.00207 0.01922 0.01789 0.01367 0.01735 0.01777 TTA TTC TTG TTT NNN 0.01798 0.01763 0.01681 0.03180 0.09679
With the normalizeMotifs
function, the frequency of motifs can be adjusted.
Here, we will transform our results of the TCGA WES studies to have the same
motif distribution as of a whole-genome analysis. The kmers
dataset contains
the estimated frequency of 3-mers across the human genome and exome.
head(sca_occurrence)
gbm hnsc kirc luad lusc ov skcm thca CA A.A 0.008326 0.009818 0.012584 0.01998 0.01648 0.012602 0.0013660 0.007743 CA A.C 0.009279 0.008179 0.012128 0.02166 0.01558 0.019244 0.0009223 0.006849 CA A.G 0.002558 0.006078 0.004636 0.01440 0.01205 0.005960 0.0004387 0.004765 CA A.T 0.005668 0.005050 0.006954 0.01336 0.01002 0.009196 0.0006880 0.006700 CA C.A 0.007473 0.014257 0.021525 0.04136 0.03900 0.012772 0.0060472 0.011167 CA C.C 0.007523 0.011084 0.013784 0.04153 0.02754 0.014305 0.0018296 0.006254
data(kmers) norms = k3wg/k3we head(norms)
seq AAA AAC AAG AAT ACA ACC 1.4519 1.1023 1.0248 1.5076 1.1089 0.8911
sca_norm = normalizeMotifs(sca_occurrence, norms) head(sca_norm)
gbm hnsc kirc luad lusc ov skcm thca CA A.A 0.009233 0.010887 0.013954 0.022155 0.018270 0.013975 0.0015148 0.008586 CA A.C 0.008268 0.007288 0.010807 0.019296 0.013884 0.017147 0.0008218 0.006103 CA A.G 0.001456 0.003459 0.002638 0.008193 0.006858 0.003392 0.0002497 0.002711 CA A.T 0.006232 0.005554 0.007647 0.014694 0.011017 0.010113 0.0007565 0.007368 CA C.A 0.006350 0.012114 0.018289 0.035144 0.033139 0.010853 0.0051382 0.009489 CA C.C 0.005232 0.007707 0.009585 0.028882 0.019148 0.009948 0.0012723 0.004349
5 Alternatives: Inferring Somatic Signatures with Different Approaches
For the identification of somatic signatures, other methods and implementations exist. The original framework [6] [11] proposed for this is based on the NMF and available for the Matlab programming language [7]. In extension, a probabilistic approach based on Poisson processes has been proposed [13] and implemented [10].
6 Frequently Asked Questions
6.1 Getting help
We welcome emails with questions or suggestions about our software, and want to ensure that we eliminate issues if and when they appear. We have a few requests to optimize the process:
- All emails and follow-up questions should take place over the Bioconductor mailing list, which serves as a repository of information.
- The subject line should contain SomaticSignatures and a few words describing the problem. First search the Bioconductor mailing list, for past threads which might have answered your question.
- If you have a question about the behavior of a function, read the sections of
the manual page for this function by typing a question mark and the function
name, e.g.
?mutationContext
. Additionally, read through the vignette to understand the interplay between different functions of the package. We spend a lot of time documenting individual functions and the exact steps that the software is performing. - Include all of your R code related to the question you are asking.
- Include complete warning or error messages, and conclude your message with the
full output of
sessionInfo()
.
6.2 Installing the package
Before you want to install the SomaticSignatures
package, please ensure that
you have the latest version of R
and Bioconductor
installed. For details on
this, please have a look at the help packages for R and Bioconductor. Then you
can open R
and run the following commands which will install the latest
release version of SomaticSignatures
:
source("http://bioconductor.org/biocLite.R") biocLite("SomaticSignatures")
6.3 Working with VRanges
A central object in the workflow of SomaticSignatures
is the VRanges
class
which is part of the VariantAnnotation
package. It builds upon the commonly
used GRanges
class of the GenomicRanges
package. Essentially, each row
represents a variant in terms of its genomic location as well as its reference
and alternative allele.
library(VariantAnnotation)
There are multiple ways of converting its own variant calls into a VRanges
object. One can for example import them from a VCF
file with the readVcf
function or employ the readMutect
function for importing variant calls from
the MuTect
caller directly. Further, one can also construct it from any other
format in the form of:
vr = VRanges(seqnames = "chr1", ranges = IRanges(start = 1000, width = 1), ref = "A", alt = "C") vr
VRanges with 1 range and 0 metadata columns: seqnames ranges strand ref alt totalDepth refDepth altDepth[1] chr1 [1000, 1000] + A C sampleNames softFilterMatrix [1] --- seqlengths: chr1 NA hardFilters: NULL
7 References
[1] | Robert C. Gentleman, Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony J. Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean YH Yang, and Jianhua Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, 5(10):R80, September 2004. PMID: 15461798. [ DOI | http ] |
[2] | Wolfram Stacklies, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. pcaMethodsa bioconductor package providing PCA methods for incomplete data. Bioinformatics, 23(9):1164-1167, May 2007. PMID: 17344241. [ DOI | http ] |
[3] | Jeffrey T Leek and John D Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet, 3(9):e161, September 2007. [ DOI | http ] |
[4] | Renaud Gaujoux and Cathal Seoighe. A flexible r package for nonnegative matrix factorization. BMC Bioinformatics, 11(1):367, July 2010. PMID: 20598126. [ DOI | http ] |
[5] | Valerie Obenchain, Martin Morgan, and Michael Lawrence. VariantAnnotation: annotation of genetic variants, 2011. [ .html ] |
[6] | Serena Nik-Zainal, Ludmil B. Alexandrov, David C. Wedge, Peter Van Loo, Christopher D. Greenman, Keiran Raine, David Jones, Jonathan Hinton, John Marshall, Lucy A. Stebbings, Andrew Menzies, Sancha Martin, Kenric Leung, Lina Chen, Catherine Leroy, Manasa Ramakrishna, Richard Rance, King Wai Lau, Laura J. Mudie, Ignacio Varela, David J. McBride, Graham R. Bignell, Susanna L. Cooke, Adam Shlien, John Gamble, Ian Whitmore, Mark Maddison, Patrick S. Tarpey, Helen R. Davies, Elli Papaemmanuil, Philip J. Stephens, Stuart McLaren, Adam P. Butler, Jon W. Teague, Göran Jönsson, Judy E. Garber, Daniel Silver, Penelope Miron, Aquila Fatima, Sandrine Boyault, Anita Langerød, Andrew Tutt, John W.M. Martens, Samuel A.J.R. Aparicio, Ake Borg, Anne Vincent Salomon, Gilles Thomas, Anne-Lise Børresen-Dale, Andrea L. Richardson, Michael S. Neuberger, P. Andrew Futreal, Peter J. Campbell, and Michael R. Stratton. Mutational processes molding the genomes of 21 breast cancers. Cell, 149(5):979-993, May 2012. [ DOI | http ] |
[7] | Ludmil Alexandrov. WTSI mutational signature framework, October 2012. [ http ] |
[8] | Yunting Sun, Nancy R. Zhang, and Art B. Owen. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. The Annals of Applied Statistics, 6(4):1664-1688, December 2012. Zentralblatt MATH identifier: 06141543; Mathematical Reviews number (MathSciNet): MR3058679. [ DOI | http ] |
[9] | Paul Theodor Pyl. h5vc: Managing alignment tallies using a hdf5 backend, 2013. [ .html ] |
[10] | Andrej Fischer. EMu: expectation-maximisation inference of mutational signatures, 2013. [ http ] |
[11] | Ludmil B. Alexandrov, Serena Nik-Zainal, David C. Wedge, Peter J. Campbell, and Michael R. Stratton. Deciphering signatures of mutational processes operative in human cancer. Cell Reports, 3(1):246-259, January 2013. [ DOI | http ] |
[12] | Kristian Cibulskis, Michael S. Lawrence, Scott L. Carter, Andrey Sivachenko, David Jaffe, Carrie Sougnez, Stacey Gabriel, Matthew Meyerson, Eric S. Lander, and Gad Getz. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, advance online publication, February 2013. [ DOI | .html ] |
[13] | Andrej Fischer, Christopher Jr Illingworth, Peter J Campbell, and Ville Mustonen. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome biology, 14(4):R39, April 2013. PMID: 23628380. [ DOI ] |
[14] | Ludmil B. Alexandrov, Serena Nik-Zainal, David C. Wedge, Samuel A. J. R. Aparicio, Sam Behjati, Andrew V. Biankin, Graham R. Bignell, Niccolò Bolli, Ake Borg, Anne-Lise Børresen-Dale, Sandrine Boyault, Birgit Burkhardt, Adam P. Butler, Carlos Caldas, Helen R. Davies, Christine Desmedt, Roland Eils, Jórunn Erla Eyfjörd, John A. Foekens, Mel Greaves, Fumie Hosoda, Barbara Hutter, Tomislav Ilicic, Sandrine Imbeaud, Marcin Imielinsk, Natalie Jäger, David T. W. Jones, David Jones, Stian Knappskog, Marcel Kool, Sunil R. Lakhani, Carlos López-Otín, Sancha Martin, Nikhil C. Munshi, Hiromi Nakamura, Paul A. Northcott, Marina Pajic, Elli Papaemmanuil, Angelo Paradiso, John V. Pearson, Xose S. Puente, Keiran Raine, Manasa Ramakrishna, Andrea L. Richardson, Julia Richter, Philip Rosenstiel, Matthias Schlesner, Ton N. Schumacher, Paul N. Span, Jon W. Teague, Yasushi Totoki, Andrew N. J. Tutt, Rafael Valdés-Mas, Marit M. van Buuren, Laura van t Veer, Anne Vincent-Salomon, Nicola Waddell, Lucy R. Yates, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-Seq Consortium, Icgc PedBrain, Jessica Zucman-Rossi, P. Andrew Futreal, Ultan McDermott, Peter Lichter, Matthew Meyerson, Sean M. Grimmond, Reiner Siebert, Elías Campo, Tatsuhiro Shibata, Stefan M. Pfister, Peter J. Campbell, and Michael R. Stratton. Signatures of mutational processes in human cancer. Nature, 500(7463):415-421, August 2013. [ DOI | .html ] |
[15] | Paul Theodor Pyl, Julian Gehring, Bernd Fischer, and Wolfgang Huber. h5vc: Scalable nucleotide tallies with HDF5. Bioinformatics, page btu026, January 2014. PMID: 24451629. [ DOI | http ] |
8 Session Info
R version 3.1.0 (2014-04-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] fastICA_1.2-0 BSgenome.Hsapiens.UCSC.hg19_1.3.99 BSgenome_1.32.0 [4] SomaticCancerAlterations_1.0.0 stringr_0.6.2 ggplot2_0.9.3.1 [7] VariantAnnotation_1.10.1 Rsamtools_1.16.0 Biostrings_2.32.0 [10] XVector_0.4.0 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 [13] IRanges_1.22.6 SomaticSignatures_1.0.1 bigmemory_4.4.6 [16] BH_1.54.0-2 bigmemory.sri_0.1.2 Biobase_2.24.0 [19] BiocGenerics_0.10.0 knitr_1.5 loaded via a namespace (and not attached): [1] AnnotationDbi_1.26.0 BBmisc_1.6 BatchJobs_1.2 BiocParallel_0.6.0 [5] DBI_0.2-7 Formula_1.1-1 GenomicAlignments_1.0.1 GenomicFeatures_1.16.0 [9] Hmisc_3.14-4 MASS_7.3-33 NMF_0.20.5 RColorBrewer_1.0-5 [13] RCurl_1.95-4.1 RSQLite_0.11.4 Rcpp_0.11.1 XML_3.98-1.1 [17] biomaRt_2.20.0 biovizBase_1.12.1 bitops_1.0-6 brew_1.0-6 [21] cluster_1.15.2 codetools_0.2-8 colorspace_1.2-4 dichromat_2.0-0 [25] digest_0.6.4 doParallel_1.0.8 evaluate_0.5.5 exomeCopy_1.10.0 [29] fail_1.2 foreach_1.4.2 formatR_0.10 ggbio_1.12.3 [33] grid_3.1.0 gridBase_0.4-7 gridExtra_0.9.1 gtable_0.1.2 [37] gtools_3.4.0 highr_0.3 iterators_1.0.7 labeling_0.2 [41] lattice_0.20-29 latticeExtra_0.6-26 markdown_0.7 mime_0.1.1 [45] munsell_0.4.2 pcaMethods_1.54.0 pkgmaker_0.20 plyr_1.8.1 [49] proto_0.3-10 registry_0.2 reshape2_1.4 rngtools_1.2.4 [53] rtracklayer_1.24.0 scales_0.2.4 sendmailR_1.1-2 splines_3.1.0 [57] stats4_3.1.0 survival_2.37-7 tools_3.1.0 xtable_1.7-3 [61] zlibbioc_1.10.0