scoreInvHap 1.0.0
scoreInvHap infers haplotype inversion’s status of a set of samples using SNP data. This method computes a similarity score between the sample SNPs in your cohort and the reference haplotypes. Samples are classified into the haplotype having the highest score. There are other approaches to perform this task such us inveRsion, invClust or PFIDO that are based on multivariate clustering procedures. However, these approaches may have several limitation that are addressed by using scoreInvHap. These limitations include:
scoreInvHap overcomes these difficulties by using a set of reference genotypes that has been used to properly characterize the inversion of interest. Genomic information such as linkage desequillibrium (R2) and hetgerozygous sequence has been determined for each of the region of interest (ROI) and are used to create the score that discriminate the inversion status.
The package can be loaded by typing:
library(scoreInvHap)
Calling procedure requires three objects that characterize the inversion of interest:
scoreInvHap already contains these required objects of four inversions. Two of them are well known and characterized genomic inversions: 8p23 (ROIno.8.3) and 17q21.31 (ROIno.17.16). These inversions are annotated using the ROI numbers used in Sander’s paper. We have also included two inversions that are described in invFEST database: 7p11.2 (HsInv0286) and Xq13.2 (HsInv0396). These objects have been created using VCF phase 3 data of 1000 Genomes Project. The code used to generate them can be found in the /inst/scripts
folder of the package. This code can be modified to create the required files of any other inversion of interest. Required information of other inversions will be incorporated in the future when checked in our group or described in the literature.
As previously stated, the method uses the frequency of the SNP genotypes of each SNP located into the inversion region. This information is provided for he different haplotype populations. This information is encoded in an object called Refs
that can be loaded by typing:
data("Refs")
names(Refs)
## [1] "ROIno.8.3" "ROIno.17.16" "HsInv0286" "HsInv0396"
Each element of this list is a reference for one of the four available inversions. For instance, the reference of inversion ROIno.8.3 can be obtained by:
ref <- Refs$ROIno.8.3
class(ref)
## [1] "list"
ref[1:2]
## $rs141039449
##
## haplos CC CG GG
## NI/NI 0.9310345 0.06321839 0.005747126
## NI/I 0.8595745 0.12765957 0.012765957
## I/I 0.6914894 0.25531915 0.053191489
##
## $rs138092889
##
## haplos AA AG GG
## NI/NI 0.005747126 0.02298851 0.9712644
## NI/I 0.012765957 0.05531915 0.9319149
## I/I 0.042553191 0.19148936 0.7659574
This object is a list of matrices containing the frequency of each genotype (columns) in each inversion genotype (rows). Each component is named with the SNP id contained in the ROI. Notice that the alleles of the heteryzogous genotypes are alphabetically ordered.
The second object required is a vector containing the R2 between the inversion status and the SNPs genotypes. The SNPs with higher R2 will have more influence when computing the similarity score. We have formatted this object as a numeric vector, named with the SNPs ids. This information is provided in the object called SNPsR2
. This is a list that contains the R2 of the four available inversions. For instance, this information for the inversion ROIno.8.3 can be get by:
data("SNPsR2")
names(SNPsR2)
## [1] "ROIno.8.3" "ROIno.17.16" "HsInv0286" "HsInv0396"
R2s <- SNPsR2$ROIno.8.3
head(R2s)
## rs141039449 rs138092889 rs138217047 rs143682252 rs550257716
## 0.0378562621 0.0289266440 0.0264322431 0.0264322431 0.0005282751
## rs2955571
## 0.0004570336
The last required information is the heterozygous genotypes of the SNPs included in the references. This information is used to ensure that input SNPs have the same coded alleles than those used in the reference. This information can be retrieve from the object called hetRefs
that can be inspect by typing:
data("hetRefs")
names(hetRefs)
## [1] "ROIno.8.3" "ROIno.17.16" "HsInv0286" "HsInv0396"
hRefs <- hetRefs$ROIno.8.3
head(hRefs)
## rs141039449 rs138092889 rs138217047 rs143682252 rs550257716 rs2955571
## "CG" "AG" "AC" "AG" "CT" "AG"
In that case, hRefs
is a character vector contaning the heterozygous genotypes of the SNPs used as references in the ROI. It should be noticed that, in the heterozygous genotype, the alleles MUST BE ordered ALPHABETICALLY.
scoreInvHap deals with data either in a SNPMatrix
or as Bioconductor VCF
class. In the case of SNPMatrix
, a list with two elements is required:
We can load our data from a ped file or from plink binary format (.bed, .bim) to a SNPMatrix
using snpStats:
library(snpStats)
## From a bed
snps <- read.plink("example.bed")
## From a pedfile
snps <- read.pedfile("example.ped", snps = "example.map")
In both cases, snps is a list containing the elements genotypes and map. This object can be passed to scoreInvHap functions.
We can load a vcf file into a VCF
object using the VariantAnnotation. We have included a small vcf in scoreInvHap package to illustrate how to deal with this data. This file contains a subset of SNPs of 30 European individuals belonging to the 1000 Genomes project. All these SNPs are located at the region 7p11.2, the region annotated as HsInv0286 in the Sander’s paper. This vcf file contains imputed data. We can load the vcf with the following code:
library(VariantAnnotation)
vcf_file <- system.file("extdata", "example.vcf", package = "scoreInvHap")
vcf <- readVcf(vcf_file, "hg19")
vcf
## class: CollapsedVCF
## dim: 380 30
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 4 columns: AF, MAF, R2, ER2
## info(header(vcf)):
## Number Type Description
## AF 1 Float Estimated Alternate Allele Frequency
## MAF 1 Float Estimated Minor Allele Frequency
## R2 1 Float Estimated Imputation Accuracy
## ER2 1 Float Empirical (Leave-One-Out) R-square (available only fo...
## geno(vcf):
## SimpleList of length 3: GT, DS, GP
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## DS 1 Float Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]
## GP 3 Float Estimated Posterior Probabilities for Genotypes 0/0, ...
We can observe that the object vcf
contains 380 SNPs and 30 samples. Now we are ready to classify the samples with regard to the inversion HsInv0286 by using the function scoreInvHap
. This function requires four pieces of information:
SNPlist
),Refs
),SNPsR2
), andhetRefs
).The HsInv0286 inversion status of the 30 samples from 1000 genomes is obtained by:
res <- scoreInvHap(SNPlist = vcf, SNPsR2 = SNPsR2$HsInv0286,
hetRefs = hetRefs$HsInv0286, Refs = Refs$HsInv0286)
res
## scoreInvHapRes
## Samples: 30
## Genotypes' table:
## IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
## 2 0 4 5 5 0 4 3 4 3
## - Inversion genotypes' table:
## NN NI II
## 7 17 6
## - Inversion frequency: 48.33%
The results of scoreInvHap
are encapsulated in a object of class scoreInvHapes
. This object contains the classification of the samples and the simmilarity scores. We can obtain this data with the following getters:
# Get classification
head(classification(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## IbIb IbIb NaIb NbIb IaIa NbNb
## Levels: IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
# Get scores
head(scores(res))
## IaIa IaIb IbIb NaIa NaIb NaNa
## HG00096 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742
## HG00097 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742
## HG00099 0.6438462 0.7267291 0.7685422 0.7768338 0.9758600 0.7703418
## HG00100 0.3431981 0.4263283 0.4892990 0.2604141 0.4764038 0.3691768
## HG00101 0.9841178 0.7603894 0.7584714 0.7188997 0.6280646 0.7182172
## HG00102 0.2518848 0.1732271 0.3167020 0.1441663 0.1994766 0.3004813
## NaNb NbIa NbIb NbNb
## HG00096 0.4156252 0.3784514 0.5670885 0.3911247
## HG00097 0.4156252 0.3784514 0.5670885 0.3911247
## HG00099 0.5317241 0.3025523 0.5345369 0.2689791
## HG00100 0.6083141 0.6376241 0.6834476 0.5605652
## HG00101 0.3558992 0.4545927 0.4165003 0.3266407
## HG00102 0.5812806 0.5527802 0.5480628 0.7223922
We can retrieve other values that are useful to evaluate the quality of the classification. For each sample, we can obtain its highest similarity score and the difference between the highest similarity score and the second highest:
# Get max score
head(maxscores(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 0.9945596 0.9945596 0.9758600 0.6834476 0.9841178 0.7223922
# Get difference score
head(diffscores(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 0.2216032 0.2216032 0.1990263 0.0458235 0.2237283 0.1411116
A classification is good when the highest score is close to 1 and the other scores are small. This means that the samples SNPs are almost the same than in one of the reference haplotypes and that they are different to the other references. We use the difference between the highest score and the second highest score as a measure of how different is the highest score from the rest. We can have a visual evaluation of these quality parameters with the function plotScores
:
plotScores(res, pch = 16, main = "QC based on scores")
The horizontal line of the plot is a suggestion of the minimum difference between the highest and the second score that we accept. By default, this value is set to 0.1 but it can be changed with the parameter minDiff
. This default value equals to considering that the sample SNPs are at least 10% more similar to the top reference than to the other references. plotScores
relies on the base plot function, so we can pass additional parameters to customize the plot.
The other quality control estimate are based on the number of SNPs used in the computation. scoreInvHap allows having some missing measurements in the input data. However, this measurements are excluded from the computation of the scores. To reflect this issue, we have two measurements: the number of SNPs used in the computation and the proportion of non-missing measurements, or call rate:
# Get Number of scores used
head(numSNPs(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 307 307 307 307 307 307
# Get call rate
head(propSNPs(res))
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 1 1 1 1 1 1
The number of SNPs must always be taken into account to evaluate the performance of the computation. It is highly recommended to use, at least, 15 SNPs in the computation. We have also included the function plotCallRate
to plot the call rate of the samples:
plotCallRate(res, main = "Call Rate QC")
The vertical line is the minimum recommended call rate to consider a sample. By default, it is set to 0.9 but can be changed with the parameter callRate
. Again, plotCallRate
relies on the base plot function, so we can customize the plot.
The function classification
have two parameters that selects samples based on these two QC parameters. The argument minDiff
sets the minimum difference between the highest and the second highest score to consider a sample. The argument callRate
sets the minimum call rate of a sample to pass the QC. By default, both arguments are set to 0 so all the sample are included:
## No filtering
length(classification(res))
## [1] 30
## QC filtering
length(classification(res, minDiff = 0.1, callRate = 0.9))
## [1] 27
Finally, the function classification
has the argument inversion
that, when it is set to TRUE, the haplotype based classification is transformed to an inversion based classification. This is useful for inversions HsInv0286 and HsInv0396 that have more than one haplotype per inversion status:
## No filtering
table(classification(res))
##
## IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
## 2 0 4 5 5 0 4 3 4 3
## QC filtering
table(classification(res, inversion = TRUE))
##
## II NI NN
## 6 17 7
When SNPs data are imputed, we obtain three different types of results: the best-guess, the dosage and the posterior probabilities. By default, scoreInvHap
use the best-guess when computing the simmilarity scores. However, we can also use posterior probabilities setting the argument imputed
to TRUE:
res_imp <- scoreInvHap(SNPlist = vcf, SNPsR2 = SNPsR2$HsInv0286, hetRefs = hetRefs$HsInv0286, Refs = Refs$HsInv0286, imputed = TRUE)
res_imp
## scoreInvHapRes
## Samples: 30
## Genotypes' table:
## IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
## 2 0 4 5 5 0 4 3 4 3
## - Inversion genotypes' table:
## NN NI II
## 7 17 6
## - Inversion frequency: 48.33%
In this case, the samples were identically classified in both cases:
table(PostProbs = classification(res_imp),
BestGuess = classification(res))
## BestGuess
## PostProbs IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
## IaIa 2 0 0 0 0 0 0 0 0 0
## IaIb 0 0 0 0 0 0 0 0 0 0
## IbIb 0 0 4 0 0 0 0 0 0 0
## NaIa 0 0 0 5 0 0 0 0 0 0
## NaIb 0 0 0 0 5 0 0 0 0 0
## NaNa 0 0 0 0 0 0 0 0 0 0
## NaNb 0 0 0 0 0 0 4 0 0 0
## NbIa 0 0 0 0 0 0 0 3 0 0
## NbIb 0 0 0 0 0 0 0 0 4 0
## NbNb 0 0 0 0 0 0 0 0 0 3
There are two additional parameters of scoreInvHap
than can reduce computing time: R2
and BPPARAM
. R2
indicates the minimum R2 that a SNP should have with the inversion to be included in the score. The less number of SNPs the less time it takes. By default, all SNPs are included in the computation. On the other hand, BPPARAM
requires an instance of BiocParallelParam
, which allows to parallelize the computation of the score. You can find more information about this class in its help page (?bpparam
) and in the BiocParallel vignette.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] snpStats_1.28.0 Matrix_1.2-11
## [3] survival_2.41-3 VariantAnnotation_1.24.0
## [5] Rsamtools_1.30.0 Biostrings_2.46.0
## [7] XVector_0.18.0 SummarizedExperiment_1.8.0
## [9] DelayedArray_0.4.0 matrixStats_0.52.2
## [11] Biobase_2.38.0 GenomicRanges_1.30.0
## [13] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [15] S4Vectors_0.16.0 BiocGenerics_0.24.0
## [17] scoreInvHap_1.0.0 BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] progress_1.1.2 splines_3.4.2
## [3] lattice_0.20-35 htmltools_0.3.6
## [5] rtracklayer_1.38.0 yaml_2.1.14
## [7] GenomicFeatures_1.30.0 blob_1.1.0
## [9] XML_3.98-1.9 rlang_0.1.2
## [11] DBI_0.7 BiocParallel_1.12.0
## [13] bit64_0.9-7 GenomeInfoDbData_0.99.1
## [15] stringr_1.2.0 zlibbioc_1.24.0
## [17] evaluate_0.10.1 memoise_1.1.0
## [19] knitr_1.17 biomaRt_2.34.0
## [21] AnnotationDbi_1.40.0 Rcpp_0.12.13
## [23] backports_1.1.1 BSgenome_1.46.0
## [25] bit_1.1-12 RMySQL_0.10.13
## [27] digest_0.6.12 stringi_1.1.5
## [29] bookdown_0.5 grid_3.4.2
## [31] rprojroot_1.2 tools_3.4.2
## [33] bitops_1.0-6 magrittr_1.5
## [35] RCurl_1.95-4.8 RSQLite_2.0
## [37] tibble_1.3.4 prettyunits_1.0.2
## [39] assertthat_0.2.0 rmarkdown_1.6
## [41] R6_2.2.2 GenomicAlignments_1.14.0
## [43] compiler_3.4.2