runGenphen {genphen} | R Documentation |
Given a set of genotypes such as single nucleotide polymorphisms (SNPs) or single amino acid polymorphisms (SAAPs) for a set of N individuals, and the corresponding N phenotypes, genphen computes severaly genotype-phenotypes association scores using Bayesian inference and statistical learning.
runGenphen(genotype, phenotype, phenotype.type, mcmc.chains, mcmc.iterations, mcmc.warmup, mcmc.cores, hdi.level, stat.learn.method, cv.iterations, with.rpa, rpa.iterations, rpa.rope)
genotype |
Character matrix/data frame or a vector, containing SNPs/SAAPs as columns or alternatively as DNAMultipleAlignment or AAMultipleAlignment Biostrings object. |
phenotype |
Numerical vector for continuous-phenotype analysis, numerical or character vector for dichotonous-phenotype analysis. |
phenotype.type |
'continuous' or 'dichotomous' based on phenotype type. |
mcmc.chains |
Number of MCMC chains used to test each association test. We recomend mcmc.chains >= 2. |
mcmc.iterations |
Length of MCMC chains (default = 1,000). |
mcmc.warmup |
Length of adaptive MCMC chains (default = 500). |
mcmc.cores |
Number of cores used for the MCMC (default = 1). The same parameter is for multicore execution of the statistical learning procedures. |
hdi.level |
Highest density interval (HDI) (default = 0.95). |
stat.learn.method |
Character parameter used to specify the statistical learning method used in the analysis. Currently two methods are available: random forest ('rf') and support vector machine ('svm'). For no statistical learning select 'none'. |
cv.iterations |
cross-validation iterations (default = 1,000). |
with.rpa |
with retrospective power analysis (RPA). |
rpa.iterations |
Number of simulation used to compute RPA scores. |
rpa.rope |
Region of practical equivalence (ROPE) for the RPA. |
Input:
genotype P genotypes of N individuals in the form of NxP character matrix/data frame or vector (if P = 1).
phenotype phenotypes of N individuals in the form of a N-sized vector. The type of the phenotype can either be continuous or dichotomous. Therefore, genphen has two analysis modes for each situation. The main difference between them is the design of the Bayesian hierarchical model which are used.
Goal: To quantify the association between each genotype and phenotype. With genphen, we provide several measures of association:
Classification accuracy (CA): it is a metric obtained with a statistical learning technique such as random forest (RF) or support vector machine (SVM), and measures the degree of accuracy with which one can classify (predict) the alleles of an SNP from the phenotype measurements. If there exists a strong association between a particular SNP and the phenotype, one should be able to build a statistical model which accurately classifies the two alleles of that SNP solely from the phenotype data (CA approx. 1). Otherwise, the classification accuracy of statistical model should be approximately similar to that of simple guessing (CA approx. 0.5). Promising SNPs thus have a high CA close to 1.
For each CA estimate, we also compute a corresponding Cohen's kappa statistic. With Cohen's kappa we compare the observed CA with the classification accuracy which is expected simply by chance (CA_exp). This is in particular useful when the genetic states of the genotype are not evenly represented, i.e. allele A of a given SNP may be represented in 80% of the individ-uals, while the other allele T may be represented in only 20% of the individuals. Such an uneven composition of the genotype can affect the classification analysis, potentially resulting in high CA simply because the classifier only predicts the dominant allele. Promising SNPs have high Cohen's kappa close to 1.
Effect size: for each SNP we compute the effect size, i.e. the size of the difference in phenotype observed between its alleles (amino acids in the case of SAAP). Depending on the type of the phenotype, we either compute the Cohen's d effect size for continuous phenotypes, or the absolute effect size for dichotomous phenotypes. We first use Byesian inference to infer the parameters of the distribution of the phenotype in each allele, and then plug the posterior distribution of these parameters into the corresponding equations for computing the effect size. In addition to the point estimates of the effect sizes, we also estimate the corresponding highest density intervals (HDIs).
Bhattacharyya coefficient (BC): the infered parameters needed to compute the effect size are used to perform posterior predictive checks and generate simulated distributions of the phenotype for each allele of a given SNP (amino acid states in the case of SAAP). Using the Bhattacharyya coefficient, we can measure the overlap between any two distributions, i.e. for no or low overlap (weak strength of association), BC is close to 0, and BC is close to 1 for a complete overlap (strong strength of association).
The association scores can be correlated, i.e. when CA is close to 1, the effect size is large and (BC = 0). This is not always the case, i.e. we can have a small but significant effect size, yet a perfect CA. This is an interesting information which could be lost if a single association metric was used. Moreover, RF and SVM allow us to capture non-linear associations.
General parameters:
site |
id of the site (e.g. position in the provided sequence alignment) |
mutation |
type of polymorphism (e.g. T->A) |
data |
number of data points for each allele (e.g. A:10, T:20) |
Association score parameters:
cohens.d or absolute.d |
Cohen's d effect size (continuous phenotype analysis) or absolute effect size (dichotomous phenotype analysis). point estimate |
cohens.d.L/cohens.d.H or absolute.d.L/absolute.d.H |
The highest density interval (HDI) of the estimated effect size. |
ca, ca.L, ca.H |
Classification accuracy (CA) estimate and its HDI |
kappa, kappa.L, kappa.H |
Cohen's kappa estimate and its HDI |
bc |
Bhattacharyya coefficient, degree of overlap between the posterior predicted distributions of the phenotype in the two alleles of a SNP (or two amino acid states of an SAAP. |
sd.d, sd.d.L, sd.d.H |
(only for continuous phenotypes) Difference in standard deviations its HDI |
MCMC convergence parameters:
s, g, n |
s=site, g=genotype, n=number of observations |
mu.rhat, sigma.rhat |
Potential scale reduction factor from the MCMC simulation for each parameter |
mu.ess, sigma.ess |
Effective sampling size from the MCMC simulation for each parameter |
divergence |
Indicator of occuring divergences during the MCMC simulation |
treedepth |
Indicator of treedepth exceptions during the MCMC simulation |
RPA parameters:
s, g, n |
s=site, g=genotype, n=number of observations |
rpa.count |
Number of RPA runs in which a significant effect was found |
Posterior predictive check parameters:
s, g, n |
s=site, g=genotype, n=number of observations |
predicted.mu.i, predicted.mu.j, real.mu.i, real.mu.j |
Predicted and real means in each group of a SNP |
Simo Kitanovski <simo.kitanovski@uni-due.de>
runDiagnostics, runPhyloBiasCheck
# I: Continuous phenotype analysis # genotype inputs: data(genotype.saap) # phenotype inputs: data(phenotype.saap) # run genphen continuous.analysis <- runGenphen(genotype = genotype.saap[, 1:3], phenotype = phenotype.saap, phenotype.type = "continuous", mcmc.chains = 2, mcmc.iterations = 2000, mcmc.warmup = 500, mcmc.cores = 2, hdi.level = 0.95, stat.learn.method = "rf", cv.iterations = 500, with.rpa = FALSE, rpa.iterations = 100, rpa.rope = 0) # II: Dichotomous phenotype analysis # genotype inputs: data(genotype.saap) # phenotype inputs: data(dichotomous.phenotype.saap) # run genphen dichotomous.analysis <- runGenphen(genotype = genotype.saap[, 1:3], phenotype = dichotomous.phenotype.saap, phenotype.type = "dichotomous", mcmc.chains = 2, mcmc.iterations = 2000, mcmc.warmup = 500, mcmc.cores = 2, hdi.level = 0.95, stat.learn.method = "rf", cv.iterations = 500, with.rpa = FALSE, rpa.iterations = 100, rpa.rope = 0)