runDiagnostics {genphen}R Documentation

Data reduction procedure

Description

Running genphen for hundreds of thousands of predictors (e.g. SNPs) can be computationally costly. Motivated by the biological fact that most SNPs have no or weak associations with the phenotype, genphen allows the user to run a light-weight diagnostic procedure which allows the user to discard large portion of the non-informative SNPs before running the main association analysis. The data reduction step proceeds as follows: 1) using random forest and their measures of variable importance, we obtain one importance value for each SNP. We can use the spectrum of importances as a 'rough' guide to determine the importance level at which the non-informative SNPs are dominant; 2) the user can then select different points along the importance spectrum (e.g. ranks 1:5, 100:105, 1000:1005, ...) for which genphen is run using its standard procedure. We can then use the association scores produced by genphen to determine the importance rank at which the SNPs are no longer informative and thereby achieve data reduction.

Usage

runDiagnostics(genotype, phenotype, phenotype.type, rf.importance.trees,
               with.anchor.points, mcmc.chains, mcmc.iterations, mcmc.warmup,
               mcmc.cores, hdi.level, anchor.points)

Arguments

genotype

Character matrix/data frame or a vector, containing SNPs/SAAPs as columns or alternatively as DNAMultipleAlignment or AAMultipleAlignment Biostrings object.

phenotype

Numerical vector for continuous-phenotype analysis, numerical or character vector for dichotonous-phenotype analysis.

phenotype.type

'continuous' or 'dichotomous' based on phenotype type.

rf.importance.trees

Number of random forest trees to use for the variable importance analysis (default = 50,000).

with.anchor.points

Boolean whether to run the complete diagnostics procedure (TRUE), or only the random forest based importance estimation (FALSE)

mcmc.chains

Number of MCMC chains used to test each association test. We recomend mcmc.chains >= 2.

mcmc.iterations

Length of MCMC chains (default = 1,000).

mcmc.warmup

Length of adaptive MCMC chains (default = 500).

mcmc.cores

Number of cores used for the MCMC (default = 1). The same parameter is for multicore execution of the statistical learning procedures.

hdi.level

Highest density interval (HDI) (default = 0.95).

anchor.points

Vector of ranks (based on the importance measure) at which to select the genotypes, for which the diagnostics will be run.

Details

Procedure: 1) Run random forest on the complete genotype-phenotype data and infer variable importance for each genotype. 2) Sort genotypes by importance, and sample few genotypes at different points along the importance spectrum, performing for each genotype the procedure explained in runGenphen. 3) Visualize results which can help the user to determine whether a sensible data reduction can be done, i.e. to select X number of most important genotypes for the main analysis.

Value

General parameters:

site

id of the site (e.g. position in the provided sequence alignment)

mutation

type of polymorphism (e.g. T->A)

data

number of data points for each allele (e.g. A:10, T:20)

Association score parameters:

cohens.d or absolute.d

Cohen's d effect size (continuous phenotype analysis) or absolute effect size (dichotomous phenotype analysis) point estimate

cohens.d.L/cohens.d.H or absolute.d.L/absolute.d.H

The highest density interval (HDI) of the estimated effect size

bc

Bhattacharyya coefficient, degree of overlap between the posterior predicted distributions of the phenotype in the two alleles of a SNP (or two amino acid states of an SAAP.

anchor.point

Indicator of selected anchor.point

Ranked variable importance scores:

site

id of the site (e.g. position in the provided sequence alignment)

importance

magnitude of importance of the site

importance.rank

rank based on the importance

Author(s)

Simo Kitanovski <simo.kitanovski@uni-due.de>

See Also

runGenphen, runPhyloBiasCheck

Examples

# I: Continuous diagnostics
# genotype inputs:
data(genotype.saap)
# phenotype inputs:
data(phenotype.saap)

# run genphen
continuous.diagnostics <- runDiagnostics(genotype = genotype.saap,
                                         phenotype = phenotype.saap,
                                         phenotype.type = "continuous",
                                         rf.importance.trees = 50000,
                                         with.anchor.points = TRUE,
                                         mcmc.chains = 2,
                                         mcmc.iterations = 1500,
                                         mcmc.warmup = 500,
                                         mcmc.cores = 2,
                                         hdi.level = 0.95,
                                         anchor.points = c(1:10))

[Package genphen version 1.8.0 Index]