optimize.sd_selection {BioTIP}R Documentation

optimization of sd selection

Description

The optimize.sd_selection filters a multi-state dataset based on a cutoff value for standard deviation per state and optimizes. By default, a cutoff value of 0.01 is used. Suggested if each state contains more than 10 samples.

Usage

optimize.sd_selection(df, samplesL, B = 100, percent = 0.8,
  times = 0.8, cutoff = 0.01, method = c("other", "reference",
  "previous", "itself", "longitudinal reference"), control_df = NULL,
  control_samplesL = NULL)

Arguments

df

A dataframe of numerics. The rows and columns represent unique transcript IDs (geneID) and sample names, respectively.

samplesL

A list of n vectors, where n equals to the number of states. Each vector gives the sample names in a state. Note that the vectors (sample names) has to be among the column names of the R object 'df'.

B

An integer indicating number of times to run this optimization, default 1000.

percent

A numeric value indicating the percentage of samples will be selected in each round of simulation.

times

A numeric value indicating the percentage of B times a transcript need to be selected in order to be considered a stable signature.

cutoff

A positive numeric value. Default is 0.01. If < 1, automatically goes to select top x# transcripts using the a selecting method (which is either the reference, other or previous stage), e.g. by default it will select top 1% of the transcripts.

method

Selection of methods from reference, other, previous, default uses other. Partial match enabled.

  • itself, or longitudinal reference. Some specific requirements for each option:

  • reference, the reference has to be the first.

  • previous, make sure sampleL is in the right order from benign to malign.

  • itself, make sure the cutoff is smaller than 1.

  • longitudinal reference make sure control_df and control_samplesL are not NULL. The row numbers of control_df is the same as df and all trancript in df is also in control_df.

control_df

A count matrix with unique loci as row names and samples names of control samples as column names, only used for method longitudinal reference

control_samplesL

A list of characters with stages as names of control samples, required for method 'longitudinal reference'

Value

A list of dataframe of filtered transcripts with the highest standard deviation are selected from df based on a cutoff value assigned. The resulting dataframe represents a subset of the raw input df.

Author(s)

Zhezhen Wang zhezhen@uchicago.edu

See Also

sd_selection

Examples


counts = matrix(sample(1:100,30),2,30)
colnames(counts) = 1:30
row.names(counts) = paste0('loci',1:2)
cli = cbind(1:30,rep(c('state1','state2','state3'),each = 10))
colnames(cli) = c('samples','group')
samplesL <- split(cli[,1],f = cli[,'group'])
test_sd_selection <- optimize.sd_selection(counts, samplesL, B = 3, cutoff =0.01)

[Package BioTIP version 1.0.0 Index]