1 Introduction
2 Installation
3 Input Data
4 Step 1- Data Processing
5 Step 2 - Split Data into Training and Test Subset
6 Step 3 - Data Normalization
7 Step 4a - Prognostic Index (PI) Score Calculation
8 Step 4b - Univariate Survival Significant Feature Selection
9 Step 5 - Prediction model development for survival probability of patients
10 Step 6 - Survival curves/plots for individual patient
11 Step 7 - Predicted mean and median survival time of individual patients
12 Step 8 - Nomogram based on Key features
13 SessionInfo
14 References
Appendix

1 Introduction

CPSM is an R package that provides a computational pipeline for predicting the survival probability of cancer patients. It encompasses several key steps, including data processing, splitting data into training and test subsets, data normalization, selecting significant features based on univariate survival analysis, generating LASSO PI scores, and developing predictive models for survival probability. Additionally, CPSM visualizes results through survival curves based on predicted probabilities and bar plots depicting the predicted mean and median survival times of patients.

2 Installation

To install this package, start R (version “4.4”) and enter the code provided:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("CPSM")

3 Input Data

The example input data object, Example_TCGA_LGG_FPKM_data, contains data for 184 LGG cancer samples as rows and various features as columns. Gene expression data is represented in FPKM values. The dataset includes 11 clinical and demographic features, 4 types of survival data (with both time and event information), and 19,978 protein-coding genes. The clinical and demographic features in the dataset include Age, subtype, gender, race, ajcc_pathologic_tumor_stage, histological_type, histological_grade, treatment_outcome_first_course, radiation_treatment_adjuvant, sample_type, and type. The four types of survival data included are Overall Survival (OS), Progression-Free Survival (PFS), Disease-Specific Survival (DSS), and Disease-Free Survival (DFS). In the dataset, the columns labeled OS, PFS, DSS, and DFS represent event occurrences, while the columns OS.time, PFS.time, DSS.time, and DFS.time provide survival times (in days).

library(CPSM)
library(SummarizedExperiment)
set.seed(7) # set seed
data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
Example_TCGA_LGG_FPKM_data
#> class: SummarizedExperiment 
#> dim: 2005 184 
#> metadata(0):
#> assays(1): expression
#> rownames(2005): A1BG A1CF ... BAZ1B BAZ2A
#> rowData names(1): gene
#> colnames(184): TCGA-TM-A7CA-01 TCGA-DU-A6S3-01 ... TCGA-E1-A7YM-01
#>   TCGA-DH-5143-01
#> colData names(20): Age subtype ... PFI.time sample

4 Step 1- Data Processing

The data_process_f function converts OS time (in days) into months and removes samples where OS/OS.time information is missing. To use this function, the input data should be provided in TSV format. Additionally, you need to define col_num (the column number at which clinical, demographic, and survival information ends, e.g., 20), surv_time (the name of the column that contains survival time information, e.g., OS.time), and output (the desired name for the output, e.g., “New_data”).

data(Example_TCGA_LGG_FPKM_data, package = "CPSM")
combined_df <- cbind(
  as.data.frame(colData(Example_TCGA_LGG_FPKM_data))
  [, -ncol(colData(Example_TCGA_LGG_FPKM_data))],
  t(as.data.frame(assay(
    Example_TCGA_LGG_FPKM_data,
    "expression"
  )))
)
New_data <- data_process_f(combined_df, col_num = 20, surv_time = "OS.time")
str(New_data[1:10])
#> 'data.frame':    176 obs. of  10 variables:
#>  $ Age                           : num  44.9 60.3 57.9 45.7 70.7 ...
#>  $ subtype                       : chr  "PN" "PN" NA "PN" ...
#>  $ gender                        : chr  "Male" "Male" "Female" "Male" ...
#>  $ race                          : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
#>  $ ajcc_pathologic_tumor_stage   : logi  NA NA NA NA NA NA ...
#>  $ histological_type             : chr  "Astrocytoma" "Oligodendroglioma" "Astrocytoma" "Oligodendroglioma" ...
#>  $ histological_grade            : chr  "G2" "G2" "G3" "G3" ...
#>  $ treatment_outcome_first_course: chr  "Complete Remission/Response" "Stable Disease" NA NA ...
#>  $ radiation_treatment_adjuvant  : chr  "NO" "NO" NA "YES" ...
#>  $ sample_type                   : chr  "Primary" "Primary" "Primary" "Primary" ...

After data processing, the output object New_data is generated, which contains 176 samples. This indicates that the function has removed 8 samples where OS/OS.time information was missing. Moreover, a new 21st column, OS_month, is added to the data, containing OS time values in months.

5 Step 2 - Split Data into Training and Test Subset

Before proceeding further, we need to split the data into training and test subsets for feature selection and model development. The output from the previous step, New_data, serves as the input for this process. Next, you need to define the fraction (e.g., 0.9) by which to split the data into training and test sets. For example, setting fraction = 0.9 will divide the data into 90% for training and 10% for testing. Additionally, you should specify names for the training and test outputs (e.g., train_FPKM and test_FPKM).

data(New_data, package = "CPSM")
# Call the function
result <- tr_test_f(data = New_data, fraction = 0.9)
# Access the train and test data
train_FPKM <- result$train_data
str(train_FPKM[1:10])
#> 'data.frame':    158 obs. of  10 variables:
#>  $ Age                           : num  53.1 54.3 38.1 25.8 46.2 ...
#>  $ subtype                       : chr  "PN" "ME" "NE" "NE" ...
#>  $ gender                        : chr  "Female" "Male" "Female" "Female" ...
#>  $ race                          : chr  "WHITE" "WHITE" "WHITE" "NOT AVAILABLE" ...
#>  $ ajcc_pathologic_tumor_stage   : logi  NA NA NA NA NA NA ...
#>  $ histological_type             : chr  "Oligodendroglioma" "Oligodendroglioma" "Oligoastrocytoma" "Astrocytoma" ...
#>  $ histological_grade            : chr  "G3" "G3" "G3" "G2" ...
#>  $ treatment_outcome_first_course: chr  NA "Progressive Disease" NA "Complete Remission/Response" ...
#>  $ radiation_treatment_adjuvant  : chr  "YES" NA "NO" "NO" ...
#>  $ sample_type                   : chr  "Primary" "Primary" "Primary" "Primary" ...
test_FPKM <- result$test_data
str(test_FPKM[1:10])
#> 'data.frame':    18 obs. of  10 variables:
#>  $ Age                           : num  70.7 34.6 32.4 61 34.4 ...
#>  $ subtype                       : chr  "PN" NA "PN" "PN" ...
#>  $ gender                        : chr  "Male" "Female" "Male" "Male" ...
#>  $ race                          : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
#>  $ ajcc_pathologic_tumor_stage   : logi  NA NA NA NA NA NA ...
#>  $ histological_type             : chr  "Oligodendroglioma" "Astrocytoma" "Oligodendroglioma" "Oligodendroglioma" ...
#>  $ histological_grade            : chr  "G3" "G2" "G2" "G2" ...
#>  $ treatment_outcome_first_course: chr  "Stable Disease" "Stable Disease" "Partial Remission/Response" "Complete Remission/Response" ...
#>  $ radiation_treatment_adjuvant  : chr  NA NA "YES" "YES" ...
#>  $ sample_type                   : chr  "Recurrent" "Primary" "Primary" "Primary" ...

After the train-test split, two new output objects are generated: train_FPKM and test_FPKM. The train_FPKM object contains 158 samples, while test_FPKM contains 18 samples. This indicates that the tr_test_f function splits the data in a 90:10 ratio.

6 Step 3 - Data Normalization

In order to select features and develop ML models, the data must be normalized. Since the expression data is available in terms of FPKM values, the train_test_normalization_f function will first convert the FPKM values into a log scale using the formula [log2(FPKM+1)], followed by quantile normalization. The training data will be used as the target matrix for the quantile normalization process. For this function, you need to provide the training and test datasets obtained from the previous step (Train/Test Split). Additionally, you must specify the column number where clinical information ends (e.g., 21) in the input datasets. Finally, you need to define output names for the resulting datasets: train_clin_data (which contains only clinical information from the training data), test_clin_data (which contains only clinical information from the test data), train_Normalized_data_clin_data (which contains both clinical information and normalized gene expression values for the training samples), and test_Normalized_data_clin_data (which contains both clinical information and normalized gene expression values for the test samples).

# Step 3 - Data Normalization
# Normalize the training and test data sets
data(train_FPKM, package = "CPSM")
data(test_FPKM, package = "CPSM")
Result_N_data <- train_test_normalization_f(
  train_data = train_FPKM,
  test_data = test_FPKM,
  col_num = 21
)
# Access the Normalized train and test data
Train_Clin <- Result_N_data$Train_Clin
Test_Clin <- Result_N_data$Test_Clin
Train_Norm_data <- Result_N_data$Train_Norm_data
Test_Norm_data <- Result_N_data$Test_Norm_data
str(Train_Clin[1:10])
#> 'data.frame':    158 obs. of  10 variables:
#>  $ Age                           : num  30.8 36.5 38.9 65.1 32.3 ...
#>  $ subtype                       : chr  "PN" NA "NE" "CL" ...
#>  $ gender                        : chr  "Male" "Male" "Male" "Female" ...
#>  $ race                          : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
#>  $ ajcc_pathologic_tumor_stage   : logi  NA NA NA NA NA NA ...
#>  $ histological_type             : chr  "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#>  $ histological_grade            : chr  "G3" "G2" "G2" "G3" ...
#>  $ treatment_outcome_first_course: chr  NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#>  $ radiation_treatment_adjuvant  : chr  "YES" NA "NO" "YES" ...
#>  $ sample_type                   : chr  "Primary" "Primary" "Primary" "Primary" ...
str(Train_Norm_data[1:10])
#> 'data.frame':    158 obs. of  10 variables:
#>  $ Age                           : num  30.8 36.5 38.9 65.1 32.3 ...
#>  $ subtype                       : chr  "PN" NA "NE" "CL" ...
#>  $ gender                        : chr  "Male" "Male" "Male" "Female" ...
#>  $ race                          : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
#>  $ ajcc_pathologic_tumor_stage   : logi  NA NA NA NA NA NA ...
#>  $ histological_type             : chr  "Oligodendroglioma" "Oligoastrocytoma" "Oligoastrocytoma" "Astrocytoma" ...
#>  $ histological_grade            : chr  "G3" "G2" "G2" "G3" ...
#>  $ treatment_outcome_first_course: chr  NA "Complete Remission/Response" NA "Partial Remission/Response" ...
#>  $ radiation_treatment_adjuvant  : chr  "YES" NA "NO" "YES" ...
#>  $ sample_type                   : chr  "Primary" "Primary" "Primary" "Primary" ...

After running the function, four outputs objects are generated: Train_Clin (which contains only clinical features from the training data), Test_Clin (which contains only clinical features from the test data), Train_Norm_data (which includes clinical features and normalized gene expression values for the training samples), and Test_Norm_data (which includes clinical features and normalized gene expression values for the test samples).

7 Step 4a - Prognostic Index (PI) Score Calculation

To create a survival model, the next step is to calculate the Prognostic Index (PI) score. The PI score is based on the expression levels of features selected by the LASSO regression model and their corresponding beta coefficients. For example, suppose five features (G1, G2, G3, G4, G5) are selected by the LASSO method, and their associated coefficients are B1, B2, B3, B4, and B5, respectively. The PI score is then computed using the following formula:

PI score = G1 * B1 + G2 * B2 + G3 * B3 + G4 * B4 + G5 * B5

To perform this calculation, you need to provide the normalized training data object (Train_Norm_data) and test data object (Test_Norm_data) obtained from the previous step (train_test_normalization_f). Additionally, you must specify the column number (col_num) where clinical features end (e.g., 21), the number of folds (nfolds) for the LASSO regression method (e.g., 5), and the survival time (surv_time) and survival event (surv_event) columns in the data (e.g., OS_month and OS, respectively). The LASSO regression is implemented using the glmnet package. Finally, you need to define names of output object to store the results, which will include the selected LASSO features and their corresponding PI values.

# Step 4 - Lasso PI Score
data(Train_Norm_data, package = "CPSM")
data(Test_Norm_data, package = "CPSM")
Result_PI <- Lasso_PI_scores_f(
  train_data = Train_Norm_data,
  test_data = Test_Norm_data,
  nfolds = 5,
  col_num = 21,
  surv_time = "OS_month",
  surv_event = "OS"
)
Train_Lasso_key_variables <- Result_PI$Train_Lasso_key_variables
Train_PI_data <- Result_PI$Train_PI_data
Test_PI_data <- Result_PI$Test_PI_data
str(Train_PI_data[1:10])
#> 'data.frame':    158 obs. of  10 variables:
#>  $ OS        : int  1 1 0 1 0 0 0 0 0 0 ...
#>  $ OS_month  : int  78 27 48 4 56 39 34 15 43 44 ...
#>  $ AADACL4   : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ABCA12    : num  0.212 0.297 0.053 0.358 0.064 0.032 0.09 0.051 0.1 0.014 ...
#>  $ ABCC3     : num  0.1 5.476 0.029 0.108 0.037 ...
#>  $ ABI1      : num  44.7 15.3 43.7 27.7 31.6 ...
#>  $ ABRA      : num  0 0.035 0 0 0.015 0.034 0.006 0 0.009 0.005 ...
#>  $ AC006059.2: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ AC008676.3: num  0.032 0.034 0.032 0 0.014 0.04 0.011 0.01 0.017 0.042 ...
#>  $ AC008764.4: num  0.038 0 0 0 0 0 0.008 0 0 0.015 ...
str(Test_PI_data[1:10])
#> 'data.frame':    18 obs. of  10 variables:
#>  $ OS        : int  1 1 0 0 0 0 0 0 0 1 ...
#>  $ OS_month  : int  32 27 2 46 18 26 17 28 64 114 ...
#>  $ AADACL4   : num  0.011 0 0.016 0 0 0 0 0 0 0 ...
#>  $ ABCA12    : num  0.048 0.165 0.044 0.033 0.053 0.02 0.09 0.038 0.058 0.096 ...
#>  $ ABCC3     : num  0.679 1.358 0.184 0.321 0.024 ...
#>  $ ABI1      : num  39.4 19.9 31 27.8 56.4 ...
#>  $ ABRA      : num  0 0.001 0.029 0 0 0 0.026 0.004 0 0.014 ...
#>  $ AC006059.2: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ AC008676.3: num  0.002 0.054 0.012 0.092 0.031 0.003 0.004 0.078 0.046 0 ...
#>  $ AC008764.4: num  0.036 0.009 0 0 0.015 0 0 0 0 0 ...
plot(Result_PI$cvfit)

The Lasso_PI_scores_f function generates the following outputs objects: 1. Train_Lasso_key_variables: A list of features selected by LASSO along with their beta coefficient values. 2. Train_Cox_Lasso_Regression_lambda_plot: The Lasso regression lambda plot. 3. Train_PI_data: This dataset contains the expression values of genes selected by LASSO along with the PI score in the last column for the training samples. 4. Test_PI_data: This dataset contains the expression values of genes selected by LASSO along with the PI score in the last column for the test samples.

8 Step 4b - Univariate Survival Significant Feature Selection

In addition to the Prognostic Index (PI) score, the Univariate_sig_features_f function in the CPSM package allows for the selection of significant features based on univariate cox-regression survival analysis. This function identifies features with a p-value less than 0.05, which are able to stratify high-risk and low-risk survival groups. The stratification is done by using the median expression value of each feature as a cutoff. To use this function, you need to provide the normalized training (Train_Norm_data) and test (Test_Norm_data) dataset objects, which were obtained from the previous step (train_test_normalization_f). Additionally, you must specify the column number (col_num) where the clinical features end (e.g., 21), as well as the names of the columns containing survival time (surv_time, e.g., OS_month) and survival event information (surv_event, e.g., OS). Furthermore, you need to define output names for the resulting datasets that will contain the expression values of the selected genes. These outputs will be used to store the significant genes identified through univariate survival analysis.

# Step 4b - Univariate  Survival Significant Feature Selection.
data(Train_Norm_data, package = "CPSM")
data(Test_Norm_data, package = "CPSM")
Result_Uni <- Univariate_sig_features_f(
  train_data = Train_Norm_data,
  test_data = Test_Norm_data,
  col_num = 21,
  surv_time = "OS_month",
  surv_event = "OS"
)
Univariate_Suv_Sig_G_L <- Result_Uni$Univariate_Survival_Significant_genes_List
Train_Uni_sig_data <- Result_Uni$Train_Uni_sig_data
Test_Uni_sig_data <- Result_Uni$Test_Uni_sig_data
Uni_Sur_Sig_clin_List <- Result_Uni$Univariate_Survival_Significant_clin_List
Train_Uni_sig_clin_data <- Result_Uni$Train_Uni_sig_clin_data
Test_Uni_sig_clin_data <- Result_Uni$Test_Uni_sig_clin_data
str(Univariate_Suv_Sig_G_L[1:10])
#>  chr [1:10] "A2ML1" "AADACL4" "AAMDC" "AAR2" "ABCA12" "ABCB4" "ABCB5" ...

The Univariate_sig_features_f function generates the following output objects: 1. Univariate_Surv_Sig_G_L: A table of univariate significant genes, along with their corresponding coefficient values, hazard ratio (HR) values, p-values, and C-Index values. 2. Train_Uni_sig_data: This dataset contains the expression values of the significant genes selected by univariate survival analysis for the training samples. 3. Test_Uni_sig_data: This dataset contains the expression values of the significant genes selected by univariate survival analysis for the test samples.

9 Step 5 - Prediction model development for survival probability of patients

After selecting significant features using LASSO or univariate survival analysis, the next step is to develop a machine learning (ML) prediction model to estimate the survival probability of patients. The MTLR_pred_model_f function in the CPSM package provides several options for building prediction models based on different feature sets. These options include: - Model_type = 1: Model based on only clinical features - Model_type = 2: Model based on PI score - Model_type = 3: Model based on PI score + clinical features - Model_type = 4: Model based on significant univariate features - Model_type = 5: Model based on significant univariate features + clinical features

For this analysis, we are interested in developing a model based on the PI score (i.e., Model_type = 2). To use this function, the following inputs are required: 1. Training data with only clinical features 2. Test data with only clinical features 3. Model type (e.g., 2 for a model based on PI score) 4. Training data with PI score 5. Test data with PI score 6. Clin_Feature_List (e.g., Key_PI_list), a list of features to be used for building the model 7. surv_time: The name of the column containing survival time in months (e.g., OS_month) 8. surv_event: The name of the column containing survival event information (e.g., OS)

These inputs will allow the MTLR_pred_model_f function to generate a prediction model for the survival probability of patients based on the provided data.

9.1 Model for only Clinical features

data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Key_Clin_feature_list, package = "CPSM")
Result_Model_Type1 <- MTLR_pred_model_f(
  train_clin_data = Train_Clin,
  test_clin_data = Test_Clin,
  Model_type = 1,
  train_features_data = Train_Clin,
  test_features_data = Test_Clin,
  Clin_Feature_List = Key_Clin_feature_list,
  surv_time = "OS_month",
  surv_event = "OS"
)
survCurves_data <- Result_Model_Type1$survCurves_data
mean_median_survival_tim_d <- Result_Model_Type1$mean_median_survival_time_data
survival_result_bas_on_MTLR <- Result_Model_Type1$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type1$Error_mat_for_Model

9.2 Model for PI

data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_PI_data, package = "CPSM")
data(Test_PI_data, package = "CPSM")
data(Key_PI_list, package = "CPSM")
Result_Model_Type2 <- MTLR_pred_model_f(
  train_clin_data = Train_Clin,
  test_clin_data = Test_Clin,
  Model_type = 2,
  train_features_data = Train_PI_data,
  test_features_data = Test_PI_data,
  Clin_Feature_List = Key_PI_list,
  surv_time = "OS_month",
  surv_event = "OS"
)
survCurves_data <- Result_Model_Type2$survCurves_data
mean_median_surviv_tim_da <- Result_Model_Type2$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type2$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type2$Error_mat_for_Model

9.3 Model for Clinical features + PI

data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_PI_data, package = "CPSM")
data(Test_PI_data, package = "CPSM")
data(Key_Clin_features_with_PI_list, package = "CPSM")
Result_Model_Type3 <- MTLR_pred_model_f(
  train_clin_data = Train_Clin,
  test_clin_data = Test_Clin,
  Model_type = 3,
  train_features_data = Train_PI_data,
  test_features_data = Test_PI_data,
  Clin_Feature_List = Key_Clin_features_with_PI_list,
  surv_time = "OS_month",
  surv_event = "OS"
)
survCurves_data <- Result_Model_Type3$survCurves_data
mean_median_surv_tim_da <- Result_Model_Type3$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type3$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type3$Error_mat_for_Model

9.4 Model for Univariate + Clinical features

data(Train_Clin, package = "CPSM")
data(Test_Clin, package = "CPSM")
data(Train_Uni_sig_data, package = "CPSM")
data(Test_Uni_sig_data, package = "CPSM")
data(Key_univariate_features_with_Clin_list, package = "CPSM")
Result_Model_Type5 <- MTLR_pred_model_f(
  train_clin_data = Train_Clin,
  test_clin_data = Test_Clin,
  Model_type = 4,
  train_features_data = Train_Uni_sig_data,
  test_features_data = Test_Uni_sig_data,
  Clin_Feature_List = Key_univariate_features_with_Clin_list,
  surv_time = "OS_month",
  surv_event = "OS"
)
survCurves_data <- Result_Model_Type5$survCurves_data
mean_median_surv_tim_da <- Result_Model_Type5$mean_median_survival_time_data
survival_result_b_on_MTLR <- Result_Model_Type5$survival_result_based_on_MTLR
Error_mat_for_Model <- Result_Model_Type5$Error_mat_for_Model

After implementing the MTLR_pred_model_f function, the following outputs are generated:

Model_with_PI.RData: This object contains the trained model based on the input data.
survCurves_data: This object contains the predicted survival probabilities for each patient at various time points. This data can be used to plot survival curves for patients.
mean_median_survival_time_data: Object containing the predicted mean and median survival times for each patient in the test data. This data can be used to generate bar plots illustrating the predicted survival times.
Error_mat_for_Model: Object containing performance metrics of the model based on the training and test data. It includes the following key performance scores:
- C-Index = 0.81 These outputs allow you to evaluate the model’s performance and visualize survival probabilities and survival times for the training test data.

10 Step 6 - Survival curves/plots for individual patient

To visualize the survival of patients, we use the surv_curve_plots_f function, which generates survival curve plots based on the survCurves_data obtained from the previous step (after running the MTLR_pred_model_f function). This function also provides the option to highlight the survival curve of a specific patient. The function requires two inputs: 1. Surv_curve_data: The data object containing predicted survival probabilities for all patients. 2. Sample ID: The ID of the specific patient (e.g., TCGA-TQ-A8XE-01) whose survival curve you want to highlight.

# Create Survival curves/plots for individual patients
data(survCurves_data, package = "CPSM")
plots <- surv_curve_plots_f(
  Surv_curve_data = survCurves_data,
  selected_sample = "TCGA-TQ-A7RQ-01"
)
# Print the plots
print(plots$all_patients_plot)

print(plots$highlighted_patient_plot)

After running the function, two output plots are generated: 1. Survival curves for all patients in the test data, displayed with different colors for each patient. 2. Survival curves for all patients (in black) with the selected patient highlighted in red.

These plots allow for easy visualization of individual patient survival in the context of the overall test data.

11 Step 7 - Predicted mean and median survival time of individual patients

To visualize the predicted survival times for patients, we use the mean_median_surv_barplot_f function, which generates bar plots for the mean and median survival times based on the data obtained from Step 5 after running the MTLR_pred_model_f function. This function also provides the option to highlight a specific patient on the bar plot. The function requires two inputs: 1. surv_mean_med_data: The data containing the predicted mean and median survival times for all patients. 2. Sample ID: The ID of the specific patient (e.g., TCGA-TQ-A8XE-01) whose bar plot should be highlighted.

data(mean_median_survival_time_data, package = "CPSM")
plots_2 <- mean_median_surv_barplot_f(
  surv_mean_med_data =
    mean_median_survival_time_data,
  selected_sample = "TCGA-TQ-A7RQ-01"
)
# Print the plots
print(plots_2$mean_med_all_pat)

print(plots_2$highlighted_selected_pat)

After running the function, two output bar plots are generated: 1. Bar plot for all patients in the test data, where the red-colored bars represent the mean survival time, and the cyan/green-colored bars represent the median survival time. 2. Bar plot for all patients with a highlighted patient (indicated by a dashed black outline). This plot shows that the highlighted patient has predicted mean and median survival times of 81.58 and 75.50 months, respectively.

These plots provide a clear comparison of the predicted survival times for all patients and the highlighted individual patient.

12 Step 8 - Nomogram based on Key features

The Nomogram_generate_f function in the CPSM package allows you to generate a nomogram plot based on user-defined clinical and other relevant features in the data. For example, we will generate a nomogram using six features: Age, Gender, Race, Histological Type, Sample Type, and PI score.

To create the nomogram, we need to provide the following inputs: 1. Train_Data_Nomogram_input: A dataset containing all the features, where samples are in the rows and features are in the columns. 2. feature_list_for_Nomogram: A list of features (e.g., Age, Gender, etc.) that will be used to generate the nomogram. 3. surv_time: The column name containing survival time in months (e.g., OS_month). 4. surv_event: The column name containing survival event information (e.g., OS).

data(Train_Data_Nomogram_input, package = "CPSM")
data(feature_list_for_Nomogram, package = "CPSM")
Result_Nomogram <- Nomogram_generate_f(
  data = Train_Data_Nomogram_input,
  Feature_List = feature_list_for_Nomogram,
  surv_time = "OS_month",
  surv_event = "OS"
)

C_index_mat <- Result_Nomogram$C_index_mat

After running the function, the output is a nomogram that predicts the risk (e.g., Event risk such as death), as well as the 1-year, 3-year, 5-year, and 10-year survival probabilities for patients based on the selected features.The nomogram provides a visual representation to estimate the patient’s survival outcomes over multiple time points, helping clinicians make more informed decisions.

13 SessionInfo

As last part of this document, we call the function “sessionInfo()”, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because the functions have been changed in a newer version of a package.

sessionInfo()
#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.38.0 Biobase_2.68.0             
#>  [3] GenomicRanges_1.60.0        GenomeInfoDb_1.44.0        
#>  [5] IRanges_2.42.0              S4Vectors_0.46.0           
#>  [7] BiocGenerics_0.54.0         generics_0.1.3             
#>  [9] MatrixGenerics_1.20.0       matrixStats_1.5.0          
#> [11] CPSM_1.0.0                  BiocStyle_2.36.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] rstudioapi_0.17.1       jsonlite_2.0.0          shape_1.4.6.1          
#>   [4] magrittr_2.0.3          TH.data_1.1-3           magick_2.8.6           
#>   [7] farver_2.1.2            rmarkdown_2.29          vctrs_0.6.5            
#>  [10] ROCR_1.0-11             base64enc_0.1-3         rstatix_0.7.2          
#>  [13] tinytex_0.57            htmltools_0.5.8.1       S4Arrays_1.8.0         
#>  [16] polspline_1.1.25        broom_1.0.8             SparseArray_1.8.0      
#>  [19] Formula_1.2-5           sass_0.4.10             parallelly_1.43.0      
#>  [22] bslib_0.9.0             htmlwidgets_1.6.4       plyr_1.8.9             
#>  [25] sandwich_3.1-1          zoo_1.8-14              cachem_1.1.0           
#>  [28] lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3        
#>  [31] Matrix_1.7-3            R6_2.6.1                fastmap_1.2.0          
#>  [34] GenomeInfoDbData_1.2.14 future_1.40.0           digest_0.6.37          
#>  [37] numDeriv_2016.8-1.1     colorspace_2.1-1        Hmisc_5.2-3            
#>  [40] ggpubr_0.6.0            labeling_0.4.3          km.ci_0.5-6            
#>  [43] httr_1.4.7              abind_1.4-8             compiler_4.5.0         
#>  [46] withr_3.0.2             htmlTable_2.4.3         backports_1.5.0        
#>  [49] carData_3.0-5           ggsignif_0.6.4          MASS_7.3-65            
#>  [52] lava_1.8.1              quantreg_6.1            DelayedArray_0.34.0    
#>  [55] tools_4.5.0             foreign_0.8-90          future.apply_1.11.3    
#>  [58] nnet_7.3-20             glue_1.8.0              nlme_3.1-168           
#>  [61] grid_4.5.0              checkmate_2.3.2         cluster_2.1.8.1        
#>  [64] reshape2_1.4.4          gtable_0.3.6            KMsurv_0.1-5           
#>  [67] preprocessCore_1.70.0   tidyr_1.3.1             survminer_0.5.0        
#>  [70] data.table_1.17.0       car_3.1-3               XVector_0.48.0         
#>  [73] foreach_1.5.2           pillar_1.10.2           stringr_1.5.1          
#>  [76] splines_4.5.0           dplyr_1.1.4             lattice_0.22-7         
#>  [79] survival_3.8-3          SparseM_1.84-2          tidyselect_1.2.1       
#>  [82] rms_8.0-0               knitr_1.50              gridExtra_2.3          
#>  [85] bookdown_0.43           svglite_2.1.3           xfun_0.52              
#>  [88] stringi_1.8.7           UCSC.utils_1.4.0        yaml_2.3.10            
#>  [91] pec_2023.04.12          evaluate_1.0.3          codetools_0.2-20       
#>  [94] tibble_3.2.1            BiocManager_1.30.25     cli_3.6.4              
#>  [97] survivalROC_1.0.3.1     rpart_4.1.24            xtable_1.8-4           
#> [100] systemfonts_1.2.2       MTLR_0.2.1              munsell_0.5.1          
#> [103] jquerylib_0.1.4         survMisc_0.5.6          Rcpp_1.0.14            
#> [106] globals_0.16.3          parallel_4.5.0          MatrixModels_0.5-4     
#> [109] ggfortify_0.4.17        ggplot2_3.5.2           listenv_0.9.1          
#> [112] glmnet_4.1-8            mvtnorm_1.3-3           timereg_2.0.6          
#> [115] scales_1.3.0            prodlim_2024.06.25      purrr_1.0.4            
#> [118] crayon_1.5.3            rlang_1.1.6             multcomp_1.4-28

14 References

Appendix

Kuhn, Max (2008). “Building Predictive Models in R Using the caret Package.” Journal of Statistical Software, 28(5), 1–26. doi:10.18637/jss.v028.i05, https://www.jstatsoft.org/index.php/jss/article/view/v028i05.
Bolstad B (2024). preprocessCore: A collection of pre-processing functions. R package version 1.66.0, https://github.com/bmbolstad/preprocessCore.
Horikoshi M, Tang Y (2018). ggfortify: Data Visualization Tools for Statistical Analysis Results. https://CRAN.R-project.org/package=ggfortify.
Therneau T (2024). A Package for Survival Analysis in R. R package version 3.7-0, https://CRAN.R-project.org/package=survival.
Terry M. Therneau, Patricia M. Grambsch (2000). Modeling Survival Data: Extending the Cox Model. Springer, New York. ISBN 0-387-98784-3.
Kassambara, A., Kosinski, M., Biecek, P., & Scheipl, F. (2021). survminer: Drawing survival curves using ‘ggplot2’ (Version 0.4.9) [R package]. CRAN. https://doi.org/10.32614/CRAN.package.survminer
Haider, H. (2019). MTLR: Survival Prediction with Multi-Task Logistic Regression (Version 0.2.1) [R package]. CRAN. https://doi.org/10.32614/CRAN.package.MTLR
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://github.com/tidyverse/dplyr, https://dplyr.tidyverse.org.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.
Zhou, H., Cheng, X., Wang, S., Zou, Y., & Wang, H. (2022). SurvMetrics: Predictive Evaluation Metrics in Survival Analysis (Version 0.5.0) [R package]. CRAN. https://doi.org/10.32614/CRAN.package.SurvMetrics
Simon N, Friedman J, Tibshirani R, Hastie T (2011). “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software, 39(5), 1–13. doi:10.18637/jss.v039.i05.
Gerds TA (2023). pec: Prediction Error Curves for Risk Prediction Models in Survival Analysis. R package version 2023.04.12, https://CRAN.R-project.org/package=pec.
Heagerty, P. J., & Saha-Chaudhuri, P. (2022). survivalROC: Time-Dependent ROC Curve Estimation from Censored Survival Data (Version 1.0.3.1) [R package]. CRAN. https://doi.org/10.32614/CRAN.package.survivalROC
Harrell, F. E. Jr. (2024). rms: Regression Modeling Strategies (Version 6.8-1) [R package]. CRAN. https://doi.org/10.32614/CRAN.package.rms
Sing T, Sander O, Beerenwinkel N, Lengauer T (2005). “ROCR: visualizing classifier performance in R.” Bioinformatics, 21(20), 7881. http://rocr.bioinf.mpi-sb.mpg.de.
Bates, D., Maechler, M., Jagan, M., Davis, T. A., Karypis, G., Riedy, J., Oehlschlägel, J., & R Core Team. (2024). Matrix: Sparse and Dense Matrix Classes and Methods (Version 1.7-0) [R package]. CRAN. https://doi.org/10.32614/CRAN.package.Matrix
Harrell, F. E. Jr., & Dupont, C. (2024). Hmisc: Harrell Miscellaneous (Version 5.1-3) [R package]. CRAN. https://doi.org/10.32614/CRAN.package.Hmisc Wickham H (2007). “Reshaping Data with the reshape Package.” Journal of Statistical Software, 21(12), 1–20. http://www.jstatsoft.org/v21/i12/.
Das P, Roychowdhury A, Das S, Roychoudhury S, Tripathy S (2020). “sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic.” Frontiers in genetics, 11, 247. doi:10.3389/fgene.2020.00247, https://www.frontiersin.org/article/10.3389/fgene.2020.00247.

CPSM: Cancer patient survival model

August 12, 2024

Contents