We developed OncoNPC (Oncology NGS-based Primary cancer type Classifier), a molecular cancer type classifier trained on multicenter targeted panel sequencing data. OncoNPC utilizes somatic alterations including mutations (single nucleotide variants and indels), mutational signatures, copy number alterations, as well as patient age at the time of sequencing and sex to jointly predict cancer type.
We have created a short tutorial video to guide you through the usage of OncoNPC's GitHub-based visualization tool, which visualizes cancer type predictions based on user-provided inputs and model explanations.
You can watch the tutorial video and access the visualization tool here.
We'll walk you through the installation of the OncoNPC pipeline for feature processing and inference.
- Setting up the Conda Environment
- Utilizing Public Tumor Sequencing Data from AACR GENIE
- Training and Validating the XGBoost-based OncoNPC Model
- Notebook Examples for Predicting Cancer Type and Visualizing Prediction Explanation
- Link to Manuscript
To ensure consistent execution of the code, we recommend using the conda environment specified in onconpc_conda.yml
. This file contains all the necessary dependencies with their specific versions.
If you do not have Conda installed, please install Miniconda or Anaconda.
git clone https://github.com/itmoon7/onconpc.git
cd onconpc
Use the onconpc_conda.yml
file to create a Conda environment.
conda env create -f onconpc_conda.yml
Once the environment is created, you can activate it using:
conda activate onconpc_conda_env
For further details on the software and package versions, please refer to the onconpc_conda.yml
file.
OncoNPC was originally trained using data from multiple sources, including publicly available data from AACR Project GENIE, specifically from two cancer centers (MSK and VICC), as well as private institutional data from DFCI (Dana-Farber Cancer Institute). This repository provides the flexibility to process and train the OncoNPC model using solely the publicly available data from AACR Project GENIE.
For integrating AACR GENIE data with OncoNPC, you will need:
- data_mutations_extended.txt
- data_clinical_patient.txt
- data_clinical_sample.txt
- data_CNA.txt
-
Accessing AACR Project GENIE Data: Begin by obtaining the AACR GENIE data as described in their Data Guide. For programmatically downloading the data using the Synapse Python client, refer to the following GitHub repository: aacr_projects_from_synapse. This repository contains scripts and instructions to facilitate the download process.
-
Preparing Mutataion Signature features:
- The pre-processing code then uses weight matrices from the COSMIC Sanger Signatures to generate continuous values for mutation signatures.
-
Setting up Data Directories: In the
load_genie_data()
ofprocess_features.py
script, specify directories for the AACR GENIE data files. -
Run the Pre-processing Script: To process the AACR GENIE data, use the command:
python process_features.py --config=genie --filename_suffix=[UNIQUE FILENAME SUFFIX]
This creates processed feature and label dataframes in the /data/ directory, appended with the provided filename suffix.
After processing features with process_features.py
, you can train and validate the OncoNPC model. To do this, follow these steps:
-
Specify Data Locations in
train_evaluate_onconpc.py
:- Open the
train_evaluate_onconpc.py
file. - Set the locations of the processed feature and label data. This can be done around lines 50-53. You will need to specify the file paths for both the features and labels. For example:
# Tab separated feature data for CKPs feature_data_name = os.path.join(DATA_PATH, 'features_genie_') # Replace with your feature file path # Tab separated label data for CKPs label_data_name = os.path.join(DATA_PATH, 'labels_genie_') # Replace with your label file path
- Ensure that
DATA_PATH
is correctly defined to point to the directory where your processed data files are located. - Replace
'features_genie_'
and'labels_genie_'
with the names of your actual processed feature and label files.
- Open the
-
Training and Validation: Use the following command for model training and validation:
python train_evaluate_onconpc.py --config=genie --save_model_name=[UNIQUE FILENAME SUFFIX OF YOUR CHOICE] --k_fold=10
- The script supports k-fold cross-validation for model assessment. Setting
k_fold
to a specific value (e.g., 10) enables this feature. - If
k_fold
is set to 0, the model trains on the entire dataset and saves usingsave_model_name
as a filename suffix.
- The script supports k-fold cross-validation for model assessment. Setting
-
Outputs: The script outputs performance results and cancer type predictions for tumor samples. The trained model is saved with the specified filename suffix.
Adapting the OncoNPC model to AACR GENIE data may result in variations in performance or results due to differences in data sources from the original DFCI training set.
The OncoNPC Prediction and Explanation for CUP Tumors notebook in this repository provides a practical application of the OncoNPC model. Key highlights include:
-
Cancer Type Prediction for CUP Tumors: Demonstrates using the trained OncoNPC model to predict the primary cancer type of CUP tumors based on molecular data.
-
Feature Visualization: Showcases how to visualize important features contributing to each cancer type prediction using SHAP (SHapley Additive exPlanations) values.
-
Detailed Case Study: Presents a specific case to illustrate the model's cancer type predictions and how SHAP values offer insights.
This notebook is a resource for researchers and clinicians to understand the model's predictions and the key features influencing these predictions in cancer genomics.
The Direct Data Loading and OncoNPC Prediction notebook adds functionality for:
-
Loading Raw Data: Utilizes raw cBioPortal-like or GENIE AACR public data, streamlining the process of data preparation.
-
Automated Cancer Type Prediction: Integrates a function to automatically predict cancer types using the OncoNPC model.
-
Visualization of Prediction Explanation: Visualizing the prediction explanation, offering clarity on how the model arrives at its conclusions for each tumor sample.
This notebook is designed to simplify the process for users who want to apply the OncoNPC model directly to raw datasets.
@article{moon2023machine,
title={Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary},
author={Moon, Intae and LoPiccolo, Jaclyn and Baca, Sylvan C and Sholl, Lynette M and Kehl, Kenneth L and Hassett, Michael J and Liu, David and Schrag, Deborah and Gusev, Alexander},
journal={Nature Medicine},
pages={1--11},
year={2023},
publisher={Nature Publishing Group US New York}
}