OncoNPC

We developed OncoNPC (Oncology NGS-based Primary cancer type Classifier), a molecular cancer type classifier trained on multicenter targeted panel sequencing data. OncoNPC utilizes somatic alterations including mutations (single nucleotide variants and indels), mutational signatures, copy number alterations, as well as patient age at the time of sequencing and sex to jointly predict cancer type.

Tutorial Video and Visualization Tool

We have created a short tutorial video to guide you through the usage of OncoNPC's GitHub-based visualization tool, which visualizes cancer type predictions based on user-provided inputs and model explanations.

You can watch the tutorial video and access the visualization tool here.

Setting up the Conda Environment

To ensure consistent execution of the code, we recommend using the conda environment specified in onconpc_conda.yml. This file contains all the necessary dependencies with their specific versions.

1. Install Conda

If you do not have Conda installed, please install Miniconda or Anaconda.

2. Clone the Repository

git clone https://github.com/itmoon7/onconpc.git
cd onconpc

3. Create Conda Environment

Use the onconpc_conda.yml file to create a Conda environment.

conda env create -f onconpc_conda.yml

4. Activate the Environment

Once the environment is created, you can activate it using:

conda activate onconpc_conda_env

For further details on the software and package versions, please refer to the onconpc_conda.yml file.

Utilizing Public Tumor Sequencing Data from AACR Project GENIE

Introduction to AACR Project GENIE Data

OncoNPC was originally trained using data from multiple sources, including publicly available data from AACR Project GENIE, specifically from two cancer centers (MSK and VICC), as well as private institutional data from DFCI (Dana-Farber Cancer Institute). This repository provides the flexibility to process and train the OncoNPC model using solely the publicly available data from AACR Project GENIE.

Required Data Files from AACR Project GENIE

For integrating AACR GENIE data with OncoNPC, you will need:

data_mutations_extended.txt
data_clinical_patient.txt
data_clinical_sample.txt
data_CNA.txt

Integrating AACR Project GENIE Data with OncoNPC

Accessing AACR Project GENIE Data: Begin by obtaining the AACR GENIE data as described in their Data Guide. For programmatically downloading the data using the Synapse Python client, refer to the following GitHub repository: aacr_projects_from_synapse. This repository contains scripts and instructions to facilitate the download process.
Preparing Mutataion Signature features:
- The pre-processing code then uses weight matrices from the COSMIC Sanger Signatures to generate continuous values for mutation signatures.
Setting up Data Directories: In the load_genie_data() of process_features.py script, specify directories for the AACR GENIE data files.
Run the Pre-processing Script: To process the AACR GENIE data, use the command:
```
python process_features.py --config=genie --filename_suffix=[UNIQUE FILENAME SUFFIX]
```
This creates processed feature and label dataframes in the /data/ directory, appended with the provided filename suffix.

Training and Validating the XGBoost-based OncoNPC Model

After processing features with process_features.py, you can train and validate the OncoNPC model. To do this, follow these steps:

Specify Data Locations in train_evaluate_onconpc.py:
- Open the train_evaluate_onconpc.py file.
- Set the locations of the processed feature and label data. This can be done around lines 50-53. You will need to specify the file paths for both the features and labels. For example:
```
# Tab separated feature data for CKPs
feature_data_name = os.path.join(DATA_PATH, 'features_genie_')  # Replace with your feature file path
# Tab separated label data for CKPs
label_data_name = os.path.join(DATA_PATH, 'labels_genie_')  # Replace with your label file path
```
- Ensure that DATA_PATH is correctly defined to point to the directory where your processed data files are located.
- Replace 'features_genie_' and 'labels_genie_' with the names of your actual processed feature and label files.
Training and Validation: Use the following command for model training and validation:
```
python train_evaluate_onconpc.py --config=genie --save_model_name=[UNIQUE FILENAME SUFFIX OF YOUR CHOICE] --k_fold=10
```
- The script supports k-fold cross-validation for model assessment. Setting k_fold to a specific value (e.g., 10) enables this feature.
- If k_fold is set to 0, the model trains on the entire dataset and saves using save_model_name as a filename suffix.
Outputs: The script outputs performance results and cancer type predictions for tumor samples. The trained model is saved with the specified filename suffix.

Note

Adapting the OncoNPC model to AACR GENIE data may result in variations in performance or results due to differences in data sources from the original DFCI training set.

Notebook Examples for Predicting Cancer Type and Visualizing Prediction Explanation

1. OncoNPC Model Application for CUP Tumors

The OncoNPC Prediction and Explanation for CUP Tumors notebook in this repository provides a practical application of the OncoNPC model. Key highlights include:

Cancer Type Prediction for CUP Tumors: Demonstrates using the trained OncoNPC model to predict the primary cancer type of CUP tumors based on molecular data.
Feature Visualization: Showcases how to visualize important features contributing to each cancer type prediction using SHAP (SHapley Additive exPlanations) values.
Detailed Case Study: Presents a specific case to illustrate the model's cancer type predictions and how SHAP values offer insights.

This notebook is a resource for researchers and clinicians to understand the model's predictions and the key features influencing these predictions in cancer genomics.

2. Direct Data Loading and OncoNPC Prediction

The Direct Data Loading and OncoNPC Prediction notebook adds functionality for:

Loading Raw Data: Utilizes raw cBioPortal-like or GENIE AACR public data, streamlining the process of data preparation.
Automated Cancer Type Prediction: Integrates a function to automatically predict cancer types using the OncoNPC model.
Visualization of Prediction Explanation: Visualizing the prediction explanation, offering clarity on how the model arrives at its conclusions for each tumor sample.

This notebook is designed to simplify the process for users who want to apply the OncoNPC model directly to raw datasets.

Link to Manuscript

Manuscript

Citation

@article{moon2023machine,
  title={Machine learning for genetics-based classification and treatment response prediction in cancer of unknown primary},
  author={Moon, Intae and LoPiccolo, Jaclyn and Baca, Sylvan C and Sholl, Lynette M and Kehl, Kenneth L and Hassett, Michael J and Liu, David and Schrag, Deborah and Gusev, Alexander},
  journal={Nature Medicine},
  pages={1--11},
  year={2023},
  publisher={Nature Publishing Group US New York}
}

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.vscode		.vscode
Supplementary Materials		Supplementary Materials
codes		codes
cup_prediction_explanation		cup_prediction_explanation
data		data
docs		docs
model		model
node_modules		node_modules
onconpc_visualization		onconpc_visualization
others_prediction_explanation		others_prediction_explanation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
onconpc_conda.yml		onconpc_conda.yml
onconpc_prediction_and_explanation_for_cup_tumors.ipynb		onconpc_prediction_and_explanation_for_cup_tumors.ipynb
onconpc_prediction_and_explanation_for_cup_tumors_from_cbio_raw.ipynb		onconpc_prediction_and_explanation_for_cup_tumors_from_cbio_raw.ipynb
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OncoNPC

Tutorial Video and Visualization Tool

Table of Contents

Setting up the Conda Environment

1. Install Conda

2. Clone the Repository

3. Create Conda Environment

4. Activate the Environment

Utilizing Public Tumor Sequencing Data from AACR Project GENIE

Introduction to AACR Project GENIE Data

Required Data Files from AACR Project GENIE

Integrating AACR Project GENIE Data with OncoNPC

Training and Validating the XGBoost-based OncoNPC Model

Note

Notebook Examples for Predicting Cancer Type and Visualizing Prediction Explanation

1. OncoNPC Model Application for CUP Tumors

2. Direct Data Loading and OncoNPC Prediction

Link to Manuscript

Citation

About

Releases

Packages

Contributors 2

Languages

License

itmoon7/onconpc

Folders and files

Latest commit

History

Repository files navigation

OncoNPC

Tutorial Video and Visualization Tool

Table of Contents

Setting up the Conda Environment

1. Install Conda

2. Clone the Repository

3. Create Conda Environment

4. Activate the Environment

Utilizing Public Tumor Sequencing Data from AACR Project GENIE

Introduction to AACR Project GENIE Data

Required Data Files from AACR Project GENIE

Integrating AACR Project GENIE Data with OncoNPC

Training and Validating the XGBoost-based OncoNPC Model

Note

Notebook Examples for Predicting Cancer Type and Visualizing Prediction Explanation

1. OncoNPC Model Application for CUP Tumors

2. Direct Data Loading and OncoNPC Prediction

Link to Manuscript

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages