This repository contains the code and data for the paper Phylogenetic Tree Inference from Single-Cell RNA Sequencing. The code and datasets provided here enable users to replicate the experiments and figures presented in the paper, as well as to run SCITE-RNA on new data.
We implement a new method for reconstructing phylogenetic trees from single-cell RNA sequencing data. SCITE-RNA selects single-nucleotide variants (SNVs), and reconstructs a phylogenetic tree of the sequenced cells. We maximize the likelihood of the inferred tree by alternating between the cell lineage and mutation tree spaces until convergence is achieved in both. This repository provides:
- Scripts to execute SCITE-RNA. The model is split into C++
src_cpp
and Python filessrc_python
. Especially for large numbers of cells and SNVs it is recommended to use the C++ code, as it is significantly faster. The inferred trees should be comparable between the C++ and Python implementations. - Summaries of the data used in the paper is available in the
data/
directory, which contains all necessary files to reproduce the figures. - Visualization scripts to generate plots as presented in the paper.
SCITE-RNA
├── data/ # Input data files and results
├── input_data # Alternative and reference read counts among other files.
├── results # Inferred trees of the multiple myeloma dataset and figures
└── simulated_data # Summary statistics of simulated data and respective inferred trees
├── generate_results_cpp/ # C++ scripts for tree inference
├── generate_results_python_r/ # Python and R scripts for simulating data, inferring trees and visualization
├── src_cpp/ # C++ source files for SCITE-RNA
├── src_python/ # Python source files for SCITE-RNA
├── configs/ # Model parameters (Python)
├── CMakeLists.txt # Primary configuration file for CMake
└── README.md # Project overview and setup instructions
- numpy
- pandas
- matplotlib
- seaborn
- scipy
- numba
- math
- jupyter
- pyyaml
- graphviz
- CMake (>= 3.27)
To set up the SCITE-RNA project locally:
git clone https://github.com/cbg-ethz/SCITE-RNA.git
cd SCITE-RNA
To reproduce the figures quickly you can use the files provided in data
.
As the size and the number of files was quite large, we produced summary statistics
using
generate_results_python_r/generate_summary_statistics.ipynb
Adjust model parameters in configs/config.yaml
for Python
or adjust them in src_cpp/mutation_filter.h
for C++.
To generate new simulated data execute:
generate_results_python_r/comparison_data_generation.py
It offers the option to set the number of cells and SNVs and the number of clones simulated. The same file can also be used for tree inference. Alternatively, run the C++ version:
generate_results_cpp/comparison_num_clones.cpp
for tree inference (not data generation).
To compare different optimization strategies of tree space switching run:
generate_results_cpp/comparison_tree_spaces_switching.cpp
Otherwise, the model will by default alternate between cell lineage and mutation tree optimization, starting from a random cell lineage tree.
All simulated results will be saved in data/simulated_data/
.
To run SCITE-RNA on the Multiple Myeloma dataset:
-
Run mutation filtering:
generate_results_python_r/MM.py
-
Perform tree inference in C++ for faster computation:
generate_results_cpp/MM.cpp
Results will be saved in data/results/mm34/
.
To use SCITE-RNA on new data:
-
Prepare reference and alternative allele count files in
.txt
format. Use the format provided indata/input_data/new_data
as a reference, where columns represent cells and rows represent SNVs. -
Set the number of bootstrap samples (optional) and run SCITE-RNA tree inference with the following script:
generate_results_cpp/run_sciterna.cpp
-
The results are saved in
data/results/new_data/
.
To reproduce the plots presented in the paper, follow the instructions below:
-
The plots are generated by default using the summary statistics generated with
generate_results_python_r/comparison_data_generation.py
-
Figure 3: Comparison of tree optimization strategies
generate_results_python_r/comparison_tree_spaces_switching.py
-
Figure 4: Comparison to SClineager and DENDRO + variable number of clones
-
Figure A.2: Runtime comparison
Optionally rerun DENDRO and SClineager on the simulated data first.
generate_results_python_r/comparison_clones_sclineager_dendro_sciterna.R
To generate the figures run
generate_results_python_r/comparison_num_clones.ipynb
-
Figure 5: Representative tree multiple myeloma
-
Figure A.1: Gene expression analysis
generate_results_python_r/bootstrap_results_mm.ipynb
All figures will be saved in data/results/figures/
.