By Nick Semenkovich <[email protected]>
An opinionated rework of the CelFiE project (original repository here). Note that this project simply reworks CelFiE to make it more Pythonic / reproducible and accept standard (.bed) inputs.
The goal of this code is to predict fractional tissue abundance from a mixed population of cells, using cell free methylation data. You likely need to build your own reference matrix (here called a "TIM" matrix, detailed below) though the original is available for reference.
This repo can be run entirely in Google Colab: celfie-simplified-demo.ipynb
Locally, you can run:
git clone https://github.com/semenko/celfie-simplified
cd celfie-simplified
pip3 install -r requirements.txt
python3 celfie-simplified.py --input data/sample-neutrophil.bed --reference_tims data/caggiano_TIM_matrix.bed --unknowns 0 --output sample-output/sample-neutrophil
CelFiE expect two input files: your sample's methylation data (as a .bed) and the tissue informative marker (TIM) matrix.
Your sample's data .bed should have columns 4 and 5 set to the # of methylated reads and # of total reads:
# chr start end Hepatocyte_meth Hepatocyte_depth
chr1 10 11 44 63
chr1 50 51 71 133
chr1 60 61 89 115
Note: Your sample .bed does not need a header. Without a header, samples will be named "sample1, sample2 …". If you provide a header, it must start with #, and your sample names must be formatted as "Tissue_name_meth" and "Tissue_name_depth". You can include more than one tissue (e.g. columns 5 and 6 can be tissue2_meth and tissue2_depth).
Note: Many analyses generate % methylation values — you can convert from percent to absolute counts using:
awk 'BEGIN{OFS="\t"}{ print $1, $2, $3, int($4 * $5 + 0.5), int($5) }'
CelFiE expects a reference matrix of tissue informative markers (TIMs) in a .bed file with a header the following format:
# chrom start end tissue1_meth tissue1_depth tissue2_meth tissue2_depth
chr1 10 11 25 29 53 105
chr1 50 51 85 99 72 285
chr1 60 61 92 117 12 33
Note: Here, a .bed header is required, as CelFiE needs to have valid tissue names for assignments and subsequent plotting.
As a reference, you can use the TIM matrix from the original CelFiE paper, which is in data/caggiano_TIM_matrix.bed
, but this matrix has some caveats (see below).
Output formatting is unchanged from the original CelFiE code. CelFiE outputs tissue estimates for each sample in your input — i.e. the proportion of each tissue in the reference making up the cfDNA sample. See celfie_demo/sample_output/1_tissue_proportions.txt
for an example of this output.
tissue1 tissue2 .... unknown
sample1 0.05 0.08 .... 0.1
sample2 0.7 0.12 .... 0.2
CelFiE also outputs the methylation proportions for each of the tissues plus however many unknowns were estimated. This output will look like this:
tissue1 tissue2 ... unknown
CpG1 0.99 1.0 ... 0.3
CpG2 0.45 0.88 ... 0.1
Sample code for processing both of these outputs can be seen in demo.ipynb
.
The original TIM matrix is in `data/ was trained on a combination of human hg38 data from ENCODE and Blueprint, as described in the original paper. It encompasses 19 human tissues:
However, this matrix is ***
TIMs are available at TIMs/sample_tims.txt
for individual CpG TIMs, and TIMs/sample_tims_summed.txt
for reads summed +/-250bp around a TIM. We recommend using the TIMs/sample_tims_summed.txt
for improved decomposition performance.
The TIMs represent markers for the following tissues:
- dendritic cells
- endothelial cells
- eosinophils
- erythroblasts
- macrophages
- monocytes
- neutrophils
- placenta
- T-cells
- adipose
- brain
- fibroblasts
- heart left ventricle
- hepatocytes
- lung
- mammary gland
- megakaryocytes
- skeletal muscle myoblasts
- small intestine
Please note all data was converted to hg38 and all CpGs are reported as (Chrom, start, end), where the end position indicates the C in the CpG dinucleotide.
Code to find TIMs is located at TIMs/tim.py
. This code takes a reference bedfile of all the tissues you would like to calculate TIMs for as input. See TIMs/sample_input.txt.
The TIM code can be run as:
python tim.py <input file> <output file> <num of tim/tissue> <num of tissues> <depth filter> <nan filter>
The number of TIMs per tissue can be adjusted, but note that as the number of TIMs approaches the number of CpGs, the less informative that TIM will be for that tissue.
The depth filter only will consider CpGs that have a median depth across all tissues greater than a user specified value. This is to ensure that low-coverage CpGs do not get selected as TIMs. The NaN filter will only consider CpGs that have less than a user specified number of missing values. This is to ensure a TIM isn't selected for a tissue because it is one of the few tissues with data at that location. The number of tims/tissue can vary. We find that 100 is a good number, and note that as the number of TIMs increase, the lower quality the TIMs will be, since we are selecting the top most informative CpGs/tissue (in other words, the top 100 most informative CpGs for pancreas will by definition, be "better" than the top 500).
Christa Caggiano, Barbara Celona, Fleur Garton, Joel Mefford, Brian Black, Catherine Lomen-Hoerth, Andrew Dahl, Noah Zaitlen, "Comprehensive cell type decomposition of circulating cell-free DNA with CelFiE", Nature Communications, May 2021 link