Skip to content

Latest commit

 

History

History
137 lines (101 loc) · 14.5 KB

faqs.md

File metadata and controls

137 lines (101 loc) · 14.5 KB

FAQs - Frequently asked questions

1. Are the SComatic parameters for scATAC-seq data the same as for scRNA-seq data?

No, they are different. ATAC-seq data is a DNA-based approach. Therefore, there are differences in the way of processing the bam files. These are the parameters that you should change in comparison to the scRNA-seq workflow:

  • Step 1: remove the --max_nM and --max_NH parameters, set the mapping quality filter to --min_MQ 30 .
  • Step 2: set the mapping quality filter to --min_mq 30
  • Step 4.2: Remove the --editing parameter, and --pon altered to point at the scATACseq PoN provided in this GitHub repo (or custom one)

2. How can we perform the variant annotation with the SComatic output?

In our manuscript, we have annotated all our variants using annovar. With a couple of command lines, we can adapt our variant calling output for this annotation tool. Using our example data, we should run these lines:

1. Preparing SComatic output for annovar

ANNOVAR=/path/to/annovar
hummandb=/path/to/annovar_humandb_hg38

grep -v '#' Example.calling.step2.pass.tsv |  tr '\t' '-' | awk -F'-' -v OFS='\t' '{print $1,$2,$3,$4,$5,$0}' > sample.variants.avinput

2. Annotate variants using the annovar-ready variant file

perl $ANNOVAR/table_annovar.pl sample.variants.avinput \
                $hummandb -buildver hg38 \
                -out sample.variants.annovar \
                -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a,gnomad_genome -operation g,r,f,f,f,f \
                -nastring . -csvout -polish --otherinfo

This step will generate a comma-separated file (.csv) called sample.variants.annovar.hg38_multianno.csv , which can be easily opened and processed in your more desired software (R, excel...). The output should look like this:

Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,cytoBand,ExAC_ALL,ExAC_AFR,ExAC_AMR,ExAC_EAS,ExAC_FIN,ExAC_NFE,ExAC_OTH,ExAC_SAS,avsnp147,SIFT_score,SIFT_pred,Polyphen2_HDIV_score,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_pred,LRT_score,LRT_pred,MutationTaster_score,MutationTaster_pred,MutationAssessor_score,MutationAssessor_pred,FATHMM_score,FATHMM_pred,PROVEAN_score,PROVEAN_pred,VEST3_score,CADD_raw,CADD_phred,DANN_score,fathmm-MKL_coding_score,fathmm-MKL_coding_pred,MetaSVM_score,MetaSVM_pred,MetaLR_score,MetaLR_pred,integrated_fitCons_score,integrated_confidence_value,GERP++_RS,phyloP7way_vertebrate,phyloP20way_mammalian,phastCons7way_vertebrate,phastCons20way_mammalian,SiPhy_29way_logOdds,gnomAD_genome_ALL,gnomAD_genome_AFR,gnomAD_genome_AMR,gnomAD_genome_ASJ,gnomAD_genome_EAS,gnomAD_genome_FIN,gnomAD_genome_NFE,gnomAD_genome_OTH,Otherinfo
chr10,29559501,29559501,A,T,"intronic","SVIL",.,.,.,"10p11.23",.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"chr10-29559501-29559501-A-T-PASS-Epithelial_cells-ATTCA-TTGAT-1-12-5-4-2-0.3333-0.4-0.0-0.0001-2-2-0;38;1-0;16;1-.-PASS-DP|NC|CC|BC|BQ|BCf|BCr-NA-12|5|3:0:2:0:0:0|8:0:4:0:0:0|306:0:164:0:0:0|0:0:0:0:0:0|8:0:4:0:0:0-30|13|13:0:0:0:0:0|30:0:0:0:0:0|1189:0:0:0:0:0|0:0:0:0:0:0|30:0:0:0:0:0"
chr10,73413109,73413109,G,A,"intronic","ANXA7",.,.,.,"10q22.2",.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"chr10-73413109-73413109-G-A-PASS-Epithelial_cells-TGGAG-CATGG-1-12-8-4-3-0.3333-0.375-0.0-0.0-2-2-0;15;1-0;11;1-.-PASS-DP|NC|CC|BC|BQ|BCf|BCr-NA-12|8|3:0:0:5:0:0|4:0:0:8:0:0|160:0:0:302:0:0|0:0:0:0:0:0|4:0:0:8:0:0-7|6|0:0:0:6:0:0|0:0:0:7:0:0|0:0:0:270:0:0|0:0:0:0:0:0|0:0:0:7:0:0"
chr10,89413978,89413978,G,A,"upstream","IFIT5","dist=590",.,.,"10q23.31",.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"chr10-89413978-89413978-G-A-PASS-Epithelial_cells-GTATG-GTGTT-1-18-11-7-4-0.3889-0.3636-0.0-0.0-2-2-0;20;1-0;12;1-.-PASS-DP|NC|CC|BC|BQ|BCf|BCr-NA-18|11|4:0:0:7:0:0|7:0:0:11:0:0|270:0:0:417:0:0|0:0:0:0:0:0|7:0:0:11:0:0-9|5|0:0:0:5:0:0|0:0:0:9:0:0|0:0:0:339:0:0|0:0:0:0:0:0|0:0:0:9:0:0"
chr10,97332670,97332670,G,A,"UTR3","FRAT2","NM_012083:c.*1201C>T",.,.,"10q24.1",.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"chr10-97332670-97332670-G-A-PASS-Epithelial_cells-TTGGG-GTGGC-1-12-11-6-6-0.5-0.5455-0.0-0.0-2-2-0;12;1-0;11;1-.-PASS-DP|NC|CC|BC|BQ|BCf|BCr-6|6|0:0:0:6:0:0|0:0:0:6:0:0|0:0:0:237:0:0|0:0:0:0:0:0|0:0:0:6:0:0-12|11|6:0:0:5:0:0|6:0:0:6:0:0|229:0:0:225:0:0|0:0:0:0:0:0|6:0:0:6:0:0-NA"
chr10,126926205,126926205,G,A,"intronic","DOCK1",.,.,.,"10q26.2",.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,"chr10-126926205-126926205-G-A-PASS-Epithelial_cells-GATCC-GTGAC-1-35-17-15-6-0.4286-0.3529-0.0-0.0-2-2-0;35;1-0;18;1-.-PASS-DP|NC|CC|BC|BQ|BCf|BCr-NA-35|17|6:0:0:12:0:0|15:0:0:20:0:0|603:0:0:786:0:0|15:0:0:20:0:0|0:0:0:0:0:0-15|7|0:0:0:7:0:0|0:0:0:15:0:0|0:0:0:590:0:0|0:0:0:15:0:0|0:0:0:0:0:0"

As you will notice, the first few columns of the file have the annotation information provided by annovar (Gene, region, impact, Gnomad allele frequencies...). Importantly, the final column of the csv file (Otherinfo) shows all the info found in the original Example.calling.step2.pass.tsv, but separated by the symbol "-" . The column names of each one of these Otherinfo items are the same as the ones found in the Example.calling.step2.pass.tsv :

> grep '^#CHROM' Example.calling.step2.pass.tsv

#CHROM	Start	End	REF	ALT	FILTER	Cell_types	Up_context	Down_context	N_ALT	Dp	Nc	Bc	Cc	VAF	CCF	BCp	CCp	Cell_types_min_BC	Cell_types_min_CC	Rest_BC	Rest_CC	Fisher_p	Cell_type_Filter	INFO	Myeloids	Epithelial_cells	Stromal_cells

Of course, depending on the columns you want to keep for the annotation process, you might slightly change the command line in the step 1 described above, as well as the parameters provided in the annovar computation. The parameters provided here are the ones used in our manuscript.

3. Can SComatic work with other types of PoN files?

Yes, SComatic can deal with PoNs obtained by other tools.

The --pon parameter permits working with different types of formats/files depending on the availability and quantity of non-neoplastic samples. These are the main options:

  • Using the PoNs provided in this repository, which were computed using the data described in the main manuscript of SComatic.
  • Using your custom PoNs, which can be built using SComatic modules.
  • TSV file listing the mutations detected by running SComatic (output of Detection of somatic mutations: Step 4.2) on scRNA-seq data from e.g., matched non-neoplastic cells.
  • Custom PoN in VCF format (unzipped) generated by running a variant caller, such as GATK-HaplotypeCaller, on DNA sequencing (WES or WGS) data, such as a matched normal sample or a set of unmatched germline samples.
  • Using a PoN in VCF format (unzipped) generated by other tools like GATK

4. Can we use the calls from other callers to genotype unique cells using SComatic?

Yes, SComatic allows us to use calls obtained by other callers (in vcf format) to check the presence of each mutation at single-cell resolution. However, a minor edit is required to transform the vcf format into the SComatic format. Here is the required command line to be run in the vcf:

# Prepare SComatic input from an external vcf for the script SingleCellGenotype.py
VCF=/path/to/sample.vcf
grep -v '^#' $VCF | awk -F'\t' -v OFS='\t' '{print $1,$2,$2,$4,$5,".","Cell_type",".",".",".",".",".",".","."}' > Mutations.other_caller.tsv

SCOMATIC=/path/to/SComatic
cell_type_bam=/path/to/cell_type.bam
META=/path/to/cell_type_annotations.tsv
REF=/path/to/reference_genome.fasta
python $SCOMATIC/scripts/SingleCellGenotype/SingleCellGenotype.py --bam $cell_type_bam  \
    --infile Mutations.other_caller.tsv \
    --nprocs 16  \
    --meta $META   \
    --outfile Mutations.other_caller.single_cell_genotype.tsv  \
    --tmp_dir temp  \
    --ref $REF

5. How do different cell type labels (e.g. different levels of granularity) affect the SComatic performance?

SComatic can be run using different levels of granularity in terms of cell type annotations.

As illustrated in the schematic below, in the case of non-neoplastic samples the relatedness between cell types from a development perspective determines which types of mutations can be detected. Using very granular cell type annotations that consider e.g. two cell types from the same differentiation hierarchy as different cell types, will remove any mutation present in the progenitor common to both cell types (assuming that sufficient sampling of all cell types in the data set analysed is achieved). As a result, only mutations acquired after clonal diversification will be detected (marked in green in the schematic). In contrast, mutations acquired in progenitor/stem cells and during early development will be discounted based on the fact that they will be present in all descendant cells (marked in red in the schematic). It follows from the preceding that mutations acquired very early during development will likely be considered germline based on their presence in multiple cell types when calling mutations using diverse tissue types from the same individual. By contrast, using broader cell type annotations (e.g., epithelial, immune, etc) permits the detection of mutations accumulated over longer periods of time in e.g., adult stem cells up to the point of lineage diversification. Overall, determining the granularity of the cell type annotations depends on the biological question of interest. We note that SComatic can be run using cell type annotations of variable granularity easily, which should enhance the applicability of our algorithm to diverse research areas.

For the analysis of cancer samples, we note that the effect of cell type granularity is less relevant as compared to the analysis of non-neoplastic samples, in that somatic mutations are only expected to be detected in the cancer cells. Thus, unless cell state annotations are used for the cancer cells, mutations present in cancer cells would be readily detected when running SComatic if one cell type annotation is used for all cancer cells and non-neoplastic cell types are included to remove germline polymorphisms.

6. What does it happen if CellRanger does not properly trim all non-genomic sequences (adapters) from the reads?

This might be an issue and could increase the number of false positive calls in the final call set. If this is the case, you can use the parameters --n_trim NUMER_OF_BASES_TO_IGNORE in Step1 (SplitBamCellTypes.py) to ignore the first and last bases of each read. In addition, when running the CellRanger count tool, provide the –-chemistry parameter as precisely as possible. It will help to remove these non-desired regions.

7. How to interpret the SingleCellGenotype.py output?

This script provides base count and allele information for all cells with reads covering each variant site. By default, all cells (mutated and non-mutated) will be present in the output file (see --alt_flag parameter to change it). However, it is important to understand the meaning of each one of the columns of this output file to properly interpret these results. Here a short description:

Column Description
#CHROM Chromosome carrying the mutation
Start Start genomic coordinate
End End genomic coordinate
REF Reference allele
ALT_expected Alternative allele as described in the input file (--infile)
Cell_type_expected Cell types harbouring the mutation as described in the input file (--infile)
Num_cells_expected Number of expected cells carrying the mutation as described in the input file (--infile)
CB Unique cell barcode analysed
Cell_type_observed Cell type attributed to the analysed CB according to the input metadata file (--meta)
Base_observed Allele observed in this CB
Num_reads Number of reads carrying the Base_observed

Let's understand this table with an example. Looking at our SComatic example data, we will focus on the variant site chr10-29559501 and the SComatic/example_data/results/SingleCellAlleles/Epithelial_cells.single_cell_genotype.tsv file generated.

#CHROM	Start	End	REF	ALT_expected	Cell_type_expected	Num_cells_expected	CB	Cell_type_observed	Base_observed	Num_reads
chr10	29559501	29559501	A	T	Epithelial_cells	2	AGTCTTTGTGCATCTA	Epithelial_cells	A	5
chr10	29559501	29559501	A	T	Epithelial_cells	2	CCCTCCTAGGCTAGGT	Epithelial_cells	A	1
chr10	29559501	29559501	A	T	Epithelial_cells	2	GGGTCTGTCTTGAGGT	Epithelial_cells	T	2
chr10	29559501	29559501	A	T	Epithelial_cells	2	GTCCTCAAGGCTCATT	Epithelial_cells	T	2
chr10	29559501	29559501	A	T	Epithelial_cells	2	GAGTCCGAGGGTGTTG	Epithelial_cells	A	2

The columns ALT_expected, Cell_type_expected and Num_cells_expected correspond to the values observed in the --infile Example.calling.step2.pass.tsv, so they represent the calls at cell type resolution.

In contrast, the columns CB, Cell_type_observed, Base_observed and Num_reads correspond to the allele observations at unique cell resolution when interrogating the bam files.

Each CB can be presented in the output file in as many rows as different alleles are found per cell, although in most cases, we only observed one allele per cell (so one row per unique CB). In order to find the alleles harbouring the called mutation, we have to look for those rows (unique CBs) where ALT_expeced == Base_observed and Cell_type_expected == Cell_type_observed. In general terms, CBs not accomplishing these conditions can be understood as noise or non-mutated cells.

Although this script is designed to be run with the SComatic calls, it can also be run with an external vcf file. As in most of the vcf-based cases we do not know the Cell_type_expected, we cannot check the Cell_type_expected == Cell_type_observed condition. However, we still can check the ALT_expeced == Base_observed condition.