Skip to content

2. Read Depth Signal

Arijit Panda edited this page Jun 6, 2022 · 1 revision

Import read depth signal

Make sure that you have indexed alignment (SAM, BAM or CRAM) file.

Initialize your CNVpytor project by running:

> cnvpytor -root file.pytor -rd file.bam [-chrom name1 ...] [-T ref.fa.gz]

where:

  • file.pytor -- specifies output cnvpytor file (HDF5 file)
  • name1 ... -- specifies chromosome name(s).
  • file.bam -- specifies bam/sam/cram file name.
  • -T ref.fa.gz -- specifies reference genome file (only for cram file without reference genome).

Chromosome names must be specified the same way as they are described in the sam/bam/cram header, e.g., chrX or X. One can specify multiple chromosomes separated by space. If no chromosome is specified, read mapping is extracted for all chromosomes in the sam/bam file. Note that the pytor file is not being overwritten.

Examples:

> cnvpytor -root NA12878.pytor -chrom 1 2 3 -rd NA12878_ali.bam

for bam files with a header like this:

@HD VN:1.4 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430

or

> cnvpytor -root NA12878.pytor -chrom chr1 chr2 chr3 -rd NA12878_ali.bam

for bam files with a header like this:

@HD VN:1.4 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430

After -rd step file file.pytor is created and read depth data binned to 100 base pair bins will be stored in pytor file.

Chromosome names and lengths are parsed from the input file header and used to detect reference genome.

Reference genome is important for GC correction and 1000 genome strict mask filtering. CNVpytor installation includes human genomes hg19 and hg38 with data files that provide information about GC content and strict mask. For other species or reference genomes you have to configure reference genome (see example).

To check is reference genome detected use:

> cnvpytor -root file.pytor -ls

CNVpytor will print out details about file.pytor including line that specify which reference genome is used and are there available GC and mask data:

Using reference genome: hg19 [ GC: yes, mask: yes ]

Command -ls is useful if you want to check content of pytor file but also date and version of cnvpytor that created it.

Predicting CNV regions

First we have to chose bin size. By cnvpytor design it have to be divisible by 100. Here we will use 10 kbp and 100 kbp bins.

To calculate read depth histograms, GC correction and statistics type:

> cnvpytor -root file.pytor -his 10000 100000

Next step is partitioning using mean-shift method:

> cnvpytor -root file.pytor -partition 10000 100000

Finally we can call CNV regions using commands:

> cnvpytor -root file.pytor -call 10000 > calls.10000.tsv
> cnvpytor -root file.pytor -call 100000 > calls.100000.tsv

Result is stored in tab separated files with following columns:

  • CNV type: "deletion" or "duplication",
  • CNV region (chr:start-end),
  • CNV size,
  • CNV level - read depth normalized to 1,
  • e-val1 -- e-value (p-value multiplied by genome size divided by bin size) calculated using t-test statistics between RD statistics in the region and global,
  • e-val2 -- e-value (p-value multiplied by genome size divided by bin size) from the probability of RD values within the region to be in the tails of a gaussian distribution of binned RD,
  • e-val3 -- same as e-val1 but for the middle of CNV,
  • e-val4 -- same as e-val2 but for the middle of CNV,
  • q0 -- fraction of reads mapped with q0 quality in call region,
  • pN -- fraction of reference genome gaps (Ns) in call region,
  • dG -- distance from closest large (>100bp) gap in reference genome.

Using viewer mode we can filter calls based on five parameters: CNV size, e-val1, q0, pN and dG:

> cnvpytor -root file.pytor [file2.pytor ...] -view 10000

cnvpytor> set Q0_range -1 0.5        # filter calls with more than half not uniquely mapped reads
cnvpytor> set p_range 0 0.0001       # filter non-confident calls 
cnvpytor> set p_N 0 0.5              # filter calls with more than 50% Ns in reference genome 
cnvpytor> set size_range 50000 inf   # filter calls smaller than 50kbp 
cnvpytor> set dG_range 100000 inf    # filter calls close to gaps in reference genome (<100kbp)
cnvpytor> print calls                # printing calls on screen (tsv format)
...
...
cnvpytor> set print_filename file.xlsx   # output filename (xlsx, tsv or vcf)
cnvpytor> set annotate               # turn on annotation (optional - takes a lot of time)
cnvpytor> print calls                # generate output file with filtered calls 
cnvpytor> quit

Upper bound for parameters size_range and dG_range can be inf (infinity).

If there are multiple samples (pytor files) there will be an additional column with sample name in tsv format, multiple sheets in Excel format, and multiple sample columns in vcf format.

Clone this wiki locally