Pairwise genome alignment pipeline
The following R script makes use of the CNEr package to pipeline the pairwise alignment of two genomes. The pipeline performs pairwise alignment of all specified chromosomes or scaffolds of target and query genomes using lastz. The remainder of the pipeline uses some binaries from UCSC kent tools for filtering and chaining alignments. The number of pairwise combinations grows quadratically, and the runtime becomes unruly with too many scaffolds or chromosomes. When working with draft genomes, often the most reasonable approach is to limit the input to the top 30-40 longest scaffolds. The same goes for reference genomes, for example limit the chrom.sizes file to contain only cannonical chromosomes, no chrUn*
.
Two primate genomes and 50 threads can take 3-5 hours, so make use of as many CPUs as your can find.
Install all dependencies in a fresh conda environment using the following command, assuming you have a working installation of conda. To install miniconda on an linux system follow these instructions.
conda env create -n lastz-test -f environment.yaml
Rscript lastz-pipe.R -h
Options:
-t CHARACTER, --target=CHARACTER
Target genome 2bit file (absolute path).
-q CHARACTER, --query=CHARACTER
Query genome 2bit file (absolute path).
-T CHARACTER, --targetSizes=CHARACTER
Target chrom.sizes file specifies chroms to include. Uses known chroms from twoBit file if not given.
-Q CHARACTER, --querySizes=CHARACTER
Query chrom.sizes file specifies chroms to include. Uses known chros from twoBit file if not given.
-d CHARACTER, --outdir=CHARACTER
Output directory (absolute path).
-p NUMERIC, --threads=NUMERIC
Number of threads to use for lastz
-h, --help
Show this help message and exit
The output file of primary importance has the form {target}.{query}.net.axt. This file can be ingested by our synteny breakpoint detection tool axtToSyn.
To create a useful visualization of this pairwise genome alignment, you can convert it to maf
format using the kent utility axtToMaf
as follows:
axtToMaf in.axt tSizes qSizes out.maf
Then create pariwise alignment dotplot using the tool last-dotplot.
last-dotplot [options] maf-or-tab-alignments dotplot.png
If the dotplot is too busy with smaller noisy alignments, you can filter the lower scoreing alignments out of the maf file with mafFiler
. I have found 10000 can be a noise-reducing threshold, but can vary based on genome assembly quality and completeness.
mafFiler -minScore=10000 in.maf out.filt.maf
To align the human genome vs the rhesus macaque genome the necessary files can be downloaded from UCSC:
Human 2bit genome: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
Human chromosome sizes: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
Rhesus 2bit genome: https://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/bigZips/rheMac10.2bit
Rhesus chromosome sizes: https://hgdownload.soe.ucsc.edu/goldenPath/rheMac10/bigZips/rheMac10.chrom.sizes
The command to run the pairwise alignement pipeline between human and rhesus would then be constructed as follows using 40 logical CPU cores, taking somewhere between 3 and 4 hours to run:
conda activate lastz-test
Rscript lastz-pipe.R -t hg38.2bit -T hg38.chrom.sizes -q rheMac10.2bit -Q rheMac10.chrom.sizes -p 40 -d hg38.rheMac10