"Alleleome" is a specialized package designed to explore and analyze natural sequence variations within the Open Reading Frames (ORFs) of alleles of core genes in a species' pan-genome, both at the amino acid and nucleotide levels. It identifies variants such as substitutions, insertions, and deletions through a series of steps:
- Initial QCQA of sequences.
- Building consensus for each gene's allele set.
- Pairwise alignment of consensus sequences with individual alleles.
- Identification and generation of amino acid variant datasets.
- Analysis of synonymous and non-synonymous substitutions from codons and corresponding amino acid data.
The Alleleome workflow is specifically tailored to study the natural sequence variations in core genes of the pan-genome of a species, with an emphasis on variations at the amino acid and nucleotide level.
This package is based on the Alleleome package created by Archana S. Harke (anpache/Core-Alleleome) For more detailed information, refer to their publication: Early Release on BioRxiv
- Alleleome is tested and confirmed for Linux systems with the Conda package manager.
- Requires Python version 3.10 or higher.
- For optimal performance, especially when processing a large dataset, such as 1400 core genes and their respective alleles across 3400 strains, a high RAM capacity is strongly recommended.
-
Clone the repository (ensure Git LFS is set up as described above).
-
Navigate to the Alleleome directory:
cd Alleleome
-
Activate the virtual environment as instructed above.
-
Install the package:
pip install .
The package consists of 5 subprograms that should be run sequentially:
- prepare: Collect information on the genomes and genes/loci, and perform QC/QA.
- fasta: Use the collected information to create amino acid and nucleotide fasta files for alignment. At this steps it should be specified whether teh Core-alleleome should be anlyzed (default) or the Panalleleome (
--pan
). - process: Run alignments using blast and mafft.
- analyze: Analyze and process the sequence alignments.
- preplot: Generate files required for plotting the results on PanKB.
The computationally intensive process and analyze subprograms can (and should typically) be run in parallel by specifying the number of available cores using -p <num>
.
You can find the full usage and parameters of alleleome
by using the --help
function:
$ alleleome -h
usage: alleleome [-h] {prepare,fasta,process,analyze,preplot} ...
Alleleome - Explore and analyze natural sequence variations within the Open Reading Frames (ORFs) of alleles of genes in a species pan-genome.
positional arguments:
{prepare,fasta,process,analyze,preplot}
options:
-h, --help show this help message and exit
The different modes should be run in the order: prepare > fasta > process > analyze > preplot. Info on the inputs and outputs for every mode can be found by invoking that mode with the --help argument. E.g.:
$ alleleome prepare -h
usage: alleleome prepare [-h] --gp_binary GP_BINARY --gp_locustag GP_LOCUSTAG --summary SUMMARY --summary_v2 SUMMARY_V2 --gbk_folder GBK_FOLDER --all_locustag ALL_LOCUSTAG --all_genes ALL_GENES --sel_locustag SEL_LOCUSTAG --sel_genes SEL_GENES
options:
-h, --help show this help message and exit
--gp_binary GP_BINARY
Path to gene_presence_binary csv file.
--gp_locustag GP_LOCUSTAG
Path to gene_presence_locustag csv file.
--summary SUMMARY Path to df_pangene_summary.csv file created by roary.
--summary_v2 SUMMARY_V2
Path to the updated summary.
--gbk_folder GBK_FOLDER
Folder containing GenBank files.
--all_locustag ALL_LOCUSTAG
Path to all_locustags csv file.
--all_genes ALL_GENES
Path to all_genes csv file.
--sel_locustag SEL_LOCUSTAG
Path to sel_locustags csv file.
--sel_genes SEL_GENES
Path to sel_genes csv file.
Alleleome introduces the concept of "ORF alleleome," encapsulating the gene alleles found across all strains of a species, thus providing a comprehensive view of genome-scale sequence variations. This analysis can be instrumental in understanding sequence diversity characteristics and natural selection processes across different species within a family. The study of the alleleome offers insights into the genetic basis of natural selection in a species.
Key features include:
- Analysis of sequence variants using the consensus sequence of ORFs.
- Identification of dominant amino acids and their variants at specific positions.
- Revealing natural sequence and structural variations compared with the consensus sequence and structural attributes.
- Identification of genome-scale synonymous and non-synonymous mutations through the analysis of codon changes and their corresponding amino acid changes."
- Python and Biopython.
- Integrated with "BGCflow" workflow using SnakeMake.
This project is licensed under the MIT License - see the LICENSE file for details.