Skip to content

Deduplicating Genes

Sam Minot edited this page Dec 14, 2021 · 5 revisions

Gene Deduplication is Necessary

While gig-map can be used to align any set of genes against any set of genomes, it is commonly used to align the complete set of genes found across a group of genomes against all of those genomes. When performing this alignment process, it is very important that no genes are duplicated in the input. The gig-map code is designed with the expectation that every gene in the query set is present only once, and any overlapping genes will be filtered out from the final alignment.

To make it easier to prepare a set of genes for alignment, we have included a utility for deduplication, or selecting a subset of genes which have unique sequences (based on a specified threshold of alignment identity and coverage).

Downloading Genes from NCBI

The gig-map deduplication utility can be used to filter down any set of genes in FASTA format. However, it also provides an easy way for the user to deduplicate the genes present in a collection of genomes in the NCBI Genome Database. To identify a collection of genomes from this database, follow the instructions provided for the genome download utility.

Deduplicating Genes

To start, create or identify a folder which will be used to run this analysis. Next, download these two template files to help you set up the deduplication process:

The deduplicate.params.json file allows you to specify which set of genes (by default, the CSV file downloaded from NCBI) will be used, and in what location the deduplicated genes will be placed. The deduplicate.sh file is a script which will launch the appropriate utility within gig-map using the parameters specified in deduplicate.params.json

To list the complete set of options available for the download utility, run the following command:

bash deduplicate.sh --help

By default, these files are set up to download the genes listed in a file named prokaryotes.csv and saving them to a folder named downloaded_genes/ within the working directory. Please modify any of the values in the deduplicate.params.json file as appropriate for your use-case.

Once you are satisfied that the deduplicate.params.json file is pointing to the right set of inputs and outputs, start the download process by running:

bash deduplicate.sh

Deduplicating Custom Genes

The example files provided above are set up to download genomes listed with the genome_tables parameter (which must point to a CSV file downloaded from NCBI). If you would like instead to deduplicate the gene sequences which are present in a set of local files (in FASTA format), use instead the genes_fasta parameter in the deduplicate.params.json file.