-
Notifications
You must be signed in to change notification settings - Fork 4
Deduplicating Genes
While gig-map
can be used to align any set of genes against any set of genomes,
it is commonly used to align the complete set of genes found across a group of
genomes against all of those genomes. When performing this alignment process, it
is very important that no genes are duplicated in the input. The gig-map
code
is designed with the expectation that every gene in the query set is present only
once, and any overlapping genes will be filtered out from the final alignment.
To make it easier to prepare a set of genes for alignment, we have included a utility for deduplication, or selecting a subset of genes which have unique sequences (based on a specified threshold of alignment identity and coverage).
The gig-map
deduplication utility can be used to filter down any set of genes
in FASTA format. However, it also provides an easy way for the user to deduplicate
the genes present in a collection of genomes in the NCBI Genome Database. To
identify a collection of genomes from this database, follow the instructions
provided for the genome download utility.
To start, create or identify a folder which will be used to run this analysis. Next, download these two template files to help you set up the deduplication process:
The deduplicate.params.json
file allows you to specify which set of genes (by default, the
CSV file downloaded from NCBI) will be used, and in what location the deduplicated
genes will be placed. The deduplicate.sh
file is a script which will launch the
appropriate utility within gig-map
using the parameters specified in deduplicate.params.json
To list the complete set of options available for the download utility, run the following command:
bash deduplicate.sh --help
By default, these files are set up to download the genes listed in a file named
prokaryotes.csv
and saving them to a folder named downloaded_genes/
within the
working directory. Please modify any of the values in the deduplicate.params.json
file as appropriate for your use-case.
Once you are satisfied that the deduplicate.params.json
file is pointing to the
right set of inputs and outputs, start the download process by running:
bash deduplicate.sh
The example files provided above are set up to download genomes listed with the
genome_tables
parameter (which must point to a CSV file downloaded from NCBI).
If you would like instead to deduplicate the gene sequences which are present in
a set of local files (in FASTA format), use instead the genes_fasta
parameter
in the deduplicate.params.json
file.