-
Notifications
You must be signed in to change notification settings - Fork 4
Deduplicating Genes
While gig-map
can be used to align any set of genes against any set of genomes,
it is commonly used to align the complete set of genes found across a group of
genomes against all of those genomes. When performing this alignment process, it
is very important that no genes are duplicated in the input. The gig-map
code
is designed with the expectation that every gene in the query set is present only
once, and any overlapping genes will be filtered out from the final alignment.
To make it easier to prepare a set of genes for alignment, we have included a utility for deduplication, or selecting a subset of genes which have unique sequences (based on a specified threshold of alignment identity and coverage).
To deduplicate a collection of genes, the input genes must be present within one or more gzip-compressed FASTA files in a single folder. If you want to combine files which are located in different folders, simply create symlinks for those files into a single folder.
Using the deduplicate_genes
tool, the options available to the user are:
-
genes
: Folder containing all genes to be analyzed. All files must be in amino acid FASTA format (gzip-compressed) -
cluster_similarity
: Amino acid similarity used for clustering (ranges from 0.0 to 1.0) [default: 0.9] -
cluster_coverage
: Alignment coverage coverage used for clustering (ranges from 0.0 to 1.0) [default: 0.9] -
min_gene_length
: Minimum amino acid length threshold used to filter genes [default: 50]
Other useful references may be: