-
Notifications
You must be signed in to change notification settings - Fork 4
Deduplicating Genes
While gig-map
can be used to align any set of genes against any set of genomes,
it is commonly used to align the complete set of genes found across a group of
genomes against all of those genomes. When performing this alignment process, it
is very important that no genes are duplicated in the input. The gig-map
code
is designed with the expectation that every gene in the query set is present only
once, and any overlapping genes will be filtered out from the final alignment.
To make it easier to prepare a set of genes for alignment, we have included a utility for deduplication, or selecting a subset of genes which have unique sequences (based on a specified threshold of alignment identity and coverage).
To deduplicate a collection of genes, the input genes must be present within one or more gzip-compressed FASTA files in a single folder. If you want to combine files which are located in different folders, simply create symlinks for those files into a single folder.
-
genes
: Folder containing all genes to be analyzed. All files must be in amino acid FASTA format (gzip-compressed) -
cluster_similarity
: Amino acid similarity used for clustering (ranges from 0.0 to 1.0) [default: 0.9] -
cluster_coverage
: Alignment coverage coverage used for clustering (ranges from 0.0 to 1.0) [default: 0.9] -
min_gene_length
: Minimum amino acid length threshold used to filter genes [default: 50]
-
centroids.faa.gz
: Sequences of the deduplicated genes (FASTA) -
centroids.annot.csv.gz
: Table with any annotations available for the deduplicated genes (columns:gene_id
andcombined_name
) -
centroids.membership.csv.gz
: Description of which genes were grouped together by themmseqs2
algorithm during deduplication
Other useful references may be: