Skip to content

Deduplicating Genes

Sam Minot edited this page May 25, 2022 · 5 revisions

Gene Deduplication is Necessary

While gig-map can be used to align any set of genes against any set of genomes, it is commonly used to align the complete set of genes found across a group of genomes against all of those genomes. When performing this alignment process, it is very important that no genes are duplicated in the input. The gig-map code is designed with the expectation that every gene in the query set is present only once, and any overlapping genes will be filtered out from the final alignment.

To make it easier to prepare a set of genes for alignment, we have included a utility for deduplication, or selecting a subset of genes which have unique sequences (based on a specified threshold of alignment identity and coverage).

Deduplicating Genes

To deduplicate a collection of genes, the input genes must be present within one or more gzip-compressed FASTA files in a single folder. If you want to combine files which are located in different folders, simply create symlinks for those files into a single folder.

Inputs and Outputs

Inputs

  • genes: Folder containing all genes to be analyzed. All files must be in amino acid FASTA format (gzip-compressed)
  • cluster_similarity: Amino acid similarity used for clustering (ranges from 0.0 to 1.0) [default: 0.9]
  • cluster_coverage: Alignment coverage coverage used for clustering (ranges from 0.0 to 1.0) [default: 0.9]
  • min_gene_length: Minimum amino acid length threshold used to filter genes [default: 50]

Outputs

  • centroids.faa.gz: Sequences of the deduplicated genes (FASTA)
  • centroids.annot.csv.gz: Table with any annotations available for the deduplicated genes (columns: gene_id and combined_name)
  • centroids.membership.csv.gz: Description of which genes were grouped together by the mmseqs2 algorithm during deduplication

deduplicate_genes

Useful References

Other useful references may be: