Skip to content

obtaining marker genes

Sina Majidian edited this page Nov 8, 2023 · 11 revisions

To run read2tree two things are required as input:

  1. The DNA sequencing reads as FASTQ file(s).
  2. A set of reference orthologous groups, i.e. marker genes.

We provide two sets of markergenes for Mammalia and Bacteria here. These could be used along with the arguments --standalone_path marker_genes/ --dna_reference {bacteria/mammalia}_cnda.fa . However, we recommend using OMA browser to download set of marker genes tailored for the clade of interest to accurately infer the species tree.

In this page we describe how to obtain the latter using OMA browser.

Step 1. Open the Export gene marker genes page of OMA browser using this link.

 

This page can be found on the Download tab of the main page of OMA browser as well.

 

 

Step 2. In the field search by species name, you can type the name of species or a clade, e.g. primates. Then click on the found item in blue.

 

 

Step 3. Then click on the internal node or leaves of the tree of life, and select (all) species. You can also expand or collapse the nodes.

 

 

Step 4. Now on the right side, we can set the value of Minimum fraction of covered species to 0.8 and Maximum nr of markers to 500 (for studying a small clade 100 is probably enough). However, if you are not limited in computation, you can set this to -1 to prepare all the possible OGs.

 

Then, click submit

 

Step 5. Finally, after waiting for few mins/hours (depending how large is your species set), a compressed file containing the FASTA files of the marker genes is ready to download.  

Then you need to combine the fna files

tar xvzf  marker_genes_*.tgz 
ls marker_genes/*.fna | wc -l
cat marker_genes/*.fna > dna_ref.fa

Note: If you want to infer the tree for viruses, check this instruction. In summary, we recommend to first download a set of proteomes and their CNDA from NCBI refSeq for your clade of interest, then use [OMA standalone] (https://omabrowser.org/standalone/) to infer the set of marker genes. For corona virus, you can use this link to download marker genes in addition to download the cdna fasta file (1MB) from here and unzip it. Once you have the marker genes as proteome and cnda, you can use the sequencing reads of your samples to infer the species tree all together using read2tree using the argument --standalone_path marker_genes --dna_reference viruses.cdna.fa .

Clone this wiki locally