Skip to content

Run FastOMA on your own grouping

Sina Majidian edited this page Jun 20, 2024 · 2 revisions

A user can provide their own initial grouping of proteins (rootHOGs) to be used with FastOMA. This could be put in practice in two ways:

  1. running two processes of hog_rest and collect_subhog in FastOMA.nf on the user's protein family in FASTA format.

  2. Providing group mapping of proteins in OMAmer format.

For the first approach, note that each fasta record should have the formatting >gene_name|species_name|unique_integerID. see example below.

>ANAPLA_R14405||ANAPLA||1134003114 ANAPLA_R14405
SPMFDGKVPHWHHYSCFWKRARIVSHTDIDGFPELRWEDQEKIKKAIETGGPGGGGDQEG
GGKAEKSLNDFAAEYAKSNRSTCKGCEQKIEK
>OREMEL_R06256||OREMEL||1323005702 OREMEL_R06256
MASKRHAVPPKQQDGKGKKVKRGEEDDVWSSTLAALKTAPKEKPPATIDGLCPLSSMPGA
QVYEDYDCTLNQTNISANNNKFYIIQLLEHDGAYSVW

what comes after space in the record ID does not matter.

In the manuscript, we described the InterProScan tool as an alternative to OMAmer+OMAdb.

For the second way, you could make an adapter that writes for each genome a “.hogmap” file in TSV format with at least the columns qseqid, hogid, family_p, qseqlen, subfamily_medianseqlen.

The hogid column must be in the format “HOG:[A-Z][0-9]{7}..*”. Everything after the dot will be truncated (so root hog id only). family_p, qseqlen and subfamily_medianseqlen are used to identify the best isoform if there are many, but I don’t think you will use those (at least not in the beginning).

See example:

# qseqid hogid family_p qseqlen subfamily_medianseqlen
sp|P15943|APLP2_RAT	HOG:B0595810.1a	1.0	0.9889574448284	0.9876786945621626	766	733
tr|H2Q546|H2Q546_PANTR	HOG:B0595810.1a.1b	1.0	0.9986343792025645	0.9986343792025645	762	733
tr|E3L250|E3L250_PUCGT	HOG:B0811161	0.6180555555555556	0.08359449706661991	0.0161118229169709	145	438

With that, you should in principle be able to use FastOMA with your own initial root hogs.

If you faced any difficulties, fell free to contact us through GitHub issue.