Phyla: Towards a Foundation Model for Phylogenetic Inference

What is Phyla?

Phyla is a protein language model designed to model both individual sequences and inter-sequence relationships. It leverages a hybrid state-space transformer architecture and is trained on two tasks: masked language modeling and phylogenetic tree reconstruction using sequence embeddings. Phyla enables rapid construction of phylogenetic trees of protein sequences, offering insights that differ from classical methods in potentially functionally significant ways.

Disclaimer

We are excited to introduce Phyla-α, an early-stage version of our model that is still under active development. Future iterations will incorporate methodological improvements and additional training data as we continue refining the model. Please note that this work is ongoing, and updates will be released as progress is made.

What is in this repo?

This repo provides a way to perform inference with the Phyla-α model for your application. After performing the steps you will be able to give Phyla a fasta file and quickly get a phylogenetic tree. We are working on providing training code as well.

Shorthand	Name in code	Dataset	Description
Phyla-α	`phyla-alpha`	13,696 trees from OpenProteinSet	Alpha release of Phyla meant as a proof of concept of ongoing work.
Phyla-β	`phyla-beta`	Full OpenProteinSet	Verion 2 of Phyla set to release in April after some methodological improvements and longer training.

Getting started with Phyla

Step one: Install the enviornment

First you need to create an enviornment for mamba, following the instructions from their Github including the causal-conv1d package. I found installing this on a gpu helps get around some problems when installing. Once you can run this import without errors:

from mamba_ssm import Mamba

then build the rest of the enviornment from yaml file provided in the envs folder in the phyla folder.

Step two: Pip install the phyla package

Run

 pip install -e .

from within this directory to install the Phyla package to your enviornment.

Step three: Run the Phyla test.

Run "run_phyla_test.py" and if you get a tree printed out then everything is set up correctly!

Once that is done just replace the fasta file in the run_phyla_test script to the fasta file with the protein sequences that you want to align and it will generate a tree.

System Requirements and Scalability

This script has been tested on an H100 Nvidia GPU and is expected to work on a 32 GB V100 as well. Greater GPU memory capacity allows for generating trees for a larger number of sequences. Reconstructing the tree of life with 3,084 sequences required running Phyla on CPUs with approximately 1 TB of memory. For those interested in running Phyla on a CPU to handle more sequences, raising an issue will help prioritize the addition of that functionality.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
img		img
phyla		phyla
README.md		README.md
run_phyla_test.py		run_phyla_test.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phyla: Towards a Foundation Model for Phylogenetic Inference

What is Phyla?

Disclaimer

What is in this repo?

Getting started with Phyla

Step one: Install the enviornment

Step two: Pip install the phyla package

Step three: Run the Phyla test.

System Requirements and Scalability

About

Releases

Packages

Contributors 2

Languages

mims-harvard/Phyla

Folders and files

Latest commit

History

Repository files navigation

Phyla: Towards a Foundation Model for Phylogenetic Inference

What is Phyla?

Disclaimer

What is in this repo?

Getting started with Phyla

Step one: Install the enviornment

Step two: Pip install the phyla package

Step three: Run the Phyla test.

System Requirements and Scalability

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages