Skip to content

Augus1999/bayesian-flow-network-for-chemistry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChemBFN: Bayesian Flow Network for Chemistry

arxiv arxiv

This is the repository of the PyTorch implementation of ChemBFN model.

Features

ChemBFN provides the state-of-the-art functionalities of

  • SMILES or SELFIES-based de novo molecule generation
  • Protein sequence de novo generation
  • Classifier-free guidance conditional generation (single or multi-objective optimisation)
  • Context-guided conditional generation (inpaint)
  • Outstanding out-of-distribution chemical space sampling
  • Molecular property and activity prediction finetuning
  • Reaction yield prediction finetuning

in an all-in-one-model style.

News

  • [17/12/2024] The second paper of out-of-distribution generation is available on arxiv.org.
  • [31/07/2024] Paper is available on arxiv.org.
  • [21/07/2024] Paper was submitted to arXiv.

Usage

You can find example scripts in 📁example folder.

Pre-trained Model

You can find pretrained models in release.

Dataset Handling

We provide a Python class CSVData to handle data stored in CSV or similar format containing headers to identify the entities. The following is a quickstart.

  1. Download your dataset file (e.g., ESOL form MoleculeNet) and split the file:
>>> from bayesianflow_for_chem.tool import split_data

>>> split_data("delaney-processed.csv", method="scaffold")
  1. Load the split data:
>>> from bayesianflow_for_chem.data import smiles2token, collate, CSVData

>>> dataset = CSVData("delaney-processed_train.csv")
>>> dataset[0]
{'Compound ID': ['Thiophene'], 
'ESOL predicted log solubility in mols per litre': ['-2.2319999999999998'], 
'Minimum Degree': ['2'], 
'Molecular Weight': ['84.14299999999999'], 
'Number of H-Bond Donors': ['0'], 
'Number of Rings': ['1'], 
'Number of Rotatable Bonds': ['0'], 
'Polar Surface Area': ['0.0'], 
'measured log solubility in mols per litre': ['-1.33'], 
'smiles': ['c1ccsc1']}
  1. Create a mapping function to tokenise the dataset and select values:
>>> import torch

>>> def encode(x):
...   smiles = x["smiles"][0]
...   value = [float(i) for i in x["measured log solubility in mols per litre"]]
...   return {"token": smiles2token(smiles), "value": torch.tensor(value)}

>>> dataset.map(encode)
>>> dataset[0]
{'token': tensor([  1, 151,  23, 151, 151, 154, 151,  23,   2]), 
'value': tensor([-1.3300])}
  1. Wrap the dataset in torch.utils.data.DataLoader:
>>> dataloader = torch.utils.data.DataLoader(dataset, 32, collate_fn=collate)

Cite This Work

@misc{2024chembfn,
      title={A Bayesian Flow Network Framework for Chemistry Tasks}, 
      author={Nianze Tao and Minori Abe},
      year={2024},
      eprint={2407.20294},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.20294}, 
}

Out-of-distribution generation:

@misc{2024chembfn_ood,
      title={Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces}, 
      author={Nianze Tao},
      year={2024},
      eprint={2412.11439},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.11439}, 
}