This is the repository of the PyTorch implementation of ChemBFN model.
ChemBFN provides the state-of-the-art functionalities of
- SMILES or SELFIES-based de novo molecule generation
- Protein sequence de novo generation
- Classifier-free guidance conditional generation (single or multi-objective optimisation)
- Context-guided conditional generation (inpaint)
- Outstanding out-of-distribution chemical space sampling
- Molecular property and activity prediction finetuning
- Reaction yield prediction finetuning
in an all-in-one-model style.
- [17/12/2024] The second paper of out-of-distribution generation is available on arxiv.org.
- [31/07/2024] Paper is available on arxiv.org.
- [21/07/2024] Paper was submitted to arXiv.
You can find example scripts in 📁example folder.
You can find pretrained models in release.
We provide a Python class CSVData
to handle data stored in CSV or similar format containing headers to identify the entities. The following is a quickstart.
- Download your dataset file (e.g., ESOL form MoleculeNet) and split the file:
>>> from bayesianflow_for_chem.tool import split_data
>>> split_data("delaney-processed.csv", method="scaffold")
- Load the split data:
>>> from bayesianflow_for_chem.data import smiles2token, collate, CSVData
>>> dataset = CSVData("delaney-processed_train.csv")
>>> dataset[0]
{'Compound ID': ['Thiophene'],
'ESOL predicted log solubility in mols per litre': ['-2.2319999999999998'],
'Minimum Degree': ['2'],
'Molecular Weight': ['84.14299999999999'],
'Number of H-Bond Donors': ['0'],
'Number of Rings': ['1'],
'Number of Rotatable Bonds': ['0'],
'Polar Surface Area': ['0.0'],
'measured log solubility in mols per litre': ['-1.33'],
'smiles': ['c1ccsc1']}
- Create a mapping function to tokenise the dataset and select values:
>>> import torch
>>> def encode(x):
... smiles = x["smiles"][0]
... value = [float(i) for i in x["measured log solubility in mols per litre"]]
... return {"token": smiles2token(smiles), "value": torch.tensor(value)}
>>> dataset.map(encode)
>>> dataset[0]
{'token': tensor([ 1, 151, 23, 151, 151, 154, 151, 23, 2]),
'value': tensor([-1.3300])}
- Wrap the dataset in torch.utils.data.DataLoader:
>>> dataloader = torch.utils.data.DataLoader(dataset, 32, collate_fn=collate)
@misc{2024chembfn,
title={A Bayesian Flow Network Framework for Chemistry Tasks},
author={Nianze Tao and Minori Abe},
year={2024},
eprint={2407.20294},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.20294},
}
Out-of-distribution generation:
@misc{2024chembfn_ood,
title={Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces},
author={Nianze Tao},
year={2024},
eprint={2412.11439},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.11439},
}