The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and _test lists:
pipe = Pipeline([('mol_transformer', MorganFingerprintTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the RDKIT UGM 2022 hackathon 2022-October-14
- Descriptors
- MolecularDescriptorTransformer
- Fingerprints
- MorganFingerprintTransformer
- MACCSKeysFingerprintTransformer
- RDKitFingerprintTransformer
- AtomPairFingerprintTransformer
- TopologicalTorsionFingerprintTransformer
- MHFingerprintTransformer
- SECFingerprintTransformer
- AvalonFingerprintTransformer
- Conversions
- SmilesToMol
- Standardizer
- Standardizer
- safeinference - SafeInferenceWrapper - set_safe_inference_mode
- Utilities
- CheckSmilesSanitazion
Users can install latest tagged release from pip
pip install scikit-mol
or from conda-forge
conda install -c conda-forge scikit-mol
The conda forge package should get updated shortly after a new tagged release on pypi.
Bleeding edge
pip install git+https://github.com:EBjerrum/scikit-mol.git
There are a collection of notebooks in the notebooks directory which demonstrates some different aspects and use cases
-
Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer
-
Using parallel execution to speed up descriptor and fingerprint calculations
-
Testing different fingerprints as part of the hyperparameter optimization
-
We also put a software note on ChemRxiv. https://doi.org/10.26434/chemrxiv-2023-fzqwd
Help wanted! Are you a PhD student that want a "side-quest" to procrastinate your thesis writing or are you simply interested in computational chemistry, cheminformatics or simply with an interest in QSAR modelling, Python Programming open-source software? Do you want to learn more about machine learning with Scikit-Learn? Or do you use scikit-mol for your current work and would like to pay a little back to the project and see it improved as well? With a little bit of help, this project can be improved much faster! Reach to me (Esben), for a discussion about how we can proceed.
Currently we are working on fixing some deprecation warnings, its not the most exciting work, but it's important to maintain a little. Later on we need to go over the scikit-learn compatibility and update to some of their newer features on their estimator classes. We're also brewing on some feature enhancements and tests, such as new fingerprints and a more versatile standardizer.
There are more information about how to contribute to the project in CONTRIBUTING
Probably still, please check issues at GitHub and report there
- Esben Jannik Bjerrum @ebjerrum, [email protected]
- Carmen Esposito @cespos
- Son Ha, [email protected]
- Oh-hyeon Choung, [email protected]
- Andreas Poehlmann, @ap--
- Ya Chen, @anya-chen
- Anton Siomchen @asiomchen
- Rafał Bachorz @rafalbachorz
- Adrien Chaton @adrienchaton
- @VincentAlexanderScholz
- @RiesBen
- @enricogandini
- @mikemhenry
- @c-feldmann