Unable to execute #13

snehachem · 2024-11-29T12:01:18Z

IKK_refine.csv
I am trying my best to execute this tool. i am trying to build my model using my data. Whenever i use tutorial data it works. However getting confused with my customized data curated from chembl data base. I want to build a random Forest classification model. the steps I followed as per the tutorial given bellow

Imported the data using pandas.
define property pIC50 (Manually i curated)
standardize smiles
calculated fingerprints. (I tried to use different fingerprint not Morgan. But getting error NaN values). Anyway used Morgan FP
Drop Nop kept WoW
filled missing value
i found my molecule number increased I started with 300+ now it is 1K+
During classification it return error
I attached the necessary files in HTML format (HTML can not be attached, so i changed .HTML to .csv, After reversing one can view the file)

What is the straightforward way to build random forest model, which I want to use for DrugEx
QSAR_Pred_2.csv
Attached Raw input data

martin-sicho · 2024-11-29T13:21:38Z

Hi, I looked at your file and it seems the main issue was that you did not change the task of the data set. You need to run:

dataset.makeClassification(target_property="pIC50", th=[6])

This will make the data suitable for a classification task.

Here is an executable Python script I compiled from your notebook that does the training on your data and contains solutions to some of the errors (again had to change the extension):
test.txt. There are three rounds of training where the first two illustrate potential issues and then follows a resolution. Sorry I did not have much time to add more detailed explanations, but I think the line above is probably all you need. Not sure about the increase in data size. It did not seem to occur in my experiments, but I can still look into it if you give me an exact snippet that results in that.

Thanks for the interest in the framework and let us know how it went!

snehachem · 2024-11-30T02:43:34Z

thanks Martin for your kind assistance. certainly i will update you. Today I am going to refine the work as per your valuable suggestion and provide you the details. Thanks again for your time. Thanks

snehachem · 2024-12-04T09:33:04Z

after issuing following codes i am able to generate classification model
os.makedirs("/home/sneha/Documents/DrugEx/IKK epsilon_TBK1", exist_ok=True)

model = SklearnModel(
base_dir="/home/sneha/Documents/DrugEx/IKK_RandomForestClassifier_TBK1",
alg=RandomForestClassifier,
name="Classification_Model"
)

CrossValAssessor("roc_auc")(model, dataset)

TestSetAssessor("roc_auc")(model, dataset)
model.fitDataset(dataset)
_ = model.save()

but i found Mean ROC (AUC= 0.68) i think i need to use TopologicalFP
So I modified the code
from qsprpred.data.descriptors.fingerprints import TopologicalFP
dataset.prepareDataset(
data_filters=[RepeatsFilter(keep=False),
CategoryFilter(name="FakeProperty", values=["Wow"], keep=True)],

split=RandomSplit(test_fraction=0.2, dataset=dataset),
feature_calculators=[TopologicalFP(radius=3, nBits=2048)],
recalculate_features=True,

)

dataset.getDF().head()

BUT I AM ENCOUNTERED BY SOME ERROR stated bellow

ArgumentError Traceback (most recent call last)
Cell In[37], line 2
1 from qsprpred.data.descriptors.fingerprints import TopologicalFP
----> 2 dataset.prepareDataset(
3 data_filters=[RepeatsFilter(keep=False),
4 CategoryFilter(name="FakeProperty", values=["Wow"], keep=True)],
5 # only keep compounds with FakeProperty="Wow"
6 split=RandomSplit(test_fraction=0.2, dataset=dataset),
7 feature_calculators=[TopologicalFP(radius=3, nBits=2048)],
8 recalculate_features=True,
9 )
11 dataset.getDF().head()

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/qspr.py:875, in QSPRDataset.prepareDataset(self, smiles_standardizer, data_filters, split, feature_calculators, feature_filters, feature_standardizer, feature_fill_value, applicability_domain, drop_outliers, recalculate_features, shuffle, random_state)
873 # calculate features
874 if feature_calculators is not None:
--> 875 self.addFeatures(feature_calculators, recalculate=recalculate_features)
876 # apply data filters
877 if data_filters is not None:

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/qspr.py:802, in QSPRDataset.addFeatures(self, feature_calculators, recalculate)
789 def addFeatures(
790 self,
791 feature_calculators: list[DescriptorSet],
792 recalculate: bool = False,
793 ):
794 """Add features to the data set.
795
796 Args:
(...)
800 present in the data set. Defaults to False.
801 """
--> 802 self.addDescriptors(
803 feature_calculators, recalculate=recalculate, featurize=False
804 )
805 self.featurize()

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/qspr.py:525, in QSPRDataset.addDescriptors(self, descriptors, recalculate, featurize, *args, **kwargs)
501 def addDescriptors(
502 self,
503 descriptors: list[DescriptorSet],
(...)
507 **kwargs,
508 ):
509 """Add descriptors to the data set.
510
511 If descriptors are already present, they will be recalculated if recalculate
(...)
523 **kwargs: additional keyword arguments to pass to each descriptor set
524 """
--> 525 super().addDescriptors(descriptors, recalculate, *args, **kwargs)
526 self.featurize(update_splits=featurize)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/mol.py:831, in MoleculeTable.addDescriptors(self, descriptors, recalculate, fail_on_invalid, *args, **kwargs)
829 for calculator in to_calculate:
830 df_descriptors = []
--> 831 for result in self.processMols(
832 calculator, proc_args=args, proc_kwargs=kwargs
833 ):
834 df_descriptors.append(result)
835 df_descriptors = pd.concat(df_descriptors, axis=0)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/mol.py:633, in MoleculeTable.processMols(self, processor, proc_args, proc_kwargs, add_props, as_rdkit, chunk_size, n_jobs)
627 logger.debug(
628 f"Applying processor '{processor}' to '{self.name}' in serial."
629 )
630 for result in self.iterChunks(
631 include_props=add_props, as_dict=True, chunk_size=len(self)
632 ):
--> 633 yield self.runMolProcess(
634 result,
635 processor,
636 as_rdkit,
637 self.smilesCol,
638 *proc_args,
639 **proc_kwargs,
640 )

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/mol.py:544, in MoleculeTable.runMolProcess(cls, props, func, add_rdkit, smiles_col, *args, **kwargs)
542 else:
543 mols = props[smiles_col]
--> 544 return func(mols, props, *args, **kwargs)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/descriptors/fingerprints.py:72, in Fingerprint.call(self, mols, props, *args, **kwargs)
52 def call(
53 self, mols: list[str | Mol], props: dict[str, list[Any]], *args, **kwargs
54 ) -> pd.DataFrame:
55 """Calculate binary fingerprints for the input molecules. Only the bits
56 specified by usedBits will be returned if more bits are calculated.
57
(...)
70 data frame of descriptor values of shape (n_mols, n_descriptors)
71 """
---> 72 values = self.getDescriptors(self.prepMols(mols), props, *args, **kwargs)
73 values = values[:, self.usedBits]
74 values = values.astype(self.dtype)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/descriptors/fingerprints.py:197, in TopologicalFP.getDescriptors(self, mols, props, *args, **kwargs)
195 ret = np.zeros((len(mols), len(self)))
196 for idx, mol in enumerate(mols):
--> 197 fp = rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(
198 mol, nBits=self.nBits, **self.kwargs
199 )
200 np_fp = np.zeros(len(fp))
201 convertFP(fp, np_fp)

ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(Mol)
did not match C++ signature:
GetHashedTopologicalTorsionFingerprintAsBitVect(RDKit::ROMol mol, unsigned int nBits=2048, unsigned int targetSize=4, boost::python::api::object fromAtoms=0, boost::python::api::object ignoreAtoms=0, boost::python::api::object atomInvariants=0, unsigned int nBitsPerEntry=4, bool includeChirality=False)

Please Help.
Sir, I am from organic medicinal chemistry background with limited idea about coding or Python programming. Is there any straight foreword way to execute This tool or DrugEx.

martin-sicho · 2024-12-04T09:48:44Z

Hi, this looks like something might have changed in how the rdkit function is supposed to be called. Can you let me know the version of rdkit that you have installed? This command should do it: pip freeze | grep rdkit. I will try to reproduce it.

snehachem · 2024-12-04T12:52:01Z

(drugex) sneha@sneha:~$ pip freeze | grep rdkit
rdkit==2024.3.6

martin-sicho · 2024-12-10T08:33:31Z

Hi @snehachem, sorry for the delay. I have been fairly busy lately. I tried to run the workflow with topological fingerprints instead and it worked just fine. Can you confirm that you still get the error even when using this notebook? I did not find any issues with calculating those descriptors and my environment looks like this:

qsprpred==3.2.1
rdkit==2024.3.6

If you do not find a fix in this shared notebook, please, share the notebook you are trying to execute with the input files as well and I will try to debug it.

martin-sicho · 2024-12-10T08:36:23Z

One thing I missed before, you are calling the fingerprint class like this: TopologicalFP(radius=3, nBits=2048), but these FPs do not have a radius parameter. So I think that is actually why it was failing. I totally missed that the first time...

snehachem · 2024-12-11T08:11:42Z

thanks, i will update accordingly

nayanmondal1337 · 2024-12-21T08:28:55Z

During my modeling,I faced similar issues. I am able to solved them on the basis of said suggestions. I am interested to build regression models. I have few quarries

I used MaccsFP, MorganFP, RDkit descriptors with various regression algorithms like RF regressor, KN regressor. But R squire (shown in array) was found to be 0.5 to 0.6. how to improve the models. I took suggestions from hyperparameter_optimization.ipynb file. but no significant improvement. It will be helpful if a tutorial notbook associated with regression like classification.
How to plot R squire slopes for predicted vs actual pub_chem_mean_value, residuals, SE etc? a tutorial associated with data plotting will be helpful like shown in classification tutorial (ROC and all).
is it possible to use / call PyDescriptor(https://www.sciencedirect.com/science/article/abs/pii/S016974391730312X) during smiles standardization, filtering and modeling?
Thanks for your support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to execute #13

Unable to execute #13

snehachem commented Nov 29, 2024 •

edited

Loading

martin-sicho commented Nov 29, 2024

snehachem commented Nov 30, 2024

snehachem commented Dec 4, 2024 •

edited

Loading

martin-sicho commented Dec 4, 2024

snehachem commented Dec 4, 2024

martin-sicho commented Dec 10, 2024

martin-sicho commented Dec 10, 2024

snehachem commented Dec 11, 2024

nayanmondal1337 commented Dec 21, 2024 •

edited

Loading

Unable to execute #13

Unable to execute #13

Comments

snehachem commented Nov 29, 2024 • edited Loading

martin-sicho commented Nov 29, 2024

snehachem commented Nov 30, 2024

snehachem commented Dec 4, 2024 • edited Loading

BUT I AM ENCOUNTERED BY SOME ERROR stated bellow

martin-sicho commented Dec 4, 2024

snehachem commented Dec 4, 2024

martin-sicho commented Dec 10, 2024

martin-sicho commented Dec 10, 2024

snehachem commented Dec 11, 2024

nayanmondal1337 commented Dec 21, 2024 • edited Loading

snehachem commented Nov 29, 2024 •

edited

Loading

snehachem commented Dec 4, 2024 •

edited

Loading

nayanmondal1337 commented Dec 21, 2024 •

edited

Loading