Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to execute #13

Open
snehachem opened this issue Nov 29, 2024 · 9 comments
Open

Unable to execute #13

snehachem opened this issue Nov 29, 2024 · 9 comments

Comments

@snehachem
Copy link

snehachem commented Nov 29, 2024

IKK_refine.csv
I am trying my best to execute this tool. i am trying to build my model using my data. Whenever i use tutorial data it works. However getting confused with my customized data curated from chembl data base. I want to build a random Forest classification model. the steps I followed as per the tutorial given bellow

  1. Imported the data using pandas.
  2. define property pIC50 (Manually i curated)
  3. standardize smiles
  4. calculated fingerprints. (I tried to use different fingerprint not Morgan. But getting error NaN values). Anyway used Morgan FP
  5. Drop Nop kept WoW
  6. filled missing value
  7. i found my molecule number increased I started with 300+ now it is 1K+
  8. During classification it return error
    I attached the necessary files in HTML format (HTML can not be attached, so i changed .HTML to .csv, After reversing one can view the file)

What is the straightforward way to build random forest model, which I want to use for DrugEx
QSAR_Pred_2.csv
Attached Raw input data

@martin-sicho
Copy link
Contributor

Hi, I looked at your file and it seems the main issue was that you did not change the task of the data set. You need to run:

dataset.makeClassification(target_property="pIC50", th=[6])

This will make the data suitable for a classification task.

Here is an executable Python script I compiled from your notebook that does the training on your data and contains solutions to some of the errors (again had to change the extension):
test.txt. There are three rounds of training where the first two illustrate potential issues and then follows a resolution. Sorry I did not have much time to add more detailed explanations, but I think the line above is probably all you need. Not sure about the increase in data size. It did not seem to occur in my experiments, but I can still look into it if you give me an exact snippet that results in that.

Thanks for the interest in the framework and let us know how it went!

@snehachem
Copy link
Author

thanks Martin for your kind assistance. certainly i will update you. Today I am going to refine the work as per your valuable suggestion and provide you the details. Thanks again for your time. Thanks

@snehachem
Copy link
Author

snehachem commented Dec 4, 2024

after issuing following codes i am able to generate classification model
os.makedirs("/home/sneha/Documents/DrugEx/IKK epsilon_TBK1", exist_ok=True)

model = SklearnModel(
base_dir="/home/sneha/Documents/DrugEx/IKK_RandomForestClassifier_TBK1",
alg=RandomForestClassifier,
name="Classification_Model"
)

CrossValAssessor("roc_auc")(model, dataset)

TestSetAssessor("roc_auc")(model, dataset)
model.fitDataset(dataset)
_ = model.save()

but i found Mean ROC (AUC= 0.68) i think i need to use TopologicalFP
So I modified the code
from qsprpred.data.descriptors.fingerprints import TopologicalFP
dataset.prepareDataset(
data_filters=[RepeatsFilter(keep=False),
CategoryFilter(name="FakeProperty", values=["Wow"], keep=True)],

split=RandomSplit(test_fraction=0.2, dataset=dataset),
feature_calculators=[TopologicalFP(radius=3, nBits=2048)],
recalculate_features=True,

)

dataset.getDF().head()

BUT I AM ENCOUNTERED BY SOME ERROR stated bellow

ArgumentError Traceback (most recent call last)
Cell In[37], line 2
1 from qsprpred.data.descriptors.fingerprints import TopologicalFP
----> 2 dataset.prepareDataset(
3 data_filters=[RepeatsFilter(keep=False),
4 CategoryFilter(name="FakeProperty", values=["Wow"], keep=True)],
5 # only keep compounds with FakeProperty="Wow"
6 split=RandomSplit(test_fraction=0.2, dataset=dataset),
7 feature_calculators=[TopologicalFP(radius=3, nBits=2048)],
8 recalculate_features=True,
9 )
11 dataset.getDF().head()

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/qspr.py:875, in QSPRDataset.prepareDataset(self, smiles_standardizer, data_filters, split, feature_calculators, feature_filters, feature_standardizer, feature_fill_value, applicability_domain, drop_outliers, recalculate_features, shuffle, random_state)
873 # calculate features
874 if feature_calculators is not None:
--> 875 self.addFeatures(feature_calculators, recalculate=recalculate_features)
876 # apply data filters
877 if data_filters is not None:

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/qspr.py:802, in QSPRDataset.addFeatures(self, feature_calculators, recalculate)
789 def addFeatures(
790 self,
791 feature_calculators: list[DescriptorSet],
792 recalculate: bool = False,
793 ):
794 """Add features to the data set.
795
796 Args:
(...)
800 present in the data set. Defaults to False.
801 """
--> 802 self.addDescriptors(
803 feature_calculators, recalculate=recalculate, featurize=False
804 )
805 self.featurize()

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/qspr.py:525, in QSPRDataset.addDescriptors(self, descriptors, recalculate, featurize, *args, **kwargs)
501 def addDescriptors(
502 self,
503 descriptors: list[DescriptorSet],
(...)
507 **kwargs,
508 ):
509 """Add descriptors to the data set.
510
511 If descriptors are already present, they will be recalculated if recalculate
(...)
523 **kwargs: additional keyword arguments to pass to each descriptor set
524 """
--> 525 super().addDescriptors(descriptors, recalculate, *args, **kwargs)
526 self.featurize(update_splits=featurize)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/mol.py:831, in MoleculeTable.addDescriptors(self, descriptors, recalculate, fail_on_invalid, *args, **kwargs)
829 for calculator in to_calculate:
830 df_descriptors = []
--> 831 for result in self.processMols(
832 calculator, proc_args=args, proc_kwargs=kwargs
833 ):
834 df_descriptors.append(result)
835 df_descriptors = pd.concat(df_descriptors, axis=0)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/mol.py:633, in MoleculeTable.processMols(self, processor, proc_args, proc_kwargs, add_props, as_rdkit, chunk_size, n_jobs)
627 logger.debug(
628 f"Applying processor '{processor}' to '{self.name}' in serial."
629 )
630 for result in self.iterChunks(
631 include_props=add_props, as_dict=True, chunk_size=len(self)
632 ):
--> 633 yield self.runMolProcess(
634 result,
635 processor,
636 as_rdkit,
637 self.smilesCol,
638 *proc_args,
639 **proc_kwargs,
640 )

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/tables/mol.py:544, in MoleculeTable.runMolProcess(cls, props, func, add_rdkit, smiles_col, *args, **kwargs)
542 else:
543 mols = props[smiles_col]
--> 544 return func(mols, props, *args, **kwargs)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/descriptors/fingerprints.py:72, in Fingerprint.call(self, mols, props, *args, **kwargs)
52 def call(
53 self, mols: list[str | Mol], props: dict[str, list[Any]], *args, **kwargs
54 ) -> pd.DataFrame:
55 """Calculate binary fingerprints for the input molecules. Only the bits
56 specified by usedBits will be returned if more bits are calculated.
57
(...)
70 data frame of descriptor values of shape (n_mols, n_descriptors)
71 """
---> 72 values = self.getDescriptors(self.prepMols(mols), props, *args, **kwargs)
73 values = values[:, self.usedBits]
74 values = values.astype(self.dtype)

File ~/miniconda3/envs/drugex/lib/python3.11/site-packages/qsprpred/data/descriptors/fingerprints.py:197, in TopologicalFP.getDescriptors(self, mols, props, *args, **kwargs)
195 ret = np.zeros((len(mols), len(self)))
196 for idx, mol in enumerate(mols):
--> 197 fp = rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(
198 mol, nBits=self.nBits, **self.kwargs
199 )
200 np_fp = np.zeros(len(fp))
201 convertFP(fp, np_fp)

ArgumentError: Python argument types in
rdkit.Chem.rdMolDescriptors.GetHashedTopologicalTorsionFingerprintAsBitVect(Mol)
did not match C++ signature:
GetHashedTopologicalTorsionFingerprintAsBitVect(RDKit::ROMol mol, unsigned int nBits=2048, unsigned int targetSize=4, boost::python::api::object fromAtoms=0, boost::python::api::object ignoreAtoms=0, boost::python::api::object atomInvariants=0, unsigned int nBitsPerEntry=4, bool includeChirality=False)

Please Help.
Sir, I am from organic medicinal chemistry background with limited idea about coding or Python programming. Is there any straight foreword way to execute This tool or DrugEx.

@martin-sicho
Copy link
Contributor

Hi, this looks like something might have changed in how the rdkit function is supposed to be called. Can you let me know the version of rdkit that you have installed? This command should do it: pip freeze | grep rdkit. I will try to reproduce it.

@snehachem
Copy link
Author

(drugex) sneha@sneha:~$ pip freeze | grep rdkit
rdkit==2024.3.6

@martin-sicho
Copy link
Contributor

Hi @snehachem, sorry for the delay. I have been fairly busy lately. I tried to run the workflow with topological fingerprints instead and it worked just fine. Can you confirm that you still get the error even when using this notebook? I did not find any issues with calculating those descriptors and my environment looks like this:

qsprpred==3.2.1
rdkit==2024.3.6

If you do not find a fix in this shared notebook, please, share the notebook you are trying to execute with the input files as well and I will try to debug it.

@martin-sicho
Copy link
Contributor

One thing I missed before, you are calling the fingerprint class like this: TopologicalFP(radius=3, nBits=2048), but these FPs do not have a radius parameter. So I think that is actually why it was failing. I totally missed that the first time...

@snehachem
Copy link
Author

thanks, i will update accordingly

@nayanmondal1337
Copy link

nayanmondal1337 commented Dec 21, 2024

During my modeling,I faced similar issues. I am able to solved them on the basis of said suggestions. I am interested to build regression models. I have few quarries

  1. I used MaccsFP, MorganFP, RDkit descriptors with various regression algorithms like RF regressor, KN regressor. But R squire (shown in array) was found to be 0.5 to 0.6. how to improve the models. I took suggestions from hyperparameter_optimization.ipynb file. but no significant improvement. It will be helpful if a tutorial notbook associated with regression like classification.
  2. How to plot R squire slopes for predicted vs actual pub_chem_mean_value, residuals, SE etc? a tutorial associated with data plotting will be helpful like shown in classification tutorial (ROC and all).
  3. is it possible to use / call PyDescriptor(https://www.sciencedirect.com/science/article/abs/pii/S016974391730312X) during smiles standardization, filtering and modeling?
    Thanks for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants