splito.lohi
+
+
++ splito.lohi.LoSplitter + + +
+ + ++ __init__ + + +
+__init__(
+ threshold: float = 0.4,
+ min_cluster_size: int = 5,
+ max_clusters: int = 50,
+ std_threshold: float = 0.6,
+)
+
A splitter that prepares data for training ML models for Lead Optimization or to guide +molecular generative models. These models must be sensitive to minor modifications of +molecules, and this splitter constructs a test that allows the evaluation of a model's +ability to distinguish those modifications.
+ + + +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
threshold |
+
+ float
+ |
+
+
+
+ ECFP4 1024-bit Tanimoto similarity threshold. +Molecules more similar than this threshold are considered too similar and can be grouped together in one cluster. + |
+
+ 0.4
+ |
+
min_cluster_size |
+
+ int
+ |
+
+
+
+ the minimum number of molecules per cluster. + |
+
+ 5
+ |
+
max_clusters |
+
+ int
+ |
+
+
+
+ the maximum number of selected clusters. The remaining molecules go to the training set. +This can be useful for limiting your test set to get more molecules in the train set. + |
+
+ 50
+ |
+
std_threshold |
+
+ float
+ |
+
+
+
+ the lower bound of the acceptable standard deviation for a cluster's values. It should be greater than the measurement noise. +For ChEMBL-like data set it to 0.60 for logKi and 0.70 for logIC50. +Set it lower if you have a high-quality dataset. + |
+
+ 0.6
+ |
+
For more information, see a tutorial in the docs and Steshin 2023, Lo-Hi: Practical ML Drug Discovery Benchmark.
+ ++ split + + +
+split(
+ smiles: list[str], values: list[float], n_jobs: int = -1, verbose: int = 1
+) -> tuple[list[int], list[list[int]]]
+
Split the dataset into test clusters and train.
+ + + +Parameters:
+Name | +Type | +Description | +Default | +
---|---|---|---|
smiles |
+
+ list[str]
+ |
+
+
+
+ list of smiles. + |
+ + required + | +
values |
+
+ list[float]
+ |
+
+
+
+ list of their continuous activity values. + |
+ + required + | +
verbose |
+
+ int
+ |
+
+
+
+ set to 0 to turn off progressbar. + |
+
+ 1
+ |
+
Returns:
+Name | Type | +Description | +
---|---|---|
train_idx |
+ list[int]
+ |
+
+
+
+ list of train indices. + |
+
clusters_idx |
+ list[list[int]]
+ |
+
+
+
+ list of lists of cluster indices. + |
+