Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridPartitioning fails with odd number of samples to select #156

Closed
marco-2023 opened this issue Aug 10, 2023 · 7 comments
Closed

GridPartitioning fails with odd number of samples to select #156

marco-2023 opened this issue Aug 10, 2023 · 7 comments

Comments

@marco-2023
Copy link
Collaborator

@Ali-Tehrani I am going through the Jupyter Notebook (tutorial) and the GridPartitioning methods fail when used to select an odd number of samples. I followed the cause of the error to line 335 of the module partition. It calls the compute_diversity function (module diversity) to compute the diversity of the bins (an array of elements of the bin is passed as the only argument). The problem is in the function compute_diversity which by default uses the hypersphere_overlap_of_subset method (line 281 of diversity module), this needs two arguments (the data of the set and the total data) which compute_diversity cannot provide.

I don't know if an option would be changing the diversity function to use as an argument to compute_diversity in line 335 of the partition module?

@FanwangM
Copy link
Collaborator

A little history tracing regarding this issue, #134.

@marco-2023
Copy link
Collaborator Author

Thanks, @FanwangM, I see now that it will be taken care of.

@FarnazH
Copy link
Member

FarnazH commented Sep 9, 2023

@marco-2023, is this an issue? If so, can you please share a code snippet to show this failure?

@marco-2023
Copy link
Collaborator Author

Yes, it is. This problem is present whenever happens that the function compute_diversity is used in line 335. Not only for an odd number of samples. Below is an example where the error shows.

from sklearn.datasets import make_blobs
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import numpy as np
from DiverseSelector import  GridPartitioning

# Generate synthetic data using make_blobs 100 samples, 2 features, 1 cluster
coords, class_labels = make_blobs(n_samples=100, n_features=2, centers=1, random_state=42)

# Selecting 13 diverse data points from the first dataset (100 points uniformly distributed in one
# cluster). 
selector = GridPartitioning(2,"equisized_independent")
selected_ids1 = selector.select(coords, size=13)

The result is:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 13
     10 # Selecting 13 diverse data points from the first dataset (100 points  uniformly  distributed in one
     11 # cluster). 
     12 selector = GridPartitioning(2,"equisized_independent")
---> 13 selected_ids1 = selector.select(coords, size=13)

File [/mnt/Data/Work/Ayers/QC-Devs/DiverseSelector/DiverseSelector/methods/base.py:65](https://file+.vscode-resource.vscode-cdn.net/mnt/Data/Work/Ayers/QC-Devs/DiverseSelector/DiverseSelector/methods/base.py:65), in SelectionBase.select(self, arr, size, labels)
     60     raise ValueError(
     61         f"Size of subset {size} cannot be larger than number of samples {len(arr)}."
     62     )
     64 if labels is None:
---> 65     return self.select_from_cluster(arr, size)
     67 # compute the number of samples (i.e. population or pop) in each cluster
     68 unique_labels = np.unique(labels)

File [/mnt/Data/Work/Ayers/QC-Devs/DiverseSelector/DiverseSelector/methods/partition.py:335](https://file+.vscode-resource.vscode-cdn.net/mnt/Data/Work/Ayers/QC-Devs/DiverseSelector/DiverseSelector/methods/partition.py:335), in GridPartitioning.select_from_cluster(self, arr, num_selected, cluster_ids)
    333 diversity = []
    334 for bin_idx, bin_list in bins.items():
--> 335     diversity.append((compute_diversity(arr[bin_list]), bin_idx))
    336 diversity.sort(reverse=True)
    337 for _, bin_idx in diversity[:num_needed]:

File [/mnt/Data/Work/Ayers/QC-Devs/DiverseSelector/DiverseSelector/diversity.py:77](https://file+.vscode-resource.vscode-cdn.net/mnt/Data/Work/Ayers/QC-Devs/DiverseSelector/DiverseSelector/diversity.py:77), in compute_diversity(features, div_type)
...
---> 77     return func_dict[div_type](features)
     78 else:
     79     raise ValueError(f"Diversity type {div_type} not supported.")

TypeError: hypersphere_overlap_of_subset() missing 1 required positional argument: 'x'

@FanwangM pointed out in #134 (comment) that this problem should be fixed by merging #138. Here the default method for compute_diversity changes to "entropy" which only needs the array of samples to compute the diversity and thus is compatible with the way the function is called in GridPartitioning.

@FanwangM
Copy link
Collaborator

FanwangM commented Sep 12, 2023

I will work on this issue later today, by resolving merging conflicts. Thanks for getting the detailed error information, which refreshed my mind on this problem. @marco-2023

@FanwangM
Copy link
Collaborator

The #138 is merged.

@FarnazH
Copy link
Member

FarnazH commented Oct 27, 2023

This problem is fixed, so I will close this issue. Please re-open or comment, if there is still sth wrong.

@FarnazH FarnazH closed this as completed Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants