Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional/new approach to estimating class overlap #8

Open
lindsaydbrin opened this issue Mar 1, 2022 · 1 comment
Open

Additional/new approach to estimating class overlap #8

lindsaydbrin opened this issue Mar 1, 2022 · 1 comment

Comments

@lindsaydbrin
Copy link
Contributor

I'd like to add a new approach to estimating class overlap, but have questions about how best to implement this.

Currently, per-sample class overlap/probabilities are calculated (via compute_expectation_with_monte_carlo()) by:

  1. Tallying class count for the kNNs (into an array)
  2. Dividing the entire array by the Parzen-window

This is then added to expectation (to build the S-matrix), which is eventually normalized by total mass of the row.

This approach considers the total area of the kNNs (via Parzen-window-normalization), but not the distances/similarities to each neighbor. One consequence of this approach (desired or not) is that the farthest neighbor could have a disproportionate impact on the overlap with another class. E.g., compare two cases where a sample in class A has all neighbors in class B, but in (1) all samples are at a moderate distance, whereas in (2) all samples but one are very close, but one is very far. If I'm understanding correctly (and correct me if I'm wrong), (1) would appear to have greater overlap because of the smaller Parzen window, whereas one might prefer for (2) to be measured to have greater overlap because most of its neighbors are much closer and in theory it should be harder to determine a class boundary.

For comparison, I'd like to try to calculate per-sample class overlap/probabilities by the following:

  1. Summing the kNN's cosine similarities per class (into an array) (i.e., sum similarities rather than tally)

This would then be added to something similar to expectation but be divided by sample count rather than normalized by row mass, so that the relative similarity information would be maintained and one could compare across rows.

I.e., the original code does this:

            probability = np.array(
                # kNN class proportions
                [(target_k == nghbr_class).sum() / k_nearest for nghbr_class in range(n_class)]
            )
            # Get the Parzen-window probability
            probability_norm = probability / get_volume(data[indices_k])  # Parzen-window normalized

where probability is 1. above, and probability_norm is 2. above. This is then added to expectation, which is later normalized as expectation[class_ix] /= expectation[class_ix].sum().

My suggestion is to do this:

            sim_scores = np.array([similarities_k[target_k == nghbr_class].sum() for nghbr_class
                                  in range(n_class)])

where sim_scores is 3. above. This would then be added to expectation, which would be normalized as expectation[class_ix] /= len(class_indices[class_ix]).

Apologies for any confusion from variable names from my current PR rather than the original code! I have this working (seemingly correctly) within the function, although it breaks code downstream as described below.

I have a couple of implementation questions, basically:

  • Should this be an optional alternate calculation method?
  • If so, should this replace or add to the results already being passed by compute_expectation_with_monte_carlo()?
  • How should this be implemented in conjunction with the distance metric, as I am thinking about this in terms of cosine similarity rather than euclidean distance? (Although I suppose it could work with distance, just with an inverted interpretation of a high/low value...?)

Some thoughts:

  • With my current PR, the function would return, for each sample (by sample index), 1. and 2. above. With the new approach, you'd return 3., but I think you'd still want to return 1. above (kNN counts - why not) and I don't think it really matters if you return 2. or not. So 3. could replace 2., but it could also just be another array that gets returned (i.e. within the SimilarityArrays dataclass).
  • You'd want to return a new S-matrix, but I am pretty sure that just replacing the current one will break code downstream related to the actual spectral clustering metric. So it might make sense to return it in addition rather than as a replacement.
  • I'm not sure, but I think that with larger datasets, summing the similarities by class would increase computation time annoyingly. So probably this should be optional.
  • If you don't need to check/change the distance metric, this fits into the logic of compute_expectation_with_monte_carlo() really easily. If you want it to only be done with cosine similarity, I don't know if you'd want to throw an error if the wrong distance metric were selected vs. silently ignoring (which seems like a bad idea), or something else?
@Dref360
Copy link
Owner

Dref360 commented Mar 3, 2022

Hello,

Yeah I think this is a great idea.

we always work with distance, so this should work with cosine and with euclidean. We would need to inverse in both cases.

I think we could add a new parameter to CumulativeGradientEstimator to select the algorithm. By default we would use the Parzen-Window.

  1. I think in your example, 2 would replace 3 when specified.
  2. The S matrix should still work as it is just a similarity matrix.
  3. We can run some benchmarks once we have the code running and decide.
  4. I would try to accommodate every distance metric supplied.

I hope this help and let's chat if there are issues with the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants