You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to add a new approach to estimating class overlap, but have questions about how best to implement this.
Currently, per-sample class overlap/probabilities are calculated (via compute_expectation_with_monte_carlo()) by:
Tallying class count for the kNNs (into an array)
Dividing the entire array by the Parzen-window
This is then added to expectation (to build the S-matrix), which is eventually normalized by total mass of the row.
This approach considers the total area of the kNNs (via Parzen-window-normalization), but not the distances/similarities to each neighbor. One consequence of this approach (desired or not) is that the farthest neighbor could have a disproportionate impact on the overlap with another class. E.g., compare two cases where a sample in class A has all neighbors in class B, but in (1) all samples are at a moderate distance, whereas in (2) all samples but one are very close, but one is very far. If I'm understanding correctly (and correct me if I'm wrong), (1) would appear to have greater overlap because of the smaller Parzen window, whereas one might prefer for (2) to be measured to have greater overlap because most of its neighbors are much closer and in theory it should be harder to determine a class boundary.
For comparison, I'd like to try to calculate per-sample class overlap/probabilities by the following:
Summing the kNN's cosine similarities per class (into an array) (i.e., sum similarities rather than tally)
This would then be added to something similar to expectation but be divided by sample count rather than normalized by row mass, so that the relative similarity information would be maintained and one could compare across rows.
I.e., the original code does this:
probability = np.array(
# kNN class proportions
[(target_k == nghbr_class).sum() / k_nearest for nghbr_class in range(n_class)]
)
# Get the Parzen-window probability
probability_norm = probability / get_volume(data[indices_k]) # Parzen-window normalized
where probability is 1. above, and probability_norm is 2. above. This is then added to expectation, which is later normalized as expectation[class_ix] /= expectation[class_ix].sum().
My suggestion is to do this:
sim_scores = np.array([similarities_k[target_k == nghbr_class].sum() for nghbr_class
in range(n_class)])
where sim_scores is 3. above. This would then be added to expectation, which would be normalized as expectation[class_ix] /= len(class_indices[class_ix]).
Apologies for any confusion from variable names from my current PR rather than the original code! I have this working (seemingly correctly) within the function, although it breaks code downstream as described below.
I have a couple of implementation questions, basically:
Should this be an optional alternate calculation method?
If so, should this replace or add to the results already being passed by compute_expectation_with_monte_carlo()?
How should this be implemented in conjunction with the distance metric, as I am thinking about this in terms of cosine similarity rather than euclidean distance? (Although I suppose it could work with distance, just with an inverted interpretation of a high/low value...?)
Some thoughts:
With my current PR, the function would return, for each sample (by sample index), 1. and 2. above. With the new approach, you'd return 3., but I think you'd still want to return 1. above (kNN counts - why not) and I don't think it really matters if you return 2. or not. So 3. could replace 2., but it could also just be another array that gets returned (i.e. within the SimilarityArrays dataclass).
You'd want to return a new S-matrix, but I am pretty sure that just replacing the current one will break code downstream related to the actual spectral clustering metric. So it might make sense to return it in addition rather than as a replacement.
I'm not sure, but I think that with larger datasets, summing the similarities by class would increase computation time annoyingly. So probably this should be optional.
If you don't need to check/change the distance metric, this fits into the logic of compute_expectation_with_monte_carlo() really easily. If you want it to only be done with cosine similarity, I don't know if you'd want to throw an error if the wrong distance metric were selected vs. silently ignoring (which seems like a bad idea), or something else?
The text was updated successfully, but these errors were encountered:
I'd like to add a new approach to estimating class overlap, but have questions about how best to implement this.
Currently, per-sample class overlap/probabilities are calculated (via
compute_expectation_with_monte_carlo()
) by:This is then added to
expectation
(to build the S-matrix), which is eventually normalized by total mass of the row.This approach considers the total area of the kNNs (via Parzen-window-normalization), but not the distances/similarities to each neighbor. One consequence of this approach (desired or not) is that the farthest neighbor could have a disproportionate impact on the overlap with another class. E.g., compare two cases where a sample in class A has all neighbors in class B, but in (1) all samples are at a moderate distance, whereas in (2) all samples but one are very close, but one is very far. If I'm understanding correctly (and correct me if I'm wrong), (1) would appear to have greater overlap because of the smaller Parzen window, whereas one might prefer for (2) to be measured to have greater overlap because most of its neighbors are much closer and in theory it should be harder to determine a class boundary.
For comparison, I'd like to try to calculate per-sample class overlap/probabilities by the following:
This would then be added to something similar to
expectation
but be divided by sample count rather than normalized by row mass, so that the relative similarity information would be maintained and one could compare across rows.I.e., the original code does this:
where
probability
is 1. above, andprobability_norm
is 2. above. This is then added toexpectation
, which is later normalized asexpectation[class_ix] /= expectation[class_ix].sum()
.My suggestion is to do this:
where
sim_scores
is 3. above. This would then be added toexpectation
, which would be normalized asexpectation[class_ix] /= len(class_indices[class_ix])
.Apologies for any confusion from variable names from my current PR rather than the original code! I have this working (seemingly correctly) within the function, although it breaks code downstream as described below.
I have a couple of implementation questions, basically:
compute_expectation_with_monte_carlo()
?Some thoughts:
SimilarityArrays
dataclass).compute_expectation_with_monte_carlo()
really easily. If you want it to only be done with cosine similarity, I don't know if you'd want to throw an error if the wrong distance metric were selected vs. silently ignoring (which seems like a bad idea), or something else?The text was updated successfully, but these errors were encountered: