You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening this issue after a nice suggestion of @davnn .
Some clusterers (eg, sckitlearn's DBSCAN) only deliver labels for the training data and cannot immediately label new unseen data. In that case one can use any ordinary classifier (eg, KNN from NearestNeighborModels.jl) to generate labels for new data.
If the classifier is a probabilistic predictor, we can even get "fuzzy" labels (like GMMClusterer from BetaML) - which could be useful even for clusterers that already generalise to new data.
Any design depends on firming up the API for clusterers: JuliaAI/MLJ.jl#852
One possible implementation (requiring MLJBase as a dependency) is to use a learning network (wrapped in a fit definition) to define the new model (see eg, TransformedTargetModel). One advantage would be that changes to the classifier hyper-parameters would not trigger re-training of the base clusterer. (I mean you could arrange that with a "hard-wired" implementation, but that would be duplicating logic we already have, extra testing, etc).
using MLJBase
using MLJModels
pure_clusterer = (@load DBSCAN pkg=ScikitLearn)()
classifier = (@load KNNClassifier)()
Xraw, yraw =make_blobs(1000, rng=123)
X, Xtest =partition(Xraw, 0.5)
_, ytest =partition(yraw, 0.5)
# the learning network (with training data at the source node):
Xs =source(X)
# this clusterer stores the training labels in its fitted_params:
mach1 =machine(pure_clusterer, Xs)
Θ =node(fitted_params, mach1)
y =node(θ -> θ.labels, Θ) # the training labels# classifier will train using the training_labels `y`:
mach2 =machine(classifier, Xs, y)
ŷ =predict(mach2, Xs) # returns probability distributions# train the network:fit!(ŷ)
# getting "probabilistic" labels for new data:ŷ(Xtest);
# getting labels for new data:
y =mode.(ŷ(Xtest));
# good agreement up to relabelling:
julia>zip(ytest, y) |> collect
(1, 3)
(2, 2)
(2, 2)
(1, 3)
(3, 1)
(1, 3)
(2, 2)
(1, 3)
(1, 3)
(2, 2)
(2, 2)
(3, 1)
(1, 3)
(3, 1)
(3, -1)
...
The text was updated successfully, but these errors were encountered:
Opening this issue after a nice suggestion of @davnn .
Some clusterers (eg, sckitlearn's DBSCAN) only deliver labels for the training data and cannot immediately label new unseen data. In that case one can use any ordinary classifier (eg, KNN from NearestNeighborModels.jl) to generate labels for new data.
If the classifier is a probabilistic predictor, we can even get "fuzzy" labels (like GMMClusterer from BetaML) - which could be useful even for clusterers that already generalise to new data.
Any design depends on firming up the API for clusterers: JuliaAI/MLJ.jl#852
One possible implementation (requiring MLJBase as a dependency) is to use a learning network (wrapped in a
fit
definition) to define the new model (see eg, TransformedTargetModel). One advantage would be that changes to the classifier hyper-parameters would not trigger re-training of the base clusterer. (I mean you could arrange that with a "hard-wired" implementation, but that would be duplicating logic we already have, extra testing, etc).See below for a proof-of-concept.
Thoughts anyone?
@juliohm @jbrea @OkonSamuel @alyst
The text was updated successfully, but these errors were encountered: