Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning graph in mach #759

Closed
wants to merge 8 commits into from
Closed

Learning graph in mach #759

wants to merge 8 commits into from

Conversation

olivierlabayle
Copy link
Collaborator

@olivierlabayle olivierlabayle commented Apr 13, 2022

This is a proposal to extend the learning network API, this should be almost not breaking, it enables caching and other fit options (like acceleration) to be passed to submachines.

Related issues: #756

The proposed routine is to declare a learning_network(mach::Machine{<:Composite}, X, y; kwargs...) method instead of the current fit for a new composite model. This is examplified with the Stack but the Pipelines for instance are kept on the current API.

The only caveat is that learning networks defined that way can only be used inside machines.

If the approach looks encouraging to you and you want to push that further, my remaining TODOs are at least:

  • Take care of the update mechanism
  • Maybe add a warning/info to encourage users to use the new API? --> This will break some logging tests that I will need to manage.
  • Update/Improve some docstrings: return!, ... ?
  • I am still not understanding why there is a data field in the cache of a composite model (even when setting cache=false) and more generally the anonymization routine since the data is still available anyway. I think @ablaom, you mentionned we could remove that gymnastic in a previous chat. I would be happy to get some guidance and add this to the PR if this is not breaking either.

Please let me know if there are other things I haven't considered.

@olivierlabayle olivierlabayle marked this pull request as draft April 13, 2022 14:05
@olivierlabayle olivierlabayle requested a review from ablaom April 13, 2022 14:05
@olivierlabayle olivierlabayle linked an issue Apr 13, 2022 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Apr 14, 2022

Codecov Report

Merging #759 (38a576f) into dev (95efa58) will increase coverage by 0.55%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #759      +/-   ##
==========================================
+ Coverage   85.41%   85.97%   +0.55%     
==========================================
  Files          36       36              
  Lines        3435     3486      +51     
==========================================
+ Hits         2934     2997      +63     
+ Misses        501      489      -12     
Impacted Files Coverage Δ
src/composition/learning_networks/machines.jl 93.07% <100.00%> (+1.11%) ⬆️
src/composition/models/methods.jl 96.66% <100.00%> (-3.34%) ⬇️
src/composition/models/stacking.jl 94.48% <100.00%> (-0.04%) ⬇️
src/machines.jl 87.50% <100.00%> (+0.83%) ⬆️
src/measures/measures.jl 70.29% <0.00%> (-0.58%) ⬇️
src/init.jl 100.00% <0.00%> (ø)
src/MLJBase.jl 60.00% <0.00%> (ø)
src/show.jl 38.27% <0.00%> (+1.23%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95efa58...38a576f. Read the comment docs.

@olivierlabayle
Copy link
Collaborator Author

@ablaom I think this is ready for review if you have some time to dedicate to it. As an additional demonstration of the benefits of this PR, you'll see on this branch a multithreaded Stack test. I haven't added that to the PR since I thought this was big enough but happy to take your comments.

@olivierlabayle olivierlabayle marked this pull request as ready for review April 14, 2022 10:16
@ablaom
Copy link
Member

ablaom commented Apr 19, 2022

@olivierlabayle I'd like to acknowledge this proposal which I haven't reviewed yet. It sounds interesting, but it may take me a couple weeks to look properly at it.

You have made raised a couple of related issues recently (caching, parallelism). Can you help me prioritise my review? Should I look at this one before the others? What are the implications of the current proposal on those others, if any. For example, is possible that implementation of this current proposal could mean the others could be resolved in a non-breaking way?

@olivierlabayle
Copy link
Collaborator Author

@ablaom Thank you for getting back to me.

Indeed there are many connecting topics that I am currently trying to address. The two most important ones are memory management and parallelism. This PR is an attempt to partialy address both with a small design change while minimally (one corner case explained below) breaking the API. This is definitely the place to start.

The general idea of the PR is to force the use of machines for a user to interact (understand fit, predict API) with Composite models. More specifically if one is ready to give up the use of fit(composite_model, verbosity, X, y) etc... and always resort to fit!(machine(composite_model, X, y))... then you can forward all fitting keyword arguments to the submachines and access the top machine's internals. This PR examplifies this with the Stack and the forwarding of caching to submachines. In this branch which is based on top of this PR, you can see the further benefit for parallelization with a forwarding of the acceleration parameter to composite models. You basically have multithreading of composite models for free (provided underlying models are threadsafe of course).

If you are happy with the PR, the last thing I would like to add for now, is the removal of anonymization process since you said it may be avoided. I would be happy to get your guidance on that and the important parts in the code since I didn't understand the use case in the first place.

I hope this helps.

@ablaom
Copy link
Member

ablaom commented May 3, 2022

@olivierlabayle Thanks for looking into this and for your patience. I have now reviewed this PR in some detail.

I have to say I really like the idea in this PR, but there are some drawbacks:

  • Composite models don't implement the full MLJ model interface (no fit(::Model). This has two important consequences. First, it's breaking. EnsembleModel(model=...) (MLJEnsembes.jl) and BinaryThresholdPredictor(model=...) (MLJModels.jl) call fit on the atomic model, and so won't work with, say, Stack, as re-imlemented here. Perhaps there are other places. In principle, these could be addressed with some disruption. Secondly, this a seismic shift conceptually. Currently, machines are a higher level abstraction for user-interaction and composite model building, but there is always the possibility of 3rd party packages buying into MLJ models at the lower level, which I think is nice. I believe GeoStats.jl actually does this. We often say "composite models in MLJ behave like any other model". Well, they wouldn't after this PR.

  • Rather than acceleration being a hyper-parameter of each composite model, which I would find the most natural, it is something apart from the model, and has a global nature: if I call fit!(composite_machine, acceleration=...) then that particular mode of acceleration is applied to training of the learning network for that composite, but also for any nested composite model. This might be convenient behaviour most of the time, but it is less powerful. For example, at the top level you may want distributed parallelism, while at some lower level you want multithreading or no parallelism. Also, I'm not sure nested Distributed parallelism "just works". Acceleration is already a hyperparameter of some ordinary models (eg, MLJFlux models) and in the model wrappers TunedModel, and EnsembleModel. It seems incongruent to have a different interface point for Composite models.

I've had a think about this, and have an alternative suggestion for dealing with acceleration which does not have these drawbacks. I realize this does not address #756, but could you take a look at #762 and let me know what you think?

@olivierlabayle
Copy link
Collaborator Author

@ablaom Thanks for taking the time to look into this!

  • That's definitely true. My original idea was that composition with learning networks is about composing machines not models. If someone wants to use the learning network API they will have to use the machine interface anyway and depend on MLJBase. So, enforcing machines for composition may not be so bad, I appreciate however that this is a conceptual change even though I tried to make it as minimal as possible. By the way, I have tried to run the Stack with an EnsembleModel and it seems to run just fine unless I missed anything.
X, y = make_regression(500, 5)

models = (constant=DeterministicConstantRegressor(),
              decisiontree=DecisionTreeRegressor(),
              ridge_lambda=FooBarRegressor(;lambda=0.1),
              ensemble=EnsembleModel(FooBarRegressor(;lambda=1.)),
              ridge=FooBarRegressor(;lambda=0))

mystack = Stack(;metalearner=FooBarRegressor(),
                  resampling=CV(;nfolds=3),
                  models...)
mach = machine(mystack, X, y)
fit!(mach, verbosity=0)

The only caveat I see is if someone is actually using fit(mystack, verbosity, X, y) without a machine. That, is definitely broken.

  • I understand, I had not really though about that as a problem indeed. I think the question boils down to what needs to be inherited by submachines and what doesn't. Maybe the acceleration is in the later category and the hyperparameter you suggest is indeed more modular (I need to take a proper look at your proposal since I have seen you added a new field to the Machine, which was not what I originally thought was your intention). Another "in between" option capitalizing on both ideas, would be to make acceleration a reserved name and we could give priority to the model hyperparameter if implemented.

Here is a vague illustration to show the idea:

function fit!(mach::Machine; acceleration=CPU1(), kwargs...)
    glb_node = glb(mach.args...) # greatest lower bound node of arguments
    fit!(glb_node; kwargs...)
    acceleration = hasfield(mach.model, "acceleration") ? mach.model.acceleration : acceleration
    fit!(glb_node; acceleration=acceleration, kwargs...)
    fit_only!(mach; kwargs...)
end

One advantage of this PR's proposed implementation (and the "in between" one) is that a user implementing a learning network does not need to worry about parallelism and still benefit from it, which I think is the most common scenario.

@ablaom
Copy link
Member

ablaom commented May 4, 2022

By the way, I have tried to run the Stack with an EnsembleModel and it seems to run just fine unless I missed anything.

No I mean the other way around: EnsembleModel(model=mystack, ...).

check_signature(signature)
# Build composite Fitresult
fitresult = CompositeFitresult(signature)
fitresult.network_model_names = network_model_names(mach.model, fitresult)

# Fit all machines in the learning network
glb_node = glb(fitresult)
fit!(glb_node; kwargs...)
acceleration = hasfield(T, :acceleration) && getfield(mach.model, :acceleration) !== nothing ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead use hasproperty and getproperty. In general model struct fields and properties can be different but it is the properties that are "public".

@@ -147,7 +149,7 @@ report(mach).cv_report
```

"""
function Stack(;metalearner=nothing, resampling=CV(), measure=nothing, measures=measure, named_models...)
function Stack(;metalearner=nothing, resampling=CV(), measure=nothing, measures=measure, acceleration=nothing, named_models...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the acceleration default nothing instead of CPU1()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was to:

  • By default propagate the acceleration specified in fit if the stack's acceleration is nothing
  • Take the stack's acceleration instead if specified.

@ablaom
Copy link
Member

ablaom commented May 9, 2022

If someone wants to use the learning network API they will have to use the machine interface anyway and depend on MLJBase.

I think you mean that if a user want's to implement a composite model using the learning network API, then they need to be familiar with machines. That is true. But presently I can use a Stack model without knowing anything about machines. That is, a user of composite models can view machines as an invisible implementation detail, if they want to.

@ablaom
Copy link
Member

ablaom commented May 9, 2022

@olivierlabayle I'm sorry, but after sitting on this a few days, I'm still reluctant to provide, in MLJBase.jl, a facility that encourages composite model definitions that do not meet the MLJ model API. I think this could come back to bite us. I'm also not enthusiastic about fixing the breaking behaviour I've pointed out. Another reason for my reluctance is that the MLJ model composition API is the now the most complex part of the MLJ code base. It is a worthy advancement on other multi-paradigm ML frameworks, but I feel it is reaching the limits of what can be sustainably maintained without a substantial re-design. Putting it another way, I'm afraid we (and I mean "me" predominantly) may be sliding into an "incremental hack".

I think we agree (but correct me if I am wrong) that for acceleration specifically, my suggestion at #762 is acceptable, but with the modification you suggested there and implement here in the latest commit here. However, as the present PR also includes the new "learning graph" code, I wonder if you would consider a separate PR for that?

The other use-case you mention for introducing the learning graph code is for propagating a fixed C (cache) value throughout the machines of a learning network. Now cache is a little different from acceleration as this has never been considered a model-specific parameter, but only a Machine one. There "is no" data caching at the level of a model. I wonder if a combination of these two suggestions already made at #756 would suffice, for now:

@olivierlabayle suggested:

  • Of course I guess there is the possibility of adding a hyperparameter to the composite model that can be transferred after to the machine definitions. This would probably solve my personnal issue (and would be a short term solution) however I don't think this is ideal in the long term because any user defined composite model will have to add this hyperparameter.
    (requires no action apart from implementing, at our leisure, this for MLJ composite models Pipeline, Stack, and so forth).

@ablaom suggested:

Another idea is to introduce a user-modifiable global constant to control the default value of C in the machine constructor. In fact, there is already such a method for acceleration and we are introducing such an option for check_scitype_level at #753.

@olivierlabayle
Copy link
Collaborator Author

@ablaom I understand your concern, I did not appreciate there was such an emphasis on the fit(model, verbosity, X, y) API and rather thought about it as an implementation detail used by the public machine API described in the documentation. I think indeed we are reaching the limitations of the current internal design: the composition API draws us towards the use of machines and the MLJModelInterface holds us back from it. In any case, since we haven't reached v1.0 yet, my humble advice would be to make any redesign choice before we are stuck with an unmaintanaible codebase.

What follows is mostly what I would consider to be a hack that is susceptible to much confusion from a user perspective.

That being said, I think indeed both acceleration and cache can be added as extra hyperparameters to composite models to tweak their fit implementation. I can try such a change in a separate PR. Can you elaborate on your alternative global variable suggestion, I am not sure I understood its link with #753 ?

@ablaom
Copy link
Member

ablaom commented May 9, 2022

Can you elaborate on your alternative global variable suggestion, I am not sure I understood its link with #753 ?

  • In MLJBase.__init__ add a new global constant - something like global DEFAULT_CACHE = Ref{Bool}(true)
  • In src/machines.jl add a function default_cache that you will export with two methods. One, default_cache() to return the value of DEFAULT_CACHE[], and a second default_cache(some_bool) to change it's state. You can mimic this
  • In the machine constructor change the default value of cache keyword to default_cache() && caches_by_default(model).

Note here that the existing model trait caches_by_default trait is badly named. Better would be never_caches_data and we use the negation. It's purpose is to ensure that machines associated with model wrappers (TunedModel, IteratedModel, etc) and composite models don't cache data, as there is no need. (That said, I've just noticed this line which I'm guessing is a bug, do you agree?) By the way, the wrappers just mentioned do have a cache hyper-parameter (as I'm suggesting Stack include) which is passed on to internally created machines.

@olivierlabayle
Copy link
Collaborator Author

Follow up PR: #767

@ablaom
Copy link
Member

ablaom commented Jan 17, 2023

Closing as old and the interface has substantially moved on.

@ablaom ablaom closed this Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control caching of composite models
3 participants