Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added custom semantic segmentation trainer tutorial #1897

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

calebrob6
Copy link
Member

This is a tutorial notebook that shows users how to override a custom semantic segmentation class for training on LandCoverAI.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 21, 2024
@calebrob6
Copy link
Member Author

@adamjstewart I think there is a bug (or at least some weird behavior going on) here

I can create a new class that extends SemanticSegmentationTask:

class CustomSemanticSegmentationTask(SemanticSegmentationTask):

    # any keywords we add here between *args and **kwargs will be found in self.hparams
    def __init__(self, *args, tmax=50, eta_min=1e-6, **kwargs) -> None:
        super().__init__(*args, **kwargs)  # pass args and kwargs to the parent class

Then instantiate and everything works great

task = CustomSemanticSegmentationTask(model="unet", tmax=100, ...)

However when I go to load from file:

task = CustomSemanticSegmentationTask.load_from_checkpoint("lightning_logs/version_3/checkpoints/epoch=0-step=117.ckpt")

I get an error:

TypeError: SemanticSegmentationTask.__init__() got an unexpected keyword argument 'ignore'

I can add del kwargs["ignore"] before super().... in the constructor of CustomSemanticSegmentationTask but this feels like a bad hack.

(@isaaccorley in case you've seen this)

@robmarkcole
Copy link
Contributor

robmarkcole commented Feb 23, 2024

Note that https://www.reviewnb.com/ is free for 'educational' use - would enable previewing the notebook

@calebrob6
Copy link
Member Author

@robmarkcole thanks for the review! Was that easy enough to do (vs reviewnb)?

On my side, I just have to run jupytext --sync custom_segmentation_trainer.ipynb whenever I change either file and commit both (which seems simple enough). We could have CI that makes sure that all py and ipynb are in sync.

@robmarkcole
Copy link
Contributor

Yep easy enough

@isaaccorley
Copy link
Collaborator

@adamjstewart I think there is a bug (or at least some weird behavior going on) here

I can create a new class that extends SemanticSegmentationTask:


class CustomSemanticSegmentationTask(SemanticSegmentationTask):



    # any keywords we add here between *args and **kwargs will be found in self.hparams

    def __init__(self, *args, tmax=50, eta_min=1e-6, **kwargs) -> None:

        super().__init__(*args, **kwargs)  # pass args and kwargs to the parent class

Then instantiate and everything works great


task = CustomSemanticSegmentationTask(model="unet", tmax=100, ...)

However when I go to load from file:


task = CustomSemanticSegmentationTask.load_from_checkpoint("lightning_logs/version_3/checkpoints/epoch=0-step=117.ckpt")

I get an error:


TypeError: SemanticSegmentationTask.__init__() got an unexpected keyword argument 'ignore'

I can add del kwargs["ignore"] before super().... in the constructor of CustomSemanticSegmentationTask but this feels like a bad hack.

(@isaaccorley in case you've seen this)

I don't see an ignore param in your custom class. Did you modify the class code you used after training that checkpoint? If so, one thing you can do is to just load the checkpoint, delete that param in the hparam dict and then save the checkpoint which would fix the error.

@calebrob6
Copy link
Member Author

The ignore param is stored in task.hparams because SemanticSegmentationTask passes it to BaseTask, -- https://github.com/microsoft/torchgeo/blob/main/torchgeo/trainers/segmentation.py#L98.

I want to be able to do something like this:

class CustomSemanticSegmentationTask(SemanticSegmentationTask):
    def __init__(self, *args, my_custom_arg, **kwargs) -> None:
        super().__init__(*args, **kwargs)

i.e. use the constructor from SemanticSegmentationTask and not have to copy paste the args and logic from SemanticSegmentationTask.

This works fine but, when I try to load a version of this class from checkpoint ignore is passed through to SemanticSegmentationTask as it is a kwarg saved in hparams.

@calebrob6
Copy link
Member Author

calebrob6 commented Feb 23, 2024

One workaround that I'm checking is just adding "ignore" to the list of args ignored in save_hyperparameters in BaseTask.

@isaaccorley
Copy link
Collaborator

One workaround that I'm checking is just adding "ignore" to the list of args ignored in save_hyperparameters in BaseTask.

I think this makes the most sense since it's not an actual hparam.

@github-actions github-actions bot added the trainers PyTorch Lightning trainers label Feb 23, 2024
@calebrob6
Copy link
Member Author

The way that I've come up with to check whether an .ipynb is in sync with the corresponding .py is diff <(jupytext --output - --to py custom_segmentation_trainer.ipynb) custom_segmentation_trainer.py. This should only output something like:

3a4
> #     formats: ipynb,py

@github-actions github-actions bot added the testing Continuous integration testing label Feb 23, 2024
@calebrob6
Copy link
Member Author

I don't know why notebook test is being cancelled (maybe because it is trying to run the LandCoverAI split script?).

@adamjstewart
Copy link
Collaborator

I don't know why notebook test is being cancelled (maybe because it is trying to run the LandCoverAI split script?).

Tests being canceled means the job either ran out of time, space, or memory. Here, my guess would be space. We want to use the smallest datasets possible, such as EuroSAT100. Might be worth creating a LandCoverAI100 or something like that.

@adamjstewart
Copy link
Collaborator

Haven't yet had time to evaluate jupytext to decide whether or not it's what we should use. @nilsleh what did you end up using for lightning-uq-box?

@adamjstewart adamjstewart added this to the 0.6.0 milestone Feb 25, 2024
@calebrob6
Copy link
Member Author

I really don't like the idea of making custom datasets/datamodules just to have pretty CI -- it is a large overhead for something that makes the tutorial less cool.

@adamjstewart
Copy link
Collaborator

And I really don't like having tutorials that take 30+ minutes to download a dataset and train a model for hundreds of epochs, or tutorials that can't be tested in CI because they involve more data than our runners can store. There's always a tradeoff. You can also find a smaller dataset instead of making your own.

@calebrob6
Copy link
Member Author

Luckily none of that happens here ;). LandCover.ai is 1.5GB (this will take 30+ minutes to download if your download speed is < 0.83 MB/s), training happens for 1 batch (and I can reduce the batch size to make this faster).

I'm saying that we shouldn't be catching bugs with overly sanitized examples -- if LandCoverAI breaks or Lightning training breaks then our other tests will break. If the example notebooks are catching bugs then we should ask ourselves why. Downloading LandCoverAI and running this now and per release doesn't seem to be a big burden.

@calebrob6
Copy link
Member Author

How about-- is it possible to change LandCoverAI datamodule to use the test data that we already have spent time creating for this notebook (then comments saying if you actually want to play with data, then do this other thing)?

@nilsleh
Copy link
Collaborator

nilsleh commented Feb 26, 2024

Haven't yet had time to evaluate jupytext to decide whether or not it's what we should use. @nilsleh what did you end up using for lightning-uq-box?

So I tried jupytext, but for my tutorials I couldn't get the jupytext scripts to execute and be displayed on the documentation. I went back to notebooks, and the notebooks are now run when the documentation builds. However, I don't need to download any data and model fitting is fast, since it's just toy problems.

@adamjstewart
Copy link
Collaborator

How about-- is it possible to change LandCoverAI datamodule to use the test data that we already have spent time creating for this notebook (then comments saying if you actually want to play with data, then do this other thing)?

This is also undesirable because the notebook by default will train and display predictions on random noise. I would much rather have a tiny dataset with real data. I'm happy to make this myself (maybe next week).

@calebrob6
Copy link
Member Author

I don't think it needs to display predictions -- as if we're only training for a batch for "making CI pretty" reasons then it will display noise regardless.

@adamjstewart
Copy link
Collaborator

We can train for multiple epochs in the tutorial, but use fast_dev_run in CI. The trainers tutorial does this with nbmake variable mocking. This also means we could use the synthetic dataset during CI and the real dataset during the tutorial. My only hope is that we don't train for longer than it takes to explain what we're doing in a live tutorial session. If we use this notebook in a tutorial and end up sitting there watching it awkwardly for an hour then it gets tedious.

@adamjstewart adamjstewart removed this from the 0.6.0 milestone Aug 21, 2024
isaaccorley
isaaccorley previously approved these changes Oct 1, 2024
Copy link
Collaborator

@isaaccorley isaaccorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

" 'https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2B_MSIL2A_20220902T090559_R050_T40XDH_20220902T181115',\n",
" 'https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2B_MSIL2A_20220718T084609_R107_T40XEJ_20220718T175008',\n",
" 'https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2B_MSIL2A_20220902T090559_R050_T40XDH_20220902T181115'\n",
" #'https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2B_MSIL2A_20220718T084609_R107_T40XEJ_20220718T175008',\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to delete these commented out files or uncomment them?

" 'm_3807511_ne_18_060_20181104.tif'\n",
" #'m_3807511_se_18_060_20181104.tif',\n",
" #'m_3807512_nw_18_060_20180815.tif',\n",
" #'m_3807512_sw_18_060_20180815.tif',\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

"cell_type": "markdown",
"metadata": {},
"source": [
"flake8: noqa: E501\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this

docs/tutorials/custom_segmentation_trainer.ipynb Outdated Show resolved Hide resolved
docs/tutorials/custom_segmentation_trainer.ipynb Outdated Show resolved Hide resolved
docs/tutorials/custom_segmentation_trainer.ipynb Outdated Show resolved Hide resolved
}
],
"source": [
"# validate that the task's hyperparameters are as expected\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we transitioned from markdown to code comments. I would prefer markdown.

"outputs": [
{
"data": {
"text/plain": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually strip outputs from the notebooks before uploading them so the outputs don't change every time someone submits a PR. I know you prefer to include the output, and I'm trying to find ways to make both of us happy. Ideally, I want to store only .py files and generate the .ipynb on-the-fly but haven't figured out a way to do that yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"# Note that you can also just call `trainer.test(task, dm)` if you've already trained\n",
"# the model in the current notebook session.\n",
"\n",
"task = CustomSemanticSegmentationTask.load_from_checkpoint(\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is like the whole purpose of #2317, but also completely unnecessary to demonstrate a custom trainer. If you remove this, we could backport this to 0.6.1 so it makes it into the stable docs immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in a rush to get this into the stable docs, and think it is a nice use case.

{
"data": {
"text/plain": [
"[{'test_loss': 17.85824203491211,\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output metrics are pretty bad. At least with our EuroSAT100 examples, the accuracies aren't horrible (much higher than random guessing). Maybe if we start with a pre-trained model (even one created solely for the purpose of this tutorial) we can get better results? Would also be nice to include example plots at the end to visualize how well the model works. Again, I know this is something you wish we didn't have to do and we could instead train for minutes/hours, but we want to keep the tutorial simple and easy to walkthrough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  • Integrating pre-trained models is outside the scope of this tutorial. If we get to a point where we post benchmark models for different datasets then we can revisit.
  • Performance on this task is close to meaningless -- it is semantic segmentation over 15 samples training for a single look on 70 samples. It may be the 15 random samples in the test set are all of the same class.
  • It'd be cute to maximize performance under the constraints of CPU only Github Action CI runners but I don't think it adds any value to the tutorial and eats up CI time.
  • It would be nice to have pretty plots, but again, you are very unlikely to get anything coherent from training for a single 70 sample epoch. Also, plotting should be covered by other tutorials.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to show people how to build custom classes that extend our base trainer classes and why they would do that (to have different metrics, to change optimizer schedule, ....).

@robmarkcole robmarkcole mentioned this pull request Nov 27, 2024
25 tasks
@calebrob6
Copy link
Member Author

@adamjstewart is this good to go?

@adamjstewart
Copy link
Collaborator

Need to give this another detailed review, but there are a lot of lines of code that are commented out that we should clean up. Also, this PR depends on multiple new features that won't exist until 0.7.0/1.0.0, so we probably can't use it for AGU (unless we backport some of the changes without advertising them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants