Skip to content

Commit

Permalink
Usecase eurac (#219)
Browse files Browse the repository at this point in the history
* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <[email protected]>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <[email protected]>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* Distributed strategy launcher (#127)

Update ParseConfig

* Distributed strategy launcher (#128)

Remove experimental files

* Docs dev (#132)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

---------

Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: VerderK <[email protected]>

* Distributed strategy launcher (#131)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* 3dgan integration (#134)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

* ADD offloading of 3DGAN training

* ADAPT 3DGAN training for singularity execution

* UPDATE test and fix linter

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Docs dev (#135)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

---------

Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: VerderK <[email protected]>

* Distributed strategy launcher (#137)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* Update README.md

* Distributed strategy launcher (#141)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

* Update README.md

* Cyclone tf dist (#130)

* get_stretegy

* UPDATE distributed strategy

* change req file

* cycline tf dist

* small bugs

* fix bug in train.py

* REFACTOR cyclones use case

* Activate pytest

* NEW TensorFlow trainer

* ADD user information

---------

Co-authored-by: ruettgers1 <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Interactive distrib ml (#139)

Add examples for distributed ml in interactive mode

* Interactive distrib ml (#140)

Update tutorial

* Disable documentation GH action

* Remove action

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: MarioRuettgers <[email protected]>

* Merge main (#142)

Bring changes on main into dev

* Virgo integration (#143)

* ADD Virgo data pipeline and some refactoring

* FIX typo

* UPDATE README

* ADD training

* ADD TrainingConfiguration

* ADD distributed training and refactor

* update readme

* UPDATE loggers and add tests

* Refactor

* FIX typo

* UPDATE use cases instructions

* ADD checkpointing and refactor.

* FIX linter

* FIX jscpd

* FIX jscpd

* Disable jscpd

* Refactor loggers

* ADD loggers to Virgo use case

* Update AUTHORS.md

* Update AUTHORS.md

* Docs dev (#144)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

* Remove unnecessary dependencies

* Add docstring

* adding latest changes from dev

* new content and changes

* Update index.rst

toctree revise

* adding pages for distributed ml tutorials

* new shpinx reqs to solve build failing

* Docs update:
- python code format fixed
- added brief explanation on ddp in new section

* requirements changed

* UPDATE requirements

* UPDATE requirements and itwinai.types

* ADD CMake and GCC installation

* UPDATE CMake and GCC installation

* UPDATE CMake and GCC installation

* ADD notebooks

* Disable notebooks section

* FIX TOC

* Saving local changes before pulling from remote

* saving updates before pull from origin

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* adding cyclones and virgo use cases pages

* FIX build errors

* Update TOC

* Update TOC

---------

Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: VerderK <[email protected]>
Co-authored-by: Killian Verder <[email protected]>

* Update dev (#152)

* Dev - itwinai 0.0.2 (#138)

* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <[email protected]>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <[email protected]>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed to…
  • Loading branch information
26 people committed Oct 16, 2024
1 parent 46e0a85 commit d0370df
Show file tree
Hide file tree
Showing 8 changed files with 47 additions and 95 deletions.
8 changes: 5 additions & 3 deletions src/itwinai/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,13 +282,16 @@ def exec_pipeline(
@app.command()
def mlflow_ui(
path: str = typer.Option("ml-logs/", help="Path to logs storage."),
port: int = typer.Option(
5000, help="Port on which the MLFlow UI is listening."
),
):
"""
Visualize Mlflow logs.
"""
import subprocess

subprocess.run(f"mlflow ui --backend-store-uri {path}".split())
subprocess.run(f"mlflow ui --backend-store-uri {path} --port {port}".split())


@app.command()
Expand All @@ -302,8 +305,7 @@ def mlflow_server(
"""
import subprocess

subprocess.run(
f"mlflow server --backend-store-uri {path} --port {port}".split())
subprocess.run(f"mlflow server --backend-store-uri {path} --port {port}".split())


@app.command()
Expand Down
71 changes: 35 additions & 36 deletions src/itwinai/torch/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,6 @@ def create_dataloaders(
# Dear user, this is a method you #
# may be interested to override! #
###################################

self.train_dataloader = self.strategy.create_dataloader(
dataset=train_dataset,
batch_size=self.config.batch_size,
Expand Down Expand Up @@ -373,7 +372,7 @@ def execute(

if self.logger:
self.logger.destroy_logger_context()
# self.strategy.clean_up()
self.strategy.clean_up()
return train_dataset, validation_dataset, test_dataset, self.model

def _set_epoch_dataloaders(self, epoch: int):
Expand Down Expand Up @@ -521,8 +520,10 @@ def train(self):
val_loss = self.validation_epoch(epoch)

# Checkpointing current best model
worker_val_losses = self.strategy.gather(val_loss, dst_rank=0)
if self.strategy.is_main_worker:
worker_val_losses = self.strategy.gather(
val_loss, dst_rank=0
)
if self.strategy.global_rank() == 0:
avg_loss = torch.mean(
torch.stack(worker_val_losses)
).detach().cpu()
Expand Down Expand Up @@ -627,53 +628,51 @@ def train_step(
)
return loss, metrics

def validation_epoch(self, epoch: int) -> Optional[torch.Tensor]:
def validation_epoch(self, epoch: int) -> torch.Tensor:
"""Perform a complete sweep over the validation dataset, completing an
epoch of validation.
Args:
epoch (int): current epoch number, from 0 to ``self.epochs - 1``.
Returns:
Optional[Loss]: average validation loss for the current epoch if
self.validation_dataloader is not None
Loss: average validation loss for the current epoch.
"""
if self.validation_dataloader is None:
return

self.model.eval()
validation_losses = []
validation_metrics = []
for batch_idx, val_batch in enumerate(self.validation_dataloader):
loss, metrics = self.validation_step(
batch=val_batch,
batch_idx=batch_idx
)
validation_losses.append(loss)
validation_metrics.append(metrics)
if self.validation_dataloader is not None:
self.model.eval()
validation_losses = []
validation_metrics = []
for batch_idx, val_batch \
in enumerate(self.validation_dataloader):
loss, metrics = self.validation_step(
batch=val_batch,
batch_idx=batch_idx
)
validation_losses.append(loss)
validation_metrics.append(metrics)

# Important: update counter
self.validation_glob_step += 1
# Important: update counter
self.validation_glob_step += 1

# Aggregate and log losses
avg_loss = torch.mean(torch.stack(validation_losses))
self.log(
item=avg_loss.item(),
identifier='validation_loss_epoch',
kind='metric',
step=self.validation_glob_step,
)
# Aggregate and log metrics
avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict()
for m_name, m_val in avg_metrics.items():
# Aggregate and log losses
avg_loss = torch.mean(torch.stack(validation_losses))
self.log(
item=m_val,
identifier='validation_' + m_name + '_epoch',
item=avg_loss.item(),
identifier='validation_loss_epoch',
kind='metric',
step=self.validation_glob_step,
)
# Aggregate and log metrics
avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict()
for m_name, m_val in avg_metrics.items():
self.log(
item=m_val,
identifier='validation_' + m_name + '_epoch',
kind='metric',
step=self.validation_glob_step,
)

return avg_loss
return avg_loss

def validation_step(
self,
Expand Down
2 changes: 1 addition & 1 deletion use-cases/eurac/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tmp_stats: /p/scratch/intertwin/datasets/eurac/stats

experiment: "drought use case lstm"
run_name: "alps_test"
epochs: 5
epochs: 2
random_seed: 1010
lr: 0.001
batch_size: 256
Expand Down
6 changes: 3 additions & 3 deletions use-cases/eurac/runall.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ if [ -z "$NUM_GPUS" ]; then
NUM_GPUS=4
fi
if [ -z "$TIME" ]; then
TIME=0:40:00
TIME=0:20:00
fi
if [ -z "$DEBUG" ]; then
DEBUG=false
Expand All @@ -34,6 +34,6 @@ submit_job () {
}

echo "Running distributed training on $NUM_NODES nodes with $NUM_GPUS GPUs per node"
# submit_job "ddp"
# submit_job "deepspeed"
submit_job "ddp"
submit_job "deepspeed"
submit_job "horovod"
2 changes: 1 addition & 1 deletion use-cases/eurac/slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ if [ "$DIST_MODE" == "horovod" ] ; then
srun --cpu-bind=none \
--ntasks-per-node=$SLURM_GPUS_PER_NODE \
--cpus-per-task=$SLURM_CPUS_PER_GPU \
--ntasks=$(($SLURM_GPUS_PER_NODE * $SLURM_NNODES)) \
--ntasks=$SLURM_GPUS_PER_NODE \
$TRAINING_CMD
else # E.g. for 'deepspeed' or 'ddp'
srun --cpu-bind=none --ntasks-per-node=1 \
Expand Down
16 changes: 2 additions & 14 deletions use-cases/eurac/trainer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
from pathlib import Path
from timeit import default_timer as timer
from typing import Dict, Literal, Optional, Union, Any, Tuple
from typing import Dict, Literal, Optional, Union

import pandas as pd
import torch
Expand All @@ -13,7 +13,6 @@
from hython.trainer import ConvTrainer, RNNTrainer, RNNTrainParams
from ray import train
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.utils.data import Dataset
from tqdm.auto import tqdm

from itwinai.loggers import EpochTimeTracker, Logger
Expand All @@ -26,7 +25,7 @@
)
from itwinai.torch.trainer import TorchTrainer
from itwinai.torch.type import Metric
from itwinai.components import profile_torch_trainer


class RNNDistributedTrainer(TorchTrainer):
"""Trainer class for RNN model using pytorch.
Expand Down Expand Up @@ -88,17 +87,6 @@ def __init__(
**kwargs,
)
self.save_parameters(**self.locals2params(locals()))
# self.execute = types.MethodType(profile_torch_trainer(self.execute), self)


@profile_torch_trainer
def execute(
self,
train_dataset: Dataset,
validation_dataset: Optional[Dataset] = None,
test_dataset: Optional[Dataset] = None
) -> Tuple[Dataset, Dataset, Dataset, Any]:
return super().execute(train_dataset, validation_dataset, test_dataset)

def create_model_loss_optimizer(self) -> None:
self.optimizer = optim.Adam(self.model.parameters(), lr=self.config.lr)
Expand Down
35 changes: 0 additions & 35 deletions use-cases/virgo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,38 +123,3 @@ You may change CLI variables for `hpo.py` to change parameters,
such as the number of trials you want to run, to change the stopping criteria for the trials or to set a
different metric on which ray will evaluate trial results.
By default, trials monitor validation loss, and results are plotted once all trials are completed.

## Generating Synthetic Data for the Virgo Use Case

This project includes another SLURM job script, `synthetic_data_gen/data_generation.sh`, that allows
users to generate synthetic dataset for the Virgo gravitational wave detector use case.
This step is typically not required unless you need to create new synthetic datasets.

The synthetic data is generated using a Python script, `file_gen.py`, which creates multiple files
containing simulated data. Each file is a pickled pandas dataframe containing `datapoints_per_file`
datapoints (defaults to 500), each
one representing a set of time series for main and strain detector channels.

If you need to generate a new dataset, you can run the SLURM script with the following command:

```bash
sbatch data_generation.sh
```

The script will generate multiple data files and store them in separate folders, which are
created in the `target_folder_name` directory.

The generated pickle files are organized in a set of nested folders to avoid creating too many
files in the same folder. To generate such folders and its files we use SLURM
[job arrays](https://slurm.schedmd.com/job_array.html).
Each SLURM array job will create its own folder and populate it with the synthetic data files.
The number of files created in each folder can be customized by setting the `NUM_FILES` environment
variablebefore submitting the job.
For example, to generate 50 files per array job, you can run:

```bash
export NUM_FILES=50
sbatch data_generation.sh
```

If you do not specify `NUM_FILES`, the script will default to creating 100 files per folder.
2 changes: 0 additions & 2 deletions use-cases/virgo/synthetic_data_gen/file_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@

from ..src.dataset import generate_cut_image_dataset

from ..src.dataset import generate_cut_image_dataset


def generate_pkl_dataset(
folder_name='test_folder',
Expand Down

0 comments on commit d0370df

Please sign in to comment.