Usecase eurac (#219) · interTwin-eu/itwinai@d0370df

Commit

Usecase eurac (#219)

* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <[email protected]>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <[email protected]>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* Distributed strategy launcher (#127)

Update ParseConfig

* Distributed strategy launcher (#128)

Remove experimental files

* Docs dev (#132)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

---------

Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: VerderK <[email protected]>

* Distributed strategy launcher (#131)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* 3dgan integration (#134)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

* ADD offloading of 3DGAN training

* ADAPT 3DGAN training for singularity execution

* UPDATE test and fix linter

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Docs dev (#135)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

---------

Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: VerderK <[email protected]>

* Distributed strategy launcher (#137)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>

* Update README.md

* Distributed strategy launcher (#141)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

* Update README.md

* Cyclone tf dist (#130)

* get_stretegy

* UPDATE distributed strategy

* change req file

* cycline tf dist

* small bugs

* fix bug in train.py

* REFACTOR cyclones use case

* Activate pytest

* NEW TensorFlow trainer

* ADD user information

---------

Co-authored-by: ruettgers1 <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Interactive distrib ml (#139)

Add examples for distributed ml in interactive mode

* Interactive distrib ml (#140)

Update tutorial

* Disable documentation GH action

* Remove action

---------

Co-authored-by: Mario Rüttgers <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: r-sarma <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: MarioRuettgers <[email protected]>

* Merge main (#142)

Bring changes on main into dev

* Virgo integration (#143)

* ADD Virgo data pipeline and some refactoring

* FIX typo

* UPDATE README

* ADD training

* ADD TrainingConfiguration

* ADD distributed training and refactor

* update readme

* UPDATE loggers and add tests

* Refactor

* FIX typo

* UPDATE use cases instructions

* ADD checkpointing and refactor.

* FIX linter

* FIX jscpd

* FIX jscpd

* Disable jscpd

* Refactor loggers

* ADD loggers to Virgo use case

* Update AUTHORS.md

* Update AUTHORS.md

* Docs dev (#144)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

* Remove unnecessary dependencies

* Add docstring

* adding latest changes from dev

* new content and changes

* Update index.rst

toctree revise

* adding pages for distributed ml tutorials

* new shpinx reqs to solve build failing

* Docs update:
- python code format fixed
- added brief explanation on ddp in new section

* requirements changed

* UPDATE requirements

* UPDATE requirements and itwinai.types

* ADD CMake and GCC installation

* UPDATE CMake and GCC installation

* UPDATE CMake and GCC installation

* ADD notebooks

* Disable notebooks section

* FIX TOC

* Saving local changes before pulling from remote

* saving updates before pull from origin

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* adding cyclones and virgo use cases pages

* FIX build errors

* Update TOC

* Update TOC

---------

Co-authored-by: KalliopiTsolaki <[email protected]>
Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: VerderK <[email protected]>
Co-authored-by: Killian Verder <[email protected]>

* Update dev (#152)

* Dev - itwinai 0.0.2 (#138)

* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <[email protected]>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <[email protected]>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <[email protected]>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <[email protected]>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <[email protected]>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <[email protected]>
Co-authored-by: Kalliopi Tsolaki <[email protected]>
Co-authored-by: orviz <[email protected]>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <[email protected]>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed to…

Loading branch information

26 people committed Oct 16, 2024

1 parent 46e0a85 commit d0370df

src/itwinai/cli.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -282,13 +282,16 @@ def exec_pipeline(
  
    @app.command()

    def mlflow_ui(

        path: str = typer.Option("ml-logs/", help="Path to logs storage."),

        port: int = typer.Option(

            5000, help="Port on which the MLFlow UI is listening."

        ),

    ):

        """

        Visualize Mlflow logs.

        """

        import subprocess

        subprocess.run(f"mlflow ui --backend-store-uri {path}".split())

        subprocess.run(f"mlflow ui --backend-store-uri {path} --port {port}".split())

    @app.command()

    @@ -302,8 +305,7 @@ def mlflow_server(
  
        """

        import subprocess

        subprocess.run(

            f"mlflow server --backend-store-uri {path} --port {port}".split())

        subprocess.run(f"mlflow server --backend-store-uri {path} --port {port}".split())

    @app.command()

src/itwinai/torch/trainer.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -298,7 +298,6 @@ def create_dataloaders(
  
            # Dear user, this is a method you #

            # may be interested to override!  #

            ###################################

            self.train_dataloader = self.strategy.create_dataloader(

                dataset=train_dataset,

                batch_size=self.config.batch_size,

    @@ -373,7 +372,7 @@ def execute(
  
            if self.logger:

                self.logger.destroy_logger_context()

            # self.strategy.clean_up()

            self.strategy.clean_up()

            return train_dataset, validation_dataset, test_dataset, self.model

        def _set_epoch_dataloaders(self, epoch: int):

    @@ -521,8 +520,10 @@ def train(self):
  
                    val_loss = self.validation_epoch(epoch)

                    # Checkpointing current best model

                    worker_val_losses = self.strategy.gather(val_loss, dst_rank=0)

                    if self.strategy.is_main_worker:

                    worker_val_losses = self.strategy.gather(

                        val_loss, dst_rank=0

                    )

                    if self.strategy.global_rank() == 0:

                        avg_loss = torch.mean(

                            torch.stack(worker_val_losses)

                        ).detach().cpu()

    @@ -627,53 +628,51 @@ def train_step(
  
            )

            return loss, metrics

        def validation_epoch(self, epoch: int) -> Optional[torch.Tensor]:

        def validation_epoch(self, epoch: int) -> torch.Tensor:

            """Perform a complete sweep over the validation dataset, completing an

            epoch of validation.

            Args:

                epoch (int): current epoch number, from 0 to ``self.epochs - 1``.

            Returns:

                Optional[Loss]: average validation loss for the current epoch if 

                    self.validation_dataloader is not None

                Loss: average validation loss for the current epoch.

            """

            if self.validation_dataloader is None:

                return

            self.model.eval()

            validation_losses = []

            validation_metrics = []

            for batch_idx, val_batch in enumerate(self.validation_dataloader):

                loss, metrics = self.validation_step(

                    batch=val_batch,

                    batch_idx=batch_idx

                )

                validation_losses.append(loss)

                validation_metrics.append(metrics)

            if self.validation_dataloader is not None:

                self.model.eval()

                validation_losses = []

                validation_metrics = []

                for batch_idx, val_batch \

                        in enumerate(self.validation_dataloader):

                    loss, metrics = self.validation_step(

                        batch=val_batch,

                        batch_idx=batch_idx

                    )

                    validation_losses.append(loss)

                    validation_metrics.append(metrics)

                # Important: update counter

                self.validation_glob_step += 1

                    # Important: update counter

                    self.validation_glob_step += 1

            # Aggregate and log losses

            avg_loss = torch.mean(torch.stack(validation_losses))

            self.log(

                item=avg_loss.item(),

                identifier='validation_loss_epoch',

                kind='metric',

                step=self.validation_glob_step,

            )

            # Aggregate and log metrics

            avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict()

            for m_name, m_val in avg_metrics.items():

                # Aggregate and log losses

                avg_loss = torch.mean(torch.stack(validation_losses))

                self.log(

                    item=m_val,

                    identifier='validation_' + m_name + '_epoch',

                    item=avg_loss.item(),

                    identifier='validation_loss_epoch',

                    kind='metric',

                    step=self.validation_glob_step,

                )

                # Aggregate and log metrics

                avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict()

                for m_name, m_val in avg_metrics.items():

                    self.log(

                        item=m_val,

                        identifier='validation_' + m_name + '_epoch',

                        kind='metric',

                        step=self.validation_glob_step,

                    )

            return avg_loss

                return avg_loss

        def validation_step(

            self,

use-cases/eurac/config.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -6,7 +6,7 @@ tmp_stats: /p/scratch/intertwin/datasets/eurac/stats @@
     experiment: "drought use case lstm"
     run_name: "alps_test"
-    epochs: 5
+    epochs: 2
     random_seed: 1010
     lr: 0.001
     batch_size: 256
@@ Expand Down @@

use-cases/eurac/runall.sh

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -12,7 +12,7 @@ if [ -z "$NUM_GPUS" ]; then
  
    	NUM_GPUS=4

    fi

    if [ -z "$TIME" ]; then 

    	TIME=0:40:00

    	TIME=0:20:00

    fi

    if [ -z "$DEBUG" ]; then 

    	DEBUG=false

    @@ -34,6 +34,6 @@ submit_job () {
  
    }

    echo "Running distributed training on $NUM_NODES nodes with $NUM_GPUS GPUs per node"

    # submit_job "ddp"

    # submit_job "deepspeed"

    submit_job "ddp"

    submit_job "deepspeed"

    submit_job "horovod"

use-cases/eurac/slurm.sh

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -100,7 +100,7 @@ if [ "$DIST_MODE" == "horovod" ] ; then
  
    	srun --cpu-bind=none \

    	--ntasks-per-node=$SLURM_GPUS_PER_NODE \

    	--cpus-per-task=$SLURM_CPUS_PER_GPU \

    	--ntasks=$(($SLURM_GPUS_PER_NODE * $SLURM_NNODES)) \

    	--ntasks=$SLURM_GPUS_PER_NODE \

    	$TRAINING_CMD

    else # E.g. for 'deepspeed' or 'ddp'

      srun --cpu-bind=none --ntasks-per-node=1 \

use-cases/eurac/trainer.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,7 +1,7 @@
  
    import os

    from pathlib import Path

    from timeit import default_timer as timer

    from typing import Dict, Literal, Optional, Union, Any, Tuple

    from typing import Dict, Literal, Optional, Union

    import pandas as pd

    import torch

    @@ -13,7 +13,6 @@
  
    from hython.trainer import ConvTrainer, RNNTrainer, RNNTrainParams

    from ray import train

    from torch.optim.lr_scheduler import ReduceLROnPlateau

    from torch.utils.data import Dataset

    from tqdm.auto import tqdm

    from itwinai.loggers import EpochTimeTracker, Logger

    @@ -26,7 +25,7 @@
  
    )

    from itwinai.torch.trainer import TorchTrainer

    from itwinai.torch.type import Metric

    from itwinai.components import profile_torch_trainer

    class RNNDistributedTrainer(TorchTrainer):

        """Trainer class for RNN model using pytorch.

    @@ -88,17 +87,6 @@ def __init__(
  
                **kwargs,

            )

            self.save_parameters(**self.locals2params(locals()))

            # self.execute = types.MethodType(profile_torch_trainer(self.execute), self)

        @profile_torch_trainer

        def execute(

            self, 

            train_dataset: Dataset, 

            validation_dataset: Optional[Dataset] = None, 

            test_dataset: Optional[Dataset] = None

        ) -> Tuple[Dataset, Dataset, Dataset, Any]:

            return super().execute(train_dataset, validation_dataset, test_dataset)

        def create_model_loss_optimizer(self) -> None:

            self.optimizer = optim.Adam(self.model.parameters(), lr=self.config.lr)

use-cases/virgo/README.md

-Original file line number
+Diff line change
@@ Expand Up @@
     such as the number of trials you want to run, to change the stopping criteria for the trials or to set a
     different metric on which ray will evaluate trial results.
     By default, trials monitor validation loss, and results are plotted once all trials are completed.
-    ## Generating Synthetic Data for the Virgo Use Case
-    This project includes another SLURM job script, `synthetic_data_gen/data_generation.sh`, that allows
-    users to generate synthetic dataset for the Virgo gravitational wave detector use case.
-    This step is typically not required unless you need to create new synthetic datasets.
-    The synthetic data is generated using a Python script, `file_gen.py`, which creates multiple files
-    containing simulated data. Each file is a pickled pandas dataframe containing `datapoints_per_file`
-    datapoints (defaults to 500), each
-    one representing a set of time series for main and strain detector channels.
-    If you need to generate a new dataset, you can run the SLURM script with the following command:
-    ```bash
-    sbatch data_generation.sh
-    ```
-    The script will generate multiple data files and store them in separate folders, which are
-    created in the `target_folder_name` directory.
-    The generated pickle files are organized in a set of nested folders to avoid creating too many
-    files in the same folder. To generate such folders and its files we use SLURM
-    [job arrays](https://slurm.schedmd.com/job_array.html).
-    Each SLURM array job will create its own folder and populate it with the synthetic data files.
-    The number of files created in each folder can be customized by setting the `NUM_FILES` environment
-    variablebefore submitting the job.
-    For example, to generate 50 files per array job, you can run:
-    ```bash
-    export NUM_FILES=50
-    sbatch data_generation.sh
-    ```
-    If you do not specify `NUM_FILES`, the script will default to creating 100 files per folder.

use-cases/virgo/synthetic_data_gen/file_gen.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -7,8 +7,6 @@ @@
     from ..src.dataset import generate_cut_image_dataset
-    from ..src.dataset import generate_cut_image_dataset
     def generate_pkl_dataset(
         folder_name='test_folder',
@@ Expand Down @@

0 comments on commit `d0370df`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `d0370df`

Commit

There are no files selected for viewing

0 comments on commit d0370df

0 comments on commit `d0370df`