From d0370dfd39e680bbf2de522cd935e5e9db4503f1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jarl=20S=C3=A6ther?=
 <60541573+jarlsondre@users.noreply.github.com>
Date: Tue, 15 Oct 2024 16:06:15 +0200
Subject: [PATCH] Usecase eurac (#219)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <neonikkus@gmail.com>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <matteo.bunino@gmail.com>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* Distributed strategy launcher (#127)

Update ParseConfig

* Distributed strategy launcher (#128)

Remove experimental files

* Docs dev (#132)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

---------

Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>

* Distributed strategy launcher (#131)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* 3dgan integration (#134)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

* ADD offloading of 3DGAN training

* ADAPT 3DGAN training for singularity execution

* UPDATE test and fix linter

---------

Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Docs dev (#135)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

---------

Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>

* Distributed strategy launcher (#137)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* Update README.md

* Distributed strategy launcher (#141)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

* Update README.md

* Cyclone tf dist (#130)

* get_stretegy

* UPDATE distributed strategy

* change req file

* cycline tf dist

* small bugs

* fix bug in train.py

* REFACTOR cyclones use case

* Activate pytest

* NEW TensorFlow trainer

* ADD user information

---------

Co-authored-by: ruettgers1 <ruettgers1@hdfmll01.hdfml>
Co-authored-by: Matteo Bunino <matteo.bunino@gmail.com>

* Interactive distrib ml (#139)

Add examples for distributed ml in interactive mode

* Interactive distrib ml (#140)

Update tutorial

* Disable documentation GH action

* Remove action

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com>

* Merge main (#142)

Bring changes on main into dev

* Virgo integration (#143)

* ADD Virgo data pipeline and some refactoring

* FIX typo

* UPDATE README

* ADD training

* ADD TrainingConfiguration

* ADD distributed training and refactor

* update readme

* UPDATE loggers and add tests

* Refactor

* FIX typo

* UPDATE use cases instructions

* ADD checkpointing and refactor.

* FIX linter

* FIX jscpd

* FIX jscpd

* Disable jscpd

* Refactor loggers

* ADD loggers to Virgo use case

* Update AUTHORS.md

* Update AUTHORS.md

* Docs dev (#144)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

* Remove unnecessary dependencies

* Add docstring

* adding latest changes from dev

* new content and changes

* Update index.rst

toctree revise

* adding pages for distributed ml tutorials

* new shpinx reqs to solve build failing

* Docs update:
- python code format fixed
- added brief explanation on ddp in new section

* requirements changed

* UPDATE requirements

* UPDATE requirements and itwinai.types

* ADD CMake and GCC installation

* UPDATE CMake and GCC installation

* UPDATE CMake and GCC installation

* ADD notebooks

* Disable notebooks section

* FIX TOC

* Saving local changes before pulling from remote

* saving updates before pull from origin

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* adding cyclones and virgo use cases pages

* FIX build errors

* Update TOC

* Update TOC

---------

Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>
Co-authored-by: Killian Verder <killian.verder@cern.ch>

* Update dev (#152)

* Dev - itwinai 0.0.2 (#138)

* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <neonikkus@gmail.com>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <matteo.bunino@gmail.com>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* Distributed strategy launcher (#127)

Update ParseConfig

* Distributed strategy launcher (#128)

Remove experimental files

* Docs dev (#132)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

---------

Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>

* Distributed strategy launcher (#131)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* 3dgan integration (#134)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

* ADD offloading of 3DGAN training

* ADAPT 3DGAN training for singularity execution

* UPDATE test and fix linter

---------

Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Docs dev (#135)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

---------

Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>

* Distributed strategy launcher (#137)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* Update README.md

* Distributed strategy launcher (#141)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Fixes to TF new version errors

* Update distributed.py

* Update tfmirrored_slurm.sh

* Update train.py

* TF updates

* Add README

* Python venv (#136)

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD pypi deployment

* DISABLE push debug

* UPDATE pypi

* UPDATE classifiers

* Update pyproject.toml

* Update README.md

* Cyclone tf dist (#130)

* get_stretegy

* UPDATE distributed strategy

* change req file

* cycline tf dist

* small bugs

* fix bug in train.py

* REFACTOR cyclones use case

* Activate pytest

* NEW TensorFlow trainer

* ADD user information

---------

Co-authored-by: ruettgers1 <ruettgers1@hdfmll01.hdfml>
Co-authored-by: Matteo Bunino <matteo.bunino@gmail.com>

* Interactive distrib ml (#139)

Add examples for distributed ml in interactive mode

* Interactive distrib ml (#140)

Update tutorial

* Disable documentation GH action

* Remove action

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com>

* Merge main (#142)

Bring changes on main into dev

* Virgo integration (#143)

* ADD Virgo data pipeline and some refactoring

* FIX typo

* UPDATE README

* ADD training

* ADD TrainingConfiguration

* ADD distributed training and refactor

* update readme

* UPDATE loggers and add tests

* Refactor

* FIX typo

* UPDATE use cases instructions

* ADD checkpointing and refactor.

* FIX linter

* FIX jscpd

* FIX jscpd

* Disable jscpd

* Refactor loggers

* ADD loggers to Virgo use case

* Update AUTHORS.md

* Update AUTHORS.md

* Docs dev (#144)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

* UPDATE requirements

* Remove unnecessary dependencies

* Add docstring

* adding latest changes from dev

* new content and changes

* Update index.rst

toctree revise

* adding pages for distributed ml tutorials

* new shpinx reqs to solve build failing

* Docs update:
- python code format fixed
- added brief explanation on ddp in new section

* requirements changed

* UPDATE requirements

* UPDATE requirements and itwinai.types

* ADD CMake and GCC installation

* UPDATE CMake and GCC installation

* UPDATE CMake and GCC installation

* ADD notebooks

* Disable notebooks section

* FIX TOC

* Saving local changes before pulling from remote

* saving updates before pull from origin

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* Update itwinai.torch.modules.rst

* adding cyclones and virgo use cases pages

* FIX build errors

* Update TOC

* Update TOC

---------

Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>
Co-authored-by: Killian Verder <killian.verder@cern.ch>

---------

Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com>
Co-authored-by: linxUser3574 <neonikkus@gmail.com>
Co-authored-by: orviz <orviz@ifca.unican.es>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>
Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com>
Co-authored-by: Killian Verder <killian.verder@cern.ch>

* Delete .github/workflows/pages.yml

* ADD quick install for users (#145)

* User install (#146)

* ADD quick install for users

* UPDATE installer

* fix framework selection

* UPDATE installer

* Update README.md

* Update README.md

* Improve docstring parsing and refactor (#147)

* UPDATE print patch and refactor

* Cleanup

* Cleanup

* Cleanup

* Cleanup

* FIX broken import

* UPDATE docs

* FIX docstring parsing

* Preserve ordering

* Update cli.py

* Update docs (#148)

* Update README.md

* ADD missing doctrings

* Bump actions/setup-python from 4 to 5 (#149)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update README.md

* Update README.md

* Update README.md

* updating doc pages (#150)

Co-authored-by: KalliopiTsolaki <ktsolaki@LAPTOP-4683QBL6>

* Update cyclones_doc.rst

* Bug fixes and addition of CERFACS use-case (#151)

* Update train.py

* Update generic_tf.sh

* Update pyproject.toml

* Update train.py

* Fix: head problems with MacOS

* Fixes for MacOS support

* Fix: Update basic_components.py

* Addition of cerfacs use-case

* Update README.md

* Update train.py

* Update cyclones_doc.rst

* Update startscript.sh

* Update pyproject.toml

* Update mnist.py

* Update mnist.py

* Update generic_tf.sh

* Update requirements.txt

* Update requirements.txt

* Docs changes (#153)

* updating doc pages

* testing if changing the GH edit url works

* adding repo link in toc

---------

Co-authored-by: KalliopiTsolaki <ktsolaki@LAPTOP-4683QBL6>

* Update pyproject.toml

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com>
Co-authored-by: linxUser3574 <neonikkus@gmail.com>
Co-authored-by: orviz <orviz@ifca.unican.es>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>
Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com>
Co-authored-by: Killian Verder <killian.verder@cern.ch>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: KalliopiTsolaki <51197740+KalliopiTsolaki@users.noreply.github.com>
Co-authored-by: KalliopiTsolaki <ktsolaki@LAPTOP-4683QBL6>

* Update sqaaas.yml

* added train to start integration

* update requirements.txt

* downsamplingo option

* fix plot

* ADD support for user-provided distributed samplers

* ADD first distributed draft with random distributed sampler

* UPDATE comments

* ADAPT train_val for distributed

* UPDATE docs

* UPDATE installation instructions

* prepare for distributed run

* enable distributed sampler

* fix

* prepare to run on JSC

* update train

* copied slurm.sh from virgo usecase

* update convlstm

* blacked

* update surrogate input to scratch

* Update dist-train.py

* Update dist-train.py

correct tqdm import error

* Update dist-train.py

in train_val, strategy.device changed to strategy.device()

* add distributed support

* add gather

* add logging

* clean up script

* update default batch-size

* refine variables to commandline args

* Update cli.py

* update torch dist final

* Add parameter run_name to mlflow logger

* prepared slurm script

* add data generator scripts and data loader

* add scaling tests

* add scaling tests

* add plots

* fix env path

* add hpo eurac

* correct start hpo cmd

* test distributed slurm

* fix hpo functionality

* working distributed version

* add instructions for HPO and scaling tests

* add hpo results vizualization script

* update data path

* convlstm

* remove rnn_config from eurac tutorial

* conv

* Prov4ml integration (#192)

* ADD prov4ml logger

* UPDATE enum access fields

* UPDATE loggers documentation and first integration attempt

* ADD prov logger

* format kinds table

* MIGRATE to upstream prov4ml

* ADD docs build on JSC

* ADD RTD website

* UPDATE docs creation

* Refactor

* UPDATE logger

* Remove lightning callbacks and loggers

* ADD checkpoints

* UPDATE logger kind docs

* Update README.md

* ADD rank on loggers

* Update loggers.py

* Update loggers.py

* Update loggers.py

* Update loggers.py

* Update loggers.py

* FIX linter

* REFACTOR loggers

* Simplify prov4ml switch case

* UPDATE loggers

* FIX prov graph

* REFACTOR itwinai logging

* UPDATE SLURM jobscripts

* REFACTOR

* Update

* ADD prov experiments

* REFACTOR provenance logs and SLURM jobscripts

* REMOVE duplication

* FIX dataset name

* UPDATE README

* SKIP cyclones use case

* UPDATE version

* REMOVE redundant parameter

* CLEANUP

* ADD warning

* ADD warning

* UPDATE README

* FIX errors

* ADD docs

* UPDATE scripts

* UPDATE scripts

* fix gan bug and update docs (#193)

* Update index.rst

* Update index.rst

* Update index.rst

* Update pyproject.toml

* Update pyproject.toml

* Bump github/super-linter from 6 to 7 (#198)

Bumps [github/super-linter](https://github.com/github/super-linter) from 6 to 7.
- [Release notes](https://github.com/github/super-linter/releases)
- [Changelog](https://github.com/github/super-linter/blob/main/CHANGELOG.md)
- [Commits](https://github.com/github/super-linter/compare/v6...v7)

---
updated-dependencies:
- dependency-name: github/super-linter
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Itwinai container (#197)

* Backend (#59)

* WIP: Tensorflow MNIST use-case

* UPDATE: Tensorflow MNIST version

* ADD: Backend

* ADD: Use-case init

* FIX: Paths and downloading of the data

* FIX: Paths and downloading of the data

* ADD: Setup, Config update

* ADD: Setup, Config update

* UPDATE: File movement into itwinai

* FIX: Move utils from tensorflow to global folder

* FIX: Add setup into torch Executable

* ADD: MNIST Torch Use-case

* FIX: Formatting

* ADD: Lib

* ADD: Lib

* ADD: Tests, Fix Loggers

* Update README.md

* ADD: Tests

* ADD: MLCC

* ADD: Cyclones, Cyclones-pipe

* ADD: TensorflowTrainer

* UPDATE: Move TensorflowTrainer into Backend

* FIX: Dependencies

* ADD: Number of devices

* ADD: initial version of TorchTrainer

* update

* update

* ADD: distributed torch Trainer and decorator

* ADD: New version of torch distribtued trainer and tests

* ADD: load torch dist trainer form config file

* ADD: multi-gpu pytorch trainer

* ADD: download on login node

* FIX: dataloaders in Trainer

* FIX: add dataloaders into trainer

* FIX: clear load and save state

* ADD: Loggers

* FIX: Log in a distributed environment

* TensorFlow backend (#63)

* UPDATE: Remove experimental distribution

* ADD: Mnist distributed

* ADD: Optional strategy

* UPDATE: Conditional distribution

* FIX: Dataloader for mnist

* FIX: Model cloning lambda function for distributed scope

* ADD: CycleGAN

* UPDATE: Types

* UPDATE: Types

* ADD: Local distr

* FIX: learning rates

* ADD: CycleGAN distributed

* FIX: Reduction

* FIX: Distribution

* ADD: tmp.py

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* FIX: Distribution

* UPDATE: Executors

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* FIX: Distributed Dataset

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD: Ray

* ADD:Initial VIRGO

* UPDATE: Optional distribution, tensorflow-gpu

* UPDATE: tensorflow-gpu dependency

* ADD: Unify branches

---------

Co-authored-by: User3574 <neonikkus@gmail.com>

* Refacto entire code base

* ADD: workflows folder

* FIX: refactor

* FIX: linting

* ADD: how to run use case doc

* ADD: workflows doc

* FIX: MD linter

* Pipe MNIST lightning (#86)

* ADD: lightning distributed + pipeline

* UPDATE: jscpd threshold

* UPDATE: super linter ignore use cases

* ADD: jscpd ignore loggers

* Functional tests for MNIST (#87)

* ADD: use case tests

* FIX: move use case models out of itwinai

* FIX: rearrange modules

* ADD: ConsoleLogger and LoggersCollection

* FIX: loggers filter

* FIX: add TF env creation

* UPDATE: test flag

* ADD: early pytest on slurm

* FIX: duplicated code in TF Trainer

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* 3dgan use case (#94)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#96)

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

* Update sqaaas.yml

* ADD: adaptive branch discovery for SQAaaS actin

* Trigger only on main and dev branches

* ADD: double quote

* Trigger pytest only on main and dev PRs

* Torch mnist inference (#95)

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* Remove keras dependency

* 3dgan integration (#97)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* 3dgan integration (#98)

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* REMOVE: keras dependency

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

---------

Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* fixed distributed trainer in cyclones use case

* 3dgan integration (#118)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

---------

Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Unit test 4 dev (#113)

* Define a step for pytest execution

* Fix: use v1 of step action

* Print result of step composition

* Rename step

* Use step previous definition in the assessment

* Rename input: workflow -> steps

* Avoid caching by using 1.0.0

* Set container image

* Bump to v1

* Bump to sqaaas-assessment-action@v2

* Remove 'id' property

* Adapt inputs to v2

* Remove current branch

* Disable test_cyclones_train_tf

* ADD marker

* ADD skip memory heavy

* Disable for PRs

---------

Co-authored-by: Matteo Bunino <matteo.bunino@gmail.com>

* Distributed strategy launcher (#117)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* Distributed strategy launcher (#127)

Update ParseConfig

* Distributed strategy launcher (#128)

Remove experimental files

* Docs dev (#132)

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* fixed distributed trainer in cyclones use case

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* commiting docs functionality for testing deployment

* adding documentation deployment relevant files

* updating readthedocs.yaml

* changing directory of requirements.txt

* updating reqs file

* commiting changes and adding pages for tutorials

* adding installation instructions in docs

* adding latest changes to docs

* adding new pages for itwinai modules and other modifications

* modified src/itwinai/torch directory name to solve namespace conflict

* fixing tutorial sections

* fixes in pages appearance

* fixing rendering bugs

* fixing pages appearance bugs

* adding latest modifications

* Deleted duplicate folder after renaming src/itwinai/torch

* adding documentation.yml file for automatic updating on github pages

* modifying documentation.yml file

* updating reqs file to solve bug in deployment

* testing automated docs update

* updating getting started page

* fixing pages and adding new content

* bug fixes

* fixing content rendering

* latest fixes in rendering

* Add version feature to docs

* Update .readthedocs.yaml

* fixing display structure in getting started page

* new fixes similar to previous commit

* Update index.rst

* Update index.rst

Text re-edit index

* Update index.rst

change 1 word

* Update .readthedocs.yaml

* Update .readthedocs.yaml

* fixing getting started page

* Text review getting_started_with_itwinai.rst

* Update 3dgan_doc.rst

* Update getting_started_with_itwinai.rst

punctuation

* Fix torch naming problem

---------

Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>

* Distributed strategy launcher (#131)

* ADD: distrib launcher mockup

* REFACTOR: cluster env, strategy and launcher

* ADD: Torch Elastic Launcher

* ADD: info on env vars

* ADD: distributed tooling and examples

* new folder

* UPDATE: distributed strategy setup

* generalized for DDP and DS

* add config file

* UPDATE: kwargs

* Update general_trainer.py

* Update general_startscript

* Update general_trainer.py

* UPDATE .gitignore

* Update distrib strategy

* UPDATE torch distributed strategy classes

* Updated docstrings

* Small fixes

* UPDATE docstrings

* ADD deepespeed config loader

* ADD first deepspeed tutorial draft

* UPDATE DDP Dp distrib strategy

* UPDATE horovod strategy

* UPDATE tutorial on torch distributed strategies

* UPDATE torch strategies tutorial

* Update createEnvJSC.sh

* Update hvd_slurm.sh

* Update README.md

* UPDATE distributed tutorial

* Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0

* Fixes to deepspeed startscript

* Update distributed.py

* Update trainer.py

* UPDATE tutorial

* ADD draft MNIST tutorial

* UPDATE DDP tutorial for MNIST

* FIX small details

* Update distributed.py

* Added TF tutorials

* Fixes to tutorials

* Add files via upload

* Update Makefile

* Update README.md

* UPDATE tutorials

* UPDATE documentation and improve explainability

* UPDATE SLURM scripts

* FIX local rank mismatch

* fixed distributed trainer in cyclones use case

* UPDATE launcher

* UPDATE linter

* UPDATE format

* FIX linter

* FIX linter

* Update workflow

* UPDATE workflow

* update

* Update workflow

* UPDATE super linter to v6

* UPDATE super linter to v6.3.0

* UPDATE super linter to slim

* Cleanup

* Update tfmirrored_slurm.sh

* Update tfmirrored_slurm.sh

* REMOVE workflows legacy

* DELETE cyclegan use case

* UPDATE dist training tutorials torch

* RENAME folders with torch

* DRAFT torch imagenet tutorial

* UPDATE configuration

* UPDATE imagenet tutorial

* DRAFT scaling test

* ADD scaling analysis report

* FIX deepspeed micro batchsize

* UPDATE data path

* UPDATE checkpoint to avoid race conditions

* UPDATE scalability report

* UPDATE dataset path

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* Update README.md

* Update README.md

* JUBE benchmarks

* Update createEnvJSC.sh

* Update createEnvJSCTF.sh

* ADD logy scale option

* Extract JUBE tutorial

* CLEANUP baselines

* Log epoch time in real-time

* FIX deepspeed dataloader for potential performances improvement

* UPDATE SC bash severity

* FIX deepspeed and horovod trainers

* FIX some code checks

* Unify redundant SLURM job scripts and configuration files

* CLEANUP unused configuration

* Reorg configurations

* Refactor configurations and add documentation

* Update README

* ADD report image

* Improve plot resolution

* UPDATE scaling test

* UPDATE  launcher scripts

* FIX linter

* REMOVE jube tutorial

* Restore ConfigParser

* FIX type hinting

* ADD dev dependencies

* REMOVE experimental scripts

* UPDATE scaling report

* Add SLURM logs

* Refactor log scale

* Update scalability report

* Unify SLURM logs per job

* Update README.md

* Update README.md

* Update README.md

* ADD itwinai installation

* UPDATE torch distributed tutorial 0

* UPDATE torch distributed tutorials

* REMOVE imagenet tutorial

* ADD NonDistributedStrategy and create_dataloader method

* CLEANUP older classes

* Rename strategies

* Simplify structure

* ADD draft new torch trainer class

* UPDATED torch trainer draft

* UPDATE MNIST use case

* INtegrate new trainer into MNIST use case

* UPDATE structure: remove unused files and refactor tests

* Tmp disable unused tests

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* Update action

* FIX failing inference

* Functiona tests (#133)

* UPDATE tests

* FIX errors

* CLEANUP

* Remove unused workflow

---------

Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>

* 3dgan integration (#134)

* fixed distributed trainer in cyclones use case

* commiting integration of 3dgan scripts

* ADD: Download dataset

* FIX: DDP distributed training with manual optimization

* ADD: log with MLFlow

* Sqaaas code (#88)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Sqaaas code (#89)

* Create sqaaas.yml

* Update sqaaas.yml

* Update sqaaas.yml

* Point to the current repo

* Remove unnecessary checkout step

* Rename step

* ADD: adaptive branch discovery for SQAaaS action

* Update sqaaas.yml

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD: draft predictor and saver

* ADD: stub for inference pipeline

* ADD: small docs

* UPDATE: inference pipeline components

* UPDATE: reorg

* ADD: image generation for inference

* update tag

* ADD: threshold

* ADD: draft inference

* ADD: draft inference wf

* ADD: working inference workflow

* ADD: 3D scatter plots

* ADD: Dockerfile + refactor

* ADD: .dockerignore

* Update .dockerignore

* ADD: skip download option

* ADD: cern pipeline.yaml

* UPDATE: dataset loading function

* UPDATE: dataset loading function

* UPDATE conf

* UPDATE refactor

* UPDATE refactor

* UPDATE training docs

* Update readme

* update README

* FIX typo

* Update README

* Update mkdir

* UPDATE data paths

* UPDATE Dockerfile

* UPDATE Dockerfiles

* UPDATE for Singularity execution

* FIX version mismatch

* UPDATE Singularity docs

* Named steps pipe (#100)

* ADD: dict steps pipe

* Relax dependency constraint

* UPDATE Singularity exec command

* UPDATE: Image version

* UPDATE: load components from pipeline

* ADD: docs

* Simplify 3DGAN model config

* ADD: mlflow autologging support for PL trainer

* UPDATE container info

* Refactor

* UPDATE dependencies

* FIX linter problem

* Simplified workflow configuration (#108)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* Simplified workflow configuration (#109)

* Add SQAaaS dynamic badge for dev branch (#104)

* Add SQAaaS dynamic badge

* Upgrade to sqaaas-assessment-action@v2

* Add draft example

* UPDATE credits field

* ADD docs

* REFACTOR components and pipeline code

* UPDATE docstring

* UPDATE mnist torch uc

* ADD config file parser draft

* ADD itwinaiCLI and ConfigParser

* ADD docs

* ADD pipeline parser and serializer plus tests

* UPDATE docs

* ADD adapter component and tests (incl parser)

* ADD splitter component, improve pipeline, tests

* UPDATE test

* REMOVE todos

* ADD component tests

* ADD serializer tests

* FIX linter

* ADD basic workflow tutorial

* ADD basic intermediate tutorial

* ADD advanced tutorial

* UPDATE advanced tutorial

* UPDATE use cases

* UPDATE save parameters

* FIX linter

* FIX cyclones use case workflow

* ADD slurm jobscript

* FIX merge error

* FIX components template

---------

Co-authored-by: orviz <orviz@ifca.unican.es>

* ADD integration tests

* FIX test

* FIX 3dgan inference test

* ADD GPU support and update tag

* FIX linter

* ADD override example

* UPDATE 3DGAN inference

* UPDATE inference execution tutorials

* UPDATE README

* UPDATE saver saving sparse tensors

* ADD interlink pods

* UPDATE pod name

* UPDATE annotations

* FIX README

* CLEANUP

* Merge

* update

* ADD tf cpu env

* U[date Makefile

* FIX 3DGAN tests

* FIX data folder path

* ADD offloading of 3DGAN training

* ADAPT 3DGAN training for singularity execution

* UPDATE test and fix linter

---------

Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: orviz <orviz@ifca.unican.es>

* Move to python venv

* Update Makefile

* Add Horovod installation

* Update env

* FIX openmpi install

* Add TF explicit version

* UPDATE env creation

* REMOVE constraint on torch 2.0.*

* UPDATE installation

* FIX test

* REMOVE strict dependency on micromamba

* FIX docs and debugging states

* FIX cpu only installation

* FIX deepspeed cpu installation

* FIX tf env creation

* FIX makefile

* ADD torch and tensorflow Docker containers

* Working DDP

* REFACTOR torch container build scripts

* FIX MPI env var set

* Incomplete containers

* UPDATE Dockerfiles

* REFACTOR Dockerfiles

* Rename

* UPDATE containers files and tutorial

* CLEANUP old doc pages

* ADD containers tutorials

* ADD containers tutorials

* UPDATE deps

* UPDATE deps

* UPDATE deps

* UPDATE docs and tutorials

* CLEANUP duplicates

* Update tests and scripts

* ADD labels

* CLEANUP

* Add docs and fix deepspeed launcher

* UPDATE linter settings

* FIX slow unit test on 3DGAN train

* ADD 3dgan sample dataset

---------

Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com>
Co-authored-by: linxUser3574 <neonikkus@gmail.com>
Co-authored-by: orviz <orviz@ifca.unican.es>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>

* Update config.yaml

* Update run_docker.sh

* fixing linting errors

* update num_workers in eurac script

* run isort on eurac files

* Update distributed.py

* Update trainer.py

* Update trainer.py

* Added option for setting checkpointing frequency in virgo trainer, added newer hpo scripts

* Reproducibility features on train_hpo.py

* Added seeding to train function and added additional plotting functions

* Fixed deepspeed launcher for scaling test, added option to set checkpoint frequency in NoiseGeneratorTrainer

* Changed a comment regarding checkpointing frequency

* Uncommented parts of runall.sh

* config

* update mse metric

* add debug prints and horovod option (WIP)

* small cleanup

* match newest config file

* add support for ddp, deepspeed and horovod

* add functionality needed for scalability test

* Cleaned up old hpo files, added itwinai Pipeline to HPO script

* Added pipeline option to hpo python script

* Deleted deprecated file train.py

* Ammended config path to be runnable from anywhere

* update slurm, runall and scaling test scripts + some cleanup

* add small bugfix in console logger

* make slurm.sh compatible with run.sh

* run isort

* fix linting errors

* move directory creation to slurm script and small eurac cleanup

* Added first version of eurac hpo integration

* add check for non-distributed strategy

* logging model

* Fixed path bugs, eurac hpo works now

* Unused imports

* remove redundant if statements and run black formatter

* Minimise hpo slurm script

* Updated ray scripts for eurac to be the same as in virgo, deleted unused old hpo files

* Added override for loggers field, so that the config.yaml does not have to be changed for hpo to work

* Merge eurac-usecase into virgo-hpo-playground

* isort

* changed line length to 95

* saving pre-trained model and exporting run and experiment to remote tracking server

* Fixed Virgo dataloading

* update trainer imports

* update config

* combine config files into one

* run isort on folder

* remove unused file

* update gather methods and fix stylistic changes

* Update README.md

* Update README.md

* cleanup temp code and small linting errors

* add stuff to config and more linting

* Fixed virgo dataloading (and cpu/ gpu utilisation with hpo)

* Incorporated first pull requests

* code cleanup for PR

* Incorporated PR comments

* Spelling error in README

* Some more changes to README.

* Incorporated comments for data generation, added info in README

* Spelling errors

* Isort

* Update file_gen.py

* Update file_gen.py

* Update README.md

* Update README.md

* update readme

* remove run.sh

* fix typo in readme

* fix device issue with GANTrainer and run black formatter

* Incorporated PR comments

* Merged config files together

* Update trainer.py

* Update config.yaml

* Update config.py

* Update config.yaml

* Pr changes VirgoConfiguration

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com>
Co-authored-by: Matteo Bunino <48362942+matbun@users.noreply.github.com>
Co-authored-by: linxUser3574 <neonikkus@gmail.com>
Co-authored-by: Matteo Bunino <matteo.bunino@gmail.com>
Co-authored-by: orviz <orviz@ifca.unican.es>
Co-authored-by: Kalliopi Tsolaki <ktsolaki@lxplus778.cern.ch>
Co-authored-by: zoechbauer1 <zoechbauer1@hdfmll01.hdfml>
Co-authored-by: Mario Rüttgers <ruettgers1@hdfmll01.hdfml>
Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com>
Co-authored-by: r-sarma <r.sarma@fz-juelich.de>
Co-authored-by: KalliopiTsolaki <tsolaki.kal@gmail.com>
Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com>
Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com>
Co-authored-by: Killian Verder <killian.verder@cern.ch>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: KalliopiTsolaki <51197740+KalliopiTsolaki@users.noreply.github.com>
Co-authored-by: KalliopiTsolaki <ktsolaki@LAPTOP-4683QBL6>
Co-authored-by: iferrario <iferrario@eurac.edu>
Co-authored-by: iacopoff <iacopo.ff@gmail.com>
Co-authored-by: ferrario2 <ferrario2@hdfmll01.hdfml>
Co-authored-by: Iacopo <38247963+iacopoff@users.noreply.github.com>
Co-authored-by: MutegekiHenry <henrymutegeki117@gmail.com>
Co-authored-by: MutegekiHenry <henrymutegeki11@gmail.com>
Co-authored-by: Anna Lappe <A.Lappe@campus.lmu.de>
Co-authored-by: Henry Mutegeki <36065782+MutegekiHenry@users.noreply.github.com>
---
 src/itwinai/cli.py                            |  8 ++-
 src/itwinai/torch/trainer.py                  | 71 +++++++++----------
 use-cases/eurac/config.yaml                   |  2 +-
 use-cases/eurac/runall.sh                     |  6 +-
 use-cases/eurac/slurm.sh                      |  2 +-
 use-cases/eurac/trainer.py                    | 16 +----
 use-cases/virgo/README.md                     | 35 ---------
 .../virgo/synthetic_data_gen/file_gen.py      |  2 -
 8 files changed, 47 insertions(+), 95 deletions(-)

diff --git a/src/itwinai/cli.py b/src/itwinai/cli.py
index 6c33c39b..c6598694 100644
--- a/src/itwinai/cli.py
+++ b/src/itwinai/cli.py
@@ -282,13 +282,16 @@ def exec_pipeline(
 @app.command()
 def mlflow_ui(
     path: str = typer.Option("ml-logs/", help="Path to logs storage."),
+    port: int = typer.Option(
+        5000, help="Port on which the MLFlow UI is listening."
+    ),
 ):
     """
     Visualize Mlflow logs.
     """
     import subprocess
 
-    subprocess.run(f"mlflow ui --backend-store-uri {path}".split())
+    subprocess.run(f"mlflow ui --backend-store-uri {path} --port {port}".split())
 
 
 @app.command()
@@ -302,8 +305,7 @@ def mlflow_server(
     """
     import subprocess
 
-    subprocess.run(
-        f"mlflow server --backend-store-uri {path} --port {port}".split())
+    subprocess.run(f"mlflow server --backend-store-uri {path} --port {port}".split())
 
 
 @app.command()
diff --git a/src/itwinai/torch/trainer.py b/src/itwinai/torch/trainer.py
index 51a84845..121372bc 100644
--- a/src/itwinai/torch/trainer.py
+++ b/src/itwinai/torch/trainer.py
@@ -298,7 +298,6 @@ def create_dataloaders(
         # Dear user, this is a method you #
         # may be interested to override!  #
         ###################################
-
         self.train_dataloader = self.strategy.create_dataloader(
             dataset=train_dataset,
             batch_size=self.config.batch_size,
@@ -373,7 +372,7 @@ def execute(
 
         if self.logger:
             self.logger.destroy_logger_context()
-        # self.strategy.clean_up()
+        self.strategy.clean_up()
         return train_dataset, validation_dataset, test_dataset, self.model
 
     def _set_epoch_dataloaders(self, epoch: int):
@@ -521,8 +520,10 @@ def train(self):
                 val_loss = self.validation_epoch(epoch)
 
                 # Checkpointing current best model
-                worker_val_losses = self.strategy.gather(val_loss, dst_rank=0)
-                if self.strategy.is_main_worker:
+                worker_val_losses = self.strategy.gather(
+                    val_loss, dst_rank=0
+                )
+                if self.strategy.global_rank() == 0:
                     avg_loss = torch.mean(
                         torch.stack(worker_val_losses)
                     ).detach().cpu()
@@ -627,7 +628,7 @@ def train_step(
         )
         return loss, metrics
 
-    def validation_epoch(self, epoch: int) -> Optional[torch.Tensor]:
+    def validation_epoch(self, epoch: int) -> torch.Tensor:
         """Perform a complete sweep over the validation dataset, completing an
         epoch of validation.
 
@@ -635,45 +636,43 @@ def validation_epoch(self, epoch: int) -> Optional[torch.Tensor]:
             epoch (int): current epoch number, from 0 to ``self.epochs - 1``.
 
         Returns:
-            Optional[Loss]: average validation loss for the current epoch if 
-                self.validation_dataloader is not None
+            Loss: average validation loss for the current epoch.
         """
-        if self.validation_dataloader is None:
-            return
-
-        self.model.eval()
-        validation_losses = []
-        validation_metrics = []
-        for batch_idx, val_batch in enumerate(self.validation_dataloader):
-            loss, metrics = self.validation_step(
-                batch=val_batch,
-                batch_idx=batch_idx
-            )
-            validation_losses.append(loss)
-            validation_metrics.append(metrics)
+        if self.validation_dataloader is not None:
+            self.model.eval()
+            validation_losses = []
+            validation_metrics = []
+            for batch_idx, val_batch \
+                    in enumerate(self.validation_dataloader):
+                loss, metrics = self.validation_step(
+                    batch=val_batch,
+                    batch_idx=batch_idx
+                )
+                validation_losses.append(loss)
+                validation_metrics.append(metrics)
 
-            # Important: update counter
-            self.validation_glob_step += 1
+                # Important: update counter
+                self.validation_glob_step += 1
 
-        # Aggregate and log losses
-        avg_loss = torch.mean(torch.stack(validation_losses))
-        self.log(
-            item=avg_loss.item(),
-            identifier='validation_loss_epoch',
-            kind='metric',
-            step=self.validation_glob_step,
-        )
-        # Aggregate and log metrics
-        avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict()
-        for m_name, m_val in avg_metrics.items():
+            # Aggregate and log losses
+            avg_loss = torch.mean(torch.stack(validation_losses))
             self.log(
-                item=m_val,
-                identifier='validation_' + m_name + '_epoch',
+                item=avg_loss.item(),
+                identifier='validation_loss_epoch',
                 kind='metric',
                 step=self.validation_glob_step,
             )
+            # Aggregate and log metrics
+            avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict()
+            for m_name, m_val in avg_metrics.items():
+                self.log(
+                    item=m_val,
+                    identifier='validation_' + m_name + '_epoch',
+                    kind='metric',
+                    step=self.validation_glob_step,
+                )
 
-        return avg_loss
+            return avg_loss
 
     def validation_step(
         self,
diff --git a/use-cases/eurac/config.yaml b/use-cases/eurac/config.yaml
index 8912e898..64ee45f7 100644
--- a/use-cases/eurac/config.yaml
+++ b/use-cases/eurac/config.yaml
@@ -6,7 +6,7 @@ tmp_stats: /p/scratch/intertwin/datasets/eurac/stats
 
 experiment: "drought use case lstm"
 run_name: "alps_test"
-epochs: 5
+epochs: 2
 random_seed: 1010
 lr: 0.001
 batch_size: 256
diff --git a/use-cases/eurac/runall.sh b/use-cases/eurac/runall.sh
index 4bbea260..6169366e 100755
--- a/use-cases/eurac/runall.sh
+++ b/use-cases/eurac/runall.sh
@@ -12,7 +12,7 @@ if [ -z "$NUM_GPUS" ]; then
 	NUM_GPUS=4
 fi
 if [ -z "$TIME" ]; then 
-	TIME=0:40:00
+	TIME=0:20:00
 fi
 if [ -z "$DEBUG" ]; then 
 	DEBUG=false
@@ -34,6 +34,6 @@ submit_job () {
 }
 
 echo "Running distributed training on $NUM_NODES nodes with $NUM_GPUS GPUs per node"
-# submit_job "ddp"
-# submit_job "deepspeed"
+submit_job "ddp"
+submit_job "deepspeed"
 submit_job "horovod"
diff --git a/use-cases/eurac/slurm.sh b/use-cases/eurac/slurm.sh
index e907e54c..e1ec58b1 100644
--- a/use-cases/eurac/slurm.sh
+++ b/use-cases/eurac/slurm.sh
@@ -100,7 +100,7 @@ if [ "$DIST_MODE" == "horovod" ] ; then
 	srun --cpu-bind=none \
 	--ntasks-per-node=$SLURM_GPUS_PER_NODE \
 	--cpus-per-task=$SLURM_CPUS_PER_GPU \
-	--ntasks=$(($SLURM_GPUS_PER_NODE * $SLURM_NNODES)) \
+	--ntasks=$SLURM_GPUS_PER_NODE \
 	$TRAINING_CMD
 else # E.g. for 'deepspeed' or 'ddp'
   srun --cpu-bind=none --ntasks-per-node=1 \
diff --git a/use-cases/eurac/trainer.py b/use-cases/eurac/trainer.py
index 7ab3e255..53c50202 100644
--- a/use-cases/eurac/trainer.py
+++ b/use-cases/eurac/trainer.py
@@ -1,7 +1,7 @@
 import os
 from pathlib import Path
 from timeit import default_timer as timer
-from typing import Dict, Literal, Optional, Union, Any, Tuple
+from typing import Dict, Literal, Optional, Union
 
 import pandas as pd
 import torch
@@ -13,7 +13,6 @@
 from hython.trainer import ConvTrainer, RNNTrainer, RNNTrainParams
 from ray import train
 from torch.optim.lr_scheduler import ReduceLROnPlateau
-from torch.utils.data import Dataset
 from tqdm.auto import tqdm
 
 from itwinai.loggers import EpochTimeTracker, Logger
@@ -26,7 +25,7 @@
 )
 from itwinai.torch.trainer import TorchTrainer
 from itwinai.torch.type import Metric
-from itwinai.components import profile_torch_trainer
+
 
 class RNNDistributedTrainer(TorchTrainer):
     """Trainer class for RNN model using pytorch.
@@ -88,17 +87,6 @@ def __init__(
             **kwargs,
         )
         self.save_parameters(**self.locals2params(locals()))
-        # self.execute = types.MethodType(profile_torch_trainer(self.execute), self)
-
-
-    @profile_torch_trainer
-    def execute(
-        self, 
-        train_dataset: Dataset, 
-        validation_dataset: Optional[Dataset] = None, 
-        test_dataset: Optional[Dataset] = None
-    ) -> Tuple[Dataset, Dataset, Dataset, Any]:
-        return super().execute(train_dataset, validation_dataset, test_dataset)
 
     def create_model_loss_optimizer(self) -> None:
         self.optimizer = optim.Adam(self.model.parameters(), lr=self.config.lr)
diff --git a/use-cases/virgo/README.md b/use-cases/virgo/README.md
index ca58c40f..d8f42147 100644
--- a/use-cases/virgo/README.md
+++ b/use-cases/virgo/README.md
@@ -123,38 +123,3 @@ You may change CLI variables for `hpo.py` to change parameters,
 such as the number of trials you want to run, to change the stopping criteria for the trials or to set a
 different metric on which ray will evaluate trial results.
 By default, trials monitor validation loss, and results are plotted once all trials are completed.
-
-## Generating Synthetic Data for the Virgo Use Case
-
-This project includes another SLURM job script, `synthetic_data_gen/data_generation.sh`, that allows
-users to generate synthetic dataset for the Virgo gravitational wave detector use case.
-This step is typically not required unless you need to create new synthetic datasets.
-
-The synthetic data is generated using a Python script, `file_gen.py`, which creates multiple files
-containing simulated data. Each file is a pickled pandas dataframe containing `datapoints_per_file`
-datapoints (defaults to 500), each
-one representing a set of time series for main and strain detector channels. 
-
-If you need to generate a new dataset, you can run the SLURM script with the following command:
-
-```bash
-sbatch data_generation.sh
-```
-
-The script will generate multiple data files and store them in separate folders, which are
-created in the `target_folder_name` directory.
-
-The generated pickle files are organized in a set of nested folders to avoid creating too many
-files in the same folder. To generate such folders and its files we use SLURM 
-[job arrays](https://slurm.schedmd.com/job_array.html).
-Each SLURM array job will create its own folder and populate it with the synthetic data files.
-The number of files created in each folder can be customized by setting the `NUM_FILES` environment
-variablebefore submitting the job.
-For example, to generate 50 files per array job, you can run:
-
-```bash
-export NUM_FILES=50
-sbatch data_generation.sh
-```
-
-If you do not specify `NUM_FILES`, the script will default to creating 100 files per folder.
diff --git a/use-cases/virgo/synthetic_data_gen/file_gen.py b/use-cases/virgo/synthetic_data_gen/file_gen.py
index 859de126..f9cf5f4b 100644
--- a/use-cases/virgo/synthetic_data_gen/file_gen.py
+++ b/use-cases/virgo/synthetic_data_gen/file_gen.py
@@ -7,8 +7,6 @@
 
 from ..src.dataset import generate_cut_image_dataset
 
-from ..src.dataset import generate_cut_image_dataset
-
 
 def generate_pkl_dataset(
     folder_name='test_folder',