From d0370dfd39e680bbf2de522cd935e5e9db4503f1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jarl=20S=C3=A6ther?= <60541573+jarlsondre@users.noreply.github.com> Date: Tue, 15 Oct 2024 16:06:15 +0200 Subject: [PATCH] Usecase eurac (#219) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Backend (#59) * WIP: Tensorflow MNIST use-case * UPDATE: Tensorflow MNIST version * ADD: Backend * ADD: Use-case init * FIX: Paths and downloading of the data * FIX: Paths and downloading of the data * ADD: Setup, Config update * ADD: Setup, Config update * UPDATE: File movement into itwinai * FIX: Move utils from tensorflow to global folder * FIX: Add setup into torch Executable * ADD: MNIST Torch Use-case * FIX: Formatting * ADD: Lib * ADD: Lib * ADD: Tests, Fix Loggers * Update README.md * ADD: Tests * ADD: MLCC * ADD: Cyclones, Cyclones-pipe * ADD: TensorflowTrainer * UPDATE: Move TensorflowTrainer into Backend * FIX: Dependencies * ADD: Number of devices * ADD: initial version of TorchTrainer * update * update * ADD: distributed torch Trainer and decorator * ADD: New version of torch distribtued trainer and tests * ADD: load torch dist trainer form config file * ADD: multi-gpu pytorch trainer * ADD: download on login node * FIX: dataloaders in Trainer * FIX: add dataloaders into trainer * FIX: clear load and save state * ADD: Loggers * FIX: Log in a distributed environment * TensorFlow backend (#63) * UPDATE: Remove experimental distribution * ADD: Mnist distributed * ADD: Optional strategy * UPDATE: Conditional distribution * FIX: Dataloader for mnist * FIX: Model cloning lambda function for distributed scope * ADD: CycleGAN * UPDATE: Types * UPDATE: Types * ADD: Local distr * FIX: learning rates * ADD: CycleGAN distributed * FIX: Reduction * FIX: Distribution * ADD: tmp.py * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * UPDATE: Executors * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD:Initial VIRGO * UPDATE: Optional distribution, tensorflow-gpu * UPDATE: tensorflow-gpu dependency * ADD: Unify branches --------- Co-authored-by: User3574 * Refacto entire code base * ADD: workflows folder * FIX: refactor * FIX: linting * ADD: how to run use case doc * ADD: workflows doc * FIX: MD linter * Pipe MNIST lightning (#86) * ADD: lightning distributed + pipeline * UPDATE: jscpd threshold * UPDATE: super linter ignore use cases * ADD: jscpd ignore loggers * Functional tests for MNIST (#87) * ADD: use case tests * FIX: move use case models out of itwinai * FIX: rearrange modules * ADD: ConsoleLogger and LoggersCollection * FIX: loggers filter * FIX: add TF env creation * UPDATE: test flag * ADD: early pytest on slurm * FIX: duplicated code in TF Trainer * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * 3dgan use case (#94) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Sqaaas code (#96) * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml * Update sqaaas.yml * ADD: adaptive branch discovery for SQAaaS actin * Trigger only on main and dev branches * ADD: double quote * Trigger pytest only on main and dev PRs * Torch mnist inference (#95) * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * Remove keras dependency * 3dgan integration (#97) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * 3dgan integration (#98) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * fixed distributed trainer in cyclones use case * 3dgan integration (#118) * fixed distributed trainer in cyclones use case * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test * ADD GPU support and update tag * FIX linter * ADD override example * UPDATE 3DGAN inference * UPDATE inference execution tutorials * UPDATE README * UPDATE saver saving sparse tensors * ADD interlink pods * UPDATE pod name * UPDATE annotations * FIX README * CLEANUP * Merge * update * ADD tf cpu env * U[date Makefile * FIX 3DGAN tests * FIX data folder path --------- Co-authored-by: zoechbauer1 Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Unit test 4 dev (#113) * Define a step for pytest execution * Fix: use v1 of step action * Print result of step composition * Rename step * Use step previous definition in the assessment * Rename input: workflow -> steps * Avoid caching by using 1.0.0 * Set container image * Bump to v1 * Bump to sqaaas-assessment-action@v2 * Remove 'id' property * Adapt inputs to v2 * Remove current branch * Disable test_cyclones_train_tf * ADD marker * ADD skip memory heavy * Disable for PRs --------- Co-authored-by: Matteo Bunino * Distributed strategy launcher (#117) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * Distributed strategy launcher (#127) Update ParseConfig * Distributed strategy launcher (#128) Remove experimental files * Docs dev (#132) * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * fixed distributed trainer in cyclones use case * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * testing automated docs update * updating getting started page * fixing pages and adding new content * bug fixes * fixing content rendering * latest fixes in rendering * Add version feature to docs * Update .readthedocs.yaml * fixing display structure in getting started page * new fixes similar to previous commit * Update index.rst * Update index.rst Text re-edit index * Update index.rst change 1 word * Update .readthedocs.yaml * Update .readthedocs.yaml * fixing getting started page * Text review getting_started_with_itwinai.rst * Update 3dgan_doc.rst * Update getting_started_with_itwinai.rst punctuation * Fix torch naming problem --------- Co-authored-by: KalliopiTsolaki Co-authored-by: zoechbauer1 Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> * Distributed strategy launcher (#131) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial * Restore ConfigParser * FIX type hinting * ADD dev dependencies * REMOVE experimental scripts * UPDATE scaling report * Add SLURM logs * Refactor log scale * Update scalability report * Unify SLURM logs per job * Update README.md * Update README.md * Update README.md * ADD itwinai installation * UPDATE torch distributed tutorial 0 * UPDATE torch distributed tutorials * REMOVE imagenet tutorial * ADD NonDistributedStrategy and create_dataloader method * CLEANUP older classes * Rename strategies * Simplify structure * ADD draft new torch trainer class * UPDATED torch trainer draft * UPDATE MNIST use case * INtegrate new trainer into MNIST use case * UPDATE structure: remove unused files and refactor tests * Tmp disable unused tests * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * FIX failing inference * Functiona tests (#133) * UPDATE tests * FIX errors * CLEANUP * Remove unused workflow --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * 3dgan integration (#134) * fixed distributed trainer in cyclones use case * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test * ADD GPU support and update tag * FIX linter * ADD override example * UPDATE 3DGAN inference * UPDATE inference execution tutorials * UPDATE README * UPDATE saver saving sparse tensors * ADD interlink pods * UPDATE pod name * UPDATE annotations * FIX README * CLEANUP * Merge * update * ADD tf cpu env * U[date Makefile * FIX 3DGAN tests * FIX data folder path * ADD offloading of 3DGAN training * ADAPT 3DGAN training for singularity execution * UPDATE test and fix linter --------- Co-authored-by: zoechbauer1 Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Docs dev (#135) * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * fixed distributed trainer in cyclones use case * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * testing automated docs update * updating getting started page * fixing pages and adding new content * bug fixes * fixing content rendering * latest fixes in rendering * Add version feature to docs * Update .readthedocs.yaml * fixing display structure in getting started page * new fixes similar to previous commit * Update index.rst * Update index.rst Text re-edit index * Update index.rst change 1 word * Update .readthedocs.yaml * Update .readthedocs.yaml * fixing getting started page * Text review getting_started_with_itwinai.rst * Update 3dgan_doc.rst * Update getting_started_with_itwinai.rst punctuation * Fix torch naming problem * UPDATE requirements --------- Co-authored-by: KalliopiTsolaki Co-authored-by: zoechbauer1 Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> * Distributed strategy launcher (#137) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial * Restore ConfigParser * FIX type hinting * ADD dev dependencies * REMOVE experimental scripts * UPDATE scaling report * Add SLURM logs * Refactor log scale * Update scalability report * Unify SLURM logs per job * Update README.md * Update README.md * Update README.md * ADD itwinai installation * UPDATE torch distributed tutorial 0 * UPDATE torch distributed tutorials * REMOVE imagenet tutorial * ADD NonDistributedStrategy and create_dataloader method * CLEANUP older classes * Rename strategies * Simplify structure * ADD draft new torch trainer class * UPDATED torch trainer draft * UPDATE MNIST use case * INtegrate new trainer into MNIST use case * UPDATE structure: remove unused files and refactor tests * Tmp disable unused tests * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * FIX failing inference * Functiona tests (#133) * UPDATE tests * FIX errors * CLEANUP * Remove unused workflow * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Update distributed.py * Update tfmirrored_slurm.sh * Update train.py * TF updates * Add README * Python venv (#136) * Move to python venv * Update Makefile * Add Horovod installation * Update env * FIX openmpi install * Add TF explicit version * UPDATE env creation * REMOVE constraint on torch 2.0.* * UPDATE installation * FIX test * REMOVE strict dependency on micromamba * FIX docs and debugging states * FIX cpu only installation * FIX deepspeed cpu installation * FIX tf env creation * FIX makefile * ADD pypi deployment * DISABLE push debug * UPDATE pypi * UPDATE classifiers * Update pyproject.toml --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * Update README.md * Distributed strategy launcher (#141) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial * Restore ConfigParser * FIX type hinting * ADD dev dependencies * REMOVE experimental scripts * UPDATE scaling report * Add SLURM logs * Refactor log scale * Update scalability report * Unify SLURM logs per job * Update README.md * Update README.md * Update README.md * ADD itwinai installation * UPDATE torch distributed tutorial 0 * UPDATE torch distributed tutorials * REMOVE imagenet tutorial * ADD NonDistributedStrategy and create_dataloader method * CLEANUP older classes * Rename strategies * Simplify structure * ADD draft new torch trainer class * UPDATED torch trainer draft * UPDATE MNIST use case * INtegrate new trainer into MNIST use case * UPDATE structure: remove unused files and refactor tests * Tmp disable unused tests * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * FIX failing inference * Functiona tests (#133) * UPDATE tests * FIX errors * CLEANUP * Remove unused workflow * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Update distributed.py * Update tfmirrored_slurm.sh * Update train.py * TF updates * Add README * Python venv (#136) * Move to python venv * Update Makefile * Add Horovod installation * Update env * FIX openmpi install * Add TF explicit version * UPDATE env creation * REMOVE constraint on torch 2.0.* * UPDATE installation * FIX test * REMOVE strict dependency on micromamba * FIX docs and debugging states * FIX cpu only installation * FIX deepspeed cpu installation * FIX tf env creation * FIX makefile * ADD pypi deployment * DISABLE push debug * UPDATE pypi * UPDATE classifiers * Update pyproject.toml * Update README.md * Cyclone tf dist (#130) * get_stretegy * UPDATE distributed strategy * change req file * cycline tf dist * small bugs * fix bug in train.py * REFACTOR cyclones use case * Activate pytest * NEW TensorFlow trainer * ADD user information --------- Co-authored-by: ruettgers1 Co-authored-by: Matteo Bunino * Interactive distrib ml (#139) Add examples for distributed ml in interactive mode * Interactive distrib ml (#140) Update tutorial * Disable documentation GH action * Remove action --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com> * Merge main (#142) Bring changes on main into dev * Virgo integration (#143) * ADD Virgo data pipeline and some refactoring * FIX typo * UPDATE README * ADD training * ADD TrainingConfiguration * ADD distributed training and refactor * update readme * UPDATE loggers and add tests * Refactor * FIX typo * UPDATE use cases instructions * ADD checkpointing and refactor. * FIX linter * FIX jscpd * FIX jscpd * Disable jscpd * Refactor loggers * ADD loggers to Virgo use case * Update AUTHORS.md * Update AUTHORS.md * Docs dev (#144) * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * fixed distributed trainer in cyclones use case * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * testing automated docs update * updating getting started page * fixing pages and adding new content * bug fixes * fixing content rendering * latest fixes in rendering * Add version feature to docs * Update .readthedocs.yaml * fixing display structure in getting started page * new fixes similar to previous commit * Update index.rst * Update index.rst Text re-edit index * Update index.rst change 1 word * Update .readthedocs.yaml * Update .readthedocs.yaml * fixing getting started page * Text review getting_started_with_itwinai.rst * Update 3dgan_doc.rst * Update getting_started_with_itwinai.rst punctuation * Fix torch naming problem * UPDATE requirements * Remove unnecessary dependencies * Add docstring * adding latest changes from dev * new content and changes * Update index.rst toctree revise * adding pages for distributed ml tutorials * new shpinx reqs to solve build failing * Docs update: - python code format fixed - added brief explanation on ddp in new section * requirements changed * UPDATE requirements * UPDATE requirements and itwinai.types * ADD CMake and GCC installation * UPDATE CMake and GCC installation * UPDATE CMake and GCC installation * ADD notebooks * Disable notebooks section * FIX TOC * Saving local changes before pulling from remote * saving updates before pull from origin * Update itwinai.torch.modules.rst * Update itwinai.torch.modules.rst * Update itwinai.torch.modules.rst * Update itwinai.torch.modules.rst * adding cyclones and virgo use cases pages * FIX build errors * Update TOC * Update TOC --------- Co-authored-by: KalliopiTsolaki Co-authored-by: zoechbauer1 Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> Co-authored-by: Killian Verder * Update dev (#152) * Dev - itwinai 0.0.2 (#138) * Backend (#59) * WIP: Tensorflow MNIST use-case * UPDATE: Tensorflow MNIST version * ADD: Backend * ADD: Use-case init * FIX: Paths and downloading of the data * FIX: Paths and downloading of the data * ADD: Setup, Config update * ADD: Setup, Config update * UPDATE: File movement into itwinai * FIX: Move utils from tensorflow to global folder * FIX: Add setup into torch Executable * ADD: MNIST Torch Use-case * FIX: Formatting * ADD: Lib * ADD: Lib * ADD: Tests, Fix Loggers * Update README.md * ADD: Tests * ADD: MLCC * ADD: Cyclones, Cyclones-pipe * ADD: TensorflowTrainer * UPDATE: Move TensorflowTrainer into Backend * FIX: Dependencies * ADD: Number of devices * ADD: initial version of TorchTrainer * update * update * ADD: distributed torch Trainer and decorator * ADD: New version of torch distribtued trainer and tests * ADD: load torch dist trainer form config file * ADD: multi-gpu pytorch trainer * ADD: download on login node * FIX: dataloaders in Trainer * FIX: add dataloaders into trainer * FIX: clear load and save state * ADD: Loggers * FIX: Log in a distributed environment * TensorFlow backend (#63) * UPDATE: Remove experimental distribution * ADD: Mnist distributed * ADD: Optional strategy * UPDATE: Conditional distribution * FIX: Dataloader for mnist * FIX: Model cloning lambda function for distributed scope * ADD: CycleGAN * UPDATE: Types * UPDATE: Types * ADD: Local distr * FIX: learning rates * ADD: CycleGAN distributed * FIX: Reduction * FIX: Distribution * ADD: tmp.py * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * UPDATE: Executors * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD:Initial VIRGO * UPDATE: Optional distribution, tensorflow-gpu * UPDATE: tensorflow-gpu dependency * ADD: Unify branches --------- Co-authored-by: User3574 * Refacto entire code base * ADD: workflows folder * FIX: refactor * FIX: linting * ADD: how to run use case doc * ADD: workflows doc * FIX: MD linter * Pipe MNIST lightning (#86) * ADD: lightning distributed + pipeline * UPDATE: jscpd threshold * UPDATE: super linter ignore use cases * ADD: jscpd ignore loggers * Functional tests for MNIST (#87) * ADD: use case tests * FIX: move use case models out of itwinai * FIX: rearrange modules * ADD: ConsoleLogger and LoggersCollection * FIX: loggers filter * FIX: add TF env creation * UPDATE: test flag * ADD: early pytest on slurm * FIX: duplicated code in TF Trainer * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * 3dgan use case (#94) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Sqaaas code (#96) * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml * Update sqaaas.yml * ADD: adaptive branch discovery for SQAaaS actin * Trigger only on main and dev branches * ADD: double quote * Trigger pytest only on main and dev PRs * Torch mnist inference (#95) * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * Remove keras dependency * 3dgan integration (#97) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * 3dgan integration (#98) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * fixed distributed trainer in cyclones use case * 3dgan integration (#118) * fixed distributed trainer in cyclones use case * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test * ADD GPU support and update tag * FIX linter * ADD override example * UPDATE 3DGAN inference * UPDATE inference execution tutorials * UPDATE README * UPDATE saver saving sparse tensors * ADD interlink pods * UPDATE pod name * UPDATE annotations * FIX README * CLEANUP * Merge * update * ADD tf cpu env * U[date Makefile * FIX 3DGAN tests * FIX data folder path --------- Co-authored-by: zoechbauer1 Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Unit test 4 dev (#113) * Define a step for pytest execution * Fix: use v1 of step action * Print result of step composition * Rename step * Use step previous definition in the assessment * Rename input: workflow -> steps * Avoid caching by using 1.0.0 * Set container image * Bump to v1 * Bump to sqaaas-assessment-action@v2 * Remove 'id' property * Adapt inputs to v2 * Remove current branch * Disable test_cyclones_train_tf * ADD marker * ADD skip memory heavy * Disable for PRs --------- Co-authored-by: Matteo Bunino * Distributed strategy launcher (#117) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * Distributed strategy launcher (#127) Update ParseConfig * Distributed strategy launcher (#128) Remove experimental files * Docs dev (#132) * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * fixed distributed trainer in cyclones use case * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * testing automated docs update * updating getting started page * fixing pages and adding new content * bug fixes * fixing content rendering * latest fixes in rendering * Add version feature to docs * Update .readthedocs.yaml * fixing display structure in getting started page * new fixes similar to previous commit * Update index.rst * Update index.rst Text re-edit index * Update index.rst change 1 word * Update .readthedocs.yaml * Update .readthedocs.yaml * fixing getting started page * Text review getting_started_with_itwinai.rst * Update 3dgan_doc.rst * Update getting_started_with_itwinai.rst punctuation * Fix torch naming problem --------- Co-authored-by: KalliopiTsolaki Co-authored-by: zoechbauer1 Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> * Distributed strategy launcher (#131) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial * Restore ConfigParser * FIX type hinting * ADD dev dependencies * REMOVE experimental scripts * UPDATE scaling report * Add SLURM logs * Refactor log scale * Update scalability report * Unify SLURM logs per job * Update README.md * Update README.md * Update README.md * ADD itwinai installation * UPDATE torch distributed tutorial 0 * UPDATE torch distributed tutorials * REMOVE imagenet tutorial * ADD NonDistributedStrategy and create_dataloader method * CLEANUP older classes * Rename strategies * Simplify structure * ADD draft new torch trainer class * UPDATED torch trainer draft * UPDATE MNIST use case * INtegrate new trainer into MNIST use case * UPDATE structure: remove unused files and refactor tests * Tmp disable unused tests * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * FIX failing inference * Functiona tests (#133) * UPDATE tests * FIX errors * CLEANUP * Remove unused workflow --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * 3dgan integration (#134) * fixed distributed trainer in cyclones use case * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test * ADD GPU support and update tag * FIX linter * ADD override example * UPDATE 3DGAN inference * UPDATE inference execution tutorials * UPDATE README * UPDATE saver saving sparse tensors * ADD interlink pods * UPDATE pod name * UPDATE annotations * FIX README * CLEANUP * Merge * update * ADD tf cpu env * U[date Makefile * FIX 3DGAN tests * FIX data folder path * ADD offloading of 3DGAN training * ADAPT 3DGAN training for singularity execution * UPDATE test and fix linter --------- Co-authored-by: zoechbauer1 Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Docs dev (#135) * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * fixed distributed trainer in cyclones use case * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * testing automated docs update * updating getting started page * fixing pages and adding new content * bug fixes * fixing content rendering * latest fixes in rendering * Add version feature to docs * Update .readthedocs.yaml * fixing display structure in getting started page * new fixes similar to previous commit * Update index.rst * Update index.rst Text re-edit index * Update index.rst change 1 word * Update .readthedocs.yaml * Update .readthedocs.yaml * fixing getting started page * Text review getting_started_with_itwinai.rst * Update 3dgan_doc.rst * Update getting_started_with_itwinai.rst punctuation * Fix torch naming problem * UPDATE requirements --------- Co-authored-by: KalliopiTsolaki Co-authored-by: zoechbauer1 Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> * Distributed strategy launcher (#137) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial * Restore ConfigParser * FIX type hinting * ADD dev dependencies * REMOVE experimental scripts * UPDATE scaling report * Add SLURM logs * Refactor log scale * Update scalability report * Unify SLURM logs per job * Update README.md * Update README.md * Update README.md * ADD itwinai installation * UPDATE torch distributed tutorial 0 * UPDATE torch distributed tutorials * REMOVE imagenet tutorial * ADD NonDistributedStrategy and create_dataloader method * CLEANUP older classes * Rename strategies * Simplify structure * ADD draft new torch trainer class * UPDATED torch trainer draft * UPDATE MNIST use case * INtegrate new trainer into MNIST use case * UPDATE structure: remove unused files and refactor tests * Tmp disable unused tests * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * FIX failing inference * Functiona tests (#133) * UPDATE tests * FIX errors * CLEANUP * Remove unused workflow * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Update distributed.py * Update tfmirrored_slurm.sh * Update train.py * TF updates * Add README * Python venv (#136) * Move to python venv * Update Makefile * Add Horovod installation * Update env * FIX openmpi install * Add TF explicit version * UPDATE env creation * REMOVE constraint on torch 2.0.* * UPDATE installation * FIX test * REMOVE strict dependency on micromamba * FIX docs and debugging states * FIX cpu only installation * FIX deepspeed cpu installation * FIX tf env creation * FIX makefile * ADD pypi deployment * DISABLE push debug * UPDATE pypi * UPDATE classifiers * Update pyproject.toml --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * Update README.md * Distributed strategy launcher (#141) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial * Restore ConfigParser * FIX type hinting * ADD dev dependencies * REMOVE experimental scripts * UPDATE scaling report * Add SLURM logs * Refactor log scale * Update scalability report * Unify SLURM logs per job * Update README.md * Update README.md * Update README.md * ADD itwinai installation * UPDATE torch distributed tutorial 0 * UPDATE torch distributed tutorials * REMOVE imagenet tutorial * ADD NonDistributedStrategy and create_dataloader method * CLEANUP older classes * Rename strategies * Simplify structure * ADD draft new torch trainer class * UPDATED torch trainer draft * UPDATE MNIST use case * INtegrate new trainer into MNIST use case * UPDATE structure: remove unused files and refactor tests * Tmp disable unused tests * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * FIX failing inference * Functiona tests (#133) * UPDATE tests * FIX errors * CLEANUP * Remove unused workflow * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Fixes to TF new version errors * Update distributed.py * Update tfmirrored_slurm.sh * Update train.py * TF updates * Add README * Python venv (#136) * Move to python venv * Update Makefile * Add Horovod installation * Update env * FIX openmpi install * Add TF explicit version * UPDATE env creation * REMOVE constraint on torch 2.0.* * UPDATE installation * FIX test * REMOVE strict dependency on micromamba * FIX docs and debugging states * FIX cpu only installation * FIX deepspeed cpu installation * FIX tf env creation * FIX makefile * ADD pypi deployment * DISABLE push debug * UPDATE pypi * UPDATE classifiers * Update pyproject.toml * Update README.md * Cyclone tf dist (#130) * get_stretegy * UPDATE distributed strategy * change req file * cycline tf dist * small bugs * fix bug in train.py * REFACTOR cyclones use case * Activate pytest * NEW TensorFlow trainer * ADD user information --------- Co-authored-by: ruettgers1 Co-authored-by: Matteo Bunino * Interactive distrib ml (#139) Add examples for distributed ml in interactive mode * Interactive distrib ml (#140) Update tutorial * Disable documentation GH action * Remove action --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com> * Merge main (#142) Bring changes on main into dev * Virgo integration (#143) * ADD Virgo data pipeline and some refactoring * FIX typo * UPDATE README * ADD training * ADD TrainingConfiguration * ADD distributed training and refactor * update readme * UPDATE loggers and add tests * Refactor * FIX typo * UPDATE use cases instructions * ADD checkpointing and refactor. * FIX linter * FIX jscpd * FIX jscpd * Disable jscpd * Refactor loggers * ADD loggers to Virgo use case * Update AUTHORS.md * Update AUTHORS.md * Docs dev (#144) * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * fixed distributed trainer in cyclones use case * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * testing automated docs update * updating getting started page * fixing pages and adding new content * bug fixes * fixing content rendering * latest fixes in rendering * Add version feature to docs * Update .readthedocs.yaml * fixing display structure in getting started page * new fixes similar to previous commit * Update index.rst * Update index.rst Text re-edit index * Update index.rst change 1 word * Update .readthedocs.yaml * Update .readthedocs.yaml * fixing getting started page * Text review getting_started_with_itwinai.rst * Update 3dgan_doc.rst * Update getting_started_with_itwinai.rst punctuation * Fix torch naming problem * UPDATE requirements * Remove unnecessary dependencies * Add docstring * adding latest changes from dev * new content and changes * Update index.rst toctree revise * adding pages for distributed ml tutorials * new shpinx reqs to solve build failing * Docs update: - python code format fixed - added brief explanation on ddp in new section * requirements changed * UPDATE requirements * UPDATE requirements and itwinai.types * ADD CMake and GCC installation * UPDATE CMake and GCC installation * UPDATE CMake and GCC installation * ADD notebooks * Disable notebooks section * FIX TOC * Saving local changes before pulling from remote * saving updates before pull from origin * Update itwinai.torch.modules.rst * Update itwinai.torch.modules.rst * Update itwinai.torch.modules.rst * Update itwinai.torch.modules.rst * adding cyclones and virgo use cases pages * FIX build errors * Update TOC * Update TOC --------- Co-authored-by: KalliopiTsolaki Co-authored-by: zoechbauer1 Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> Co-authored-by: Killian Verder --------- Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com> Co-authored-by: linxUser3574 Co-authored-by: orviz Co-authored-by: Kalliopi Tsolaki Co-authored-by: zoechbauer1 Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: KalliopiTsolaki Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com> Co-authored-by: Killian Verder * Delete .github/workflows/pages.yml * ADD quick install for users (#145) * User install (#146) * ADD quick install for users * UPDATE installer * fix framework selection * UPDATE installer * Update README.md * Update README.md * Improve docstring parsing and refactor (#147) * UPDATE print patch and refactor * Cleanup * Cleanup * Cleanup * Cleanup * FIX broken import * UPDATE docs * FIX docstring parsing * Preserve ordering * Update cli.py * Update docs (#148) * Update README.md * ADD missing doctrings * Bump actions/setup-python from 4 to 5 (#149) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 4 to 5. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/v4...v5) --- updated-dependencies: - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update README.md * Update README.md * Update README.md * updating doc pages (#150) Co-authored-by: KalliopiTsolaki * Update cyclones_doc.rst * Bug fixes and addition of CERFACS use-case (#151) * Update train.py * Update generic_tf.sh * Update pyproject.toml * Update train.py * Fix: head problems with MacOS * Fixes for MacOS support * Fix: Update basic_components.py * Addition of cerfacs use-case * Update README.md * Update train.py * Update cyclones_doc.rst * Update startscript.sh * Update pyproject.toml * Update mnist.py * Update mnist.py * Update generic_tf.sh * Update requirements.txt * Update requirements.txt * Docs changes (#153) * updating doc pages * testing if changing the GH edit url works * adding repo link in toc --------- Co-authored-by: KalliopiTsolaki * Update pyproject.toml --------- Signed-off-by: dependabot[bot] Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com> Co-authored-by: linxUser3574 Co-authored-by: orviz Co-authored-by: Kalliopi Tsolaki Co-authored-by: zoechbauer1 Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: KalliopiTsolaki Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com> Co-authored-by: Killian Verder Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: KalliopiTsolaki <51197740+KalliopiTsolaki@users.noreply.github.com> Co-authored-by: KalliopiTsolaki * Update sqaaas.yml * added train to start integration * update requirements.txt * downsamplingo option * fix plot * ADD support for user-provided distributed samplers * ADD first distributed draft with random distributed sampler * UPDATE comments * ADAPT train_val for distributed * UPDATE docs * UPDATE installation instructions * prepare for distributed run * enable distributed sampler * fix * prepare to run on JSC * update train * copied slurm.sh from virgo usecase * update convlstm * blacked * update surrogate input to scratch * Update dist-train.py * Update dist-train.py correct tqdm import error * Update dist-train.py in train_val, strategy.device changed to strategy.device() * add distributed support * add gather * add logging * clean up script * update default batch-size * refine variables to commandline args * Update cli.py * update torch dist final * Add parameter run_name to mlflow logger * prepared slurm script * add data generator scripts and data loader * add scaling tests * add scaling tests * add plots * fix env path * add hpo eurac * correct start hpo cmd * test distributed slurm * fix hpo functionality * working distributed version * add instructions for HPO and scaling tests * add hpo results vizualization script * update data path * convlstm * remove rnn_config from eurac tutorial * conv * Prov4ml integration (#192) * ADD prov4ml logger * UPDATE enum access fields * UPDATE loggers documentation and first integration attempt * ADD prov logger * format kinds table * MIGRATE to upstream prov4ml * ADD docs build on JSC * ADD RTD website * UPDATE docs creation * Refactor * UPDATE logger * Remove lightning callbacks and loggers * ADD checkpoints * UPDATE logger kind docs * Update README.md * ADD rank on loggers * Update loggers.py * Update loggers.py * Update loggers.py * Update loggers.py * Update loggers.py * FIX linter * REFACTOR loggers * Simplify prov4ml switch case * UPDATE loggers * FIX prov graph * REFACTOR itwinai logging * UPDATE SLURM jobscripts * REFACTOR * Update * ADD prov experiments * REFACTOR provenance logs and SLURM jobscripts * REMOVE duplication * FIX dataset name * UPDATE README * SKIP cyclones use case * UPDATE version * REMOVE redundant parameter * CLEANUP * ADD warning * ADD warning * UPDATE README * FIX errors * ADD docs * UPDATE scripts * UPDATE scripts * fix gan bug and update docs (#193) * Update index.rst * Update index.rst * Update index.rst * Update pyproject.toml * Update pyproject.toml * Bump github/super-linter from 6 to 7 (#198) Bumps [github/super-linter](https://github.com/github/super-linter) from 6 to 7. - [Release notes](https://github.com/github/super-linter/releases) - [Changelog](https://github.com/github/super-linter/blob/main/CHANGELOG.md) - [Commits](https://github.com/github/super-linter/compare/v6...v7) --- updated-dependencies: - dependency-name: github/super-linter dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Itwinai container (#197) * Backend (#59) * WIP: Tensorflow MNIST use-case * UPDATE: Tensorflow MNIST version * ADD: Backend * ADD: Use-case init * FIX: Paths and downloading of the data * FIX: Paths and downloading of the data * ADD: Setup, Config update * ADD: Setup, Config update * UPDATE: File movement into itwinai * FIX: Move utils from tensorflow to global folder * FIX: Add setup into torch Executable * ADD: MNIST Torch Use-case * FIX: Formatting * ADD: Lib * ADD: Lib * ADD: Tests, Fix Loggers * Update README.md * ADD: Tests * ADD: MLCC * ADD: Cyclones, Cyclones-pipe * ADD: TensorflowTrainer * UPDATE: Move TensorflowTrainer into Backend * FIX: Dependencies * ADD: Number of devices * ADD: initial version of TorchTrainer * update * update * ADD: distributed torch Trainer and decorator * ADD: New version of torch distribtued trainer and tests * ADD: load torch dist trainer form config file * ADD: multi-gpu pytorch trainer * ADD: download on login node * FIX: dataloaders in Trainer * FIX: add dataloaders into trainer * FIX: clear load and save state * ADD: Loggers * FIX: Log in a distributed environment * TensorFlow backend (#63) * UPDATE: Remove experimental distribution * ADD: Mnist distributed * ADD: Optional strategy * UPDATE: Conditional distribution * FIX: Dataloader for mnist * FIX: Model cloning lambda function for distributed scope * ADD: CycleGAN * UPDATE: Types * UPDATE: Types * ADD: Local distr * FIX: learning rates * ADD: CycleGAN distributed * FIX: Reduction * FIX: Distribution * ADD: tmp.py * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * FIX: Distribution * UPDATE: Executors * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * FIX: Distributed Dataset * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD: Ray * ADD:Initial VIRGO * UPDATE: Optional distribution, tensorflow-gpu * UPDATE: tensorflow-gpu dependency * ADD: Unify branches --------- Co-authored-by: User3574 * Refacto entire code base * ADD: workflows folder * FIX: refactor * FIX: linting * ADD: how to run use case doc * ADD: workflows doc * FIX: MD linter * Pipe MNIST lightning (#86) * ADD: lightning distributed + pipeline * UPDATE: jscpd threshold * UPDATE: super linter ignore use cases * ADD: jscpd ignore loggers * Functional tests for MNIST (#87) * ADD: use case tests * FIX: move use case models out of itwinai * FIX: rearrange modules * ADD: ConsoleLogger and LoggersCollection * FIX: loggers filter * FIX: add TF env creation * UPDATE: test flag * ADD: early pytest on slurm * FIX: duplicated code in TF Trainer * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * 3dgan use case (#94) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Sqaaas code (#96) * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml * Update sqaaas.yml * ADD: adaptive branch discovery for SQAaaS actin * Trigger only on main and dev branches * ADD: double quote * Trigger pytest only on main and dev PRs * Torch mnist inference (#95) * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * Remove keras dependency * 3dgan integration (#97) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * 3dgan integration (#98) * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * REMOVE: keras dependency * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test --------- Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * fixed distributed trainer in cyclones use case * 3dgan integration (#118) * fixed distributed trainer in cyclones use case * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test * ADD GPU support and update tag * FIX linter * ADD override example * UPDATE 3DGAN inference * UPDATE inference execution tutorials * UPDATE README * UPDATE saver saving sparse tensors * ADD interlink pods * UPDATE pod name * UPDATE annotations * FIX README * CLEANUP * Merge * update * ADD tf cpu env * U[date Makefile * FIX 3DGAN tests * FIX data folder path --------- Co-authored-by: zoechbauer1 Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Unit test 4 dev (#113) * Define a step for pytest execution * Fix: use v1 of step action * Print result of step composition * Rename step * Use step previous definition in the assessment * Rename input: workflow -> steps * Avoid caching by using 1.0.0 * Set container image * Bump to v1 * Bump to sqaaas-assessment-action@v2 * Remove 'id' property * Adapt inputs to v2 * Remove current branch * Disable test_cyclones_train_tf * ADD marker * ADD skip memory heavy * Disable for PRs --------- Co-authored-by: Matteo Bunino * Distributed strategy launcher (#117) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * Distributed strategy launcher (#127) Update ParseConfig * Distributed strategy launcher (#128) Remove experimental files * Docs dev (#132) * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * fixed distributed trainer in cyclones use case * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * commiting docs functionality for testing deployment * adding documentation deployment relevant files * updating readthedocs.yaml * changing directory of requirements.txt * updating reqs file * commiting changes and adding pages for tutorials * adding installation instructions in docs * adding latest changes to docs * adding new pages for itwinai modules and other modifications * modified src/itwinai/torch directory name to solve namespace conflict * fixing tutorial sections * fixes in pages appearance * fixing rendering bugs * fixing pages appearance bugs * adding latest modifications * Deleted duplicate folder after renaming src/itwinai/torch * adding documentation.yml file for automatic updating on github pages * modifying documentation.yml file * updating reqs file to solve bug in deployment * testing automated docs update * updating getting started page * fixing pages and adding new content * bug fixes * fixing content rendering * latest fixes in rendering * Add version feature to docs * Update .readthedocs.yaml * fixing display structure in getting started page * new fixes similar to previous commit * Update index.rst * Update index.rst Text re-edit index * Update index.rst change 1 word * Update .readthedocs.yaml * Update .readthedocs.yaml * fixing getting started page * Text review getting_started_with_itwinai.rst * Update 3dgan_doc.rst * Update getting_started_with_itwinai.rst punctuation * Fix torch naming problem --------- Co-authored-by: KalliopiTsolaki Co-authored-by: zoechbauer1 Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> * Distributed strategy launcher (#131) * ADD: distrib launcher mockup * REFACTOR: cluster env, strategy and launcher * ADD: Torch Elastic Launcher * ADD: info on env vars * ADD: distributed tooling and examples * new folder * UPDATE: distributed strategy setup * generalized for DDP and DS * add config file * UPDATE: kwargs * Update general_trainer.py * Update general_startscript * Update general_trainer.py * UPDATE .gitignore * Update distrib strategy * UPDATE torch distributed strategy classes * Updated docstrings * Small fixes * UPDATE docstrings * ADD deepespeed config loader * ADD first deepspeed tutorial draft * UPDATE DDP Dp distrib strategy * UPDATE horovod strategy * UPDATE tutorial on torch distributed strategies * UPDATE torch strategies tutorial * Update createEnvJSC.sh * Update hvd_slurm.sh * Update README.md * UPDATE distributed tutorial * Delete tutorials/distributed-ml/torch-ddp-deepspeed-horovod/0 * Fixes to deepspeed startscript * Update distributed.py * Update trainer.py * UPDATE tutorial * ADD draft MNIST tutorial * UPDATE DDP tutorial for MNIST * FIX small details * Update distributed.py * Added TF tutorials * Fixes to tutorials * Add files via upload * Update Makefile * Update README.md * UPDATE tutorials * UPDATE documentation and improve explainability * UPDATE SLURM scripts * FIX local rank mismatch * fixed distributed trainer in cyclones use case * UPDATE launcher * UPDATE linter * UPDATE format * FIX linter * FIX linter * Update workflow * UPDATE workflow * update * Update workflow * UPDATE super linter to v6 * UPDATE super linter to v6.3.0 * UPDATE super linter to slim * Cleanup * Update tfmirrored_slurm.sh * Update tfmirrored_slurm.sh * REMOVE workflows legacy * DELETE cyclegan use case * UPDATE dist training tutorials torch * RENAME folders with torch * DRAFT torch imagenet tutorial * UPDATE configuration * UPDATE imagenet tutorial * DRAFT scaling test * ADD scaling analysis report * FIX deepspeed micro batchsize * UPDATE data path * UPDATE checkpoint to avoid race conditions * UPDATE scalability report * UPDATE dataset path * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSC.sh * Update createEnvJSCTF.sh * Update README.md * Update README.md * JUBE benchmarks * Update createEnvJSC.sh * Update createEnvJSCTF.sh * ADD logy scale option * Extract JUBE tutorial * CLEANUP baselines * Log epoch time in real-time * FIX deepspeed dataloader for potential performances improvement * UPDATE SC bash severity * FIX deepspeed and horovod trainers * FIX some code checks * Unify redundant SLURM job scripts and configuration files * CLEANUP unused configuration * Reorg configurations * Refactor configurations and add documentation * Update README * ADD report image * Improve plot resolution * UPDATE scaling test * UPDATE launcher scripts * FIX linter * REMOVE jube tutorial * Restore ConfigParser * FIX type hinting * ADD dev dependencies * REMOVE experimental scripts * UPDATE scaling report * Add SLURM logs * Refactor log scale * Update scalability report * Unify SLURM logs per job * Update README.md * Update README.md * Update README.md * ADD itwinai installation * UPDATE torch distributed tutorial 0 * UPDATE torch distributed tutorials * REMOVE imagenet tutorial * ADD NonDistributedStrategy and create_dataloader method * CLEANUP older classes * Rename strategies * Simplify structure * ADD draft new torch trainer class * UPDATED torch trainer draft * UPDATE MNIST use case * INtegrate new trainer into MNIST use case * UPDATE structure: remove unused files and refactor tests * Tmp disable unused tests * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * Update action * FIX failing inference * Functiona tests (#133) * UPDATE tests * FIX errors * CLEANUP * Remove unused workflow --------- Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: zoechbauer1 * 3dgan integration (#134) * fixed distributed trainer in cyclones use case * commiting integration of 3dgan scripts * ADD: Download dataset * FIX: DDP distributed training with manual optimization * ADD: log with MLFlow * Sqaaas code (#88) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step --------- Co-authored-by: orviz * Sqaaas code (#89) * Create sqaaas.yml * Update sqaaas.yml * Update sqaaas.yml * Point to the current repo * Remove unnecessary checkout step * Rename step * ADD: adaptive branch discovery for SQAaaS action * Update sqaaas.yml --------- Co-authored-by: orviz * ADD: draft predictor and saver * ADD: stub for inference pipeline * ADD: small docs * UPDATE: inference pipeline components * UPDATE: reorg * ADD: image generation for inference * update tag * ADD: threshold * ADD: draft inference * ADD: draft inference wf * ADD: working inference workflow * ADD: 3D scatter plots * ADD: Dockerfile + refactor * ADD: .dockerignore * Update .dockerignore * ADD: skip download option * ADD: cern pipeline.yaml * UPDATE: dataset loading function * UPDATE: dataset loading function * UPDATE conf * UPDATE refactor * UPDATE refactor * UPDATE training docs * Update readme * update README * FIX typo * Update README * Update mkdir * UPDATE data paths * UPDATE Dockerfile * UPDATE Dockerfiles * UPDATE for Singularity execution * FIX version mismatch * UPDATE Singularity docs * Named steps pipe (#100) * ADD: dict steps pipe * Relax dependency constraint * UPDATE Singularity exec command * UPDATE: Image version * UPDATE: load components from pipeline * ADD: docs * Simplify 3DGAN model config * ADD: mlflow autologging support for PL trainer * UPDATE container info * Refactor * UPDATE dependencies * FIX linter problem * Simplified workflow configuration (#108) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow --------- Co-authored-by: orviz * Simplified workflow configuration (#109) * Add SQAaaS dynamic badge for dev branch (#104) * Add SQAaaS dynamic badge * Upgrade to sqaaas-assessment-action@v2 * Add draft example * UPDATE credits field * ADD docs * REFACTOR components and pipeline code * UPDATE docstring * UPDATE mnist torch uc * ADD config file parser draft * ADD itwinaiCLI and ConfigParser * ADD docs * ADD pipeline parser and serializer plus tests * UPDATE docs * ADD adapter component and tests (incl parser) * ADD splitter component, improve pipeline, tests * UPDATE test * REMOVE todos * ADD component tests * ADD serializer tests * FIX linter * ADD basic workflow tutorial * ADD basic intermediate tutorial * ADD advanced tutorial * UPDATE advanced tutorial * UPDATE use cases * UPDATE save parameters * FIX linter * FIX cyclones use case workflow * ADD slurm jobscript * FIX merge error * FIX components template --------- Co-authored-by: orviz * ADD integration tests * FIX test * FIX 3dgan inference test * ADD GPU support and update tag * FIX linter * ADD override example * UPDATE 3DGAN inference * UPDATE inference execution tutorials * UPDATE README * UPDATE saver saving sparse tensors * ADD interlink pods * UPDATE pod name * UPDATE annotations * FIX README * CLEANUP * Merge * update * ADD tf cpu env * U[date Makefile * FIX 3DGAN tests * FIX data folder path * ADD offloading of 3DGAN training * ADAPT 3DGAN training for singularity execution * UPDATE test and fix linter --------- Co-authored-by: zoechbauer1 Co-authored-by: Kalliopi Tsolaki Co-authored-by: orviz * Move to python venv * Update Makefile * Add Horovod installation * Update env * FIX openmpi install * Add TF explicit version * UPDATE env creation * REMOVE constraint on torch 2.0.* * UPDATE installation * FIX test * REMOVE strict dependency on micromamba * FIX docs and debugging states * FIX cpu only installation * FIX deepspeed cpu installation * FIX tf env creation * FIX makefile * ADD torch and tensorflow Docker containers * Working DDP * REFACTOR torch container build scripts * FIX MPI env var set * Incomplete containers * UPDATE Dockerfiles * REFACTOR Dockerfiles * Rename * UPDATE containers files and tutorial * CLEANUP old doc pages * ADD containers tutorials * ADD containers tutorials * UPDATE deps * UPDATE deps * UPDATE deps * UPDATE docs and tutorials * CLEANUP duplicates * Update tests and scripts * ADD labels * CLEANUP * Add docs and fix deepspeed launcher * UPDATE linter settings * FIX slow unit test on 3DGAN train * ADD 3dgan sample dataset --------- Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com> Co-authored-by: linxUser3574 Co-authored-by: orviz Co-authored-by: Kalliopi Tsolaki Co-authored-by: zoechbauer1 Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: KalliopiTsolaki Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> * Update config.yaml * Update run_docker.sh * fixing linting errors * update num_workers in eurac script * run isort on eurac files * Update distributed.py * Update trainer.py * Update trainer.py * Added option for setting checkpointing frequency in virgo trainer, added newer hpo scripts * Reproducibility features on train_hpo.py * Added seeding to train function and added additional plotting functions * Fixed deepspeed launcher for scaling test, added option to set checkpoint frequency in NoiseGeneratorTrainer * Changed a comment regarding checkpointing frequency * Uncommented parts of runall.sh * config * update mse metric * add debug prints and horovod option (WIP) * small cleanup * match newest config file * add support for ddp, deepspeed and horovod * add functionality needed for scalability test * Cleaned up old hpo files, added itwinai Pipeline to HPO script * Added pipeline option to hpo python script * Deleted deprecated file train.py * Ammended config path to be runnable from anywhere * update slurm, runall and scaling test scripts + some cleanup * add small bugfix in console logger * make slurm.sh compatible with run.sh * run isort * fix linting errors * move directory creation to slurm script and small eurac cleanup * Added first version of eurac hpo integration * add check for non-distributed strategy * logging model * Fixed path bugs, eurac hpo works now * Unused imports * remove redundant if statements and run black formatter * Minimise hpo slurm script * Updated ray scripts for eurac to be the same as in virgo, deleted unused old hpo files * Added override for loggers field, so that the config.yaml does not have to be changed for hpo to work * Merge eurac-usecase into virgo-hpo-playground * isort * changed line length to 95 * saving pre-trained model and exporting run and experiment to remote tracking server * Fixed Virgo dataloading * update trainer imports * update config * combine config files into one * run isort on folder * remove unused file * update gather methods and fix stylistic changes * Update README.md * Update README.md * cleanup temp code and small linting errors * add stuff to config and more linting * Fixed virgo dataloading (and cpu/ gpu utilisation with hpo) * Incorporated first pull requests * code cleanup for PR * Incorporated PR comments * Spelling error in README * Some more changes to README. * Incorporated comments for data generation, added info in README * Spelling errors * Isort * Update file_gen.py * Update file_gen.py * Update README.md * Update README.md * update readme * remove run.sh * fix typo in readme * fix device issue with GANTrainer and run black formatter * Incorporated PR comments * Merged config files together * Update trainer.py * Update config.yaml * Update config.py * Update config.yaml * Pr changes VirgoConfiguration --------- Signed-off-by: dependabot[bot] Co-authored-by: Roman Machacek <69751521+User3574@users.noreply.github.com> Co-authored-by: Matteo Bunino <48362942+matbun@users.noreply.github.com> Co-authored-by: linxUser3574 Co-authored-by: Matteo Bunino Co-authored-by: orviz Co-authored-by: Kalliopi Tsolaki Co-authored-by: zoechbauer1 Co-authored-by: Mario Rüttgers Co-authored-by: r-sarma <126173968+r-sarma@users.noreply.github.com> Co-authored-by: r-sarma Co-authored-by: KalliopiTsolaki Co-authored-by: VerderK <167095399+VerderK@users.noreply.github.com> Co-authored-by: MarioRuettgers <127950124+MarioRuettgers@users.noreply.github.com> Co-authored-by: Killian Verder Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: KalliopiTsolaki <51197740+KalliopiTsolaki@users.noreply.github.com> Co-authored-by: KalliopiTsolaki Co-authored-by: iferrario Co-authored-by: iacopoff Co-authored-by: ferrario2 Co-authored-by: Iacopo <38247963+iacopoff@users.noreply.github.com> Co-authored-by: MutegekiHenry Co-authored-by: MutegekiHenry Co-authored-by: Anna Lappe Co-authored-by: Henry Mutegeki <36065782+MutegekiHenry@users.noreply.github.com> --- src/itwinai/cli.py | 8 ++- src/itwinai/torch/trainer.py | 71 +++++++++---------- use-cases/eurac/config.yaml | 2 +- use-cases/eurac/runall.sh | 6 +- use-cases/eurac/slurm.sh | 2 +- use-cases/eurac/trainer.py | 16 +---- use-cases/virgo/README.md | 35 --------- .../virgo/synthetic_data_gen/file_gen.py | 2 - 8 files changed, 47 insertions(+), 95 deletions(-) diff --git a/src/itwinai/cli.py b/src/itwinai/cli.py index 6c33c39b..c6598694 100644 --- a/src/itwinai/cli.py +++ b/src/itwinai/cli.py @@ -282,13 +282,16 @@ def exec_pipeline( @app.command() def mlflow_ui( path: str = typer.Option("ml-logs/", help="Path to logs storage."), + port: int = typer.Option( + 5000, help="Port on which the MLFlow UI is listening." + ), ): """ Visualize Mlflow logs. """ import subprocess - subprocess.run(f"mlflow ui --backend-store-uri {path}".split()) + subprocess.run(f"mlflow ui --backend-store-uri {path} --port {port}".split()) @app.command() @@ -302,8 +305,7 @@ def mlflow_server( """ import subprocess - subprocess.run( - f"mlflow server --backend-store-uri {path} --port {port}".split()) + subprocess.run(f"mlflow server --backend-store-uri {path} --port {port}".split()) @app.command() diff --git a/src/itwinai/torch/trainer.py b/src/itwinai/torch/trainer.py index 51a84845..121372bc 100644 --- a/src/itwinai/torch/trainer.py +++ b/src/itwinai/torch/trainer.py @@ -298,7 +298,6 @@ def create_dataloaders( # Dear user, this is a method you # # may be interested to override! # ################################### - self.train_dataloader = self.strategy.create_dataloader( dataset=train_dataset, batch_size=self.config.batch_size, @@ -373,7 +372,7 @@ def execute( if self.logger: self.logger.destroy_logger_context() - # self.strategy.clean_up() + self.strategy.clean_up() return train_dataset, validation_dataset, test_dataset, self.model def _set_epoch_dataloaders(self, epoch: int): @@ -521,8 +520,10 @@ def train(self): val_loss = self.validation_epoch(epoch) # Checkpointing current best model - worker_val_losses = self.strategy.gather(val_loss, dst_rank=0) - if self.strategy.is_main_worker: + worker_val_losses = self.strategy.gather( + val_loss, dst_rank=0 + ) + if self.strategy.global_rank() == 0: avg_loss = torch.mean( torch.stack(worker_val_losses) ).detach().cpu() @@ -627,7 +628,7 @@ def train_step( ) return loss, metrics - def validation_epoch(self, epoch: int) -> Optional[torch.Tensor]: + def validation_epoch(self, epoch: int) -> torch.Tensor: """Perform a complete sweep over the validation dataset, completing an epoch of validation. @@ -635,45 +636,43 @@ def validation_epoch(self, epoch: int) -> Optional[torch.Tensor]: epoch (int): current epoch number, from 0 to ``self.epochs - 1``. Returns: - Optional[Loss]: average validation loss for the current epoch if - self.validation_dataloader is not None + Loss: average validation loss for the current epoch. """ - if self.validation_dataloader is None: - return - - self.model.eval() - validation_losses = [] - validation_metrics = [] - for batch_idx, val_batch in enumerate(self.validation_dataloader): - loss, metrics = self.validation_step( - batch=val_batch, - batch_idx=batch_idx - ) - validation_losses.append(loss) - validation_metrics.append(metrics) + if self.validation_dataloader is not None: + self.model.eval() + validation_losses = [] + validation_metrics = [] + for batch_idx, val_batch \ + in enumerate(self.validation_dataloader): + loss, metrics = self.validation_step( + batch=val_batch, + batch_idx=batch_idx + ) + validation_losses.append(loss) + validation_metrics.append(metrics) - # Important: update counter - self.validation_glob_step += 1 + # Important: update counter + self.validation_glob_step += 1 - # Aggregate and log losses - avg_loss = torch.mean(torch.stack(validation_losses)) - self.log( - item=avg_loss.item(), - identifier='validation_loss_epoch', - kind='metric', - step=self.validation_glob_step, - ) - # Aggregate and log metrics - avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict() - for m_name, m_val in avg_metrics.items(): + # Aggregate and log losses + avg_loss = torch.mean(torch.stack(validation_losses)) self.log( - item=m_val, - identifier='validation_' + m_name + '_epoch', + item=avg_loss.item(), + identifier='validation_loss_epoch', kind='metric', step=self.validation_glob_step, ) + # Aggregate and log metrics + avg_metrics = pd.DataFrame(validation_metrics).mean().to_dict() + for m_name, m_val in avg_metrics.items(): + self.log( + item=m_val, + identifier='validation_' + m_name + '_epoch', + kind='metric', + step=self.validation_glob_step, + ) - return avg_loss + return avg_loss def validation_step( self, diff --git a/use-cases/eurac/config.yaml b/use-cases/eurac/config.yaml index 8912e898..64ee45f7 100644 --- a/use-cases/eurac/config.yaml +++ b/use-cases/eurac/config.yaml @@ -6,7 +6,7 @@ tmp_stats: /p/scratch/intertwin/datasets/eurac/stats experiment: "drought use case lstm" run_name: "alps_test" -epochs: 5 +epochs: 2 random_seed: 1010 lr: 0.001 batch_size: 256 diff --git a/use-cases/eurac/runall.sh b/use-cases/eurac/runall.sh index 4bbea260..6169366e 100755 --- a/use-cases/eurac/runall.sh +++ b/use-cases/eurac/runall.sh @@ -12,7 +12,7 @@ if [ -z "$NUM_GPUS" ]; then NUM_GPUS=4 fi if [ -z "$TIME" ]; then - TIME=0:40:00 + TIME=0:20:00 fi if [ -z "$DEBUG" ]; then DEBUG=false @@ -34,6 +34,6 @@ submit_job () { } echo "Running distributed training on $NUM_NODES nodes with $NUM_GPUS GPUs per node" -# submit_job "ddp" -# submit_job "deepspeed" +submit_job "ddp" +submit_job "deepspeed" submit_job "horovod" diff --git a/use-cases/eurac/slurm.sh b/use-cases/eurac/slurm.sh index e907e54c..e1ec58b1 100644 --- a/use-cases/eurac/slurm.sh +++ b/use-cases/eurac/slurm.sh @@ -100,7 +100,7 @@ if [ "$DIST_MODE" == "horovod" ] ; then srun --cpu-bind=none \ --ntasks-per-node=$SLURM_GPUS_PER_NODE \ --cpus-per-task=$SLURM_CPUS_PER_GPU \ - --ntasks=$(($SLURM_GPUS_PER_NODE * $SLURM_NNODES)) \ + --ntasks=$SLURM_GPUS_PER_NODE \ $TRAINING_CMD else # E.g. for 'deepspeed' or 'ddp' srun --cpu-bind=none --ntasks-per-node=1 \ diff --git a/use-cases/eurac/trainer.py b/use-cases/eurac/trainer.py index 7ab3e255..53c50202 100644 --- a/use-cases/eurac/trainer.py +++ b/use-cases/eurac/trainer.py @@ -1,7 +1,7 @@ import os from pathlib import Path from timeit import default_timer as timer -from typing import Dict, Literal, Optional, Union, Any, Tuple +from typing import Dict, Literal, Optional, Union import pandas as pd import torch @@ -13,7 +13,6 @@ from hython.trainer import ConvTrainer, RNNTrainer, RNNTrainParams from ray import train from torch.optim.lr_scheduler import ReduceLROnPlateau -from torch.utils.data import Dataset from tqdm.auto import tqdm from itwinai.loggers import EpochTimeTracker, Logger @@ -26,7 +25,7 @@ ) from itwinai.torch.trainer import TorchTrainer from itwinai.torch.type import Metric -from itwinai.components import profile_torch_trainer + class RNNDistributedTrainer(TorchTrainer): """Trainer class for RNN model using pytorch. @@ -88,17 +87,6 @@ def __init__( **kwargs, ) self.save_parameters(**self.locals2params(locals())) - # self.execute = types.MethodType(profile_torch_trainer(self.execute), self) - - - @profile_torch_trainer - def execute( - self, - train_dataset: Dataset, - validation_dataset: Optional[Dataset] = None, - test_dataset: Optional[Dataset] = None - ) -> Tuple[Dataset, Dataset, Dataset, Any]: - return super().execute(train_dataset, validation_dataset, test_dataset) def create_model_loss_optimizer(self) -> None: self.optimizer = optim.Adam(self.model.parameters(), lr=self.config.lr) diff --git a/use-cases/virgo/README.md b/use-cases/virgo/README.md index ca58c40f..d8f42147 100644 --- a/use-cases/virgo/README.md +++ b/use-cases/virgo/README.md @@ -123,38 +123,3 @@ You may change CLI variables for `hpo.py` to change parameters, such as the number of trials you want to run, to change the stopping criteria for the trials or to set a different metric on which ray will evaluate trial results. By default, trials monitor validation loss, and results are plotted once all trials are completed. - -## Generating Synthetic Data for the Virgo Use Case - -This project includes another SLURM job script, `synthetic_data_gen/data_generation.sh`, that allows -users to generate synthetic dataset for the Virgo gravitational wave detector use case. -This step is typically not required unless you need to create new synthetic datasets. - -The synthetic data is generated using a Python script, `file_gen.py`, which creates multiple files -containing simulated data. Each file is a pickled pandas dataframe containing `datapoints_per_file` -datapoints (defaults to 500), each -one representing a set of time series for main and strain detector channels. - -If you need to generate a new dataset, you can run the SLURM script with the following command: - -```bash -sbatch data_generation.sh -``` - -The script will generate multiple data files and store them in separate folders, which are -created in the `target_folder_name` directory. - -The generated pickle files are organized in a set of nested folders to avoid creating too many -files in the same folder. To generate such folders and its files we use SLURM -[job arrays](https://slurm.schedmd.com/job_array.html). -Each SLURM array job will create its own folder and populate it with the synthetic data files. -The number of files created in each folder can be customized by setting the `NUM_FILES` environment -variablebefore submitting the job. -For example, to generate 50 files per array job, you can run: - -```bash -export NUM_FILES=50 -sbatch data_generation.sh -``` - -If you do not specify `NUM_FILES`, the script will default to creating 100 files per folder. diff --git a/use-cases/virgo/synthetic_data_gen/file_gen.py b/use-cases/virgo/synthetic_data_gen/file_gen.py index 859de126..f9cf5f4b 100644 --- a/use-cases/virgo/synthetic_data_gen/file_gen.py +++ b/use-cases/virgo/synthetic_data_gen/file_gen.py @@ -7,8 +7,6 @@ from ..src.dataset import generate_cut_image_dataset -from ..src.dataset import generate_cut_image_dataset - def generate_pkl_dataset( folder_name='test_folder',