Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add from_dict to HFDatasetDataModule #11559

Conversation

akoumpa
Copy link
Member

@akoumpa akoumpa commented Dec 11, 2024

What does this PR do ?

In HF you can do:

from datasets import Dataset
my_data = {"a": [1, 2, 3]}
dataset = Dataset.from_dict(my_data)

Adding support for the following:

    data = {'text': "Below is an instruction that describes a task, paired with an input that "}

    datamodule = llm.HFDatasetDataModule.**from_dict(
        {"text": [data['text'] for _ in range(101)]},** 
        split='train',
        global_batch_size=4,
        micro_batch_size=1,
    )

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/make_HFDatasetDataModule_arg_accept_path_or_dataset branch from fc74bee to 6672419 Compare December 11, 2024 22:41
akoumpa and others added 5 commits December 11, 2024 14:42
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Copy link
Contributor

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.llm.gpt.data.hf_dataset
nemo/collections/llm/gpt/data/hf_dataset.py:174:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:181:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:207:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:233:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:237:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:241:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:244:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:247:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:250:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:253:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.17/10

Thank you for improving NeMo's documentation!

@akoumpa akoumpa marked this pull request as ready for review December 11, 2024 23:45
@@ -157,6 +170,13 @@ def __init__(
self.use_mcore_sampler = use_mcore_sampler
self.mcore_dataloader_type = mcore_dataloader_type

@staticmethod
def from_dict(dataset_dict, split, **kwargs):
from datasets import Dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all datasets import to the top level? Since there's already from datasets import load_dataset at the top level, I think it's better to move everything to the top

@@ -130,16 +133,26 @@ def __init__(
) -> None:
super().__init__()
assert pad_token_id is not None

logging.info(f"Loading HF dataset from {path}")
from datasets import Dataset, DatasetDict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to top level? Same as the other comment

Copy link
Contributor

[🤖]: Hi @akoumpa 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Can move the imports back to the top in a follow up PR if needed.

@ericharper ericharper merged commit 05398c6 into main Dec 12, 2024
172 of 175 checks passed
@ericharper ericharper deleted the akoumparouli/make_HFDatasetDataModule_arg_accept_path_or_dataset branch December 12, 2024 03:54
akoumpa added a commit that referenced this pull request Dec 16, 2024
* Add from_dict method

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
ananthsub pushed a commit to ananthsub/NeMo that referenced this pull request Dec 16, 2024
* Add from_dict method

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
BoxiangW pushed a commit that referenced this pull request Dec 23, 2024
* Add from_dict method

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>
ko3n1g added a commit that referenced this pull request Jan 8, 2025
* Add fsdp2 strategy

Signed-off-by: Boxiang Wang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: BoxiangW <[email protected]>

* Add imports

Signed-off-by: Boxiang Wang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: BoxiangW <[email protected]>

* Add init import

Signed-off-by: Boxiang Wang <[email protected]>

* Apply isort and black reformatting

Signed-off-by: BoxiangW <[email protected]>

* Fix mixtral export for NeMo 2.0 (#11532)

* Initial commit

Signed-off-by: Piotr Kaminski <[email protected]>

* Apply isort and black reformatting

Signed-off-by: Laplasjan107 <[email protected]>

---------

Signed-off-by: Piotr Kaminski <[email protected]>
Signed-off-by: Laplasjan107 <[email protected]>
Co-authored-by: Piotr Kaminski <[email protected]>
Co-authored-by: Laplasjan107 <[email protected]>

* Make HFDatasetDataModule a datasets.load_dataset wrapper (#11500)

* Make HfDatasetDataModule a datasets.load_dataset wrapper

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add logging

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update HFDatasetDataModule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor fixup

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor fixup #2

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* do not expand

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* doc

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* doc

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add synonym

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* Add train/val/test attributes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add test for hf-datamodule

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Import lazily to avoid breaking with older megatron versions

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* bot happy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* bot happy2

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add doc-strings and collate-fn arg

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* ci: Bump release workflow (#11544)

Signed-off-by: Oliver Koenig <[email protected]>

* ci: Use SHA for cut-off (#11545)

Signed-off-by: Oliver Koenig <[email protected]>

* link to mcore documentation (#11538)

Signed-off-by: ashors1 <[email protected]>

* ci: Adjust inputs for code-freeze workflow (#11550)

Signed-off-by: Oliver Koenig <[email protected]>

* ci: Bump release freeze (#11551)

Signed-off-by: Oliver Koenig <[email protected]>

* Ko3n1g/ci/commit sha for cutoff (#11553)

* ci: Remove token from checkout

Signed-off-by: Oliver Koenig <[email protected]>

* bump version

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>

* ci: Bump code-freeze workflow (#11554)

Signed-off-by: Oliver Koenig <[email protected]>

* ci: Bump code freeze workflow (#11557)

Signed-off-by: Oliver Koenig <[email protected]>

* Fix deploy conflicts in llm.api (#11367)

* Fix llm.deploy api

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

* Apply isort and black reformatting

Signed-off-by: hemildesai <[email protected]>

* PR feedback

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

---------

Signed-off-by: Hemil Desai <[email protected]>
Signed-off-by: hemildesai <[email protected]>
Co-authored-by: hemildesai <[email protected]>

* perf summary docs link (#11262)

Signed-off-by: Malay Nagda <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Add vlm nemo run scripts (#11394)

* update recipe

Signed-off-by: yaoyu-33 <[email protected]>

* fix mllama mock ds

Signed-off-by: yaoyu-33 <[email protected]>

* update to use attention bias

Signed-off-by: yaoyu-33 <[email protected]>

* remove example

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring mock.py

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring language.py

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring language.py

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring mllama/base.py

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring mllama/language.py

Signed-off-by: yaoyu-33 <[email protected]>

* bump mcore

Signed-off-by: Oliver Koenig <[email protected]>

* Add scripts for mllama

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* update script

Signed-off-by: yaoyu-33 <[email protected]>

* fix pylint

Signed-off-by: yaoyu-33 <[email protected]>

* revert Dockerfile.ci

Signed-off-by: Yu Yao <[email protected]>

* add scripts

Signed-off-by: yaoyu-33 <[email protected]>

* add vlm training test in ci

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix docstring issues

Signed-off-by: yaoyu-33 <[email protected]>

* update script match recipe

Signed-off-by: yaoyu-33 <[email protected]>

* update recipes

Signed-off-by: yaoyu-33 <[email protected]>

* Update mllama_train.py

Signed-off-by: Yu Yao <[email protected]>

* update mllama 90b recipe

Signed-off-by: yaoyu-33 <[email protected]>

* update to use tmp in ci tests

Signed-off-by: yaoyu-33 <[email protected]>

* update default llava config

Signed-off-by: yaoyu-33 <[email protected]>

* add nemo run scripts

Signed-off-by: yaoyu-33 <[email protected]>

* fix vpp issue

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix cicd

Signed-off-by: yaoyu-33 <[email protected]>

* fix cicd

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* remove duplicated script

Signed-off-by: yaoyu-33 <[email protected]>

* ci: Add HF cache

Signed-off-by: oliver könig <[email protected]>

* update to use SP in recipe

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* upgrade

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "upgrade"

This reverts commit f6ad2cd76abcdd9258cb53a25c788fd658189150.

* update neva api

Signed-off-by: yaoyu-33 <[email protected]>

* update neva api

Signed-off-by: yaoyu-33 <[email protected]>

* fix neva processing

Signed-off-by: yaoyu-33 <[email protected]>

* fix lint

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix data fields

Signed-off-by: yaoyu-33 <[email protected]>

* few fixes

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>
Co-authored-by: Oliver Koenig <[email protected]>

* Add from_dict to HFDatasetDataModule (#11559)

* Add from_dict method

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test_load_from_dict

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Prevent llama3.1 from using Linear interpolation (#11548)

* prevent llama3.1 from using linear interpolation

* Apply isort and black reformatting

Signed-off-by: suiyoubi <[email protected]>

---------

Signed-off-by: suiyoubi <[email protected]>
Co-authored-by: suiyoubi <[email protected]>

* [TTS] Add audio and mel codec HF models to docs (#11526)

Signed-off-by: Ryan <[email protected]>

* Update for NEST release (#11537)

* update for nest release

Signed-off-by: stevehuang52 <[email protected]>

* make pylint happier

Signed-off-by: stevehuang52 <[email protected]>

* fix for lhotse dataloader

Signed-off-by: stevehuang52 <[email protected]>

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* minor refactor

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>

* Merging SpeechLLM development branch (#11462)

* Port changes related to SFT text+speech dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert changes from Canary(nonLLM) code

Signed-off-by: Piotr Żelasko <[email protected]>

* Add joint text/audio dataloading capability to speechllm

Signed-off-by: Piotr Żelasko <[email protected]>

* include text-only into fprop of training and eval; TODO: text-only
predict

Signed-off-by: zhehuaichen <[email protected]>

* Actually working forward step

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for source-target text file pair training for MT+speech

Signed-off-by: Piotr Żelasko <[email protected]>

* Include supervision text tokens in audio example's num tokens

Signed-off-by: Piotr Żelasko <[email protected]>

* Disable conformer seq len NCCL sync

Signed-off-by: Piotr Żelasko <[email protected]>

* Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin

Signed-off-by: Piotr Żelasko <[email protected]>

* Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together).

Signed-off-by: Piotr Żelasko <[email protected]>

* Add missing config

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert multimodal grad accum and fix mask padding issue

Signed-off-by: Piotr Żelasko <[email protected]>

* Add modality weights support via cfg.model.modality_weights

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix for V2 dataloader shuffling CRITICAL

Signed-off-by: Piotr Żelasko <[email protected]>

* Restore multimodal grad accum

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix unit tests for multi-sampler configurations

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* nemo gemma to hf  conversion (#9629)

* adding script for gemma nemo to hf

Signed-off-by: Krishna Puvvada <[email protected]>

* adding verification for convert_gemma_nemo_to_hf

Signed-off-by: Krishna Puvvada <[email protected]>

* Apply isort and black reformatting

Signed-off-by: krishnacpuvvada <[email protected]>

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>

* support FSDP (thank Yifan for early trying) (#10062)

Note: as of now, this is still not fully working on the cluster. See above doc for details.
Signed-off-by: zhehuaichen <[email protected]>

* Fix unit tests after rebasing on recent main

Signed-off-by: Piotr Żelasko <[email protected]>

* support megatron_amp_O2 and tp (#10599)

* Port changes related to SFT text+speech dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert changes from Canary(nonLLM) code

Signed-off-by: Piotr Żelasko <[email protected]>

* Add joint text/audio dataloading capability to speechllm

Signed-off-by: Piotr Żelasko <[email protected]>

* include text-only into fprop of training and eval; TODO: text-only
predict

Signed-off-by: zhehuaichen <[email protected]>

* Actually working forward step

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for source-target text file pair training for MT+speech

Signed-off-by: Piotr Żelasko <[email protected]>

* Include supervision text tokens in audio example's num tokens

Signed-off-by: Piotr Żelasko <[email protected]>

* Disable conformer seq len NCCL sync

Signed-off-by: Piotr Żelasko <[email protected]>

* Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin

Signed-off-by: Piotr Żelasko <[email protected]>

* Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together).

Signed-off-by: Piotr Żelasko <[email protected]>

* Add missing config

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert multimodal grad accum and fix mask padding issue

Signed-off-by: Piotr Żelasko <[email protected]>

* Add modality weights support via cfg.model.modality_weights

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix for V2 dataloader shuffling CRITICAL

Signed-off-by: Piotr Żelasko <[email protected]>

* Restore multimodal grad accum

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix unit tests for multi-sampler configurations

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* nemo gemma to hf  conversion (#9629)

* adding script for gemma nemo to hf

Signed-off-by: Krishna Puvvada <[email protected]>

* adding verification for convert_gemma_nemo_to_hf

Signed-off-by: Krishna Puvvada <[email protected]>

* Apply isort and black reformatting

Signed-off-by: krishnacpuvvada <[email protected]>

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>

* support FSDP (thank Yifan for early trying)

Signed-off-by: zhehuaichen <[email protected]>

* debug TP deadlock

Signed-off-by: zhehuaichen <[email protected]>

* some fixes for fsdp and tp

/lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs2048_mbs16_ep200/error-1417621-0.out

/lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_tp_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs128_mbs16_ep200/error-1421103-3.out

Signed-off-by: zhehuaichen <[email protected]>

* nit fix
Signed-off-by: zhehuaichen <[email protected]>

* fix for llama3.1
Signed-off-by: zhehuaichen <[email protected]>

* for llama3.1
Signed-off-by: zhehuaichen <[email protected]>

* fix for inference
Signed-off-by: zhehuaichen <[email protected]>

* fix inference
Signed-off-by: zhehuaichen <[email protected]>

* fix grad accu
Signed-off-by: zhehuaichen <[email protected]>

* fix inference
Signed-off-by: zhehuaichen <[email protected]>

* initial impl to support megatron_amp_O2 in salm, bestow, salm-t5

Signed-off-by: zhehuaichen <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>
Co-authored-by: pzelasko <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>

* minor change in dataloader (#10601)

* Speechllm dataset basic unit test (#10631)

* Basic unit test for speechllm lhotse dataset

Signed-off-by: Piotr Żelasko <[email protected]>

* cleanup

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Unit test for existing speechllm dataset with llama2 prompt format (#10634)

Signed-off-by: Piotr Żelasko <[email protected]>

* [speechllm] Replace TextProcessing with PromptFormatter (#10639)

* [speechllm] Replace TextProcessing with PromptFormatter

Signed-off-by: Piotr Żelasko <[email protected]>

* Test for tokens_to_generate

Signed-off-by: Piotr Żelasko <[email protected]>

* Padding optimization for speechlm dataset

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Multimodal conversation format dataloading (#10683)

* Draft implementation of NeMo Multimodal Conversation format

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working data parsing and iteration

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working dataloading with tokenization + prompting

Signed-off-by: Piotr Żelasko <[email protected]>

* Collapse consecutive user turns into single turn

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* a few fixes for the new prompt template based dataloader and lora+distributed fused adam (#10701)

* Draft implementation of NeMo Multimodal Conversation format

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working data parsing and iteration

Signed-off-by: Piotr Żelasko <[email protected]>

* Fully working dataloading with tokenization + prompting

Signed-off-by: Piotr Żelasko <[email protected]>

* Collapse consecutive user turns into single turn

Signed-off-by: Piotr Żelasko <[email protected]>

* compatible with previous expts

Signed-off-by: zhehuaichen <[email protected]>

* support gemma

Signed-off-by: zhehuaichen <[email protected]>

* handle the case max_seq_length is smaller than input_id length

Signed-off-by: zhehuaichen <[email protected]>

* fix max seq case

Signed-off-by: zhehuaichen <[email protected]>

* fix lora ckpt storing and loading

Signed-off-by: zhehuaichen <[email protected]>

* temp fix for distributed fused adam

Signed-off-by: zhehuaichen <[email protected]>

* revert changes in nemo_adapters.py
Signed-off-by: zhehuaichen <[email protected]>

* Fix tokenize_with_prompt

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Co-authored-by: Piotr Żelasko <[email protected]>

* Mechanism to insert BOS/EOS at the beginning/end of dialog (#10923)

* Mechanism to insert BOS/EOS at the beginning/end of dialog

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix Gemma prompt formatter test

Signed-off-by: Piotr Żelasko <[email protected]>

* Add a test specifically for multiturn insertion of bos/eos

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Add options to override default map/iterable dataset style selection in lhotse dataloader

Signed-off-by: Piotr Żelasko <[email protected]>

* Feature/conversations tarred (#11086)

* Multimodal conversation tarring script

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix sharding logic

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix dir creation

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* EMMeTT support in SpeechLLM + tutorial for Lhotse Multimodal Dataloading (#10927)

* Preliminary support for oomptimizer

Signed-off-by: Piotr Żelasko <[email protected]>

* OOMptimizer for SpeechLLM

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial version of estimate token bins script

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial support for multimodal 2d bucketing

Signed-off-by: Piotr Żelasko <[email protected]>

* Extend to text-to-text oomptimizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Preliminary support for Llama2 prompt format in ast+mt

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for 1D estimate token bins

Signed-off-by: Piotr Żelasko <[email protected]>

* Support for 1D estimate token bins

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Minor tweaks

Signed-off-by: Piotr Żelasko <[email protected]>

* Add min/max tokens filter

Signed-off-by: Piotr Żelasko <[email protected]>

* Change to bisect_left for bucket idx selection

Signed-off-by: Piotr Żelasko <[email protected]>

* Add reconfigure_num_microbatches_calculator at the start of train epoch for modular models

Signed-off-by: Piotr Żelasko <[email protected]>

* Update lhotse multi-sampler config and make validation datasets finite

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial implementation of text+audio training for T5 modular models

Signed-off-by: Piotr Żelasko <[email protected]>

* megatron t5 nmt prompt formatter

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes for MT+AST T5 oomptimizer and training

Signed-off-by: Piotr Żelasko <[email protected]>

* configs, fixes, token-per-token filtering

* Support text modality in predict_step

Signed-off-by: Piotr Żelasko <[email protected]>

* Support text data in val/test dl

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* fix infinite

Signed-off-by: Piotr Żelasko <[email protected]>

* prompt format fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes in audio supervision

Signed-off-by: Piotr Żelasko <[email protected]>

* remove superficial padding

Signed-off-by: Piotr Żelasko <[email protected]>

* test config and prompt context fetching fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* support text-only decoding for salm/bestow

Signed-off-by: Piotr Żelasko <[email protected]>

* Add unit tests for EMMETT / refactor prompt_format_fn

Signed-off-by: Piotr Żelasko <[email protected]>

* make t5nmt prompt formatter auto discoverable

Signed-off-by: Piotr Żelasko <[email protected]>

* include token count / tpt filtering in estimate_token_bins

Signed-off-by: Piotr Żelasko <[email protected]>

* fix max token filter

Signed-off-by: Piotr Żelasko <[email protected]>

* some fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* custom mixin for text adapters

Signed-off-by: Piotr Żelasko <[email protected]>

* Warmup in oomptimizer-speechlm

Signed-off-by: Piotr Żelasko <[email protected]>

* Move oomptimizer-speechllm to separate directory

Signed-off-by: Piotr Żelasko <[email protected]>

* Initial cleanup

Signed-off-by: Piotr Żelasko <[email protected]>

* Refactoring of prompt format fn and length measurement and filtering for data types; improved unit test coverage

Signed-off-by: Piotr Żelasko <[email protected]>

* Refactor sampler constraints / filters into sampling.py

Signed-off-by: Piotr Żelasko <[email protected]>

* Tests and support for sampler length measurement of multimodal conversations

Signed-off-by: Piotr Żelasko <[email protected]>

* Update estimate_token_bins.py

Signed-off-by: Piotr Żelasko <[email protected]>

* Move estimate_token_bins.py to speech_llm scripts

Signed-off-by: Piotr Żelasko <[email protected]>

* Minor tweaks

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes for SpeechLLM dataset

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* Add missing emmett tests

Signed-off-by: Piotr Żelasko <[email protected]>

* Add tutorial about multimodal lhotse dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Updated documentation for multimodal dataloading

Signed-off-by: Piotr Żelasko <[email protected]>

* Prompt Formatter tutorial

Signed-off-by: Piotr Żelasko <[email protected]>

* Review comments

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes for sampling filters None values

Signed-off-by: Piotr Żelasko <[email protected]>

* Changes requested by Steve: moving some args to main config namespace in multi config sampler

Signed-off-by: Piotr Żelasko <[email protected]>

* fix

Signed-off-by: Piotr Żelasko <[email protected]>

* Update default configs to the modified config schema

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix omegaconf use issue

Signed-off-by: Piotr Żelasko <[email protected]>

* Update the docs to the modified multi config format

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Co-authored-by: pzelasko <[email protected]>

* Remove old TODO comments

Signed-off-by: Piotr Żelasko <[email protected]>

* Remove prompts/fn.py

Signed-off-by: Piotr Żelasko <[email protected]>

* Copyright notices

Signed-off-by: Piotr Żelasko <[email protected]>

* Make linter happy

Signed-off-by: Piotr Żelasko <[email protected]>

* Make linter happy

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix megatron test

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix megatron test

Signed-off-by: Piotr Żelasko <[email protected]>

* Disable plugin for high entropy strings in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix CodeQL errors

Signed-off-by: Piotr Żelasko <[email protected]>

* fix unit tests

Signed-off-by: Piotr Żelasko <[email protected]>

* fix another unit test

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix multimodal tests

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* fixes after merging canary2 pr to main

Signed-off-by: Piotr Żelasko <[email protected]>

* fix headers

Signed-off-by: Piotr Żelasko <[email protected]>

* fix canary integration test + formatting

Signed-off-by: Piotr Żelasko <[email protected]>

* Address reviews - add sync_max_audio_length flag for conformer encoder

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert change in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert change in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert change in secrets detector

Signed-off-by: Piotr Żelasko <[email protected]>

* Address code review

Signed-off-by: Piotr Żelasko <[email protected]>

* Address Steve's review

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: zhehuaichen <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Signed-off-by: Krishna Puvvada <[email protected]>
Signed-off-by: krishnacpuvvada <[email protected]>
Co-authored-by: zhehuaichen <[email protected]>
Co-authored-by: pzelasko <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: krishnacpuvvada <[email protected]>
Co-authored-by: zhehuaichen <[email protected]>

* Sync validation metrics for ASRModel (#11533)

* Sync validation metrics for ASRModel

Signed-off-by: Piotr Żelasko <[email protected]>

* support sync for single-dataloader case

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* NeMo 2.0 In-framework deployment support (#11523)

* nemo 2 support

Signed-off-by: Onur Yilmaz <[email protected]>

* Remove unwanted params in DDP init in Megatron Parallel

Signed-off-by: Hemil Desai <[email protected]>

* nemo2 working with query

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* multigpu deployment with nemo2 works

Signed-off-by: Onur Yilmaz <[email protected]>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <[email protected]>

* add max output lenght

Signed-off-by: Onur Yilmaz <[email protected]>

* Remove prints

Signed-off-by: Onur Yilmaz <[email protected]>

* Fix merge conflicts

Signed-off-by: Onur Yilmaz <[email protected]>

* readded this file

Signed-off-by: Onur Yilmaz <[email protected]>

---------

Signed-off-by: Onur Yilmaz <[email protected]>
Signed-off-by: Hemil Desai <[email protected]>
Signed-off-by: oyilmaz-nvidia <[email protected]>
Co-authored-by: Hemil Desai <[email protected]>
Co-authored-by: oyilmaz-nvidia <[email protected]>

* Add SFT/PEFT HF tests (#11519)

* Add SFT/PEFT HF tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move hf examples to examples dir

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* bot

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use mini_squad

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use mini_squad

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* add 2gpu DDP

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use labels as passed by the user

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update samples/ tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rm unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add tests with subset split names, e.g. train[:100]

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* add --disable-ckpt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use self-hosted-azure-gpus-1 for single-gpu test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add TRANSFORMERS_OFFLINE=1 to hf tests

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* Fix typo: LocalNonpersitentObject -> LocalNonpersistentObject (#11546)

Signed-off-by: Ananth Subramaniam <[email protected]>

* Adding documentation for packed dataset preparation with context para… (#11564)

* adding documentation for packed dataset preparation with context parallel

Signed-off-by: Lifu Zhang <[email protected]>

* addressing Anna Shor's comment

Signed-off-by: Lifu Zhang <[email protected]>

---------

Signed-off-by: Lifu Zhang <[email protected]>

* have micro_batch_size and global_batch_size as class attributes in mock datamodule (#11563)

* Revert "Fix the names of two sets of weight and bias in mcore_to_nemo_mapping" (#11560)

* Revert "Fix the names of two sets of weight and bias in mcore_to_nemo_mapping (#9628)"

This reverts commit 6784db56a03f19f37bc4f37bdf87dabb3fc1acee.

* keep underscores

Signed-off-by: ashors1 <[email protected]>

---------

Signed-off-by: ashors1 <[email protected]>

* add huggingface-based tokenizer support for mixtral HF -> .nemo (#11572)

* add huggingface-based tokenizer support

Signed-off-by: dimapihtar <[email protected]>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>
Signed-off-by: dimapihtar <[email protected]>
Co-authored-by: dimapihtar <[email protected]>

* Github Actions tests for Llava Next and modify pretrain recipe to have language model path (#11424)

* modified pretrain recipe to have language_model_from_pretrained

* ci test for llava next

* fixed indent/lint issue in cicd yml file

* fix lint issues

* Apply isort and black reformatting

Signed-off-by: yashaswikarnati <[email protected]>

* Update .github/workflows/cicd-main.yml

Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Yashaswi Karnati <[email protected]>

* Update .github/workflows/cicd-main.yml

Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Yashaswi Karnati <[email protected]>

---------

Signed-off-by: yashaswikarnati <[email protected]>
Signed-off-by: Yashaswi Karnati <[email protected]>
Co-authored-by: yashaswikarnati <[email protected]>
Co-authored-by: oliver könig <[email protected]>

* Fix SingleDeviceStrategy support in Nsys callback (#11574)

* fix for SingleDeviceStrategy

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* mini refactor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* remove dialogue scripts and docs (#11577)

* remove deprecated scripts

Signed-off-by: dimapihtar <[email protected]>

* remove deprecated docs

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>

* add JitTransform (#11131)

* add JitTransform

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fixes

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add JiT CB test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* cleanup

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add jit callback test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* fix param passing

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use sgd in test_nemo_jit_cb

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add thunder call

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* Use .compile method to avoid changing module structure

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* Use JitConfig

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* thunder setting

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* avoid reentry

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove optional

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rewrite

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* refactor & module_selector

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* NeMo 2.0 documentation upgrade (#11235)

* update attention

Signed-off-by: dimapihtar <[email protected]>

* update docs to NeMo 2.0

Signed-off-by: dimapihtar <[email protected]>

* update usage

Signed-off-by: dimapihtar <[email protected]>

* update parallelism

Signed-off-by: dimapihtar <[email protected]>

* update parallelism docs

Signed-off-by: dimapihtar <[email protected]>

* update parallelism docs

Signed-off-by: dimapihtar <[email protected]>

* fix style

Signed-off-by: dimapihtar <[email protected]>

* update to NeMo 2.0

Signed-off-by: dimapihtar <[email protected]>

* NeMo 2.0 update

Signed-off-by: dimapihtar <[email protected]>

* NeMo 2.0 update

Signed-off-by: dimapihtar <[email protected]>

* remove deprecated file

Signed-off-by: dimapihtar <[email protected]>

* update in respect to NeMo 2.0

Signed-off-by: dimapihtar <[email protected]>

* fix hyperlinks

Signed-off-by: dimapihtar <[email protected]>

* remove deprecated

Signed-off-by: dimapihtar <[email protected]>

* remove deprecated

Signed-off-by: dimapihtar <[email protected]>

* update documentation to NeMo 2.0

Signed-off-by: dimapihtar <[email protected]>

* fix typo

Signed-off-by: dimapihtar <[email protected]>

* fix punctuation

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>

* Remove auto-import of lhotse when importing nemo.collections.common.data (#11578)

* Remove auto-import of lhotse when importing nemo.collections.common.data

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix test import

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix example configs (#11571)

* Fix example configs

Signed-off-by: Boxiang Wang <[email protected]>

* Fix line length

Signed-off-by: Boxiang Wang <[email protected]>

---------

Signed-off-by: Boxiang Wang <[email protected]>

* fix (#11575)

Signed-off-by: Oliver Koenig <[email protected]>

* NIM supporting changes for nemo.export for NeMo 2.0 (#11488)

* Move torch_dtype_from_precision for independent export module

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>

* Remove unused imports

Signed-off-by: Jan Lasek <[email protected]>

* Fix too long lines

Signed-off-by: Jan Lasek <[email protected]>

* Apply isort and black reformatting

Signed-off-by: janekl <[email protected]>
Signed-off-by: Jan Lasek <[email protected]>

* Fix signature and default for megatron_amp_O2

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>
Signed-off-by: janekl <[email protected]>
Co-authored-by: Bobby Chen <[email protected]>
Co-authored-by: janekl <[email protected]>

* AED greedy confidence estimation (#11573)

* upload

Signed-off-by: Aleksandr Laptev <[email protected]>

* Apply isort and black reformatting

Signed-off-by: GNroy <[email protected]>

* set prompt confidence dtype at initialization

Signed-off-by: Aleksandr Laptev <[email protected]>

---------

Signed-off-by: Aleksandr Laptev <[email protected]>
Signed-off-by: GNroy <[email protected]>
Co-authored-by: GNroy <[email protected]>

* gemma fix (#11587)

* Update T5 DataModule regarding Pretrain/Finetune validate (#11584)

* update datamodule to have mbs/gbs

* update datamodule to have mbs/gbs

* Apply isort and black reformatting

Signed-off-by: huvunvidia <[email protected]>

---------

Signed-off-by: huvunvidia <[email protected]>
Co-authored-by: Huy Vu2 <[email protected]>
Co-authored-by: huvunvidia <[email protected]>

* fix llama3 (#11580)

* Add Hf nemorun tests (#11566)

* minor fixes for recipe

Signed-off-by: HuiyingLi <[email protected]>

* add peft nemorun script

Signed-off-by: HuiyingLi <[email protected]>

* add sft script and data module

Signed-off-by: HuiyingLi <[email protected]>

* Apply isort and black reformatting

Signed-off-by: HuiyingLi <[email protected]>

* clean up

Signed-off-by: HuiyingLi <[email protected]>

* add disable ckpt and data config for tests

Signed-off-by: HuiyingLi <[email protected]>

* Apply isort and black reformatting

Signed-off-by: HuiyingLi <[email protected]>

* add tests to cicd yaml

Signed-off-by: HuiyingLi <[email protected]>

* cleanup

Signed-off-by: HuiyingLi <[email protected]>

---------

Signed-off-by: HuiyingLi <[email protected]>
Signed-off-by: HuiyingLi <[email protected]>
Co-authored-by: HuiyingLi <[email protected]>

* [🤖]: Howdy folks, let's bump NeMo-Toolkit to `2.2.0rc0` ! (#11555)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Pass the number of experts to modelopt layer spec (#11607)

* Pass number of experts to modelopt layer spec

Signed-off-by: Jan Lasek <[email protected]>

* Fix too long lines

Signed-off-by: Jan Lasek <[email protected]>

---------

Signed-off-by: Jan Lasek <[email protected]>

* Adding changes to asr documentation (#11397)

Signed-off-by: Ssofja <[email protected]>

* Support Cosmos tokenizer TensorRT inference (#11472)

* Add cosmos TRT

* Add trt run script

* Apply isort and black reformatting

Signed-off-by: meatybobby <[email protected]>

* Clean code

* Fix CodeQL

---------

Signed-off-by: meatybobby <[email protected]>
Co-authored-by: meatybobby <[email protected]>

* Neva updates to latest mcore and some fixes (#11565)

* api updates and fixes

Signed-off-by: yaoyu-33 <[email protected]>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <[email protected]>

* fix

Signed-off-by: yaoyu-33 <[email protected]>

* fix arg

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Co-authored-by: yaoyu-33 <[email protected]>

* add nemo2-sft-peft to readme (#11613)

Signed-off-by: Huiying Li <[email protected]>

* Set Minitron width pruning batch size 1 (#11603)

Signed-off-by: Keval Morabia <[email protected]>

* Disable CP for running Inference using megatron_gpt_eval (#11547)

* Disable CP for megatron_gpt_eval

* Apply isort and black reformatting

Signed-off-by: suiyoubi <[email protected]>

* Update examples/nlp/language_modeling/megatron_gpt_eval.py

Co-authored-by: Chen Cui <[email protected]>
Signed-off-by: Ao Tang <[email protected]>

---------

Signed-off-by: suiyoubi <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
Co-authored-by: suiyoubi <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* ci: Add `no-fail-fast` mode (#11608)

Signed-off-by: Oliver Koenig <[email protected]>

* Chat dataset support (#11423)

* chat dataset support

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* add ci test

Signed-off-by: Chen Cui <[email protected]>

* address comment

Signed-off-by: Chen Cui <[email protected]>

* Apply isort and black reformatting

Signed-off-by: cuichenx <[email protected]>

* address comment

Signed-off-by: Chen Cui <[email protected]>

---------

Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: cuichenx <[email protected]>
Co-authored-by: cuichenx <[email protected]>

* Sortformer Diarizer 4spk v1 model PR Part 2: Unit-tests for Sortformer Diarizer. (#11336)

* Adding the first pr files models and dataset

Signed-off-by: taejinp <[email protected]>

* Tested all unit-test files

Signed-off-by: taejinp <[email protected]>

* Name changes on yaml files and train example

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Reflecting comments and removing unnecessary parts for this PR

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Adding docstrings to reflect the PR comments

Signed-off-by: taejinp <[email protected]>

* removed the unused find_first_nonzero

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Fixed all pylint issues

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Resolving pylint issues

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Removing unused varialbe in audio_to_diar_label.py

Signed-off-by: taejinp <[email protected]>

* Fixed docstrings in training script

Signed-off-by: taejinp <[email protected]>

* Line-too-long issue from Pylint fixed

Signed-off-by: taejinp <[email protected]>

* Adding get_subsegments_scriptable to prevent jit.script error

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Addressed Code-QL issues

Signed-off-by: taejinp <[email protected]>

* Resolved conflicts on bce_loss.py

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Adding all the diarization reltated unit-tests

Signed-off-by: taejinp <[email protected]>

* Moving speaker task related unit test files to speaker_tasks folder

Signed-off-by: taejinp <[email protected]>

* Fixed uninit variable issue in bce_loss.py spotted by codeQL

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Fixing code-QL issues

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Reflecting PR comments from weiqingw

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Line too long pylint issue resolved in e2e_diarize_speech.py

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Resovled unused variable issue in model test

Signed-off-by: taejinp <[email protected]>

* Reflecting the comment on Nov 21st  2024.

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Unused variable import time

Signed-off-by: taejinp <[email protected]>

* Adding docstrings to score_labels() function in der.py

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Reflecting comments on YAML files and model file variable changes.

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Added get_subsegments_scriptable for legacy get_subsegment functions

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Resolved line too long pylint issues

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Added training and inference CI-tests

Signed-off-by: taejinp <[email protected]>

* Added the missing parse_func in preprocessing/collections.py

Signed-off-by: taejinp <[email protected]>

* Adding the missing parse_func in preprocessing/collections.py

Signed-off-by: taejinp <[email protected]>

* Fixed an indentation error

Signed-off-by: taejinp <[email protected]>

* Resolved multi_bin_acc and bce_loss issues

Signed-off-by: taejinp <[email protected]>

* Resolved line-too-long for msdd_models.py

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Code QL issues and fixed test errors

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* line too long in audio_to_diar_label.py

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* resolving CICD test issues

Signed-off-by: taejinp <[email protected]>

* Fixing codeQL issues

Signed-off-by: taejinp <[email protected]>

* Fixed pin memory False for inference

Signed-off-by: taejinp <[email protected]>

---------

Signed-off-by: taejinp <[email protected]>
Signed-off-by: tango4j <[email protected]>
Co-authored-by: tango4j <[email protected]>

* 2x more memory efficient Graph-based RNN-T (#11169)

* Optimized Graph-Transducer implementation

Signed-off-by: Vladimir Bataev <[email protected]>

---------

Signed-off-by: Vladimir Bataev <[email protected]>
Signed-off-by: artbataev <[email protected]>
Co-authored-by: artbataev <[email protected]>

* Use explicit subpaths in io for exporting a checkpoint (#11352)

* Fix llm.export_ckpt

Signed-off-by: Hemil Desai <[email protected]>

* fix

Signed-off-by: Hemil Desai <[email protected]>

---------

Signed-off-by: Hemil Desai <[email protected]>

* Remove triton requirement (#11627)

* Specify pytorch-triton instead of triton

Signed-off-by: Dong Hyuk Chang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove triton

Signed-off-by: Dong Hyuk Chang <[email protected]>

---------

Signed-off-by: Dong Hyuk Chang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* ci: Remove comment if no changes required anymore (#11624)

Signed-off-by: Oliver Koenig <[email protected]>

* Jit with peft (#11586)

* move jitransform at the end

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add docstring & post-init

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add remove_extra_batch_keys and remove align_labels

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Run JitTransform on_train_epoch_start

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add --use-torch-jit option

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add docstrings

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pep8

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* NeMo-UX: add Hf's AutoModelForImageTextToText (#11321)

* init commit

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* wip

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* peft examp;le

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* move peft example to multimodal_llm

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* surface HFAutoModelForImageTextToText

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add hf vlm dataset

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move processor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* train_log -> train_loss

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* vlm.HFDatasetDataModule pass collate_fn as argument

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update peft example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* typo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Move example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* remove unused

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Small change

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Fix loss calculation

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add extract_skipped_token_ids

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Use vlm.HFAutoModelForImageTextToText.extract_skipped_token_ids

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Update logits/labels handling

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add trust_remote_code to configure_processor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

* mini refactor

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add LLAMA_TOKENS

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* update hf_dataset

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add lora_dtype for models with non-FP weights

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Add load_in_4bit option

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add default_dtype

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add load_in_4bit to llm collection

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* rm import

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix asset path

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move vlm test

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move data offline

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use signel gpu

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* pylint

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* drop align_labels

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove align_labels from llm too

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* use loss * mask instead of loss[mask == 1]

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix path

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: akoumpa <[email protected]>

* ci: Bump release workflow (#11635)

Signed-off-by: Oliver Koenig <[email protected]>

* Add fix docstring for speech commands (#11638)

Signed-off-by: smajumdar <[email protected]>

* Fixing Multi_Task_Adapters.ipynb by replacing canary2 with canary_custom (#11641)

Signed-off-by: Weiqing Wang <[email protected]>

* fixed config name in online augmentation tutorial (#11628)

Signed-off-by: Rauf <[email protected]>

* fix default nodes (#11632)

* add renormalize_blend_weights param (#11647)

Signed-off-by: dimapihtar <[email protected]>

* Sortformer Diarizer 4spk v1 model PR Part 3: Speaker Diarization Mixin (#11511)

* Adding diarization mixin for one click inference

Signed-off-by: taejinp <[email protected]>

* Apply isort and black reformatting

Signed-off-by: tango4j <[email protected]>

* Resolving CodeQL and Pylint

Signed-off-by: taejinp <[email protected]>

* Resolving CodeQL and Pylint - unsaved files …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants