Add from_dict to HFDatasetDataModule #11559

akoumpa · 2024-12-11T22:40:17Z

What does this PR do ?

In HF you can do:

from datasets import Dataset
my_data = {"a": [1, 2, 3]}
dataset = Dataset.from_dict(my_data)

Adding support for the following:

    data = {'text': "Below is an instruction that describes a task, paired with an input that "}

    datamodule = llm.HFDatasetDataModule.**from_dict(
        {"text": [data['text'] for _ in range(101)]},** 
        split='train',
        global_batch_size=4,
        micro_batch_size=1,
    )

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Alexandros Koumparoulis <[email protected]>

Signed-off-by: akoumpa <[email protected]>

github-actions · 2024-12-11T22:53:57Z

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.llm.gpt.data.hf_dataset
nemo/collections/llm/gpt/data/hf_dataset.py:174:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:181:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:207:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:233:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:237:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:241:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:244:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:247:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:250:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/llm/gpt/data/hf_dataset.py:253:4: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.17/10

Thank you for improving NeMo's documentation!

hemildesai · 2024-12-12T00:17:51Z

nemo/collections/llm/gpt/data/hf_dataset.py

@@ -157,6 +170,13 @@ def __init__(
        self.use_mcore_sampler = use_mcore_sampler
        self.mcore_dataloader_type = mcore_dataloader_type

+    @staticmethod
+    def from_dict(dataset_dict, split, **kwargs):
+        from datasets import Dataset


Can you move all datasets import to the top level? Since there's already from datasets import load_dataset at the top level, I think it's better to move everything to the top

hemildesai · 2024-12-12T00:18:25Z

nemo/collections/llm/gpt/data/hf_dataset.py

@@ -130,16 +133,26 @@ def __init__(
    ) -> None:
        super().__init__()
        assert pad_token_id is not None
-
-        logging.info(f"Loading HF dataset from {path}")
+        from datasets import Dataset, DatasetDict


Can you move this to top level? Same as the other comment

github-actions · 2024-12-12T00:44:22Z

[🤖]: Hi @akoumpa 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

ericharper

LGTM. Thanks!

Can move the imports back to the top in a follow up PR if needed.

* Add from_dict method Signed-off-by: Alexandros Koumparoulis <[email protected]> * add test_load_from_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add test_load_from_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]>

* Add fsdp2 strategy Signed-off-by: Boxiang Wang <[email protected]> * Apply isort and black reformatting Signed-off-by: BoxiangW <[email protected]> * Add imports Signed-off-by: Boxiang Wang <[email protected]> * Apply isort and black reformatting Signed-off-by: BoxiangW <[email protected]> * Add init import Signed-off-by: Boxiang Wang <[email protected]> * Apply isort and black reformatting Signed-off-by: BoxiangW <[email protected]> * Fix mixtral export for NeMo 2.0 (#11532) * Initial commit Signed-off-by: Piotr Kaminski <[email protected]> * Apply isort and black reformatting Signed-off-by: Laplasjan107 <[email protected]> --------- Signed-off-by: Piotr Kaminski <[email protected]> Signed-off-by: Laplasjan107 <[email protected]> Co-authored-by: Piotr Kaminski <[email protected]> Co-authored-by: Laplasjan107 <[email protected]> * Make HFDatasetDataModule a datasets.load_dataset wrapper (#11500) * Make HfDatasetDataModule a datasets.load_dataset wrapper Signed-off-by: Alexandros Koumparoulis <[email protected]> * add logging Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Update HFDatasetDataModule Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor fixup Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor fixup #2 Signed-off-by: Alexandros Koumparoulis <[email protected]> * do not expand Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * doc Signed-off-by: Alexandros Koumparoulis <[email protected]> * doc Signed-off-by: Alexandros Koumparoulis <[email protected]> * add synonym Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * Add train/val/test attributes Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add test for hf-datamodule Signed-off-by: Alexandros Koumparoulis <[email protected]> * Import lazily to avoid breaking with older megatron versions Signed-off-by: Alexandros Koumparoulis <[email protected]> * bot happy Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * bot happy2 Signed-off-by: Alexandros Koumparoulis <[email protected]> * add doc-strings and collate-fn arg Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * ci: Bump release workflow (#11544) Signed-off-by: Oliver Koenig <[email protected]> * ci: Use SHA for cut-off (#11545) Signed-off-by: Oliver Koenig <[email protected]> * link to mcore documentation (#11538) Signed-off-by: ashors1 <[email protected]> * ci: Adjust inputs for code-freeze workflow (#11550) Signed-off-by: Oliver Koenig <[email protected]> * ci: Bump release freeze (#11551) Signed-off-by: Oliver Koenig <[email protected]> * Ko3n1g/ci/commit sha for cutoff (#11553) * ci: Remove token from checkout Signed-off-by: Oliver Koenig <[email protected]> * bump version Signed-off-by: Oliver Koenig <[email protected]> --------- Signed-off-by: Oliver Koenig <[email protected]> * ci: Bump code-freeze workflow (#11554) Signed-off-by: Oliver Koenig <[email protected]> * ci: Bump code freeze workflow (#11557) Signed-off-by: Oliver Koenig <[email protected]> * Fix deploy conflicts in llm.api (#11367) * Fix llm.deploy api Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> * Apply isort and black reformatting Signed-off-by: hemildesai <[email protected]> * PR feedback Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> --------- Signed-off-by: Hemil Desai <[email protected]> Signed-off-by: hemildesai <[email protected]> Co-authored-by: hemildesai <[email protected]> * perf summary docs link (#11262) Signed-off-by: Malay Nagda <[email protected]> Co-authored-by: oliver könig <[email protected]> * Add vlm nemo run scripts (#11394) * update recipe Signed-off-by: yaoyu-33 <[email protected]> * fix mllama mock ds Signed-off-by: yaoyu-33 <[email protected]> * update to use attention bias Signed-off-by: yaoyu-33 <[email protected]> * remove example Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring mock.py Signed-off-by: yaoyu-33 <[email protected]> * fix docstring language.py Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring language.py Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring mllama/base.py Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring mllama/language.py Signed-off-by: yaoyu-33 <[email protected]> * bump mcore Signed-off-by: Oliver Koenig <[email protected]> * Add scripts for mllama Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * update script Signed-off-by: yaoyu-33 <[email protected]> * fix pylint Signed-off-by: yaoyu-33 <[email protected]> * revert Dockerfile.ci Signed-off-by: Yu Yao <[email protected]> * add scripts Signed-off-by: yaoyu-33 <[email protected]> * add vlm training test in ci Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix docstring issues Signed-off-by: yaoyu-33 <[email protected]> * update script match recipe Signed-off-by: yaoyu-33 <[email protected]> * update recipes Signed-off-by: yaoyu-33 <[email protected]> * Update mllama_train.py Signed-off-by: Yu Yao <[email protected]> * update mllama 90b recipe Signed-off-by: yaoyu-33 <[email protected]> * update to use tmp in ci tests Signed-off-by: yaoyu-33 <[email protected]> * update default llava config Signed-off-by: yaoyu-33 <[email protected]> * add nemo run scripts Signed-off-by: yaoyu-33 <[email protected]> * fix vpp issue Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix cicd Signed-off-by: yaoyu-33 <[email protected]> * fix cicd Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * remove duplicated script Signed-off-by: yaoyu-33 <[email protected]> * ci: Add HF cache Signed-off-by: oliver könig <[email protected]> * update to use SP in recipe Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * upgrade Signed-off-by: yaoyu-33 <[email protected]> * Revert "upgrade" This reverts commit f6ad2cd76abcdd9258cb53a25c788fd658189150. * update neva api Signed-off-by: yaoyu-33 <[email protected]> * update neva api Signed-off-by: yaoyu-33 <[email protected]> * fix neva processing Signed-off-by: yaoyu-33 <[email protected]> * fix lint Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix data fields Signed-off-by: yaoyu-33 <[email protected]> * few fixes Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Yu Yao <[email protected]> Signed-off-by: oliver könig <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> Co-authored-by: Oliver Koenig <[email protected]> * Add from_dict to HFDatasetDataModule (#11559) * Add from_dict method Signed-off-by: Alexandros Koumparoulis <[email protected]> * add test_load_from_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add test_load_from_dict Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * Prevent llama3.1 from using Linear interpolation (#11548) * prevent llama3.1 from using linear interpolation * Apply isort and black reformatting Signed-off-by: suiyoubi <[email protected]> --------- Signed-off-by: suiyoubi <[email protected]> Co-authored-by: suiyoubi <[email protected]> * [TTS] Add audio and mel codec HF models to docs (#11526) Signed-off-by: Ryan <[email protected]> * Update for NEST release (#11537) * update for nest release Signed-off-by: stevehuang52 <[email protected]> * make pylint happier Signed-off-by: stevehuang52 <[email protected]> * fix for lhotse dataloader Signed-off-by: stevehuang52 <[email protected]> * update yaml Signed-off-by: stevehuang52 <[email protected]> * minor refactor Signed-off-by: stevehuang52 <[email protected]> * clean up Signed-off-by: stevehuang52 <[email protected]> * clean up Signed-off-by: stevehuang52 <[email protected]> --------- Signed-off-by: stevehuang52 <[email protected]> * Merging SpeechLLM development branch (#11462) * Port changes related to SFT text+speech dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Revert changes from Canary(nonLLM) code Signed-off-by: Piotr Żelasko <[email protected]> * Add joint text/audio dataloading capability to speechllm Signed-off-by: Piotr Żelasko <[email protected]> * include text-only into fprop of training and eval; TODO: text-only predict Signed-off-by: zhehuaichen <[email protected]> * Actually working forward step Signed-off-by: Piotr Żelasko <[email protected]> * Support for source-target text file pair training for MT+speech Signed-off-by: Piotr Żelasko <[email protected]> * Include supervision text tokens in audio example's num tokens Signed-off-by: Piotr Żelasko <[email protected]> * Disable conformer seq len NCCL sync Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin Signed-off-by: Piotr Żelasko <[email protected]> * Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together). Signed-off-by: Piotr Żelasko <[email protected]> * Add missing config Signed-off-by: Piotr Żelasko <[email protected]> * Revert multimodal grad accum and fix mask padding issue Signed-off-by: Piotr Żelasko <[email protected]> * Add modality weights support via cfg.model.modality_weights Signed-off-by: Piotr Żelasko <[email protected]> * Fix for V2 dataloader shuffling CRITICAL Signed-off-by: Piotr Żelasko <[email protected]> * Restore multimodal grad accum Signed-off-by: Piotr Żelasko <[email protected]> * Fix unit tests for multi-sampler configurations Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * nemo gemma to hf conversion (#9629) * adding script for gemma nemo to hf Signed-off-by: Krishna Puvvada <[email protected]> * adding verification for convert_gemma_nemo_to_hf Signed-off-by: Krishna Puvvada <[email protected]> * Apply isort and black reformatting Signed-off-by: krishnacpuvvada <[email protected]> --------- Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> * support FSDP (thank Yifan for early trying) (#10062) Note: as of now, this is still not fully working on the cluster. See above doc for details. Signed-off-by: zhehuaichen <[email protected]> * Fix unit tests after rebasing on recent main Signed-off-by: Piotr Żelasko <[email protected]> * support megatron_amp_O2 and tp (#10599) * Port changes related to SFT text+speech dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Revert changes from Canary(nonLLM) code Signed-off-by: Piotr Żelasko <[email protected]> * Add joint text/audio dataloading capability to speechllm Signed-off-by: Piotr Żelasko <[email protected]> * include text-only into fprop of training and eval; TODO: text-only predict Signed-off-by: zhehuaichen <[email protected]> * Actually working forward step Signed-off-by: Piotr Żelasko <[email protected]> * Support for source-target text file pair training for MT+speech Signed-off-by: Piotr Żelasko <[email protected]> * Include supervision text tokens in audio example's num tokens Signed-off-by: Piotr Żelasko <[email protected]> * Disable conformer seq len NCCL sync Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary sampler fusion stragies support: mux/zip/round_robin/randomized_round_robin Signed-off-by: Piotr Żelasko <[email protected]> * Working V2 version of multimodal dataloading. Each modality gets its own batch settings that can be merged with zip sampler to enjoy max batch sizes for both modalities in a single training step. Each modality runs fwd+bwd in turn to save GPU memory (instead of running fwd separately and bwd together). Signed-off-by: Piotr Żelasko <[email protected]> * Add missing config Signed-off-by: Piotr Żelasko <[email protected]> * Revert multimodal grad accum and fix mask padding issue Signed-off-by: Piotr Żelasko <[email protected]> * Add modality weights support via cfg.model.modality_weights Signed-off-by: Piotr Żelasko <[email protected]> * Fix for V2 dataloader shuffling CRITICAL Signed-off-by: Piotr Żelasko <[email protected]> * Restore multimodal grad accum Signed-off-by: Piotr Żelasko <[email protected]> * Fix unit tests for multi-sampler configurations Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * nemo gemma to hf conversion (#9629) * adding script for gemma nemo to hf Signed-off-by: Krishna Puvvada <[email protected]> * adding verification for convert_gemma_nemo_to_hf Signed-off-by: Krishna Puvvada <[email protected]> * Apply isort and black reformatting Signed-off-by: krishnacpuvvada <[email protected]> --------- Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> * support FSDP (thank Yifan for early trying) Signed-off-by: zhehuaichen <[email protected]> * debug TP deadlock Signed-off-by: zhehuaichen <[email protected]> * some fixes for fsdp and tp /lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs2048_mbs16_ep200/error-1417621-0.out /lustre/fsw/portfolios/llmservice/users/zhehuaic/results/canary-v0_speechllm/prompt_lhmerge5_p2b_tp_oci_FC-GPT_llama_canaryset_b6s4kf-sunolong_noCC_langtemp0.5_dsettemp0.5_lr1e-4wd1e-3_CosineAnnealing_warmup2500_minlr1e-6_gbs128_mbs16_ep200/error-1421103-3.out Signed-off-by: zhehuaichen <[email protected]> * nit fix Signed-off-by: zhehuaichen <[email protected]> * fix for llama3.1 Signed-off-by: zhehuaichen <[email protected]> * for llama3.1 Signed-off-by: zhehuaichen <[email protected]> * fix for inference Signed-off-by: zhehuaichen <[email protected]> * fix inference Signed-off-by: zhehuaichen <[email protected]> * fix grad accu Signed-off-by: zhehuaichen <[email protected]> * fix inference Signed-off-by: zhehuaichen <[email protected]> * initial impl to support megatron_amp_O2 in salm, bestow, salm-t5 Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: pzelasko <[email protected]> Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> Co-authored-by: pzelasko <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> * minor change in dataloader (#10601) * Speechllm dataset basic unit test (#10631) * Basic unit test for speechllm lhotse dataset Signed-off-by: Piotr Żelasko <[email protected]> * cleanup Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * Unit test for existing speechllm dataset with llama2 prompt format (#10634) Signed-off-by: Piotr Żelasko <[email protected]> * [speechllm] Replace TextProcessing with PromptFormatter (#10639) * [speechllm] Replace TextProcessing with PromptFormatter Signed-off-by: Piotr Żelasko <[email protected]> * Test for tokens_to_generate Signed-off-by: Piotr Żelasko <[email protected]> * Padding optimization for speechlm dataset Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * Multimodal conversation format dataloading (#10683) * Draft implementation of NeMo Multimodal Conversation format Signed-off-by: Piotr Żelasko <[email protected]> * Fully working data parsing and iteration Signed-off-by: Piotr Żelasko <[email protected]> * Fully working dataloading with tokenization + prompting Signed-off-by: Piotr Żelasko <[email protected]> * Collapse consecutive user turns into single turn Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * a few fixes for the new prompt template based dataloader and lora+distributed fused adam (#10701) * Draft implementation of NeMo Multimodal Conversation format Signed-off-by: Piotr Żelasko <[email protected]> * Fully working data parsing and iteration Signed-off-by: Piotr Żelasko <[email protected]> * Fully working dataloading with tokenization + prompting Signed-off-by: Piotr Żelasko <[email protected]> * Collapse consecutive user turns into single turn Signed-off-by: Piotr Żelasko <[email protected]> * compatible with previous expts Signed-off-by: zhehuaichen <[email protected]> * support gemma Signed-off-by: zhehuaichen <[email protected]> * handle the case max_seq_length is smaller than input_id length Signed-off-by: zhehuaichen <[email protected]> * fix max seq case Signed-off-by: zhehuaichen <[email protected]> * fix lora ckpt storing and loading Signed-off-by: zhehuaichen <[email protected]> * temp fix for distributed fused adam Signed-off-by: zhehuaichen <[email protected]> * revert changes in nemo_adapters.py Signed-off-by: zhehuaichen <[email protected]> * Fix tokenize_with_prompt Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Co-authored-by: Piotr Żelasko <[email protected]> * Mechanism to insert BOS/EOS at the beginning/end of dialog (#10923) * Mechanism to insert BOS/EOS at the beginning/end of dialog Signed-off-by: Piotr Żelasko <[email protected]> * Fix Gemma prompt formatter test Signed-off-by: Piotr Żelasko <[email protected]> * Add a test specifically for multiturn insertion of bos/eos Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * Add options to override default map/iterable dataset style selection in lhotse dataloader Signed-off-by: Piotr Żelasko <[email protected]> * Feature/conversations tarred (#11086) * Multimodal conversation tarring script Signed-off-by: Piotr Żelasko <[email protected]> * Fix sharding logic Signed-off-by: Piotr Żelasko <[email protected]> * Fix dir creation Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * EMMeTT support in SpeechLLM + tutorial for Lhotse Multimodal Dataloading (#10927) * Preliminary support for oomptimizer Signed-off-by: Piotr Żelasko <[email protected]> * OOMptimizer for SpeechLLM Signed-off-by: Piotr Żelasko <[email protected]> * Initial version of estimate token bins script Signed-off-by: Piotr Żelasko <[email protected]> * Initial support for multimodal 2d bucketing Signed-off-by: Piotr Żelasko <[email protected]> * Extend to text-to-text oomptimizer Signed-off-by: Piotr Żelasko <[email protected]> * Preliminary support for Llama2 prompt format in ast+mt Signed-off-by: Piotr Żelasko <[email protected]> * Support for 1D estimate token bins Signed-off-by: Piotr Żelasko <[email protected]> * Support for 1D estimate token bins Signed-off-by: Piotr Żelasko <[email protected]> * Fix Signed-off-by: Piotr Żelasko <[email protected]> * Fix Signed-off-by: Piotr Żelasko <[email protected]> * Minor tweaks Signed-off-by: Piotr Żelasko <[email protected]> * Add min/max tokens filter Signed-off-by: Piotr Żelasko <[email protected]> * Change to bisect_left for bucket idx selection Signed-off-by: Piotr Żelasko <[email protected]> * Add reconfigure_num_microbatches_calculator at the start of train epoch for modular models Signed-off-by: Piotr Żelasko <[email protected]> * Update lhotse multi-sampler config and make validation datasets finite Signed-off-by: Piotr Żelasko <[email protected]> * Initial implementation of text+audio training for T5 modular models Signed-off-by: Piotr Żelasko <[email protected]> * megatron t5 nmt prompt formatter Signed-off-by: Piotr Żelasko <[email protected]> * Fixes for MT+AST T5 oomptimizer and training Signed-off-by: Piotr Żelasko <[email protected]> * configs, fixes, token-per-token filtering * Support text modality in predict_step Signed-off-by: Piotr Żelasko <[email protected]> * Support text data in val/test dl Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * fix infinite Signed-off-by: Piotr Żelasko <[email protected]> * prompt format fixes Signed-off-by: Piotr Żelasko <[email protected]> * Fixes in audio supervision Signed-off-by: Piotr Żelasko <[email protected]> * remove superficial padding Signed-off-by: Piotr Żelasko <[email protected]> * test config and prompt context fetching fixes Signed-off-by: Piotr Żelasko <[email protected]> * support text-only decoding for salm/bestow Signed-off-by: Piotr Żelasko <[email protected]> * Add unit tests for EMMETT / refactor prompt_format_fn Signed-off-by: Piotr Żelasko <[email protected]> * make t5nmt prompt formatter auto discoverable Signed-off-by: Piotr Żelasko <[email protected]> * include token count / tpt filtering in estimate_token_bins Signed-off-by: Piotr Żelasko <[email protected]> * fix max token filter Signed-off-by: Piotr Żelasko <[email protected]> * some fixes Signed-off-by: Piotr Żelasko <[email protected]> * custom mixin for text adapters Signed-off-by: Piotr Żelasko <[email protected]> * Warmup in oomptimizer-speechlm Signed-off-by: Piotr Żelasko <[email protected]> * Move oomptimizer-speechllm to separate directory Signed-off-by: Piotr Żelasko <[email protected]> * Initial cleanup Signed-off-by: Piotr Żelasko <[email protected]> * Refactoring of prompt format fn and length measurement and filtering for data types; improved unit test coverage Signed-off-by: Piotr Żelasko <[email protected]> * Refactor sampler constraints / filters into sampling.py Signed-off-by: Piotr Żelasko <[email protected]> * Tests and support for sampler length measurement of multimodal conversations Signed-off-by: Piotr Żelasko <[email protected]> * Update estimate_token_bins.py Signed-off-by: Piotr Żelasko <[email protected]> * Move estimate_token_bins.py to speech_llm scripts Signed-off-by: Piotr Żelasko <[email protected]> * Minor tweaks Signed-off-by: Piotr Żelasko <[email protected]> * Fixes for SpeechLLM dataset Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * Add missing emmett tests Signed-off-by: Piotr Żelasko <[email protected]> * Add tutorial about multimodal lhotse dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Updated documentation for multimodal dataloading Signed-off-by: Piotr Żelasko <[email protected]> * Prompt Formatter tutorial Signed-off-by: Piotr Żelasko <[email protected]> * Review comments Signed-off-by: Piotr Żelasko <[email protected]> * Fixes for sampling filters None values Signed-off-by: Piotr Żelasko <[email protected]> * Changes requested by Steve: moving some args to main config namespace in multi config sampler Signed-off-by: Piotr Żelasko <[email protected]> * fix Signed-off-by: Piotr Żelasko <[email protected]> * Update default configs to the modified config schema Signed-off-by: Piotr Żelasko <[email protected]> * Fix omegaconf use issue Signed-off-by: Piotr Żelasko <[email protected]> * Update the docs to the modified multi config format Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: pzelasko <[email protected]> Co-authored-by: pzelasko <[email protected]> * Remove old TODO comments Signed-off-by: Piotr Żelasko <[email protected]> * Remove prompts/fn.py Signed-off-by: Piotr Żelasko <[email protected]> * Copyright notices Signed-off-by: Piotr Żelasko <[email protected]> * Make linter happy Signed-off-by: Piotr Żelasko <[email protected]> * Make linter happy Signed-off-by: Piotr Żelasko <[email protected]> * Fix megatron test Signed-off-by: Piotr Żelasko <[email protected]> * Fix megatron test Signed-off-by: Piotr Żelasko <[email protected]> * Disable plugin for high entropy strings in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Fix CodeQL errors Signed-off-by: Piotr Żelasko <[email protected]> * fix unit tests Signed-off-by: Piotr Żelasko <[email protected]> * fix another unit test Signed-off-by: Piotr Żelasko <[email protected]> * Fix multimodal tests Signed-off-by: Piotr Żelasko <[email protected]> * Apply isort and black reformatting Signed-off-by: pzelasko <[email protected]> * fixes after merging canary2 pr to main Signed-off-by: Piotr Żelasko <[email protected]> * fix headers Signed-off-by: Piotr Żelasko <[email protected]> * fix canary integration test + formatting Signed-off-by: Piotr Żelasko <[email protected]> * Address reviews - add sync_max_audio_length flag for conformer encoder Signed-off-by: Piotr Żelasko <[email protected]> * Revert change in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Revert change in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Revert change in secrets detector Signed-off-by: Piotr Żelasko <[email protected]> * Address code review Signed-off-by: Piotr Żelasko <[email protected]> * Address Steve's review Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: zhehuaichen <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]> Signed-off-by: pzelasko <[email protected]> Signed-off-by: Krishna Puvvada <[email protected]> Signed-off-by: krishnacpuvvada <[email protected]> Co-authored-by: zhehuaichen <[email protected]> Co-authored-by: pzelasko <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: Krishna Puvvada <[email protected]> Co-authored-by: krishnacpuvvada <[email protected]> Co-authored-by: zhehuaichen <[email protected]> * Sync validation metrics for ASRModel (#11533) * Sync validation metrics for ASRModel Signed-off-by: Piotr Żelasko <[email protected]> * support sync for single-dataloader case Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * NeMo 2.0 In-framework deployment support (#11523) * nemo 2 support Signed-off-by: Onur Yilmaz <[email protected]> * Remove unwanted params in DDP init in Megatron Parallel Signed-off-by: Hemil Desai <[email protected]> * nemo2 working with query Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * multigpu deployment with nemo2 works Signed-off-by: Onur Yilmaz <[email protected]> * Apply isort and black reformatting Signed-off-by: oyilmaz-nvidia <[email protected]> * add max output lenght Signed-off-by: Onur Yilmaz <[email protected]> * Remove prints Signed-off-by: Onur Yilmaz <[email protected]> * Fix merge conflicts Signed-off-by: Onur Yilmaz <[email protected]> * readded this file Signed-off-by: Onur Yilmaz <[email protected]> --------- Signed-off-by: Onur Yilmaz <[email protected]> Signed-off-by: Hemil Desai <[email protected]> Signed-off-by: oyilmaz-nvidia <[email protected]> Co-authored-by: Hemil Desai <[email protected]> Co-authored-by: oyilmaz-nvidia <[email protected]> * Add SFT/PEFT HF tests (#11519) * Add SFT/PEFT HF tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * move hf examples to examples dir Signed-off-by: Alexandros Koumparoulis <[email protected]> * bot Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * use mini_squad Signed-off-by: Alexandros Koumparoulis <[email protected]> * use mini_squad Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * add 2gpu DDP Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor Signed-off-by: Alexandros Koumparoulis <[email protected]> * use labels as passed by the user Signed-off-by: Alexandros Koumparoulis <[email protected]> * update samples/ tests Signed-off-by: Alexandros Koumparoulis <[email protected]> * rm unused imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add tests with subset split names, e.g. train[:100] Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * add --disable-ckpt Signed-off-by: Alexandros Koumparoulis <[email protected]> * use self-hosted-azure-gpus-1 for single-gpu test Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add TRANSFORMERS_OFFLINE=1 to hf tests Signed-off-by: Alexandros Koumparoulis <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * Fix typo: LocalNonpersitentObject -> LocalNonpersistentObject (#11546) Signed-off-by: Ananth Subramaniam <[email protected]> * Adding documentation for packed dataset preparation with context para… (#11564) * adding documentation for packed dataset preparation with context parallel Signed-off-by: Lifu Zhang <[email protected]> * addressing Anna Shor's comment Signed-off-by: Lifu Zhang <[email protected]> --------- Signed-off-by: Lifu Zhang <[email protected]> * have micro_batch_size and global_batch_size as class attributes in mock datamodule (#11563) * Revert "Fix the names of two sets of weight and bias in mcore_to_nemo_mapping" (#11560) * Revert "Fix the names of two sets of weight and bias in mcore_to_nemo_mapping (#9628)" This reverts commit 6784db56a03f19f37bc4f37bdf87dabb3fc1acee. * keep underscores Signed-off-by: ashors1 <[email protected]> --------- Signed-off-by: ashors1 <[email protected]> * add huggingface-based tokenizer support for mixtral HF -> .nemo (#11572) * add huggingface-based tokenizer support Signed-off-by: dimapihtar <[email protected]> * Apply isort and black reformatting Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: dimapihtar <[email protected]> Co-authored-by: dimapihtar <[email protected]> * Github Actions tests for Llava Next and modify pretrain recipe to have language model path (#11424) * modified pretrain recipe to have language_model_from_pretrained * ci test for llava next * fixed indent/lint issue in cicd yml file * fix lint issues * Apply isort and black reformatting Signed-off-by: yashaswikarnati <[email protected]> * Update .github/workflows/cicd-main.yml Co-authored-by: oliver könig <[email protected]> Signed-off-by: Yashaswi Karnati <[email protected]> * Update .github/workflows/cicd-main.yml Co-authored-by: oliver könig <[email protected]> Signed-off-by: Yashaswi Karnati <[email protected]> --------- Signed-off-by: yashaswikarnati <[email protected]> Signed-off-by: Yashaswi Karnati <[email protected]> Co-authored-by: yashaswikarnati <[email protected]> Co-authored-by: oliver könig <[email protected]> * Fix SingleDeviceStrategy support in Nsys callback (#11574) * fix for SingleDeviceStrategy Signed-off-by: Alexandros Koumparoulis <[email protected]> * mini refactor Signed-off-by: Alexandros Koumparoulis <[email protected]> * typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * remove dialogue scripts and docs (#11577) * remove deprecated scripts Signed-off-by: dimapihtar <[email protected]> * remove deprecated docs Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> * add JitTransform (#11131) * add JitTransform Signed-off-by: Alexandros Koumparoulis <[email protected]> * fixes Signed-off-by: Alexandros Koumparoulis <[email protected]> * add JiT CB test Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove stale imports Signed-off-by: Alexandros Koumparoulis <[email protected]> * typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * cleanup Signed-off-by: Alexandros Koumparoulis <[email protected]> * add jit callback test Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * fix param passing Signed-off-by: Alexandros Koumparoulis <[email protected]> * use sgd in test_nemo_jit_cb Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * add thunder call Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * Use .compile method to avoid changing module structure Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * Use JitConfig Signed-off-by: Alexandros Koumparoulis <[email protected]> * thunder setting Signed-off-by: Alexandros Koumparoulis <[email protected]> * avoid reentry Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove optional Signed-off-by: Alexandros Koumparoulis <[email protected]> * rewrite Signed-off-by: Alexandros Koumparoulis <[email protected]> * refactor & module_selector Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: akoumpa <[email protected]> * NeMo 2.0 documentation upgrade (#11235) * update attention Signed-off-by: dimapihtar <[email protected]> * update docs to NeMo 2.0 Signed-off-by: dimapihtar <[email protected]> * update usage Signed-off-by: dimapihtar <[email protected]> * update parallelism Signed-off-by: dimapihtar <[email protected]> * update parallelism docs Signed-off-by: dimapihtar <[email protected]> * update parallelism docs Signed-off-by: dimapihtar <[email protected]> * fix style Signed-off-by: dimapihtar <[email protected]> * update to NeMo 2.0 Signed-off-by: dimapihtar <[email protected]> * NeMo 2.0 update Signed-off-by: dimapihtar <[email protected]> * NeMo 2.0 update Signed-off-by: dimapihtar <[email protected]> * remove deprecated file Signed-off-by: dimapihtar <[email protected]> * update in respect to NeMo 2.0 Signed-off-by: dimapihtar <[email protected]> * fix hyperlinks Signed-off-by: dimapihtar <[email protected]> * remove deprecated Signed-off-by: dimapihtar <[email protected]> * remove deprecated Signed-off-by: dimapihtar <[email protected]> * update documentation to NeMo 2.0 Signed-off-by: dimapihtar <[email protected]> * fix typo Signed-off-by: dimapihtar <[email protected]> * fix punctuation Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> * Remove auto-import of lhotse when importing nemo.collections.common.data (#11578) * Remove auto-import of lhotse when importing nemo.collections.common.data Signed-off-by: Piotr Żelasko <[email protected]> * Fix test import Signed-off-by: Piotr Żelasko <[email protected]> --------- Signed-off-by: Piotr Żelasko <[email protected]> * Fix example configs (#11571) * Fix example configs Signed-off-by: Boxiang Wang <[email protected]> * Fix line length Signed-off-by: Boxiang Wang <[email protected]> --------- Signed-off-by: Boxiang Wang <[email protected]> * fix (#11575) Signed-off-by: Oliver Koenig <[email protected]> * NIM supporting changes for nemo.export for NeMo 2.0 (#11488) * Move torch_dtype_from_precision for independent export module Signed-off-by: Jan Lasek <[email protected]> * Apply isort and black reformatting Signed-off-by: janekl <[email protected]> Signed-off-by: Jan Lasek <[email protected]> * Remove unused imports Signed-off-by: Jan Lasek <[email protected]> * Fix too long lines Signed-off-by: Jan Lasek <[email protected]> * Apply isort and black reformatting Signed-off-by: janekl <[email protected]> Signed-off-by: Jan Lasek <[email protected]> * Fix signature and default for megatron_amp_O2 Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> Signed-off-by: janekl <[email protected]> Co-authored-by: Bobby Chen <[email protected]> Co-authored-by: janekl <[email protected]> * AED greedy confidence estimation (#11573) * upload Signed-off-by: Aleksandr Laptev <[email protected]> * Apply isort and black reformatting Signed-off-by: GNroy <[email protected]> * set prompt confidence dtype at initialization Signed-off-by: Aleksandr Laptev <[email protected]> --------- Signed-off-by: Aleksandr Laptev <[email protected]> Signed-off-by: GNroy <[email protected]> Co-authored-by: GNroy <[email protected]> * gemma fix (#11587) * Update T5 DataModule regarding Pretrain/Finetune validate (#11584) * update datamodule to have mbs/gbs * update datamodule to have mbs/gbs * Apply isort and black reformatting Signed-off-by: huvunvidia <[email protected]> --------- Signed-off-by: huvunvidia <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: huvunvidia <[email protected]> * fix llama3 (#11580) * Add Hf nemorun tests (#11566) * minor fixes for recipe Signed-off-by: HuiyingLi <[email protected]> * add peft nemorun script Signed-off-by: HuiyingLi <[email protected]> * add sft script and data module Signed-off-by: HuiyingLi <[email protected]> * Apply isort and black reformatting Signed-off-by: HuiyingLi <[email protected]> * clean up Signed-off-by: HuiyingLi <[email protected]> * add disable ckpt and data config for tests Signed-off-by: HuiyingLi <[email protected]> * Apply isort and black reformatting Signed-off-by: HuiyingLi <[email protected]> * add tests to cicd yaml Signed-off-by: HuiyingLi <[email protected]> * cleanup Signed-off-by: HuiyingLi <[email protected]> --------- Signed-off-by: HuiyingLi <[email protected]> Signed-off-by: HuiyingLi <[email protected]> Co-authored-by: HuiyingLi <[email protected]> * [🤖]: Howdy folks, let's bump NeMo-Toolkit to `2.2.0rc0` ! (#11555) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Pass the number of experts to modelopt layer spec (#11607) * Pass number of experts to modelopt layer spec Signed-off-by: Jan Lasek <[email protected]> * Fix too long lines Signed-off-by: Jan Lasek <[email protected]> --------- Signed-off-by: Jan Lasek <[email protected]> * Adding changes to asr documentation (#11397) Signed-off-by: Ssofja <[email protected]> * Support Cosmos tokenizer TensorRT inference (#11472) * Add cosmos TRT * Add trt run script * Apply isort and black reformatting Signed-off-by: meatybobby <[email protected]> * Clean code * Fix CodeQL --------- Signed-off-by: meatybobby <[email protected]> Co-authored-by: meatybobby <[email protected]> * Neva updates to latest mcore and some fixes (#11565) * api updates and fixes Signed-off-by: yaoyu-33 <[email protected]> * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> * fix Signed-off-by: yaoyu-33 <[email protected]> * fix arg Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: yaoyu-33 <[email protected]> * add nemo2-sft-peft to readme (#11613) Signed-off-by: Huiying Li <[email protected]> * Set Minitron width pruning batch size 1 (#11603) Signed-off-by: Keval Morabia <[email protected]> * Disable CP for running Inference using megatron_gpt_eval (#11547) * Disable CP for megatron_gpt_eval * Apply isort and black reformatting Signed-off-by: suiyoubi <[email protected]> * Update examples/nlp/language_modeling/megatron_gpt_eval.py Co-authored-by: Chen Cui <[email protected]> Signed-off-by: Ao Tang <[email protected]> --------- Signed-off-by: suiyoubi <[email protected]> Signed-off-by: Ao Tang <[email protected]> Co-authored-by: suiyoubi <[email protected]> Co-authored-by: Chen Cui <[email protected]> * ci: Add `no-fail-fast` mode (#11608) Signed-off-by: Oliver Koenig <[email protected]> * Chat dataset support (#11423) * chat dataset support Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * add ci test Signed-off-by: Chen Cui <[email protected]> * address comment Signed-off-by: Chen Cui <[email protected]> * Apply isort and black reformatting Signed-off-by: cuichenx <[email protected]> * address comment Signed-off-by: Chen Cui <[email protected]> --------- Signed-off-by: Chen Cui <[email protected]> Signed-off-by: cuichenx <[email protected]> Co-authored-by: cuichenx <[email protected]> * Sortformer Diarizer 4spk v1 model PR Part 2: Unit-tests for Sortformer Diarizer. (#11336) * Adding the first pr files models and dataset Signed-off-by: taejinp <[email protected]> * Tested all unit-test files Signed-off-by: taejinp <[email protected]> * Name changes on yaml files and train example Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Reflecting comments and removing unnecessary parts for this PR Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Adding docstrings to reflect the PR comments Signed-off-by: taejinp <[email protected]> * removed the unused find_first_nonzero Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Fixed all pylint issues Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Resolving pylint issues Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Removing unused varialbe in audio_to_diar_label.py Signed-off-by: taejinp <[email protected]> * Fixed docstrings in training script Signed-off-by: taejinp <[email protected]> * Line-too-long issue from Pylint fixed Signed-off-by: taejinp <[email protected]> * Adding get_subsegments_scriptable to prevent jit.script error Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Addressed Code-QL issues Signed-off-by: taejinp <[email protected]> * Resolved conflicts on bce_loss.py Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Adding all the diarization reltated unit-tests Signed-off-by: taejinp <[email protected]> * Moving speaker task related unit test files to speaker_tasks folder Signed-off-by: taejinp <[email protected]> * Fixed uninit variable issue in bce_loss.py spotted by codeQL Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Fixing code-QL issues Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Reflecting PR comments from weiqingw Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Line too long pylint issue resolved in e2e_diarize_speech.py Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Resovled unused variable issue in model test Signed-off-by: taejinp <[email protected]> * Reflecting the comment on Nov 21st 2024. Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Unused variable import time Signed-off-by: taejinp <[email protected]> * Adding docstrings to score_labels() function in der.py Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Reflecting comments on YAML files and model file variable changes. Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Added get_subsegments_scriptable for legacy get_subsegment functions Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Resolved line too long pylint issues Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Added training and inference CI-tests Signed-off-by: taejinp <[email protected]> * Added the missing parse_func in preprocessing/collections.py Signed-off-by: taejinp <[email protected]> * Adding the missing parse_func in preprocessing/collections.py Signed-off-by: taejinp <[email protected]> * Fixed an indentation error Signed-off-by: taejinp <[email protected]> * Resolved multi_bin_acc and bce_loss issues Signed-off-by: taejinp <[email protected]> * Resolved line-too-long for msdd_models.py Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Code QL issues and fixed test errors Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * line too long in audio_to_diar_label.py Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * resolving CICD test issues Signed-off-by: taejinp <[email protected]> * Fixing codeQL issues Signed-off-by: taejinp <[email protected]> * Fixed pin memory False for inference Signed-off-by: taejinp <[email protected]> --------- Signed-off-by: taejinp <[email protected]> Signed-off-by: tango4j <[email protected]> Co-authored-by: tango4j <[email protected]> * 2x more memory efficient Graph-based RNN-T (#11169) * Optimized Graph-Transducer implementation Signed-off-by: Vladimir Bataev <[email protected]> --------- Signed-off-by: Vladimir Bataev <[email protected]> Signed-off-by: artbataev <[email protected]> Co-authored-by: artbataev <[email protected]> * Use explicit subpaths in io for exporting a checkpoint (#11352) * Fix llm.export_ckpt Signed-off-by: Hemil Desai <[email protected]> * fix Signed-off-by: Hemil Desai <[email protected]> --------- Signed-off-by: Hemil Desai <[email protected]> * Remove triton requirement (#11627) * Specify pytorch-triton instead of triton Signed-off-by: Dong Hyuk Chang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove triton Signed-off-by: Dong Hyuk Chang <[email protected]> --------- Signed-off-by: Dong Hyuk Chang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * ci: Remove comment if no changes required anymore (#11624) Signed-off-by: Oliver Koenig <[email protected]> * Jit with peft (#11586) * move jitransform at the end Signed-off-by: Alexandros Koumparoulis <[email protected]> * add docstring & post-init Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add remove_extra_batch_keys and remove align_labels Signed-off-by: Alexandros Koumparoulis <[email protected]> * Run JitTransform on_train_epoch_start Signed-off-by: Alexandros Koumparoulis <[email protected]> * add --use-torch-jit option Signed-off-by: Alexandros Koumparoulis <[email protected]> * add docstrings Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * pep8 Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Co-authored-by: akoumpa <[email protected]> * NeMo-UX: add Hf's AutoModelForImageTextToText (#11321) * init commit Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * wip Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * peft examp;le Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * move peft example to multimodal_llm Signed-off-by: Alexandros Koumparoulis <[email protected]> * surface HFAutoModelForImageTextToText Signed-off-by: Alexandros Koumparoulis <[email protected]> * add hf vlm dataset Signed-off-by: Alexandros Koumparoulis <[email protected]> * move processor Signed-off-by: Alexandros Koumparoulis <[email protected]> * train_log -> train_loss Signed-off-by: Alexandros Koumparoulis <[email protected]> * vlm.HFDatasetDataModule pass collate_fn as argument Signed-off-by: Alexandros Koumparoulis <[email protected]> * Update peft example Signed-off-by: Alexandros Koumparoulis <[email protected]> * typo Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove unused var Signed-off-by: Alexandros Koumparoulis <[email protected]> * Move example Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * remove unused Signed-off-by: Alexandros Koumparoulis <[email protected]> * Small change Signed-off-by: Alexandros Koumparoulis <[email protected]> * Fix loss calculation Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add extract_skipped_token_ids Signed-off-by: Alexandros Koumparoulis <[email protected]> * Use vlm.HFAutoModelForImageTextToText.extract_skipped_token_ids Signed-off-by: Alexandros Koumparoulis <[email protected]> * add test Signed-off-by: Alexandros Koumparoulis <[email protected]> * Update logits/labels handling Signed-off-by: Alexandros Koumparoulis <[email protected]> * add trust_remote_code to configure_processor Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> * mini refactor Signed-off-by: Alexandros Koumparoulis <[email protected]> * add LLAMA_TOKENS Signed-off-by: Alexandros Koumparoulis <[email protected]> * update hf_dataset Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add lora_dtype for models with non-FP weights Signed-off-by: Alexandros Koumparoulis <[email protected]> * Add load_in_4bit option Signed-off-by: Alexandros Koumparoulis <[email protected]> * add default_dtype Signed-off-by: Alexandros Koumparoulis <[email protected]> * add load_in_4bit to llm collection Signed-off-by: Alexandros Koumparoulis <[email protected]> * rm import Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix asset path Signed-off-by: Alexandros Koumparoulis <[email protected]> * move vlm test Signed-off-by: Alexandros Koumparoulis <[email protected]> * move data offline Signed-off-by: Alexandros Koumparoulis <[email protected]> * use signel gpu Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * pylint Signed-off-by: Alexandros Koumparoulis <[email protected]> * drop align_labels Signed-off-by: Alexandros Koumparoulis <[email protected]> * remove align_labels from llm too Signed-off-by: Alexandros Koumparoulis <[email protected]> * use loss * mask instead of loss[mask == 1] Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix path Signed-off-by: Alexandros Koumparoulis <[email protected]> * Apply isort and black reformatting Signed-off-by: akoumpa <[email protected]> --------- Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: akoumpa <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: akoumpa <[email protected]> * ci: Bump release workflow (#11635) Signed-off-by: Oliver Koenig <[email protected]> * Add fix docstring for speech commands (#11638) Signed-off-by: smajumdar <[email protected]> * Fixing Multi_Task_Adapters.ipynb by replacing canary2 with canary_custom (#11641) Signed-off-by: Weiqing Wang <[email protected]> * fixed config name in online augmentation tutorial (#11628) Signed-off-by: Rauf <[email protected]> * fix default nodes (#11632) * add renormalize_blend_weights param (#11647) Signed-off-by: dimapihtar <[email protected]> * Sortformer Diarizer 4spk v1 model PR Part 3: Speaker Diarization Mixin (#11511) * Adding diarization mixin for one click inference Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Resolving CodeQL and Pylint Signed-off-by: taejinp <[email protected]> * Resolving CodeQL and Pylint - unsaved files …

akoumpa added 2 commits December 11, 2024 14:40

Add from_dict method

90a1536

Signed-off-by: Alexandros Koumparoulis <[email protected]>

add test_load_from_dict

6672419

Signed-off-by: Alexandros Koumparoulis <[email protected]>

akoumpa force-pushed the akoumparouli/make_HFDatasetDataModule_arg_accept_path_or_dataset branch from fc74bee to 6672419 Compare December 11, 2024 22:41

akoumpa and others added 5 commits December 11, 2024 14:42

fix

a10326e

Signed-off-by: Alexandros Koumparoulis <[email protected]>

fix

0644c57

Signed-off-by: Alexandros Koumparoulis <[email protected]>

add test_load_from_dict

8e82e57

Signed-off-by: Alexandros Koumparoulis <[email protected]>

fix

11f1202

Signed-off-by: Alexandros Koumparoulis <[email protected]>

Apply isort and black reformatting

329d737

Signed-off-by: akoumpa <[email protected]>

akoumpa added the Run CICD label Dec 11, 2024

akoumpa marked this pull request as ready for review December 11, 2024 23:45

akoumpa requested review from hemildesai and cuichenx December 11, 2024 23:46

hemildesai reviewed Dec 12, 2024

View reviewed changes

ericharper approved these changes Dec 12, 2024

View reviewed changes

ericharper merged commit 05398c6 into main Dec 12, 2024
172 of 175 checks passed

ericharper deleted the akoumparouli/make_HFDatasetDataModule_arg_accept_path_or_dataset branch December 12, 2024 03:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add from_dict to HFDatasetDataModule #11559

Add from_dict to HFDatasetDataModule #11559

akoumpa commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 11, 2024

hemildesai Dec 12, 2024

hemildesai Dec 12, 2024

github-actions bot commented Dec 12, 2024

ericharper left a comment

Add from_dict to HFDatasetDataModule #11559

Add from_dict to HFDatasetDataModule #11559

Conversation

akoumpa commented Dec 11, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

github-actions bot commented Dec 11, 2024

hemildesai Dec 12, 2024

Choose a reason for hiding this comment

hemildesai Dec 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 12, 2024

ericharper left a comment

Choose a reason for hiding this comment

akoumpa commented Dec 11, 2024 •

edited

Loading