Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPS-free 2D bucket estimation and filtering #11738

Merged
merged 20 commits into from
Jan 23, 2025

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jan 2, 2025

What does this PR do ?

Improves the UX of estimating 2D buckets and preparing the training configuration. Key changes:

  • 2D bucket estimation auto-finds the right max-token-per-second setting for each bucket separately instead of assuming a global constant max-TPS. The per-bucket max-TPS is determined as 4*stddev(bucket_tps).
  • Training with 2D bucketing works without setting the max_tps filter.
  • The data that doesn't fit the 2D buckets (either 1st or 2nd dim) is filtered out automatically during sampling.
    • strict mode (default): first allocate the duration bucket, and then try to allocate to max_tokens sub-bucket. If an example doesn't fit, discard it.
    • lenient mode (model.train_ds.bucketing_2d_strict_mode=False): find any bucket that will fit a given example. That means token-per-second outliers are pushed to buckets with higher durations increasing the padding but reducing the amount of discarded data. Use at your own risk - for the setups I tested it so far with, it may cause training instability (likely due to inclusion of lower-quality data).
  • Fixes some issues with estimate_duration_bins_2d.py script.
  • For data with complex TPS distributions: max-TPS can be defined as a list with the same length as the list of buckets for fine-grained control. A best-guess setting is suggested by estimate_duration_bins_2d.py.
  • Renamed tarred_random_access to skip_missing_manifest_entries and adjusted the logic to reduce CPU memory usage and accelerate loading time.

Collection: ASR, TTS, SpeechLLM

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the common label Jan 2, 2025
Signed-off-by: Piotr Żelasko <[email protected]>
@@ -15,9 +15,11 @@
import bisect
import logging
import math
from bisect import bisect_left, bisect_right

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'bisect_right' is not used.
Signed-off-by: Piotr Żelasko <[email protected]>
from lhotse.testing.dummies import DummyManifest, dummy_cut
from nemo.collections.common.data.lhotse.sampling import FixedBucketBatchSizeConstraint2D
from lhotse.testing.dummies import dummy_cut
from lhotse.testing.random import deterministic_rng

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'deterministic_rng' is not used.
filter_2d = BucketingFilter(constraint)
cut = make_cut(duration=2.0, num_tokens=20)
assert filter_2d(cut) == False
assert constraint.select_bucket(constraint.max_seq_len_buckets, cut) == None

Check notice

Code scanning / CodeQL

Testing equality to None Note test

Testing for None should use the 'is' operator.
@pzelasko pzelasko requested a review from nithinraok January 9, 2025 18:04
@pzelasko pzelasko marked this pull request as ready for review January 9, 2025 18:04
Signed-off-by: Piotr Żelasko <[email protected]>
Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -698,28 +709,32 @@ def make_structured_with_schema_warnings(config: DictConfig | dict) -> DictConfi
if not isinstance(config, DictConfig):
config = DictConfig(config)

if config.get("tarred_random_access", False):
logging.warning(
"Option 'tarred_random_access' is deprecated and replaced with 'skip_missing_manifest_entries'.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be also add version from which this would be removed

nithinraok
nithinraok previously approved these changes Jan 10, 2025
pzelasko and others added 2 commits January 12, 2025 17:22
Signed-off-by: Piotr Żelasko <[email protected]>
@pzelasko pzelasko requested a review from nithinraok January 17, 2025 01:24
@github-actions github-actions bot added core Changes to NeMo Core NLP CI Multi Modal labels Jan 17, 2025
@pzelasko pzelasko force-pushed the 2d-bucketing-and-tps-improvements-2 branch from 49a1536 to a238da2 Compare January 17, 2025 01:27
@github-actions github-actions bot removed core Changes to NeMo Core NLP CI Multi Modal labels Jan 17, 2025
@pzelasko pzelasko enabled auto-merge (squash) January 17, 2025 01:28
Signed-off-by: Piotr Żelasko <[email protected]>
Copy link
Contributor

beep boop 🤖: 🚨 The following files must be fixed before merge!


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.common.data.lhotse.sampling
nemo/collections/common/data/lhotse/sampling.py:15:0: W0611: Unused import bisect (unused-import)
nemo/collections/common/data/lhotse/sampling.py:16:0: W0611: Unused import logging (unused-import)
nemo/collections/common/data/lhotse/sampling.py:18:0: W0611: Unused bisect_right imported from bisect (unused-import)

-----------------------------------
Your code has been rated at 9.87/10

Mitigation guide:

  • Add sensible and useful docstrings to functions and methods
  • For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
  • To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

Copy link
Contributor

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pzelasko pzelasko merged commit 7e24313 into main Jan 23, 2025
207 of 215 checks passed
@pzelasko pzelasko deleted the 2d-bucketing-and-tps-improvements-2 branch January 23, 2025 19:24
parthmannan pushed a commit that referenced this pull request Jan 28, 2025
* TPS-free 2D bucket estimation and filtering

Signed-off-by: Piotr Żelasko <[email protected]>

* fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* extra unit test

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes in 2D bin estimation script and OOMptimizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* Support list values for max_tps/max_tpt and bucketing

Signed-off-by: Piotr Żelasko <[email protected]>

* fix max_tps/max_tpt config parsing for list values

Signed-off-by: Piotr Żelasko <[email protected]>

* Transition: tarred_random_access -> skip_missing_manifest_entries

Signed-off-by: Piotr Żelasko <[email protected]>

* Improve warning messages

Signed-off-by: Piotr Żelasko <[email protected]>

* Support single unified BPE tokenizer for Canary2

Signed-off-by: Piotr Zelasko <[email protected]>

* bugfix +es to unified tokenizer support for Canary

Signed-off-by: Piotr Żelasko <[email protected]>

* CanaryBPETokenizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert changes to fetching pad id in AED models

Signed-off-by: Piotr Żelasko <[email protected]>

* Update docs

Signed-off-by: Piotr Żelasko <[email protected]>

* fix tests

Signed-off-by: Piotr Żelasko <[email protected]>

* Remove max_tps and max_duration from OOMptimizer output

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Signed-off-by: Piotr Zelasko <[email protected]>
Co-authored-by: pzelasko <[email protected]>
Signed-off-by: Parth Mannan <[email protected]>
abhinavg4 pushed a commit that referenced this pull request Jan 30, 2025
* TPS-free 2D bucket estimation and filtering

Signed-off-by: Piotr Żelasko <[email protected]>

* fixes

Signed-off-by: Piotr Żelasko <[email protected]>

* extra unit test

Signed-off-by: Piotr Żelasko <[email protected]>

* Fixes in 2D bin estimation script and OOMptimizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Apply isort and black reformatting

Signed-off-by: pzelasko <[email protected]>

* Support list values for max_tps/max_tpt and bucketing

Signed-off-by: Piotr Żelasko <[email protected]>

* fix max_tps/max_tpt config parsing for list values

Signed-off-by: Piotr Żelasko <[email protected]>

* Transition: tarred_random_access -> skip_missing_manifest_entries

Signed-off-by: Piotr Żelasko <[email protected]>

* Improve warning messages

Signed-off-by: Piotr Żelasko <[email protected]>

* Support single unified BPE tokenizer for Canary2

Signed-off-by: Piotr Zelasko <[email protected]>

* bugfix +es to unified tokenizer support for Canary

Signed-off-by: Piotr Żelasko <[email protected]>

* CanaryBPETokenizer

Signed-off-by: Piotr Żelasko <[email protected]>

* Revert changes to fetching pad id in AED models

Signed-off-by: Piotr Żelasko <[email protected]>

* Update docs

Signed-off-by: Piotr Żelasko <[email protected]>

* fix tests

Signed-off-by: Piotr Żelasko <[email protected]>

* Remove max_tps and max_duration from OOMptimizer output

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
Signed-off-by: Piotr Zelasko <[email protected]>
Co-authored-by: pzelasko <[email protected]>
Signed-off-by: Abhinav Garg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants