-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add slice_with_offset and dry_run Support for Tar Dataset Creation; New Script for Partial Conversion #10511
Conversation
Signed-off-by: Sasha Meister <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Signed-off-by: ssh-meister <[email protected]>
Signed-off-by: Sasha Meister <[email protected]>
Signed-off-by: ssh-meister <[email protected]>
encoded_audio = BytesIO() | ||
if codec == "opus": | ||
kwargs = {"format": "ogg", "subtype": "opus"} | ||
if codec is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we could do a fast-path here: if codec=None and offset=0 and duration=None
, instead of reading+writing with soundfile (triggering encoding and decoding) just read the raw audio file bytes and write them to tar file directly. Will likely speed up data tarring by a non-trivial factor. Up to you.
def _write_to_tar( | ||
self, tar, audio_filepath: str, squashed_filename: str, duration: float = None, offset: float = 0 | ||
) -> None: | ||
if ((codec := self.config.force_codec) is None or audio_filepath.endswith(f".{codec}")) and not duration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many if/else here now that I think we should have some unit test coverage for this script (or just the function).
), | ||
) | ||
parser.add_argument( | ||
"--dry_run", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dry_run
is not an accurate name since the script will create some output: the manifests without tar files. I suggest renaming it to --only_manifest
instead.
BTW dry_run
would also be a nice option, telling you very quickly how many shards with how much data per shard are going to be created without actually reading any audio or writing anything.
""" | ||
# Partial Tarred Audio Dataset Creator | ||
|
||
## Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the purpose of this script. Why would you want to only tar specific shards? Can you add some comments / docs about the intended use cases?
Signed-off-by: ssh-meister <[email protected]>
Signed-off-by: Sasha Meister <[email protected]>
Signed-off-by: ssh-meister <[email protected]>
Signed-off-by: Sasha Meister <[email protected]>
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Mitigation guide:
By applying these rules, we reduce the occurance of this message in future. Thank you for improving NeMo's documentation! |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
What does this PR do ?
slice_with_offset
Feature:dry_run
Flag:partial_conversion_to_tarred_audio_dataset.py
:Changelog
--
create_shards
: Updated to support theslice_by_offset flag
for segment-based tar creation anddry_run
functionality.-- Added new functionality to handle sharded manifests when
dry_run
is enabled.--
partial_conversion_to_tarred_audio_dataset.py:
Processes dry-run manifests to generate individual shards.PR Type:
Additional Information