Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data setup: Librispeech downloading code raises exception #324

Closed
priyakasimbeg opened this issue Feb 22, 2023 · 3 comments
Closed

Data setup: Librispeech downloading code raises exception #324

priyakasimbeg opened this issue Feb 22, 2023 · 3 comments
Assignees
Labels
🚀 Launch Blocker Issues that are blocking launch of benchmark

Comments

@priyakasimbeg
Copy link
Contributor

From Mike's test:

None of the tar commands succeed. This happens on line 433-434 and on line 447, and I believe they fail for different reasons.
The first time (line 433-434), the path to the tar file is incorrect. I believe the script is intended to be run from the algorithmic_efficiency root folder (instructions say call python3 datasets/dataset_setup.py …), and I used different tmp and data directories
The second time (line 447) it fails because wget (line 445) hasn’t completed yet
To fix this, I suggest:
passing cwd=tmp_librispeech_dir to the Popen constructor on line 434
appending .communicate() on line 445
change line 447 to subprocess.Popen(f'tar xzvf {tar_filename}', shell=True, cwd=tmp_librispeech_dir).communicate()
(I tested those changes locally and they solved those problems)
After untarring the files, everything is in tmp_librispeech_dir/LibriSpeech so line 450 should be changed to take data_dir=os.path.join(tmp_librispeech_dir, ‘LibriSpeech’)
There are also path-related problems in the librispeech_tokenizer.py. Here the file spm_model.vocab gets copied to the algorithmic_efficiency directory (from which python3 was a called). Then it isn’t found when librispeech_tokenizer.load_tokenizer() gets called from librispeech_preprocess.run()

@priyakasimbeg
Copy link
Contributor Author

Open question on pr/400 from @sourabh2k15:

I ran the script and it wasn't able to find tokenizer after it trained it , where do we write the trained tokenizer to on disk ?

@chandramouli-sastry
Copy link
Contributor

Hi, sorry for the extremely late reply -- the trained tokenizer can be found in the path printed by this line:

logging.info('Copied %s to %s', model_fp.name + '.model', abs_model_path)

@priyakasimbeg priyakasimbeg added the 🚀 Launch Blocker Issues that are blocking launch of benchmark label Jul 18, 2023
@priyakasimbeg
Copy link
Contributor Author

Fixed in #465 by @sourabh2k15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 Launch Blocker Issues that are blocking launch of benchmark
Projects
None yet
Development

No branches or pull requests

2 participants