Data setup: Librispeech downloading code raises exception #324

priyakasimbeg · 2023-02-22T18:01:19Z

From Mike's test:

None of the tar commands succeed. This happens on line 433-434 and on line 447, and I believe they fail for different reasons.
The first time (line 433-434), the path to the tar file is incorrect. I believe the script is intended to be run from the algorithmic_efficiency root folder (instructions say call python3 datasets/dataset_setup.py …), and I used different tmp and data directories
The second time (line 447) it fails because wget (line 445) hasn’t completed yet
To fix this, I suggest:
passing cwd=tmp_librispeech_dir to the Popen constructor on line 434
appending .communicate() on line 445
change line 447 to subprocess.Popen(f'tar xzvf {tar_filename}', shell=True, cwd=tmp_librispeech_dir).communicate()
(I tested those changes locally and they solved those problems)
After untarring the files, everything is in tmp_librispeech_dir/LibriSpeech so line 450 should be changed to take data_dir=os.path.join(tmp_librispeech_dir, ‘LibriSpeech’)
There are also path-related problems in the librispeech_tokenizer.py. Here the file spm_model.vocab gets copied to the algorithmic_efficiency directory (from which python3 was a called). Then it isn’t found when librispeech_tokenizer.load_tokenizer() gets called from librispeech_preprocess.run()

The text was updated successfully, but these errors were encountered:

priyakasimbeg · 2023-06-22T22:27:20Z

Open question on pr/400 from @sourabh2k15:

I ran the script and it wasn't able to find tokenizer after it trained it , where do we write the trained tokenizer to on disk ?

chandramouli-sastry · 2023-06-29T00:43:22Z

Hi, sorry for the extremely late reply -- the trained tokenizer can be found in the path printed by this line:

algorithmic-efficiency/datasets/librispeech_tokenizer.py

Line 102 in 7a2b389

logging.info('Copied %s to %s', model_fp.name + '.model', abs_model_path)

priyakasimbeg · 2023-08-07T21:09:19Z

Fixed in #465 by @sourabh2k15

priyakasimbeg assigned chandramouli-sastry Jun 1, 2023

chandramouli-sastry mentioned this issue Jun 8, 2023

Librispeech dataset fixes #400

Merged

priyakasimbeg added the 🚀 Launch Blocker Issues that are blocking launch of benchmark label Jul 18, 2023

priyakasimbeg closed this as completed Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data setup: Librispeech downloading code raises exception #324

Data setup: Librispeech downloading code raises exception #324

priyakasimbeg commented Feb 22, 2023

priyakasimbeg commented Jun 22, 2023

chandramouli-sastry commented Jun 29, 2023

priyakasimbeg commented Aug 7, 2023

Data setup: Librispeech downloading code raises exception #324

Data setup: Librispeech downloading code raises exception #324

Comments

priyakasimbeg commented Feb 22, 2023

priyakasimbeg commented Jun 22, 2023

chandramouli-sastry commented Jun 29, 2023

priyakasimbeg commented Aug 7, 2023