-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Audio] MP3 resampling is incorrect when dataset's audio files have different sampling rates #3662
Comments
Thanks @lhoestq for finding the reason of incorrect resampling. This issue affects all languages which have sound files with different sampling rates such as Turkish and Luganda. |
@cahya-wirawan - do you know how many languages have different sampling rates in Common Voice? I'm quite surprised to see this for multiple languages actually |
@cahya-wirawan, I can reproduce the problem for Common Voice 7 for Turkish. Here a script you can use: #!/usr/bin/env python3
from datasets import load_dataset
import torchaudio
from io import BytesIO
from datasets import Audio
from collections import Counter
import sys
ds_name = str(sys.argv[1])
lang = str(sys.argv[2])
ds = load_dataset(ds_name, lang, split="train", use_auth_token=True)
ds = ds.cast_column("audio", Audio(decode=False))
all_sampling_rates = []
def print_sampling_rate(x):
x, sr = torchaudio.load(BytesIO(x["audio"]["bytes"]), format="mp3")
all_sampling_rates.append(sr)
ds.map(print_sampling_rate)
print(Counter(all_sampling_rates)) can be run with: python run.py mozilla-foundation/common_voice_7_0 tr For CV 6.1 all samples seem to have the same audio |
It actually shows that many more samples are in 32kHz format than it 48kHz which is unexpected. Thanks a lot for flagging! Will contact Common Voice about this as well |
I only checked the CV 7.0 for Turkish, Luganda and Indonesian, they have audio files with difference sampling rates, and all of them are affected by this issue. Percentage of incorrect resampling as follow, Turkish: 9.1%, Luganda: 88.2% and Indonesian: 64.1%. |
Ok wow, thanks a lot for checking this - you've found a pretty big bug 😅 It seems like a lot more datasets are actually affected than I original thought. We'll try to solve this as soon as possible and make an announcement tomorrow. |
The Audio feature resampler for MP3 gets stuck with the first original frequencies it meets, which leads to subsequent decoding to be incorrect.
Here is a code to reproduce the issue:
Let's first consider two audio files with different sampling rates 32000 and 16000:
Then we can see an issue here when decoding:
The value of
orig_freq
doesn't change no matter what file needs to be decodedcc @patrickvonplaten @anton-l @cahya-wirawan @albertvillanova
The issue seems to be here in
Audio.decode_mp3
:datasets/src/datasets/features/audio.py
Lines 176 to 180 in 4c417d5
The text was updated successfully, but these errors were encountered: