Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Audio] MP3 resampling is incorrect when dataset's audio files have different sampling rates #3662

Closed
lhoestq opened this issue Feb 1, 2022 · 6 comments · Fixed by #3665
Closed

Comments

@lhoestq
Copy link
Member

lhoestq commented Feb 1, 2022

The Audio feature resampler for MP3 gets stuck with the first original frequencies it meets, which leads to subsequent decoding to be incorrect.

Here is a code to reproduce the issue:

Let's first consider two audio files with different sampling rates 32000 and 16000:

# first download a mp3 file with sampling_rate=32000
!wget https://file-examples-com.github.io/uploads/2017/11/file_example_MP3_700KB.mp3

import torchaudio

audio_path = "file_example_MP3_700KB.mp3"
audio_path2 = audio_path.replace(".mp3", "_resampled.mp3")
resample = torchaudio.transforms.Resample(32000, 16000)  # create a new file with sampling_rate=16000
torchaudio.save(audio_path2, resample(torchaudio.load(audio_path)[0]), 16000)

Then we can see an issue here when decoding:

from datasets import Dataset, Audio

dataset = Dataset.from_dict({"audio": [audio_path, audio_path2]}).cast_column("audio", Audio(48000))
dataset[0]  # decode the first audio file sets the resampler orig_freq to 32000
print(dataset .features["audio"]._resampler.orig_freq)
# 32000
print(dataset[0]["audio"]["array"].shape)  # here decoding is fine
# (1308096,)

dataset = Dataset.from_dict({"audio": [audio_path, audio_path2]}).cast_column("audio", Audio(48000))
dataset[1]  # decode the second audio file sets the resampler orig_freq to 16000
print(dataset .features["audio"]._resampler.orig_freq)
# 16000
print(dataset[0]["audio"]["array"].shape)  # here decoding uses orig_freq=16000 instead of 32000
# (2616192,)

The value of orig_freq doesn't change no matter what file needs to be decoded

cc @patrickvonplaten @anton-l @cahya-wirawan @albertvillanova

The issue seems to be here in Audio.decode_mp3:

if self.sampling_rate and self.sampling_rate != sampling_rate:
if not hasattr(self, "_resampler"):
self._resampler = T.Resample(sampling_rate, self.sampling_rate)
array = self._resampler(array)
sampling_rate = self.sampling_rate

@cahya-wirawan
Copy link
Contributor

Thanks @lhoestq for finding the reason of incorrect resampling. This issue affects all languages which have sound files with different sampling rates such as Turkish and Luganda.

@patrickvonplaten
Copy link
Contributor

@cahya-wirawan - do you know how many languages have different sampling rates in Common Voice? I'm quite surprised to see this for multiple languages actually

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Feb 1, 2022

@cahya-wirawan, I can reproduce the problem for Common Voice 7 for Turkish. Here a script you can use:

#!/usr/bin/env python3
from datasets import load_dataset
import torchaudio
from io import BytesIO
from datasets import Audio
from collections import Counter
import sys

ds_name = str(sys.argv[1])
lang = str(sys.argv[2])

ds = load_dataset(ds_name, lang, split="train", use_auth_token=True)
ds = ds.cast_column("audio", Audio(decode=False))

all_sampling_rates = []


def print_sampling_rate(x):
    x, sr = torchaudio.load(BytesIO(x["audio"]["bytes"]), format="mp3")
    all_sampling_rates.append(sr)

ds.map(print_sampling_rate)


print(Counter(all_sampling_rates))

can be run with:

python run.py mozilla-foundation/common_voice_7_0 tr

For CV 6.1 all samples seem to have the same audio

@patrickvonplaten
Copy link
Contributor

It actually shows that many more samples are in 32kHz format than it 48kHz which is unexpected. Thanks a lot for flagging! Will contact Common Voice about this as well

@cahya-wirawan
Copy link
Contributor

I only checked the CV 7.0 for Turkish, Luganda and Indonesian, they have audio files with difference sampling rates, and all of them are affected by this issue. Percentage of incorrect resampling as follow, Turkish: 9.1%, Luganda: 88.2% and Indonesian: 64.1%.
I checked it using the original CV files. I check the original sampling rates and the length of audio array of each files and compare it with the length of audio array (and the sampling rate which is always 48kHz) from mozilla-foundation/common_voice_7_0 datasets. if the length of audio array from dataset is not equal to 48kHz/original sampling rate * length of audio array of the original audio file then it is affected,

@patrickvonplaten
Copy link
Contributor

Ok wow, thanks a lot for checking this - you've found a pretty big bug 😅 It seems like a lot more datasets are actually affected than I original thought. We'll try to solve this as soon as possible and make an announcement tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants