Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding ASR testing #40

Open
Simplesss opened this issue Aug 28, 2024 · 3 comments
Open

Regarding ASR testing #40

Simplesss opened this issue Aug 28, 2024 · 3 comments

Comments

@Simplesss
Copy link

Hello, thank you very much for your work. I would like to reproduce the ASR performance of the AnyGPT Base model on the Librispeech test clean. I noticed that your paper stated a WER of 8.5, but my test result was 14.5 (using the command format speech | text | {speech file path}). Therefore, I am speculating whether this result is caused by randomly selecting a prompt for ASR during each inference in the ASR task? If possible, could you share the relevant code for calculating WER (I used 7 Composers from jiwer for calculation), as well as the text result obtained from ASR of the model. Looking forward to your reply.

@JunZhan2000
Copy link
Collaborator

JunZhan2000 commented Sep 30, 2024

Hello, I think it's probably not an issue with the prompt, each prompt has been seen many times during training.
I would like to confirm two things: First, are you using beam search as your decoding strategy? This strategy generally produces the best results. Second, it's necessary to perform some post-processing on the transcription results to standardize them, because the output format of the LLM is very different from the ground truth, including punctuation and words like "you're" which shoud be "you are" in the groundtruth.
I also use jiwer for caculating wer.
Regarding the test code, unfortunately, it was lost during an environment migration, but I believe if you use GPT to write some standardization code, you should be able to achieve the results mentioned in the paper.(I didn't handle all the standardization cases)

@Changhao-Xiang
Copy link

Hello, thank you very much for your work. I would like to reproduce the ASR performance of the AnyGPT Base model on the Librispeech test clean. I noticed that your paper stated a WER of 8.5, but my test result was 14.5 (using the command format speech | text | {speech file path}). Therefore, I am speculating whether this result is caused by randomly selecting a prompt for ASR during each inference in the ASR task? If possible, could you share the relevant code for calculating WER (I used 7 Composers from jiwer for calculation), as well as the text result obtained from ASR of the model. Looking forward to your reply.

Hello, have you reproduced the results successfully? My reproduced performance on LibriSpeech test-clean is also a WER around 15 with the following configs:

{
    "do_sample": false,
    "max_new_tokens": 100,
    "min_new_tokens": 1,
    "repetition_penalty": 1.0,
    "num_beams": 5
}

@jingfanke
Copy link

Hello, I wanted to check in and see if you've successfully reproduced the results.

My performance on the LibriSpeech test-clean dataset yielded a WER of approximately 14.3 . For my setup, I configured the model generation with "num_beams": 5 and utilized the standardization code from

def text_normalization(original_text):
text= clean(original_text,
fix_unicode=True, # fix various unicode errors
to_ascii=True, # transliterate to closest ASCII representation
lower=True, # lowercase text
no_line_breaks=False, # fully strip line breaks as opposed to only normalizing them
no_urls=False, # replace all URLs with a special token
no_emails=False, # replace all email addresses with a special token
no_phone_numbers=False, # replace all phone numbers with a special token
no_numbers=False, # replace all numbers with a special token
no_digits=False, # replace all digits with a special token
no_currency_symbols=False, # replace all currency symbols with a special token
no_punct=False, # remove punctuations
replace_with_punct="", # instead of removing punctuations you may replace them
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_number="<PHONE>",
replace_with_number="<NUMBER>",
replace_with_digit="0",
replace_with_currency_symbol="<CUR>",
lang="en" # set to 'de' for German special handling
)
text=inverse_normalizer.inverse_normalize(text, verbose=False)
text=text.lower()
# A dictionary of contractions and their expanded forms, including "didn't"
contractions = {
"i'm": "i am", "don't": "do not", "can't": "cannot", "it's": "it is",
"isn't": "is not", "he's": "he is", "she's": "she is", "that's": "that is",
"what's": "what is", "where's": "where is", "there's": "there is",
"who's": "who is", "how's": "how is", "i've": "i have", "you've": "you have",
"we've": "we have", "they've": "they have", "i'd": "i would", "you'd": "you would",
"he'd": "he would", "she'd": "she would", "we'd": "we would", "they'd": "they would",
"i'll": "i will", "you'll": "you will", "he'll": "he will", "she'll": "she will",
"we'll": "we will", "they'll": "they will", "didn't": "did not"
}
# Manually handle contractions
for contraction, expansion in contractions.items():
text = text.replace(contraction, expansion)
# Remaining rules are the same as previous implementation
text = re.sub(r"\[.*?\]", "", text)
text = re.sub(r"\(.*?\)", "", text)
fillers = ["hmm", "mm", "mhm", "mmm", "uh", "um"]
filler_pattern = r'\b(?:' + '|'.join(fillers) + r')\b'
text = re.sub(filler_pattern, "", text)
text = re.sub(r"\s’", "’", text)
text = re.sub(r"(?<=\d),(?=\d)", "", text)
text = re.sub(r"\.(?!\d)", "", text)
text = re.sub(r"[^\w\s.,%$]", "", text)
text = re.sub(r"\s+", " ", text)
return text.strip()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants