Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--uses_speaker_adaptation off, but alignment still variable depending on set size #809

Open
amo104 opened this issue May 21, 2024 · 3 comments

Comments

@amo104
Copy link

amo104 commented May 21, 2024

Hello,

I wanted some clarification on how MFA's alignment works. As I understand, there's a 2-step alignment process, where the second pass utilizes per-speaker features. I'm trying to universalize my process through MFA to see how I might use the log-likelihood information to find errors in transcriptions/errors in speech productions. (I won't be needing the actual TextGrids for anything.) While aligning with --uses_speaker_adaption off, --single_speaker on, and setting to the same seed every run, I'm finding that the log-likelihoods are still variable.

I ran a test on a wav set A, a wav set B, and a wav set C=A+B. It's unclear whether A/B or C is generally better or worse, but overall the log-likelihoods are somewhat variable when comparing the same file in one of the smaller sets to the same file in the larger set. When just aligning C twice the output log-likelihoods are also marginally different, but are stable to about 5 digits which is leading me to think this problem is related to the size of the data set.

image
(left is the results of A and B, right is the results of C).

Is there anything more I could be doing to be getting as consistent of outputs as possible, regardless of data set size? Or is this variance fundamental to how MFA works? For my purposes, it's more important that I get the same result over and over rather than get the most fitting alignment, so I'm okay with turning off as many features as necessary. Thank you!

Specs: MFA 3.0.7, Windows 10 Education, 7,800 WAV files of pseudo-English non-words utilizing the english_us_arpa acoustic model and a custom dictionary

@mmcauliffe
Copy link
Member

There's a couple of sources of variability that could be playing a role. First is that feature generation has dithering, which can be turned off via --dither 0, but a larger source is likely that you have both wave files in the same directory, correct? There is still CMVN that is calculated at the speaker level, so adding files would change the CMVN stats. If you put them in different directories then it should calculate CMVN stats per file which would then be consistent across datasets. If you already have them separated out, then dithering should be the only source of difference across runs.

@amo104
Copy link
Author

amo104 commented May 21, 2024

Great, it seems like implementing both of those changes is giving me the behavior I want, thank you!! So for the future, instead of using --single_speaker, the goal would be to have MFA treat every file as a separate speaker (either by separating into directories or passing a --speaker_characters that would lead to them all being split), correct?

@mmcauliffe
Copy link
Member

Either splitting into directories or --speaker_characters 500 or something obscenely large would ensure the whole file name would be the "speaker" (and assuming that the file names are unique). I'll think about adding a flag for parsing the corpus as though every file is a unique speaker, since both the directory and speaker_character approaches a bit round about and it shouldn't be too much work. The only thing to decide is adding a boolean flag like --no_speakers or deprecating the current --single_speaker flag and adding a --speaker_mode flag with options like "auto"/"directory"/"default" vs "single" vs "none".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants