snfa
(Simple Neural Forced Aligner) is a phoneme-to-audio forced aligner built for embedded usage in python programs, with its only inference dependency being numpy
and python 3.7 or later.
- Tiny model size (2 MB)
- Numpy as the only dependency
- MFA comparable alignment quality
Note: You still need PyTorch
and some other libs if you want to do training.
pip install snfa
Download the pretrained cv_jp.bin
weights from release.
cv_jp.bin
is a weight file trained on Japanese Common Voice Corpus 14.0, 6/28/2023. The model weight is released into Public Domain
.
import snfa
aligner = snfa.Aligner("cv_jp.bin")
transcript = "k o N n i ch i w a".split(" ")
# you can also use `scipy` or `wavfile` as long as you normalize it to [-1,1]
x, _ = librosa.load("sample.wav", sr=aligner.sr)
segments, path, trellis, emission, labels = aligner(x, transcript)
print(segment)
I'll cover this part if it's needed by anyone. Please let me know by creating an issue if you need.
- Rust crate
- multi-language
- Storing
pau
index in binary model - Record and warn the user when score is too low
snfa
is released under ISC Licence
, as shown here.
The file snfa/stft.py
contains code adapted from librosa
which obeys ISC Licence
with different copyright claim. A copy of librosa
's licence can be found in librosa's repo.
The file snfa/viterbi.py
contains code adapted from torchaudio
which obeys BSD 2-Clause "Simplified" License
. A copy of torchaudio
's licence can be found in torchaudio's repo.
The neural network used in snfa
is basically a PyTorch implementation of CTC*
structure described in Evaluating Speech—Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis.