v0.5.0
This is a HUGE release with some great new features and fixes π
Highlights
- Timestamp logits filter by @jkrukowski
- Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
- This is on by default but can be disabled using the decoding option
withoutTimestamps: true
- Language detection by @Abhinay1997
- New function on the
TextDecoding
protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio - Enabled by default for decoding options where
usePrefilPrompt: false
and thelanguage: nil
and it is not an English only model.
- New function on the
- First token log prob thresholds fallback check by @jkrukowski
- This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
- Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
- Distil whisper support
- Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
- distil-large-v3
distil-large-v3_594MB
distil-large-v3_turbo
distil-large-v3_turbo_600MB - Note that these do not yet have word timestamp alignment heads, so can't be used with
wordTimestamps: true
- It can be run via CLI as well:
swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav
β οΈ Experimental new stream mode
We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.
Recommended settings for the best performance for this iteration are:
- Max tokens per loop < 100
- Max fallback count < 2
- Prompt and cache prefill true
Looking for feedback on:
- Token confirmation numbers that work well
- Model, device, and settings combinations that work well
RPReplay_Final1711775397.MP4
What's Changed
- CLI Task Handling in #85
- Added TimestampRulesFilter implementation by @jkrukowski in #45
- Support distil whisper models in #88
- Language Detection by @Abhinay1997 in #78
- Tokenizer refactor, tests cleanup by @jkrukowski in #87
- First token logProb thresholding by @jkrukowski in #90
- [#93] Add missing settings to decoding options by @cgfarmer4 in #94
- "Eager" streaming mode via word timestamps in #95
New Contributors
- @Abhinay1997 made their first contribution in #78
Full Changelog: v0.4.1...v0.5.0