Release v0.5.0 · argmaxinc/WhisperKit

This is a HUGE release with some great new features and fixes 🙌

Highlights

Timestamp logits filter by @jkrukowski
- Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
- This is on by default but can be disabled using the decoding option withoutTimestamps: true
Language detection by @Abhinay1997
- New function on the TextDecoding protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio
- Enabled by default for decoding options whereusePrefilPrompt: false and the language: nil and it is not an English only model.
First token log prob thresholds fallback check by @jkrukowski
- This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
- Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
Distil whisper support
- Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
- distil-large-v3
  distil-large-v3_594MB
  distil-large-v3_turbo
  distil-large-v3_turbo_600MB
- Note that these do not yet have word timestamp alignment heads, so can't be used with wordTimestamps: true
- It can be run via CLI as well:
  - swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav

⚠️ Experimental new stream mode

We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.

Recommended settings for the best performance for this iteration are:

Max tokens per loop < 100
Max fallback count < 2
Prompt and cache prefill true

Looking for feedback on:

Token confirmation numbers that work well
Model, device, and settings combinations that work well

RPReplay_Final1711775397.MP4

What's Changed

CLI Task Handling in #85
Added TimestampRulesFilter implementation by @jkrukowski in #45
Support distil whisper models in #88
Language Detection by @Abhinay1997 in #78
Tokenizer refactor, tests cleanup by @jkrukowski in #87
First token logProb thresholding by @jkrukowski in #90
[#93] Add missing settings to decoding options by @cgfarmer4 in #94
"Eager" streaming mode via word timestamps in #95

New Contributors

@Abhinay1997 made their first contribution in #78

Full Changelog: v0.4.1...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

Highlights

⚠️ Experimental new stream mode

What's Changed

New Contributors

Contributors