Skip to content

v0.5.0

Compare
Choose a tag to compare
@ZachNagengast ZachNagengast released this 30 Mar 08:38
· 68 commits to main since this release

This is a HUGE release with some great new features and fixes πŸ™Œ

Highlights

  • Timestamp logits filter by @jkrukowski
    • Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
    • This is on by default but can be disabled using the decoding option withoutTimestamps: true
  • Language detection by @Abhinay1997
    • New function on the TextDecoding protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio
    • Enabled by default for decoding options whereusePrefilPrompt: false and the language: nil and it is not an English only model.
  • First token log prob thresholds fallback check by @jkrukowski
    • This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
    • Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
  • Distil whisper support
    • Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
    • distil-large-v3
      distil-large-v3_594MB
      distil-large-v3_turbo
      distil-large-v3_turbo_600MB
    • Note that these do not yet have word timestamp alignment heads, so can't be used with wordTimestamps: true
    • It can be run via CLI as well:
      • swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav

⚠️ Experimental new stream mode

We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.

Recommended settings for the best performance for this iteration are:

  • Max tokens per loop < 100
  • Max fallback count < 2
  • Prompt and cache prefill true

Looking for feedback on:

  • Token confirmation numbers that work well
  • Model, device, and settings combinations that work well
RPReplay_Final1711775397.MP4

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.5.0