-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast alternative to text tokenization with SimpleTokenizer
#755
Comments
I wonder how it compares with I've been using transformers' CLIP tokenizer as a replacement for |
I just ran a short benchmark, on my machine it is 47x faster for encoding than the Rust-based from transformers import CLIPTokenizerFast
from instant_clip_tokenizer import Tokenizer as InstantTokenizer
tokenizer_fast = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch16")
tokenizer_instant = InstantTokenizer()
INPUT = "If yes I'm happy to send in a PR!" # some random sentence :)
%timeit tokenizer_fast.encode(INPUT, add_special_tokens=False)
# -> 80.1 µs ± 626 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit tokenizer_instant.encode(INPUT)
# -> 1.7 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) |
Cool! Good to know about this faster tokenizer! Have you tried batch tokenization? However, in my use cases, tokenization is not a bottleneck when training CLIP-like models. |
We mostly care about tokenization performance for single inputs (we use it for inference). Nevertheless, we provide a |
We've been working on a re-implementation of the original OpenAI text tokenizer (
SimpleTokenizer
) in Rust, with bindings for Python, called instant-clip-tokenizer.In our benchmarks it is around 70x faster than the current Python implementation.
Are you interested in mentioning this library in your Readme as an alternative to the
SimpleTokenizer
included in this repository? If yes I'm happy to send in a PR!The text was updated successfully, but these errors were encountered: