Add median token length as limit #47

stephantul · 2024-09-29T11:21:01Z

This PR adds a speed optimization:

For longer texts, tokenization can take up 90-95% of our time, because we tokenize the entire text. However, we usually only take the first N (usually 512) tokens of text. So, it makes sense to only really tokenize the first X tokens of a text. We therefore truncate texts to N * median(token_lengths). A test on wikipedia shows that this doesn't lead to any unnecessary truncation, i.e., it never truncated to a length < 512 tokens.

codecov · 2024-09-29T11:21:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines	Coverage Δ
model2vec/model.py	`95.60% <100.00%> (+0.20%)`	⬆️

Pringled

Nice 🚤

Add median token length as limit

6cd995f

Pringled approved these changes Sep 29, 2024

View reviewed changes

stephantul merged commit 9a887a3 into main Sep 29, 2024
4 checks passed

stephantul deleted the fix_median_token_length branch September 29, 2024 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add median token length as limit #47

Add median token length as limit #47

stephantul commented Sep 29, 2024

codecov bot commented Sep 29, 2024

Pringled left a comment

Add median token length as limit #47

Add median token length as limit #47

Conversation

stephantul commented Sep 29, 2024

codecov bot commented Sep 29, 2024

Codecov Report

Pringled left a comment

Choose a reason for hiding this comment