Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token indices sequence length #856

Open
saboor2632 opened this issue Dec 29, 2024 · 5 comments
Open

Token indices sequence length #856

saboor2632 opened this issue Dec 29, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@saboor2632
Copy link

I am facing this issue whenever i run my code for scraping bbc or any other site
Error:
Token indices sequence length is longer than the specified maximum sequence length for this model (5102 > 1024). Running this sequence through the model will result in indexing errors

It does not give me complete results

@Qunlexie
Copy link

Qunlexie commented Jan 2, 2025

I have this same issue as well especially when working with Ollama LLAMA models

@VinciGit00
Copy link
Collaborator

for big websites you should use openai

github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.35.0-beta.4](v1.35.0-beta.3...v1.35.0-beta.4) (2025-01-06)

### Features

* ⏰added graph timeout and fixed model_tokens param ([#810](#810) [#856](#856) [#853](#853)) ([01a331a](01a331a))
github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.35.0](v1.34.2...v1.35.0) (2025-01-06)

### Features

* ⏰added graph timeout and fixed model_tokens param ([#810](#810) [#856](#856) [#853](#853)) ([01a331a](01a331a))
* ⛏️ enhanced contribution and precommit added ([fcbfe78](fcbfe78))
* add codequality workflow ([4380afb](4380afb))
* add timeout and retry_limit in loader_kwargs ([#865](#865) [#831](#831)) ([21147c4](21147c4))
* serper api search ([1c0141f](1c0141f))

### Bug Fixes

* browserbase integration ([752a885](752a885))
* local html handling ([2a15581](2a15581))

### CI

* **release:** 1.34.2-beta.1 [skip ci] ([f383e72](f383e72)), closes [#861](#861) [#861](#861)
* **release:** 1.34.2-beta.2 [skip ci] ([93fd9d2](93fd9d2))
* **release:** 1.34.3-beta.1 [skip ci] ([013a196](013a196)), closes [#861](#861) [#861](#861)
* **release:** 1.35.0-beta.1 [skip ci] ([c5630ce](c5630ce)), closes [#865](#865) [#831](#831)
* **release:** 1.35.0-beta.2 [skip ci] ([f21c586](f21c586))
* **release:** 1.35.0-beta.3 [skip ci] ([cb54d5b](cb54d5b))
* **release:** 1.35.0-beta.4 [skip ci] ([6e375f5](6e375f5)), closes [#810](#810) [#856](#856) [#853](#853)
@PeriniM
Copy link
Collaborator

PeriniM commented Jan 6, 2025

@Qunlexie @saboor2632 there is indeed an issue with the method to calculate chunks for ollama models, in tokenizer_ollama.py. We are using get_num_tokens from langchain lib but it is always limiting the model to 1024 tokens. Even if we change the num of tokens from the graph config using the model_tokens param we could only set it to 1024. It would be better to calculate the num of chunks ourself, using for example openai tokenizer from tiktoken and approximate the num of tokens per chunk. wdyt

@Qunlexie
Copy link

Qunlexie commented Jan 10, 2025

We are using get_num_tokens from langchain lib but it is always limiting the model to 1024 tokens. Even if we change the num of tokens from the graph config using the model_tokens param we could only set it to 1024.

Is this a bug that should be raised to langchain? I belive the limit to the tokens is the cause of the issue where only parts of the web page is being retrieved rather all of it.

It would be better to calculate the num of chunks ourself, using for example openai tokenizer from tiktoken and approximate the num of tokens per chunk. wdyt

How would this work? Do you have a practical example? for how this will work with ScrpapeGraph. I really believe that getting Ollama to work properly is key for open source.

Happy to get your thoughts

@Pal-dont-want-to-work
Copy link

I face the same problem, hoping author can fix it
Token indices sequence length is longer than the specified maximum sequence length for this model (1548 > 1024). Running this sequence through the model will result in indexing errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants