Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Embedding model's context length overflows while creating ObjectIndex from tables that are very wide (over 900+ columns, containing column comments) #16657

Open
cognetics-ai opened this issue Oct 23, 2024 · 2 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@cognetics-ai
Copy link

Bug Description

obj_index = ObjectIndex.from_objects(table_schema_objs, table_node_mapping, VectorStoreIndex)

throws 400 error against OpenAI's Embedding endpoint as it exceeds 8192 tokens. This will happen with tables that are very wide and have detailed column descriptions.

Version

0.11.2

Steps to Reproduce

  1. Create a table with 900+ columns with descriptions
  2. obj_index = ObjectIndex.from_objects(table_schema_objs, table_node_mapping, VectorStoreIndex)

Follow steps:
https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/SQLIndexDemo/#part-2-query-time-retrieval-of-tables-for-text-to-sql

Screenshot 2024-10-23 at 8 59 14 AM

Relevant Logs/Tracbacks

No response

@cognetics-ai cognetics-ai added bug Something isn't working triage Issue needs to be triaged/prioritized labels Oct 23, 2024
Copy link

dosubot bot commented Oct 23, 2024

Hey @cognetics-ai! 👋 I'm here to assist you with bugs, questions, and contributions. Let's tackle this issue together!

To address the issue of exceeding the token limit when creating an ObjectIndex from wide tables with many columns and detailed descriptions, you can consider the following strategies:

  1. Pre-split Data: Use a token splitter to divide the table schema into smaller, manageable chunks that fit within the token limit. LlamaIndex uses a simple token splitter that creates overlapping chunks, which you can customize to suit your needs [1].

  2. Simplify Descriptions: Reduce the verbosity of column descriptions. Truncate or summarize the descriptions to decrease the overall token count.

  3. Selective Indexing: Consider indexing only the most critical columns or a subset of the table that is essential for your application. This can significantly reduce the token count.

  4. Quantized Models: Use quantized embedding models to handle large input data more efficiently. LlamaIndex supports this through the Optimum-Intel library, which provides smaller and faster models with minimal accuracy loss [2].

These approaches can help you manage the input size and avoid exceeding the token limits imposed by the OpenAI Embedding endpoint.

To continue talking to Dosu, mention @dosu.

@logan-markewich
Copy link
Collaborator

I dont think theres going to be an easy solution to this one. Maybe an llm should summarize the schema, and use that for embeddings (i doubt every column is important here). Or some postprocessing step removes columns before embeddings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants