You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Bug]: Embedding model's context length overflows while creating ObjectIndex from tables that are very wide (over 900+ columns, containing column comments)
#16657
Open
cognetics-ai opened this issue
Oct 23, 2024
· 2 comments
throws 400 error against OpenAI's Embedding endpoint as it exceeds 8192 tokens. This will happen with tables that are very wide and have detailed column descriptions.
Version
0.11.2
Steps to Reproduce
Create a table with 900+ columns with descriptions
Hey @cognetics-ai! 👋 I'm here to assist you with bugs, questions, and contributions. Let's tackle this issue together!
To address the issue of exceeding the token limit when creating an ObjectIndex from wide tables with many columns and detailed descriptions, you can consider the following strategies:
Pre-split Data: Use a token splitter to divide the table schema into smaller, manageable chunks that fit within the token limit. LlamaIndex uses a simple token splitter that creates overlapping chunks, which you can customize to suit your needs [1].
Simplify Descriptions: Reduce the verbosity of column descriptions. Truncate or summarize the descriptions to decrease the overall token count.
Selective Indexing: Consider indexing only the most critical columns or a subset of the table that is essential for your application. This can significantly reduce the token count.
Quantized Models: Use quantized embedding models to handle large input data more efficiently. LlamaIndex supports this through the Optimum-Intel library, which provides smaller and faster models with minimal accuracy loss [2].
These approaches can help you manage the input size and avoid exceeding the token limits imposed by the OpenAI Embedding endpoint.
I dont think theres going to be an easy solution to this one. Maybe an llm should summarize the schema, and use that for embeddings (i doubt every column is important here). Or some postprocessing step removes columns before embeddings
Bug Description
obj_index = ObjectIndex.from_objects(table_schema_objs, table_node_mapping, VectorStoreIndex)
throws 400 error against OpenAI's Embedding endpoint as it exceeds 8192 tokens. This will happen with tables that are very wide and have detailed column descriptions.
Version
0.11.2
Steps to Reproduce
obj_index = ObjectIndex.from_objects(table_schema_objs, table_node_mapping, VectorStoreIndex)
Follow steps:
https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/SQLIndexDemo/#part-2-query-time-retrieval-of-tables-for-text-to-sql
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered: