[Bug]: Embedding model's context length overflows while creating ObjectIndex from tables that are very wide (over 900+ columns, containing column comments) #16657

cognetics-ai · 2024-10-23T12:59:56Z

Bug Description

obj_index = ObjectIndex.from_objects(table_schema_objs, table_node_mapping, VectorStoreIndex)

throws 400 error against OpenAI's Embedding endpoint as it exceeds 8192 tokens. This will happen with tables that are very wide and have detailed column descriptions.

Version

0.11.2

Steps to Reproduce

Create a table with 900+ columns with descriptions
obj_index = ObjectIndex.from_objects(table_schema_objs, table_node_mapping, VectorStoreIndex)

Follow steps:
https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/SQLIndexDemo/#part-2-query-time-retrieval-of-tables-for-text-to-sql

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-10-23T13:02:27Z

Hey @cognetics-ai! 👋 I'm here to assist you with bugs, questions, and contributions. Let's tackle this issue together!

To address the issue of exceeding the token limit when creating an ObjectIndex from wide tables with many columns and detailed descriptions, you can consider the following strategies:

Pre-split Data: Use a token splitter to divide the table schema into smaller, manageable chunks that fit within the token limit. LlamaIndex uses a simple token splitter that creates overlapping chunks, which you can customize to suit your needs ^[1].
Simplify Descriptions: Reduce the verbosity of column descriptions. Truncate or summarize the descriptions to decrease the overall token count.
Selective Indexing: Consider indexing only the most critical columns or a subset of the table that is essential for your application. This can significantly reduce the token count.
Quantized Models: Use quantized embedding models to handle large input data more efficiently. LlamaIndex supports this through the Optimum-Intel library, which provides smaller and faster models with minimal accuracy loss ^[2].

These approaches can help you manage the input size and avoid exceeding the token limits imposed by the OpenAI Embedding endpoint.

To continue talking to Dosu, mention @dosu.

logan-markewich · 2024-10-23T13:17:06Z

I dont think theres going to be an easy solution to this one. Maybe an llm should summarize the schema, and use that for embeddings (i doubt every column is important here). Or some postprocessing step removes columns before embeddings

cognetics-ai added bug Something isn't working triage Issue needs to be triaged/prioritized labels Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Embedding model's context length overflows while creating ObjectIndex from tables that are very wide (over 900+ columns, containing column comments) #16657

[Bug]: Embedding model's context length overflows while creating ObjectIndex from tables that are very wide (over 900+ columns, containing column comments) #16657

cognetics-ai commented Oct 23, 2024

dosubot bot commented Oct 23, 2024

logan-markewich commented Oct 23, 2024

[Bug]: Embedding model's context length overflows while creating ObjectIndex from tables that are very wide (over 900+ columns, containing column comments) #16657

[Bug]: Embedding model's context length overflows while creating ObjectIndex from tables that are very wide (over 900+ columns, containing column comments) #16657

Comments

cognetics-ai commented Oct 23, 2024

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Oct 23, 2024

logan-markewich commented Oct 23, 2024