[Bug]: Query failure with Document Summary Index using Azure AI Search and Cosmos DB #17304

zhongshuai-cao · 2024-12-17T21:47:41Z

Bug Description

Thank you for taking the time to review this issue! Our team is actively developing around Azure Cloud, and we have found LlamaIndex to be a fantastic framework for building RAG-based AI applications.

When implementing the Document Summary Index with Azure AI Search and Cosmos DB as part of a fully Azure-based setup, we encounter an issue during querying. The index builds successfully, but querying fails with a KeyError regarding a missing mapping from document_id to node_id.

Here is what we observed:

Before moving to Azure: Querying works correctly in a local persistent storage context.

In Azure: When using Azure AI Search as the vector store, the index_store does not seem to be used. Instead, the document summary mapping is stored in the docstore with type='document_summary', while other entries have type='1' (TextNode).

If the docstore is queried to retrieve all documents, it fails due to the presence of document_summary entries that should ideally reside in the index_store.

This behavior seems specific to how Azure AI Search handles the vector store and storage for the Document Summary Index.

Additional Context

The issue does not occur when using a local persistent storage setup.

The behavior appears specific to Azure AI Search as the vector store.

The docstore contains entries of type document_summary, which should ideally reside in the index_store.

Version

0.12.5

Steps to Reproduce

Instantiate docstore, index_store, and vector_store using Azure services.

Index the documents.

Construct the Document Summary Index with the same parameters used for querying.

Run a query using query_engine.query(question).

Additionally, attempt to retrieve all documents from the docstore.

Code Example

docstore = AzureDocumentStore.from_connection_string(
            connection_string=cosmos_table_connection_string,
            namespace=namespace,
            service_mode=ServiceMode.STORAGE,
            partition_key=self.index_name    # use index name as partition key
        )

index_store = AzureIndexStore.from_connection_string(
                connection_string=cosmos_table_connection_string,
                namespace=namespace,
                service_mode=ServiceMode.STORAGE,
                partition_key=self.index_name    # use index name as partition key
                )

index_vector_store = AzureAISearchVectorStore(
            search_or_index_client=self.index_client,
            # filterable_metadata_field_keys=metadata_fields,
            index_name=self.index_name,
            index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
            id_field_key="id",
            chunk_field_key="chunk",
            embedding_field_key="embedding",
            embedding_dimensionality=self.embeddings_dimension,
            metadata_string_field_key="metadata",
            doc_id_field_key="doc_id",
            language_analyzer="en.lucene",
            vector_algorithm_type="exhaustiveKnn",
            # compression_type="binary" # Option to use "scalar" or "binary". NOTE: compression is only supported for HNSW
        )

storage_context = StorageContext.from_defaults(
            docstore=docstore,
            index_store=index_store,
            vector_store=search_vector_store,
            )

index = DocumentSummaryIndex.from_documents(
            documents=documents,
            storage_context=self.search_storage_context,
            llm=self.llm,
            embed_model=self.embed_model)
query_engine =index.as_query_engine()

response = query_engine.query(question)

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "<user-path>/manual_test_document_summary_index_indexing.py", line 31, in <module>
    response = query_engine.query(question)
  File "<llama-index-path>/instrumentation/dispatcher.py", line 321, in wrapper
    result = func(*args, **kwargs)
  File "<llama-index-path>/base_query_engine.py", line 52, in query
    query_result = self._query(str_or_query_bundle)
  File "<llama-index-path>/instrumentation/dispatcher.py", line 321, in wrapper
    result = func(*args, **kwargs)
  File "<llama-index-path>/retriever_query_engine.py", line 178, in _query
    nodes = self.retrieve(query_bundle)
  File "<llama-index-path>/retriever_query_engine.py", line 133, in retrieve
    nodes = self._retriever.retrieve(query_bundle)
  File "<llama-index-path>/instrumentation/dispatcher.py", line 321, in wrapper
    result = func(*args, **kwargs)
  File "<llama-index-path>/base_retriever.py", line 245, in retrieve
    nodes = self._retrieve(query_bundle)
  File "<llama-index-path>/instrumentation/dispatcher.py", line 321, in wrapper
    result = func(*args, **kwargs)
  File "<llama-index-path>/document_summary/retrievers.py", line 185, in _retrieve
    node_ids = self._index_struct.summary_id_to_node_ids[summary_id]
KeyError: '358720ff-2d3b-4d1a-9f00-6eb137ff8ed0'

dosubot · 2024-12-17T21:49:25Z

The issue you're encountering seems to be related to how the DocumentSummaryIndex interacts with Azure AI Search and Cosmos DB in the LlamaIndex framework. Specifically, the problem arises from a missing mapping between document_id and node_id, which is crucial for querying.

Here are some points to consider for troubleshooting and resolving this issue:

Mapping Storage: The DocumentSummaryIndex uses a dictionary to map document_id to node_id. This mapping is typically stored in the doc_id_to_summary_id attribute. Ensure that this mapping is correctly populated and accessible during both indexing and querying phases ^[1].
Docstore and Index Store: It seems that the docstore is being used to store entries of type document_summary, which might be causing the issue. Ideally, these entries should be managed by the index_store. Verify that the index_store is correctly configured and used for storing document summaries.
Configuration Consistency: Double-check the configuration of your AzureDocumentStore, AzureIndexStore, and AzureAISearchVectorStore. Ensure that they are correctly set up to handle the storage and retrieval of document summaries and indices. Pay attention to parameters like partition_key and service_mode to ensure they align with your setup.
Debugging the KeyError: The KeyError indicates that a summary_id is missing from the mapping. You might want to add logging or debugging statements to track the creation and retrieval of these mappings. This can help identify where the mapping is failing.
Local vs. Azure Setup: Since the issue does not occur in a local setup, compare the configurations and data flow between your local and Azure environments. There might be differences in how data is stored or retrieved that could lead to this issue.

By focusing on these areas, you should be able to identify and resolve the problem with the missing mappings in your Azure-based setup.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

zhongshuai-cao added bug Something isn't working triage Issue needs to be triaged/prioritized labels Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Query failure with Document Summary Index using Azure AI Search and Cosmos DB #17304

[Bug]: Query failure with Document Summary Index using Azure AI Search and Cosmos DB #17304

zhongshuai-cao commented Dec 17, 2024

dosubot bot commented Dec 17, 2024

[Bug]: Query failure with Document Summary Index using Azure AI Search and Cosmos DB #17304

[Bug]: Query failure with Document Summary Index using Azure AI Search and Cosmos DB #17304

Comments

zhongshuai-cao commented Dec 17, 2024

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Dec 17, 2024