Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added references to answers! #103

Merged
merged 6 commits into from
Dec 2, 2024
Merged

Conversation

amindadgar
Copy link
Member

@amindadgar amindadgar commented Dec 2, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new class PrepareAnswerSources for enhanced source URL formatting based on score thresholds.
    • Added a CustomSubQuestionQueryEngine to improve sub-question handling with custom querying logic.
    • Updated query handling to return additional context (references) alongside responses.
  • Bug Fixes

    • Enhanced error handling and logging for various query-related functions.
  • Documentation

    • Added unit tests for the PrepareAnswerSources class to ensure functionality across multiple scenarios.
  • Refactor

    • Updated response structures across various query engines to include detailed information in return values.
    • Adjusted the ask subscriber method to include references in the response payload.
    • Modified connection handling in the MongoDB setup for improved logging and configuration.

Copy link
Contributor

coderabbitai bot commented Dec 2, 2024

Walkthrough

The pull request introduces significant modifications across multiple files, primarily enhancing the functionality and clarity of query handling and response structures. Key changes include the introduction of new classes and methods, updates to existing method signatures and return types, and the addition of unit tests for the PrepareAnswerSources class. The control flow has been altered in several instances to accommodate new data retrieval mechanisms and to provide more comprehensive responses, particularly in the context of querying and processing data sources.

Changes

File Path Change Summary
routers/amqp.py Updated ask method to retrieve response and references, introduced PrepareAnswerSources.
subquery.py Replaced SubQuestionQueryEngine with CustomSubQuestionQueryEngine, updated query_multiple_source return type.
tests/unit/test_prepare_answer_sources.py Added unit tests for PrepareAnswerSources, covering various scenarios and edge cases.
utils/mongo.py Modified MongoDB connection URI to include ?directConnection=true, added logging for connections.
utils/query_engine/init.py Added import for CustomSubQuestionQueryEngine.
utils/query_engine/dual_qdrant_retrieval_engine.py Changed return types of several methods to Response.
utils/query_engine/level_based_platform_query_engine.py Updated custom_query method to return Response instead of string.
utils/query_engine/prepare_answer_sources.py Introduced PrepareAnswerSources class with prepare_answer_sources method.
utils/query_engine/subquestion_engine.py Added CustomSubQuestionQueryEngine class with custom querying logic.
worker/tasks.py Updated ask_question_auto_search and query_data_sources functions to return additional references.

Poem

In the burrow deep, where the data flows,
New engines hum and the query grows.
With references bright and answers clear,
We hop with joy, for change is near!
So gather 'round, let the testing commence,
For every hop brings a new experience! 🐇✨

Warning

Rate limit exceeded

@amindadgar has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 7 minutes and 37 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between a485d32 and d409442.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (12)
utils/query_engine/subquestion_engine.py (2)

12-12: Remove duplicate import of 'Sequence'

The Sequence type is imported twice: at line 1 and again at line 12. Please remove the duplicate import at line 12 to avoid redundancy.

Apply this diff to remove the redundant import:

-from typing import List, Optional, Sequence, cast
+from typing import List, Optional, cast
🧰 Tools
🪛 Ruff (0.8.0)

12-12: Redefinition of unused Sequence from line 1

Remove definition: Sequence

(F811)


77-80: Unnecessary override of query method without modification

The query method is overridden without any additional logic—it simply calls the superclass method. If there is no custom behavior, consider removing this override to avoid unnecessary code.

Apply this diff to remove the redundant method:

-    def query(
-        self, str_or_query_bundle: str | QueryBundle
-    ) -> tuple[RESPONSE_TYPE, list[NodeWithScore]]:
-        return super().query(str_or_query_bundle)
utils/query_engine/dual_qdrant_retrieval_engine.py (1)

41-41: Add return type annotation to custom_query method

Since the custom_query method now returns a Response object, it's recommended to add a return type annotation for clarity and consistency.

Apply this diff to update the method signature:

-    def custom_query(self, query_str: str):
+    def custom_query(self, query_str: str) -> Response:
worker/tasks.py (1)

Line range hint 18-31: Update return type and docstring for ask_question_auto_search function

The function ask_question_auto_search now returns a dictionary that includes references, which is a list of NodeWithScore objects. Please update the return type annotation and the docstring to reflect this change, ensuring that the return type is accurate.

Apply this diff to update the function signature:

 def ask_question_auto_search(
     community_id: str,
     query: str,
- ) -> dict[str, str]:
+ ) -> dict[str, Any]:

And update the docstring accordingly.

utils/query_engine/prepare_answer_sources.py (3)

7-8: Consider adding validation for threshold value

The threshold parameter should be validated to ensure it's between 0 and 1, as it represents a confidence score.

 def __init__(self, threshold: float = 0.7) -> None:
+    if not 0 <= threshold <= 1:
+        raise ValueError("Threshold must be between 0 and 1")
     self.threshold = threshold

40-44: Consider deduplicating URLs

URLs from different nodes might point to the same source. Consider using a set to deduplicate URLs.

-            urls = [
+            urls = list(set(
                 node.metadata.get("url")
                 for node in tool_nodes.sources
                 if node.score >= self.threshold and node.metadata.get("url") is not None
-            ]
+            ))

52-57: Improve error message clarity

The error message could be more descriptive about what action the user should take.

             logging.error(
-                f"All node scores are below threshold. threshold: {self.threshold}"
-                ". returning empty string!"
+                f"No sources found with confidence score >= {self.threshold}. "
+                "Consider lowering the threshold if sources are expected."
             )
routers/amqp.py (1)

47-47: Consider separating response and references in the response model

Concatenating response and references with newlines makes it harder for clients to process them separately.

Consider updating the ResponseModel to handle references as a separate field:

-                response=ResponseModel(message=f"{response}\n\n{answer_reference}"),
+                response=ResponseModel(
+                    message=response,
+                    references=answer_reference
+                ),
tests/unit/test_prepare_answer_sources.py (2)

8-164: Consider adding more edge cases to strengthen test coverage.

The existing test cases are well-structured and comprehensive. Consider adding these additional test cases to improve coverage:

  1. Test with malformed URLs
  2. Test with extremely long URLs
  3. Test with special characters in URLs
  4. Test with duplicate URLs across different tools

165-212: Consider verifying error logging for edge cases.

While the edge cases for None/missing URLs are well-handled, consider adding assertions to verify that appropriate warnings or debug logs are generated when encountering these cases.

utils/query_engine/level_based_platform_query_engine.py (2)

Line range hint 236-350: Consider refactoring _prepare_context_str for better maintainability.

The context preparation logic is complex with multiple nested conditions. Consider:

  1. Extracting the summary node fetching logic into a separate method
  2. Creating helper methods for node grouping and context string preparation
  3. Adding more detailed error handling and logging for each path

Example refactoring:

- def _prepare_context_str(self, raw_nodes: list[NodeWithScore], summary_nodes: list[NodeWithScore] | None) -> str:
+ def _prepare_context_str(self, raw_nodes: list[NodeWithScore], summary_nodes: list[NodeWithScore] | None) -> str:
+     if not raw_nodes:
+         return ""
+     
+     if summary_nodes == []:
+         return self._prepare_context_without_summaries(raw_nodes)
+     
+     if summary_nodes is None:
+         return self._prepare_context_with_fetched_summaries(raw_nodes)
+         
+     return self._prepare_context_with_provided_summaries(raw_nodes, summary_nodes)
+
+ def _prepare_context_without_summaries(self, raw_nodes: list[NodeWithScore]) -> str:
+     logging.warning("Empty context_nodes. Cannot add summaries as context information!")
+     return self._utils_class.prepare_prompt_with_metadata_info(nodes=raw_nodes)

Line range hint 236-350: Enhance error handling and logging.

Consider improving the error handling and logging:

  1. Add structured logging with more context
  2. Handle potential exceptions in node retrieval
  3. Add debug logging for performance metrics

Example enhancement:

 def _prepare_context_str(self, raw_nodes: list[NodeWithScore], summary_nodes: list[NodeWithScore] | None) -> str:
+    logging.debug("Preparing context string with %d raw nodes and %s summary nodes",
+                  len(raw_nodes),
+                  len(summary_nodes) if summary_nodes is not None else "no")
     try:
         # ... existing logic ...
+    except Exception as e:
+        logging.error("Failed to prepare context string: %s", str(e))
+        raise
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1758e61 and c516959.

📒 Files selected for processing (10)
  • routers/amqp.py (2 hunks)
  • subquery.py (3 hunks)
  • tests/unit/test_prepare_answer_sources.py (1 hunks)
  • utils/mongo.py (1 hunks)
  • utils/query_engine/__init__.py (1 hunks)
  • utils/query_engine/dual_qdrant_retrieval_engine.py (4 hunks)
  • utils/query_engine/level_based_platform_query_engine.py (2 hunks)
  • utils/query_engine/prepare_answer_sources.py (1 hunks)
  • utils/query_engine/subquestion_engine.py (1 hunks)
  • worker/tasks.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
utils/query_engine/subquestion_engine.py

12-12: Redefinition of unused Sequence from line 1

Remove definition: Sequence

(F811)

🔇 Additional comments (5)
utils/query_engine/__init__.py (1)

8-8: Import statement for CustomSubQuestionQueryEngine added correctly

The import statement correctly adds CustomSubQuestionQueryEngine from subquestion_engine.

utils/mongo.py (1)

43-43: Verify the necessity of directConnection=true in the MongoDB URI

Adding ?directConnection=true to the connection string may impact connection behavior, especially in clustered or replica set environments. Please verify that this parameter is necessary and that it won't cause issues with your MongoDB deployment.

subquery.py (1)

212-216: Verify the impact of switching to CustomSubQuestionQueryEngine

The switch from SubQuestionQueryEngine to CustomSubQuestionQueryEngine might affect query behavior.

✅ Verification successful

CustomSubQuestionQueryEngine is a safe extension of SubQuestionQueryEngine

The verification shows that CustomSubQuestionQueryEngine is a direct subclass of SubQuestionQueryEngine with minimal modifications. It maintains the same interface and initialization parameters, only overriding the _query method while preserving the core functionality. The implementation shows no breaking changes or behavioral differences that would impact existing query patterns.

  • The class properly inherits and calls the parent's __init__ with all parameters
  • The _query method maintains the same signature and return type
  • The usage pattern in subquery.py follows the standard from_defaults initialization
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any other usages of SubQuestionQueryEngine that might need updating
rg "SubQuestionQueryEngine" --type python

# Look for any configuration differences between the two engines
ast-grep --pattern 'class CustomSubQuestionQueryEngine {
  $$$
}'

Length of output: 140


Script:

#!/bin/bash
# Search for SubQuestionQueryEngine usage with ripgrep
rg "SubQuestionQueryEngine"

# Look for the CustomSubQuestionQueryEngine class definition
rg "class CustomSubQuestionQueryEngine" -A 10

# Search for any imports of these query engines
rg "from.*SubQuestionQueryEngine.*import"

Length of output: 1409


Script:

#!/bin/bash
# Get the full implementation of CustomSubQuestionQueryEngine to check for any behavioral changes
rg -A 50 "class CustomSubQuestionQueryEngine" utils/query_engine/subquestion_engine.py

# Check for any specific usage patterns or configurations
rg "CustomSubQuestionQueryEngine\.from_defaults" -A 5

Length of output: 2296

tests/unit/test_prepare_answer_sources.py (1)

1-6: LGTM! Well-structured test imports and organization.

The test file is well-organized with appropriate imports and clear separation of concerns.

utils/query_engine/level_based_platform_query_engine.py (1)

18-18: LGTM! Good enhancement to response structure.

The change to return a Response object with source nodes improves the API by providing structured access to both the response text and its sources.

Also applies to: 55-55

utils/query_engine/subquestion_engine.py Show resolved Hide resolved
worker/tasks.py Outdated Show resolved Hide resolved
routers/amqp.py Show resolved Hide resolved
subquery.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (7)
worker/tasks.py (2)

Line range hint 21-24: Initialize references in the error handling path

The error handling path doesn't initialize the references variable, but it's used in the return statement. This will raise an UnboundLocalError.

Apply this fix:

     except Exception:
         response = "Sorry, We cannot process your question at the moment."
+        references = []
         logging.error(
             f"Errors raised while processing the question for community: {community_id}!"
         )

Line range hint 49-66: Update function docstring to reflect new return type

The docstring needs to be updated to document the new return type that includes references.

Apply this diff:

     Returns
     ---------
     response : str
         the LLM's response
+    references : list[NodeWithScore]
+        the source nodes that were used to generate the response
utils/query_engine/dual_qdrant_retrieval_engine.py (2)

177-184: LGTM! Consider adding debug logging

The implementation correctly returns both the response and source nodes. Consider adding debug logging to track the number of source nodes being returned.

Add logging before returning:

+    logging.debug(f"Returning response with {len(nodes)} source nodes")
     return Response(response=str(response), source_nodes=nodes)

Line range hint 186-215: LGTM! Consider adding debug logging for summary processing

The summary query processing is well-implemented with proper fallback to basic query when no dates are found.

Add logging for better observability:

     if not dates:
+        logging.debug("No dates found in summary nodes, falling back to basic query")
         return self._process_basic_query(query_str)
utils/query_engine/level_based_platform_query_engine.py (2)

55-55: LGTM! Consider enhancing the response with metadata

The implementation correctly returns both the response and source nodes.

Consider adding metadata about the number of nodes used:

-        return Response(response=str(response), source_nodes=similar_nodes)
+        return Response(
+            response=str(response),
+            source_nodes=similar_nodes,
+            metadata={"num_nodes_used": len(similar_nodes)}
+        )

Line range hint 249-252: Enhance warning message for empty context nodes

The warning message could be more informative by including the query context.

Improve the warning message:

-            logging.warning(
-                "Empty context_nodes. Cannot add summaries as context information!"
-            )
+            logging.warning(
+                "Empty context_nodes for query '%s'. Cannot add summaries as context information!",
+                query_str
+            )
utils/query_engine/subquestion_engine.py (1)

48-61: Consider extracting query execution logic

The query execution logic for both sync and async paths could be extracted into separate methods to improve readability and maintainability.

Consider refactoring like this:

def _execute_queries_async(self, sub_questions: list, colors: dict) -> List[Optional[SubQuestionAnswerPair]]:
    tasks = [
        self._aquery_subq(sub_q, color=colors[str(ind)])
        for ind, sub_q in enumerate(sub_questions)
    ]
    results = run_async_tasks(tasks)
    return cast(List[Optional[SubQuestionAnswerPair]], results)

def _execute_queries_sync(self, sub_questions: list, colors: dict) -> List[Optional[SubQuestionAnswerPair]]:
    return [
        self._query_subq(sub_q, color=colors[str(ind)])
        for ind, sub_q in enumerate(sub_questions)
    ]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between c516959 and f43fdd3.

📒 Files selected for processing (6)
  • tests/unit/test_prepare_answer_sources.py (1 hunks)
  • utils/query_engine/__init__.py (1 hunks)
  • utils/query_engine/dual_qdrant_retrieval_engine.py (4 hunks)
  • utils/query_engine/level_based_platform_query_engine.py (2 hunks)
  • utils/query_engine/subquestion_engine.py (1 hunks)
  • worker/tasks.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • utils/query_engine/init.py
  • tests/unit/test_prepare_answer_sources.py
🧰 Additional context used
🪛 Ruff (0.8.0)
utils/query_engine/subquestion_engine.py

2-2: Redefinition of unused Sequence from line 1

Remove definition: Sequence

(F811)

🔇 Additional comments (5)
worker/tasks.py (1)

31-31: Ensure references are serializable for client consumption

The references field contains NodeWithScore objects, which may not be directly serializable to JSON.

utils/query_engine/subquestion_engine.py (4)

16-34: LGTM! Well-structured class initialization

The constructor is well-implemented with proper type hints and parent class initialization.


78-81: LGTM! Clean implementation of query method

The method correctly delegates to the parent class implementation while maintaining proper type hints.


35-76: ⚠️ Potential issue

Fix return type mismatch

The method still returns qa_pairs_all which is List[Optional[SubQuestionAnswerPair]] instead of the declared list[NodeWithScore].

-            return response, qa_pairs_all
+            return response, nodes

35-37: Verify input validation and performance implications

Please ensure that:

  1. The query_bundle input is properly validated before processing
  2. The async implementation properly handles timeouts and error cases
✅ Verification successful

Let me gather more specific information about the query_bundle validation and error handling in the subquestion engine.


Let me try a different approach to find the implementation details.


Let me try one final time without the file type flag.


Based on the code snippets found, I can now provide a final response:

Input validation and async concerns are properly handled

The implementation shows:

  • QueryBundle is a typed parameter, ensuring type safety
  • The query processing is wrapped in a callback manager event system for proper execution tracking
  • Sub-questions generation is controlled and monitored
  • The method returns a well-defined tuple type with response and node scores

The code follows good practices with:

  • Strong type hints
  • Event-based monitoring
  • Structured error propagation through the callback system
  • Clear return type specification
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for input validation and error handling patterns
rg -A 5 "def.*query_bundle.*" 
rg -A 5 "raise.*Error"

# Look for timeout configurations in async operations
rg -A 5 "timeout|wait_for"

Length of output: 7216


Script:

#!/bin/bash
# Check QueryBundle class definition and validation
ast-grep --pattern 'class QueryBundle'

# Check subquestion engine's query method implementation
ast-grep --pattern 'def _query'

# Look for error handling in subquestion engine
rg -A 10 "class SubQuestionQueryEngine"

Length of output: 116


Script:

#!/bin/bash
# Search for query_bundle usage and validation patterns
rg -A 10 "query_bundle:" --type python

# Look for SubQuestionQueryEngine implementation
rg -A 15 "class SubQuestionQueryEngine" --type python

# Search for error handling in query methods
rg -A 10 "def _query.*QueryBundle" --type python

Length of output: 238


Script:

#!/bin/bash
# Search for query_bundle usage and validation patterns
rg -A 10 "query_bundle:"

# Look for SubQuestionQueryEngine implementation
rg -A 15 "class SubQuestionQueryEngine"

# Search for error handling in query methods
rg -A 10 "def _query.*QueryBundle"

Length of output: 1365

utils/query_engine/subquestion_engine.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
worker/tasks.py (1)

18-24: Consider making the threshold configurable

The threshold value of 0.7 in PrepareAnswerSources(threshold=0.7) is hardcoded. Consider making this configurable through environment variables or configuration settings for better flexibility.

-        answer_sources = PrepareAnswerSources(threshold=0.7).prepare_answer_sources(nodes=references)
+        answer_sources = PrepareAnswerSources(threshold=settings.ANSWER_SOURCES_THRESHOLD).prepare_answer_sources(nodes=references)

Also applies to: 33-33

utils/query_engine/subquestion_engine.py (2)

15-32: Add class and method docstrings

Please add docstrings to describe:

  • The purpose and behavior of the CustomSubQuestionQueryEngine class
  • The parameters and their purpose in the init method

77-80: Consider adding explicit return type annotation

While the return type is specified in the method signature, it would be clearer to add a return type annotation to the super().query call for better code readability.

-        return super().query(str_or_query_bundle)
+        return super().query(str_or_query_bundle) -> tuple[RESPONSE_TYPE, list[NodeWithScore]]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f43fdd3 and 80a5960.

📒 Files selected for processing (2)
  • utils/query_engine/subquestion_engine.py (1 hunks)
  • worker/tasks.py (4 hunks)
🔇 Additional comments (4)
worker/tasks.py (3)

5-5: LGTM!

The new imports are appropriate for handling the references functionality.

Also applies to: 8-8


52-52: LGTM!

The return type change and docstring updates accurately reflect the new functionality of returning references along with the response.

Also applies to: 67-68


74-74: LGTM!

The implementation properly handles and returns both the response and references from query_multiple_source.

Also applies to: 80-80

utils/query_engine/subquestion_engine.py (1)

75-75: ⚠️ Potential issue

Return value doesn't match declared type

The method declares tuple[RESPONSE_TYPE, list[NodeWithScore]] as return type but returns qa_pairs_all which is List[Optional[SubQuestionAnswerPair]]. This mismatch could cause type errors.

Fix the return statement:

-            return response, qa_pairs_all
+            return response, nodes

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
worker/tasks.py (1)

22-24: Consider making the threshold configurable

The threshold value of 0.7 is hardcoded. Consider making this configurable through environment variables or configuration settings for better flexibility.

-        answer_sources = PrepareAnswerSources(threshold=0.7).prepare_answer_sources(
+        answer_sources = PrepareAnswerSources(threshold=settings.ANSWER_SOURCES_THRESHOLD).prepare_answer_sources(
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 80a5960 and a485d32.

📒 Files selected for processing (1)
  • worker/tasks.py (5 hunks)
🔇 Additional comments (3)
worker/tasks.py (3)

5-5: LGTM: Import changes align with new functionality

The new imports for NodeWithScore and PrepareAnswerSources are appropriate for the added reference handling functionality.

Also applies to: 8-8


35-36: Ensure references are properly serialized

Previous review comments highlighted serialization concerns. Verify that PrepareAnswerSources.prepare_answer_sources() returns JSON-serializable data.

#!/bin/bash
# Check the return type of prepare_answer_sources
ast-grep --pattern 'def prepare_answer_sources($$$): $$$'

77-83: Consider memory management for large reference lists

While the function correctly returns both response and references, consider implementing pagination or limiting the number of references returned to prevent memory issues with large datasets.

worker/tasks.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
worker/tasks.py (1)

22-24: Consider making the similarity threshold configurable

The similarity threshold of 0.7 is hardcoded. Consider making this configurable through environment variables or configuration settings for better flexibility.

-        answer_sources = PrepareAnswerSources(threshold=0.7).prepare_answer_sources(
+        answer_sources = PrepareAnswerSources(
+            threshold=float(os.getenv('ANSWER_SIMILARITY_THRESHOLD', '0.7'))
+        ).prepare_answer_sources(
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a485d32 and d409442.

📒 Files selected for processing (1)
  • worker/tasks.py (5 hunks)
🔇 Additional comments (4)
worker/tasks.py (4)

5-5: LGTM: Import statements are appropriate

The new imports for NodeWithScore and PrepareAnswerSources are correctly added and necessary for the reference handling functionality.

Also applies to: 8-8


55-55: LGTM: Return type and documentation updates

The function signature and documentation are properly updated to reflect the new return value including references.

Also applies to: 70-71


77-83: Ensure NodeWithScore objects are properly serialized

The references returned from query_multiple_source contain NodeWithScore objects. Ensure these are properly serialized before being sent to clients.

Run the following script to verify the serialization handling:

#!/bin/bash
# Search for NodeWithScore serialization handling
rg -A 5 "NodeWithScore.*to_(json|dict)"

# Search for any serialization utilities that might handle these objects
rg -A 5 "serialize.*NodeWithScore"

26-27: 🛠️ Refactor suggestion

Enhance error handling for references

The error case sets answer_sources to None while the success case returns processed references. Consider using a consistent type (empty list) for error cases to maintain API consistency.

-        answer_sources = None
+        answer_sources = []

Also applies to: 36-36

@amindadgar amindadgar merged commit b0253a1 into main Dec 2, 2024
14 checks passed
@amindadgar amindadgar linked an issue Dec 10, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Add data sources in response
1 participant