Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

[NeuralChat] Enable RAG's table extraction and summary #1417

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

xmx-521
Copy link
Contributor

@xmx-521 xmx-521 commented Mar 25, 2024

Type of Change

feature
API changed

Description

Enable RAG's table extraction functionality for pdf
Enable RAG's table summary functionality, with three modes to choose: [none, title, llm]

Expected Behavior & Potential Risk

User can use RAG's table extraction and summary functionality to get better RAG experience

How has this PR been tested?

Local test and pre-CI

Dependency Change?

add tesseract dependency
add poppler dependency
change unstructured dependency unstructured[all-docs] dependency

Copy link

github-actions bot commented Mar 25, 2024

⚡ Required checks status: All passing 🟢

Groups summary

🟢 Format Scan Tests workflow
Check ID Status Error details
format-scan (pylint) success
format-scan (bandit) success
format-scan (cloc) success
format-scan (cpplint) success

These checks are required after the changes to intel_extension_for_transformers/neural_chat/assets/docs/LLAMA2_page6.pdf, intel_extension_for_transformers/neural_chat/chatbot.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/parser.py, intel_extension_for_transformers/neural_chat/prompts/prompt.py, intel_extension_for_transformers/neural_chat/tests/ci/plugins/retrieval/test_parameters.py, intel_extension_for_transformers/neural_chat/tests/requirements.txt.

🟢 NeuralChat Unit Test
Check ID Status Error details
neuralchat-unit-test-baseline success
neuralchat-unit-test-PR-test success
Generate-NeuralChat-Report success

These checks are required after the changes to .github/workflows/script/unitTest/run_unit_test_neuralchat.sh, intel_extension_for_transformers/neural_chat/chatbot.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/parser.py, intel_extension_for_transformers/neural_chat/prompts/prompt.py, intel_extension_for_transformers/neural_chat/tests/ci/plugins/retrieval/test_parameters.py, intel_extension_for_transformers/neural_chat/tests/requirements.txt.

🟢 Chat Bot Test workflow
Check ID Status Error details
call-inference-llama-2-7b-chat-hf / inference test success
call-inference-mpt-7b-chat / inference test success

These checks are required after the changes to intel_extension_for_transformers/neural_chat/chatbot.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/context_utils.py, intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/parser/parser.py, intel_extension_for_transformers/neural_chat/prompts/prompt.py, intel_extension_for_transformers/neural_chat/tests/requirements.txt.


Thank you for your contribution! 💜

Note
This comment is automatically generated and will be updates every 180 seconds within the next 6 hours. If you have any other questions, contact VincyZhang or XuehaoSun for help.

@Liangyx2
Copy link
Contributor

please add Installation and instruction for pdf table-to-text in intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md

return result

tables_result = []
def get_relation(table_coords, caption_coords, table_page_number, caption_page_number, threshold=100):

This comment was marked as resolved.

xmx-521 and others added 7 commits March 28, 2024 10:31
Signed-off-by: Manxin Xu <[email protected]>
Signed-off-by: Manxin Xu <[email protected]>
Signed-off-by: Manxin Xu <[email protected]>
Signed-off-by: Manxin Xu <[email protected]>
Signed-off-by: Manxin Xu <[email protected]>
@xmx-521 xmx-521 requested a review from XuhuiRen March 28, 2024 06:05
@@ -92,6 +92,7 @@ Below are the description for the available parameters in `agent_QA`,
| enable_rerank | bool | Whether to enable retrieval then rerank pipeline |True, False|
| reranker_model | str | The name of the reranker model from the Huggingface or a local path |-|
| top_n | int | The return number of the reranker model |-|
| table_strategy | str | The strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast" |"fast", "hq", "llm"|
Copy link
Collaborator

@XinyuYe-Intel XinyuYe-Intel Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code, seems "fast" table_strategy would only return None instead of table content, is this somewhat unreasonable?

It appears "hq" strategy uses unstructured pkg to extract table, I also used this pkg, and find it actually performed worse than table-transformer.

Also does the "llm" strategy return the reliable table contents? From the code, looks like it uses LLM and a prompt to generate the table summarization of the document, but from my previous experience, such way would generate results that significantly deviate the table content sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for insightful comments, my opinion on these issues are as follows:

From the code, seems "fast" table_strategy would only return None instead of table content, is this somewhat unreasonable?

In fact, by default, our program will use OCR to extract all text information in files including table information, which has been implemented in other PRs. This PR is just to further enhance the understanding of the table, so no content is returned in fast mode (fast mode is also the default mode).

It appears "hq" strategy uses unstructured pkg to extract table, I also used this pkg, and find it actually performed worse than table-transformer.

At present, we do use unstructured to extract table information, and the extraction performance is quite satisfactory. We have not tried the table transformer, but it is indeed worth considering.

Also does the "llm" strategy return the reliable table contents? From the code, looks like it uses LLM and a prompt to generate the table summarization of the document, but from my previous experience, such way would generate results that significantly deviate the table content sometimes.

Your understanding of what llm mode does is correct. It is true that llm's table summary is not completely reliable, but according to the experimental results, there will be much better table QA performance in llm mode overall.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants