Merge pull request #969 from opendatalab/release-0.9.3

Release 0.9.3
opendatalab · Nov 15, 2024 · 845a3ff · 845a3ff
2 parents d0558ab + 6083e10
commit 845a3ff
Show file tree

Hide file tree

Showing 212 changed files with 3,857 additions and 880,886 deletions.
diff --git a/.gitignore b/.gitignore
@@ -48,3 +48,6 @@ debug_utils/
 
 # sphinx docs
 _build/
+
+
+output/
diff --git a/README.md b/README.md
@@ -42,6 +42,7 @@
 </div>
 
 # Changelog
+- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
 - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
 - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
   - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
@@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features
         "enable": true  // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
     },
     "table-config": {
-        "model": "tablemaster",  // When using structEqTable, please change to "struct_eqtable".
+        "model": "rapid_table",  // When using structEqTable, please change to "struct_eqtable".
         "enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
         "max_time": 400
     }
@@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
 - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
 - Quick Deployment with Docker
 > [!IMPORTANT]
-> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
+> Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default.
 >
 > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
 > 
@@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
 # Acknowledgments
 
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
+- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

diff --git a/README_ja-JP.md b/README_ja-JP.md
@@ -1,3 +1,5 @@
+> [!Warning]
+> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください：[ENGLISH](README.md)。
 <div id="top">
 
 <p align="center">
@@ -18,9 +20,7 @@
 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
 
 
-<div align="center" style="color: red; background-color: #ffdddd; padding: 10px; border: 1px solid red; border-radius: 5px;">
-  <strong>NOTE：</strong> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください。
-</div>
+
 
 
 [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -42,7 +42,7 @@
 </div>
 
 # 更新记录
-
+- 2024/11/15 0.9.3发布，为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上，准确率更高，显存占用更低
 - 2024/11/06 0.9.2发布，为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
 - 2024/10/31 0.9.0发布，这是我们进行了大量代码重构的全新版本，解决了众多问题，提升了性能，降低了硬件需求，并提供了更丰富的易用性：
   - 重构排序模块代码，使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序，确保在各种排版下都能实现极高准确率
@@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
         <td rowspan="2">GPU硬件支持列表</td>
         <td colspan="2">最低要求 8G+显存</td>
         <td colspan="2">3060ti/3070/4060<br>
-        8G显存可开启layout、公式识别和ocr加速</td>
+        8G显存可开启全部加速功能(表格仅限rapid_table)</td>
         <td rowspan="2">None</td>
     </tr>
     <tr>
         <td colspan="2">推荐配置 10G+显存</td>
         <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
-        10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br>
+        10G显存及以上可开启全部加速功能<br>
         </td>
     </tr>
 </table>
@@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
         "enable": true  // 公式识别功能默认是开启的，如果需要关闭请修改此处的值为"false"
     },
     "table-config": {
-        "model": "tablemaster",  // 使用structEqTable请修改为"struct_eqtable"
+        "model": "rapid_table",  // 使用structEqTable请修改为"struct_eqtable"
         "enable": false, // 表格识别功能默认是关闭的，如果需要开启请修改此处的值为"true"
         "max_time": 400
     }
@@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
 - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
 - 使用Docker快速部署
 > [!IMPORTANT]
-> Docker 需设备gpu显存大于等于16GB，默认开启所有加速功能
+> Docker 需设备gpu显存大于等于8GB，默认开启所有加速功能
 > 
 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
 > 
@@ -431,6 +431,7 @@ TODO
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

diff --git a/demo/magic_pdf_parse_main.py b/demo/magic_pdf_parse_main.py
@@ -19,9 +19,10 @@ def json_md_dump(
         pdf_name,
         content_list,
         md_content,
+        orig_model_list,
 ):
     # 写入模型结果到 model.json
-    orig_model_list = copy.deepcopy(pipe.model_list)
+
     md_writer.write(
         content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
         path=f"{pdf_name}_model.json"
@@ -87,9 +88,12 @@ def pdf_parse_main(
 
         pdf_bytes = open(pdf_path, "rb").read()  # 读取 pdf 文件的二进制数据
 
+        orig_model_list = []
+
         if model_json_path:
             # 读取已经被模型解析后的pdf文件的 json 原始数据，list 类型
             model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
+            orig_model_list = copy.deepcopy(model_json)
         else:
             model_json = []
 
@@ -115,8 +119,9 @@ def pdf_parse_main(
         pipe.pipe_classify()
 
         # 如果没有传入模型数据，则使用内置模型解析
-        if not model_json:
+        if len(model_json) == 0:
             pipe.pipe_analyze()  # 解析
+            orig_model_list = copy.deepcopy(pipe.model_list)
 
         # 执行解析
         pipe.pipe_parse()
@@ -126,7 +131,7 @@ def pdf_parse_main(
         md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
 
         if is_json_md_dump:
-            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
+            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content, orig_model_list)
 
         if is_draw_visualization_bbox:
             draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)

diff --git a/magic-pdf.template.json b/magic-pdf.template.json
@@ -15,7 +15,7 @@
         "enable": true
     },
     "table-config": {
-        "model": "tablemaster",
+        "model": "rapid_table",
         "enable": false,
         "max_time": 400
     },

diff --git a/magic_pdf/dict2md/ocr_mkcontent.py b/magic_pdf/dict2md/ocr_mkcontent.py
@@ -168,7 +168,7 @@ def merge_para_with_text(para_block):
                         # 如果是前一行带有-连字符，那么末尾不应该加空格
                         if __is_hyphen_at_line_end(content):
                             para_text += content[:-1]
-                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i']:
+                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i'] and not content.isdigit():
                             para_text += content
                         else:  # 西方文本语境下 content间需要空格分隔
                             para_text += f"{content} "

diff --git a/magic_pdf/libs/Constants.py b/magic_pdf/libs/Constants.py
@@ -50,4 +50,6 @@ class MODEL_NAME:
 
     YOLO_V8_MFD = "yolo_v8_mfd"
 
-    UniMerNet_v2_Small = "unimernet_small"
+    UniMerNet_v2_Small = "unimernet_small"
+
+    RAPID_TABLE = "rapid_table"
diff --git a/magic_pdf/libs/config_reader.py b/magic_pdf/libs/config_reader.py
@@ -92,7 +92,7 @@ def get_table_recog_config():
     table_config = config.get('table-config')
     if table_config is None:
         logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
-        return json.loads(f'{{"model": "{MODEL_NAME.TABLE_MASTER}","enable": false, "max_time": 400}}')
+        return json.loads(f'{{"model": "{MODEL_NAME.RAPID_TABLE}","enable": false, "max_time": 400}}')
     else:
         return table_config
 

diff --git a/magic_pdf/libs/draw_bbox.py b/magic_pdf/libs/draw_bbox.py
@@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
             if block['type'] in [BlockType.Image, BlockType.Table]:
                 for sub_block in block['blocks']:
                     if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
-                        for line in sub_block['virtual_lines']:
-                            bbox = line['bbox']
-                            index = line['index']
-                            page_line_list.append({'index': index, 'bbox': bbox})
+                        if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
+                            for line in sub_block['virtual_lines']:
+                                bbox = line['bbox']
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
+                        else:
+                            for line in sub_block['lines']:
+                                bbox = line['bbox']
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
                     elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
                         for line in sub_block['lines']:
                             bbox = line['bbox']
Original file line number	Diff line number	Diff line change
Expand Up		@@ -48,3 +48,6 @@ debug_utils/

		# sphinx docs
		_build/


		output/