Skip to content

Commit

Permalink
Merge pull request #969 from opendatalab/release-0.9.3
Browse files Browse the repository at this point in the history
Release 0.9.3
  • Loading branch information
myhloli authored Nov 15, 2024
2 parents d0558ab + 6083e10 commit 845a3ff
Show file tree
Hide file tree
Showing 212 changed files with 3,857 additions and 880,886 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,6 @@ debug_utils/

# sphinx docs
_build/


output/
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
</div>

# Changelog
- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
- 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
- 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
- Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
Expand Down Expand Up @@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features
"enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
},
"table-config": {
"model": "tablemaster", // When using structEqTable, please change to "struct_eqtable".
"model": "rapid_table", // When using structEqTable, please change to "struct_eqtable".
"enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
"max_time": 400
}
Expand All @@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker
> [!IMPORTANT]
> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
> Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default.
>
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
>
Expand Down Expand Up @@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
# Acknowledgments

- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [RapidTable](https://github.com/RapidAI/RapidTable)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [layoutreader](https://github.com/ppaanngggg/layoutreader)
Expand Down
6 changes: 3 additions & 3 deletions README_ja-JP.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
> [!Warning]
> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください:[ENGLISH](README.md)
<div id="top">
<p align="center">
Expand All @@ -18,9 +20,7 @@
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>


<div align="center" style="color: red; background-color: #ffdddd; padding: 10px; border: 1px solid red; border-radius: 5px;">
<strong>NOTE:</strong> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください。
</div>



[English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
Expand Down
11 changes: 6 additions & 5 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
</div>

# 更新记录

- 2024/11/15 0.9.3发布,为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上,准确率更高,显存占用更低
- 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
- 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
- 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
Expand Down Expand Up @@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
<td rowspan="2">GPU硬件支持列表</td>
<td colspan="2">最低要求 8G+显存</td>
<td colspan="2">3060ti/3070/4060<br>
8G显存可开启layout、公式识别和ocr加速</td>
8G显存可开启全部加速功能(表格仅限rapid_table)</td>
<td rowspan="2">None</td>
</tr>
<tr>
<td colspan="2">推荐配置 10G+显存</td>
<td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br>
10G显存及以上可开启全部加速功能<br>
</td>
</tr>
</table>
Expand Down Expand Up @@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
"enable": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
},
"table-config": {
"model": "tablemaster", // 使用structEqTable请修改为"struct_eqtable"
"model": "rapid_table", // 使用structEqTable请修改为"struct_eqtable"
"enable": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"true"
"max_time": 400
}
Expand All @@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
- 使用Docker快速部署
> [!IMPORTANT]
> Docker 需设备gpu显存大于等于16GB,默认开启所有加速功能
> Docker 需设备gpu显存大于等于8GB,默认开启所有加速功能
>
> 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
>
Expand Down Expand Up @@ -431,6 +431,7 @@ TODO
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [RapidTable](https://github.com/RapidAI/RapidTable)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [layoutreader](https://github.com/ppaanngggg/layoutreader)
Expand Down
11 changes: 8 additions & 3 deletions demo/magic_pdf_parse_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ def json_md_dump(
pdf_name,
content_list,
md_content,
orig_model_list,
):
# 写入模型结果到 model.json
orig_model_list = copy.deepcopy(pipe.model_list)

md_writer.write(
content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
path=f"{pdf_name}_model.json"
Expand Down Expand Up @@ -87,9 +88,12 @@ def pdf_parse_main(

pdf_bytes = open(pdf_path, "rb").read() # 读取 pdf 文件的二进制数据

orig_model_list = []

if model_json_path:
# 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
orig_model_list = copy.deepcopy(model_json)
else:
model_json = []

Expand All @@ -115,8 +119,9 @@ def pdf_parse_main(
pipe.pipe_classify()

# 如果没有传入模型数据,则使用内置模型解析
if not model_json:
if len(model_json) == 0:
pipe.pipe_analyze() # 解析
orig_model_list = copy.deepcopy(pipe.model_list)

# 执行解析
pipe.pipe_parse()
Expand All @@ -126,7 +131,7 @@ def pdf_parse_main(
md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")

if is_json_md_dump:
json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
json_md_dump(pipe, md_writer, pdf_name, content_list, md_content, orig_model_list)

if is_draw_visualization_bbox:
draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)
Expand Down
2 changes: 1 addition & 1 deletion magic-pdf.template.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"enable": true
},
"table-config": {
"model": "tablemaster",
"model": "rapid_table",
"enable": false,
"max_time": 400
},
Expand Down
2 changes: 1 addition & 1 deletion magic_pdf/dict2md/ocr_mkcontent.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ def merge_para_with_text(para_block):
# 如果是前一行带有-连字符,那么末尾不应该加空格
if __is_hyphen_at_line_end(content):
para_text += content[:-1]
elif len(content) == 1 and content not in ['A', 'I', 'a', 'i']:
elif len(content) == 1 and content not in ['A', 'I', 'a', 'i'] and not content.isdigit():
para_text += content
else: # 西方文本语境下 content间需要空格分隔
para_text += f"{content} "
Expand Down
4 changes: 3 additions & 1 deletion magic_pdf/libs/Constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,6 @@ class MODEL_NAME:

YOLO_V8_MFD = "yolo_v8_mfd"

UniMerNet_v2_Small = "unimernet_small"
UniMerNet_v2_Small = "unimernet_small"

RAPID_TABLE = "rapid_table"
2 changes: 1 addition & 1 deletion magic_pdf/libs/config_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ def get_table_recog_config():
table_config = config.get('table-config')
if table_config is None:
logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
return json.loads(f'{{"model": "{MODEL_NAME.TABLE_MASTER}","enable": false, "max_time": 400}}')
return json.loads(f'{{"model": "{MODEL_NAME.RAPID_TABLE}","enable": false, "max_time": 400}}')
else:
return table_config

Expand Down
14 changes: 10 additions & 4 deletions magic_pdf/libs/draw_bbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
if block['type'] in [BlockType.Image, BlockType.Table]:
for sub_block in block['blocks']:
if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
for line in sub_block['virtual_lines']:
bbox = line['bbox']
index = line['index']
page_line_list.append({'index': index, 'bbox': bbox})
if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
for line in sub_block['virtual_lines']:
bbox = line['bbox']
index = line['index']
page_line_list.append({'index': index, 'bbox': bbox})
else:
for line in sub_block['lines']:
bbox = line['bbox']
index = line['index']
page_line_list.append({'index': index, 'bbox': bbox})
elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
for line in sub_block['lines']:
bbox = line['bbox']
Expand Down
Loading

0 comments on commit 845a3ff

Please sign in to comment.