Document Visual Question Answering (Doc-VQA)

Document Visual Question Answering

1 Introduction

VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images.

The DOC-VQA algorithm in PP-Structure is developed based on the PaddleNLP natural language processing algorithm library.

The main features are as follows:

Integrate LayoutXLM model and PP-OCR prediction engine.
Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
Supports custom training for SER tasks and RE tasks.
Supports end-to-end system prediction and evaluation of OCR+SER.
Supports end-to-end system prediction of OCR+SER+RE.

This project is an open source implementation of LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding on Paddle 2.2, Included fine-tuning code on XFUND dataset.

2. Performance

We evaluate the algorithm on the Chinese dataset of XFUND, and the performance is as follows

Model	Task	hmean	Model download address
LayoutXLM	SER	0.9038	link
LayoutXLM	RE	0.7483	link
LayoutLMv2	SER	0.8544	link
LayoutLMv2	RE	0.6777	link
LayoutLM	SER	0.7731	link

3. Effect demo

Note: The test images are from the XFUND dataset.

3.1 SER

Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: QUESTION, ANSWER, HEADER

Dark purple: HEADER
Light purple: QUESTION
Army Green: ANSWER

The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.

3.2 RE

The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.

4. Install

4.1 Install dependencies

(1) Install PaddlePaddle

python3 -m pip install --upgrade pip

# GPU installation
python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple

# CPU installation
python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple

For more requirements, please refer to the instructions in Installation Documentation.

4.2 Install PaddleOCR

(1) pip install PaddleOCR whl package quickly (prediction only)

python3 -m pip install paddleocr

(2) Download VQA source code (prediction + training)

[Recommended] git clone https://github.com/PaddlePaddle/PaddleOCR

# If the pull cannot be successful due to network problems, you can also choose to use the hosting on the code cloud:
git clone https://gitee.com/paddlepaddle/PaddleOCR

# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first.

(3) Install VQA's requirements

python3 -m pip install -r ppstructure/vqa/requirements.txt

5. Usage

5.1 Data and Model Preparation

If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly.

Download the processed dataset

The download address of the processed XFUND Chinese dataset: https://paddleocr.bj.bcebos.com/dataset/XFUND.tar.

Download and unzip the dataset, and place the dataset in the current directory after unzipping.

wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar

Convert the dataset

If you need to train other XFUND datasets, you can use the following commands to convert the datasets

python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path

Download the pretrained models

mkdir pretrain && cd pretrain
#download the SER model
wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar && tar -xvf ser_LayoutXLM_xfun_zh.tar
#download the RE model
wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar && tar -xvf re_LayoutXLM_xfun_zh.tar
cd ../

5.2 SER

Before starting training, you need to modify the following four fields

Train.dataset.data_dir: point to the directory where the training set images are stored
Train.dataset.label_file_list: point to the training set label file
Eval.dataset.data_dir: refers to the directory where the validation set images are stored
Eval.dataset.label_file_list: point to the validation set label file

start training

CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml

Finally, precision, recall, hmean and other indicators will be printed. In the ./output/ser_layoutxlm/ folder will save the training log, the optimal model and the model for the latest epoch.

resume training

To resume training, assign the folder path of the previously trained model to the Architecture.Backbone.checkpoints field.

CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir

evaluate

Evaluation requires assigning the folder path of the model to be evaluated to the Architecture.Backbone.checkpoints field.

CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir

Finally, precision, recall, hmean and other indicators will be printed

Use OCR engine + SER tandem prediction

Use the following command to complete the series prediction of OCR engine + SER, taking the pretrained SER model as an example:

CUDA_VISIBLE_DEVICES=0 python3 tools/infer_vqa_token_ser.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/Global.infer_img=doc/vqa/input/zh_val_42.jpg

Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the config.Global.save_res_path field. The prediction result text file is named infer_results.txt.

End-to-end evaluation of OCR engine + SER prediction system

First use the tools/infer_vqa_token_ser.py script to complete the prediction of the dataset, then use the following command to evaluate.

export CUDA_VISIBLE_DEVICES=0
python3 tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt

5.3 RE

start training

Before starting training, you need to modify the following four fields

Train.dataset.data_dir: point to the directory where the training set images are stored
Train.dataset.label_file_list: point to the training set label file
Eval.dataset.data_dir: refers to the directory where the validation set images are stored
Eval.dataset.label_file_list: point to the validation set label file

CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml

Finally, precision, recall, hmean and other indicators will be printed. In the ./output/re_layoutxlm/ folder will save the training log, the optimal model and the model for the latest epoch.

resume training

To resume training, assign the folder path of the previously trained model to the Architecture.Backbone.checkpoints field.

CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir

evaluate

Evaluation requires assigning the folder path of the model to be evaluated to the Architecture.Backbone.checkpoints field.

CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir

Finally, precision, recall, hmean and other indicators will be printed

Use OCR engine + SER + RE tandem prediction

Use the following command to complete the series prediction of OCR engine + SER + RE, taking the pretrained SER and RE models as an example:

export CUDA_VISIBLE_DEVICES=0
python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/Global.infer_img=doc/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm. yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/

Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the config.Global.save_res_path field. The prediction result text file is named infer_results.txt.

6. Reference Links

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
XFUND dataset, https://github.com/doc-analysis/XFUND

License

The content of this project itself is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Document Visual Question Answering

1 Introduction

2. Performance

3. Effect demo

3.1 SER

3.2 RE

4. Install

4.1 Install dependencies

4.2 Install PaddleOCR

5. Usage

5.1 Data and Model Preparation

5.2 SER

5.3 RE

6. Reference Links

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Document Visual Question Answering

1 Introduction

2. Performance

3. Effect demo

3.1 SER

3.2 RE

4. Install

4.1 Install dependencies

4.2 Install PaddleOCR

5. Usage

5.1 Data and Model Preparation

5.2 SER

5.3 RE

6. Reference Links

License