This package provides an example to train PICK on DocBank dataset.

Dataset details

Statistics

Split	Abstract	Author	Caption	Date	Equation	Figure	Footer	List	Paragraph	Reference	Section	Table	Title	Total
Train	25,387	25,909	106,723	6,391	161,140	90,429	38,482	44,927	398,086	44,813	180,774	19,638	21,688	400,000
	6.35%	6.48%	26.68%	1.60%	40.29%	22.61%	9.62%	11.23%	99.52%	11.20%	45.19%	4.91%	5.42%	100.00%
Dev	3,164	3,286	13,443	797	20,154	11,463	4,804	5,609	49,759	5,549	22,666	2,374	2,708	50,000
	6.33%	6.57%	26.89%	1.59%	40.31%	22.93%	9.61%	11.22%	99.52%	11.10%	45.33%	4.75%	5.42%	100.00%
Test	3,176	3,277	13,476	832	20,244	11,378	4,876	5,553	49,762	5,641	22,384	2,505	2,729	50,000
	6.35%	6.55%	26.95%	1.66%	40.49%	22.76%	9.75%	11.11%	99.52%	11.28%	44.77%	5.01%	5.46%	100.00%
Total	31,727	32,472	133,642	8,020	201,538	113,270	48,162	56,089	497,607	56,003	225,824	24,517	27,125	500,000
	6.35%	6.49%	26.73%	1.60%	40.31%	22.65%	9.63%	11.22%	99.52%	11.20%	45.16%	4.90%	5.43%	100.00%

Annotation

There are 11 labels: abstract, author, caption, equation, figure, footer, list, paragraph, reference, section, table, title. In the process, we will ignore the blank pages (they are probably wrong annotations).

Usage

Download annotation zip files (~47GB) and image zip file (~3GB) from DocBank. We assume you have saved those files to the folder ${BASE_DATA_DIR}. You will end up with 10 image zip files (DocBank_500K_ori_img.zip.0*) and one annotation zip file (DocBank_500K_txt.zip) in ${BASE_DATA_DIR};
Merge the image zip files to one file, named DocBank_500K_ori_img.zip;
Run the scripts. The script will convert annotation files and split the dataset. You will end up with train, dev and test folders under ${BASE_DATA_DIR}.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataset details

Statistics

Annotation

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataset details

Statistics

Annotation

Usage