This implementation requires the input data in the following format:
train_samples_list.csv
file: multi-line with format:index,document_type,file_name
.boxes_and_transcripts
folder:file_name.tsv
files.- every
file_name.tsv
file has multi-line with format:index,box_coordinates (clockwise 8 values), transcripts,box_entity_types
.
- every
images
folder:file_name.jpg
files.entities
folder (optional) :file_name.txt
files.- every
file_name.txt
file contains a json format string, providing the exactly label value of every entity. - if
iob_tagging_type
is set tobox_level
, this folder will not be used, thenbox_entity_types
infile_name.tsv
file ofboxes_and_transcripts
folder will be used as label of entity. otherwise, it must be provided.
- every
boxes_and_transcripts
folder:file_name.tsv
files- every
file_name.tsv
file has multi-line with format:index,box_coordinates (clockwise 8 values), transcripts
.
- every
images
folder:file_name.jpg
files.