Skip to content

Latest commit

 

History

History
51 lines (42 loc) · 2.47 KB

README.md

File metadata and controls

51 lines (42 loc) · 2.47 KB

HangulDB-Image

Korean handwriting dataset parsed from the HangulDB.

Samples

Each image has different width and height. For the consistency with the original, I intentionally preserve the property.

b0a1

b0a1/1.jpg b0a1/2.jpg b0a1/3.jpg b0a1/4.jpg b0a1/5.jpg b0a1/6.jpg b0a1/7.jpg

bad0

bad0/1.jpg bad0/2.jpg bad0/3.jpg bad0/4.jpg bad0/5.jpg bad0/6.jpg bad0/7.jpg

ba88

b8aa/1.jpg b8aa/2.jpg b8aa/3.jpg b8aa/4.jpg b8aa/5.jpg b8aa/6.jpg b8aa/7.jpg

Datasets

This repo contains PE92, SERI95, and HanDB.

  • PE92 contains 2350 classes, each with 100 samples.
  • SERI95 contains 520 classes, each with 1000 samples.
  • HANDB merges SERI95 and PE92. That is, 520 classes have 1100 samples and the others (1820 classes) have 100 samples.

Architecture

Three datasets have the same structure:

<dataset_name>/<label>/<sample_index>.jpg

warning

PE92 contains some mislabeled samples at the last few samples for each class.

Verification

parser.ipynb parses a hgu1 file to several jpg files. You can test whether it correctly parse the original dataset using parser.ipynb.