Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of XSD validation errors #78

Open
kba opened this issue Jun 7, 2020 · 2 comments
Open

Lots of XSD validation errors #78

kba opened this issue Jun 7, 2020 · 2 comments
Assignees

Comments

@kba
Copy link
Member

kba commented Jun 7, 2020

Found thanks to OCR-D/core#470:

<report valid="false">
  <error>assets/data/page_dewarp/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:22:49.558544' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/leptonica_samples/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:14:27.999250' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/leptonica_samples/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/leptonica_samples/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/column-samples/data/mets.xml: Line 39: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 16:44:18.171353' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 34: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 37: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 40: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 43: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 48: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 51: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 54: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 57: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 60: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 63: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 69: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/grenzboten-test/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-08-07 17:52:26.109166' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/grenzboten-test/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>FILE_0001_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000001' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>FILE_0001_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
  <error>FILE_0002_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000002' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>FILE_0002_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
</report>
<report valid="false">
  <error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 59: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0017' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0020' is not a valid value of the atomic type 'xs:ID'.</error>
</report>
<report valid="false">
  <error>assets/data/scribo-test/data/mets.xml: Line 33: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/communist_manifesto/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-03-24 22:16:26.006316' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/communist_manifesto/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/dfki-testdata/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-22 10:31:05.897472' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/dfki-testdata/data/mets.xml: Line 32: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
@kba kba assigned tboenig and kba Jun 7, 2020
@bertsky
Copy link
Contributor

bertsky commented Jul 17, 2020

assets/data/page_dewarp/data/mets.xml
assets/data/leptonica_samples/data/mets.xml
assets/data/DIBCO11-machine_printed/data/mets.xml
assets/data/grenzboten-test/data/mets.xml
assets/data/communist_manifesto/data/mets.xml
assets/data/dfki-testdata/data/mets.xml
assets/data/page_dewarp/data/mets.xml
assets/data/leptonica_samples/data/mets.xml
assets/data/DIBCO11-machine_printed/data/mets.xml
assets/data/grenzboten-test/data/mets.xml
assets/data/scribo-test/data/mets.xml
assets/data/communist_manifesto/data/mets.xml
assets/data/column-samples/data/mets.xml
assets/data/dfki-testdata/data/mets.xml
  • file ID re-used (repeated) as page ID (needs manual fix):
assets/data/kant_aufklaerung_1784-binarized/data/mets.xml
  • invalid CDATA (... – manual fix or remove):
assets/data/SBB0000F29300010000/data/OCR-D-GT-PAGE/FILE_0001_FULLTEXT.xml
assets/data/SBB0000F29300010000/data/OCR-D-GT-PAGE/FILE_0002_FULLTEXT.xml

But as of now there are even more errors:

  • empty reading order group (intentional?):
assets/data/gutachten/data/TEMP1/PAGE_TEMP1
  • wrongly formatted regionRef (intentional?):
assets/data/gutachten/data/TEMP2/PAGE_TEMP2_1.xml
assets/data/gutachten/data/TEMP2/PAGE_TEMP2_2.xml
  • pc:PcGts/@pcGtsId differs from mets:file/@ID (newly introduced, ambitious goal)

nearly everywhere...

@kba
Copy link
Member Author

kba commented Jul 23, 2020

assets/data/page_dewarp/data/mets.xml
assets/data/leptonica_samples/data/mets.xml
assets/data/DIBCO11-machine_printed/data/mets.xml
assets/data/grenzboten-test/data/mets.xml
assets/data/communist_manifesto/data/mets.xml
assets/data/dfki-testdata/data/mets.xml

fixed (w/o regenerating)

assets/data/page_dewarp/data/mets.xml
assets/data/leptonica_samples/data/mets.xml
assets/data/DIBCO11-machine_printed/data/mets.xml

fixed

  • affected by OCR-D/core#499 (merely needs re-ordering between dmdSec and structMap):
assets/data/grenzboten-test/data/mets.xml
assets/data/scribo-test/data/mets.xml
assets/data/communist_manifesto/data/mets.xml

fixed

  • affected by OCR-D/core#499 (merely needs re-ordering between fileSec and structMap):
assets/data/column-samples/data/mets.xml
assets/data/dfki-testdata/data/mets.xml

fixed

  • file ID re-used (repeated) as page ID (needs manual fix):
assets/data/kant_aufklaerung_1784-binarized/data/mets.xml

fixed

  • invalid CDATA (... – manual fix or remove):
assets/data/SBB0000F29300010000/data/OCR-D-GT-PAGE/FILE_0001_FULLTEXT.xml
assets/data/SBB0000F29300010000/data/OCR-D-GT-PAGE/FILE_0002_FULLTEXT.xml

But as of now there are even more errors:

  • empty reading order group (intentional?):
assets/data/gutachten/data/TEMP1/PAGE_TEMP1

yes this is intentional to test the reading order methods in the generateDS API

  • wrongly formatted regionRef (intentional?):
assets/data/gutachten/data/TEMP2/PAGE_TEMP2_1.xml
assets/data/gutachten/data/TEMP2/PAGE_TEMP2_2.xml

same

  • pc:PcGts/@pcGtsId differs from mets:file/@ID (newly introduced, ambitious goal)

nearly everywhere...

fixed manually where it wasn't too much effort. will be a perfect use case for ocrd-sanitize implementing https://github.com/mikegerber/sbb-useful-hacks/blob/master/mets-fixers/fix-page-pcgtsid-to-be-mets-file-id

kba added a commit that referenced this issue Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants