Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTO reader: GT type (subject) from @VALUE ? #3

Open
bertsky opened this issue Sep 12, 2022 · 1 comment
Open

ALTO reader: GT type (subject) from @VALUE ? #3

bertsky opened this issue Sep 12, 2022 · 1 comment
Labels
question Further information is requested

Comments

@bertsky
Copy link

bertsky commented Sep 12, 2022

gt_els = [e for e in gt_type_el if e.getAttribute(
'ID') == "ulb_groundtruth_type"]
if len(gt_els) == 1:
value = gt_els[0].getAttribute('VALUE')

I wonder what ALTO version OtherTag/@VALUE conforms to. Is that a Transkribus or ULB extension @M3ssman ?

Generally, IMO we do need to support this kind of information in the annotation files themselves (PAGE/ALTO), but should also consider the case where it enters as metadata (METS/MODS). For the latter, we have the https://github.com/ocr-d/gt-labelling schema, but that does not contain any definitions on subject/genre/content class yet. There is a classification schema for content items in ENMAP (§10 Annex 2), a set of newspaper article types in DTABf for example. Somewhat related, one could also consider relevant the non-structural (i.e. metadata) types of DFG Strukturdatenset, or the general set of text sorts in DTA and DWDS...

Anyway, back to the annotation schema in ALTO: Why OtherTag in the first place – shouldn't this kind of information be placed in LayoutTag by convention? On the PAGE side, it's always MetadataItem I suppose.

Here I made a proposal to mirror the gt-labelling info from MODS into the MetadataItem in PAGE BTW.

@kba RFC

@M3ssman
Copy link
Member

M3ssman commented Sep 12, 2022

Don't worry, this is originates from my very first and superficial interpretation of ALTO to express additional content information.

Has nothing to do with Transkribus, how drops this element anyway due it's limited Transformation capabilities.

With Version 2.1 (2014), according to ALTO Schema they introduced annotations like LayoutTag StructureTag RoleTag NamedEntityTag OtherTag . Nowadays I guess they were intended to be able to express neat relations from even single String-Element's TAGREFS via the NER-Tag.

If I would do it again ( ... which is not planned) I'd go for the ComposedBlockType@TYPE attribute, which shall a string to express what sort the included sub-regions are made of: table, advertisement, ... (example values from ALTO schema definition).

The type-stuff for Blocks (and Illustrations!) seems to be part of the spec since the very beginning.
It's dated in the prelude back to 2004, even before Version 1.3 of ALTO has been tagged.

@M3ssman M3ssman added the question Further information is requested label Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants