Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding dataset LiDi 1.0 project #152

Open
Giorgiaagostini opened this issue Jul 4, 2024 · 2 comments
Open

Adding dataset LiDi 1.0 project #152

Giorgiaagostini opened this issue Jul 4, 2024 · 2 comments

Comments

@Giorgiaagostini
Copy link

Hello HTR-united team!

please consider the following data set description for inclusion in your directory.

Here is our dataset YAML file:

schema: https://htr-united.github.io/schema/2023-06-27/schema.json
title: LiDi1.0-project
url: https://github.com/Giorgiaagostini/LiDi1.0-project
authors:
 - name: Giorgia
   surname: Agostini
   orcid: 0009-0007-9887-5129
   roles:
     - transcriber
     - aligner
     - project-manager
     - quality-control
institutions: []
description: >-
 This repository contains all data relating to the LiDi 1.0 project. In
 particular HTR GT of 16th antiquarian Pirro Ligorio, used to create
 Transkribus public model Ligorio 0.3 PyL.
project-name: LiDi 1.0
project-website: https://lidiws-limes.cfs.unipi.it
language:
 - ita
production-software: Transkribus
automatically-aligned: false
script:
 - iso: Latn
 - iso: Grek
script-type: only-manuscript
time:
 notBefore: '1568'
 notAfter: '1580'
hands:
 count: '1'
 precision: estimated
license:
 name: CC-BY-SA 4.0
 url: https://creativecommons.org/licenses/by-sa/4.0/
format: Alto-XML
sources:
 - reference: ''
   link: >-
     https://archiviodistatotorino.beniculturali.it/dbadd/visvol_bibl.php?uid=300146
volume:
 - metric: files
   count: 195
citation-file-link: >-
 https://github.com/Giorgiaagostini/LiDi1.0-project/blob/main/Data/Ground%20Truth/CITATION.cff
transcription-guidelines: >-
 - Normalisation of «V» to «U» except in Latin inscriptions;

 - Preservation of the diacritical marks and punctuation as used by the Author
 except for the part in Greek;

 - Where the use of capital and small caps is not distinguished, it is
 transcribed according to the grammatical rules of the Italian language;

 - Tagging of uncertain words with the «unclear» tag;

 - Tagging of illegible words with three dots (...) and the «unclear» tag;

 - Use of the angle dash, instead of the hyphen, to divide words into syllables
 at the end of a line.

 Moreover due to some issues in the visualization of ancient symbols unicode,
 the Roman Denarius (U+10196) and the Roman Sestersius (U+10198) signs were
 transcribed using other symbols not used by the author from the Astronomical
 chart:

 Roman denarius sign ➛♀(U+2640 Female sign)

 Roman sestertius sign➛☿ (U+263F Mercury)

 In order to change them to the correct one during post-processing.
@alix-tz
Copy link
Member

alix-tz commented Jul 4, 2024

Hello Giorgia,

Thank you very much for your contribution!

It looks like there are only the XML files in your repository, which is not enough to get a complete GT dataset. I see however that in "sources" you put the link to the image visualizer on the website of the Archivio di stato di Torino. I think it would be useful if you can add, in the README of your dataset repository, clear indications that the images are not included in the dataset but that they can be downloaded there (if they can be?). Basically anything to facilitate the reconstruction of the ground truth dataset.

From comparing the viewer and your data, I have the impression that you pre-processed the images to get single pages instead of double pages. This pre-procesing step might be difficult to reproduced in a way that guarantees that the images and the XML files are correctly aligned. If I am right with my understanding, in my opinion, this is reason enough to publish your preprocessed images along with the XML files (if the license on the image allows it).

What do you think? Is there anything that can be done in this regard?

@Giorgiaagostini
Copy link
Author

Giorgiaagostini commented Jul 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants