Workflow for creating structured tei documents from Transkribus layout+text recognition
This project is based on page2tei by Dario Kampkaspar and extended with several waves of string replacements and XSLT transformations.
exports/
: Exports from Transkribus in the PAGE format
To process individual parts (as groups of pages), themets.xml
file can be duplicated and reduced to the relevant pages.guidelines/
: Documentation for the TEI format of certain phenomenonslanguage_data/
: Plain wordlists in a number of languages. To enable more languages, add lists here and adaptreplace_hyphens(xml_data, language="any")
inreplacements.py
out/
: Default output folderpage2tei/
: Project code from dariok/page2tei by Dario Kampkasparxsd/
: Validation for the documents in REMSxslt/
: Stylesheetsbibliography.xsl
: Turn list entries into bibliography entries and mark monograph titlescheckpara.xsl
: Testing: Check for paragraph types in unexpected contextcollect-blocks.xsl
: Apply standardized blocks for REMS documentsdisconnect-style-and-type.xsl
: Merge tags that just differ by layout but keep layout information (REMS documents)expand-hi.xsl
: Usehi
and types instead of different elements for markupid-to-div.xsl
: Add ID to REMS documentsindent.xsl
: Pretty printjoin-paragraphs.xsl
: Join paragraphs across page breaksmove-footnotes.xsl
: Move footnotes from page end to the footnote markpage-numbers.xsl
: Set page number as attribute in page breakspostprocess-page2tei.xsl
: Processing of additional styles in the same way as page2teiremove-lb.xsl
: Remove linebreaks in insignificant positionremove-pb.xsl
: Remove pagebreaks without numbersremove-position-data.xsl
: Remove the elementfacsimile
from the xmlsimplify-hi.xsl
: Turnhi
into elements without attributes to join the across line breaksstring-pack.xsl
: See page2teiwoelfflin-elements.xsl
: Replacement of TEI elements to match the flavour of the project
bibliography.py
: Cascade of operations for the bibliography in REMSdocuments.py
: Cascade of operations for the main parts in REMSgedanken.py
: Cascade of operations for Heinrich Wölfflins «Gedanken zur Kunstgeschichte» (1941)introduction.py
: Cascade of operations for the introduction of REMS, and any text without special markupreplacements.py
: Replacements using regular expressions as catalog of functionssimplify.py
: Cascade of operations for a simple TEI documenttransform.py
: XSLT transformations as catalog of functionsworkflow.md
: Step by step description of the workflows
- Copy
saxon-he-10.5.jar
into the root of this repository - Make sure, the following folders exist in the project root:
exports
,out
,temp
- Extract your export from Transkribus to
exports
- Chose among the suggested cascades or add your own.
introduction.py
is the most generic one - Run it like:
python3 introduction.py -i exports/your_export/mets.xml -o out/your_output.xml