Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common understanding about the workflow requirements #48

Closed
4 tasks done
krvoigt opened this issue Feb 14, 2022 · 3 comments
Closed
4 tasks done

Common understanding about the workflow requirements #48

krvoigt opened this issue Feb 14, 2022 · 3 comments
Assignees
Labels

Comments

@krvoigt
Copy link

krvoigt commented Feb 14, 2022

As a team, we need a summary of the past discussion and the resulting requirements so that we can discuss them together and generate our next steps from them.

@kba
Copy link
Member

kba commented Feb 14, 2022

@cneud
Copy link
Member

cneud commented Feb 17, 2022

summary of the past discussion

The aim for a workflow component for OCR-D goes back to the initial phase I of the project. I'll try to summarize the high level discussions so far.

A strength of OCR-D lies in its modularity and flexibility in the use of Processors, with often more than one Processor being available for a given task (or step). Experiences, tests and user expectations have conveyed that - due to the complex and very diverse nature of the documents to be processed - often only tailor made workflows can guarantee the optimal result quality.

The OCR-D functional model shows that OCR is always composed from a sequence of such steps, with the condition that the output (image, text or metadata) from a single Processor must also be made available for subsequent Processors to make use of (theoretically, also loops are conceivable).

In practice, workflows are often specific to an institution or project, composed of ad-hoc solutions or local scripts and lacking standardized interfaces, descriptions and documentation. This greatly hinders reuse, comparability and replicability.

So far we can summarize the following needs, or rather benefits, of a standardized workflow management component:

  • standardized workflows should support the transparent exchange of information between Processors on a global level
  • standardized workflows should ease the composition of sequences from a diverse range of Processors
  • standardized workflows should provide better means for replicability, evaluation and comparison
  • standardized workflows should help with the reuse of exisiting workflows by others
  • standardized workflows should capture provenance output that can aid in debugging and optimization

A bit further down the line, standardized workflows could also be used for discovery (i.e. finding an appropriate workflow for a set of documents), or to dynamically compose sequences of Processors merely based on a workflow description.

Other aspects that could greatly benefit from standardized workflows are scalability/parallelization (e.g. while some steps require more time to complete, other steps could potentially already be started and run in parallel to reduce the overall execution time) and error handling/robustness (i.e. the ability to overcome exceptions or crashes of individual Processors without stopping the overall workflow execution).

A first steps towards standardized workflows was made with

Old WF repo: https://github.com/OCR-D/ocrd-workflows

which basically encapsulates workflows as a simple sequence of Processors in Posix shell scripts. This has a few drawbacks e.g. with regard to the validation of the sequence and the data exchange between Processors (see also the example and explanation in the wiki on workflows), or the capture of global provenance information.

Old WF format: OCR-D/spec#171

tries to specify a standardized format for such workflows and also has a good discussion of the benefits/drawbacks of this approach, while also touching on alternatives like CWL. [Note: while initially also Taverna/SCUFL2 was a candidate, with it being retired as an Apache incubator early in 2020, we eventually discarded it]

WF Server: OCR-D/core#652

is an implementation of a workflow server based on OCR-D/spec#171 that already aims to overcome some of the problems in parallelization and error recovery that are well explained under

RFC: Preloading OCR-D Processors: https://hackmd.io/23-JzLp_Q96cb6T0ttoFIA

Last but not least, another relevant piece of software in this context is

Ocrd Controller: https://github.com/bertsky/ocrd_controller

which provides a network interface (via SSH) for executing Processors based on a local installation of ocrd_all, and which could be exposed to third party software such as e.g. Kitodo.

@krvoigt
Copy link
Author

krvoigt commented Feb 22, 2022

we need concrete requirements. There is a difference between workflow requirements and requirements of the processors.
we need to provide an excample workflow and ask IMPL for agreement. Triet will take one OCR-D example and will prepare an CWL for NextFlow.

@krvoigt krvoigt closed this as completed Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants