-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common understanding about the workflow requirements #48
Comments
|
The aim for a workflow component for OCR-D goes back to the initial phase I of the project. I'll try to summarize the high level discussions so far. A strength of OCR-D lies in its modularity and flexibility in the use of Processors, with often more than one Processor being available for a given task (or step). Experiences, tests and user expectations have conveyed that - due to the complex and very diverse nature of the documents to be processed - often only tailor made workflows can guarantee the optimal result quality. The OCR-D functional model shows that OCR is always composed from a sequence of such steps, with the condition that the output (image, text or metadata) from a single Processor must also be made available for subsequent Processors to make use of (theoretically, also loops are conceivable). In practice, workflows are often specific to an institution or project, composed of ad-hoc solutions or local scripts and lacking standardized interfaces, descriptions and documentation. This greatly hinders reuse, comparability and replicability. So far we can summarize the following needs, or rather benefits, of a standardized workflow management component:
A bit further down the line, standardized workflows could also be used for discovery (i.e. finding an appropriate workflow for a set of documents), or to dynamically compose sequences of Processors merely based on a workflow description. Other aspects that could greatly benefit from standardized workflows are scalability/parallelization (e.g. while some steps require more time to complete, other steps could potentially already be started and run in parallel to reduce the overall execution time) and error handling/robustness (i.e. the ability to overcome exceptions or crashes of individual Processors without stopping the overall workflow execution). A first steps towards standardized workflows was made with
which basically encapsulates workflows as a simple sequence of Processors in Posix shell scripts. This has a few drawbacks e.g. with regard to the validation of the sequence and the data exchange between Processors (see also the example and explanation in the wiki on workflows), or the capture of global provenance information.
tries to specify a standardized format for such workflows and also has a good discussion of the benefits/drawbacks of this approach, while also touching on alternatives like CWL. [Note: while initially also Taverna/SCUFL2 was a candidate, with it being retired as an Apache incubator early in 2020, we eventually discarded it]
is an implementation of a workflow server based on OCR-D/spec#171 that already aims to overcome some of the problems in parallelization and error recovery that are well explained under
Last but not least, another relevant piece of software in this context is
which provides a network interface (via SSH) for executing Processors based on a local installation of ocrd_all, and which could be exposed to third party software such as e.g. Kitodo. |
we need concrete requirements. There is a difference between workflow requirements and requirements of the processors. |
As a team, we need a summary of the past discussion and the resulting requirements so that we can discuss them together and generate our next steps from them.
The text was updated successfully, but these errors were encountered: