-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New processor API #1240
base: master
Are you sure you want to change the base?
New processor API #1240
Conversation
- add method `process_workspace(workspace)` as a replacement for passing `workspace` in the constructor and then calling `process` (implemented by subclasses): implement in the superclass - loop over input files - delegate processing to new method `process_page_file()` if possible - otherwise fall back to old `process()` outside of loop - download input files when needed if `self.download` - add method `process_page_file()` as single-page processing procedure on OcrdFiles: implement in the superclass for the most frequent/default use-case of - (multi-) image/PAGE input files - (single) PAGE output files - delegate to new method `process_page_pcgts()` if available - add PAGE processing metadata - set PAGE PcGtsId - handle `make_file_id` and `workspace.add_file` - add method `process_page_pcgts()` as single-page processing function on OcrdPage: to be implemented only by subclasses - constructor: add kwarg `download_files` controlling `self.download` (see above)
- implement `process_page_pcgts` with behaviour for `copy_files=False` - override superclass `process_page_file` with behaviour for `copy_files=True` - remove old `process` implementation
…load_files=False` in tests (because they are not actually in the filesystem)
A couple of points worth discussing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great start.
src/ocrd/processor/base.py
Outdated
with pushd_popd(workspace.directory): | ||
self.workspace = workspace | ||
try: | ||
# FIXME: add page parallelization by running multiprocessing.Pool (#322) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the Pool workers would be also controllable by an environment variable to ease resource distribution in HPC environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the idea, exactly.
… used in meta-data of resulting image
BTW, I consider it bad practice that we always load the ocrd-tool.json in the inheriting constructor. It should really be a class attribute. And assuming it is always placed in the same spot of the Python distribution (which is currently the case but not strictly required as our specs only require it in the repository root), one can even load it in the superclass (so nothing would have to be done in the inheriting code). |
…up (also, simplify)
…t if fallback process() raises anything itself
…PUT=OVERWRITE|SKIP or disjoint --page-id)
…ameters (necessary for required params)
…o get right with required parameters, and now covered by wrapped Processor.verify() already)
…OUTPUT==OVERWRITE
0fb234b
to
53b880f
Compare
This is a first attempt at implementing #322 – without the actual error handling or parallel processing (so more or less refactoring and deprecation). It also enables fixing #579 in the process.