Changed:
- Merge v2 master into new-procesor-api
- PAGE API: Update to latest generateDS 2.44.1, bertsky#21
- 🔥 logging: increase default root (not
ocrd
) level fromINFO
toWARNING
- 🔥
initLogging
: do not remove any previous handlers/levels, unlessforce_reinit
- 🔥
disableLogging
: remove all handlers, reset all levels - instead of being selective - 🔥 Processor: replace
weakref
with__del__
to triggershutdown
- 🔥
OCRD_MAX_PARALLEL_PAGES>1
: log viaQueueHandler
in subprocess,QueueListener
in main - 🔥
ocrd_utils.initLogging
: also add handler to root logger (as in file config),
but disable message propagation to avoid duplication - only import
ocrd_network
insrc/ocrd/decorators/__init__.py
once needed Processor.process_page_file
: skip computingprocess_page_pcgts
if output already exists,
butOCRD_EXISTING_OUTPUT!=OVERWRITE
- 🔥
OCRD_MAX_PARALLEL_PAGES>1
: switch from multithreading to multiprocessing, depend on
loky
instead of stdlibconcurrent.futures
OCRD_PROCESSING_PAGE_TIMEOUT>0
: actually enforce timeout within workerOCRD_MAX_MISSING_OUTPUTS>0
: abort early if too many failures already, prospectivelyProcessor.process_workspace
: split up into overridable sub-methods:process_workspace_submit_tasks
(iterate input file group and schedule page tasks)process_workspace_submit_page_task
(download input files and submit single page task)process_workspace_handle_tasks
(monitor page tasks and aggregate results)process_workspace_handle_page_task
(await single page task and handle errors)
- 🔥
Processor
/Workspace.add_file
: alwaysforce
ifOCRD_EXISTING_OUTPUT==OVERWRITE
- 🔥
Processor.verify
: revert 3.0.0b1 enforcing cardinality checks (stay backwards compatible) - 🔥
Processor.verify
: check output fileGrps, too
(must not exist unlessOCRD_EXISTING_OUTPUT=OVERWRITE|SKIP
or disjoint--page-id
range) - lib.bash
input-files
: do not try to validate tasks here (now covered byProcessor.verify()
) run_processor
: be robust ifocrd_tool
is missingsteps
PcGtsType.PageType.id
viamake_xml_id
: replace/
with_
- 🔥
ocrd_utils
,ocrd_models
,ocrd_modelfactory
,ocrd_validators
andocrd_network
are not published
as separate packages anymore, everything is contained inocrd
- you should adapt yourrequirements.txt
accordingly - 🔥
Processor.parameter
now a property (attribute always exists, butNone
for non-processing contexts) - 🔥
Processor.parameter
is now afrozendict
(contents immutable) - 🔥
Processor.parameter
validate when(ever) set instead of (just) the constructor - setting
Processor.parameter
will also trigger (Processor.shutdown() and) Processor.setup()
get_processor(... instance_caching=True)
: usemin(max_instances, OCRD_MAX_PROCESSOR_CACHE)
- 🔥
Processor.verify
always validates fileGrp cardinalities (because we haveocrd-tool.json
defaults now) - 🔥
OcrdMets.add_agent
without positional arguments ocrd bashlib input-files
now uses normal Processor decorator, and gets passed actualocrd-tool.json
and tool name
from bashlib'socrd__wrap
- 🔥
OcrdPage
as proxy ofPcGtsType
instead of alias; also containsetree
andmapping
now - 🔥
page_from_file
: removed kwargwith_tree
- useOcrdPage.etree
andOcrdPage.mapping
instead - 🔥
Processor.zip_input_files
now can throwocrd.NonUniqueInputFile
andocrd.MissingInputFile
(the latter only ifOCRD_MISSING_INPUT=ABORT
) - 🔥
Processor.zip_input_files
does not by default userequire_first
anymore
(so the first file in any input file tuple per page can beNone
as well) - 🔥 no more
Workspace.overwrite_mode
, merely delegate toOCRD_EXISTING_OUTPUT=OVERWRITE
- 🎨 improve on docs result for
ocrd_utils.config
- 🔥 Deprecate
Processor.process
- update spec to v3.25.0, which requires annotating fileGrp cardinality in
ocrd-tool.json
- 🔥 Remove passing non-processing kwargs to
Processor
constructor, add as members
(i.e.show_help
,dump_json
,dump_module_dir
,list_resources
,show_resource
,resolve_resource
) - 🔥 Deprecate passing processing arg / kwargs to
Processor
constructor
(i.e.workspace
,page_id
,input_file_grp
,output_file_grp
; now all set byrun_processor
) - 🔥 Deprecate passing
ocrd-tool.json
metadata toProcessor
constructor ocrd.processor
: Handle loading of bundledocrd-tool.json
generically
Fixed:
ocrd --help
output was broken for multiline config options, bertsky#25- Call
initLogging
before instantiating processors inocrd_cli_wrap_processor
, bertsky#24, #1296 - PAGE API: Fully reversable mapping from/to XML element/generateDS instances, bertsky#21
initLogging
: only add root handler instead of multiple redundant handlers withpropagate=false
setOverrideLogLevel
: override all currently active loggers' levelOcrdMets.get_physical_pages
: coverreturn_divs
w/ofor_fileIds
andfor_pageIds
- tests: ensure
ocrd_utils.config
gets reset whenever changing it globally OcrdMetsServer.add_file
: pass onforce
kwargocrd.cli.workspace
: consistently pass on--mets-server-url
and--backup
ocrd.cli.validate "tasks"
: pass on--mets-server-url
ocrd.cli.bashlib "input-files"
: pass on--mets-server-url
lib.bash input-files
: pass on--mets-server-url
,--overwrite
, and parameterslib.bash
: fixerrexit
handlingocrd.cli.ocrd-tool "resolve-resource"
: forgot to actually print resultProcessor.metadata_location
:src
workaround respects namespace packages, qurator-spk/eynollah#134Workspace.reload_mets
: handle ClientSideOcrdMets as welldisableLogging
: also re-instate root logger to Python defaults- actually apply CLI
--log-filename
, and show in--help
- adapt to Pillow changes
ocrd workspace clone
: do pass on--file-grp
(for download filtering)
Added:
ocrd-filter
processor to remove segments based on XPath expressions, bertsky#21- XPath function
pc:pixelarea
for the number of pixels of the bounding box (or sum area on node sets), bertsky#21 - XPath function
pc:textequiv
for the first TextEquiv unicode string (or concatenated string on node sets), bertsky#21 OcrdPage
: newPageType.get_ReadingOrderGroups()
to retrieve recursive RO as dict- ocrd.cli.workspace
server
: add subcommandsreload
andsave
- METS Server: export and delegate
physical_pages
- processor CLI: delegate
--resolve-resource
, too Processor.process_page_file
/OcrdPageResultImage
: allowNone
besidesAlternativeImageType
OcrdConfig.reset_defaults
to reset config variables to their defaultsProcessor.max_workers
: class attribute to control per-page parallelism of this implementationProcessor.max_page_seconds
: class attribute to control per-page timeout of this implementationOCRD_MAX_PARALLEL_PAGES
for whether and how many workers should process pages in parallelOCRD_PROCESSING_PAGE_TIMEOUT
for whether and how long processors should wait for single pagesOCRD_MAX_MISSING_OUTPUTS
for maximum rate (fraction) of pages before makingOCRD_MISSING_OUTPUT=abort
Processor.metadata_filename
: expose to make local path ofocrd-tool.json
in Python distribution reusable+overridableProcessor.metadata_location
: expose to make absolute path ofocrd-tool.json
reusable+overridableProcessor.metadata_rawdict
: expose to make in-memory contents ofocrd-tool.json
reusable+overridableProcessor.metadata
: expose to make validated and default-expanded contents ofocrd-tool.json
reusable+overridableProcessor.shutdown
: to shut down processor after processing, optionalProcessor.max_instances
: class attribute to control instance caching of this implementation- 👉
OCRD_DOWNLOAD_INPUT
for whether input files should be downloaded before processing - 👉
OCRD_MISSING_INPUT
for how to handle missing input files (SKIP
orABORT
) - 👉
OCRD_MISSING_OUTPUT
for how to handle processing failures (SKIP
orABORT
orCOPY
)
the latter behaves like ocrd-dummy for the failed page(s) - 👉
OCRD_EXISTING_OUTPUT
for how to handle existing output files (SKIP
orABORT
orOVERWRITE
) - new CLI option
--debug
as short-hand forABORT
choices above Processor.logger
set up by constructor already (for re-use by processor implementors)default
-expand and validateocrd_tool.json
inProcessor
constructor, log invalidities- handle JSON
deprecation
inocrd_tool.json
by reporting warnings Processor.process_workspace
: process a complete workspace, with default implementationProcessor.process_page_file
: process an OcrdFile, with default implementationProcessor.process_page_pcgts
: process a single OcrdPage, produce a single OcrdPage, required to implementProcessor.verify
: handle fileGrp cardinality verification, with default implementationProcessor.setup
: to set up processor before processing, optional