Skip to content

v3.0.0

Latest
Compare
Choose a tag to compare
@kba kba released this 09 Jan 10:54
· 1 commit to master since this release

Changed:

  • Merge v2 master into new-procesor-api
  • PAGE API: Update to latest generateDS 2.44.1, bertsky#21
  • 🔥 logging: increase default root (not ocrd) level from INFO to WARNING
  • 🔥 initLogging: do not remove any previous handlers/levels, unless force_reinit
  • 🔥 disableLogging: remove all handlers, reset all levels - instead of being selective
  • 🔥 Processor: replace weakref with __del__ to trigger shutdown
  • 🔥 OCRD_MAX_PARALLEL_PAGES>1: log via QueueHandler in subprocess, QueueListener in main
  • 🔥 ocrd_utils.initLogging: also add handler to root logger (as in file config),
    but disable message propagation to avoid duplication
  • only import ocrd_network in src/ocrd/decorators/__init__.py once needed
  • Processor.process_page_file: skip computing process_page_pcgts if output already exists,
    but OCRD_EXISTING_OUTPUT!=OVERWRITE
  • 🔥 OCRD_MAX_PARALLEL_PAGES>1: switch from multithreading to multiprocessing, depend on
    loky instead of stdlib concurrent.futures
  • OCRD_PROCESSING_PAGE_TIMEOUT>0: actually enforce timeout within worker
  • OCRD_MAX_MISSING_OUTPUTS>0: abort early if too many failures already, prospectively
  • Processor.process_workspace: split up into overridable sub-methods:
    • process_workspace_submit_tasks (iterate input file group and schedule page tasks)
    • process_workspace_submit_page_task (download input files and submit single page task)
    • process_workspace_handle_tasks (monitor page tasks and aggregate results)
    • process_workspace_handle_page_task (await single page task and handle errors)
  • 🔥 Processor / Workspace.add_file: always force if OCRD_EXISTING_OUTPUT==OVERWRITE
  • 🔥 Processor.verify: revert 3.0.0b1 enforcing cardinality checks (stay backwards compatible)
  • 🔥 Processor.verify: check output fileGrps, too
    (must not exist unless OCRD_EXISTING_OUTPUT=OVERWRITE|SKIP or disjoint --page-id range)
  • lib.bash input-files: do not try to validate tasks here (now covered by Processor.verify())
  • run_processor: be robust if ocrd_tool is missing steps
  • PcGtsType.PageType.id via make_xml_id: replace / with _
  • 🔥 ocrd_utils, ocrd_models, ocrd_modelfactory, ocrd_validators and ocrd_network are not published
    as separate packages anymore, everything is contained in ocrd - you should adapt your requirements.txt accordingly
  • 🔥 Processor.parameter now a property (attribute always exists, but None for non-processing contexts)
  • 🔥 Processor.parameter is now a frozendict (contents immutable)
  • 🔥 Processor.parameter validate when(ever) set instead of (just) the constructor
  • setting Processor.parameter will also trigger (Processor.shutdown() and) Processor.setup()
  • get_processor(... instance_caching=True): use min(max_instances, OCRD_MAX_PROCESSOR_CACHE)
  • 🔥 Processor.verify always validates fileGrp cardinalities (because we have ocrd-tool.json defaults now)
  • 🔥 OcrdMets.add_agent without positional arguments
  • ocrd bashlib input-files now uses normal Processor decorator, and gets passed actual ocrd-tool.json and tool name
    from bashlib's ocrd__wrap
  • 🔥 OcrdPage as proxy of PcGtsType instead of alias; also contains etree and mapping now
  • 🔥 page_from_file: removed kwarg with_tree - use OcrdPage.etree and OcrdPage.mapping instead
  • 🔥 Processor.zip_input_files now can throw ocrd.NonUniqueInputFile and ocrd.MissingInputFile
    (the latter only if OCRD_MISSING_INPUT=ABORT)
  • 🔥 Processor.zip_input_files does not by default use require_first anymore
    (so the first file in any input file tuple per page can be None as well)
  • 🔥 no more Workspace.overwrite_mode, merely delegate to OCRD_EXISTING_OUTPUT=OVERWRITE
  • 🎨 improve on docs result for ocrd_utils.config
  • 🔥 Deprecate Processor.process
  • update spec to v3.25.0, which requires annotating fileGrp cardinality in ocrd-tool.json
  • 🔥 Remove passing non-processing kwargs to Processor constructor, add as members
    (i.e. show_help, dump_json, dump_module_dir, list_resources, show_resource, resolve_resource)
  • 🔥 Deprecate passing processing arg / kwargs to Processor constructor
    (i.e. workspace, page_id, input_file_grp, output_file_grp; now all set by run_processor)
  • 🔥 Deprecate passing ocrd-tool.json metadata to Processor constructor
  • ocrd.processor: Handle loading of bundled ocrd-tool.json generically

Fixed:

  • ocrd --help output was broken for multiline config options, bertsky#25
  • Call initLogging before instantiating processors in ocrd_cli_wrap_processor, bertsky#24, #1296
  • PAGE API: Fully reversable mapping from/to XML element/generateDS instances, bertsky#21
  • initLogging: only add root handler instead of multiple redundant handlers with propagate=false
  • setOverrideLogLevel: override all currently active loggers' level
  • OcrdMets.get_physical_pages: cover return_divs w/o for_fileIds and for_pageIds
  • tests: ensure ocrd_utils.config gets reset whenever changing it globally
  • OcrdMetsServer.add_file: pass on force kwarg
  • ocrd.cli.workspace: consistently pass on --mets-server-url and --backup
  • ocrd.cli.validate "tasks": pass on --mets-server-url
  • ocrd.cli.bashlib "input-files": pass on --mets-server-url
  • lib.bash input-files: pass on --mets-server-url, --overwrite, and parameters
  • lib.bash: fix errexit handling
  • ocrd.cli.ocrd-tool "resolve-resource": forgot to actually print result
  • Processor.metadata_location: src workaround respects namespace packages, qurator-spk/eynollah#134
  • Workspace.reload_mets: handle ClientSideOcrdMets as well
  • disableLogging: also re-instate root logger to Python defaults
  • actually apply CLI --log-filename, and show in --help
  • adapt to Pillow changes
  • ocrd workspace clone: do pass on --file-grp (for download filtering)

Added:

  • ocrd-filter processor to remove segments based on XPath expressions, bertsky#21
  • XPath function pc:pixelarea for the number of pixels of the bounding box (or sum area on node sets), bertsky#21
  • XPath function pc:textequiv for the first TextEquiv unicode string (or concatenated string on node sets), bertsky#21
  • OcrdPage: new PageType.get_ReadingOrderGroups() to retrieve recursive RO as dict
  • ocrd.cli.workspace server: add subcommands reload and save
  • METS Server: export and delegate physical_pages
  • processor CLI: delegate --resolve-resource, too
  • Processor.process_page_file / OcrdPageResultImage: allow None besides AlternativeImageType
  • OcrdConfig.reset_defaults to reset config variables to their defaults
  • Processor.max_workers: class attribute to control per-page parallelism of this implementation
  • Processor.max_page_seconds: class attribute to control per-page timeout of this implementation
  • OCRD_MAX_PARALLEL_PAGES for whether and how many workers should process pages in parallel
  • OCRD_PROCESSING_PAGE_TIMEOUT for whether and how long processors should wait for single pages
  • OCRD_MAX_MISSING_OUTPUTS for maximum rate (fraction) of pages before making OCRD_MISSING_OUTPUT=abort
  • Processor.metadata_filename: expose to make local path of ocrd-tool.json in Python distribution reusable+overridable
  • Processor.metadata_location: expose to make absolute path of ocrd-tool.json reusable+overridable
  • Processor.metadata_rawdict: expose to make in-memory contents of ocrd-tool.json reusable+overridable
  • Processor.metadata: expose to make validated and default-expanded contents of ocrd-tool.json reusable+overridable
  • Processor.shutdown: to shut down processor after processing, optional
  • Processor.max_instances: class attribute to control instance caching of this implementation
  • 👉 OCRD_DOWNLOAD_INPUT for whether input files should be downloaded before processing
  • 👉 OCRD_MISSING_INPUT for how to handle missing input files (SKIP or ABORT)
  • 👉 OCRD_MISSING_OUTPUT for how to handle processing failures (SKIP or ABORT or COPY)
    the latter behaves like ocrd-dummy for the failed page(s)
  • 👉 OCRD_EXISTING_OUTPUT for how to handle existing output files (SKIP or ABORT or OVERWRITE)
  • new CLI option --debug as short-hand for ABORT choices above
  • Processor.logger set up by constructor already (for re-use by processor implementors)
  • default-expand and validate ocrd_tool.json in Processor constructor, log invalidities
  • handle JSON deprecation in ocrd_tool.json by reporting warnings
  • Processor.process_workspace: process a complete workspace, with default implementation
  • Processor.process_page_file: process an OcrdFile, with default implementation
  • Processor.process_page_pcgts: process a single OcrdPage, produce a single OcrdPage, required to implement
  • Processor.verify: handle fileGrp cardinality verification, with default implementation
  • Processor.setup: to set up processor before processing, optional