-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #185 from alliomeria/1.3.0
Documentation Updates + New Section for Strawberry Runners
- Loading branch information
Showing
13 changed files
with
312 additions
and
26 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,8 +7,19 @@ | |
[METRO's Digital Services Team](https://metro.org/digital-services) facilitated many different internal training sessions throughout 2020-2022. If you and your team need access to any of these sessions that were recorded, please [contact us](mailto:[email protected]). Thank you! | ||
|
||
## 2023 | ||
- [Experimental IIIF Kitchen using Archipelago. Pino Navarro, Diego; Sherrick, Allison. (June 2023)](https://tinyurl.com/apiiif2023) | ||
- [Mapping an Engineer Through IIIF. Monger, Jenifer J.; McCarthy, Brenden; Pino Navarro, Diego; Sherrick, Allison. (June 2023)](https://tinyurl.com/2x9mshx5) | ||
|
||
- IIIF Search API and Dynamic/evolving Manifest Generation: Facing the Unknown. Diego Alberto Pino Navarro, Allison Sherrick. | ||
- [📺 Recording available](https://www.youtube.com/watch?v=z9yay9J16T0&t=1050s) | ||
|
||
- DLF Forum (November 2023) | ||
- [Working and Learning the IIIF Search API in Archipelago. Diego Pino Navarro, Allison Sherrick.](https://osf.io/p4m9s/) | ||
- [Working with Open-Schema JSON in Archipelago. Allison Sherrick, Diego Pino Navarro, Martha Tenney, Joanna DiPasquale, Corinne Chatnik.](https://osf.io/dx3fm/) | ||
- [Slaying the Migration Dragon: Approaches to Navigating an Open Source System Migration. Lisa McFall, Sarah Walden McGowan, Brenden McCarthy, Shay Foley.](https://osf.io/aymhd/) | ||
|
||
- IIIF Annual Conference (June 2023) | ||
- [Experimental IIIF Kitchen using Archipelago. Pino Navarro, Diego; Sherrick, Allison.](https://tinyurl.com/apiiif2023) | ||
- [Mapping an Engineer Through IIIF. Monger, Jenifer J.; McCarthy, Brenden; Pino Navarro, Diego; Sherrick, Allison.](https://tinyurl.com/2x9mshx5) | ||
|
||
- Into Archipelago Commons: Access, Innovation and Community in Modern Archives. Monger, Jenifer J.; McCarthy, Brenden; Corinne Chatnik. (June 2023) | ||
- Implementing Archipelago: An Innovative, Community Driven, Open-Source Repository. Corinne Chatnik, Union College; Martha Tenney, Barnard College. (June 2023) | ||
- [For the Love of Data and Ourselves: The Bumpy, Technical Road to Modern Archives. Monger, Jenifer J.; McCarthy, Brenden. (January/February 2023)](https://mydigitalpublication.com/publication/?m=30305&l=1) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
--- | ||
title: Strawberry Runners Post-Processing | ||
tags: | ||
- Strawberry Runners | ||
- HOCR | ||
- OCR | ||
- Post-processing | ||
- Background Processing | ||
--- | ||
|
||
# Strawberry Runners Post-Processing Configuration | ||
|
||
Archipelago's [Strawberry Runners (SBR)](https://github.com/esmero/strawberry_runners) module provides provides a set of post-processing capabilities for the JSON based metadata, files and entities that comprise your Archipelago Digital Objects (ADOs). These post-processing actions are based on dispatched events, direct http calls, and invoked webhooks from partner services (such as Min.io, AWS S3 or self-invoked). | ||
|
||
The default Archipelago SBR post-processor configurations include operations that: | ||
- perform page-based HOCR/OCR for image and pdf-based ADOs, send the output to the Search API, and use Natural Language Processing to extract entities from the output | ||
- extract text from pages within a Webarchives File and send the output to the Search API | ||
- convert WARC format Webarchives Files into WACZ format and attach the new WACZ file to the original source ADO to complement the WARC original | ||
|
||
SBR actions can be chained and nested to enable ordered operations, such as first extract individual pages in an ordered sequence and then run HOCR/OCR across the individual pages. | ||
|
||
## Strawberry Runners Settings Overview | ||
|
||
You can access the Strawberry Runners Settings: | ||
|
||
- Through the `Manage` menu > `Configuration` > `Archipelago` > `Configure Strawberry Runners Post Processors` | ||
- Directly at `/admin/config/archipelago/strawberry_runners` | ||
|
||
On the Strawberry Runners Settings page, you will see the Archipelago default post processor configurations (unless modified). | ||
|
||
![Strawberry Runners Home](images/strawberryrunnershome.png) | ||
|
||
1. The `pager` action uses the 'Post processor that extracts/generates Ordered Sequences of files/pages/children using Files present in an ADO' plugin. | ||
2. Nested one level in, the `ocr` action uses the 'Post processor that Runs OCR/HORC against files' plugin. The `ocr` operations will be executed after the completion of the `pager` operations. | ||
3. The `wacz_page_extractor` action uses the 'Post processor that extracts/generates Indexed Page Content from WACZ files in an ADO' plugin. | ||
4. Nested one level in, the `webpage` action uses the 'Post processor that Indexes WACZ Frictionless data Search Index to Search API' plugin. The `webpage` operations will be executed after the completion of the `wacz_page_extractor` operations. | ||
5. The `warc_to_wacz` action uses the 'Post processor that uses a System Binary to process * files' operations. | ||
|
||
## Reviewing and Adjusting the default Post-Processors | ||
|
||
From the main Strawberry Runner Settings page, you can review and adjust the settings for the default Archipelago configurations by selecting `Edit` from the `Operations`` menu. | ||
|
||
Please see the following guides for: | ||
|
||
- [Adjusting the `pager` and `ocr` operations](strawberryrunners_pager_ocr.md) | ||
- [Adjusting the `wacz_page_extractor` and `webpage` operations](strawberryrunners_webpage_text.md) | ||
- [Adjusting the `warc_to_wacz` operation](strawberryrunners_wacz_binary.md) | ||
|
||
## Triggering Post-Processing Actions Manually | ||
|
||
After making adjustments to Strawberry Runners Post-Processing configurations, you may want to trigger/re-trigger a particular action manually. | ||
|
||
You can use Archipelago's [Find and Replace](docs/find_and_replace.md) to first select a specific group of Digital Objects you wish to target for Post-Processing, then select the `Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item` from the [Find and Replace](docs/find_and_replace.md) `Actions menu`. | ||
|
||
### Additional Post Processor Operations | ||
|
||
Archipelago also includes the `Post processor that writes/reads Frictionless Data Packages` plugin. Please keep a lookout for future documentation related to using this plugin. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
--- | ||
title: Strawberry Runners Post-Processing | ||
tags: | ||
- Strawberry Runners | ||
- HOCR | ||
- OCR | ||
- Pager | ||
- Post-processing | ||
- Background Processing | ||
--- | ||
|
||
# Reviewing and adjusting the `pager` and `ocr` Post-Processor operations | ||
|
||
The `pager` and `ocr` Post-processor operations are likely the most important pair of Strawberry Runners in your Archipelago. | ||
|
||
As stated on the [Strawberry Runners overview page](docs/strawberryrunners.md), the `pager` action uses the 'Post processor that extracts/generates Ordered Sequences of files/pages/children using Files present in an ADO' plugin. Nested one level in, the `ocr` action uses the 'Post processor that Runs OCR/HORC against files' plugin. The `ocr` operations will be executed after the completion of the `pager` operations. | ||
|
||
Common changes you may wish to make for the `pager` and/or `ocr` operations include adding or removing particular types of Archipelago Digital Objects to apply these operations to. | ||
|
||
# Pager Settings | ||
|
||
To review or adjust the configurations for the `pager` operation, select `Edit` from the `Operations` menu. | ||
|
||
In the `pager` settings, you will see the following configuration options: | ||
|
||
![Strawberry Runners Pager](images/sbr_pager.png) | ||
|
||
1. Label: | ||
- Label for this Processor; which should be a unique machine-readable name | ||
- Can only contain lowercase letters, numbers, and underscores | ||
- We do not recommend changing this Label from the default `pager`. | ||
|
||
2. Strawberry Runner Post Processor Plugin: | ||
- The `Post processor that extracts/generates Ordered Sequences of files/pages/children using Files present in an ADO` should be selected. | ||
- We do not recommend changing this Plugin selection. | ||
|
||
3. Checkbox to mark this processor plugin as active | ||
- We recommend keeping this checked as `active` at all times, but you may wish to temporarily disable this if you are performing certain types of administrative review tasks such as running large test ingests where you plan on deleting the ADOs before a final ingest. | ||
- If you accidentally uncheck this and need to re-trigger the `pager` (and corresponding nested `ocr` action), you can use Archipelago's [Find and Replace](docs/find_and_replace.md) to first select a specific group of Digital Objects you wish to target for Post-Processing, then select the `Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item` from the [Find and Replace](docs/find_and_replace.md) `Actions menu`. | ||
|
||
4. ADO type(s) to limit this processor to: | ||
- A single ADO type or a comma delimited list of ado types that qualify to be Processed. | ||
- Leave empty to apply to all ADOs. If you do not provide any specific ADO types here, the processor will be applied for all ADOs with the JSON keys selected in the next step. | ||
- Default ADO types specified are: 'Document,Book,Article' | ||
- You may wish to add additional types of document/multiple-paged type of ADOs to this list that are custom to you Archipelago environment. | ||
|
||
5. The JSON key that contains the desired source files: | ||
- By default, the `as:image` and `as:document` keys are selected. | ||
- We do not recommend changing this selection. | ||
|
||
6. Mimetypes(s) to limit this Processor to: | ||
- A single Mimetype type or a comma separated list of mimetypes that qualify to be Processed. | ||
- Leave empty to apply any file. | ||
- Default mimetypes are: 'application/pdf,image/tiff,image/jpeg,image/jp2' | ||
|
||
7. Within the ADO's metadata, the JSON key that contains the language in ISO639-3 (3 letter) format to be used for OCR/NLP processing via Tesseract. | ||
- Default JSON key specified is: 'language_iso639_3' | ||
|
||
8. Please provide a default language in ISO639-3 (3 letter) format. If none is provided we will use 'eng'. | ||
- Default language specified is: 'eng' | ||
|
||
9. Timeout in seconds for this process. | ||
- If the process runs out of time it can still be processed again | ||
- Default selection is: 10 | ||
|
||
10. Order or execution in the global chain. | ||
- Default selection is: 0 | ||
|
||
# OCR / HOCR Settings | ||
|
||
To review or adjust the configurations for the `ocr` operation, select `Edit` from the `Operations` menu. | ||
|
||
In the `pager` settings, you will see several different configuration options. | ||
|
||
![Strawberry Runners OCR Part 1](images/sbr_ocr_part1.png) | ||
|
||
|
||
1. Label: | ||
- Label for this Processor; which should be a unique machine-readable name | ||
- Can only contain lowercase letters, numbers, and underscores | ||
- We do not recommend changing this Label from the default `ocr`. | ||
|
||
2. Strawberry Runner Post Processor Plugin: | ||
- The `Post processor that Runs OCR/HORC against files` should be selected. | ||
- We do not recommend changing this Plugin selection. | ||
|
||
3. Checkbox to mark this processor plugin as active | ||
- We recommend keeping this checked as `active` at all times, but you may wish to temporarily disable this if you are performing certain types of administrative review tasks such as running large test ingests where you plan on deleting the ADOs before a final ingest. | ||
- If you accidentally uncheck this and need to re-trigger the `pager` (and corresponding nested `ocr` action), you can use Archipelago's [Find and Replace](docs/find_and_replace.md) to first select a specific group of Digital Objects you wish to target for Post-Processing, then select the `Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item` from the [Find and Replace](docs/find_and_replace.md) `Actions menu`. | ||
|
||
4. The type of source data this processor works on: | ||
- Select from where the source file this processor needs is fetched. | ||
- Default selection of 'File entities referenced in the as:filetype JSON structure'. | ||
- You also have the option of selecting 'Full file paths passed by another processor', but we do not recommend using this option as it is less granular in its application. | ||
|
||
5. ADO type(s) to limit this processor to: | ||
- A single ADO type or a comma delimited list of ado types that qualify to be Processed | ||
- Leave empty to apply to all ADOs. If you do not provide any specific ADO types here, the processor will be applied for all ADOs with the JSON keys selected in the next step. | ||
- Default ADO types specified are: 'Document,Book,Article' | ||
- You may wish to add additional types of document/multiple-paged type of ADOs to this list that are custom to you Archipelago environment. | ||
|
||
6. The JSON key that contains the desired source files: | ||
- By default, the `as:image` and `as:document` keys are selected. | ||
- We do not recommend changing this selection. | ||
|
||
7. Mimetypes(s) to limit this Processor to: | ||
- A single Mimetype type or a comma separated list of mimetypes that qualify to be Processed. | ||
- Leave empty to apply any file. | ||
- Default mimetypes are: 'application/pdf,image/tiff,image/jpeg,image/jp2' | ||
|
||
!!! warning "Advanced OCR/HOCR Settings" | ||
|
||
We do not recommend making changes to the follow settings unless you are the System Administrator. | ||
|
||
|
||
8. The system path to the ghostscript (gs) binary that will be executed by this processor. | ||
- A full system path to the gs binary present in the same environment your PHP runs | ||
- Default path specified is: '/usr/bin/gs' | ||
|
||
9. Any additional argument your executable binary requires. | ||
- Any arguments your ghostscript (gs) binary requires to run. Use %file as replacement for the file if the executable requires the filename to be passed under a specific argument. We recommend testing with -r150 (150dpi image extraction) for better performance but -r300 can be also used if source Images in a PDF are small | ||
- Default argument specified is: -r150 %file | ||
|
||
10. The system path to the Tesseract binary that will be executed by this processor. | ||
- A full system path to the Tesseract binary present in the same environment your PHP runs | ||
- Default path specified is: '/usr/bin/tesseract' | ||
|
||
11. Within the ADO's metadata, the JSON key that contains the language in ISO639-3 (3 letter) format to be used for OCR/NLP processing via Tesseract. | ||
- Default JSON key specified is: 'language_iso639_3' | ||
|
||
![Strawberry Runners OCR Part 2](images/sbr_ocr_part2.png) | ||
|
||
12. Please provide a default language in ISO639-3 (3 letter) format. If none is provided we will use 'eng' | ||
- Default language specified is: 'eng' | ||
|
||
13. Any additional argument for your tesseract binary. | ||
- Any arguments your binary requires to run. Use %file as replacement for the file that is output by the GS binary. Use %language as replacement for the chosen language. | ||
- Default arguments specified are: '%file stdout -l %language hocr' | ||
|
||
14. The data/languages folder for Tesseract | ||
- Absolute path where the Languages are stored in the Server. This will be used in --tessdata-dir | ||
- Default path specified is: '/usr/share/tessdata' | ||
|
||
15. The system path to the pdfalto binary that will be executed by this processor. | ||
- A full system path to the pdfalto binary present in the same environment your PHP runs, e.g /usr/local/bin/pdfalto | ||
- Default path specified is: '/usr/bin/pdfalto' | ||
|
||
16. Any additional argument for your pdfalto binary. | ||
- Any arguments your binary requires to run. Use %file as replacement for the file that is output by the pdfalto binary. | ||
- Default arguments specified are: '%file -q | ||
|
||
17. The expected and desired output of this processor. | ||
- If the output is just data and "One or more Files" is selected all data will be dumped into a file and handled as such. | ||
- Default selection is: 'Data/Values that can be serialized to JSON' | ||
- Additional optional is to select 'One or more Files', but it is not recommended unless to use this for the default `ocr` operation since this will alter how the data is incorporated in the Search API (Solr index). | ||
|
||
18. Where and how the output will be used. | ||
- Default select is: 'In a Search API Document using the Strawberryfield Flavor Data Source (e.g used for HOCR highlight)' | ||
- Additional option to select 'As Input for another processor Plugin' --which will only have an effect if another Processor is setup to consume this output. | ||
|
||
19. The queue to use for this processor. | ||
- The primary queue will be execute in realtime while the Secondary will be execute in background | ||
- Default selection is for the 'Secondary queue in background' | ||
|
||
20. Checkbox to Use NLP (Natural Language Processing) to extract entities from Text | ||
- If checked Full text will be processed for Natural language Entity extraction using Polyglot. | ||
- Default option is to have the option checked. | ||
|
||
21. The URL location of your NLP64 server. | ||
- Defaults to http://esmero-nlp:6400 | ||
|
||
22. Which method(NER) to use | ||
- The NER NLP method to use to extract Agents, Places and Sentiment. | ||
- Default selection: 'Polyglot (faster)' | ||
- Alternation selection: 'spaCy (more accurate)' | ||
|
||
23. Timeout in seconds for this process. | ||
- 900 | ||
- If the process runs out of time it can still be processed again. | ||
24. Order or execution in the global chain. | ||
- 0 | ||
|
||
___ | ||
|
||
Thank you for reading! Please contact us on our [Archipelago Commons Google Group](https://groups.google.com/forum/#!forum/archipelago-commons) with any questions or feedback. | ||
|
||
Return to the main [Strawberry Runners](strawberryrunners.md) or the [Archipelago Documentation main page](index.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
--- | ||
title: Strawberry Runners Post-Processing | ||
tags: | ||
- Strawberry Runners | ||
- WACZ | ||
- WARC | ||
- Binary | ||
- Post-processing | ||
- Background Processing | ||
--- | ||
|
||
# Reviewing and adjusting the `warc_to_wacz` Post-Processor operation |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
--- | ||
title: Strawberry Runners Post-Processing | ||
tags: | ||
- Strawberry Runners | ||
- Fulltext Search | ||
- WACZ | ||
- Webpage | ||
- Post-processing | ||
- Background Processing | ||
--- | ||
|
||
# Reviewing and adjusting the `wacz_page_extractor` and `webpage` Post-Processor operations |
Oops, something went wrong.