Skip to content

Commit

Permalink
Merge pull request #185 from alliomeria/1.3.0
Browse files Browse the repository at this point in the history
Documentation Updates + New Section for Strawberry Runners
  • Loading branch information
DiegoPino authored Dec 21, 2023
2 parents cc76bbf + f59d3ff commit 1bb1832
Show file tree
Hide file tree
Showing 13 changed files with 312 additions and 26 deletions.
2 changes: 2 additions & 0 deletions docs/find_and_replace.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,11 @@ The default options available through the Action dropdown menu include:
- *`Publish Digital Object`
- *`Unpublish Digital Object`
- *`Change the author of content`
- `Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item`
- *`Delete selected entities/translations`

_* denotes Action options that are also shared with the main `Content` Page Action Menu_
_You can read more about [Strawberry Runners Post-Processing Actions here](docs/strawberryrunners.md)_

![Find and Replace Actions](images/find_and_replace_actions.jpg)

Expand Down
Binary file modified docs/images/find_and_replace_actions.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/sbr_ocr_part1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/sbr_ocr_part2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/sbr_pager.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/strawberryrunnershome.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 4 additions & 5 deletions docs/inthewild.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,7 @@ The Archipelagos listed below are supported by the [Digital Services Team at the

- [Frick Collection and Webrecorder Team Web Archives Collaboration](https://webarchive.archipelago.nyc)

- Hamilton College Library & IT Services (https://litsdigital.hamilton.edu/)
- Migration to Archipelago in final stages; Official launch of new site Autumn 2023
- [Hamilton College Library & IT Services](https://litsdigital.hamilton.edu/)

- [Olin College Library Phoenix Files](https://phoenixfiles.olin.edu)
- *Early adopter - live since Summer 2020
Expand All @@ -41,17 +40,17 @@ The Archipelagos listed below are supported by the [Digital Services Team at the
- [Union College Library](https://arches.union.edu)

- [Western Washington University](https://library.wwu.edu/)
- Migration to Archipelago began Summer 2022; Launch of new site ~Winter 2023
- Migration to Archipelago began Summer 2022; Launch of new site ~Winter 2023/24

## Neighborhood Archipelagos

From all around our beautiful shared world. 🏡 🏫 🏛️

- [Amherst College](https://acdc.amherst.edu)
- _Migration to Archipelago began Spring 2022_
- Migration to Archipelago began Spring 2022

- [Association Montessori Internationale](https://montessori-ami.org/)
- _Development of Archipelago environment began Summer 2022_
- Development of Archipelago environment began Summer 2022; Launch of new site Spring 2024

- [California Revealed](https://repository.californiarevealed.org/)

Expand Down
15 changes: 13 additions & 2 deletions docs/presentations_events.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,19 @@
[METRO's Digital Services Team](https://metro.org/digital-services) facilitated many different internal training sessions throughout 2020-2022. If you and your team need access to any of these sessions that were recorded, please [contact us](mailto:[email protected]). Thank you!

## 2023
- [Experimental IIIF Kitchen using Archipelago. Pino Navarro, Diego; Sherrick, Allison. (June 2023)](https://tinyurl.com/apiiif2023)
- [Mapping an Engineer Through IIIF. Monger, Jenifer J.; McCarthy, Brenden; Pino Navarro, Diego; Sherrick, Allison. (June 2023)](https://tinyurl.com/2x9mshx5)

- IIIF Search API and Dynamic/evolving Manifest Generation: Facing the Unknown. Diego Alberto Pino Navarro, Allison Sherrick.
- [📺 Recording available](https://www.youtube.com/watch?v=z9yay9J16T0&t=1050s)

- DLF Forum (November 2023)
- [Working and Learning the IIIF Search API in Archipelago. Diego Pino Navarro, Allison Sherrick.](https://osf.io/p4m9s/)
- [Working with Open-Schema JSON in Archipelago. Allison Sherrick, Diego Pino Navarro, Martha Tenney, Joanna DiPasquale, Corinne Chatnik.](https://osf.io/dx3fm/)
- [Slaying the Migration Dragon: Approaches to Navigating an Open Source System Migration. Lisa McFall, Sarah Walden McGowan, Brenden McCarthy, Shay Foley.](https://osf.io/aymhd/)

- IIIF Annual Conference (June 2023)
- [Experimental IIIF Kitchen using Archipelago. Pino Navarro, Diego; Sherrick, Allison.](https://tinyurl.com/apiiif2023)
- [Mapping an Engineer Through IIIF. Monger, Jenifer J.; McCarthy, Brenden; Pino Navarro, Diego; Sherrick, Allison.](https://tinyurl.com/2x9mshx5)

- Into Archipelago Commons: Access, Innovation and Community in Modern Archives. Monger, Jenifer J.; McCarthy, Brenden; Corinne Chatnik. (June 2023)
- Implementing Archipelago: An Innovative, Community Driven, Open-Source Repository. Corinne Chatnik, Union College; Martha Tenney, Barnard College. (June 2023)
- [For the Love of Data and Ourselves: The Bumpy, Technical Road to Modern Archives. Monger, Jenifer J.; McCarthy, Brenden. (January/February 2023)](https://mydigitalpublication.com/publication/?m=30305&l=1)
Expand Down
58 changes: 58 additions & 0 deletions docs/strawberryrunners.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: Strawberry Runners Post-Processing
tags:
- Strawberry Runners
- HOCR
- OCR
- Post-processing
- Background Processing
---

# Strawberry Runners Post-Processing Configuration

Archipelago's [Strawberry Runners (SBR)](https://github.com/esmero/strawberry_runners) module provides provides a set of post-processing capabilities for the JSON based metadata, files and entities that comprise your Archipelago Digital Objects (ADOs). These post-processing actions are based on dispatched events, direct http calls, and invoked webhooks from partner services (such as Min.io, AWS S3 or self-invoked).

The default Archipelago SBR post-processor configurations include operations that:
- perform page-based HOCR/OCR for image and pdf-based ADOs, send the output to the Search API, and use Natural Language Processing to extract entities from the output
- extract text from pages within a Webarchives File and send the output to the Search API
- convert WARC format Webarchives Files into WACZ format and attach the new WACZ file to the original source ADO to complement the WARC original

SBR actions can be chained and nested to enable ordered operations, such as first extract individual pages in an ordered sequence and then run HOCR/OCR across the individual pages.

## Strawberry Runners Settings Overview

You can access the Strawberry Runners Settings:

- Through the `Manage` menu > `Configuration` > `Archipelago` > `Configure Strawberry Runners Post Processors`
- Directly at `/admin/config/archipelago/strawberry_runners`

On the Strawberry Runners Settings page, you will see the Archipelago default post processor configurations (unless modified).

![Strawberry Runners Home](images/strawberryrunnershome.png)

1. The `pager` action uses the 'Post processor that extracts/generates Ordered Sequences of files/pages/children using Files present in an ADO' plugin.
2. Nested one level in, the `ocr` action uses the 'Post processor that Runs OCR/HORC against files' plugin. The `ocr` operations will be executed after the completion of the `pager` operations.
3. The `wacz_page_extractor` action uses the 'Post processor that extracts/generates Indexed Page Content from WACZ files in an ADO' plugin.
4. Nested one level in, the `webpage` action uses the 'Post processor that Indexes WACZ Frictionless data Search Index to Search API' plugin. The `webpage` operations will be executed after the completion of the `wacz_page_extractor` operations.
5. The `warc_to_wacz` action uses the 'Post processor that uses a System Binary to process * files' operations.

## Reviewing and Adjusting the default Post-Processors

From the main Strawberry Runner Settings page, you can review and adjust the settings for the default Archipelago configurations by selecting `Edit` from the `Operations`` menu.

Please see the following guides for:

- [Adjusting the `pager` and `ocr` operations](strawberryrunners_pager_ocr.md)
- [Adjusting the `wacz_page_extractor` and `webpage` operations](strawberryrunners_webpage_text.md)
- [Adjusting the `warc_to_wacz` operation](strawberryrunners_wacz_binary.md)

## Triggering Post-Processing Actions Manually

After making adjustments to Strawberry Runners Post-Processing configurations, you may want to trigger/re-trigger a particular action manually.

You can use Archipelago's [Find and Replace](docs/find_and_replace.md) to first select a specific group of Digital Objects you wish to target for Post-Processing, then select the `Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item` from the [Find and Replace](docs/find_and_replace.md) `Actions menu`.

### Additional Post Processor Operations

Archipelago also includes the `Post processor that writes/reads Frictionless Data Packages` plugin. Please keep a lookout for future documentation related to using this plugin.

187 changes: 187 additions & 0 deletions docs/strawberryrunners_pager_ocr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
title: Strawberry Runners Post-Processing
tags:
- Strawberry Runners
- HOCR
- OCR
- Pager
- Post-processing
- Background Processing
---

# Reviewing and adjusting the `pager` and `ocr` Post-Processor operations

The `pager` and `ocr` Post-processor operations are likely the most important pair of Strawberry Runners in your Archipelago.

As stated on the [Strawberry Runners overview page](docs/strawberryrunners.md), the `pager` action uses the 'Post processor that extracts/generates Ordered Sequences of files/pages/children using Files present in an ADO' plugin. Nested one level in, the `ocr` action uses the 'Post processor that Runs OCR/HORC against files' plugin. The `ocr` operations will be executed after the completion of the `pager` operations.

Common changes you may wish to make for the `pager` and/or `ocr` operations include adding or removing particular types of Archipelago Digital Objects to apply these operations to.

# Pager Settings

To review or adjust the configurations for the `pager` operation, select `Edit` from the `Operations` menu.

In the `pager` settings, you will see the following configuration options:

![Strawberry Runners Pager](images/sbr_pager.png)

1. Label:
- Label for this Processor; which should be a unique machine-readable name
- Can only contain lowercase letters, numbers, and underscores
- We do not recommend changing this Label from the default `pager`.

2. Strawberry Runner Post Processor Plugin:
- The `Post processor that extracts/generates Ordered Sequences of files/pages/children using Files present in an ADO` should be selected.
- We do not recommend changing this Plugin selection.

3. Checkbox to mark this processor plugin as active
- We recommend keeping this checked as `active` at all times, but you may wish to temporarily disable this if you are performing certain types of administrative review tasks such as running large test ingests where you plan on deleting the ADOs before a final ingest.
- If you accidentally uncheck this and need to re-trigger the `pager` (and corresponding nested `ocr` action), you can use Archipelago's [Find and Replace](docs/find_and_replace.md) to first select a specific group of Digital Objects you wish to target for Post-Processing, then select the `Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item` from the [Find and Replace](docs/find_and_replace.md) `Actions menu`.

4. ADO type(s) to limit this processor to:
- A single ADO type or a comma delimited list of ado types that qualify to be Processed.
- Leave empty to apply to all ADOs. If you do not provide any specific ADO types here, the processor will be applied for all ADOs with the JSON keys selected in the next step.
- Default ADO types specified are: 'Document,Book,Article'
- You may wish to add additional types of document/multiple-paged type of ADOs to this list that are custom to you Archipelago environment.

5. The JSON key that contains the desired source files:
- By default, the `as:image` and `as:document` keys are selected.
- We do not recommend changing this selection.

6. Mimetypes(s) to limit this Processor to:
- A single Mimetype type or a comma separated list of mimetypes that qualify to be Processed.
- Leave empty to apply any file.
- Default mimetypes are: 'application/pdf,image/tiff,image/jpeg,image/jp2'

7. Within the ADO's metadata, the JSON key that contains the language in ISO639-3 (3 letter) format to be used for OCR/NLP processing via Tesseract.
- Default JSON key specified is: 'language_iso639_3'

8. Please provide a default language in ISO639-3 (3 letter) format. If none is provided we will use 'eng'.
- Default language specified is: 'eng'

9. Timeout in seconds for this process.
- If the process runs out of time it can still be processed again
- Default selection is: 10

10. Order or execution in the global chain.
- Default selection is: 0

# OCR / HOCR Settings

To review or adjust the configurations for the `ocr` operation, select `Edit` from the `Operations` menu.

In the `pager` settings, you will see several different configuration options.

![Strawberry Runners OCR Part 1](images/sbr_ocr_part1.png)


1. Label:
- Label for this Processor; which should be a unique machine-readable name
- Can only contain lowercase letters, numbers, and underscores
- We do not recommend changing this Label from the default `ocr`.

2. Strawberry Runner Post Processor Plugin:
- The `Post processor that Runs OCR/HORC against files` should be selected.
- We do not recommend changing this Plugin selection.

3. Checkbox to mark this processor plugin as active
- We recommend keeping this checked as `active` at all times, but you may wish to temporarily disable this if you are performing certain types of administrative review tasks such as running large test ingests where you plan on deleting the ADOs before a final ingest.
- If you accidentally uncheck this and need to re-trigger the `pager` (and corresponding nested `ocr` action), you can use Archipelago's [Find and Replace](docs/find_and_replace.md) to first select a specific group of Digital Objects you wish to target for Post-Processing, then select the `Trigger Strawberrry Runners process/reprocess for Archipelago Digital Objects content item` from the [Find and Replace](docs/find_and_replace.md) `Actions menu`.

4. The type of source data this processor works on:
- Select from where the source file this processor needs is fetched.
- Default selection of 'File entities referenced in the as:filetype JSON structure'.
- You also have the option of selecting 'Full file paths passed by another processor', but we do not recommend using this option as it is less granular in its application.

5. ADO type(s) to limit this processor to:
- A single ADO type or a comma delimited list of ado types that qualify to be Processed
- Leave empty to apply to all ADOs. If you do not provide any specific ADO types here, the processor will be applied for all ADOs with the JSON keys selected in the next step.
- Default ADO types specified are: 'Document,Book,Article'
- You may wish to add additional types of document/multiple-paged type of ADOs to this list that are custom to you Archipelago environment.

6. The JSON key that contains the desired source files:
- By default, the `as:image` and `as:document` keys are selected.
- We do not recommend changing this selection.

7. Mimetypes(s) to limit this Processor to:
- A single Mimetype type or a comma separated list of mimetypes that qualify to be Processed.
- Leave empty to apply any file.
- Default mimetypes are: 'application/pdf,image/tiff,image/jpeg,image/jp2'

!!! warning "Advanced OCR/HOCR Settings"

We do not recommend making changes to the follow settings unless you are the System Administrator.


8. The system path to the ghostscript (gs) binary that will be executed by this processor.
- A full system path to the gs binary present in the same environment your PHP runs
- Default path specified is: '/usr/bin/gs'

9. Any additional argument your executable binary requires.
- Any arguments your ghostscript (gs) binary requires to run. Use %file as replacement for the file if the executable requires the filename to be passed under a specific argument. We recommend testing with -r150 (150dpi image extraction) for better performance but -r300 can be also used if source Images in a PDF are small
- Default argument specified is: -r150 %file

10. The system path to the Tesseract binary that will be executed by this processor.
- A full system path to the Tesseract binary present in the same environment your PHP runs
- Default path specified is: '/usr/bin/tesseract'

11. Within the ADO's metadata, the JSON key that contains the language in ISO639-3 (3 letter) format to be used for OCR/NLP processing via Tesseract.
- Default JSON key specified is: 'language_iso639_3'

![Strawberry Runners OCR Part 2](images/sbr_ocr_part2.png)

12. Please provide a default language in ISO639-3 (3 letter) format. If none is provided we will use 'eng'
- Default language specified is: 'eng'

13. Any additional argument for your tesseract binary.
- Any arguments your binary requires to run. Use %file as replacement for the file that is output by the GS binary. Use %language as replacement for the chosen language.
- Default arguments specified are: '%file stdout -l %language hocr'

14. The data/languages folder for Tesseract
- Absolute path where the Languages are stored in the Server. This will be used in --tessdata-dir
- Default path specified is: '/usr/share/tessdata'

15. The system path to the pdfalto binary that will be executed by this processor.
- A full system path to the pdfalto binary present in the same environment your PHP runs, e.g /usr/local/bin/pdfalto
- Default path specified is: '/usr/bin/pdfalto'

16. Any additional argument for your pdfalto binary.
- Any arguments your binary requires to run. Use %file as replacement for the file that is output by the pdfalto binary.
- Default arguments specified are: '%file -q

17. The expected and desired output of this processor.
- If the output is just data and "One or more Files" is selected all data will be dumped into a file and handled as such.
- Default selection is: 'Data/Values that can be serialized to JSON'
- Additional optional is to select 'One or more Files', but it is not recommended unless to use this for the default `ocr` operation since this will alter how the data is incorporated in the Search API (Solr index).

18. Where and how the output will be used.
- Default select is: 'In a Search API Document using the Strawberryfield Flavor Data Source (e.g used for HOCR highlight)'
- Additional option to select 'As Input for another processor Plugin' --which will only have an effect if another Processor is setup to consume this output.

19. The queue to use for this processor.
- The primary queue will be execute in realtime while the Secondary will be execute in background
- Default selection is for the 'Secondary queue in background'

20. Checkbox to Use NLP (Natural Language Processing) to extract entities from Text
- If checked Full text will be processed for Natural language Entity extraction using Polyglot.
- Default option is to have the option checked.

21. The URL location of your NLP64 server.
- Defaults to http://esmero-nlp:6400

22. Which method(NER) to use
- The NER NLP method to use to extract Agents, Places and Sentiment.
- Default selection: 'Polyglot (faster)'
- Alternation selection: 'spaCy (more accurate)'

23. Timeout in seconds for this process.
- 900
- If the process runs out of time it can still be processed again.
24. Order or execution in the global chain.
- 0

___

Thank you for reading! Please contact us on our [Archipelago Commons Google Group](https://groups.google.com/forum/#!forum/archipelago-commons) with any questions or feedback.

Return to the main [Strawberry Runners](strawberryrunners.md) or the [Archipelago Documentation main page](index.md).
12 changes: 12 additions & 0 deletions docs/strawberryrunners_wacz_binary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Strawberry Runners Post-Processing
tags:
- Strawberry Runners
- WACZ
- WARC
- Binary
- Post-processing
- Background Processing
---

# Reviewing and adjusting the `warc_to_wacz` Post-Processor operation
12 changes: 12 additions & 0 deletions docs/strawberryrunners_webpage_text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Strawberry Runners Post-Processing
tags:
- Strawberry Runners
- Fulltext Search
- WACZ
- Webpage
- Post-processing
- Background Processing
---

# Reviewing and adjusting the `wacz_page_extractor` and `webpage` Post-Processor operations
Loading

0 comments on commit 1bb1832

Please sign in to comment.