Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further AIP and SIP generation speed improvements #25

Open
nutjob4life opened this issue Apr 7, 2020 · 0 comments
Open

Further AIP and SIP generation speed improvements #25

nutjob4life opened this issue Apr 7, 2020 · 0 comments

Comments

@nutjob4life
Copy link
Member

The Issue

Issue #13 demonstrated how certain data (like the 1.3TiB insight_cameras) basically caused sipgen to not terminate. We've addressed that by using better algorithms and adding caching, but we can go steps further.

For example, sipgen still does some redundant XML parsing and aipgen does some single-threaded hash generation that hits the Python GIL. In issue #13 we architected things to include a temporary sqlite3 database that could be shared by numerous processing (using the multiprocessing module, for example) that's ripe for further optimizations.

Some Ideas

  • Additional use of sqlite3 in sipgen: process XML files just once and store the useful information in multiple tables
  • Multiprocessing: in sipgen use parallel processes and the sqlite3 database to accelerate
  • Producer-consumer: make multiprocessing workers consume XML and hash computations as they are done; in aipgen, for example, make one worker walk the directory tree for files to pass into a queue while multiple other workers snag files for MD5 digests.
  • Streaming: provide information as it becomes available so users have feedback that things are getting done instead of wondering what if things are just hanging

Context

See issue #13 and the commits made against it.

@nutjob4life nutjob4life added enhancement New feature or request triage-needed labels Apr 7, 2020
@nutjob4life nutjob4life added this to the PDS.14 (ends 2020-04-08) milestone Apr 7, 2020
nutjob4life added a commit that referenced this issue Apr 7, 2020
- Resolve #21 with a new driver program `aipsip` that generates both the AIP and uses it to make the SIP as well, leaving all in the current working directory (along with two—count 'em, *two*—PDS labels for the price of one!).
    - Updates the Python `setuptools` metadata to generate the new `aipsip` (helps with #21).
    - Refactors logging and command-line argument setup (also for #21).
- Unifies logging between `aipgen` and `sipgen` with the new `aipsip` so that there are `--debug` and `--quiet` options; without either you get a nominal amount of "hand-holding" of output.
- Resolve #13 so that instead of billions of redundant XML parsing and XPath lookups we use a local `sqlite3` database and LRU caching.
    - Factor out XML parsing from `aipgen` and `sipgen` so we can apply caching.
    - Clear up logging messages so we can know what's calling what.
    - Create a temp DB in `sipgen` and populate it with mappings from lidvids to XML files for rapid lookups
        - But see also #25 for other uses of that DB.
- Add standardized `--version` arguments for all three programs.

With these changes, running `sipgen` on my Mac¹ can process a 272GiB `insight_cameras` export in 1:03. On `pdsimg-int1`, it handles the 1.5TiB`insight_cameras` dataset in under 4 hours.

Footnotes:

- ¹2.4 GHz 8-core Intel Core i9, SSD
- ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive
@jordanpadams jordanpadams removed this from the PDS.14 (ends 2020-04-08) milestone Apr 9, 2020
jordanpadams added a commit that referenced this issue Apr 11, 2020
* Resolutions for #13 and #21

- Resolve #21 with a new driver program `aipsip` that generates both the AIP and uses it to make the SIP as well, leaving all in the current working directory (along with two—count 'em, *two*—PDS labels for the price of one!).
    - Updates the Python `setuptools` metadata to generate the new `aipsip` (helps with #21).
    - Refactors logging and command-line argument setup (also for #21).
- Unifies logging between `aipgen` and `sipgen` with the new `aipsip` so that there are `--debug` and `--quiet` options; without either you get a nominal amount of "hand-holding" of output.
- Resolve #13 so that instead of billions of redundant XML parsing and XPath lookups we use a local `sqlite3` database and LRU caching.
    - Factor out XML parsing from `aipgen` and `sipgen` so we can apply caching.
    - Clear up logging messages so we can know what's calling what.
    - Create a temp DB in `sipgen` and populate it with mappings from lidvids to XML files for rapid lookups
        - But see also #25 for other uses of that DB.
- Add standardized `--version` arguments for all three programs.

With these changes, running `sipgen` on my Mac¹ can process a 272GiB `insight_cameras` export in 1:03. On `pdsimg-int1`, it handles the 1.5TiB`insight_cameras` dataset in under 4 hours.

Footnotes:

- ¹2.4 GHz 8-core Intel Core i9, SSD
- ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive

* Improvements for usability and bug fixes for validate errors

* After running validate, there were a few minor fixes that needed to be implemented.
* Commented out / removed several CLI options for the time being until functionality is fully developed.
* Updated file naming to take into the account bundle versioning separate from the AIP/SIP version
* Updated docs per new pds-deep-archive script which combines aipgen and sipgen.

Refs #21

Co-authored-by: Jordan Padams <[email protected]>
@github-project-automation github-project-automation bot moved this to Release Backlog in EN Portfolio Backlog Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ToDo
Development

No branches or pull requests

2 participants