-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further AIP and SIP generation speed improvements #25
Labels
Comments
nutjob4life
added a commit
that referenced
this issue
Apr 7, 2020
- Resolve #21 with a new driver program `aipsip` that generates both the AIP and uses it to make the SIP as well, leaving all in the current working directory (along with two—count 'em, *two*—PDS labels for the price of one!). - Updates the Python `setuptools` metadata to generate the new `aipsip` (helps with #21). - Refactors logging and command-line argument setup (also for #21). - Unifies logging between `aipgen` and `sipgen` with the new `aipsip` so that there are `--debug` and `--quiet` options; without either you get a nominal amount of "hand-holding" of output. - Resolve #13 so that instead of billions of redundant XML parsing and XPath lookups we use a local `sqlite3` database and LRU caching. - Factor out XML parsing from `aipgen` and `sipgen` so we can apply caching. - Clear up logging messages so we can know what's calling what. - Create a temp DB in `sipgen` and populate it with mappings from lidvids to XML files for rapid lookups - But see also #25 for other uses of that DB. - Add standardized `--version` arguments for all three programs. With these changes, running `sipgen` on my Mac¹ can process a 272GiB `insight_cameras` export in 1:03. On `pdsimg-int1`, it handles the 1.5TiB`insight_cameras` dataset in under 4 hours. Footnotes: - ¹2.4 GHz 8-core Intel Core i9, SSD - ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive
jordanpadams
added a commit
that referenced
this issue
Apr 11, 2020
* Resolutions for #13 and #21 - Resolve #21 with a new driver program `aipsip` that generates both the AIP and uses it to make the SIP as well, leaving all in the current working directory (along with two—count 'em, *two*—PDS labels for the price of one!). - Updates the Python `setuptools` metadata to generate the new `aipsip` (helps with #21). - Refactors logging and command-line argument setup (also for #21). - Unifies logging between `aipgen` and `sipgen` with the new `aipsip` so that there are `--debug` and `--quiet` options; without either you get a nominal amount of "hand-holding" of output. - Resolve #13 so that instead of billions of redundant XML parsing and XPath lookups we use a local `sqlite3` database and LRU caching. - Factor out XML parsing from `aipgen` and `sipgen` so we can apply caching. - Clear up logging messages so we can know what's calling what. - Create a temp DB in `sipgen` and populate it with mappings from lidvids to XML files for rapid lookups - But see also #25 for other uses of that DB. - Add standardized `--version` arguments for all three programs. With these changes, running `sipgen` on my Mac¹ can process a 272GiB `insight_cameras` export in 1:03. On `pdsimg-int1`, it handles the 1.5TiB`insight_cameras` dataset in under 4 hours. Footnotes: - ¹2.4 GHz 8-core Intel Core i9, SSD - ²2.3 GHz 8-core Intel Xeon Gold 6140, unknown drive * Improvements for usability and bug fixes for validate errors * After running validate, there were a few minor fixes that needed to be implemented. * Commented out / removed several CLI options for the time being until functionality is fully developed. * Updated file naming to take into the account bundle versioning separate from the AIP/SIP version * Updated docs per new pds-deep-archive script which combines aipgen and sipgen. Refs #21 Co-authored-by: Jordan Padams <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The Issue
Issue #13 demonstrated how certain data (like the 1.3TiB
insight_cameras
) basically causedsipgen
to not terminate. We've addressed that by using better algorithms and adding caching, but we can go steps further.For example,
sipgen
still does some redundant XML parsing andaipgen
does some single-threaded hash generation that hits the Python GIL. In issue #13 we architected things to include a temporarysqlite3
database that could be shared by numerous processing (using themultiprocessing
module, for example) that's ripe for further optimizations.Some Ideas
sqlite3
insipgen
: process XML files just once and store the useful information in multiple tablessipgen
use parallel processes and thesqlite3
database to accelerateaipgen
, for example, make one worker walk the directory tree for files to pass into a queue while multiple other workers snag files for MD5 digests.Context
See issue #13 and the commits made against it.
The text was updated successfully, but these errors were encountered: