Create a Python package to store package data? #3

namurphy · 2024-03-13T22:01:06Z

One possibility to simplify access to data files would be to include them in a Python package that could be made available on PyPI and conda-forge. The package could include functionality to open files with pandas, xarray, or h5py, which could then be imported into PlasmaPy.

Instead of needing to download the data files separately, they could be acquired via pip install plasmapy-data, and then accessed by PlasmaPy. We could potentially have plasmapy-data be a dependency of PlasmaPy. We could perhaps even allow installation without plasmapy-data via pip install plasmapy[lite] if the size of the data increases to ≳ 10 MB.

So far, the sizes of data files in this repository are of a scope that is well within what can reasonable included in a Python package. PlasmaPy wheels are ∼9 MB and source distributions are ∼14 MB.

The main disadvantage of creating a package is that we would have an additional package to maintain, but there are tools like cruft that could simplify package maintenance. I don't expect the amount of maintenance for this package to be very large compared to the main PlasmaPy repo, though. We would want to make the release process simpler than for the main PlasmaPy repo (i.e., by avoiding changelogs).

We'd have to figure out what we'd want to do with data used in tests. If PlasmaPy moves to an src layout with a separate tests directory, then the test data could live in the tests directory.

An advantage of incorporating the data into a Python package is that it could be cached in GitHub Actions very straightforwardly.

I do not know if this is the best approach, so I'd also like to look into best practices and check with people in pyOpenSci about alternatives.

This will take a while quite a bit more discussion, so we should proceed with PlasmaPy/PlasmaPy#2570 (which we may need for especially large data sets).

@pheuer, @JaydenR2305 — I'm curious what your thoughts are on this!

The text was updated successfully, but these errors were encountered:

pheuer · 2024-03-14T00:54:14Z

I think the answer to this depends a lot on whether the amount of data included remains small (say <10 MB), medium (~100 MB) or large (> 1 GB). Right now, the NIST_STAR file is the largest at ~2 MB: I don't see that growing much. The nascent nuclear datafile is ~100 kB, which I could see growing to 1+ MB. If we add more nuclear data, say scattering cross-sections for elastic collisions, that might be another 1-10 MB.

If the total data is <10 MB, it could probably be just incorporated into PlasmaPy directly. Any more than that starts to seem like a lot to me. For the features I can imagine right now, I think we'll get to around 5-10 MB. But, that makes me nervous that we'll likely exceed that limit when we think of new features!

I do think adding broadly applicable data (especially the nuclear data) is quite valuable, since this data is often quite difficult to access (ENDF is a pain!) but is critical to doing even basic calculations in plasma/fusion/nuclear science. I don't want to go crazy adding things, but I think a reasonable test is that if data could be required for an educational notebook or a homework calculation, it should be in scope.

An important question: @namurphy do you know the package limit size for PlasmaPy on PyPI? It seems to vary from package to package - I've seen lots of numbers? Some like 100-150 MB? Do we have room for 10-20 MB of data? If so, maybe for now we should just move these files into the main package (and keep #2570 in the wings for any future large files).

Separate package idea

Storing data in a separate package is an interesting idea - we could have a function in PlasmaPy that tries to import the relevant data from the package and prompts the user to install the package if it can't find the files.
Pros:

Relatively easy to implement
Easy to cache (as Nick mentioned).

Cons:

Hard to update (need to re-release the package). This probably isn't a good solution for data files that are changing with any regularity as we add features.
Seems like a slight abuse of the concept of a package? Is anyone else doing something like this?

File server idea

Another option would be to host these files on a public HTTP fileserver on the plasmapy.org domain, e.g. plasmapy.org/files. I'm not sure how web hosting for that domain is handled, but there may be an option to add a file server.
Pros:

Extensible to very large files
Cons:
Hard to cache values?

namurphy · 2024-03-14T01:15:44Z

Thank you for the thoughtful reply!

An important question: @namurphy do you know the package limit size for PlasmaPy on PyPI? It seems to vary from package to package - I've seen lots of numbers? Some like 100-150 MB? Do we have room for 10-20 MB of data? If so, maybe for now we should just move these files into the main package (and keep #2570 in the wings for any future large files).

This is a hard number to find! I saw 60 MB in one place from a few years ago. It's probably a better practice (good manners?) to keep package sizes ≲20 MB.

* Hard to update (need to re-release the package). This probably isn't a good solution for data files that are changing with any regularity as we add features.

Doing a release is fairly straightforward nowadays with GitHub Actions. For PlasmaPy, the release itself takes only a few minutes, but there are a bunch of peripheral tasks.

* Seems like a slight abuse of the concept of a package? Is anyone else doing something like this?

I won't tell PyPI if you don't! Good thing there won't be a public record of me saying that. 🙃

namurphy mentioned this issue Mar 14, 2024

Create a class to manage local and online resource files PlasmaPy/PlasmaPy#2570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a Python package to store package data? #3

Create a Python package to store package data? #3

namurphy commented Mar 13, 2024 •

edited

Loading

pheuer commented Mar 14, 2024

namurphy commented Mar 14, 2024

Create a Python package to store package data? #3

Create a Python package to store package data? #3

Comments

namurphy commented Mar 13, 2024 • edited Loading

pheuer commented Mar 14, 2024

Separate package idea

File server idea

namurphy commented Mar 14, 2024

namurphy commented Mar 13, 2024 •

edited

Loading