Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Python package to store package data? #3

Open
namurphy opened this issue Mar 13, 2024 · 2 comments
Open

Create a Python package to store package data? #3

namurphy opened this issue Mar 13, 2024 · 2 comments

Comments

@namurphy
Copy link
Member

namurphy commented Mar 13, 2024

One possibility to simplify access to data files would be to include them in a Python package that could be made available on PyPI and conda-forge. The package could include functionality to open files with pandas, xarray, or h5py, which could then be imported into PlasmaPy.

Instead of needing to download the data files separately, they could be acquired via pip install plasmapy-data, and then accessed by PlasmaPy. We could potentially have plasmapy-data be a dependency of PlasmaPy. We could perhaps even allow installation without plasmapy-data via pip install plasmapy[lite] if the size of the data increases to ≳ 10 MB.

So far, the sizes of data files in this repository are of a scope that is well within what can reasonable included in a Python package. PlasmaPy wheels are ∼9 MB and source distributions are ∼14 MB.

The main disadvantage of creating a package is that we would have an additional package to maintain, but there are tools like cruft that could simplify package maintenance. I don't expect the amount of maintenance for this package to be very large compared to the main PlasmaPy repo, though. We would want to make the release process simpler than for the main PlasmaPy repo (i.e., by avoiding changelogs).

We'd have to figure out what we'd want to do with data used in tests. If PlasmaPy moves to an src layout with a separate tests directory, then the test data could live in the tests directory.

An advantage of incorporating the data into a Python package is that it could be cached in GitHub Actions very straightforwardly.

I do not know if this is the best approach, so I'd also like to look into best practices and check with people in pyOpenSci about alternatives.

This will take a while quite a bit more discussion, so we should proceed with PlasmaPy/PlasmaPy#2570 (which we may need for especially large data sets).

@pheuer, @JaydenR2305 — I'm curious what your thoughts are on this!

@pheuer
Copy link
Member

pheuer commented Mar 14, 2024

I think the answer to this depends a lot on whether the amount of data included remains small (say <10 MB), medium (~100 MB) or large (> 1 GB). Right now, the NIST_STAR file is the largest at ~2 MB: I don't see that growing much. The nascent nuclear datafile is ~100 kB, which I could see growing to 1+ MB. If we add more nuclear data, say scattering cross-sections for elastic collisions, that might be another 1-10 MB.

If the total data is <10 MB, it could probably be just incorporated into PlasmaPy directly. Any more than that starts to seem like a lot to me. For the features I can imagine right now, I think we'll get to around 5-10 MB. But, that makes me nervous that we'll likely exceed that limit when we think of new features!

I do think adding broadly applicable data (especially the nuclear data) is quite valuable, since this data is often quite difficult to access (ENDF is a pain!) but is critical to doing even basic calculations in plasma/fusion/nuclear science. I don't want to go crazy adding things, but I think a reasonable test is that if data could be required for an educational notebook or a homework calculation, it should be in scope.

An important question: @namurphy do you know the package limit size for PlasmaPy on PyPI? It seems to vary from package to package - I've seen lots of numbers? Some like 100-150 MB? Do we have room for 10-20 MB of data? If so, maybe for now we should just move these files into the main package (and keep #2570 in the wings for any future large files).

Separate package idea

Storing data in a separate package is an interesting idea - we could have a function in PlasmaPy that tries to import the relevant data from the package and prompts the user to install the package if it can't find the files.
Pros:

  • Relatively easy to implement
  • Easy to cache (as Nick mentioned).

Cons:

  • Hard to update (need to re-release the package). This probably isn't a good solution for data files that are changing with any regularity as we add features.
  • Seems like a slight abuse of the concept of a package? Is anyone else doing something like this?

File server idea

Another option would be to host these files on a public HTTP fileserver on the plasmapy.org domain, e.g. plasmapy.org/files. I'm not sure how web hosting for that domain is handled, but there may be an option to add a file server.
Pros:

  • Extensible to very large files
    Cons:
  • Hard to cache values?

@namurphy
Copy link
Member Author

Thank you for the thoughtful reply!

An important question: @namurphy do you know the package limit size for PlasmaPy on PyPI? It seems to vary from package to package - I've seen lots of numbers? Some like 100-150 MB? Do we have room for 10-20 MB of data? If so, maybe for now we should just move these files into the main package (and keep #2570 in the wings for any future large files).

This is a hard number to find! I saw 60 MB in one place from a few years ago. It's probably a better practice (good manners?) to keep package sizes ≲20 MB.

* Hard to update (need to re-release the package). This probably isn't a good solution for data files that are changing with any regularity as we add features.

Doing a release is fairly straightforward nowadays with GitHub Actions. For PlasmaPy, the release itself takes only a few minutes, but there are a bunch of peripheral tasks.

* Seems like a slight abuse of the concept of a package? Is anyone else doing something like this?

I won't tell PyPI if you don't! Good thing there won't be a public record of me saying that. 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants