-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a Python package to store package data? #3
Comments
I think the answer to this depends a lot on whether the amount of data included remains small (say <10 MB), medium (~100 MB) or large (> 1 GB). Right now, the If the total data is <10 MB, it could probably be just incorporated into PlasmaPy directly. Any more than that starts to seem like a lot to me. For the features I can imagine right now, I think we'll get to around 5-10 MB. But, that makes me nervous that we'll likely exceed that limit when we think of new features! I do think adding broadly applicable data (especially the nuclear data) is quite valuable, since this data is often quite difficult to access (ENDF is a pain!) but is critical to doing even basic calculations in plasma/fusion/nuclear science. I don't want to go crazy adding things, but I think a reasonable test is that if data could be required for an educational notebook or a homework calculation, it should be in scope. An important question: @namurphy do you know the package limit size for PlasmaPy on PyPI? It seems to vary from package to package - I've seen lots of numbers? Some like 100-150 MB? Do we have room for 10-20 MB of data? If so, maybe for now we should just move these files into the main package (and keep #2570 in the wings for any future large files). Separate package ideaStoring data in a separate package is an interesting idea - we could have a function in PlasmaPy that tries to import the relevant data from the package and prompts the user to install the package if it can't find the files.
Cons:
File server ideaAnother option would be to host these files on a public HTTP fileserver on the
|
Thank you for the thoughtful reply!
This is a hard number to find! I saw 60 MB in one place from a few years ago. It's probably a better practice (good manners?) to keep package sizes ≲20 MB.
Doing a release is fairly straightforward nowadays with GitHub Actions. For PlasmaPy, the release itself takes only a few minutes, but there are a bunch of peripheral tasks.
I won't tell PyPI if you don't! Good thing there won't be a public record of me saying that. 🙃 |
One possibility to simplify access to data files would be to include them in a Python package that could be made available on PyPI and conda-forge. The package could include functionality to open files with
pandas
,xarray
, orh5py
, which could then be imported into PlasmaPy.Instead of needing to download the data files separately, they could be acquired via
pip install plasmapy-data
, and then accessed by PlasmaPy. We could potentially haveplasmapy-data
be a dependency of PlasmaPy. We could perhaps even allow installation withoutplasmapy-data
viapip install plasmapy[lite]
if the size of the data increases to ≳ 10 MB.So far, the sizes of data files in this repository are of a scope that is well within what can reasonable included in a Python package. PlasmaPy wheels are ∼9 MB and source distributions are ∼14 MB.
The main disadvantage of creating a package is that we would have an additional package to maintain, but there are tools like
cruft
that could simplify package maintenance. I don't expect the amount of maintenance for this package to be very large compared to the main PlasmaPy repo, though. We would want to make the release process simpler than for the main PlasmaPy repo (i.e., by avoiding changelogs).We'd have to figure out what we'd want to do with data used in tests. If PlasmaPy moves to an
src
layout with a separatetests
directory, then the test data could live in thetests
directory.An advantage of incorporating the data into a Python package is that it could be cached in GitHub Actions very straightforwardly.
I do not know if this is the best approach, so I'd also like to look into best practices and check with people in pyOpenSci about alternatives.
This will take a while quite a bit more discussion, so we should proceed with PlasmaPy/PlasmaPy#2570 (which we may need for especially large data sets).
@pheuer, @JaydenR2305 — I'm curious what your thoughts are on this!
The text was updated successfully, but these errors were encountered: