-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved remote cluster performance #259
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't judge the earthaccess code well, but it's cool to see that this doesn't appear to be a crazy hack 🙂
This is amazing! @jrbourbeau I'll take a look today. On the tests failing, I have some WIP to only run integration tests on main. They are not deterministic and some datasets require EULAs, also when you run them from a fork secrets cannot be shared. |
Co-authored-by: Matthew Rocklin <[email protected]>
We don't have a solution for this right now.
This looks ready to merge to us; @betolink did some local testing and things worked great. We added a comment about an issue with data from the CMR API that we currently don't know how to resolve: The first "data link" is not always the "right" data link. We're going to punt on the failing tests because Luis found the unit tests are passing, but issues with secrets are causing the integration tests to fail. Another PR for skipping integration tests on forks is coming soon :) @jrbourbeau as this PR still has [WIP] in the title, are you planning to do more work on it? If not we can merge it tomorrow! |
@jrbourbeau is on vacation this week. Last I spoke with him I recall him saying positive things about this PR though, and saying that mostly it was waiting review. I wouldn't be surprised if he just forgot to remove the WIP. |
Great! Thank you for your patience and contribution :) |
This PR adds an
fsspec
-compatibleEarthAccessFile
object that's a very light wrapper around the current https / S3 file objects being used. The only difference is that when a file is serialized (as happens when working on a remote cluster), if we find that we are running in region on AWS and the underlying file is an https file instead of an S3 file, we reopen the file using the existingeathaccess.open(...)
logic to get a S3 file.Anecdotally, I see a computation performance boost of 2x when running the notebook in #253 with the changes in this PR
Closes #253