Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add download utils to allow data download from prem and AWS s3 buckets #489

Closed
wants to merge 11 commits into from

Conversation

zoghbi-a
Copy link
Contributor

This PR adds two utility methods: http_download and aws_download. They have the same call signature and allow data download from on-prem addresses (through http urls) or from AWS s3 URI.

These methods enhance what is available with dal.Record.getdataset for on-prem data access. When a standard for serving cloud data using the VO is adopted in the future, these util methods may be absorbed into dal.Record.getdataset.

For relevant discussion on cloud data, see PR #369.
The current PR attempts to simplify #369.

@bsipocz
Copy link
Member

bsipocz commented Sep 25, 2023

This should do the trick for the CI and dependency handling: zoghbi-a#1

@codecov
Copy link

codecov bot commented Sep 25, 2023

Codecov Report

Merging #489 (ac47c96) into main (ad84a0e) will increase coverage by 0.01%.
The diff coverage is 80.95%.

@@            Coverage Diff             @@
##             main     #489      +/-   ##
==========================================
+ Coverage   80.07%   80.09%   +0.01%     
==========================================
  Files          52       53       +1     
  Lines        6059     6185     +126     
==========================================
+ Hits         4852     4954     +102     
- Misses       1207     1231      +24     
Files Coverage Δ
pyvo/utils/__init__.py 100.00% <100.00%> (ø)
pyvo/utils/download.py 80.80% <80.80%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

with pytest.warns(PyvoUserWarning):
http_download('http://example.com/data/basic.xml',
local_filepath='basic.xml', cache=True)
assert os.path.getsize('basic.xml') == 901
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose some content related test is needed instead of size, to make windows happy.

@zoghbi-a zoghbi-a marked this pull request as draft October 16, 2023 19:43
@bsipocz bsipocz added this to the v1.5 milestone Oct 16, 2023
@zoghbi-a zoghbi-a force-pushed the download-utils branch 2 times, most recently from 888476d to ef906c9 Compare October 16, 2023 21:52
@zoghbi-a zoghbi-a marked this pull request as ready for review October 18, 2023 02:38
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs review

Comment on lines 16 to 40
Example Usage
==============
```python
data_url = 'https://heasarc.gsfc.nasa.gov/FTP/chandra/data/byobsid/2/3052/primary/acisf03052N004_cntr_img2.jpg'
image_file = http_download(url=data_url)

s3_uri = 's3://nasa-heasarc/chandra/data/byobsid/2/3052/primary/acisf03052N004_cntr_img2.jpg'
image2_file = aws_download(uri=data_url)
```
or
```python
s3_key = 'chandra/data/byobsid/2/3052/primary/acisf03052N004_cntr_img2.jpg's
s3_bucket = 'nasa-heasarc'
image2_file = aws_download(bucket=s3_bucket, key=s3_key)
```

If the aws data requires authentication, a credential profile (e.g. `aws_user` profile in ``~/.aws/credentials``) can be passed
```python
image2_file = aws_download(bucket=s3_bucket, key=s3_key, aws_profile='aws_user')
```
A session (instance of ``boto3.session.Session``) can also be passed instead (see detials in `AWS session documentation`_).
```python
s3_session = boto3.session.Session(aws_access_key_id, aws_secret_access_key)
image2_file = aws_download(bucket=s3_bucket, key=s3_key, session=s3_session)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the other rst files to reformat these into rst rather than markdown, and add the >>>, too, so they can be doctested (also add the remote-data directives)

it obtained. These can be considered an advanced version of `~pyvo.dal.Record.getdataset` that can handle
data from standard on-prem servers as well as cloud data. For now only AWS is supported.

There two methods with the same call signature: `http_download` and `aws_download`. The first handles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use full API links otherwise sphinx will complain (we don't have nitpicky turned on yet, but things like this will eventually cause CI failures)

```

.. _Amazon S3 storage: https://aws.amazon.com/s3/
.. _AWS session documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add api reference at the bottom of the page

Comment on lines 16 to 24
Example Usage
==============
.. code-block:: python

data_url = 'https://heasarc.gsfc.nasa.gov/FTP/chandra/data/byobsid/2/3052/primary/acisf03052N004_cntr_img2.jpg'
image_file = http_download(url=data_url)

s3_uri = 's3://nasa-heasarc/chandra/data/byobsid/2/3052/primary/acisf03052N004_cntr_img2.jpg'
image2_file = aws_download(uri=data_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the other rst file for examples of how to write rst code blocks (rather than markdown). Also add empty lines after the headings, and add >>> for the doctesting to pick these up, and the remote data directives, too.

image2_file = aws_download(bucket=s3_bucket, key=s3_key)


If the aws data requires authentication, a credential profile (e.g. `aws_user` profile in ``~/.aws/credentials``) can be passed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use single backticks for highlight, all of those will fail to link to an URL or sphinx reference. Either double backtick, or use full sphinx resolveable reference

@bsipocz
Copy link
Member

bsipocz commented Oct 18, 2023

tox -e build_docs will build the docs locally, it maybe useful for sorting out the various issues.

@zoghbi-a
Copy link
Contributor Author

I think that the remaining CI failure is not related to this PR.

@bsipocz bsipocz force-pushed the download-utils branch 3 times, most recently from cd8ad8c to d2aaa04 Compare October 18, 2023 19:46
@bsipocz bsipocz mentioned this pull request Dec 15, 2023
@bsipocz bsipocz removed this from the v1.5 milestone Dec 15, 2023
@zoghbi-a zoghbi-a marked this pull request as draft January 22, 2024 18:45
@tomdonaldson
Copy link
Contributor

There are some open questions for this, so we will close it for now and revisit the issues later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants