Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support setting proxy from environment variables for request session #823

Closed
wants to merge 4 commits into from

Conversation

Isotr0py
Copy link

@Isotr0py Isotr0py commented Sep 25, 2024

Fix #501

Motivation

  • The direct connection to Earthdata through our university's network is slow and unstable, which always causes corrupted downloads even the failed search.
  • We have a proxy to download data from data providers like Earthdata. However, setting proxy for earthaccess is a little bit tricky refer to the example code in Allow custom proxy settings with requests sessions #501.

Description:

  • This PR lets user initialize the request session with proxy, if they have proxy set in environment variables.
Pull Request (PR) draft checklist - click to expand
  • Please review our
    contributing documentation
    before getting started.
  • Ensure an issue exists representing the problem being solved in this PR.
  • Populate a descriptive title. For example, instead of "Updated README.md", use a
    title such as "Add testing details to the contributor section of the README".
    Example PRs: #763
  • Populate the body of the pull request with:
  • Update CHANGELOG.md with details about your change in a section titled
    ## Unreleased. If such a section does not exist, please create one. Follow
    Common Changelog for your additions.
    Example PRs: #763
  • Update the documentation and/or the README.md with details of changes to the
    earthaccess interface, if any. Consider new environment variables, function names,
    decorators, etc.

Click the "Ready for review" button at the bottom of the "Conversation" tab in GitHub
once these requirements are fulfilled. Don't worry if you see any test failures in
GitHub at this point!

Pull Request (PR) merge checklist - click to expand

Please do your best to complete these requirements! If you need help with any of these
requirements, you can ping the @nsidc/earthaccess-support team in a comment and we
will help you out!

  • Add unit tests for any new features.
  • Apply formatting and linting autofixes. You can add a GitHub comment in this Pull
    Request containing "pre-commit.ci autofix" to automate this.
  • Ensure all automated PR checks (seen at the bottom of the "conversation" tab) pass.
  • Get at least one approving review.

📚 Documentation preview 📚: https://earthaccess--823.org.readthedocs.build/en/823/

@Isotr0py Isotr0py changed the title Set proxy from environment variables when initialize request session Support setting proxy from environment variables for request session Sep 25, 2024
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, but we'll need to discuss whether or not this is an appropriate/safe way to address #501. As I've commented there, you might be able to address the issue by exporting the appropriate env var, as there are cases where they are automatically used, and thus no need to explicitly set them on the session. Specifically, requests uses urllib, which recognizes the proxy env vars.

For cases where the proxy env vars are not used, that's because the session's trust_env property is set to False, indicating that the env should not be trusted. Therefore, the code in this PR ignores the trust_env property, which is perhaps not the most secure approach to solving this.

@Isotr0py
Copy link
Author

Thanks for informing! I didn't notice the trust_env is set to False for safety before.

So how about adding a method like earthaccess.set_proxy() to expose proxy settings to user?

@chuckwondo
Copy link
Collaborator

Thanks for informing! I didn't notice the trust_env is set to False for safety before.

It's set to False in some cases, which is why I suggest trying to set the proxy env vars and seeing if that works. If it does, great. If not, then you've hit a case where trust_env is set to False.

So how about adding a method like earthaccess.set_proxy() to expose proxy settings to user?

The auth logic in this library has arguably become a bit "messy," so we'll need to make a thorough review of the code to determine a reasonable approach, which might very well be roughly as simple as your suggestion.

@Isotr0py
Copy link
Author

It's set to False in some cases, which is why I suggest trying to set the proxy env vars and seeing if that works. If it does, great. If not, then you've hit a case where trust_env is set to False.

I have double checked the proxy env vars, and it didn't work. I guess the case is that I'm using interactive strategy to login in, which will create bearer tokens and blocking env vars here:

def get_session(self, bearer_token: bool = True) -> requests.Session:
"""Returns a new request session instance.
Parameters:
bearer_token: whether to include bearer token
Returns:
class Session instance with Auth and bearer token headers
"""
session = SessionWithHeaderRedirection()
if bearer_token and self.token:
# This will avoid the use of the netrc after we are logged in
session.trust_env = False
session.headers.update(
{"Authorization": f'Bearer {self.token["access_token"]}'}
)
return session

@chuckwondo
Copy link
Collaborator

It's set to False in some cases, which is why I suggest trying to set the proxy env vars and seeing if that works. If it does, great. If not, then you've hit a case where trust_env is set to False.

I have double checked the proxy env vars, and it didn't work. I guess the case is that I'm using interactive strategy to login in, which will create bearer tokens and blocking env vars here:

Can you try creating a ~/.netrc file with your Earthdata Login creds?

@Isotr0py
Copy link
Author

Can you try creating a ~/.netrc file with your Earthdata Login creds?

I tried both netrc and environment strategies, and they all return session.trust_env=False with header contents bearer tokens. I wonder if there is an issue related to the auth logic.

My test script

import earthaccess

earthaccess.login(strategy="netrc")
session = earthaccess.get_requests_https_session()
print(session.headers)
print(session.trust_env)

Outputs

{'User-Agent': 'earthaccess v0.10.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Bearer <My_token>'}
False

@chuckwondo
Copy link
Collaborator

Can you try creating a ~/.netrc file with your Earthdata Login creds?

I tried both netrc and environment strategies, and they all return session.trust_env=False with header contents bearer tokens. I wonder if there is an issue related to the auth logic.

Don't bother looking at that for the moment. That may be misleading. Instead, try to do the thing that currently isn't working for you. For example, if you have a script to do something with earthaccess that currently fails due to not making use of the proxy, and the script is named foo.py, run it like so to see if it works:

https_proxy=http://HOST:PORT python foo.py

where HOST and PORT are whatever is appropriate for your proxy.

I am able so run a local proxy on my laptop at http://localhost:8080 and I am able to login and download via earthaccess.

@Isotr0py
Copy link
Author

Isotr0py commented Sep 25, 2024

I have a simple download script:

import argparse
import earthaccess
import os


def download(date_start, date_end, shortname):
    earthaccess.login()
    results = earthaccess.search_data(
        short_name=shortname,
        cloud_hosted=True,
        temporal=(date_start, date_end),
    )
    session = earthaccess.get_requests_https_session()
    print("http_proxy:", os.environ["http_proxy"])
    print("https_proxy:", os.environ["https_proxy"])
    print(session.headers)
    print(session.trust_env)
    earthaccess.download(results, f"download/{shortname}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--shortname", type=str)
    parser.add_argument("--date-start", type=str)
    parser.add_argument("--date-end", type=str)
    args = parser.parse_args()
    download(args.date_start, args.date_end, args.shortname)

And I run with below command on Windows Powershell, however, the download has failed with corrupted files, while http_proxy and https_proxy did have been passed to env vars.

$env:HTTP_PROXY="http://127.0.0.1:7890"; $env:HTTPS_PROXY="http://127.0.0.1:7890"; python .\test_scripts.py --shortname GOCI_L2_OC --date-start 2015-01-01 --date-end 2015-01-02

Outputs

http_proxy: http://127.0.0.1:7890
https_proxy: http://127.0.0.1:7890
{'User-Agent': 'earthaccess v0.10.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Bearer <my token>'}
False

QUEUEING TASKS | : 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 557.52it/s]
PROCESSING TASKS | :   0%|                                                                                                                                                                         | 0/16 [00:00<?, ?it/s]Error while downloading the file G2015001001640.L2_COMS_OC.nc
Traceback (most recent call last):
  File "F:\GIS\earthaccess\.venv\Lib\site-packages\urllib3\response.py", line 748, in _error_catcher
    yield
  File "F:\GIS\earthaccess\.venv\Lib\site-packages\urllib3\response.py", line 894, in _raw_read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
urllib3.exceptions.IncompleteRead: IncompleteRead(17812224 bytes read, 185151892 more expected)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "F:\GIS\earthaccess\earthaccess\store.py", line 646, in _download_file
    shutil.copyfileobj(r.raw, f, length=1024 * 1024)
  File "D:\miniconda3\Lib\shutil.py", line 197, in copyfileobj
    buf = fsrc_read(length)
          ^^^^^^^^^^^^^^^^^
  File "F:\GIS\earthaccess\.venv\Lib\site-packages\urllib3\response.py", line 949, in read
    data = self._raw_read(amt)
           ^^^^^^^^^^^^^^^^^^^
  File "F:\GIS\earthaccess\.venv\Lib\site-packages\urllib3\response.py", line 872, in _raw_read
    with self._error_catcher():
  File "D:\miniconda3\Lib\contextlib.py", line 155, in __exit__
    self.gen.throw(typ, value, traceback)
  File "F:\GIS\earthaccess\.venv\Lib\site-packages\urllib3\response.py", line 772, in _error_catcher
    raise ProtocolError(arg, e) from e
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(17812224 bytes read, 185151892 more expected)', IncompleteRead(17812224 bytes read, 185151892 more expected))

However, when I switched to the script provided in #501 (comment), it can download 16 complete files through the same proxy.

@chuckwondo
Copy link
Collaborator

Thank you. That's all helpful information.

@chuckwondo
Copy link
Collaborator

@Isotr0py, I believe this might suffice as a workaround for you for the moment, so please let me know if this works for you. I've tweaked your sample code from above.

The basic idea is that I'm setting the proxies on the session obtained from Auth.get_session (which is invoked on the call path from earthaccess.get_requests_https_session) and caching that session, such that every subsequent call to Auth.get_session will use that same session. This is necessary because otherwise, every call to Auth.get_session creates and returns a new session object.

This might be a way around modifying earthaccess code directly, so let me know if it does the trick.

import argparse
import os
from functools import cache
from typing import Callable
from typing_extensions import ParamSpec

import earthaccess
import requests


P = ParamSpec("P")


def set_proxies(f: Callable[P, requests.Session]) -> Callable[P, requests.Session]:
    def go(*args: P.args, **kwargs: P.kwargs) -> requests.Session:
        s = f(*args, **kwargs)
        s.proxies.update(
            {
                k: v
                for k in ("HTTP_PROXY", "HTTPS_PROXY")
                if (v := os.environ.get(k, os.environ.get(k.casefold())))
            }
        )

        return s

    return go


def download(date_start, date_end, shortname):
    earthaccess.login()
    results = earthaccess.search_data(
        short_name=shortname,
        cloud_hosted=True,
        temporal=(date_start, date_end),
    )
    auth: earthaccess.Auth = earthaccess.__store__.auth
    auth.get_session = cache(set_proxies(auth.get_session))
    session = earthaccess.get_requests_https_session()
    print("http_proxy:", os.environ.get("http_proxy"))
    print("https_proxy:", os.environ.get("https_proxy"))
    print(f"{session.proxies=}")
    print(session.headers)
    print(session.trust_env)
    earthaccess.download(results, f"download/{shortname}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--shortname", type=str)
    parser.add_argument("--date-start", type=str)
    parser.add_argument("--date-end", type=str)
    args = parser.parse_args()
    download(args.date_start, args.date_end, args.shortname)

@Isotr0py
Copy link
Author

Isotr0py commented Sep 26, 2024

@chuckwondo The above script didn't work, because the set_proxy is using incorrect keys (HTTP_PROXY and HTTPS_PROXY) in proxy dict.

In fact, this script worked well after I modified the set_proxies with the correct keys (http and https):

def set_proxies(f: Callable[P, requests.Session]) -> Callable[P, requests.Session]:
    def go(*args: P.args, **kwargs: P.kwargs) -> requests.Session:
        s = f(*args, **kwargs)
        # session use 'http' and `https` as proxies dict keys
        s.proxies.update(
            {
                k.split("_")[0].lower(): v
                for k in ("HTTP_PROXY", "HTTPS_PROXY")
                if (v := os.environ.get(k, os.environ.get(k.casefold())))
            }
        )

        return s

    return go

Outputs

http_proxy: http://127.0.0.1:7890
https_proxy: http://127.0.0.1:7890
session.proxies={'http': 'http://127.0.0.1:7890', 'https': 'http://127.0.0.1:7890'}
{'User-Agent': 'earthaccess v0.10.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Bearer <my token>'}
False
QUEUEING TASKS | : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 776.63it/s]
PROCESSING TASKS | : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:44<00:00, 13.09s/it]

Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isotr0py, the example I provided was for you to use instead of this code change. I'm glad it works for you (with the small tweak you made to it).

We still do not want to make this particular PR change internally because, as I mentioned earlier, this completely ignores cases where the session's trust_env property is set to False. We want to respect that setting and instead find a better way to give the caller more control over the session settings themselves.

Please close this PR, as this is a larger issue that the earthaccess team needs to discuss and find a solution for.

@chuckwondo
Copy link
Collaborator

@chuckwondo The above script didn't work, because the set_proxy is using incorrect keys (HTTP_PROXY and HTTPS_PROXY) in proxy dict.

In fact, this script worked well after I modified the set_proxies with the correct keys (http and https):

def set_proxies(f: Callable[P, requests.Session]) -> Callable[P, requests.Session]:
    def go(*args: P.args, **kwargs: P.kwargs) -> requests.Session:
        s = f(*args, **kwargs)
        # session use 'http' and `https` as proxies dict keys
        s.proxies.update(
            {
                k.split("_")[0].lower(): v
                for k in ("HTTP_PROXY", "HTTPS_PROXY")
                if (v := os.environ.get(k, os.environ.get(k.casefold())))
            }
        )

        return s

    return go

Outputs

http_proxy: http://127.0.0.1:7890
https_proxy: http://127.0.0.1:7890
session.proxies={'http': 'http://127.0.0.1:7890', 'https': 'http://127.0.0.1:7890'}
{'User-Agent': 'earthaccess v0.10.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Authorization': 'Bearer <my token>'}
False
QUEUEING TASKS | : 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 776.63it/s]
PROCESSING TASKS | : 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:44<00:00, 13.09s/it]

Awesome! I'm glad my suggestion (with your minor tweak) worked for you!

Personally, I'd probably tweak it like so, but just a personal preference:

def set_proxies(f: Callable[P, requests.Session]) -> Callable[P, requests.Session]:
    def go(*args: P.args, **kwargs: P.kwargs) -> requests.Session:
        s = f(*args, **kwargs)
        s.proxies.update(
            {
                scheme: v
                for scheme in ("http", "https")
                if (
                    v := os.environ.get(
                        k := f"{scheme}_proxy", os.environ.get(k.upper())
                    )
                )
            }
        )

        return s

    return go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow custom proxy settings with requests sessions
2 participants