Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate regions from a known set #55

Open
iainelder opened this issue Dec 20, 2022 · 9 comments
Open

Validate regions from a known set #55

iainelder opened this issue Dec 20, 2022 · 9 comments

Comments

@iainelder
Copy link
Collaborator

iainelder commented Dec 20, 2022

Today I discovered my botocove run failed because I misspelled a region name: eu-noth-1. But I only discovered it after a long run across the whole org.

I'd like to discover silly mistakes like this as early as possible in the process.

We can use boto3's model to discover the set of known regions.

I think emitting a warning would be more appropriate than raising an exception here because the model may be out of date, i.e., the named region may be so new that the user's boto3 version doesn't yet include it in the model.

You would get the set of known regions from the model like this:

s = boto3.session.Session()
{r for p in s.get_available_partitions() for r in s.get_available_regions("ec2", p)}

Alternatively, you could get the set of known regions by called EC2's DescribeRegions API. There is even an example of this in the README. It would require extra permissions in the cove host account.

@connelldave
Copy link
Owner

Alternatively, you could get the set of known regions by called EC2's DescribeRegions API. There is even an example of this in the README. It would require extra permissions in the cove host account.

The history behind that comment in the readme was that I was playing with this idea but in the context of being able to say "just run in all regions" - the extra IAM complexity put me off that, so the readme line is kind of a throwaway "here's how to do that".

I think emitting a warning would be more appropriate than raising an exception here because the model may be out of date, i.e., the named region may be so new that the user's boto3 version doesn't yet include it in the model.

I agree with the sentiment (we're precluded from raising an exception because the model might be out of date) - but a warning is a bit of a dissapointing mechanism for highlighting failure for the problem you want to solve.

Some thinking aloud.. Is it fair to assume that the latest boto3 release should always have the latest regions in it's model? I wonder if we could raise an exception that suggests updating boto3, or setting an escape value? One to ponder on, but my first thought is that I'm never opposed to failing on malformed data, but the tradeoff is >current vers will start breaking for users that expect to dynamically run across all regions without updating boto3, and deployed software will break whenever AWS release a new region until boto3 is bumped in the software running botocove. Feels like that's unacceptable, but maybe there's a way?

@iainelder
Copy link
Collaborator Author

but a warning is a bit of a dissapointing mechanism for highlighting failure for the problem you want to solve.

As long as it's on stderr by default, it would be good enough for me. I'm usually watching botocove run in the terminal so I would see it and know to cancel before waiting any longer.

The current explicit in-flight feedback is none, so this would still be an improvement.

Boto3 itself doesn't actually force you to use the regions in its model, so I wouldn't expect botocove to enforce this either.

In that way I think it would help me to catch my fat fingers more quickly without getting in the way of anyone who needs to run in a new region ASAP.

@iainelder
Copy link
Collaborator Author

There is another motive for flagging this early: a misspelled region name slows botocove to a crawl.

I had blamed all sorts of things for this mysterious slowdown in the past, but this it the first time I have reproduced it.

See the real time output for the following spellings. 10s total for the correct spelling and 217 seconds total for the incorrect spelling.

(I'm not really sure what the other measurements mean, including timeit's output. The "real" time is the one I feel in front of the screen.)

Again, I don't think botocove can do too much to solve this because it depends on boto3's handling an maybe even my OS's network stack.

But if I'm staring at the terminal wondering why botocove has slowed to a crawl, a warning about the region name would give me a clue. :-)


Use bash time builtin and python timeit module to measure some requests to eu-north-1.

time \
poetry run python3 -m timeit \
--number 5 \
--setup \
'from boto3.session import Session' \
-- \
'try:' \
'    Session().client("sts", region_name="eu-north-1").get_caller_identity()' \
'except:' \
'    pass'
5 loops, best of 5: 355 msec per loop

real	0m10.595s
user	0m3.067s
sys	0m0.563s

Use bash time builtin and python timeit module to measure some requests to eu-noth-1.

time \
poetry run python3 -m timeit \
--number 5 \
--setup \
'from boto3.session import Session' \
-- \
'try:' \
'    Session().client("sts", region_name="eu-noth-1").get_caller_identity()' \
'except:' \
'    pass'
5 loops, best of 5: 7.59 sec per loop

real	3m27.398s
user	0m3.305s
sys	0m0.776s

@alencar
Copy link

alencar commented Jun 5, 2023

Would it be possible to use the decorator session within @cove() annotation to load the regions that are relevant for session account?

It is common-place to have regions disabled from the account configuration (opt-in regions) and using the Management Account (or any other) regions as the list often result in An error occurred (UnrecognizedClientException) when calling the ListTrails operation: The security token included in the request is invalid exceptions due to region being disabled/not opted-in in Account -> AWS Regions AND/OR due to Global STS Endpoint issued tokens being only valid on regions enabled by default unless explicitly changed by the user in IAM -> Security Token Service (STS) -> Global endpoint

Another side effect of not using the account's enabled regions is that, you can miss regions that are not enabled/opted-in in the account, .i.e. Management Account.

There are currently ~10 regions that requires opt-in
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_enable-regions.html?icmpid=docs_iam_console#id_credentials_region-endpoints

@iainelder
Copy link
Collaborator Author

@alencar, that's a good question. This issue is for validation of user-supplied region names. I think you are asking about something else, so so I've opened a new issue #74 to address it cleanly.

@iainelder
Copy link
Collaborator Author

Coming back to this now because I think the proposed solution in #74 will be more robust if we solve this.

Because there are so many sources of region names, and we can't expect that the client's boto3 version will have a complete list, then I think we should default to "soft" validation using warnings. In that way we can at least notify the client that the region name is unknown to boto3 and so botocove may not work as expected.

The warning module has ways to turn warnings into exceptions, for those clients who prefer "hard" validation. Maybe that could be set via a "region validation mode" parameter.

@connelldave
Copy link
Owner

I'm a little skeptical that this is a feature that adds enough value - there isn't a path to fail in a clearly deterministic way, or add strange constraints/special knowledge requirements. Users can solve this problem themselves if it is a problem they're facing and then get full understanding of the implementation choice without understanding botocove under the hood, I think?

Today I discovered my botocove run failed because I misspelled a region name: eu-noth-1. But I only discovered it after a long run across the whole org.

I'd like to discover silly mistakes like this as early as possible in the process.

re-reading the problem statement, perhaps we just need to ensure that this error is raised as soon as it's impactful? Is the slowness caused by a retry/timeout exception? I'm not particularly sure about directly implementing a regions_per_account function since that adds quite a lot of complexity and depth of knowledge in general: just failing for regions not enabled sounds reasonable, pythonic and having less edge cases to me.

@iainelder
Copy link
Collaborator Author

iainelder commented Jun 19, 2023

Rereading the Zen of Python:

If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.

So everything I've suggested before seems like a bad idea.

re-reading the problem statement, perhaps we just need to ensure that this error is raised as soon as it's impactful? Is the slowness caused by a retry/timeout exception?

From the timing tests above, boto3 waited about 7.5 seconds before raising an exception.

boto3 raises an EndpointConnectionError if the region name is invalid.

I wonder whether we could catch this exception in cove_thread and reraise it if the region is not in boto3's model.

@iainelder
Copy link
Collaborator Author

iainelder commented Jun 19, 2023

@connelldave , thanks to your feedback I tried another approach that I think gives good enough feedback and is easier to explain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants