Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debuggability feature for specific request with debug key #14

Open
wants to merge 1 commit into
base: task/check-dd-extensions
Choose a base branch
from

Conversation

erickhun
Copy link
Collaborator

@erickhun erickhun commented Dec 19, 2019

Adding a "debuggability" key to make it easier to engineers to see what happen with specific requests. Discussion here

Why?

Engineers will be able to add , and no matter the level of verbosity set (with LOG_LEVEL) in the services. The logs will appear in datadog logs with the value engineers arbitrarily give. For example:

with https://service.com?debug=A_RANDOM_STRING

Engineers will be able to see all logs across services if they search A_RANDOM_STRING in datadog logs , so they can isolate that given request, and remove noise.

This feature will need to be implemented in the BuffLog javacript library too to make sure to make the most of it.

Review

Up to have your eyes on this @esclapes @colinscape ? ( and cc @josem @karelm since I think you'll certainly make a use of that feature with the publish-api x) )

note: This is will be merge into #13

self::$isDebug = true;
// we set the lowest level of logs to see everything
self::$verbosityLevel = Logger::DEBUG;
self::$debugValue = $_GET[self::$debugKey];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably sanitize this

Copy link

@colinscape colinscape Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may well be security aspects to consider too. Since many API endpoints will be publicly available, there's the potential for anyone to include this key in their request, which could have the effect of making us incur large logging costs! 😅

It's not a huge risk, but it does exist. Potentially something like the API Gateway could help here by only letting this parameter through if the request comes via the VPN (or I guess from 'inside' the cluster too).

As a proof of concept, I think this is fine as is, but we might want to keep this in mind that it is something of a half-open door if someone wants to act maliciously against us.

@@ -86,6 +92,18 @@ public static function __callStatic($methodName, $args)
}
}

private static function checkDebug()
{
if (is_array($_GET) && isset($_GET[self::$debugKey])) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we take into account only GET requests or also POST ones?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think POST too - particularly for API requests (eg creating an update) it would be useful to get the trace in those situations too as well as the 'read' requests.

Copy link

@esclapes esclapes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @erickhun, 👋

A bit late to the party here 😅, sharing my two cents here. I love this feature. I think it will be really useful. 😍

I was thinking how this would translate across services, i.e. when Publish API uses the Core Auth Svc to fulfill a request I would expect the debugger in the service to also be aware of this new key.

Thinking about this, I was considering our buffer-rpc calls that use HTTP query params for method and args. In this case, where do you see the new debug query parameter being included?:

  • inside args with the rest of method arguments,
  • or a top level, next to method and args

Wondering if a using an HTTP header as well (or instead of) a query parameter would make it more straight forward? A header feels less prone to confusion, but on the other hand it may be harder to set a custom header when using the browser.

Anyways, just wanted to share some thoughts around this. I'd love to read yours 😄

@karelm
Copy link

karelm commented May 11, 2020

Heya @erickhun, I think that Colin and Edu both raised questions around this that resonated with me, but I thought I'd ask if there was any intentions to move this forward, or whether this was just going to remain paused for the time being?

@erickhun
Copy link
Collaborator Author

erickhun commented May 12, 2020

Since it is showing interest, let's try to see if that worth pushing ! @colinscape @esclapes @karelm

When do we want that feature?

I think we can prfioritize that feature if we see the need of it. We left it in pause since we didn't really see anyone expressing they needed it. Is it something that will be useful now? Any example on how it will make things easier for you?

Security aspect:

I'm unsure about the best solution solve the risk of "attacker" using that key, and us ending up with tons of unwanted logs. 1 way would be to add a variable environment defining the "debug-ability key". Something such as LOG_DEBUG_KEY=my_secret_debug_key and that feature will be trigger by https://service.com?my_secret_debug_key=value. But we would end up with all services that will need that key to show be able to show the log?
The other way @colinscape mentioned, is a VPN check. The challenge I'm seeing is that we will need to specify the IP somewhere , and know how that IP is sent from service to services inside the cluster?

If anyone has a better idea :)

RPC calls: where will be this parameter ?

Thinking about this, I was considering our buffer-rpc calls that use HTTP query params for method and args. In this case, where do you see the new debug query parameter being included?:

  • inside args with the rest of method arguments,
  • or a top level, next to method and args

@esclapes I was thinking inside args with the rest of method arguments, so it consistent if that will be if we trigger that from a browser. I'm not too sure how RPC calls are made for other service, open to change that if that make more sense .

harder to set a custom header when using the browser.

As you said I think it's harder to set an HTTP header, me assuming that most of the case will be trigger in the browser?

@colinscape
Copy link

Realistically, I don't think the security aspect is too important - the chances of someone using that are pretty minimal, and if it does happen I'd say it should be pretty straightforward to add the secret like you mention @erickhun - so that to debug a single request you'd need to know the secret (and that would be relatively simple to change ideally with a single change in kubesecret and corresponding change in the DD indexing rules).

I terms of the usefulness, I'd defer to the product engineers. I can also see value in a mechanism like this (once we have fuller logging in place) to be able to see potential areas for improvement. For example, it might be enlightening to see how many database calls are made on a page load 😅 . Or how many external http requests are made.

@karelm
Copy link

karelm commented May 15, 2020

Thanks for this @erickhun I think that the secret key approach would be nice, the only small issue we have with publish is the RPC request in the first place... but i can imagine with the new graphql endpoint this could be really useful to trace how a single request fans out from the service.
I'm not too bothered about rpc events relative to other more useful features in future. If we can make those rpc events also behave the same way, then great, like Colin said it would be really interesting to see how many db call's are made just to render a page. This will hopefully lead to easy win's for us to improve performance overall for the website + show us our bottlenecks more clearly.

@esclapes
Copy link

esclapes commented May 15, 2020

Hey @erickhun @colinscape answering to some of the questions:

I don't think the security aspect is too important - the chances of someone using that are pretty minimal

I agree. 💯 I see the main issue on how to really trigger this properly, like @karelm shared:

I think that the secret key approach would be nice, the only small issue we have with publish is the RPC request in the first place

For instance, the first GET request we make is the one that we would trigger by using the browser's URL bar and, thus, the one that you could modify with this extra parameter. However, this request would only load the Single Page App (HTML with React app) while most of the "interesting" requests are triggered from the app as AJAX requests (i.e. RPC to getFeatures).

To get this extra parameter sent in the "interesting" requests, you would need the app (or the underlying http client) to check the URL parameters in the browser and decide if it needs to include it in every subsequent request. If this is a secret key, how does the single page app know it? It would probably need some pattern matching and then let the back-end double-check it.

@esclapes I was thinking inside args with the rest of method arguments, so it consistent if that will be if we trigger that from a browser.

I would approach this as two separate questions:

  • how we trigger the debug mode
  • how we transmit the debug mode between services

For transmitting, I think a header would be the most semantic.

Regarding triggering, I have a thought experiment for you (if you have the time 😏 )


I'd love for us to do a thought experiment and look at this from a different angle: Instead of making this an on-demand feature, we turn it around and make it our default: Instead of linking requests across services when a magic parameter is included, we could just do it all the time and use a convention that would work for our different layers

We already have several examples provided by our third party tools that we can use as inspiration:

  • DataDog offers a trace_id which has some magic baked in as the it is magically shared across services and builds our APM traces. However, traces are sampled and the client is not aware of the trace ID corresponding to their request, so it wouldn't fully solve this issue.
  • Cloudflare, on the other hand, generates a cf-request-id which is provided to our services as a header and it's returned to the client as a reponse header

The basic steps are to build our own would be:

  • Generate a x-buffer-request-id at the edge (api-gateway, rpc server, publish-api etc) and include it as a header to upstream services. We could easily use cf-request-id if available and only generate our own whenever we don't have it
  • Update our rpc-client and other libraries used for inter-service communication to take this header and relay it upstream in every request. Also update Bufflog to include this header.
  • Include this x-buffer-request-id as a response header so that we can just use it to search for our request in logs.

I think that the harder part is the second bullet (updating internal communication libraries), but reading through our options above it feels to me that we would need to do that anyways with the ad-hoc request identification and this always-on approach would have some extra benefits:

  • No need to handle secrets and magic parameter. Being always on means that as a developer you can just use the app as usual and be able to search for any request.
  • The x-buffer-request-id could be shown to customers on some specific errors to make it easier for them to report issues and for us to debug them. I have already seen advocates using c-xray shared by customer in their reports.

If I am understanding the original goal of this we would miss a key part of it, the always-on approach would not solve for increasing verbosity level in all services, to address this I would consider the following:

  • Increasing the verbosity may not always be necessary: in many cases the issues will be logged as errors or warnings, which are logged and indexed by default, so being able to search for a request could be enough.
  • We would still need a solution when more verbosity level is required. Maybe a magic parameter like proposed above would be the most appropriate but maybe a command via kubectl could also work to avoid it being exposed to the public. In any case, I think this would complement the always-on request tracing instead of being an alternative.

I am sorry for the long post 😅 and specially if this does not match the current direction. I just wanted to share this idea as I feel it could solve some of the use cases and I would personally find it very useful.

Thank you for reading and feel free to go ahead with the current approach 👍 🤗

@colinscape
Copy link

Amazing thoughts, @esclapes - thanks so much for taking the time to share them! 🤗

I do like the idea of customers being able to give some sort of trace id for us to use (even potentially create a tool for advocates to 'download' the full trace for the customer and save for later analysis in a bug report, for example).

Another possibility is that maybe production doesn't provide that ability to increase the log level on a per-request basis. Instead perhaps a different ('logging'?) environment has that built in and logs at a more verbose level by default. 🤔 We would then only make the request in the different environment when we want a deluge of logs for a particular request. 😀

@karelm karelm removed their request for review June 7, 2021 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants