Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time to first byte? #592

Open
lukehsiao opened this issue May 10, 2024 · 3 comments
Open

Time to first byte? #592

lukehsiao opened this issue May 10, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@lukehsiao
Copy link

Suppose we want to load-test an API which uses server-sent events (SSE). Is it possible to measure the time-to-first-byte using Goose?

@jeremyandrews
Copy link
Member

Can you provide some examples of how you’re using SSE and what metrics you’d want to measure? What technologies are you using?

@lukehsiao
Copy link
Author

lukehsiao commented May 12, 2024

Hmmm, I can try and get more specific if you need, but one example would be load testing an API like ChatGPT, which uses SSE so that you can start to see the response streaming back as it is generated, rather than simply staring at a blank page for a long time before the entire response is complete.

In these types of use cases, time-to-first-token (essentially time-to-first-byte) is the interesting metric, as that represents the latency between asking a query and when the user can begin to receive a response. This metric is often what dictates how responsive a streaming LLM API feels to a user.

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices says more about this if you go to the "Important Metrics for LLM Serving" heading.

Our team uses four key metrics for LLM serving:

  1. Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.

So, the question then is: can goose be used to load-test and measure a time-to-first-byte as a proxy for time-to-first-token? Could I use goose to try and reproduce some of the results in this Databricks blog post?

Does that help clarify?

@jeremyandrews
Copy link
Member

That’s very helpful, yes. I’ll find some time to test and see what can be done. I expect it will take some code changes/additions to be useful.

@jeremyandrews jeremyandrews added the enhancement New feature or request label Jul 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants