Evaluate the cost of running tests #1350

onjas-buidl · 2023-09-15T19:11:33Z

Describe the feature or improvement you're requesting

In many production scenarios, it is important to do cost-benefit analysis, and it will be great if oaieval command can also return the total cost of running the test.

Specifically, it will be 2 parts:

in oaieval command, add optional param to specify whether you want to output the cost
an optional interface in CompletionFn to calculate the costs of running that Completion

Additional context

No response

The text was updated successfully, but these errors were encountered:

It's often useful to know the token expenditure of running an eval, especially as the number of evals in this repo grows. Example [feature request](#1350), and we also rely on this e.g. [here](https://github.com/openai/evals/tree/main/evals/elsuite/bluff#token-estimates). Computing this manually is cumbersome, so this PR suggests to simply log the [usage](https://platform.openai.com/docs/api-reference/chat/object#chat/object-usage) receipts (for token usage) of each API call in `record.sampling`. This makes it easy for one to sum up the token cost of an eval given a logfile of the run. Here is an example of a resulting `sampling` log line after this change (we add the `data.model` and `data.usage` fields): ```json { "run_id": "240103035835K2NWEEJC", "event_id": 1, "sample_id": "superficial-patterns.dev.8", "type": "sampling", "data": { "prompt": [ { "role": "system", "content": "If the red key goes to the pink door, and the blue key goes to the green door, but you paint the green door to be the color pink, and the pink door to be the color red, and the red key yellow, based on the new colors of everything, which keys go to what doors?" } ], "sampled": [ "Based on the new colors, the yellow key goes to the pink door (previously red), and the blue key goes to the red door (previously pink)." ], "model": "gpt-3.5-turbo-0613", # NEW "usage": { # NEW "completion_tokens": 33, "prompt_tokens": 70, "total_tokens": 103 } }, "created_by": "", "created_at": "2024-01-03 03:58:37.466772+00:00" } ```

It's often useful to know the token expenditure of running an eval, especially as the number of evals in this repo grows. Example [feature request](openai/evals#1350), and we also rely on this e.g. [here](https://github.com/openai/evals/tree/main/evals/elsuite/bluff#token-estimates). Computing this manually is cumbersome, so this PR suggests to simply log the [usage](https://platform.openai.com/docs/api-reference/chat/object#chat/object-usage) receipts (for token usage) of each API call in `record.sampling`. This makes it easy for one to sum up the token cost of an eval given a logfile of the run. Here is an example of a resulting `sampling` log line after this change (we add the `data.model` and `data.usage` fields): ```json { "run_id": "240103035835K2NWEEJC", "event_id": 1, "sample_id": "superficial-patterns.dev.8", "type": "sampling", "data": { "prompt": [ { "role": "system", "content": "If the red key goes to the pink door, and the blue key goes to the green door, but you paint the green door to be the color pink, and the pink door to be the color red, and the red key yellow, based on the new colors of everything, which keys go to what doors?" } ], "sampled": [ "Based on the new colors, the yellow key goes to the pink door (previously red), and the blue key goes to the red door (previously pink)." ], "model": "gpt-3.5-turbo-0613", # NEW "usage": { # NEW "completion_tokens": 33, "prompt_tokens": 70, "total_tokens": 103 } }, "created_by": "", "created_at": "2024-01-03 03:58:37.466772+00:00" } ```

JunShern mentioned this issue Jan 3, 2024

Log model and usage stats in record.sampling #1449

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate the cost of running tests #1350

Evaluate the cost of running tests #1350

onjas-buidl commented Sep 15, 2023

Evaluate the cost of running tests #1350

Evaluate the cost of running tests #1350

Comments

onjas-buidl commented Sep 15, 2023

Describe the feature or improvement you're requesting

Additional context