Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_test_results response_format #1280

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 60 additions & 22 deletions python/langsmith/client.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Client for interacting with the LangSmith API.

Check notice on line 1 in python/langsmith/client.py

View workflow job for this annotation

GitHub Actions / benchmark

Benchmark results

........... WARNING: the benchmark result may be unstable * the standard deviation (106 ms) is 15% of the mean (696 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_5_000_run_trees: Mean +- std dev: 696 ms +- 106 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (220 ms) is 15% of the mean (1.44 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_10_000_run_trees: Mean +- std dev: 1.44 sec +- 0.22 sec ........... WARNING: the benchmark result may be unstable * the standard deviation (168 ms) is 12% of the mean (1.43 sec) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. create_20_000_run_trees: Mean +- std dev: 1.43 sec +- 0.17 sec ........... dumps_class_nested_py_branch_and_leaf_200x400: Mean +- std dev: 703 us +- 11 us ........... dumps_class_nested_py_leaf_50x100: Mean +- std dev: 25.0 ms +- 0.3 ms ........... dumps_class_nested_py_leaf_100x200: Mean +- std dev: 104 ms +- 3 ms ........... dumps_dataclass_nested_50x100: Mean +- std dev: 25.1 ms +- 0.2 ms ........... WARNING: the benchmark result may be unstable * the standard deviation (17.8 ms) is 25% of the mean (70.7 ms) Try to rerun the benchmark with more runs, values and/or loops. Run 'python -m pyperf system tune' command to reduce the system jitter. Use pyperf stats, pyperf dump and pyperf hist to analyze results. Use --quiet option to hide these warnings. dumps_pydantic_nested_50x100: Mean +- std dev: 70.7 ms +- 17.8 ms ........... dumps_pydanticv1_nested_50x100: Mean +- std dev: 197 ms +- 3 ms

Check notice on line 1 in python/langsmith/client.py

View workflow job for this annotation

GitHub Actions / benchmark

Comparison against main

+-----------------------------------------------+----------+------------------------+ | Benchmark | main | changes | +===============================================+==========+========================+ | dumps_pydanticv1_nested_50x100 | 218 ms | 197 ms: 1.11x faster | +-----------------------------------------------+----------+------------------------+ | create_5_000_run_trees | 707 ms | 696 ms: 1.02x faster | +-----------------------------------------------+----------+------------------------+ | dumps_dataclass_nested_50x100 | 25.3 ms | 25.1 ms: 1.01x faster | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_branch_and_leaf_200x400 | 700 us | 703 us: 1.01x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_50x100 | 24.9 ms | 25.0 ms: 1.01x slower | +-----------------------------------------------+----------+------------------------+ | dumps_class_nested_py_leaf_100x200 | 103 ms | 104 ms: 1.01x slower | +-----------------------------------------------+----------+------------------------+ | create_10_000_run_trees | 1.39 sec | 1.44 sec: 1.04x slower | +-----------------------------------------------+----------+------------------------+ | create_20_000_run_trees | 1.37 sec | 1.43 sec: 1.04x slower | +-----------------------------------------------+----------+------------------------+ | dumps_pydantic_nested_50x100 | 65.8 ms | 70.7 ms: 1.07x slower | +-----------------------------------------------+----------+------------------------+ | Geometric mean | (ref) | 1.01x slower | +-----------------------------------------------+----------+------------------------+

Use the client to customize API keys / workspace ocnnections, SSl certs,
etc. for tracing.
Expand Down Expand Up @@ -2599,29 +2599,43 @@
def get_test_results(
self,
*,
project_id: Optional[ID_TYPE] = None,
project_name: Optional[str] = None,
) -> pd.DataFrame:
"""Read the record-level information from an experiment into a Pandas DF.
experiment_name: Optional[str] = None,
experiment_id: Optional[ID_TYPE] = None,
response_format: Literal["pandas", "list"] = "pandas",
**kwargs: Any,
) -> Union[pd.DataFrame, List[Dict[str, Any]]]:
"""Read the record-level information from an experiment.

Note: this will fetch whatever data exists in the DB. Results are not
immediately available in the DB upon evaluation run completion.

Returns:
--------
pd.DataFrame
A dataframe containing the test results.
Union[pd.DataFrame, List[Dict[str, Any]]]
A dataframe or list of dictionaries containing the test results.
"""
if kwargs.get("project_id"):
warnings.warn(
f'Argument "project_id" is deprecated. Use experiment_id instead (client.get_test_results(experiment_id="{project_id}"))',
DeprecationWarning,
)
experiment_id = kwargs.pop("project_id")
elif kwargs.get("project_name"):
warnings.warn(
f'Argument "project_name" is deprecated. Use experiment_name instead (client.get_test_results(experiment_name="{project_name}"))',
DeprecationWarning,
)
experiment_name = kwargs.pop("project_name")
else:
raise ValueError("Must provide project_name or project_id")
warnings.warn(
"Function get_test_results is in beta.", UserWarning, stacklevel=2
)
from concurrent.futures import ThreadPoolExecutor, as_completed # type: ignore

import pandas as pd # type: ignore

runs = self.list_runs(
project_id=project_id,
project_name=project_name,
project_id=experiment_id,
project_name=experiment_name,
is_root=True,
select=[
"id",
Expand All @@ -2634,28 +2648,29 @@
"end_time",
],
)
results: list[dict] = []
results: List[Dict[str, Any]] = []
example_ids = []

def fetch_examples(batch):
examples = self.list_examples(example_ids=batch)
return [
{
"example_id": example.id,
**{f"reference.{k}": v for k, v in (example.outputs or {}).items()},
"inputs": example.inputs,
"reference_outputs": example.outputs or {},
}
for example in examples
]

batch_size = 50
cursor = 0

with ThreadPoolExecutor() as executor:
futures = []
for r in runs:
row = {
"example_id": r.reference_example_id,
**{f"input.{k}": v for k, v in r.inputs.items()},
**{f"outputs.{k}": v for k, v in (r.outputs or {}).items()},
"outputs": r.outputs or {},
"execution_time": (
(r.end_time - r.start_time).total_seconds()
if r.end_time
Expand All @@ -2676,25 +2691,48 @@
else:
logger.warning(f"Run {r.id} has no reference example ID.")
if len(example_ids) % batch_size == 0:
# Ensure not empty
if batch := example_ids[cursor : cursor + batch_size]:
futures.append(executor.submit(fetch_examples, batch))
cursor += batch_size
results.append(row)

# Handle any remaining examples
if example_ids[cursor:]:
futures.append(executor.submit(fetch_examples, example_ids[cursor:]))
result_df = pd.DataFrame(results).set_index("example_id")

example_outputs = [
output for future in as_completed(futures) for output in future.result()
]
if example_outputs:
example_df = pd.DataFrame(example_outputs).set_index("example_id")
result_df = example_df.merge(result_df, left_index=True, right_index=True)

# Flatten dict columns into dot syntax for easier access
return pd.json_normalize(result_df.to_dict(orient="records"))
if example_outputs:
example_dict = {item["example_id"]: item for item in example_outputs}
for result in results:
if result["example_id"] in example_dict:
result.update(example_dict[result["example_id"]])

if response_format == "list":
return results
elif response_format == "pandas":
try:
import pandas as pd
except ImportError:
raise ImportError(
"The 'pandas' library is required to use the 'pandas' response_format. "
"Please install it using 'pip install pandas'."
)
# Flatten the inputs/outputs/reference_outputs fields
for result in results:
inputs = result.pop("inputs", {})
outputs = result.pop("outputs", {})
reference_outputs = result.pop("reference_outputs", {})
for k, v in inputs.items():
result[f"inputs.{k}"] = v
for k, v in outputs.items():
result[f"outputs.{k}"] = v
for k, v in reference_outputs.items():
result[f"reference.{k}"] = v
return pd.json_normalize(pd.DataFrame(results).to_dict(orient="records"))
else:
raise ValueError("Invalid response_format. Must be 'list' or 'pandas'.")

def list_projects(
self,
Expand Down
Loading