Adds more thorough pytest example (#442)

Adds example on using pytest for testing - explaining pytest, but also doing a more realistic example. This example just shows how to use the tools so to speak. It doesn't talk about what should be evaluated much at all. But hopefully it should give people an idea of what to do and how... This example should be updated as we do moe. Adds SDLC images and verbiage too.
DAGWorks-Inc · Dec 27, 2024 · e149616 · e149616
1 parent cef8b92
commit e149616
Show file tree

Hide file tree

Showing 16 changed files with 799 additions and 7 deletions.
diff --git a/docs/_static/burr_sdlc.png b/docs/_static/burr_sdlc.png
diff --git a/docs/concepts/index.rst b/docs/concepts/index.rst
@@ -10,6 +10,7 @@ Overview of the concepts -- read these to get a mental model for how Burr works.
     :maxdepth: 2
 
     overview
+    sdlc
     actions
     state
     state-machine

diff --git a/docs/concepts/sdlc.rst b/docs/concepts/sdlc.rst
@@ -0,0 +1,27 @@
+================================
+SDLC with LLMs
+================================
+If you're building an LLM-based application, you'll want to follow a slightly different software development lifecycle (SDLC)
+than you would for a traditional software project. Here's a rough outline of what that might look like:
+
+.. image:: ../_static/burr_sdlc.png
+   :alt: SDLC with LLMs
+   :align: center
+
+The two cycles that exist are:
+
+1. App Dev Loop.
+2. Test Driven Development Loop.
+
+and you will use one to feed into the other, etc.
+
+Walking through the diagram the SDLC looks like this:
+
+1. Write code with Burr.
+2. Use Burr's integrated observability, and trace all parts of your application.
+3. With the data collected, you can: (1) annotate what was captured and export it, or (2) create a pytest fixture with it.
+4. Create a data set from the annotated data or by running tests.
+5. Evaluate the data set.
+6. Analyze the results.
+7. Either adjust code or prompts, or ship the code.
+8. Iterate using one of the loops...
diff --git a/docs/examples/guardrails/creating_tests.rst b/docs/examples/guardrails/creating_tests.rst
@@ -14,6 +14,12 @@ words or phrases, or using LLMs to grade the output, etc. We aren't opinionated
 do this, but in any case, you'll need to write a test case to exercise things, and this
 is what we're showing you how to do here.
 
+Need to know more about pytest?
+-------------------------------
+For a more pytest walkthrough and example, see the `pytest example <https://github.com/DAGWorks-Inc/burr/tree/main/examples/pytest>`_,
+that explains what pytest is, how to evaluate more than just a single assert statement, how to aggregate results, etc.
+
+
 Test Case Creation Example
 --------------------------
 Video walkthrough:
@@ -57,11 +63,11 @@ test case with the module name of your serialization logic.
 Note (2): you can pass in `--action-name` to override the action name in the test case. This is useful if you want
 to use the output of one action as the input to another action; there are corner cases where this is useful.
 
+
 Future Work
 -----------
 We see many more improvements here:
 
-1. Annotating data in the UI to make it easier to pull out.
-2. Automatically suggesting tests cases for you to add.
-3. Data export / integration with evaluation tools.
-4. etc. Please let us know what you need!
+1. Automatically suggesting tests cases for you to add.
+2. Data export / integration with evaluation tools.
+3. etc. Please let us know what you need!
diff --git a/docs/examples/guardrails/index.rst b/docs/examples/guardrails/index.rst
@@ -1,6 +1,6 @@
-=============
-🚧 Guardrails
-=============
+=======================
+🚧 Guardrails / Tests
+=======================
 
 .. toctree::
     :maxdepth: 2

diff --git a/examples/pytest/README.md b/examples/pytest/README.md
@@ -0,0 +1,319 @@
+# A SDLC with Burr and pytest
+Here we show a quick example of a software development lifecycle (SDLC) with Burr and pytest.
+
+![Burr and pytest](burr_sdlc.png)
+
+While we don't cover everything in the diagram, in this example we specifically show how to do most of the TDD loop:
+
+1. Create a test case.
+2. Run the test case.
+3. Create a dataset.
+4. Show how you might construct evaluation logic to evaluate the output of your agent / augmented LLM / application.
+
+using Burr and pytest.
+
+# Using pytest to evaluate your agent  / augmented LLM / application
+
+An agent / augmented LLM is a combination of LLM calls and logic. But how do we know if it's working? Well we can test & evaluate it.
+
+From a high level we want to test & evaluate the "micro" i.e. the LLM calls & individual bits of logic,
+through to the "macro" i.e. the agent as a whole.
+
+But, the challenge with LLM calls is that you might want to "assert" on various aspects of the
+output without failing on the first assertion failure, which is standard test framework behavior. So what are you to do?
+
+Well we can use some `pytest` constructs to help us with this.
+
+## pytest Constructs
+To start let's recap pytest quickly, then move on to how we can evaluate multiple aspects without failing on the first assertion failure.
+
+### pytest basics
+We like pytest because we think it's simpler than the unittest module python comes with. To use it you need to install it first:
+
+```bash
+pip install pytest
+```
+
+Then to define a test it's just a function that starts with `test_`:
+
+```python
+# test_my_agent.py
+
+def test_my_agent():
+    assert my_agent("input1") == "output1"
+    assert my_agent("input2") == "output2"
+    # can have multiple asserts here - it'll fail on the first one and not run the rest
+```
+Yep - no classes to deal with. Just a function that starts with `test_`. Then to run it:
+
+```bash
+pytest test_my_agent.py
+```
+Boom, you're testing!
+
+### Parameterizing Tests
+We can also parameterize tests to run the same test with different inputs. This comes in handy as we build up
+data points to evaluate our agent or parts of our agent. Each input is then an individual test that can error. Here's an example:
+
+```python
+import pytest
+
+@pytest.mark.parametrize(
+    "input, expected_output",
+    [
+        ("input1", "output1"),
+        ("input2", "output2"),
+    ],
+    ids=["test1", "test2"] # these are the test names for the above inputs
+)
+def test_my_agent(input, expected_output):
+    actual_output = my_agent(input) # your code to call your agent or part of it here
+    # can include static measures / evaluations here
+    assert actual_output == expected_output
+    # assert some other property of the output...
+```
+What we've shown above will fail on the first assertion failure. But what if we want to evaluate all the outputs before making a pass / fail decision?
+
+### What kind of "asserts" do we want?
+
+We might want to evaluate the output in a number of ways:
+1. Exact match - the output is exactly as expected.
+2. Fuzzy match - the output is close to what we expect, e.g. does it contain the right words, is it "close" to the answer, etc.
+3. Human grade - the output is graded by a human as to how close it is to the expected output.
+4. LLM grade - the output is graded by an LLM as to how close it is to the expected output.
+4. Static measures - the output has some static measures that we want to evaluate, e.g. length, etc.
+
+It is rare that you solely rely on (1) with LLMs, and you'll likely want to evaluate the output in a number of ways before making a pass / fail decision.
+E.g. that the output is close to the expected output, that it contains the right words, etc., and then make a pass / fail decision based on all these evaluations.
+
+We will not dive deep into what evaluation logic you should use. If you want to dive deeper there, we
+suggest you start with [posts like this](https://hamel.dev/notes/llm/officehours/evalmultiturn.html) - you need to understand your data and outcomes to
+choose the right evaluation logic.
+
+### Not failing on first assert failure / logging test results
+
+One limitation of pytest is that it fails on the first assertion failure. This is not ideal if you want to evaluate multiple aspects of the output before making a pass / fail decision.
+
+There are multiple ways one could solve this limitation, as pytest is very extensible. We will only go over one way here,
+which is to use the `pytest-harvest` plugin to log what our tests are doing. This allows us to capture the results of our tests in a structured way without
+breaking at the first asserting failure. This means we can mix and match where appropriate hard "assertions" - i.e. definitely fail, with
+softer ones where we want to evaluate all aspects before making an overall pass / fail decision. We walk through how to do this below using a few
+pytest constructs.
+
+`results_bag` is a fixture that we can log values to from our tests. This is useful if we don't want to fail on the first assert statement,
+and instead capture a lot more. This is not native to pytest, and is why we use the `pytest-harvest` plugin to achieve this.
+
+To use it, you just need to install `pytest-harvest` and then you can use the `results_bag` fixture in your tests:
+
+```python
+def test_my_agent(results_bag):
+    results_bag.input = "my_value"
+    results_bag.output = "my_output"
+    results_bag.expected_output = "my_expected_output"
+```
+
+We can then access the results in the `results_bag` from the `pytest-harvest` plugin via the `module_results_df` fixture that
+provides a pandas dataframe of the results:
+
+```python
+def test_print_results(module_results_df):
+    # place this function at the end of the module so that way it's run last.
+    print(module_results_df.columns) # this will include "input", "output", "expected_output"
+    print(module_results_df.head()) # this will show the first few rows of the results
+    # TODO: Add more evaluation logic here or log the results to a file, etc.
+    # assert some threshold of success, etc.
+```
+This enables us to get a dataframe of all the results from our tests, and then we can evaluate them as we see fit for our use case.
+E.g. we only pass tests if all the outputs are as expected, or we pass if 80% of the outputs are as expected, etc. You could
+also log this to a file, or a database, etc. for further inspection and record keeping, or combining it with
+open source frameworks [mlflow](https://mlflow.org) and using their [evaluate functionality](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html).
+
+Note: we can also combine `results_bag` with ``pytest.mark.parametrize`` to run the same test with different inputs and expected outputs:
+
+```python
+import pytest
+
+@pytest.mark.parametrize(
+    "input, expected_output",
+    [
+        ("input1", "output1"),
+        ("input2", "output2"),
+    ],
+    ids=["test1", "test2"] # these are the test names for the above inputs
+)
+def test_my_agent(input, expected_output, results_bag):
+    results_bag.input = input
+    results_bag.expected_output = expected_output
+    results_bag.output = my_agent(input) # your code to call the agent here
+    # can include static measures / evaluations here
+    results_bag.success = results_bag.output == results_bag.expected_output
+```
+
+
+### Using Burr's pytest Hook
+With Burr you can curate test cases from real application runs. You can then use these test cases in your pytest suite.
+Burr has a hook that enables you to curate a file with the input state and expected output state for an entire run,
+or a single action.  See the [Burr test case creation documentation](https://burr.dagworks.io/examples/guardrails/creating_tests/) for more
+details on how. Here we show you how you can combine this with getting results:
+
+```python
+import pytest
+from our_agent_application import prompt_for_more
+
+from burr.core import state
+
+# the following is required to run file based tests
+from burr.testing import pytest_generate_tests  # noqa: F401
+
+@pytest.mark.file_name("prompt_for_more.json") # our fixture file with the expected inputs and outputs
+def test_an_agent_action(input_state, expected_state, results_bag):
+    """Function for testing an individual action of our agent."""
+    input_state = state.State.deserialize(input_state)
+    expected_state = state.State.deserialize(expected_state)
+    _, output_state = prompt_for_more(input_state)  # exercising an action of our agent
+
+    results_bag.input_state = input_state
+    results_bag.expected_state = expected_state
+    results_bag.output_state = output_state
+    results_bag.foo = "bar"
+    # TODO: choose appropriate way to evaluate the output
+    # e.g. exact match, fuzzy match, LLM grade, etc.
+    # this is exact match here on all values in state
+    exact_match = output_state == expected_state
+    # for output that varies, you can do something like this
+    # assert 'some value' in output_state["response"]["content"]
+    # or, have an LLM Grade things -- you need to create the llm_evaluator function:
+    # assert llm_evaluator("are these two equivalent responses. Respond with Y for yes, N for no",
+    # output_state["response"]["content"], expected_state["response"]["content"]) == "Y"
+    # store it in the results bag
+    results_bag.correct = exact_match
+
+    # place any asserts at the end of the test
+    assert exact_match
+```
+So if we want to test an entire agent, we can use the same approach, but instead rely on the input and output
+state being the entire state of the agent at the start and end of the run.
+
+```python
+import pytest
+from our_agent_application import agent_builder, agent_runner # some functions that build and run our agent
+
+from burr.core import state
+
+# the following is required to run file based tests
+from burr.testing import pytest_generate_tests  # noqa: F401
+
+@pytest.mark.file_name("e2e.json") # our fixture file with the expected inputs and outputs
+def test_an_agent_e2e(input_state, expected_state, results_bag):
+    """Function for testing an agent end-to-end."""
+    input_state = state.State.deserialize(input_state)
+    expected_state = state.State.deserialize(expected_state)
+    # exercise the agent
+    agent = agent_builder(input_state) # e.g. something like some_actions._build_application(...)
+    output_state = agent_runner(agent)
+
+    results_bag.input_state = input_state
+    results_bag.expected_state = expected_state
+    results_bag.output_state = output_state
+    results_bag.foo = "bar"
+    # TODO: choose appropriate way to evaluate the output
+    # e.g. exact match, fuzzy match, LLM grade, etc.
+    # this is exact match here on all values in state
+    exact_match = output_state == expected_state
+    # for output that varies, you can do something like this
+    # assert 'some value' in output_state["response"]["content"]
+    # or, have an LLM Grade things -- you need to create the llm_evaluator function:
+    # assert llm_evaluator("are these two equivalent responses. Respond with Y for yes, N for no",
+    # output_state["response"]["content"], expected_state["response"]["content"]) == "Y"
+    # store it in the results bag
+    results_bag.correct = exact_match
+
+    # place any asserts at the end of the test
+    assert exact_match
+
+```
+#### Using the Burr UI to observe test runs
+You can also use the Burr UI to observe the test runs. This can be useful to see the results of the tests in a more visual way.
+To do this, you'd instantiate the Burr Tracker and then run the tests as normal. A notes on ergonomics:
+
+1. It's useful to use the test_name as the partition_key to easily find test runs in the Burr UI. You can also make the app_id match some test run ID, e.g. date-time, etc.
+2. You can turn on opentelemetry tracing to see the traces in the Burr UI as well.
+3. In general this means that you should have a parameterizeable application builder function that can take in a tracker and partition key.
+
+```python
+import pytest
+from our_agent_application import agent_builder, agent_runner # some functions that build and run our agent
+
+from burr.core import state
+
+# the following is required to run file based tests
+from burr.testing import pytest_generate_tests  # noqa: F401
+from burr.tracking import LocalTrackingClient
+
+@pytest.fixture
+def tracker():
+    """Fixture for creating a tracker to track runs to log to the Burr UI."""
+    tracker = LocalTrackingClient("pytest-runs")
+    # optionally turn on opentelemetry tracing
+    yield tracker
+
+
+@pytest.mark.file_name("e2e.json") # our fixture file with the expected inputs and outputs
+def test_an_agent_e2e_with_tracker(input_state, expected_state, results_bag, tracker, request):
+    """Function for testing an agent end-to-end using the tracker.
+
+    Fixtures used:
+     - results_bag: to log results -- comes from pytest-harvest
+     - tracker: to track runs -- comes from tracker() function above
+     - request: to get the test name -- comes from pytest
+    """
+    input_state = state.State.deserialize(input_state)
+    expected_state = state.State.deserialize(expected_state)
+
+    test_name = request.node.name
+    # exercise the agent
+    agent = agent_builder(input_state, partition_key=test_name, tracker=tracker) # e.g. something like some_actions._build_application(...)
+    output_state = agent_runner(agent)
+
+    results_bag.input_state = input_state
+    results_bag.expected_state = expected_state
+    results_bag.output_state = output_state
+    results_bag.foo = "bar"
+    # TODO: choose appropriate way to evaluate the output
+    # e.g. exact match, fuzzy match, LLM grade, etc.
+    # this is exact match here on all values in state
+    exact_match = output_state == expected_state
+    # for output that varies, you can do something like this
+    # assert 'some value' in output_state["response"]["content"]
+    # or, have an LLM Grade things -- you need to create the llm_evaluator function:
+    # assert llm_evaluator("are these two equivalent responses. Respond with Y for yes, N for no",
+    # output_state["response"]["content"], expected_state["response"]["content"]) == "Y"
+    # store it in the results bag
+    results_bag.correct = exact_match
+
+    # place any asserts at the end of the test
+    assert exact_match
+```
+
+# An example
+Here in this directory we have:
+
+ - `some_actions.py` - a file that defines an augmented LLM application (it's not a full agent) with some actions
+ - `test_some_actions.py` - a file that defines some tests for the actions in `some_actions.py`.
+
+You'll see that we use the `results_bag` fixture to log the results of our tests,and then we can access these results
+via the `module_results_df` fixture that provides a pandas dataframe of the results. This dataframe is then
+saved as a CSV for uploading to google sheets, etc. for further analysis. You will also see uses of `pytest.mark.parametrize`
+and Burr's pytest feature for parameterizing tests from a JSON file.
+
+To run the tests, you can run them with pytest:
+
+```bash
+pytest test_some_actions.py
+```
+
+After running the tests, you can see the results in a CSV file called `results.csv` in the same directory as the tests.
+
+You should also see the following in the Burr UI:
+
+<img src="burr_ui.png" alt="Burr UI" style="width:1000px;"/>
diff --git a/examples/pytest/burr_sdlc.png b/examples/pytest/burr_sdlc.png
diff --git a/examples/pytest/burr_ui.png b/examples/pytest/burr_ui.png
-Original file line number
+Diff line change
@@ Expand Up @@
         :maxdepth: 2
         overview
+        sdlc
         actions
         state
         state-machine
@@ Expand Down @@