From 05ac53d7a008ab2f8cafd3beb852105544688bb3 Mon Sep 17 00:00:00 2001 From: Joe Vincent Date: Fri, 10 May 2024 08:12:44 -0700 Subject: [PATCH] Update index.html --- docs/index.html | 32 +++++++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) diff --git a/docs/index.html b/docs/index.html index 62c4f8d6..d68208e0 100644 --- a/docs/index.html +++ b/docs/index.html @@ -115,7 +115,8 @@

How Generalizable Is My Behavior Clonin type="video/mp4">

- Aliquam vitae elit ullamcorper tellus egestas pellentesque. Ut lacus tellus, maximus vel lectus at, placerat pretium mi. Maecenas dignissim tincidunt vestibulum. Sed consequat hendrerit nisl ut maximus. + When using a small number of policy rollouts to evaluate robot performance, it is important to quantify our uncertainty in the performance estimate. + In our paper we show how to place worst-case confidence bounds on the distribution of robot performance while using the observed performance from policy rollouts as efficiently as possible.

@@ -140,6 +141,27 @@

Abstract

+ +
+
+
+
+

Evaluation in Simulation

+
+

+ We obtain upper confidence bounds on the cumulative distribution function (CDF) of the total reward obtained by diffusion policies in out-of-distribution robosuite environments. + An upper confidence bound on the CDF can be interpreted as the worst-case distribution of reward that is consistent with the observed policy rollouts. + Here we show representative policy rollouts for the Square environment, and plot the in-distribution CDF of reward and our upper confidence bound constructed from 40 out-of-distribution policy rollouts. + The confidence bounds we obtain quantify our uncertainty in the performance of the robot in a concrete and interpretable manner. +

+
+
+
+
+
+ + + @@ -147,11 +169,11 @@

Abstract

-

Hardware Evaluation

+

Evaluation in Hardware

We obtain lower confidence bounds on the success rate of a diffusion policy tested in two out-of-distribution environments. - The confidence bounds we obtain make the most efficient use of the 50 samples used to estimate the performance of the robot. + The confidence bounds we obtain make the most efficient use of the 50 policy rollouts used to estimate the performance of the robot. The confidence bounds we obtain quantify our uncertainty in the performance of the robot in a concrete and interpretable manner.

@@ -177,7 +199,7 @@

Comparing Policies

Here we apply our statistical bounds to the recent results from the RT-2 paper, where the authors compare their RT-2 policy to a VC-1 policy in three settings designed to test emergent capabilities in symbol understanding, reasoning, and human recognition. For each setting we find the 95% confidence intervals for policy success rate are disjiont, and we conclude with 95% confidence that RT-2 outperforms VC-1.

- Confidence intervals for policy success rates + Confidence intervals for policy success rates
@@ -193,7 +215,7 @@

Comparing Policies

-
+