-
Notifications
You must be signed in to change notification settings - Fork 1
/
published-202407-susmann-adaptive-conformal.qmd
1117 lines (950 loc) · 83.4 KB
/
published-202407-susmann-adaptive-conformal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "AdaptiveConformal: An `R` Package for Adaptive Conformal Inference"
subtitle: "AdaptiveConformal: An `R` Package for Adaptive Conformal Inference"
author:
- name: Herbert Susmann
corresponding: true
email: [email protected]
url: https://herbsusmann.com
orcid: 0000-0002-3540-8255
affiliations:
- name: CEREMADE (UMR 7534), Université Paris-Dauphine PSL, Place du Maréchal de Lattre de Tassigny, Paris, 75016, France
url: https://www.ceremade.dauphine.fr/
- name: Antoine Chambaz
email: [email protected]
url: https://helios2.mi.parisdescartes.fr/~chambaz/
orcid: 0000-0002-5592-6471
affiliations:
- name: Université Paris Cité, CNRS, MAP5, F-75006 Paris, France
department:
url: https://map5.mi.parisdescartes.fr/
- name: Julie Josse
email: [email protected]
url: http://juliejosse.com/
orcid: 0000-0001-9547-891X
affiliations:
- name: Inria PreMeDICaL team, Idesp, Université de Montpellier
url: https://team.inria.fr/premedical/
date: 07-18-2024
date-modified: last-modified
description: |
Conformal Inference (CI) is a popular approach for generating finite sample prediction intervals based on the output of any point prediction method when data are exchangeable. Adaptive Conformal Inference (ACI) algorithms extend CI to the case of sequentially observed data, such as time series, and exhibit strong theoretical guarantees without having to assume exchangeability of the observed data. The common thread that unites algorithms in the ACI family is that they adaptively adjust the width of the generated prediction intervals in response to the observed data. We provide a detailed description of five ACI algorithms and their theoretical guarantees, and test their performance in simulation studies. We then present a case study of producing prediction intervals for influenza incidence in the United States based on black-box point forecasts. Implementations of all the algorithms are released as an open-source `R` package, `AdaptiveConformal`, which also includes tools for visualizing and summarizing conformal prediction intervals.
abstract: >+
Conformal Inference (CI) is a popular approach for generating finite sample prediction intervals based on the output of any point prediction method when data are exchangeable. Adaptive Conformal Inference (ACI) algorithms extend CI to the case of sequentially observed data, such as time series, and exhibit strong theoretical guarantees without having to assume exchangeability of the observed data. The common thread that unites algorithms in the ACI family is that they adaptively adjust the width of the generated prediction intervals in response to the observed data. We provide a detailed description of five ACI algorithms and their theoretical guarantees, and test their performance in simulation studies. We then present a case study of producing prediction intervals for influenza incidence in the United States based on black-box point forecasts. Implementations of all the algorithms are released as an open-source `R` package, `AdaptiveConformal`, which also includes tools for visualizing and summarizing conformal prediction intervals.
keywords: [Conformal inference, Adaptive conformal inference, time series, R]
citation:
type: article-journal
container-title: "Computo"
doi: "10.57750/edan-5f53"
url: https://computo.sfds.asso.fr/template-computo-quarto
publisher: "Société Française de Statistique"
issn: "2824-7795"
bibliography: references.bib
github-user: computorg
logo: "true"
repo: "published-202407-susmann-adaptive-conformal"
draft: false # set to false once the build is running
published: true # will be set to true once accepted
google-scholar: true
format:
computo-html:
code-fold: true
computo-pdf:
keep-tex: true
include-in-header:
- text: |
\usepackage{stmaryrd}
\usepackage{xfrac}
---
# Introduction
Conformal Inference (CI) is a family of methods for generating finite sample prediction intervals around point predictions when data are exchangeable [@vovk2005; @shafer2008conformal; @angelopoulos2022gentle]. The input point predictions can be derived from any prediction method, making CI a powerful tool for augmenting black-box prediction algorithms with prediction intervals. Classical CI methods are able to yield marginally valid intervals with only the assumption that the joint distribution of the data does not change based on the order of the observations (that is, they are exchangeable). However, in many real-world settings data are not exchangeable: for example, time series data usually cannot be assumed to be exchangeable due to temporal dependence. A recent line of research examines the problem of generating prediction intervals for observations that are observed online (that is, one at a time) and for which exchangeability is not assumed to hold [@gibbs2021adaptive; @zaffran2022agaci; @gibbs2022faci; @bhatnagar2023saocp]. The methods from this literature, which we refer to generally as _Adaptive Conformal Inference_ (ACI) algorithms, work by adaptively adjusting the width of the generated prediction intervals in response to the observed data.
Informally, suppose a sequence of outcomes $y_t \in \mathbb{R}$, $t = 1, \dots, T$ are observed one at a time. Before seeing each observation, we have at our disposal a point prediction $\hat{\mu}_t \in \mathbb{R}$ that can be generated by any method. Our goal is to find an algorithm for producing prediction intervals $[\ell_t, u_t]$, $\ell_t \leq u_t$ such that, in the long run, the observations $y_t$ fall within the corresponding prediction intervals roughly $\alpha \times 100\%$ of the time: that is, $\lim_{T \to \infty} \sfrac{1}{T} \sum_{t=1}^T \mathbb{I}\{ y_t \in [\ell_t, u_t] \} = \alpha$. The original ACI algorithm [@gibbs2021adaptive] is based on a simple idea: if the previous prediction interval at time $(t-1)$ did not cover the true observation, then the next prediction interval at time $t$ is made slightly wider. Conversely, if the previous prediction interval did include the observation, then the next prediction interval is made slightly narrower. It can be shown that this procedure yields prediction intervals that in the long run cover the true observations the desired proportion of the time.
The main tuning parameter of the original ACI algorithm is a learning rate that controls how fast prediction interval width changes. If the learning rate is too low, then the prediction intervals will not be able to adapt fast enough to shifts in the data generating distribution; if it is too large, then the intervals will oscillate widely. The critical dependence of the original ACI algorithm on proper choice of its learning rate spurred subsequent research into meta-algorithms that learn the correct learning rate (or an analogue thereof) in various ways, typically drawing on approaches from the online learning literature. In this paper, we present four such algorithms: Aggregated ACI [AgACI, @zaffran2022agaci], Dynamically-tuned Adaptive ACI [DtACI, @gibbs2022faci], Scale-Free Online Gradient Descent [SF-OGD, @bhatnagar2023saocp], and Strongly Adaptive Online Conformal Prediction [SAOCP, @bhatnagar2023saocp]. We note that the adaption of conformal inference techniques is an active area of research and the algorithms we focus on in this work are not exhaustive; see among others @feldman2023achieving, @bastani2022practical, @xu2021enbpi, @xu2023spci, @angelopoulos2024online, @zhang2024discounted, and @gasparin2024conformal.
Our primary practical contribution is an implementation of each algorithm in an open source `R` package, `AdaptiveConformal`, which is available at [https://github.com/herbps10/AdaptiveConformal](https://github.com/herbps10/AdaptiveConformal). The package also includes routines for visualization and summary of the prediction intervals. We note that Python versions of several algorithms were also made available by @zaffran2022agaci and @bhatnagar2023saocp, but to our knowledge this is the first package implementing them in `R`. In addition, several `R` packages exist for conformal inference in other contexts, including `conformalInference` focusing on regression [@tibshirani2019ci], `conformalInference.fd`, with methods for functional responses [@diquigiovanni2022fd], and `cfcausal` for causal inference related functionals [@lei2020cfcausal]. Our second practical contribution is to compare the performance of the algorithms in simulation studies and in a case study generating prediction intervals for influenza incidence in the United States based on black-box point forecasts.
The rest of the paper unfolds as follows. In @sec-theory, we present a unified theoretical framework for analyzing the ACI algorithms based on the online learning paradigm. In @sec-algorithms we provide descriptions of each algorithm along with their known theoretical properties. In @sec-simulations we compare the performance of the algorithms in several simulation studies. @sec-case-study gives a case study based on forecasting influenza in the United States. Finally, @sec-discussion provides a discussion and ideas for future research in this rapidly expanding field.
```{r, warning=FALSE, message=FALSE, echo=FALSE}
library(tidyverse)
library(patchwork)
library(future)
library(furrr)
library(progressr)
knitr::opts_chunk$set(echo = TRUE)
source("helpers.R")
```
# Theoretical Framework {#sec-theory}
*Notation*: for any integer $N \geq 1$ let $\llbracket N \rrbracket := \{ 1, \dots, N \}$. Let $\mathbb{I}$ be the indicator function. Let $\nabla f$ denote the gradient (subgradient) of the differentiable (convex) function $f$.
We consider an online learning scenario in which we gain access to a sequence of observations $(y_t)_{t \geq 1}$ one at a time (see @cesabianchi2006games for an comprehensive account of online learning theory). Fix $\alpha \in (0, 1)$ to be the target empirical coverage of the prediction intervals. The goal is to output at time $t$ a prediction interval for the unseen observation $y_{t}$, with the prediction interval generated by an _interval construction function_ $\widehat{C}_{t}$. Formally, let $\widehat{C}_t$ be a function that takes as input a parameter $\theta_t \in \mathbb{R}$ and outputs a closed prediction interval $[\ell_t, u_t]$. The interval construction function must be nested: if $\theta^\prime > \theta$, then $\widehat{C}_t(\theta) \subseteq \widehat{C}_t(\theta^\prime)$. In words, larger values of $\theta$ imply wider prediction intervals. The interval constructor is indexed by $t$ to emphasize that it may use other information at each time point, such as a point prediction $\hat{\mu}_t \in \mathbb{R}$. We make no restrictions on how this external information is generated.
Define $r_t := \inf\{\theta \in \mathbb{R} : y_t \in \widehat{C}_t(\theta) \}$ to be the _radius_ at time $t$. The radius is the smallest possible $\theta$ such that the prediction interval covers the observation $y_t$. A key assumption for the theoretical analysis of several of the algorithms is that the radii are bounded:
**Assumption**: there exists a finite $D > 0$ such that $r_t < D$ for all $t$.
If the outcome space is bounded, then $D$ can be easily chosen to cover the entire space. Next, we describe two existing definitions of interval construction functions.
## Linear Intervals
A simple method for forming the prediction intervals is to use the parameter $\theta_t$ to directly define the width of the interval. Suppose that at each time $t$ we have access to a point prediction $\hat{\mu}_t \in \mathbb{R}$. Then we can form a symmetric prediction interval around the point estimate using
$$
\begin{aligned}
\theta \mapsto \widehat{C}_t(\theta) := [\hat{\mu}_t - \theta, \hat{\mu}_t + \theta].
\end{aligned}
$$
We refer to this as the _linear interval constructor_. Note that in this case, the radius is simply the absolute residual $r_t = |\hat{\mu}_t - y_t|$.
## Quantile Intervals
The original ACI paper proposed constructing intervals based on the previously observed residuals [@gibbs2021adaptive]. Let $S : \mathbb{R}^2 \to \mathbb{R}$ be a function called a _nonconformity score_. A popular choice of nonconformity score is the absolute residual: $(\mu, y) \mapsto S(\mu, y):= |\mu - y|$. Let $s_t := S(\hat{\mu}_t, y_t)$ be the nonconformity score of the $t$th-observation. The quantile interval construction function is then given by
$$
\begin{aligned}
\widehat{C}_t(\theta_t) := [\hat{\mu}_t - \mathrm{Quantile}(\theta, \{ s_1, \dots, s_{t-1} \}), \hat{\mu}_t + \mathrm{Quantile}(\theta, \{ s_1, \dots, s_{t-1} \})]
\end{aligned}
$$
where $\mathrm{Quantile}(\theta, A)$ denotes the empirical $\theta$-quantile of the elements in the set $A$. Note that $\widehat{C}_t$ is indeed nested in $\theta_t$ because the Quantile function is non-decreasing in $\theta$. Note we define $\widehat{C}_t(1) = \max\{s_1, \dots, s_{t-1}\}$ rather than $\widehat{C}_t(1) = \infty$ in order to avoid practical problems with trivial prediction intervals [@zaffran2022agaci]. Note that we can always choose $D = 1$ to satisfy the outcome boundedness assumption given above.
We focus on the above definition for the quantile interval construction function which is designed to be symmetric around the point prediction $\hat{\mu}_t$. However, we note it is possible to take a more general definition, such as
$$
\widehat{C}_t(\theta_t) := \{ y : S(\hat{\mu}_t, y) \leq \mathrm{Quantile(\theta, \{ s_1, \dots, s_{t-1}\})} \}
$$
Such an approach allows for prediction intervals that may not be centered on $\hat{\mu}_t$.
Our proposed `AdaptiveConformal` package takes the absolute residual as the default nonconformity score, although the user may also specify any custom nonconformity score by supplying it as an `R` function.
## Online Learning Framework
We now introduce a loss function that defines the quality of a prediction interval with respect to a realized observation. Define the _pinball loss_ $L^\alpha$ as
$$
(\theta, r) \mapsto L^\alpha(\theta, r) := \begin{cases}
(1 - \alpha)(\theta - r), & \theta \geq r \\
\alpha(r - \theta), & \theta < r.
\end{cases}
$$
The way in which the algorithm gains access to the data and incurs losses is as follows:
- Sequentially, for $t = 1, \dots, T$:
- Predict radius $\theta_t$ and form prediction interval $\widehat{C}_t(\theta_t)$.
- Observe true outcome $y_t$ and calculate radius $r_t$.
- Record $\mathrm{err}_t := \mathbb{I}[y_t \not\in \widehat{C}_t(\theta_t)]$.
- Incur loss $L^\alpha(\theta_t, r_t)$.
This iterative procedure is at the core of the online learning theoretical framework in which theoretical results have been derived.
## Assessing ACI algorithms
There are two different perspectives we can take in measuring the quality of an ACI algorithm that generates a sequence $(\theta_t)_{t \in \llbracket T \rrbracket}$. First, we could look at how close the empirical coverage of the generated prediction intervals is to the desired coverage level $\alpha$. Formally, define the empirical coverage as the proportion of observations that fell within the corresponding prediction interval: $\mathrm{EmpCov}(T) := \frac{1}{T} \sum_{t=1}^T (1 - \mathrm{err}_t)$. The coverage error is then given by
$$
\begin{aligned}
\mathrm{CovErr}(T) := \mathrm{EmpCov}(T) - \alpha.
\end{aligned}
$$
The second perspective is to look at how well the algorithm controls the incurred pinball losses. Following the classical framework from the online learning literature, we define the _regret_ as the difference between the cumulative loss yielded by a sequence $(\theta_t)_{t \in \llbracket T \rrbracket}$ versus the cumulative loss of the best possible fixed choice:
$$
\begin{aligned}
\mathrm{Reg}(T) := \sum_{t=1}^T L^\alpha(\theta_t, r_t) - \inf_{\theta^* \in \mathbb{R}} \sum_{t=1}^T L^\alpha(\theta^*, r_t).
\end{aligned}
$$
In settings of distribution shift, it may not be appropriate to compare the cumulative loss of an algorithm to a fixed competitor. As such, stronger notions of regret have been defined. The _strongly adaptive regret_ is the largest regret over any subperiod of length $m \in \llbracket T \rrbracket$:
$$
\begin{aligned}
\mathrm{SAReg}(T, m) := \max_{[\tau, \tau + m - 1] \subseteq \llbracket T \rrbracket} \left( \sum_{t=\tau}^{\tau + m - 1} L^{\alpha}(\theta_t, r_t) - \inf_{\theta^* \in \mathbb{R}} \sum_{t=\tau}^{\tau + m - 1} L^\alpha(\theta^*, r_t) \right).
\end{aligned}
$$
Both ways of evaluating ACI methods are important because targeting only one or the other can lead to algorithms that yield prediction intervals that are not practically useful. As a simple pathological example of only targeting the coverage error, suppose we wish to generate $\alpha = 50\%$ prediction intervals. We could choose to alternate $\theta$ between 0 and $\infty$, such that $\mathrm{err}_t$ alternates between 0 and 1. The empirical coverage would then trivially converge to the desired level of 50\%. However, the same algorithm would yield infinite regret (see @bhatnagar2023saocp for a more in-depth example of an scenario in which coverage is optimal but the regret grows linearly). On the other hand, an algorithm that has arbitrarily small regret may not yield good empirical coverage. Suppose the observations and point predictions are constant: $y_t = 1$ and $\hat{\mu}_t = 0$ for all $t \geq 1$. Consider a simple class of algorithms that outputs constantly $\theta_t = \theta'$ for some $\theta' < 1$. With the linear interval construction function, the prediction intervals are then $\widehat{C}_t(\theta_t) = [-\theta', \theta']$. The regret is given by $\mathrm{Reg}(T) = 2T\alpha(1-\theta')$, which approaches zero as $\theta'$ approaches 1. The empirical coverage is, however, always zero. In other words, the regret can be arbitrarily close to zero while at the same time the empirical coverage does not approach the desired level.
These simple examples illustrate that, unfortunately, bounds on the coverage error and bounds on the regret are not in general interchangeable. It is possible, however, to show equivalencies by either (1) making distributional assumptions on the data or (2) using additional information about how the algorithm produces the sequence $(\theta_t)_{t \in \llbracket T \rrbracket}$ [@bhatnagar2023saocp].
It may also be informative to summarize a set of prediction intervals in ways beyond their coverage error or their regret. A common metric for prediction intervals is the _mean interval width_:
$$
\begin{aligned}
\mathrm{MeanWidth}(T) := \frac{1}{T} \sum_{t=1}^T w_t,
\end{aligned}
$$
where $w_t := u_t - \ell_t$ is the interval width at time $t$.
Finally, we introduce a metric that is intended to capture pathological behavior that can arise with ACI algorithms where the prediction intervals oscillate between being extremely narrow and extremely wide. Define the _path length_ of prediction intervals generated by an ACI algorithm as
$$
\begin{aligned}
\mathrm{PathLength}(T) := \sum_{t=2}^T |w_t - w_{t-1}|.
\end{aligned}
$$
A high path length indicates that the prediction intervals were variable over time, and a low path length indicates the prediction intervals were stable.
# Algorithms {#sec-algorithms}
| Algorithm |
| ------------------------------------------------------|
| **Adaptive Conformal Inference (ACI)** |
| - Tuning parameters: learning rate $\gamma$ |
| - Original interval constructor: quantile |
| - Theoretical guarantees: coverage error, regret |
| - Citation: @gibbs2021adaptive |
| **Aggregated Adaptive Conformal Inference (AgACI)** |
| - Tuning parameters: candidate learning rates $(\gamma_k)_{1 \leq k \leq K}$ |
| - Original interval constructor: quantile |
| - Citation: @zaffran2022agaci |
| **Dynamically-tuned Adaptive Conformal Inference (DtACI)** |
| - Tuning parameters: candidate learning rates $(\gamma_k)_{1 \leq k \leq K}$ |
| - Original interval constructor: quantile |
| - Theoretical guarantees: coverage error, strongly adaptive regret, dynamic regret |
| - Citation | @gibbs2022faci |
| **Scale-Free Online Gradient Descent (SF-OGD)** |
| - Tuning parameters: learning rate $\gamma$ or maximum radius $D$ |
| - Original interval constructor: linear |
| - Theoretical guarantees: coverage error, anytime regret |
| - Citation: @bhatnagar2023saocp |
| **Strongly Adaptive Online Conformal Prediction (SAOCP)** |
| - Tuning parameters: learning rate $\gamma$, lifetime multiplier $g$ |
| - Original interval constructor: linear |
| - Theoretical guarantees: coverage error, strongly adaptive regret |
| - Citation: @bhatnagar2023saocp |
: Summary of ACI algorithms. Only the theoretical guarantees discussed in this paper are shown for each algorithm. {#tbl-aci}
As a simple running example to illustrate each algorithm, we simulate independently $T = 500$ values $y_1, \dots, y_T$ following
$$
\begin{aligned}
y_t &\sim N(0, \sigma_t^2), \quad t \in \llbracket T \rrbracket, \\
\sigma_t &= \begin{cases}
0.2, &t \leq 250, \\
0.1, &t > 250.
\end{cases}
\end{aligned}
$$
For demonstration purposes we assume we have access to unbiased predictions $\hat{\mu}_t = 0$ for all $t \in \llbracket T \rrbracket$. Throughout we set the target empirical coverage to $\alpha = 0.8$.
```{r example_setup, echo = FALSE}
library(AdaptiveConformal)
set.seed(15321)
N <- 5e2
running_example <- running_example_data(N)
y <- running_example$y
yhat <- running_example$yhat
```
## Adaptive Conformal Inference (ACI)
```pseudocode
#| label: algo-aci
#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true
\begin{algorithm}
\caption{Adaptive Conformal Inference}
\begin{algorithmic}
\State \textbf{Input:} starting value $\theta_1$, learning rate $\gamma > 0$.
\For{$t = 1, 2, \dots, T$}
\State \textbf{Output:} prediction interval $\widehat{C}_t(\theta_t)$.
\State Observe $y_t$.
\State Evaluate $\mathrm{err}_t = \mathbb{I}[y_t \not\in \widehat{C}_t(\theta_t)]$.
\State Update $\theta_{t+1} = \theta_t + \gamma (\mathrm{err}_t - (1 - \alpha))$.
\EndFor
\end{algorithmic}
\end{algorithm}
```
The original ACI algorithm (@gibbs2021adaptive; @algo-aci) adaptively adjusts the width of the prediction intervals in response to the observations. The updating rule for the estimated radius can be derived as an online subgradient descent scheme. The subgradient of the pinball loss function with respect to $\theta$ is given by
$$
\begin{aligned}
\nabla L^\alpha(\theta, r) &= \begin{cases}
\{ -\alpha \}, &\theta < r, \\
\{ 1 - \alpha \}, & \theta > r, \\
[-\alpha, 1 - \alpha], &\theta = r
\end{cases} \\
\end{aligned}
$$
It follows that, for all $\theta_t \in \mathbb{R}$ and $r_t \in \mathbb{R}$,
$$
1 - \alpha - \mathrm{err}_t \in \nabla L^\alpha(\theta_t, r_t).
$$
This leads to the following update rule for $\theta$ based on subgradient descent:
$$
\begin{aligned}
\theta_{t+1} = \theta_{t} + \gamma (\mathrm{err}_t - (1 - \alpha)),
\end{aligned}
$$
where $\gamma > 0$ is a user-specified learning rate. For intuition, note that if $y_t$ fell outside of the prediction interval at time $t$ ($\mathrm{err}_t = 1$) then the next interval is widened ($\theta_{t+1} = \theta_t + \gamma \alpha$). On the contrary, if $y_t$ fell within the interval ($\mathrm{err}_t = 0$) then the next interval is shortened ($\theta_{t+1} = \theta_t - \gamma(1 - \alpha)$). The learning rate $\gamma$ controls how fast the width of the prediction intervals changes in response to the data.
### Theoretical Guarantees
With the choice of the quantile interval structure, the ACI algorithm has the following finite sample bound on the coverage error (@gibbs2021adaptive; Proposition 4.1). For all $\gamma > 0$ (and so long as $\gamma$ does not depend on $T$),
$$
|\mathrm{CovErr}(T)| \leq \frac{\max\{\theta_1, 1 - \theta_1\} + \gamma}{\gamma T}.
$$
This result was originally shown for ACI with the choice of the quantile interval constructor, although it can also be extended to other interval constructors [@bhatnagar2023saocp, @feldman2023achieving]. More generally, the algorithm has the following coverage error bound in terms of the radius bound $D$ [@bhatnagar2023saocp]:
$$
|\mathrm{CovErr}(T)| \leq \frac{D + \gamma}{\gamma T}.
$$
In addition, standard results for online subgradient descent yield the following regret bound with the use of the linear interval constructor, assuming that the true radii are bounded by $D$:
$$
\mathrm{Reg}(T) \leq \mathcal{O}(D^2 / \gamma + \gamma T) \leq \mathcal{O}(D \sqrt{T}),
$$
where the second inequality follows if the optimal choice (with respect to long-term regret) of $\gamma = D/\sqrt{T}$ is used [@bhatnagar2023saocp]. Taken together, these theoretical results imply that while the coverage error is guaranteed to converge to zero for any choice of $\gamma$, achieving sublinear regret requires choosing $\gamma$ more carefully. This highlights the importance of both ways of assessing ACI algorithms: if we only focused on controlling the coverage error, we might not achieve optimal control of regret, leading to intervals that are not practically useful.
### Tuning Parameters
Therefore, the main tuning parameter is the learning rate $\gamma$. The theoretical bounds on the coverage error suggest setting a large $\gamma$ such that the coverage error decays quickly in $T$; however, in practice and setting $\gamma$ too large will lead to intervals with large oscillations as seen in @fig-aci. This is quantified in the path length, which increases significantly as $\gamma$ increases, even though the empirical coverage remains near the desired value of 80\%. On the other hand, setting $\gamma$ too small will lead to intervals that do not adapt fast enough to distribution shifts. Thus, choosing a good value for $\gamma$ is essential. However, the optimal choice $\gamma = D / \sqrt{T}$ cannot be used directly in practice unless the time series length $T$ is fixed in advance, or the so called "doubling trick" is used to relax the need to know $T$ in advance (@cesabianchi2006games; Section 2.3).
The theoretical results guaranteeing the performance of the ACI algorithm do not depend on the choice of starting value $\theta_1$, and thus in practice any value can be chosen. Indeed, the effect of the choice of $\theta_1$ decays over time as a function of the chosen learning rate. In practice, substantive prior information can be used to pick a reasonable starting value. By default, the `AdaptiveConformal` package sets $\theta_1 = \alpha$ when the quantile interval predictor is used, and $\theta_1 = 0$ otherwise, although in both cases the user can supply their own starting value. The behavior of the early prediction intervals in the examples (@fig-aci) is driven by the small number of residuals available, which makes the output of the quantile interval constructor sensitive to small changes in $\theta$. In practice, a warm-up period can be used before starting to produce prediction intervals so that the quantiles of the residuals are more stable.
```{r aci_example_plot, echo = FALSE}
#| label: fig-aci
#| fig-height: 4.5
#| fig-cap: "Example 80% prediction intervals from the ACI algorithm for different choices of learning rate $\\gamma$ and with $\\theta_1 = 0.8$. Blue and red points are observations that fell inside and outside the prediction intervals, respectively."
alpha <- 0.8
gamma_grid <- c(0.016, 0.032, 0.064, 0.128)
results <- list(
aci(y, yhat, alpha = alpha, method = "ACI", parameters = list(interval_constructor = "conformal", theta0 = alpha, gamma = gamma_grid[1])),
aci(y, yhat, alpha = alpha, method = "ACI", parameters = list(interval_constructor = "conformal", theta0 = alpha, gamma = gamma_grid[2])),
aci(y, yhat, alpha = alpha, method = "ACI", parameters = list(interval_constructor = "conformal", theta0 = alpha, gamma = gamma_grid[3])),
aci(y, yhat, alpha = alpha, method = "ACI", parameters = list(interval_constructor = "conformal", theta0 = alpha, gamma = gamma_grid[4]))
)
coverage <- format_coverage(extract_metric(results, "coverage"))
path_length <- format_path_length(extract_metric(results, "path_length"))
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
for(i in 1:4) {
plot(results[[i]], legend = FALSE, predictions = FALSE, cex = 0.5, main = bquote(gamma==.(results[[i]]$parameters$gamma)), ylim = c(-0.9, 0.8))
text(x = -10, y = -0.75, labels = bquote(EmpCov == .(coverage[[i]]) ), pos = 4)
text(x = -10, y = -0.9, labels = bquote(PathLength == .(path_length[[i]]) ), pos = 4)
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
## Aggregated Adaptive Conformal Inference (AgACI)
```pseudocode
#| label: algo-agaci
#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true
\begin{algorithm}
\caption{Aggregated Adaptive Conformal Inference}
\begin{algorithmic}
\State \textbf{Input:} candidate learning rates $(\gamma_k)_{1 \leq k \leq K }$, starting value $\theta_1$.
\State Initialize lower and upper BOA algorithms $\mathcal{B}^\ell := \texttt{BOA}(\alpha \leftarrow (1 - \alpha) / 2)$ and $\mathcal{B}^u := \texttt{BOA}(\alpha \leftarrow (1 - (1 - \alpha)/2))$.
\For{$k = 1, \dots, K$}
\State Initialize ACI $\mathcal{A}_k = \texttt{ACI}(\alpha \leftarrow \alpha, \gamma \leftarrow \gamma_k, \theta_1 \leftarrow \theta_1)$.
\EndFor
\For{$t = 1, 2, \dots, T$}
\For{$k = 1, \dots, K$}
\State Retrieve candidate prediction interval $[\ell^k_{t}, u^k_{t}]$ from $\mathcal{A}_k$.
\EndFor
\State Compute aggregated lower bound $\tilde{\ell}_t := \mathcal{B}^\ell((\ell^k_t : k \in \{ 1, \dots, K \}))$.
\State Compute aggregated upper bound $\tilde{u}_t := \mathcal{B}^u((u^k_t : k \in \{ 1, \dots, K \}))$.
\State \textbf{Output:} prediction interval $[\tilde{\ell}_t, \tilde{u}_t]$.
\State Observe $y_t$.
\For{$k = 1, \dots, K$}
\State Update $\mathcal{A}_k$ with observation $y_t$.
\EndFor
\State Update $\mathcal{B}^\ell$ with observed outcome $y_t$.
\State Update $\mathcal{B}^u$ with observed outcome $y_t$.
\EndFor
\end{algorithmic}
\end{algorithm}
```
The Aggregated ACI (AgACI; @algo-agaci) algorithm solves the problem of choosing a learning rate for ACI by running multiple copies of the algorithm with different learning rates, and then separately combining the lower and upper interval bounds using an online aggregation of experts algorithm [@zaffran2022agaci]. That is, one aggregation algorithm seeks to estimate the lower $(1-\alpha)/2$ quantile, and the other seeks to estimate the upper $1 - (1 - \alpha) / 2$ quantile. @zaffran2022agaci experimented with multiple online aggregation algorithms, and found that they yielded similar results. Thus, we follow their lead in using the Bernstein Online Aggregation (BOA) algorithm as implemented in the `opera` `R` package [@wintenberger2017boa; @opera2023]. BOA is an online algorithm that forms predictions for the lower (or upper) prediction interval bound as a weighted mean of the candidate ACI prediction interval lower (upper) bound, where the weights are determined by each candidate's past performance with respect to the quantile loss. As a consequence, the prediction intervals generated by AgACI are not necessarily symmetric around the point prediction, as the weights for the lower and upper bounds are separate.
### Theoretical Gaurantees
AgACI departs from our main theoretical framework in that it does not yield a sequence $(\theta_t)_{t \in \llbracket T \rrbracket}$ whose elements yield prediction intervals via a set construction function $\widehat{C}_t$. Rather, the upper and lower interval bounds from a set of candidate ACI algorithms are aggregated separately. Thus, theoretical results such as regret bounds similar to those for the other algorithms are not available. It would be possible, however, to establish regret bounds for the pinball loss applied separately to the lower and upper bounds of the prediction intervals. It is unclear, however, how to convert such regret bounds into a coverage bound.
### Tuning Parameters {#sec-agaci-tuning}
The main tuning parameter for AgACI is the set of candidate learning rates. Beyond necessitating additional computational time, there is no drawback to having a large grid. As a default, `AdaptiveConformal` uses learning rates $\gamma \in \{ 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128 \}$. As a basic check, we can look at the weights assigned to each of the learning rates. If large weights are given to the smallest (largest) learning rates, it is a sign that smaller (or larger) learning rates may perform well. In addition each of the candidate ACI algorithms requires a starting value, which can be set to any value as discussed in the ACI section. @fig-agaci illustrates AgACI applied to the running example with two sets of learning grids. The first grid is $\gamma = \{ 0.032, 0.064, 0.128, 0.256 \}$, and the second grid includes the additional values $\{ 0.008, 0.016 \}$. For the first grid, we can see that for the lower bound AgACI assigns high weight to the lowest learning rate ($\gamma = 0.032$). The second grid yields weights that are less concentrated on a single learning rate, and the output prediction intervals are smoother.
```{r agaci_example_plot, echo = FALSE}
#| label: fig-agaci
#| fig-height: 4.5
#| fig-cap: "Example 80% prediction intervals from the AgACI algorithm with starting values $\\theta_1 = 0.8$ and two different learning rate grids. In the left column, blue and red points are observations that fell inside and outside the prediction intervals, respectively."
grid1 <- 2^(seq(5, 8, 1)) / 1e3
grid2 <- 2^(seq(1, 8, 1)) / 1e3
results <- list(
aci(y, yhat, alpha = alpha, method = "AgACI", parameters = list(interval_constructor = "conformal", gamma_grid = grid1, theta0 = 0.8)),
aci(y, yhat, alpha = alpha, method = "AgACI", parameters = list(interval_constructor = "conformal", gamma_grid = grid2, theta0 = 0.8))
)
coverage <- format_coverage(extract_metric(results, "coverage"))
path_length <- format_path_length(extract_metric(results, "path_length"))
par(mfrow = c(2, 2), mar = c(4, 3, 2, 1))
for(i in 1:2) {
plot(results[[i]], legend = FALSE, predictions = FALSE, cex = 0.5, main = paste("Grid", i), ylim = c(-0.9, 0.8))
text(x = -10, y = -0.75, labels = bquote(EmpCov == .(coverage[[i]]) ), pos = 4)
text(x = -10, y = -0.9, labels = bquote(PathLength == .(path_length[[i]]) ), pos = 4)
plot_agaci_weights(results[[i]], main = "Final Aggregation Weights", legend = "topright", ylim = c(0, 1))
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
```{r agaci_example_plot2, echo = FALSE}
#| label: fig-agaci2
#| fig-height: 3
#| fig-cap: "Weights assigned to each of the candidate ACI algorithms by AgACI at the final timepoint."
```
## Dynamically-tuned Adaptive Conformal Inference (DtACI)
```pseudocode
#| label: algo-faci
#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true
\begin{algorithm}
\caption{Dynamically-tuned Adaptive Conformal Inference}
\begin{algorithmic}
\State \textbf{Input:} starting value $\theta_1$, candidate learning rates $(\gamma_k)_{1 \leq k \leq K }$, parameters $\sigma, \eta$.
\For{$k = 1, \dots, K$}
\State Initialize expert $\mathcal{A}_k = \texttt{ACI}(\alpha \leftarrow \alpha, \gamma \leftarrow \gamma_k, \theta_1 \leftarrow \theta_1)$.
\EndFor
\For{$t = 1, 2, \dots, T$}
\State Define $p_t^k := p_t^k / \sum_{i=1}^K p_t^i$, for all $1 \leq k \leq K$.
\State Set $\theta_t = \sum_{k=1}^K \theta_t^k p_t^k$.
\State \textbf{Output:} prediction interval $\widehat{C}_t(\theta_t)$.
\State Observe $y_t$ and compute $r_t$.
\State $\bar{w}_{t}^k \gets p_t^k \exp(-\eta L^\alpha(\theta_t^k, r_t))$, for all $1 \leq k \leq K$.
\State $\bar{W}_t \gets \sum_{i=1}^K \bar{w}_t^i$.
\State $p_{t+1}^k \gets (1 - \sigma) \bar{w}_t^k + \bar{W}_t \sigma / K$.
\State Set $\mathrm{err}_t := \mathbb{I}[y_t \not\in \widehat{C}_t(\theta_t)]$.
\For{$k = 1, \dots, K$}
\State Update ACI $\mathcal{A}_k$ with $y_t$ and obtain $\theta_{t+1}^k$.
\EndFor
\EndFor
\end{algorithmic}
\end{algorithm}
```
The Dynamically-tuned Adaptive Conformal Inference (DtACI; @algo-faci) algorithm was developed by the authors of the original ACI algorithm in part to address the issue of how to choose the learning rate parameter $\gamma$. In this respect the goal of the algorithm is similar to that of AgACI, although it is achieved slightly differently. DtACI also aggregates predictions from multiple copies of ACI run with different learning rates, but differs in that it directly aggregates the estimated radii emitted from each algorithm based on their pinball loss [@gibbs2022faci] using an exponential reweighting scheme [@gradu2022adaptive]. As opposed to AgACI, this construction allows for more straightforward development of theoretical guarantees on the algorithm's performance, because the upper and lower bounds of the intervals are not aggregated separately.
### Theoretical Guarantees
DtACI was originally proposed with the choice of the quantile interval constructor. DtACI has the following strongly-adaptive regret bound [@bhatnagar2023saocp]: for all $\eta > 0$ and subperiod lengths $m$,
$$
\begin{aligned}
\mathrm{SAReg}(T, m) \leq \widetilde{\mathcal{O}}(D^2 / \eta + \eta m).
\end{aligned}
$$
If $m$ is fixed a-priori, then choosing $\eta = D/\sqrt{m}$ yields a strongly adaptive regret bound of order $\widetilde{\mathcal{O}}(D \sqrt{m})$ (for a single choice of $m$). Practically, this result implies that, if we know in advance the time length for which we would like to control the regret, it is possible to choose an optimal tuning parameter value. However, we cannot control the regret simultaneously for all possible time lengths.
To establish a bound on the coverage error, the authors investigated a slightly modified version of DtACI in which $\theta_t$ is chosen randomly from the candidate $\theta_{t_k}$ with weights given by $p_{t,k}$, instead of taking a weighted average. This is a common trick used in the literature as it facilitates theoretical analysis. In practice, the authors comment that this randomized version of DtACI and the deterministic version lead to very similar results. The coverage error result also assumes that the hyperparameters can change over time: that is, we have $t$-specific $\eta_{t}$ and $\sigma_t$, rather than fixed $\eta$ and $\sigma$. The coverage error then has the following bound (@gibbs2022faci; Theorem 3.2), where $\gamma_{\min}$ and $\gamma_{\max}$ are the smallest and largest learning rates in the grid, respectively:
$$
|\mathrm{CovErr}(T)| \leq \frac{1 + 2\gamma_{\max}}{T \gamma_{\min}} + \frac{(1 + 2\gamma_{\max})^2}{\gamma_{\min}} \frac{1}{T}\sum_{t=1}^T \eta_t \exp(\eta_t(1 + 2\gamma_{\max})) + 2 \frac{1+\gamma_{\max}}{\gamma_{\min}} \frac{1}{T} \sum_{t=1}^T \sigma_t.
$$
Thus, if $\eta_t$ and $\sigma_t$ both converge to zero as $t \to \infty$, then the coverage error will also converge to zero. In addition, under mild distributional assumptions the authors provide a type of short-term coverage error bound for arbitrary time spans, for which we refer to [@gibbs2022faci].
We note one additional result established by @gibbs2022faci (their Theorem 3.1) on a slightly different dynamic regret bound in terms of the pinball loss, as it informs the choice of tuning parameters. Let $\gamma_{\mathrm{max}} = \max_{1 \leq k \leq K} \gamma_k$ be the largest learning rate in the grid and assume that $\gamma_1 < \gamma_2 < \cdots < \gamma_K$ with $\gamma_{k+1}/\gamma \leq 2$ for all $1 \leq k < K$. Then, for any interval $I = [r, s] \subseteq \llbracket T \rrbracket$ and any sequence $\theta_r^*, \dots, \theta_s^*$, under the assumption that $\gamma_k \geq \sqrt{1 + 1 / |I|}$,
$$
\begin{aligned}
\frac{1}{|I|} \sum_{t=r}^s \mathbb{E}[L^\alpha(\theta_t, r_t)] - \frac{1}{|I|} \sum_{t=r}^s L^\alpha(\theta_t, \theta_t^*) \leq& \frac{\log(k / \sigma) + 2\sigma|I|}{\eta |I|} + \frac{\eta}{|I|} \sum_{t=r}^s \mathbb{E}[L^\alpha(\theta_t, r_t)^2] \\
&+ 2\sqrt{3}(1 + \gamma_{\mathrm{max}})^2 \max\left\{ \sqrt{\frac{\sum_{t=r+1}^s |\theta_t^* - \theta_{t-1}^*| + 1}{|I|}}, \gamma_1 \right\},
\end{aligned}
$$
where the expectation is over the randomness in the randomized version of the algorithm. Here the time interval $I$ (with length $|I|$) is comparable to the time period length $m$ for the strongly adaptive regret. The parameter $|I|$, the time interval of interest for which we would like to control, can be chosen arbitrarily. This dynamic regret bound can be converted to a strongly adaptive regret bound by choosing $\theta^*_t$ to be constant.
### Tuning parameters
The recommended settings for the tuning parameters depend on choosing a time interval length $|I|$ for which we would like to control the pinball loss. The choice of $|I|$ can be chosen arbitrarily.
For the tuning parameter $\sigma$, the authors suggest the optimal choice (with respect to the dynamic regret) $\sigma = 1 / (2 |I|)$. Choosing $\eta$ is more difficult. The authors suggest the following choice for $\eta$, which they show is optimal if there is in fact no distribution shift:
$$
\begin{aligned}
\eta = \sqrt{\frac{3}{|I|}} \sqrt{\frac{\log(K \cdot |I|) + 2}{(\alpha)^2 (1 - \alpha)^3 + (1-\alpha)^2 \alpha^3 }}
\end{aligned}.
$$
Note that this choice is optimal only for the quantile interval constructor, for which $\theta_t$ is a quantile of previous nonconformity scores. As an alternative, the authors point out that $\eta$ can be learned in an online fashion using the update rule
$$
\begin{aligned}
\eta_t := \sqrt{\frac{\log(|I| K) + 2}{\sum_{s=t-|I|}^{t-1} L^\alpha(\theta_s, r_s)}}.
\end{aligned}
$$
Both ways of choosing $\eta$ led to very similar results in the original author's empirical studies. In our proposed `AdaptiveConformal` package, the first approach is used when the quantile interval construction function is chosen, and the latter approach for the linear interval construction function.
@fig-dtaci-example illustrates DtACI with the quantile interval construction function and with the learning rate grid $\gamma \in \{ 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128 \}$. The tuning parameter $\eta$ was set to $0.001$, $1$, and $100$ to show how the algorithm responds to extreme choices of the parameter, and to $\eta \approx 3.19$ according to the optimal choice recommendation with $I = 100$ as described in the previous section. The results show that, in this simple example, high values of $\eta$ may lead to intervals that are too reactive to the data, as seen in the longer path length. The algorithm appears more robust, however, to small choices of $\eta$.
```{r faci_example_plot, echo = FALSE}
#| label: fig-dtaci-example
#| fig-height: 4.5
#| fig-cap: "Example 80% prediction intervals generated by the DtACI algorithm with starting values $\\theta_1 = 0.8$ and with several values of the tuning parameter $\\eta$. Blue and red points are observations that fell inside and outside the prediction intervals, respectively."
gamma_grid <- c(0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128)
results <- list(
aci(y, yhat, alpha = alpha, method = "DtACI", parameters = list(gamma_grid = gamma_grid, interval_constructor = "conformal", eta = 0.001)),
aci(y, yhat, alpha = alpha, method = "DtACI", parameters = list(gamma_grid = gamma_grid, interval_constructor = "conformal", eta = 1)),
aci(y, yhat, alpha = alpha, method = "DtACI", parameters = list(gamma_grid = gamma_grid, interval_constructor = "conformal", eta = 100)),
aci(y, yhat, alpha = alpha, method = "DtACI", parameters = list(gamma_grid = gamma_grid, interval_constructor = "conformal"))
)
coverage <- format_coverage(extract_metric(results, "coverage"))
path_length <- format_path_length(extract_metric(results, "path_length"))
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
for(i in 1:4) {
plot(results[[i]], legend = FALSE, predictions = FALSE, cex = 0.5, main = bquote(eta==.(results[[i]]$parameters$eta)), ylim = c(-0.9, 0.8))
text(x = -10, y = -0.75, labels = bquote(EmpCov == .(coverage[[i]]) ), pos = 4)
text(x = -10, y = -0.9, labels = bquote(PathLength == .(path_length[[i]])), pos = 4)
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
## Scale-Free Online Gradient Descent (SF-OGD)
```pseudocode
#| label: algo-sfogd
#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true
\begin{algorithm}
\caption{Scale-Free Online Gradient Descent}
\begin{algorithmic}
\State \textbf{Input:} starting value $\theta_1$, learning rate $\gamma > 0$.
\For{$t = 1, 2, \dots, T$}
\State \textbf{Output:} prediction interval $\widehat{C}_t(\theta_t)$.
\State Observe $y_t$ and compute $r_t$.
\State Update $\theta_{t+1} = \theta_t - \gamma \frac{\nabla L^\alpha(\theta_t, r_t)}{\sqrt{\sum_{i=1}^t} \| \nabla L^\alpha(\theta_i, r_i) \|_2^2}$.
\EndFor
\end{algorithmic}
\end{algorithm}
```
Scale-Free Online Gradient Descent (SF-OGD; @algo-sfogd) is a general algorithm for online learning proposed by @orabona2018sfogd. The algorithm updates $\theta_t$ with a gradient descent step where the learning rate adapts to the scale of the previously observed gradients. SF-OGD was first used in the context of ACI as a sub-algorithm for SAOCP (described in the next section). However, it was found to have good performance by itself [@bhatnagar2023saocp] in real-world tasks, so we have made it available in the package as a stand-alone algorithm.
### Theoretical Guarantees
The SF-OGD algorithm with linear interval constructor has the following regret bound, which is called an _anytime regret bound_ because it holds for all $t \in \llbracket T \rrbracket$ [@bhatnagar2023saocp]. For any $\gamma > 0$,
$$
\begin{aligned}
\mathrm{Reg}(t) \leq \mathcal{O}(D \sqrt{t}) \text{ for all } t \in \llbracket T \rrbracket.
\end{aligned}
$$
A bound for the coverage error has also been established (@bhatnagar2023saocp; Theorem 4.2). For any learning rate $\gamma = \Theta(D)$ (where $\gamma = D / \sqrt{3}$ is optimal) and any starting value $\theta_1 \in [0, D]$, then it holds that for any $T > 1$,
$$
\begin{aligned}
|\mathrm{CovErr}(T)| \leq \mathcal{O}\left( (1 - \alpha)^{-2} T^{-1/4} \log T \right).
\end{aligned}
$$
### Tuning parameters
[@fig-sf-ogd] compares results for several choices of $\gamma$ to illustrate its effect. The optimal choice of learning rate is $\gamma = D / \sqrt{3}$, where $D$ is the maximum possible radius. When $D$ is not known, it can be estimated by using an initial subset of the time series as a calibration set and estimating $D$ as the maximum of the absolute residuals of the observations and the predictions [@bhatnagar2023saocp]. @fig-sf-ogd illustrates SF-OGD for several values of $\gamma$. In the example, the prediction intervals are not reactive enough and do not achieve optimal coverage when $\gamma$ is small. As $\gamma$ increases, the coverage error is near optimal, although the path length becomes larger.
```{r sfogd_example_plot, echo = FALSE}
#| label: fig-sf-ogd
#| fig-height: 4.5
#| fig-cap: "Example 80% prediction intervals generated by the SF-OGD algorithm with different values of the maximum radius tuning parameter $D$. Blue and red points are observations that fell inside and outside the prediction intervals, respectively."
alpha <- 0.8
results <- list(
aci(y, yhat, alpha = alpha, method = "SF-OGD", parameters = list(gamma = 0.01)),
aci(y, yhat, alpha = alpha, method = "SF-OGD", parameters = list(gamma = 0.1)),
aci(y, yhat, alpha = alpha, method = "SF-OGD", parameters = list(gamma = 0.25)),
aci(y, yhat, alpha = alpha, method = "SF-OGD", parameters = list(gamma = 0.5))
)
coverage <- format_coverage(extract_metric(results, "coverage"))
path_length <- format_path_length(extract_metric(results, "path_length"))
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
for(i in 1:4) {
plot(results[[i]], legend = FALSE, predictions = FALSE, cex = 0.5, main = bquote(gamma==.(results[[i]]$parameters$gamma)), ylim = c(-0.9, 0.8))
text(x = -10, y = -0.75, labels = bquote(EmpCov == .(coverage[[i]]) ), pos = 4)
text(x = -10, y = -0.9, labels = bquote(PathLength == .(path_length[[i]])), pos = 4)
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
## Strongly Adaptive Online Conformal Prediction (SAOCP)
```pseudocode
#| label: algo-saocp
#| html-indent-size: "1.2em"
#| html-comment-delimiter: "//"
#| html-line-number: true
#| html-line-number-punc: ":"
#| html-no-end: false
#| pdf-placement: "htb!"
#| pdf-line-number: true
\begin{algorithm}
\caption{Strongly Adaptive Online Conformal Prediction}
\begin{algorithmic}
\State \textbf{Input:} initial value $\theta_0$, learning rate $\gamma > 0$.
\For{$t = 1, 2, \dots, T$}
\State Initialize expert $\mathcal{A}_t = \texttt{SF-OGD}(\alpha \leftarrow \alpha, \gamma \leftarrow \gamma, \theta_1 \leftarrow \theta_{t-1})$, set weight $p_t^t = 0$.
\State Compute active set $\mathrm{Active}(t) = \{ i \in \llbracket T \rrbracket : t - L(i) < i \leq t \}$ (see below for definition of $L(t)$).
\State Compute prior probability $\pi_i \propto i^{-2} (1 + \lfloor \log_2 i \rfloor )^{-1} \mathbb{I}[i \in \mathrm{Active}(t)]$.
\State Compute un-normalized probability $\hat{p}_i = \pi_i [p_{t,i}]_+$ for all $i \in \llbracket t \rrbracket$.
\State Normalize $p = \hat{p} / \| \hat{p} \|_1 \in \Delta^t$ if $\| \hat{p} \|_1 > 0$, else $p = \pi$.
\State Set $\theta_t = \sum_{i \in \mathrm{Active}(t)} p_i \theta_t^i$ (for $t \geq 2$), and $\theta_t = 0$ for $t = 1$.
\State \textbf{Output:} prediction set $\widehat{C}_t(\theta_t)$.
\State Observe $y_t$ and compute $r_t$.
\For{$i \in \mathrm{Active}(t)$}
\State Update expert $\mathcal{A}_t$ with $y_t$ and obtain $\theta_{t+1}^i$.
\State Compute $g_t^i = \begin{cases}
\frac{1}{D}\left(L^\alpha(\theta_t, r_t) - L^\alpha(\theta_t^i, r_t)\right) & p_t^i > 0 \\
\frac{1}{D}\left[L^\alpha(\theta_t, r_t) - L^\alpha(\theta_t^i, r_t))\right]_+ & p_t^i \leq 0 \\
\end{cases}$.
\State Update expert weight $p_{t+1}^i = \frac{1}{t - i + 1}\left( \sum_{j=i}^t g_j^i \right) \left(1 + \sum_{j=i}^t p_j^i g_j^i \right)$.
\EndFor
\EndFor
\end{algorithmic}
\end{algorithm}
```
The Strongly Adaptive Online Conformal Prediction (SAOCP; @algo-saocp) algorithm was proposed as an improvement over the extant ACI algorithms in that it features stronger theoretical guarantees. SAOCP works similarly to AgACI and DtACI in that it maintains a library of candidate online learning algorithms that generate prediction intervals which are then aggregated using a meta-algorithm [@bhatnagar2023saocp]. The candidate algorithm was chosen to be SF-OGD, although any algorithm that features anytime regret guarantees can be chosen. As opposed to AgACI and DtACI, in which each candidate has a different learning rate but is always able to contribute to the final prediction intervals, here each candidate has the same learning rate but only has positive weight over a specific interval of time. New candidate algorithms are continually being spawned in order that, if the distribution shifts rapidly, the newer candidates will be able to react quickly and receive positive weight. Specifically, at each time point, a new expert is instantiated which is active over a finite ``lifetime". Define the _lifetime_ of an expert instantiated at time $t$ as
$$
\begin{aligned}
L(t) := g \cdot \max_{n \in \mathbb{Z}} \{ 2^n t \equiv 0 \mod 2^n \},
\end{aligned}
$$
where $g \in \mathbb{Z}^*$ is a _lifetime multiplier_ parameter. The active experts are weighted according to their empirical performance with respect to the pinball loss function. The authors show that this construction results in intervals that have strong regret guarantees. The form of the lifetime interval function $L(t)$ is due to the use of geometric covering intervals to partition the input time series, and other choices may be possible [@jun2017coinbetting].
### Theoretical Guarantees
The theoretical results were established for SAOCP using the linear interval constructor. The following bound for the strongly adaptive regret holds for all subperiod lengths $m \in \llbracket T \rrbracket$ (@bhatnagar2023saocp; Proposition 4.1):
$$
\begin{aligned}
\mathrm{SAReg}(T, m) \leq 15 D \sqrt{m(\log T + 1)} \leq \tilde{\mathcal{O}}(D \sqrt m).
\end{aligned}
$$
It should be emphasized that this regret bounds holds simultaneously across all $m$, as opposed to DtACI, where a similar bound holds only for a single $m$. A bound on the coverage error of SAOCP has also been established as:
$$
\begin{aligned}
|\mathrm{CovErr}(T)| \leq \mathcal{O}\left(\inf_\beta(T^{1/2 - \beta} + T^{\beta - 1} S_\beta(T))\right).
\end{aligned}
$$
where $S_{\beta}(T)$ is a technical measure of the smoothness of the cumulative gradients and expert weights for each of the candidate experts (@bhatnagar2023saocp; Theorem 4.3). For some intuition, $S_{\beta}$ can be expected to be small when the weights placed on each algorithm do change quickly, as would be the case under abrupt distributional shifts.
### Tuning Parameters
The primary tuning parameter for SAOCP is the learning rate $\gamma$ of the SF-OGD sub-algorithms, which we saw in the previous section has for optimal choice $\gamma = D / \sqrt{3}$. Values for $D$ that are too low lead to intervals that adapt slowly, and values that are too large lead to jagged intervals. In their experiments, the authors select a value for $D$ by picking the maximum residual from a calibration set. The second tuning parameter is the lifetime multiplier $g$ which controls the lifetime of each of the experts. We follow the original paper in setting $g = 8$. @fig-saocp illustrates the SAOCP algorithm for choices of $D \in \{0.01, 0.1, 0.25, 0.5 \}$. Similarly to SF-OGD, the prediction intervals tend to undercover for small $D$, and achieve near-optimal coverage for larger $D$ at the expense of larger path lengths.
```{r saocp_example_plot, echo = FALSE, cache = TRUE}
#| label: fig-saocp
#| fig-height: 4.5
#| fig-cap: "Example 80% prediction intervals generated by the SAOCP algorithm with different values of the maximum radius parameter $D$. Blue and red points are observations that fell inside and outside the prediction intervals, respectively."
alpha <- 0.8
results <- list(
aci(y, yhat, alpha = alpha, method = "SAOCP", parameters = list(D = 0.01)),
aci(y, yhat, alpha = alpha, method = "SAOCP", parameters = list(D = 0.1)),
aci(y, yhat, alpha = alpha, method = "SAOCP", parameters = list(D = 0.25)),
aci(y, yhat, alpha = alpha, method = "SAOCP", parameters = list(D = 0.5))
)
coverage <- format_coverage(extract_metric(results, "coverage"))
path_length <- format_path_length(extract_metric(results, "path_length"))
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
for(i in 1:4) {
plot(results[[i]], legend = FALSE, predictions = FALSE, cex = 0.5, main = bquote(D==.(results[[i]]$parameters$D)), ylim = c(-0.9, 0.8))
text(x = -10, y = -0.75, labels = bquote(EmpCov == .(coverage[[i]])), pos = 4)
text(x = -10, y = -0.9, labels = bquote(PathLength == .(path_length[[i]])), pos = 4)
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
# `AdaptiveConformal` `R` package
The ACI algorithms described in the previous section have been implemented in the open-source and publically available `R` package `AdaptiveConformal`, available at [https://github.com/herbps10/AdaptiveConformal](https://github.com/herbps10/AdaptiveConformal). CIn this section, we briefly introduce the main functionality of the package. Comprehensive documentation is, including several example vignettes, is included with the package.
The `AdaptiveConformal` package can be installed using the `remotes` package:
```{r, eval = FALSE}
remotes::install_github("herbps10/AdaptiveConformal")
```
The ACI algorithms are accessed through the `aci` function, which takes as input a vector of observations ($y_t$) and a vector or matrix of predictions ($\hat{y}_t$). Using the data generating process from the running example to illustrate, we can fit the original ACI algorithm with learning rate $\gamma = 0.1$:
```{r}
set.seed(532)
data <- running_example_data(N = 5e2)
fit <- aci(data$y, data$yhat, alpha = 0.8, method = "ACI", parameters = list(gamma = 0.1))
```
The available parameters for each method can be found in the documentation for the `aci` method, accessible with the command `?aci`. The resulting conformal prediction intervals can then be plotted using the `plot` function:
```{r, fig.height = 4}
plot(fit)
```
The properties of the prediction intervals can also be examined using the `summary` function:
```{r}
summary(fit)
```
# Simulation Studies {#sec-simulations}
We present two empirical studies in order to compare the performance of the AgACI, DtACI, SF-OGD, and SAOCP algorithms applied to simple simulated datasets. The original ACI algorithm was not included as it is not clear how to set the tuning rate $\gamma$, which can have a large effect on the resulting intervals. For both simulations we set the targeted empirical coverage to $\alpha = 0.8$, $\alpha = 0.9$, and $\alpha = 0.95$. For each algorithm, we chose the interval constructor that was used in its original presentation (see @tbl-aci).
## Time series with ARMA errors
In this simulation we reproduce the setup described in @zaffran2022agaci (itself based on that of @friedman1983). The time series values $y_t$ for $t \in \llbracket T \rrbracket$ ($T = 600$) are simulated according to
$$
\begin{aligned}
y_t = 10\sin(\pi X_{t,1}X_{t,2}) + 20(X_{t,3} - 0.5)^2 + 10X_{t,4} + 5 X_{t,5} + 0X_{t,6} + \epsilon_t,
\end{aligned}
$$
where $X_{t,i}$, $i = 1, \dots, 6$, $t \in \llbracket T \rrbracket$ are independently uniformly distributed on $[0, 1]$ and the noise terms $\epsilon_t$ are generated according to an ARMA(1, 1) process:
$$
\begin{aligned}
\epsilon_t &= \psi \epsilon_{t-1} + \xi_t + \theta \xi_{t-1}, \\
\xi_t &\sim N(0, \sigma^2).
\end{aligned}
$$
We set $\psi$ and $\theta$ jointly to each value in $\{ 0.1, 0.8, 0.9, 0.95, 0.99 \}$ to simulate time series with increasing temporal dependence. The innovation variance was set to $\sigma^2 = (1 - \psi^2) / (1 + 2\psi \xi + \xi^2)$ (to ensure that the process has constant variance). For each setting, 25 simulated datasets were generated.
To provide point predictions for the ACI algorithms, at each time $t \geq 200$ a random forest model was fitted to the previously observed data using the `ranger` `R` package [@wright2017ranger]. The estimated model was then used to predict the subsequent time point. The maximum radius $D$ was estimated as the maximum residual observed between time points $t=200$ and $t=249$. The ACI models were then executed starting at time point $t = 250$. All metrics are based on time points $t \geq 300$ to allow time for the ACI methods to initialize.
```{r simulation_study_one, cache = TRUE}
simulate <- function(seed, psi, xi, N = 1e3) {
set.seed(seed)
s <- 10
innov_scale <- sqrt(s * (1 - psi^2) / (1 + 2 * psi * xi + xi^2))
X <- matrix(runif(6 * N), ncol = 6, nrow = N)
colnames(X) <- paste0("X", 1:6)
epsilon <- arima.sim(n = N, model = list(ar = psi, ma = xi), sd = innov_scale)
mu <- 10 * sin(pi * X[,1] * X[,2]) + 20 * (X[,3] - 0.5)^2 + 10 * X[,4] + 5 * X[,5]
y <- mu + epsilon
as_tibble(X) %>% mutate(y = y)
}
estimate_model <- function(data, p = NULL) {
if(!is.null(p)) p()
preds <- numeric(nrow(data))
for(t in 200:nrow(data)) {
model <- ranger::ranger(y ~ X1 + X2 + X3 + X4 + X5 + X6, data = data[1:(t - 1),])
preds[t] <- predict(model, data = data[t, ])$predictions
}
preds
}
metrics <- function(fit) {
indices <- 300:length(fit$Y)
aci_metrics(fit, indices)
}
fit <- function(data, preds, method, alpha, p = NULL) {
if(!is.null(p)) p()
D <- max(abs(data$y - preds)[200:249])
gamma <- D / sqrt(3)
interval_constructor = case_when(
method == "AgACI" ~ "conformal",
method == "DtACI" ~ "conformal",
method == "SF-OGD" ~ "linear",
method == "SAOCP" ~ "linear"
)
if(interval_constructor == "linear") {
gamma_grid = seq(0.1, 1, 0.1)
}
else {
gamma_grid <- c(0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128)
}
parameters <- list(
interval_constructor = interval_constructor,
D = D,
gamma = gamma,
gamma_grid = gamma_grid
)
aci(
data$y[250:nrow(data)],
preds[250:nrow(data)],
method = method,
alpha = alpha,
parameters = parameters
)
}
N_sims <- 100
simulation_data <- expand_grid(
index = 1:N_sims,
param = c(0.1, 0.8, 0.9, 0.95, 0.99),
N = 600
) %>%
mutate(psi = param, xi = param)
# For each simulated dataset, fit multiple ACI methods
simulation_study_setup <- expand_grid(
alpha = c(0.8, 0.9, 0.95),
method = c("AgACI", "SF-OGD", "SAOCP", "DtACI")
)
# run_simulation_study1 function is defined in helpers.R
simulation_study1 <- run_simulation_study1(
simulation_data,
simulation_study_setup,
estimate_model,
fit,
workers = 8
)
```
The coverage errors, mean interval widths, path lengths, and strongly adaptive regret (for $m = 20$) of each of the algorithms for $\alpha = 0.9$ are shown in @fig-simulation-one-results (results for $\alpha \in \{ 0.8, 0.95 \}$ were similar and are available in the appendix). All methods achieved near optimal empirical coverage, although SAOCP tended to slightly undercover. The mean interval widths re similar across methods, although again SAOCP had slightly shorter intervals (as could be expected given its tendency to undercover). The strongly adaptive regret was similar for all methods. The path length of SAOCP was larger than any of the other methods. To investigate why, @fig-simulation-one-widths plots $w_t - w_{t-1}$, the difference in interval width between times $t-1$ and $t$, for each method in one of the simulations. The interval widths for AgACI and DtACI change slowly relative to those for SF-OGD and SAOCP. For SAOCP, we can see the interval widths have larger fluctuations than for the other methods, explaining its higher path width. The prediction intervals themselves for the same simulation are shown in @fig-simulation-one-example, which shows that although the path lengths are quite different, the output prediction intervals are quite similar.
```{r simulation_one_plot, message=FALSE, warning=FALSE}
#| label: fig-simulation-one-results
#| fig-height: 9
#| fig-cap: "Coverage errors, mean interval widths, path lengths, and strongly adaptive regret (for $m = 20$) for the first simulation study with target coverage $\\alpha = 0.9$."
simulation_one_plot(simulation_study1$results %>% filter(alpha == 0.9))
```
```{r simulation_one_width_plot}
#| label: fig-simulation-one-widths
#| fig-height: 5
#| fig-cap: "Difference in successive interval widths ($w_t - w_{t-1}$) from an illustrative simulation from the first simulation study."
fits <- simulation_study1$example_fits
par(mfrow = c(2, 2), mar = c(3, 4, 2, 1))
for(i in 1:4) {
plot(
diff(fits$fit[[i]]$intervals[,2] - fits$fit[[i]]$intervals[,1]),
main = fits$method[[i]],
xlab = "T",
ylab = expression(w[t] - w[t - 1]))
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
```{r simulation_one_example_plots}
#| label: fig-simulation-one-example
#| fig-height: 6
#| fig-cap: "Example prediction intervals (target coverage $\\alpha = 0.9$) from the first simulation study for time points 300 to 350; metrics shown are for all time points $t \\geq 300$. Blue and red points are observations that fell inside and outside the prediction intervals, respectively."
fits <- simulation_study1$example_fits
coverage <- format_coverage(map_dbl(map(fits$fit, metrics), `[[`, "coverage"))
path_length <- format_path_length(map_dbl(map(fits$fit, metrics), `[[`, "path_length"))
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
for(i in 1:4) {
plot(fits$fit[[i]], legend = FALSE, main = fits$method[[i]], predictions = FALSE, ylim = c(-20, 35), index = 50:100)
text(x = -0, y = -7.5, labels = bquote(EmpCov == .(coverage[[i]]) ), pos = 4)
text(x = -0, y = -17.5, labels = bquote(PathLength == .(path_length[[i]]) ), pos = 4)
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
## Distribution shift
This simulation study features time series with distribution shifts. The setup is quite simple in order to probe the basic performance of the methods in response to distribution shift. As a baseline, we simulate time series of independent data with
$$
\begin{aligned}
y_t &\sim N(0, \sigma_t^2), \\
\sigma_t &= 0.2,
\end{aligned}
$$
for all $t \in \llbracket T \rrbracket$ ($T = 500$). In the second type of time series, the observations are still independent but their variance increases halfway through the time series:
$$
\begin{aligned}
y_t &\sim N(0, \sigma_t^2), \\
\sigma_t &= 0.2 + 0.5 \mathbb{I}[t > 250].
\end{aligned}
$$
In each case, the ACI algorithms are provided with the unbiased predictions $\hat{\mu}_t = 0$, $t \in \llbracket T \rrbracket$. Fifty simulated datasets were generated for each type of time series.
```{r simulation_two, cache = TRUE}
simulate <- function(seed, distribution_shift = 0, N = 1e3, sigma = 0.2) {
set.seed(seed)
mu <- rep(0, N)
shift <- 1:N > (N / 2)
yhat <- mu
y <- rnorm(n = length(mu), mean = mu, sd = sigma + ifelse(shift, distribution_shift, 0))
tibble(y = y, yhat = yhat)
}
metrics <- function(fit) {
N <- length(fit$Y)
indices <- which(1:N > 50)
aci_metrics(fit, indices)
}
fit <- function(data, method, alpha, p = NULL) {
if(!is.null(p)) p()
interval_constructor = case_when(
method == "AgACI" ~ "conformal",
method == "DtACI" ~ "conformal",
method == "SF-OGD" ~ "linear",
method == "SAOCP" ~ "linear"
)
if(interval_constructor == "linear") {
D <- max(abs(data$y - data$yhat)[1:50])
}
else {
D <- 1
}
gamma <- D / sqrt(3)
if(interval_constructor == "linear") {
gamma_grid <- seq(0.1, 2, 0.1)
}
else {
gamma_grid <- c(0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128)
}
parameters <- list(
interval_constructor = interval_constructor,
D = D,
gamma = gamma,
gamma_grid = gamma_grid
)
aci(data$y, data$yhat, method = method, alpha = alpha, parameters = parameters)
}
N_sims <- 100
simulation_study_setup2 <- expand_grid(
index = 1:N_sims,
distribution_shift = c(0, 0.5),
alpha = c(0.8, 0.9, 0.95),
N = 500,
method = c("AgACI", "SF-OGD", "SAOCP", "DtACI"),
) %>%
mutate(data = pmap(list(index, distribution_shift, N), simulate))
# run_simulation_study2 function is defined in helpers.R
simulation_study2 <- run_simulation_study2(simulation_study_setup2, fit, workers = 8)
```
The coverage error, mean path length, mean interval widths, and strongly adaptive regret (for $m = 20$) of the algorithms are summarized in @fig-simulation-two-results (an alternative plot is included in the appendix as @fig-simulation-two-joint). The coverage error of all the algorithms is near the desired value in the absence of distribution shift. On the contrary, all of the algorithms except AgACI and DtACI undercover when there is distributional shift. SAOCP tends to have higher average path lengths than the other methods. In the distribution shift setting, SF-OGD and SAOCP tended to have smaller strongly adaptive regret than the other methods. An illustrative example of prediction intervals generated by each method for one of the simulated time series with distribution shift is shown in @fig-simulation-two-example. The SAOCP prediction intervals in the example before the distribution shift are more jagged than those produced by the other methods, which illustrates why SAOCP may have longer path lengths.
```{r simulation_two_plot}
#| label: fig-simulation-two-results
#| fig-height: 6.5
#| fig-cap: "Coverage error, mean interval width, path length, and strongly adaptive regret ($m = 20$) for $\\alpha = 0.8, 0.9, 0.95$ and simulations with and without distributional shift."
simulation_two_plot(simulation_study2$results)
```
```{r simulation_two_example_plot}
#| label: fig-simulation-two-example
#| fig-height: 6
#| fig-cap: "Example prediction intervals (target coverage $\\alpha = 0.9$) from the second simulation study of time series with distributional shift, in which the shift occurs at time 250. Blue and red points are observations that fell inside and outside the prediction intervals, respectively."
fits <- simulation_study2$example_fits
coverage <- format_coverage(extract_metric(fits$fit, "coverage"))
path_length <- format_path_length(extract_metric(fits$fit, "path_length"))
par(mfrow = c(2, 2), mar = c(3, 3, 2, 1))
for(i in 1:4) {
plot(fits$fit[[i]], legend = FALSE, main = fits$method[[i]], index = 51:500)
text(x = -10, y = -1.5, labels = bquote(EmpCov == .(coverage[[i]]) ), pos = 4)
text(x = -10, y = -2, labels = bquote(PathLength == .(path_length[[i]]) ), pos = 4)
}
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
```
# Case Study: Influenza Forecasting {#sec-case-study}
Influenza is a highly infectious disease that is estimated to infect approximately one billion individuals each year around the world [@krammer2018influenza]. Influenza incidence in temperate climates tends to follow a seasonal pattern, with the highest number of infections during what is commonly referred to as the \textit{flu season} [@lofgren2007influenza]. Accurate forecasting of influenza is of significant interest to aid in public health planning and resource allocation. To investigate the accuracy of influenza forecasts, the US Centers for Disease Control (CDC) initiated a challenge, referred to as FluSight, in which teams from multiple institutions submitted weekly forecasts of influenza incidence [@biggerstaff2016flusight]. @reich2019influenza evaluated the accuracy of the forecasts over seven flu seasons from 2010 to 2017. As a case study, we investigate the use of ACI algorithms to augment the FluSight forecasts with prediction intervals.
The FluSight challenge collected forecasts for multiple prediction targets. For this case study, we focus on national (US) one-week ahead forecasts of weighted influenza-like illness (wILI), which is a population-weighted percentage of doctors visits where patients presented with influenza-like symptoms [@biggerstaff2016flusight]. The FluSight dataset, which is publicly available, include forecasts derived from 21 different forecasting models, from both mechanistic and statistical viewpoints [@flusight2020;@tushar2018flusightnetwork;@tushar2019flusight]. For our purposes, we treat the way the forecasts were produced as a black box.
Formally, let $y_{t}$, $t \in \llbracket T \rrbracket$ be the observed national wILI at time $t$, and let $\hat{\mu}_{j,t}$, $j \in \llbracket J \rrbracket$, be the one-week ahead forecast of the wILI from model $j$ at time $t$. Two of the original 21 forecasting methods were excluded from this case study due to poor predictive performance (\texttt{Delphi\_Uniform} and \texttt{CUBMA}). In addition, six methods had identical forecasts (\texttt{CU\_EAKFC\_SIRS}, \texttt{CU\_EKF\_SEIRS}, \texttt{CU\_EKF\_SIRS}, \texttt{CU\_RHF\_SEIRS}, \texttt{CU\_RHF\_SIRS}), and therefore we only included one (\texttt{CU\_EAKFC\_SEIRS}) in the analysis. The ACI methods were then applied to the log-observations and log-predictions, where the log-transformation was used to constrain the final prediction intervals to be positive. The first flu season (2010-2011) was used as a warm-up for each ACI method, and we report the empirical performance of the prediction intervals for the subsequent seasons (six seasons from 2012-2013 to 2016-2017). The ACI algorithms target prediction intervals with coverage of $\alpha = 0.8$, $\alpha = 0.9$, and $\alpha = 0.95$. As in the simulation study, we used the interval constructor corresponding to the original presentation of each algorithm (see @tbl-aci).
```{r case_study, cache = TRUE}
# Paste together URL so it is not cut off in PDF
url <- paste0("https://raw.githubusercontent.com/FluSightNetwork/",
"cdc-flusight-ensemble/master/scores/point_ests.csv")
raw_data <- read_csv(url, show_col_types = FALSE)
fit <- function(data, method, alpha) {
first_season <- data$Season == "2010/2011"
D <- max(abs(data$obs_value - data$Value)[first_season])
interval_constructor = case_when(
method == "AgACI" ~ "conformal",
method == "DtACI" ~ "conformal",
method == "SF-OGD" ~ "linear",
method == "SAOCP" ~ "linear"
)
gamma <- D / sqrt(3)
if(interval_constructor == "linear") {
gamma_grid = seq(0.1, 1, 0.1)
}
else {
gamma_grid <- c(0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128)
}
parameters <- list(
interval_constructor = interval_constructor,
D = D,
gamma = gamma,
gamma_grid = gamma_grid
)
aci(
Y = log(data$obs_value),
predictions = log(data$Value),
method = method,
parameters = parameters,
alpha = alpha
)
}