-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAMC-talk.qmd
649 lines (403 loc) · 16.4 KB
/
AMC-talk.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
---
title: "In Search of Basu's Elephants"
author: "Merlise Clyde"
subtitle: |
| Theory and Foundations of Statistics in the Era of Big Data
| Conference Honoring Bahadur and Basu
institute: "Duke University"
format:
revealjs:
theme: [simple, custom.scss]
slide-number: true
incremental: true
scrollable: false
controls: true
fragments: true
preview-links: auto
smaller: true
logo: Dukelogo.jpg
embed-resources: true
html-math-method:
method: mathjax
url: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
editor:
markdown:
wrap: 72
execute:
echo: false
number-sections: false
pdf-separate-fragments: false
preview-links: true
---
```{r}
#| echo: false
library(BAS)
```
## Basu's Elephants
{{< include macros.qmd >}}
:::: {.columns}
::: {.column width="60%"}
Circus Owner plans to ship his 50 elephants
- needs an approximate total weight of the 50 elephants
- Sambo is a typical elephant of average weight
- decides to take 50 $\times$ Sambo's current weight as an estimate of the total weight of his herd
:::
::: {.column width="40%"}
![](img/circus-ship.jpeg){width="70%"}
:::
::::
## Sampling Design
:::: {.columns}
::: {.column width="60%"}
Circus Statistician is shocked!
- we need a sampling design to create an unbiased estimator!
- they compromised and came up with the following plan:
- sample Sambo with probability $99/100$
- the rest of the elephants with probability $1/4900$
:::
::: {.column width="40%"}
![](img/shocked-statistician.png){width="70%"}
:::
::::
## Unbiased Estimation & Horvitz-Thompson
:::: {.columns}
::: {.column width="70%"}
Everyone was happy when Sambo was selected!
- Circus Owner proposes using $50 \times$ Sambo's current weight
- Circus Statistician: we should use the Horvitz-Thompson Estimator (HT)
- unique hyper-admissible estimator in the class of all generalized polynomial unbiased estimators!
- HT estimate is Sambo's weight $\div$ probability Sambo was selected -or- W $\times 100/99$
- But what if the largest elephant Jumbo had been selected? asked the Circus Owner
- Well, said the sheepish statistician, HT would lead to Jumbo's weight $W \times 4900/1$!
- and thus the Circus statistician lost their job (and became an instructor of statistics)!
:::
::: {.column width="30%"}
![](img/basu-elephant.png){width="70%"}
:::
::::
## Several years later...
:::: {.columns}
::: {.column width=50%}
![](img/model_agency.png)
:::
::: {.column width=50%}
Statistical Modelling Agency has ad out for help sampling models
- Circus statistician applies
- Agency: we want to implement Bayesian Model Averaging and need your help in sampling models
and estimation
- Former Circus Statistician: I know all about sampling! Tell me more!
- and being an instructor of statistics does not pay enough
:::
::::
## Canonical Regression Model
- Observe response vector $\Y$ with predictor variables $X_1 \dots
X_p$.
- Model for data under a specific model $\Mg$:
\begin{equation*}
\Y \mid \alpha, \bg, \phi, \Mg \sim \N(\one_n\alpha + \Xg \bg, \I_n/\phi)
\label{eq:linear.model}
\end{equation*}
- Models $\Mg$ encoded by $\g = (\gamma_1, \ldots \gamma_p)^T$ binary vector with
$\gamma_j = 1$ indicating that $X_j$ is included in model $\Mg$
where
\begin{align*}
\gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\
\gamma_j = 1 & \Leftrightarrow \beta_j \neq 0
\end{align*}
- $\Xg$: the $n \times \pg$ design matrix for model $\Mg$
- $\bg$: the $\pg$ vector of non-zero regression coefficients under $\Mg$
- intercept $\alpha$, precision $\phi$ common to all models
## Bayesian Model Averaging (BMA)
- prior distributions on all unknowns $(\bg, \g, \alpha, \phi)$ and turn the Bayesian crank to get posterior distributions!
- key component is posterior distribution over models
$$
p(\Mg \mid\ Y) = \frac {p(\Y \mid \Mg) p(\Mg)} {\sum_{\g \in \Gamma} p(\Y \mid \Mg)p(\Mg)}
$$
- for **nice** priors, we can integrate out the parameters $\tg = (\bg, \alpha, \phi)$ to obtain the marginal likelihood of $\Mg$ which is proportional to
$$
p(\Y \mid \Mg) = \int p(\Y \mid \tg, \Mg)p(\tg \mid \Mg) d\tg
$$
- posterior distribution of quantities $\Delta$ of interest under BMA
$$
\sum_{\g \in \Gamma} p(\Mg \mid \Y) p(\Delta \mid \Y, \Mg)
$$
- estimations $\E[\Y]$, predictive distribution of $\Y^*$, $\gamma_j$ (marginal inclusion probabilities)
## Implementation
The Former Circus Statistician asked - "What if you can't enumerate all of the models in $\gamma$?
- With $88$ predictor variables there are $2^{88} > 3 \times 10^{26}$
models!
- more than the number of stars ($10^{24}$) in the universe!
. . .
![](img/dont-panic.png){width=70%}
## Sampling
Well, then we just use a sample, said the agency statistician
- simple random sampling with or without replacement doesn't work very well (inefficient)
- Box used importance sampling in the 80's, but that didn't seem to scale
- non-random samples using Branch & Bound, but those are biased (also does not scale)
- but then Bayesians discovered MCMC so Markov Chain Monte Carlo is now pretty standard for implementation of BMA
## MCMC
- design a Markov Chain to transition through $\Gamma$ so that ultimately models are sampled proportional to their posterior probabilities
$$
p(\Mg \mid \Y ) \propto p(\Y \mid \Mg) p(\Mg)
$$
- propose a new model from $q(\g^* \mid \g)$
- accept moving to $\g^*$ with probability
$$
\textsf{MH} = \max(1, \frac{p(\Mg* \mid \Y)p(\Mg^*)/q(\g^* \mid \g)}
{p(\Mg \mid \Y)p(\Mg)/q(\g)})
$$
- otherwise stay at model $\Mg$
- the normalizing constant is the same in the numerator and denominator so we just need
$p(\Y \mid \Mg) p(\Mg)$!
## Estimation in BMA
The Former Circus Statistician asks: "So how do you estimate the probabilities of models?
- we just use the Monte Carlo frequencies of models or ergodic averages
\begin{align*}
\widehat{p(\Mg \mid \Y)} & = \frac{\sum_{t = 1}^T I(\M_t = \Mg)} {T} \\
& = \frac{\sum_{\g \in S} n_{\g} I(\Mg \in S)} {\sum n_{\g}} \\
\end{align*}
- $T$ = # MCMC samples
- $S$ is the collection of unique sampled models
- $n_{\g}$ is the frequency of model $\Mg$ in $S$
- $n = \sum_{\g \in S} n_{\g}$ total number of unique models
. . .
- asymptotically unbiased as $T \to \infty$
## Shocked (II)
:::: {.columns}
::: {.column width="40%"}
![](img/shocked-statistician.png){width="70%"}
What! You just the Monte Carlo frequencies! exclaimed the Former Circus Statistician
- I don't know much about Bayes, but that doesn't seem particularly Bayesian!
:::
::: {.column width="60%"}
- what happened to the (observed) marginal likelihoods $\times$ prior probabilities?
- since you are sampling from a finite population, can you use survey sampling estimates of
$$C = \sum_{\g \in \Gamma } p(\Y \mid \Mg) p(\Mg)$$
with the observed $p(\Y \mid \Mg) p(\Mg)$?
- Agency Statistician: We tried using
$$
\widehat{p(\Mg \mid\ Y)} = \frac {p(\Y \mid \Mg) p(\Mg)} {\sum_{\g \in S} p(\Y \mid \Mg)p(\Mg)}
$$
- it's Fisher Consistent, but biased in finite samples
- that's why we need you!
:::
::::
## Inverse Probability Weighting
Former Circus Statistician: Let's try the Hansen-Hurwitz estimator in PPS sampling
- Goal is to estimate $C= \sum_i^N p(\Y \mid \M_i) p(\M_i)$
- Let $\rho_i$ be the probability of selecting $\M_i$
- Hansen-Hurwitz or importance sampling estimate is
$$\hat{C} = \frac{1}{n}\sum_i^n \frac{ n_i p(\Y \mid \M_i) p(\M_i)}{\rho_i}
$$
- If we have "perfect" samples from the posterior then
$$\rho_i = \frac{p(\Y \mid \M_i)p(\M_i)}{C}$$ and recover $C$!
## Self-Normalized Importance Sampling
Since $C$ is unknown, we can apply the ratio estimator
$$
\hat{C} = \frac{\frac1 n \sum_i^n \frac{p(\Y \mid \M_i) p(\M_i)}{\rho_i}}{ \frac1 n \sum_i^n \frac{1}{\rho_i}} = \left[ \sum_i \frac{1}{p(\Y \mid \M_i) p(\M_i)} \right]^{-1}
$$
. . .
But this is the "infamous" harmonic mean estimator, said the Agency Statistician! While unbiased, it's is highly unstable!
. . .
The Former Circus Statistician (hoping to keep this gig):
- Wait! - I know that the Horvitz-Thompson (HT) estimator uses only the unique sampled values and not the frequencies
- HT dominates the Hansen-Hurwitz in terms of variance!
- but we can't use MCMC...
## Proposal Distribution
Former Circus Statistician: So tell me more about $q$ in MCMC
- ratio $\frac{p(\Y \mid \Mg^*) p(\Mg^*)}{q(\Mg^* \mid \Mg)}$ looks like importance sampling
- what are choices for $q(\Mg^* \mid \Mg)$?
- what about independent proposals $q(\Mg^*)$?
## Adaptive Independent Metropolis
Griffin et al (2021):
- Independent Metropolis Hastings
$$q(\g) = \prod \pi_i^{\gamma_i} (1 - \pi_i)^{1-\gamma_i}$$
- Product of **Independent** Bernoullis
- Use adaptive Independent Metropolis Hastings to learn $\pi_i$ from past samples plus Rao-Blackwellization
- optimal for target where $\gamma_j$ are independent _a posteriori_
- still uses Metropolis-Hastings Ratio to accept
- And they use Monte Carlo frequencies!
- but it does not work well for sampling with replacement and importance sampling in general
## Factor Target
- The joint posterior distribution of $\g$ (dropping $\Y$) may be factored:
$$p(\Mg \mid \Y) \equiv p(\g) = \prod_{j = 1}^p p(\gamma_j \mid \g_{<j})
$$
where $\g_{< j}\equiv \{\gamma_k\}$ for $k < j$ and $p(\gamma_1
\mid \g_{<1}) \equiv p(\gamma_1)$.
- As $\gamma_j$ are binary, re-express as
\begin{equation*}
p(\g) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j
\mid <j})}^{1-\gamma_j}
\end{equation*}
where $\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \g_{<j})$ and
$\rho_{1 \mid < 1} = \rho_1$, the marginal probability.
- Product of **Dependent** Bernoullis
## Global Adaptive MCMC Proposal
:::: {.columns}
::: {.column width="60%"}
Factor proposal
$$q(\g) = \prod_{j = 1}^p q(\gamma_j \mid \g_{<j}) = \prod_j \Ber(\hat{\rho}_{j \mid <j})
$$
- Note: $\Pr(\gamma_j = 1 \mid \g_{<j}) = \E[\gamma_j = 1 \mid \g_{<j}]$
- Fit a sequence of $p$ regressions $\gamma_j$ on
$\gamma_{<j}$
\begin{align*}
\gamma_1 & = \mu_1 + \epsilon_1 \\
\gamma_2 & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\
\gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 -
\mu_1) + \beta_{3 2} (\gamma_2 -
\mu_2) + \epsilon_3 \\
& \vdots \\
\gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 -
\mu_1) \ldots + \beta_{p-1 \, p-1} (\gamma_{p-1} -
\mu_{p-1})+ \epsilon_p
\end{align*}
:::
::: {.column width="30%"}
![](img/tree_v1.png){width="70%}
:::
::::
## Compositional Regression
Approximate model $$\g \sim \N(\mub, \Sigmab_{\g})$$
- Wermouth (1980) compositional regression
$$ \G = \one_{T} \mub^T + (\G - \one_T \mub^T) \B + \eps
$$
- $\G$ is $T \times p$ matrix where row $t$ is $\g_t$
- $\mub$ is the $p$ dimensional vector of $\E[\g]$
- $\Sigmab_{\g} = \U^T \U$ where $\U$ is upper triangular Cholesky decomposition of covariance matrix of $\g$ ($p \times p$)
- $\B^T = \I_p - diag(\U)^{-1} \U^{-T}$ (lower triangle)
- $\B$ is a $p \times p$ upper triangular matrix with zeros on
the diagonal and regression coefficients for $jth$ regression in row $j$
## Estimators of $\B$ and $\mub$
- OLS is BLUE and consistent, but $\G$ may not be full rank
- apply Bayesian Shrinkage with "priors" on $\mub$ (non-informative or Normal) and $\Sigma$ (inverse-Wishart)
- pseudo-posterior mean $\mub$ is the current estimate of the marginal inclusion probabilities $\bar{\g} = \hat{\mub}$
- use pseudo-posterior mean for $\Sigmab$
- one Cholesky decomposition provides all
coefficients for the $p$ predictions for proposing $\g^*$
- constrain predicted values $\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)$
- generate $\g^*_j \mid \g^*_{< j} \sim \Ber(\hat{\rho}_{j \mid <j})$
- use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all)
## Horvitz-Thompson
- use only the $n$ unique models
- inclusion probability that $\g_i \in S$ is included under IS (PPSWR)
$$\pi_i = 1 - (1 - q(\g_i))^T$$
- HT estimate of normalizing constant
$$\hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}$$
- Ratio HT estimate of posterior probabilities
$$p(\Mg \mid \Y) = \frac{\sum_{i \in n} I(\M_i = \Mg) p(\Y \mid \M_i)p(\M_i)/\pi_{\g}}
{\sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}} =
\frac { p(\Y \mid \Mg)p(\M_i)/\pi_{\g}}
{\sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}}$$
- Estimate of Marginal Inclusion Probabilities
$$p(\gamma_j = 1 \mid \Y) = \frac{\sum_{i \in n} p(\Y \mid \M_i)p(\Mg)/\pi_{\g}}
{\sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}}$
$$
## Simulation
::: {.nonincremental}
:::: {.columns}
::: {.column width="50%"}
- `tecator` data (Griffin et al (2021))
- a sample of $p = 20$ variables
- compare
- enumeration
- MCMC with add, delete, and swap moves
- Adaptive MCMC
- Importance Sampling with HT
- same settings `burnin.it`, `MCMC.it`, `thin`
:::
::: {.column width="50%"}
```{r}
load("sim_code/tecator-time.dat")
boxplot(time, main="CPU time", ylab = "Time")
```
:::
::::
:::
## MSE Comparision {.center}
:::: {.columns}
::: {.column width="50%"}
```{r}
#| echo: FALSE
load("tecator-mse.dat")
colnames(mse.pip) <- c("HT", "AMCMC", "MCMC")
boxplot(10*sqrt(mse.pip), main="Marginal Inclusion Probabilities", ylab = "10 x RMSE")
```
:::
::: {.column width="50%"}
```{r}
#| echo: FALSE
colnames(mse.pp) <- c("HT", "AMCMC", "MCMC")
boxplot(10*sqrt(mse.pp), main="Posterior Model Probabilities", ylab = "10 x RMSE")
```
:::
::::
## Basu's Estimator of Total
Basu (1971), Meeden and Ghosh (1983) :
$$C = \sum_{i \in S} p(\Y \mid \M_i) p(\M_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\Y \mid \M_i)p(\M_i)}{\pi_i} \times (1 - \sum_{i \notin S} \pi_i)$$
- Model-based: Let $m_i = p(\Y \mid \M_i) p(\M_i)$
\begin{align}
m_i \mid \pi_i &\ind N(\pi_i \beta, \sigma^2 \pi_i^2) \\
p(\beta, \sigma^2) & \propto 1/\sigma^2
\end{align}
- posterior mean of $\beta$ is $\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}$ (HT)
- using posterior predictive
\begin{align*} \hat{C} & = \sum_i m_i + \sum_{i \notin S} \hat{\beta} \pi_i
= \sum_i m_i + \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\sum_{i \in S} (1 - \pi_i)
\end{align*}
- accounts for probability on models not yet observed!
## Final Posterior Estimates
- estimate of posterior probability $\Mg$
$$
\frac{p(\Y \mid \Mg) p(\Mg)}{\sum_{i \in S} p(\Y \mid \M_i) p(\M_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\Y \mid \M_i) p(\Mi)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}
$$
- estimate of all models in $\G - S$
$$
\frac{1}{\sum_{i \in S} p(\Y \mid \M_i) p(\M_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\Y \mid \M_i) p(\Mi)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}
$$
- Uses renormalized marginal likelihoods of sampled models
- Computing marginal inclusion probabilities
- What about $\E[\b \mid \Y]$, $\E[\X\b \mid \Y]$, $\E[\Y^* \mid \Y]$ or $p(\Delta \mid \Y)$?
## Continued Adaptation ?
- can update Cholesky with rank 1 updates with new models
- how to combine IS with MH samples (weighting) ?
- HT/Hajek - computational complexity involved if we need to compute inclusion probability for all models based on updates (previous models and future models)
- Basu (1971) suggests approach for PPSWOR (adaptation?)
## Refinements
:::: {.columns}
::: {.column width="50%"}
- Need to avoid MCMC for pseudo Bayesian posteriors for
- learning proposal distribution in sample design for models
- estimation of posterior model probabilities in model-based approaches (ie sampling from predictive distribution)
- estimation of general quantities under BMA?
- avoid infinite regret
:::
::: {.column width="40%"}
![](img/turtles_all_the_way_down.jpg)
:::
::::
## Summary
:::: {.columns}
::: {.column width="50%"}
- Adaptive Independent Metropolis proposal for models (use in more complex IS)
- Use observed values of unique marginal likelihoods of models for estimating posterior distribution
- Bayes estimates of MC output
- Basu's Elephants seem to be well, holding up the Bayesian World
- and the Circus Statistician is still employed!
:::
::: {.column width="50%"}
![](img/The_hindoo_earth.png)
:::
::::
## Thank You! {.center}
:::: {.columns}
::: {.column width="50%"}
Code in development version of `BAS` on [https://github.com/merliseclyde/BAS](https://github.com/merliseclyde/BAS)
Talk available at [https://merliseclyde.github.io/AMC-talk/](https://merliseclyde.github.io/AMC-talk/])
:::
::: {.column width="50%"}
![](img/adobe-express-qr-code.png){width="50%"}
:::
::::
## References
Meeden and Ghosh (1983) https://www.jstor.org/stable/2240483