AMC-talk.qmd

---
title: "In Search of Basu's Elephants"
author: "Merlise Clyde"
subtitle: |
 | Theory and Foundations of Statistics in the Era of Big Data
 | Conference Honoring Bahadur and Basu
 
institute: "Duke University"
format: 
  revealjs:
    theme: [simple, custom.scss]
    slide-number: true
    incremental: true
    scrollable: false
    controls: true
    fragments: true
    preview-links: auto
    smaller: true
    logo: Dukelogo.jpg
    embed-resources: true
html-math-method:
  method: mathjax
  url: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"    
editor: 
  markdown: 
    wrap: 72
execute: 
  echo: false
number-sections: false
pdf-separate-fragments: false
preview-links: true
---


```{r}
#| echo: false
library(BAS)
```


## Basu's Elephants


{{< include macros.qmd >}}


:::: {.columns}

::: {.column width="60%"}
Circus Owner plans to ship his 50 elephants

- needs an approximate total weight of the 50 elephants
- Sambo is a typical elephant of average weight
- decides to take 50 $\times$ Sambo's current weight as an estimate of the total weight of his herd
:::

::: {.column width="40%"}
![](img/circus-ship.jpeg){width="70%"}
:::

::::

## Sampling Design 

:::: {.columns}

::: {.column width="60%"}
Circus Statistician is shocked!

- we need a sampling design to create an unbiased estimator!
- they compromised and came up with the following plan:
  - sample Sambo with probability $99/100$
  - the rest of the elephants with probability $1/4900$

:::

::: {.column width="40%"}
![](img/shocked-statistician.png){width="70%"}
:::

::::
## Unbiased Estimation & Horvitz-Thompson

:::: {.columns}

::: {.column width="70%"}

Everyone was happy when Sambo was selected!

- Circus Owner proposes using $50 \times$ Sambo's current weight
- Circus Statistician: we should use the Horvitz-Thompson Estimator (HT)
   - unique hyper-admissible estimator in the class of all generalized polynomial unbiased estimators! 
- HT estimate is Sambo's weight $\div$ probability Sambo was selected  -or- W $\times 100/99$
- But what if the largest elephant Jumbo had been selected? asked the Circus Owner
- Well, said the sheepish statistician, HT would lead to Jumbo's weight $W \times 4900/1$!
- and thus the Circus statistician lost their job (and became an instructor of statistics)!
:::

::: {.column width="30%"}
![](img/basu-elephant.png){width="70%"}
:::

::::


## Several years later...

:::: {.columns}
::: {.column width=50%}
![](img/model_agency.png)
:::
::: {.column width=50%}

Statistical Modelling Agency has ad out for help sampling models 

- Circus statistician applies 

- Agency:   we want to implement Bayesian Model Averaging and need your help in sampling models
and estimation

- Former Circus Statistician:  I know all about sampling!  Tell me more!

- and being an instructor of statistics does not pay enough

:::
::::

## Canonical Regression Model
- Observe response vector $\Y$ with predictor variables $X_1 \dots
X_p$.  

- Model for data under a specific model $\Mg$:
\begin{equation*}
\Y \mid \alpha, \bg, \phi, \Mg \sim \N(\one_n\alpha + \Xg \bg, \I_n/\phi)
\label{eq:linear.model}
\end{equation*}


- Models $\Mg$ encoded by $\g = (\gamma_1, \ldots \gamma_p)^T$ binary vector with
  $\gamma_j = 1$ indicating that $X_j$ is included in model $\Mg$
  where
  \begin{align*}
  \gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\
  \gamma_j = 1 &  \Leftrightarrow \beta_j \neq 0 
  \end{align*}

- $\Xg$: the $n \times \pg$  design matrix for model $\Mg$
  
-  $\bg$:  the $\pg$ vector of non-zero regression coefficients under $\Mg$

-  intercept $\alpha$, precision $\phi$ common to all models


##  Bayesian Model Averaging  (BMA)

- prior distributions on all unknowns $(\bg, \g, \alpha, \phi)$ and turn the Bayesian crank to get posterior distributions!

 - key component is posterior distribution over models
$$
p(\Mg \mid\ Y) = \frac {p(\Y \mid \Mg) p(\Mg)} {\sum_{\g \in \Gamma} p(\Y \mid \Mg)p(\Mg)}
$$


- for **nice** priors, we can integrate out the parameters $\tg = (\bg, \alpha, \phi)$ to obtain the marginal likelihood of $\Mg$ which is proportional to
$$
p(\Y \mid \Mg) = \int p(\Y \mid \tg, \Mg)p(\tg  \mid \Mg) d\tg
$$ 

- posterior distribution of quantities $\Delta$ of interest under BMA
$$
\sum_{\g \in \Gamma} p(\Mg \mid \Y) p(\Delta \mid \Y, \Mg) 
$$
- estimations $\E[\Y]$, predictive distribution of $\Y^*$, $\gamma_j$ (marginal inclusion probabilities)


## Implementation


The Former Circus Statistician asked - "What if you can't enumerate all of the models in $\gamma$? 

- With $88$ predictor variables there  are $2^{88} > 3 \times 10^{26}$
models! 

- more than the number of stars ($10^{24}$) in the universe!

. . .

![](img/dont-panic.png){width=70%}


## Sampling

Well, then we just use a sample, said the agency statistician 

- simple random sampling with or without replacement doesn't work very well (inefficient)

- Box used importance sampling in the 80's, but that didn't seem to scale

- non-random samples using Branch & Bound, but those are biased (also does not scale)

- but then Bayesians discovered MCMC so Markov Chain Monte Carlo is now pretty standard for implementation of BMA

## MCMC 

  - design a Markov Chain to transition through $\Gamma$ so that ultimately models are sampled proportional to their posterior probabilities 
  $$
  p(\Mg \mid \Y ) \propto p(\Y \mid \Mg) p(\Mg)
  $$
 
 - propose a new model from $q(\g^* \mid \g)$ 
 
 - accept moving to $\g^*$ with probability
$$
\textsf{MH} = \max(1, \frac{p(\Mg* \mid \Y)p(\Mg^*)/q(\g^* \mid \g)}
                   {p(\Mg \mid \Y)p(\Mg)/q(\g)})
$$

- otherwise stay at model $\Mg$

- the normalizing constant is the same in the numerator and denominator so we just need
$p(\Y \mid \Mg) p(\Mg)$!


## Estimation in BMA

The Former Circus Statistician asks: "So how do you estimate the probabilities of models?


- we just use the Monte Carlo frequencies of models or ergodic averages
\begin{align*} 
\widehat{p(\Mg \mid \Y)}  & = \frac{\sum_{t = 1}^T I(\M_t = \Mg)} {T} \\
 & = \frac{\sum_{\g \in S} n_{\g} I(\Mg \in S)} {\sum n_{\g}} \\
\end{align*}

- $T$ = # MCMC samples
- $S$ is the collection of unique sampled models
- $n_{\g}$ is the frequency of model $\Mg$ in $S$
- $n = \sum_{\g \in S} n_{\g}$ total number of unique models
. . . 

- asymptotically unbiased as $T \to \infty$


## Shocked (II)

:::: {.columns}

::: {.column width="40%"}

![](img/shocked-statistician.png){width="70%"}

What! You just the Monte Carlo frequencies! exclaimed the Former Circus Statistician

- I don't know much about Bayes, but that doesn't seem particularly Bayesian!

:::

::: {.column width="60%"}


- what happened to the (observed) marginal likelihoods $\times$ prior probabilities?

- since you are sampling from a finite population, can you use survey sampling estimates of
$$C = \sum_{\g \in \Gamma } p(\Y \mid \Mg) p(\Mg)$$
with the observed $p(\Y \mid \Mg) p(\Mg)$?

- Agency Statistician:  We tried using 
$$
\widehat{p(\Mg \mid\ Y)} = \frac {p(\Y \mid \Mg) p(\Mg)} {\sum_{\g \in S} p(\Y \mid \Mg)p(\Mg)}
$$

- it's Fisher Consistent, but biased in finite samples

- that's why we need you!
:::

::::


## Inverse Probability Weighting

Former Circus Statistician: Let's try the Hansen-Hurwitz estimator in PPS sampling 

- Goal is to estimate $C= \sum_i^N p(\Y \mid \M_i) p(\M_i)$ 
- Let $\rho_i$ be the probability of selecting $\M_i$
- Hansen-Hurwitz or importance sampling estimate   is
$$\hat{C} = \frac{1}{n}\sum_i^n \frac{ n_i p(\Y \mid \M_i) p(\M_i)}{\rho_i}
$$
- If we have "perfect" samples from the posterior then 
$$\rho_i = \frac{p(\Y \mid \M_i)p(\M_i)}{C}$$ and recover $C$!

## Self-Normalized Importance Sampling

Since $C$ is unknown, we can apply the ratio estimator
$$
\hat{C} = \frac{\frac1 n \sum_i^n \frac{p(\Y \mid \M_i) p(\M_i)}{\rho_i}}{ \frac1 n \sum_i^n \frac{1}{\rho_i}} =  \left[ \sum_i \frac{1}{p(\Y \mid \M_i) p(\M_i)} \right]^{-1}
$$

. . .

But this is the "infamous" harmonic mean estimator, said the Agency Statistician!  While unbiased, it's is highly unstable!

. . .

The Former Circus Statistician (hoping to keep this gig): 

- Wait! - I know that the Horvitz-Thompson (HT) estimator uses only the unique sampled values and not the frequencies

- HT dominates the Hansen-Hurwitz in terms of variance!

- but we can't use MCMC...


## Proposal Distribution

Former Circus Statistician:   So tell me more about $q$ in MCMC

- ratio $\frac{p(\Y \mid \Mg^*) p(\Mg^*)}{q(\Mg^* \mid \Mg)}$ looks like importance sampling

- what are choices for $q(\Mg^* \mid \Mg)$?

- what about independent proposals $q(\Mg^*)$?


## Adaptive Independent Metropolis 

Griffin et al (2021):

 - Independent Metropolis Hastings 
   $$q(\g) = \prod \pi_i^{\gamma_i} (1 - \pi_i)^{1-\gamma_i}$$
   
  -  Product of **Independent** Bernoullis
 - Use adaptive Independent Metropolis Hastings to learn $\pi_i$ from past samples plus Rao-Blackwellization
 
 - optimal for target  where $\gamma_j$ are independent _a posteriori_
 
 - still uses Metropolis-Hastings Ratio to accept
 
 - And they use Monte Carlo frequencies!
 
 - but it does not work well for sampling with replacement and importance sampling in general
 

## Factor Target

- The joint posterior distribution of $\g$ (dropping $\Y$) may be factored:
$$p(\Mg \mid \Y) \equiv p(\g) = \prod_{j = 1}^p p(\gamma_j \mid \g_{<j})
$$
where $\g_{< j}\equiv \{\gamma_k\}$ for $k < j$ and $p(\gamma_1
\mid \g_{<1}) \equiv p(\gamma_1)$.


- As $\gamma_j$ are binary,  re-express  as
\begin{equation*}
p(\g) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j
    \mid <j})}^{1-\gamma_j}
\end{equation*}
where $\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \g_{<j})$ and
$\rho_{1 \mid < 1} = \rho_1$, the marginal probability. 


- Product of **Dependent** Bernoullis

## Global Adaptive MCMC Proposal

:::: {.columns}

::: {.column width="60%"}
Factor proposal 
$$q(\g) = \prod_{j = 1}^p q(\gamma_j \mid \g_{<j}) = \prod_j \Ber(\hat{\rho}_{j \mid <j})
$$

- Note: $\Pr(\gamma_j = 1 \mid \g_{<j}) = \E[\gamma_j = 1 \mid \g_{<j}]$ 

- Fit a sequence of $p$ regressions $\gamma_j$ on 
  $\gamma_{<j}$ 
  \begin{align*}
 \gamma_1 & = \mu_1 + \epsilon_1 \\
\gamma_2  & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\
\gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 -
    \mu_1) + \beta_{3 2} (\gamma_2 -
    \mu_2) + \epsilon_3 \\
& \vdots \\
\gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 -
    \mu_1)  \ldots + \beta_{p-1 \,  p-1} (\gamma_{p-1} -
    \mu_{p-1})+  \epsilon_p 
\end{align*}

:::

::: {.column width="30%"}

![](img/tree_v1.png){width="70%}
:::
::::

## Compositional Regression 

Approximate model $$\g \sim \N(\mub, \Sigmab_{\g})$$

- Wermouth (1980) compositional regression 
$$ \G = \one_{T} \mub^T + (\G - \one_T \mub^T) \B + \eps
$$

- $\G$ is $T \times p$ matrix where row $t$ is $\g_t$
- $\mub$ is the $p$ dimensional vector of $\E[\g]$
- $\Sigmab_{\g} = \U^T \U$  where  $\U$ is upper triangular Cholesky decomposition of covariance matrix of $\g$ ($p \times p$)
- $\B^T = \I_p -  diag(\U)^{-1} \U^{-T}$   (lower triangle)
- $\B$ is  a $p \times p$ upper triangular matrix with zeros on
    the diagonal and regression coefficients for $jth$ regression in row $j$

## Estimators of $\B$ and $\mub$

- OLS is BLUE and consistent, but $\G$ may not be full rank

- apply Bayesian Shrinkage with "priors" on $\mub$ (non-informative or Normal) and $\Sigma$ (inverse-Wishart)

- pseudo-posterior mean $\mub$ is the current estimate of the marginal inclusion probabilities $\bar{\g} = \hat{\mub}$
- use pseudo-posterior mean for $\Sigmab$

- one Cholesky decomposition provides all
    coefficients for the $p$ predictions for proposing $\g^*$
    
- constrain predicted values $\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)$ 

- generate $\g^*_j \mid \g^*_{< j} \sim \Ber(\hat{\rho}_{j \mid <j})$
    
- use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all)


## Horvitz-Thompson


- use only the $n$ unique  models 
- inclusion probability that $\g_i \in S$ is included under IS (PPSWR)
$$\pi_i = 1 - (1 - q(\g_i))^T$$
- HT estimate of normalizing constant
$$\hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}$$

- Ratio HT estimate of posterior probabilities
$$p(\Mg \mid \Y) = \frac{\sum_{i \in n} I(\M_i = \Mg) p(\Y \mid \M_i)p(\M_i)/\pi_{\g}}
{\sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}} = 
\frac { p(\Y \mid \Mg)p(\M_i)/\pi_{\g}}
{\sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}}$$

- Estimate of Marginal Inclusion Probabilities

$$p(\gamma_j = 1 \mid \Y) = \frac{\sum_{i \in n} p(\Y \mid \M_i)p(\Mg)/\pi_{\g}}
{\sum_{i \in n} \frac{p(\Y \mid \M_i)p(\M_i)} {\pi_i}}$
$$

## Simulation
::: {.nonincremental}
:::: {.columns}

::: {.column width="50%"}
- `tecator` data (Griffin et al (2021))

- a sample of $p = 20$ variables

- compare 
  - enumeration
  - MCMC with add, delete, and swap moves
  - Adaptive MCMC
  - Importance Sampling with HT
- same settings `burnin.it`, `MCMC.it`, `thin`  
:::

::: {.column width="50%"}


```{r}
load("sim_code/tecator-time.dat")
boxplot(time, main="CPU time", ylab = "Time")

```
:::


::::

:::

## MSE Comparision {.center}

:::: {.columns}

::: {.column width="50%"}

```{r}
#| echo: FALSE
load("tecator-mse.dat")
colnames(mse.pip) <- c("HT", "AMCMC", "MCMC")
boxplot(10*sqrt(mse.pip), main="Marginal Inclusion Probabilities", ylab = "10 x RMSE")
```
:::

::: {.column width="50%"}

```{r}
#| echo: FALSE
colnames(mse.pp) <- c("HT", "AMCMC", "MCMC")

boxplot(10*sqrt(mse.pp), main="Posterior Model Probabilities", ylab = "10 x RMSE")
```
:::

::::

## Basu's Estimator of Total

Basu (1971),  Meeden and Ghosh (1983) :
$$C = \sum_{i \in S} p(\Y \mid \M_i) p(\M_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\Y \mid \M_i)p(\M_i)}{\pi_i} \times (1 - \sum_{i \notin S} \pi_i)$$

- Model-based: Let $m_i = p(\Y \mid \M_i) p(\M_i)$
\begin{align} 
m_i  \mid \pi_i &\ind N(\pi_i \beta, \sigma^2 \pi_i^2) \\
p(\beta, \sigma^2) & \propto 1/\sigma^2
\end{align}

- posterior mean of $\beta$ is $\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}$ (HT)

- using posterior predictive 
\begin{align*} \hat{C} & = \sum_i m_i + \sum_{i \notin S} \hat{\beta} \pi_i 
 = \sum_i m_i +  \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\sum_{i \in S} (1 - \pi_i) 
\end{align*}

- accounts for probability on models not yet observed! 

##  Final Posterior Estimates

- estimate of posterior probability $\Mg$
$$
\frac{p(\Y \mid \Mg) p(\Mg)}{\sum_{i \in S} p(\Y \mid \M_i) p(\M_i) +  \frac{1}{n} \sum_{i \in S} \frac{p(\Y \mid \M_i) p(\Mi)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}
$$

- estimate of all models in $\G - S$
$$
\frac{1}{\sum_{i \in S} p(\Y \mid \M_i) p(\M_i) +  \frac{1}{n} \sum_{i \in S} \frac{p(\Y \mid \M_i) p(\Mi)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}
$$
- Uses renormalized marginal likelihoods of sampled models 

- Computing marginal inclusion probabilities 

- What about $\E[\b \mid \Y]$, $\E[\X\b \mid \Y]$, $\E[\Y^* \mid \Y]$ or $p(\Delta \mid \Y)$?


## Continued Adaptation ?

- can update Cholesky with rank 1 updates with new models
- how to combine IS with MH samples (weighting) ?
- HT/Hajek - computational complexity involved if we need to compute inclusion probability for all models based on updates (previous models and future models)
- Basu (1971) suggests approach for PPSWOR (adaptation?)


## Refinements

:::: {.columns}

::: {.column width="50%"}
- Need to avoid MCMC for pseudo Bayesian posteriors for
    - learning proposal distribution in sample design for models
    - estimation of posterior model probabilities in model-based approaches (ie sampling from predictive distribution)
    - estimation of general quantities under BMA?
    
- avoid infinite regret
:::

::: {.column width="40%"}
![](img/turtles_all_the_way_down.jpg)

:::
::::


## Summary


:::: {.columns}

::: {.column width="50%"}
- Adaptive Independent Metropolis proposal for models (use in more complex IS)
- Use observed values of unique marginal likelihoods of models for estimating posterior distribution
- Bayes estimates of MC output 

- Basu's Elephants seem to be well, holding up the Bayesian World

- and the Circus Statistician is still employed!


:::

::: {.column width="50%"}
![](img/The_hindoo_earth.png)


:::
::::

##  Thank You! {.center}


:::: {.columns}
::: {.column width="50%"}

Code in development version of `BAS` on [https://github.com/merliseclyde/BAS](https://github.com/merliseclyde/BAS)

Talk available at [https://merliseclyde.github.io/AMC-talk/](https://merliseclyde.github.io/AMC-talk/])

:::

::: {.column width="50%"}

![](img/adobe-express-qr-code.png){width="50%"}
:::
::::

## References


Meeden and Ghosh (1983) https://www.jstor.org/stable/2240483