-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathAIS_MarLik.tex.bak
1272 lines (1165 loc) · 61.8 KB
/
AIS_MarLik.tex.bak
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Use pdfLatex to compile
\documentclass[12pt]{article}
\usepackage[dvipsnames]{xcolor}
\usepackage[letterpaper,margin=1in]{geometry}
\usepackage{charter}
\usepackage{latexsym,subfig, amssymb, amsmath, amsfonts, graphicx, color,fancyvrb,natbib}
\usepackage{amsthm,wrapfig,url,bm,rotating,multirow}
\usepackage{url,setspace,epsfig}
\def\ci{\perp\!\!\!\perp}
\def\bibfontsize{\small}
\def\authorfmt#1{\textsc{#1}}
\def\dcuand{\&}
\def\MI{\textsf{MI}}
\def\PI{\textsf{PI}}
\def\KL{\textsf{KL}}
\def\ESS{\textsf{ESS }}
\def\Z{\textsf{Z}}
\def\BF{\textsf{BF}}
\renewcommand{\baselinestretch}{1.0}
\makeatletter
%\newcommand{\figcaption}{\def\@captype{figure}\caption}
\newcommand{\tabcaption}{\def\@captype{table}\caption}
\makeatother
\begin{document}
\vspace{-1in}
\title{Adaptive Annealed Importance Sampling Method for Calculating Marginal likelihood with Application to Bayesian Exo-planets Detection}
\author{Bin Liu, Merlise Clyde, Tom Loredo, Jim Berger}
\maketitle
\begin{abstract}
Motivated by an astrophysics problem, namely, extra-solar planet
searching, we propose an adaptive annealed importance sampling
method, which facilitates simulating draws from multi-modal joint
posterior distribution, and provides an effective and
easy-to-implement way for estimating marginal likelihood that is
required for Bayesian model comparison. This method is inspired by
the idea of 'annealing' as a way of searching isolated peaky modes
in the posterior and the fact that any continuous probability
density can be approximated well by a finite mixture model, provided
that it is affiliated with a sufficient number of components along
with correctly estimated parameters. The annealing strategy is
performed within a sequential importance sampling (SIS) framework,
in which a mixture function that consists of a batch of student's T
densities is gradually fit to approximate the posterior. We estimate
the parameters of each mixture component by an
Expectation-Maximization (EM) algorithm. We also propose online
approaches to tune the number of the mixture components, whose
invocation is guided by the effective sample size (ESS) when
performing IS. Both simulation studies and real data analysis
demonstrate the great efficiency of this method.
\end{abstract}
\section{Introduction}
In Bayesian statistics, accurately calculating the marginal
likelihood required in Bayesian model selection can be difficult.
The importance sampling method, for example, often breaks down in
practice because its accuracy depends on designing an importance
function that mimics the integrand, and building such a function can
be quite difficult even in lower dimensional settings.
In this paper, we are concerned with the Importance sampling (IS)
method for calculating marginal likelihoods. In particular, we
propose an adaptive method for constructing an importance function
that resembles the posterior. Section 2 outlines the scientific
issues that inspired this work, namely inferring the number of
planets around distant stars. The exoplanet application uses models
with $2+5p$ parameters where $p$ is the number of planets in the
model, so even a single planet model has a seven dimensional
parameter space to integrate over. Section 3 describes in detail how
importance sampling can be used to calculate marginal likelihoods
for model selection. Section 4 explains the method we developed. Our
method employs the annealing idea to deal with isolated peaky modes
in the posterior and the mixture modeling technique to encapsulate
the information on the posterior obtained so far into a mixture
model. It starts by defining a set of intermediate target
distributions between an easy-to-handle distribution to the
posterior. It then adapts a mixture model to fit these distributions
one by one. The resulting mixture model will finally mimic the
posterior. We detail the relationships between our method and the
existing methods in the literature at the end. Section 5 presents
two simulations, one in three dimensions and the other in seven, in
which the target functions were built to be difficult to integrate.
Section 6 reports real data analysis results and Section 7 discusses
the method and then concludes this paper.
\section {Bayesian Exoplanet Searches Based on RV technique}
In this section, we describe the radial velocity models and priors
used in this paper. We then formulate the problem of exoplanet
searches as a Bayesian model selection problem. See details in
\citep{bullard2009edc,Crooks_2007,ford2006improving}).
\subsection {Radial Velocity Model}
In a two-body system such as a star and planet, the pair rotate
together about a point lying somewhere on the line connecting their
centers of mass. If one of the bodies (the star) radiates light, the
frequency of this light measured by a distant observer will
cyclically vary with a period equal to the orbital period. This
Dopplar Effect is understood well enough that astronomers can
translate between frequency shift and the star's velocity toward or
away from Earth. The data collected by astronomers are
$y=\{t_i,v_i,\sigma_i\}_{i=1}^o$, where $t_i$ is the time of the
$i$th observation, $v_i$ is the observed velocity component at time
$t_i$ and $\sigma_i$ is the estimated error of the velocity
observation. Additionally, there is a source of (presumably)
Gaussian error from "stellar jitter", the random fluctuations in a
star's luminosity. The observed velocities are related to the
orbital parameters through
\begin{equation}\label{Velocity_Model}
v_i\thicksim \mathcal{N}\left(C+\bigtriangleup
V(t_i|\phi),\sigma_i^2+s^2\right),
\end{equation}
where $C$ is the constant center-of-mass velocity of the star
relative to Earth, $\bigtriangleup V(t_i|\phi)$ is a function to be
defined below, and $s^2$ is the 'stellar jitter' variance. The full
set of parameters $\{C,s^2,\phi\}$ will be denoted $\theta$.
Given the data set $y$, we consider a set of competing models
$\mathcal{M}_i$, $i\in\{0,1,\ldots\}$, where $\mathcal{M}_i$ has $i$
planet(s).
If a star does not host any orbiting planets, then the RV
measurements will be roughly constant over any period of time,
varying only due to the stellar jitter. Hence the function
$\bigtriangleup V(t_i|\phi)$ is zero for any value of $t$:
\begin{equation}\label{Velocity_0p_Model}
\bigtriangleup V(t|\phi)=0, \forall t\in \mathbb{R}.
\end{equation}
In this case, $\theta$ only has two elements: $C$ and $s^2$, and
Equation (\ref{Velocity_Model}) becomes
\begin{equation}\label{0p_Model}
v_i\thicksim \mathcal{N}\left(C,\sigma_i^2+s^2\right).
\end{equation}
Because of the orbital eccentricity and the fact that we do not
normally observe planetary systems side-on, the velocity shift vs.
time plot will not but a sinusoid, but will be a member of a more
complicated family of curves parameterized by
$\phi=(K,P,e,\omega,\mu_0)$, where $K\geq 0$ is the velocity
semi-amplitude, $P>0$ is the orbital period, $0\leq e\leq1$ denotes
eccentricity, $0\leq\omega\leq2\pi$ is argument of periastron and
$0\leq \mu_0\leq2\pi$ denotes the mean anomaly at time $t=0$. Given
these parameters, the velocity shift, due to the single planet, is
then just
\begin{equation}\label{Velocity_1p_Model}
\bigtriangleup V(t|\phi)=K[\cos(\omega+T(t))+e\cos(\omega)].
\end{equation}
Here $T(t)$ is the "true anomaly at time $t$" and is itself given by
\begin{equation}\label{true_anomaly}
T(t)=2\arctan\left[\tan(\frac{E(t)}{2})\sqrt{\frac{1+e}{1-e}}\right].
\end{equation}
where $E(t)$ is called the "eccentric anomaly at time $t$"; it's the
solution to the transcendental equation
\begin{equation}\label{transcendental_equation}
E(t)-e\sin(E(t))=\mbox{mod}\left(\frac{2\pi}{P}t+\mu_0,2\pi\right).
\end{equation}
If there are $j\geq2$ planets, the overall velocity shift
$\bigtriangleup V$ is simply the sum of the independent velocity
shifts of the individual planets:
\begin{equation}\label{Velocity_1p_Model}
\bigtriangleup V(t|\phi_1,\ldots,\phi_j)=\sum_{i=1}^j
K_i[\cos(\omega_i+T_i(t))+e_i\cos(\omega_i)].
\end{equation}
Note that these planets' mutual gravitational interactions are
negligible here.
Of course we do not know how many planets there are \emph{a priori}
- indeed, finding the number is a major aim. If we assume there are
at most $J$ planets, then there are a total $2+5j$ parameters,
$\theta_j=\{\mathcal{C},s^2,\phi_1,\ldots,\phi_j\}$ for each
possible number of planets $j\in\{1,\dots,J\}$.
\subsection{Priors}
We adopt the priors recommended by \cite{ford2006bms}, which is
based in part on mathematical convenience, and in part on their
approximate realism.
For $\mathcal{M}_0$ in (\ref {0p_Model}), the prior is set such that
the parameters are independent, and their marginal distributions are
given by:
\begin{equation}\label{prior_C}
p_C(C)=\left\{\begin{aligned}
\frac{1}{C_{max}-C_{min}} &~ \quad\quad\mbox{for} &~C_{min}\leq C\leq C_{max}\\
0\quad\quad &~ \quad\quad\mbox{otherwise} & \\
\end{aligned}\right.
\end{equation}
and
\begin{equation}\label{prior_sigma}
p_{s}(s)=\left\{\begin{aligned}
\frac{1}{\log\left(1+\frac{s_{max}}{s_0}\right)}\cdot\frac{1}{s_0+s} &~ \quad\quad\mbox{for} &~0<s\leq s_{max}\\
0\quad\quad &~ \quad\quad\mbox{otherwise} & \\
\end{aligned}\right.
\end{equation}
Finally the complete prior is given by:
\begin{equation}\label{prior_0p}
p_0(\theta_0)=p_0(C,s)=\frac{1}{C_{max}-C_{min}}\cdot\frac{1}{\log\left(1+\frac{s_{max}}{s_0}\right)}\cdot\frac{1}{s_0+s},
\end{equation}
where $C_{min}\leq C\leq C_{max}, 0<s \leq s_{max}.$
Before introducing prior setting for the parameter of
$\mathcal{M}_1$, we present the parameter transformations
recommended by \citep {bullard2009edc}:
\begin{itemize}
%\item Translate the data $t_i$ (essentially reparameterizing $\mu_0$)
%so that $t=0$ occurs in the middle of the observations$-$here we
%place it at the weighted mean of the observation times, with the
%weights being inversely proportional to the measurement errors. This
%reduces correlations between $\mu_0$ and $P$.
\item Use Poincar$\acute{\mbox{e}}$ variables $x\equiv e\cos\omega$ and
$y\equiv e\sin\omega$ instead of $e$ and $\omega$ to reduce (very
strong) correlations between $\mu_0$ and $\omega$. This is
particularly important for low eccentricity orbits.
\item Use $z\equiv (\omega+\mu_0) \mbox{mod}2\pi$ instead of $\mu_0$ to further reduce correlations between these two
parameters when $e\ll1$; for larger $e$, it still doesn't hurt.
\end{itemize}
These transformations aim to reduce the correlations between the
elements in the parameter space defined by
$(C,K,P,e,\omega,\mu_0,s^2)$, and make the posterior as Gaussian as
possible \citep {bullard2009edc}. It's also recommended to work in
$\dot{P}\equiv \log P$ rather than in $P$, in $\dot{K}\equiv \log K$
rather than in $K$, and in $\dot{s}\equiv \log s$ rather than in $s$
\citep{bullard2009edc}, to make the posterior be more Gaussian. In
that case, the model parameter becomes
$\theta=\{C,\dot{K},\dot{P},x,y,z,\dot{s}\}$. We adopt the above
transformed parameters, but note that the efficiency of the proposed
method will not require the posterior to be Gaussian or depend on
any parameter transformation.
In the transformed parameter space, the prior is set as follows:
\begin{equation}\label{prior_1p}
p_1(\theta_1)=\kappa_1\cdot\exp\dot{K}\cdot\frac{1}{1+\frac{\exp\dot{K}}{K_0}}\cdot\frac{1}{\sqrt{x^2+y^2}}\cdot\exp\dot{s}\cdot\frac{1}{1+\frac{\exp\dot{s}}{s_0}},
\end{equation}
where
\begin{equation}\label{kappa_in_prior_1p}
\kappa_1=\frac{1}{C_{max}-C_{min}}\cdot\frac{1}{\log\left(1+\frac{K_{max}}{K_0}\right)}\cdot\frac{1}{K_0}\cdot\frac{1}{\log\left(\frac{P_{max}}{P_{min}}\right)}\cdot\left(\frac{1}{2\pi}\right)^2\cdot\frac{1}{\log\left(1+\frac{s_{max}}{s_0}\right)}\cdot\frac{1}{s_0},
\end{equation}
and $C_{min}\leq C\leq C_{max}$, $\dot{K}\leq\log(K_{max})$,
$\log(P_{min})\leq\dot{P}\leq\log(P_{max})$, $x^2+y^2<1$, $0\leq
z\leq2\pi$, and $\dot{s}\leq\log(s_{max})$.
The involved constants are set based upon the physical realities
(e.g., an orbit with too small a period will result in the planet
getting consumed by the star) \citep{bullard2009edc}, as follows,
\begin{eqnarray}
P_{min} & \equiv & 1\mbox{day}\\
P_{max} & \equiv & 1000\mbox{years} \\
K_0 & \equiv & 1\mbox{ms}^{-1}\\
K_{max} & \equiv & 2128 \mbox{ms}^{-1} \\
C_{min} & \equiv & -K_{max}\\
C_{max} & \equiv & K_{max}\\
s_0 & \equiv & 1 \mbox{ms}^{-1}\\
s_{max} & \equiv & K_{max}
\end{eqnarray}
\subsection{Using Bayesian Method to Count Planets}
From an astronomer's point of view, the most important question is
how to schedule telescope time in order to maximize the probability
of doing something useful such as determining the total number of
planets in the system or detecting at least one planet. Let us
examine the first of these aims. Finding the number of planets in a
system is, from a statistics point of view, a model selection
problem. Bayesian model selection requires calculation of marginal
likelihoods of two or more models in order to calculate Bayes
Factors,
\begin{equation}
BF(\mathcal{M}_i,\mathcal{M}_j)=\frac{m(y|\mathcal{M}_i)}{m(y|\mathcal{M}_j)}.
\end{equation}
Here $m(y|\mathcal{M}_i)$ is the marginal likelihood of the model
with $i$ planets, which is the likelihood integrated over the prior:
\begin{equation}\label{marginal_lik}
m(y|\mathcal{M}_i=\int
L(y|\theta_i,\mathcal{M}_i)p(\theta_i|\mathcal{M}_i)d\theta_i.
\end{equation}
\section{Importance Sampling for Marginal Likelihood Estimation}
In many situations the integral in Equation (\ref{marginal_lik}) is
intractable. However, it can be calculated approximately using
importance sampling (IS). IS involves designing an easy-to-use
probability density called the importance function, $q(\theta)$.
Let $\Z=p(y|\mathcal{M})$, we have
\begin{equation}\label{IS_for_marginal_lik}
\Z=\mathbb{E}_q\left\{\frac{L(\theta|y)p(\theta|\mathcal{M})}{q(\theta)}\right\}
\end{equation}
Then, given a random sample from $q$, $Z$ can be unbiasedly and
consistently estimated by
\begin{equation}\label{IS_solution_marlik}
\hat{\Z}=N^{-1}\sum_{i=1}^N\frac{L(\theta^i|y)p(\theta^i|\mathcal{M})}{q(\theta^i)}=N^{-1}\sum_{i=1}^N
w^i, \mbox{where}\quad \theta^i\sim Q, i=1,\ldots,N
\end{equation}
Here we just let
$w^i=\frac{L(\theta^i|y)p(\theta^i|\mathcal{M})}{q(\theta^i)}$.
Therefore, for large values of $N$, $\hat{\Z}$ can be expected to be
a good approximation of $\Z$. The distribution of the error
$|\hat{\Z}-\Z|$ can be approximated via the central limit theorem
\begin{equation}
\mbox{Var}_q\left\{\hat{\Z}\right\}=\mathbb{E}_q\left\{\left(\hat{\Z}-\Z\right)^2\right\}=\frac{\sigma_q^2}{N}<\infty,
\end{equation}
when $\hat{\Z}$ has finite variance. Here
\begin{equation}
\sigma_q^2=\mbox{Var}_q\left\{\frac{L(\theta|y)p(\theta|\mathcal{M})}{q(\theta)}\right\}=\int_{R^d}\left(\frac{L(\theta|y)p(\theta|\mathcal{M})}{q(\theta)}-\Z\right)^2
q(\theta)d\theta=\Z^2\mathbb{E}_q\left\{\left(\frac{\pi(\theta)-q(\theta)}{q(\theta)}\right)^2\right\}.
\end{equation}
If the importance density is such that $\sigma_q^2<\infty$, then via
the central limit theorem
\begin{equation}
\frac{\hat{\Z}-\Z}{\sigma_q/\sqrt{N}}\rightarrow X
\end{equation}
as $N\rightarrow\infty$, where $X\sim \mathcal{N}(0,1)$. The above
results then implies that
$\hat{Z}-Z=\mathcal{O}(\frac{1}{\sqrt{N}})$.
One can construct confidence intervals to assess the quality of a
particular estimate of $\Z$ \citep{lefebvre2009path}. With a large
degree of certainty, the interval $\hat{\Z}\pm3\sigma_q/\sqrt{N}$
contains $\Z$ when $N$ is large and so the half length
$3\sigma_q/\sqrt{N}$ indicates how accurate $\hat{\Z}$ is. Since
$\sigma_q$ is unknown, a natural estimator of $\sigma_q^2$ is the
usual consistent variance estimator
\begin{equation}
\sigma_q^2\simeq\frac{1}{N}\sum\limits_{i=1}^N\left(w^i-\hat{\Z}\right)^2.
\end{equation}
The efficiency of importance sampling depends largely on how closely
the importance function mimics the shape of the target,
$L(y|\theta)p(\theta)$. A number of things can go wrong. If the
importance function has thinner tails than the target, the estimate
will tend to be unstable because the weights can be arbitrarily
large; on the other hand if the importance function has much fatter
tails than the target, too many of the $\theta_i$'s will be drawn
from the tails, resulting in a seemingly stable but biased estimate.
The best case is an importance function with slightly heavier tails
than the target. Of course asymptotically it should not matter which
importance function is used; for samples of practical size, though,
a poor choice can be disastrous.
These problems are only compounded in high-dimensional settings
where the target is often extremely complicated. Unless one is very
lucky indeed it is usually impossible to get accurate results in
real-life situations with dimension greater than 3.
\section{Annealed Adaptive IS Method for Constructing Importance Function}
In this section, we propose the adaptive mixture modeling approach,
which is used for constructing importance function that resembles
the posterior and then is used in (\ref{IS_solution_marlik}) for
calculating the marginal likelihood.
Let $\pi(\theta)$ denote the posterior distribution, which is
proportional to $L(y|\theta)p(\theta)$. Note that computing
$L(y|\theta)p(\theta)$ for any $\theta$ is feasible, but that we are
not able to directly sample from the distribution it defines, i.e.
$\pi(\theta)$. The aim is to find a distribution, $q(\theta)$, that
approximates $\pi(\theta)$, which we are also able to evaluate.
We first select an initial proposal distribution $q_0(\theta)$, then
define a series of intermediate distributions between $q_0(\theta)$
and $\pi(\theta)$, by
\begin{equation}
\pi_t(\theta)\propto
q_0(\theta)^{1-\lambda_t}\pi(\theta)^{\lambda_t}, \quad\quad
t=1,\ldots,T, \quad \mbox{and}\quad
0<\lambda_1<\lambda_2<\dots<\lambda_T=1.
\end{equation}
Finally we propose an iterative method to construct $q(\theta)$, as
shown in Algorithm 1.
\begin{center}
\textbf{Algorithm 1. Iterative Proposal Construction}
\end{center}
\begin{itemize}
\item Use $\pi_1(\theta)$ as the target distribution, with $q_0(\theta)$
as the proposal. Tune the parameters of $q_0(\theta)$ to obtain an
updated probability, denoted $q_1(\theta)$, that approximates
$\pi_1(\theta)$.
\item Similarly as in step 1, tune parameters of $q_1(\theta)$ to
obtain an updated proposal,$q_2(\theta)$, which is used to
approximate $\pi_2(\theta)$.
\item $\dots$.
\item Use $\pi_T(\theta)$ as the target distribution, tune parameters of
$q_{T-1}(\theta)$ to obtain $q_{T}(\theta)$, that approximates
$\pi_T(\theta)=\pi(\theta)$. Then $q_{T}(\theta)$ is actually the
importance function which we want.
\end{itemize}
We see that the annealing strategy is adopted in the above iterative
procedure. In what follows, we propose an adaptive method to handle
the task of updating $q_{t-1}(\theta)$ to $q_{t}(\theta)$ in
Algorithm 1.
\subsection{Multivariate Student's T-mixture Model}
Considering possible complexities in the structure of
$\pi_t(\theta)$, we employ a mixture model to define
$q_{t}(\theta)$. Note that any continuous probability density
function can be approximated by a finite mixture model, provided
that the mixture has a sufficient number of components along with
correctly estimated parameters
\citep{bishop2005neural,zeevi1997det}. In particular, we equip
$q_{t}(\theta)$ with the Student's T distribution as the mixture
component, making it easier to detect modes that are far apart
thanks to its fat tail structure. Let $\mathcal{S}_d(\mu,\Sigma,v)$
denote a $d$-variate Student's T distribution, where $\mu$ is the
center, $\Sigma$ the positive definite inner product matrix, and $v$
is the degrees of freedom. We just fix $v$ to be a constant, e.g., 5
in the following. So the mixture model we use is:
\begin{equation}\label{Definition_mixture}
q(\theta|\psi)=\sum\limits_{m=1}^M \alpha_{m}
\mathcal{S}_d(\theta|\mu_m,\Sigma_m)
\end{equation}
where
$\psi=\{M,\alpha_1,\ldots,\alpha_M,\mu_1,\ldots,\mu_M,\Sigma_1,\ldots,\Sigma_M\}$
is the mixture parameter. Here $\alpha_m$ is the probability mass of
the $m$th component. It satisfies that $\alpha_{m}\geq
0,\sum_{m=1}^M \alpha_{m}=1$. So the task of fitting a T-mixture
model to a probability density translates into seeking suitable
parameter values for $\psi$.
\subsection {Proposal Adaptation via IS plus EM}\label{sec:IS_EM}
Let's focus on iteration $t$ in Algorithm 1. We first simulate an
random sample, $\theta^n$, from $q_{t-1}(\theta)$, with the
importance weight calculated by
\begin{equation}
w^n=\frac{\frac{\pi_{t}(\theta^n)}{q_{t-1}(\theta^n)}}{\sum_{n=1}^N\frac{\pi_{t}(\theta^n)}{q_{t-1}(\theta^n)}},
\end{equation}
where $N$ denotes the sample size. The resulting weighted particle
set then can be seen as a particle approximation to the current
target distribution $\pi_t(\theta)$, i.e.
\begin{equation}\label{delta_mass_approx}
\pi_t(\theta)\simeq\sum\limits_{n=1}^Nw^i\delta(\theta^i),
\end{equation}
where $\delta(\cdot)$ is the delta mass function.
Here we adopt the Kullback-Leibler (KL) divergence to measure the
distance between any two distributions. Then the aim is to obtain a
proposal, $q_t(\theta)$, that minimizes the KL distance with respect
to $\pi_t(\theta)$, which is shown to be
\begin{equation}\label{KL}
\mathcal{D}[\pi_t(\theta)||q_t(\theta|\psi)]=\mathbb{E}_{\pi_t(\theta)}\left[\log\frac{\pi_t(\theta)}{\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta|\mu_m,\Sigma_m)}\right].
\end{equation}
where $\mathbb{E}_f$ denotes expectation with respect to a function
$f$. It's clear that minimizing (\ref{KL}) in $(\alpha,\mu,\Sigma)$
is equivalent to maximizing
\begin{equation}\label{MLE}
\int\pi_t(\theta)\log\left(\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta|\mu_m,\Sigma_m)\right).
\end{equation}
Substituting Equation (\ref{delta_mass_approx}) in the above, we
have
\begin{equation}\label{Max_log_lik}
\sum_{i=1}^N w^i\log\left(\sum\limits_{m=1}^M
\alpha_m\mathcal{S}_d(\theta^i|\mu_m,\Sigma_m)\right).
\end{equation}
At this stage, the task translates into a mixture-density parameter
estimation problem, that is, how to optimize the parameter values of
$\psi$ to make the resulting mixture density function fit the data
$\{\theta^n,w^n\}_{n=1}^N$ as well as possible.
Given data component $\theta^n$, the posterior probability of each
mixture component can be obtained according to the Bayes rule:
\begin{equation}\label{posterior_prob_component}
\rho_m(\theta^n)=\alpha_m
\mathcal{S}_d(\theta^n;\mu_m,\Sigma_m)/q(\theta^n),
\end{equation}
where $\alpha_m$ is regarded as the prior, and
$\mathcal{S}_d(\theta^n;\mu_m,\Sigma_m)$ is the associated
likelihood. The proportion mass of this component can then be
updated by
\begin{equation}\label{MLE_alpha}
\alpha_m=\sum\limits_{n=1}^N w^n \rho_m(\theta^n).
\end{equation}
And each pair of $\{\mu_m,\Sigma_m\}$ can be updated by the
following Equations:
\begin{equation}
\mu_m=\frac{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)\theta^n}{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)},
\end{equation}
\begin{equation}\label{MLE_Sigma}
\Sigma_m=\frac{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)u_m(\theta^n)C_n }{\sum\limits_{n=1}^N w^n
\rho_m(\theta^n)},
\end{equation}
where
\begin{equation}
u_m(\theta)=\frac{v+d}{v+(\theta-\mu_m)^T(\Sigma_m)^{-1}(\theta-\mu_m)}
\end{equation}
and $C_n=(\theta^n-\mu_m)(\theta^n-\mu_m)'$, where $'$ denotes
matrix transposition.
The operations from (\ref{posterior_prob_component}) to
(\ref{MLE_Sigma}) constitute one iteration of the EM algorithm,
which is used to do mixture density parameter estimation
\citep{peel2000rmm}. Let $q_t(\theta)$ be a T-mixture model, with
the updated parameter values $\{\alpha_m,\mu_m,\Sigma_m\}$, and it's
expected to be an approximation to $\pi_t(\theta)$. The efficiency
of $q_t(\theta)$, as an approximation of $\pi_t(\theta)$, can be
measured by the KL divergence using (\ref{Max_log_lik}), or the
effective sample size, which is defined to be
\begin{equation}\label{ESS}
\ESS=\frac{1}{\sum_{n=1}^N w_n^2}.
\end{equation}
Note that these importance weights are associated with the random
sample drawn from $q_t(\theta)$. The \ESS was originally proposed by
\cite{kong1994sequential} to measure the overall efficiency of an IS
algorithm. Heuristically it measures how many i.i.d samples are
equivalent to the $N$ weighted samples. The \ESS is related to the
coefficient of variation (CV) of the importance weights (see
\cite{kong1994sequential}), which finds applications in
\cite{ardia2008adaptive,cappe2008ais,cornebise2008adaptive}. It's
shown that \ESS is determined by the associated proposal and the
particle size, $N$, while, once $N$ is big enough, the ratio of \ESS
to $N$ should be stable. So in this paper, we use \ESS$/N$ as a
quantity to monitor the efficiency of $q_t(\theta)$. When \ESS$/N$
is big enough, let Algorithm 1 go to next iteration. If \ESS$/N$ is
smaller than a given threshold, we continue to update $q_t(\theta)$
until it satisfies the requirement, i.e. the resulting \ESS$/N$ is
big enough. In this way, we can make sure that the finally obtained
$q_t(\theta)$ is really a good approximation to $\pi_t(\theta)$.
Suppose that currently \ESS$/N$ is not big enough, we consider two
approaches that are used to update $q_t(\theta)$ again. First we
check whether the particle with the biggest importance weight,
denoted $\theta^{W}, W\in\{1,\ldots,N\}$, is located in the tail of
the proposal. If it is, which means $q_t(\theta^{W})$ is smaller
than the proposal densities of most other particles, we then select
to perform an operation called component-splitting (see Section
\ref{sec:split}), otherwise, we perform the IS plus EM procedure
again. The reason for doing this is that the IS plus EM procedure is
exactly a local search approach, because it fixes the number of
mixture components to be a constant and the EM itself is a local
optimization method, while the component-splitting procedure will
break the local model structure by splitting one mixture component
into two, thus facilitates to find the global or, at least a better
local optimum (see Section \ref{sec:split}).
\subsection {Online Approach to Tune the Number of the Mixture Components}
In the above IS plus EM procedure, the number of components $M$ is
imposed to be constant, while this assumption is of course not
reasonable. To deal with this issue, we propose a series of adaptive
mechanisms to adapt $M$ online, which include operations called
components-merging, components-splitting and components-deletion,
respectively. We'll provide respective conditions, on which each of
these procedures needs to be performed.
\subsubsection{Components Merging}
This step targets for merging mixture components that have much
mutual information among each other. To begin with, we first
introduce the concept, mutual information. Suppose that a set of
equally weighted data components $\{\theta^n\}_{n=1}^N$ are
available, whose distribution is modeled by a mixture density
function $q=\sum_{m=1}^M\alpha_mf_m$. Given any data component
$\theta^n$, the posterior probability of each mixture component
$f_m$ can be calculated by
\begin{equation}
\rho_m(\theta^n)=\alpha_m f_m(\theta^n)/q(\theta^n).
\end{equation}
If there are two mixture components $f_i$, $f_j$, for which it
satisfies that $\rho_i(\theta^n)=\rho_j(\theta^n)$ for any
$n\in\{1,\ldots,N\}$, then $f_i$ and $f_j$ are regarded to contain
completely overlapped information. Inspired by the components
merging criterion proposed in \cite{ueda2000sam,wang2004estimation},
we define the mutual information between $f_i$ and $f_j$ to be:
\begin{equation}\label{mutual_info}
\MI(i,j)=\frac{(\Upsilon_i-\bar{\Upsilon}_i)^T(\Upsilon_j-\bar{\Upsilon}_j)}{\|\Upsilon_i-\bar{\Upsilon}_i\|\cdot\|\Upsilon_j-\bar{\Upsilon}_j\|},
\end{equation}
where
$\Upsilon_m=[\rho_m(\theta^1),\ldots,\rho_m(\theta^N)]'$, $'$
denotes transposition of a matrix, $\|\cdot\|$ denotes the Euclidean
norm, and $\bar{\Upsilon}_m=\frac{1}{N}\sum_{n=1}^N
\rho_m(\theta^n)\textbf{1}_N$. Here $\textbf{1}_n$ denotes the
$n-$dimensional column vector with all elements being $1$.
Note that $\MI(i,j)\in[-1,1]$, and $\MI(i,j)=1$ iff $f_i$ and $f_j$
contain completely overlapped information.
In our algorithm framework, each data point $\theta^n$ is weighted
by $w^n$. Accordingly, $\Upsilon_m$ should be $\sum_{n=1}^N w^n
\rho_m(\theta^n)$, and the mutual information between $i$, $j$ turns
out to be:
\begin{equation}\label{Merge_criterion_correlation}
\MI(i,j)=\frac{(\Upsilon_i-\bar{\Upsilon}_i)'
D(w)(\Upsilon_j-\bar{\Upsilon}_j)}{\sqrt{\sum_{n=1}^N
w^n(\Upsilon_i(\theta^{n})-\bar{\Upsilon}_i)^2}\sqrt{\sum_{n=1}^N
w^n(\Upsilon_j(\theta^{n})-\bar{\Upsilon}_j)^2}}
\end{equation}
where
\begin{equation}
D(w)=diag([w^1,\ldots,w^N])
\end{equation}
We judge if two components, $i$ and $j$, need to be merged into one
by comparing $\MI(i,j)$ with a threshold $0<T_{merge}<1$. If
$\MI(j,k)>T_{merge}$, we then merge the $i$th component into $j$,
and set $M=M-1$; otherwise, we maintain the original mixture. The
merging operation is specified to be:
\begin{eqnarray}\label{Linear_combination_components}
\alpha_{j}&=&\alpha_{i}+\alpha_{j}\\
\mu_{j}&=&\frac{\alpha_{i}\mu_{i}+\alpha_{j}\mu_{j}}{\alpha_{j}}\\
\Sigma_{j}&=&\frac{\Sigma_{i}\mu_{i}+\Sigma_{j}\mu_{j}}{\alpha_{j}}
\end{eqnarray}
which is a linear combination of the two old components, and has
been used in \cite{wang2004estimation}. We traverse each pair of the
original components to do the merging judgements and perform the
above merging operation if the merging condition satisfies.
We execute the above merging operation when superfluous components
may exist, e.g., when $q_0(\theta)$ is initialized with too many
overlapped components, or too many new components appear at one
iteration of Algorithm 1 via component-splitting (see Section
\ref{sec:split}).
\subsubsection{Component-splitting}\label{sec:split}
As mentioned in Section \ref{sec:IS_EM}, it may happen that, after
performing the IS plus EM procedure, \ESS$/N$ was not big enough,
and the maximum weight particle was located in the tail of
$q_t(\theta)$. In that case, we perform the mechanism proposed here,
called component-splitting. This procedure is used to break the
local structure of $q_t(\theta)$ by splitting one component into
two, whereby the component number, $M$, will increase to $M+1$.
This Component-splitting approach is performed as follows. First,
when we simulate draws from $q_t(\theta)$ for evaluating the
efficiency of $q_t(\theta)$, we record the component origination of
each draw. We then see the parent component of the maximum weight
sample, $\theta^W$, denoted
$f_p=\mathcal{S}_d(\theta|\mu_p,\Sigma_p)$, and know that which
samples were originated from $f_p$. We encapsulate the samples that
come from $f_p$ to a data array,
$\mathcal{\theta}_{local}=\{\theta^{1},\ldots,\theta^{pN}\}$, then
split $f_p$ into two components$-$
$f_{c1}=\mathcal{S}_d(\theta|\mu_{c1},\Sigma_{c1})$ and
$f_{c2}=\mathcal{S}_d(\theta|\mu_{c2},\Sigma_{c2})$, and finally
replace $f_p$ in $q_t(\theta)$ by $f_{c1}$ and $f_{c2}$.
Next we give operations to get
$\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$. We first initialize
$\mu_{c1}=\theta^W$, $\mu_{c2}=\mu_p$,
$\Sigma_{c1}=\Sigma_{c2}=\Sigma_p$, and then specify a local mixture
$q_{local}=\alpha_{c1}f_{c1}+\alpha_{c2}f_{c2}$, where
$\alpha_{c1}=\alpha_{c1}=0.5$, finally we perform a local IS plus EM
procedure to update
$\alpha_{c1},\alpha_{c2},\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$.
Such local IS plus EM procedure starts by evaluating the local
weight for each $\theta^{i}$ in $\mathcal{\theta}_{local}$ via
\begin{equation}
w^i=\frac{\frac{\pi_{t}(\theta^{i})}{q_{local}(\theta^{i})}}{\sum_{n=1}^{pN}\frac{\pi_{t}(\theta^n)}{q_{local}(\theta^n)}}.
\end{equation}
Substituting $\{\theta^i,w^i\}_{i=1}^{pN}$ and the local mixture
parameters,
$\alpha_{c1},\alpha_{c2},\mu_{c1},\Sigma_{c1},\mu_{c2},\Sigma_{c2}$
in the Equations from (\ref{MLE_alpha}) to (\ref{MLE_Sigma}), we
then can obtain the updated parameter values for the local mixture
$q_{local}$. Suppose that $f_p$ is assigned a proportion mass
$\alpha_p$ in $q_t(\theta)$, we assign $f_{c1}$ and $f_{c2}$
proportion masses $\alpha_{c1}=\alpha_p\alpha_{c1}$ and
$\alpha_p\alpha_{c2}$, respectively, for the updated $q_t(\theta)$,
in which $f_p$ is replaced by $f_{c1}$ and $f_{c2}$.
This component-splitting method makes use of the local behavior of
the target distribution around the maximum weight sample. It adds
mixture proposal probability mass to those areas of the parameter
space which is near the maximum weight sample, where there is
relatively little mixture proposal probability mass. As the local IS
plus EM procedure is actually a local mixture learning process, by
which the local structure around the maximum weight sample will be
modeled well by the updated local mixture, which then will elicit a
better global mixture representation of the whole target
distribution, which may lead to a satisfactory \ESS$/N$ value. Note
that the above component-splitting procedure may be repeated
provided that the condition of executing it satisfies.
\subsubsection{Component-Deletion} \label{sec:deletion}
As mentioned above, when simulating draws from $q_t(\theta)$, we can
record the component origination of each sample. For any mixture
component that generates no sample, we deem it has taken no effect
in modeling the current target distribution, thus we just delete it
in current mixture proposal. Its probability masses is then
redistributed strengthening the remaining components.
\subsection{Practical Issues to be Considered}
EM based mixture modeling approach provides a convenient and
flexible utility for estimating or approximating distributions,
while the mixture log-likelihood presents a well known problem,
i.e., degeneracy toward infinity
\citep{biernacki2003degeneracy,biernacki2007degeneracy} (actually EM
also suffers from the problem of slow convergence, so it may
converge to a local optimum \citep{biernacki2003degeneracy},
however, we don't need to worry about that, because on one side the
proposed component-splitting procedure facilitates to search the
global optimum, and, one the other side, the objective here is just
to find a good mixture importance function, which is not necessary
to be the global optimum). To prevent the mixture log-likelihood
from arising to infinity, we assign an inverse-wishart prior on the
covariance matrix of each mixture component, then perform EM to get
the \emph{maximum a posterior} (instead of maximum likelihood)
estimate of the covariance matrix.
In the component-splitting step, the parameter updating of $f_{c1}$
and $f_{c2}$ is based upon $pN$ particles originated from $f_p$. In
case of $pN$ being too small, e.g. smaller than a threshold
specified beforehand, denoted $N_s$, we need to generate another set
of $N_s-pN$ samples from $f_p$, then do the parameter updating for
$f_{local}$ based upon the $N_s$ particles. By doing this, we can
guarantee that we are using a sufficient number of random draws to
capture the local structure in the posterior, that will then be
represented by $f_{local}$.
In the above mentioned component-splitting step, when we replace
$f_p$ in $q_t(\theta)$ by $f_{c1}$ and $f_{c2}$, we assign the
proportion mass of $f_p$ to $f_{c1}$ and $f_{c2}$ proportionally to
their original weights, i.e. $\alpha_{c1}$ and $\alpha_{c2}$,
respectively. Note that if $\alpha_p$ is a very small value, e.g.,
$0.0001$, the new added mixture components, $f_{c1}$ and $f_{c2}$,
will only take an insignificant effect in the updated $q_t(\theta)$,
while $f_{c1}$ and $f_{c2}$ may contain new and important
information about the posterior. To guarantee that $f_{c1}$ and
$f_{c2}$ can play a role in $q_t(\theta)$, at lease to some extent,
we propose the following operations. We first specify a threshold
$\alpha_{threshold}$, which may be, e.g., 0.1 or 0.2. Then, if
$\alpha_p>\alpha_{threshold}$, we assign weights to $f_{c1}$ and
$f_{c2}$ as usual as shown in Section \ref{sec:split}. Otherwise, we
assign $\alpha_{threshold}\alpha_{c1}$ and
$\alpha_{threshold}\alpha_{c2}$ to $f_{c1}$ and $f_{c2}$,
respectively. So the sum of weights of $f_{c1}$ and $f_{c2}$ in
$q_t(\theta)$ will be $\alpha_{threshold}$. To make sure the sum of
the weights of the mixture components equals to 1, we reduce the
weights of the components, other than $f_{c1}$ and $f_{c2}$,
proportionally to their original weights.
\subsection{Relationships to Existing Methods}
As the proposed AAIS method utilizes the idea of 'annealing' as a
way of searching isolated modes, it has close relationships to a
series of MCMC methods, which are based on the "thermodynamic
integration" technique for computing marginal likelihoods, and are
discussed under the name of Bridge and Path sampling in
\cite{gelman1998simulating}, parallel tempering MCMC in
\cite{gregory2010bayesian,gregory2005bayesian} and Annealed
Importance Sampling (AIS) in \cite{neal2001ais}. The central idea of
these methods is to divide the ratio of two normalization constants,
which we can think of for our purposes as the ratio of marginal
likelihoods, $\Z_t$ and $\Z_0$, which are associated to
$\pi(\theta)$ and $q_0(\theta)$, respectively, into a series of
easier ones, parameterised by inverse temperature, $\lambda$. In
detail:
\begin{equation}
\frac{\Z_t}{\Z_0}=\frac{\Z_1}{\Z_0}\frac{\Z_2}{\Z_1}\dots\frac{\Z_t}{\Z_{t-1}}.
\end{equation}
Each of the ratios is estimated using samples from a MCMC method.
For example, to compute $\frac{\Z_t}{\Z_{t-1}}$, we should have
samples that are drawn from equilibrium from the distribution
defined by
$\pi_{t-1}(\theta)\propto\pi(\theta)^{\lambda_{t-1}}q_0(\theta)^{1-\lambda_{t-1}}$.
Therefore, these methods require us to devise a single Metropolis
proposal at each temperature, which should be a troublesome issue in
practice, since it is not prior known the target distribution's
structure at each temperature. In addition, the involvement of MCMC
results in annoying questions, e.g. how long should the 'burn-in'
period last, and how many samples are needed to get a reliable
estimate, which should be considered at each temperature.
Striking differently, the proposed AAIS is developed from the
perspective of mixture modeling, that is, how to adapt a mixture
model to resemble the posterior, while the similar temperature idea,
i.e., 'annealing', is also involved target for searching isolated
peaky modes. We only need to select an initial mixture proposal
$q_0$, the annealing temperatures $\lambda_t$, and up to three
thresholds, after which, the mixture adaptation process is
completely adaptive, and finally the marginal likelihood is simply
obtained by substituting the resulting mixture as the importance
function in (\ref{IS_for_marginal_lik}). Such property of
easy-to-implementation results from the adaptive techniques involved
in the algorithm, including the EM based parameter estimation for
each mixture component, and the proposed online approach used for
adjusting the cardinality of the mixture, i.e. the number of mixture
components.
This AAIS algorithm falls within a general algorithmic framework,
called Sequential Monte Carlo Sampler (SMCS), which is proposed in
\cite{del2007sequential,del2006sequential}. However, SMCS utilizes
the same "thermodynamic integration" technique for calculating
marginal likelihoods, so it inherits all drawbacks mentioned above.
Exactly, the proposed AAIS method can be seen as a special case of
SMCS, which adopts adaptive independent mixture proposals as the
transition kernels (see details in
\cite{del2007sequential,del2006sequential}), and the IS, instead of
MCMC, for particle generation.
Finally, AAIS has connections to a bunch of existing adaptive IS
methods in the literature, for example,
\cite{evans1991chaining,oh1993imf,cappe2008ais,cappe2004pmc,west1993mixture,ardia2008adaptive},
to name just a few. The most distinguishing point lies in that our
method adopts the annealing strategy, which is well recognized as a
powerful approach to search separated, especially peaky, modes in
the posterior. From the mixture modeling point of view, the proposed
mixture adaptation technique can be seen as an intermediate between
the completely non-parametric method in \cite{west1993mixture} and
the EM based semi-parametric methods, which is used in the adaptive
mixture importance sampling (AMIS) approach \citep{cappe2008ais}.
The AMIS approach employs a fixed number of mixture components,
while our method and the nonparametric method
\citep{west1993mixture} can let the data speak for themselves how
many components should be contained in the mixture.
\section{Simulation Tests}
\label{sec:simulation}
In this Section, we present two simulations, one in three dimensions
and the other in seven, in which the target functions were built to
be difficult to integrate, but their shapes can be inferred through
their mathematical expressions. The objective of the simulation
study is to evaluate the efficiency of the importance function
yielded by AAIS, via comparing its shape with that of the real
target, and comparing the calculated marginal likelihood with the
real answer.
\subsection{Example 1: Three Dimensional Flared
Helix}\label{sec:example1}
We used a cylindrical function with standard Gaussian cross-section
(in ($x$, $y$)) that has been deformed (in $z$) into a helix with
varying diameter. The helix performs three complete turns. The
target function is defined to be
\begin{equation}\label{helix_func}
p(x,y,z)=\delta(-30<z\leq30)\mathcal{N}([x,y]|\mu,\Sigma)
\end{equation}
where $\mu(1)=(z+35)\times\cos(\beta)$ and
$\mu(2)=(z+35)\times\sin(\beta)$, where $\beta=(z+30)\times\pi/10$.
Here $\Sigma=\mbox{diag}([1,1])$. Fig.\ref{fig:flared_helix_target}
shows the shape of this function. And the integration of this
function is shown to be
\begin{equation}
\int\int\int p(x,y,z)dxdydz=\int\int\int p(x,y|z)dxdy
p(z)dz=\int_{-30}^{30} 1 dz=60
\end{equation}
Treating (\ref{helix_func}) to be $L(y|\theta)p(\theta)$, the
marginal likelihood in this simulation case is then 60.
\begin{figure}[!htb]
\centerline{\includegraphics[width=3in,height=2.4in]{Fig/flared_helix_target.png}}
\caption{The simulated 3-D flared helix target distribution}
\label{fig:flared_helix_target}
\end{figure}
Before running the proposed Adaptive Annealed IS (AAIS) algorithm,
we select the initial proposal to be a T-mixture that is composed of
10 equally weighted components. The degree of freedom for each T
component is fixed to be 5. The centers of these components are
selected uniformly from a region restricted by
$x\in[-100,100],y\in[-100,100]$ and $z\in[-30,30]$. Then the
covariance matrix for every component is identical to
$\mbox{diag}[\sigma_x^2,\sigma_y^2,\sigma_z^2]$, where $\sigma_a$ is
the standard error of the argument $a$ in the components' centers
that has been just specified. The particle size $N=2000$,
$\lambda_t=0.1t$, and $t=1,\ldots,10$.
The adaptive mixture importance sampling (AMIS) approach
\cite{cappe2008ais} is involved for performance comparison. We
initialize it using the same setting as for the AAIS, and let it run
10 iterations to give the final mixture importance function.
Fig. \ref{fig:helix_scatter} shows the resulting posterior samples
(which are equally weighted via resampling) corresponding to AAIS
and AMIS, respectively. It's shown that the samples resulted from
AAIS are distributed in wider posterior area, which indicates that
the importance function given by AAIS captures more structures in
the posterior.
\begin{figure}[!htb]
\begin{tabular}{c}
\centerline{\includegraphics[width=3in,height=2.4in]{Fig/flared_helix_scatter.png}\includegraphics[width=3in,height=2.4in]{Fig/AIS_3d_felix_scatter.png}}
\end{tabular}
\caption{Three dimensional flared helix example. Left:posterior
samples produced by the proposed AAIS algorithm. Right:posterior
samples produced by the AMIS algorithm \citep{cappe2008ais}. In the
diagonal, the curves are kernel density estimates of the posterior
samples.} \label{fig:helix_scatter}
\end{figure}
A quantitative comparison between the importance functions obtained
by AAIS and AMIS is shown in Table \ref{comparison_3D}. Three
quantities are used for comparison, that are the \ESS, the KL
distance and the estimated marginal likelihood. Note that, we can
easily calculate the KL distance via a simple Monte Carlo for this
simulation case. As is shown in Table \ref{comparison_3D}, AAIS
gives correct answer to the marginal likelihood, closer KL distance
with respect to the real target, and bigger \ESS. So it further
demonstrates that the AAIS is much better than the AMIS in dealing
with this simulation case.
\begin{center}
\begin{tabular}{c||c|c|c}
& \ESS/N & $\KL$ & Marginal Likelihood (real answer:60) \\
\hline AMIS \citep{cappe2008ais} & 0.1525 & 6.7830 & $47.2\pm3.3$\\
\hline AAIS proposed here & 0.4459 & 0.1586 & $59.7\pm2.0$\\
\hline
\end{tabular}
\tabcaption{Three dimensional flared helix example. Quantitative
Performance comparison.}\label{comparison_3D}
\end{center}
\subsection{Example 2: Outer Product of Seven Univariate Densities}
To test the concern whether good results from a 3-D example imply
good results in higher dimensions, we used as a target function the
outer product of seven univariate distributions normalized to
integrate to 1. These seven distributions are:
\begin{enumerate}
\item $\frac{3}{5}\mathcal{G}(10+x|2,3)+\frac{2}{5}\mathcal{G}(10-x|2,5)$
\item $\frac{3}{4}sk\mathcal{N}(x|3,1,5)+\frac{1}{4}sk\mathcal{N}(x|-3,3,-6)$
\item $\mathcal{S}(x|0,9,4)$
\item $\frac{1}{2}\mathcal{B}(x+3|3,3)+\frac{1}{2}\mathcal{N}(x|0,1)$
\item $\frac{1}{2}\varepsilon(x|1)+\frac{1}{2}\varepsilon(-x|1)$
\item $sk\mathcal{N}(x|0,8,-3)$
\item $\frac{1}{8}\mathcal{N}(x|-10,0.1)+\frac{1}{4}\mathcal{N}(x|0,0.15)+\frac{5}{8}\mathcal{N}(x|7,0.2)$
\end{enumerate}
Here $\mathcal{G}(\cdot|\alpha,\beta)$ denotes the gamma
distribution, $\mathcal{N}(\cdot|\mu,\sigma)$ denotes the normal
distribution, $sk\mathcal{N}(\cdot|\mu,\sigma,\alpha)$ denotes the
skew-normal distribution, $\mathcal{S}(\cdot|\mu,s,df)$ denotes the
student-T distribution, $\mathcal{B}(\cdot|\alpha,\beta)$ denotes
the beta distribution, and $\varepsilon(\cdot|\lambda)$ denotes the
exponential distribution. Dimension 2 has two modes bracketing a
deep ravine, dimension 4 has one low, broad mode that overlaps a
second sharper mode, and dimension 7 has three distinct,
well-separated modes. Only dimension 5 is symmetric. There is a
range of tail behavior as well, from Gaussian to heavy-tailed.
In this case, we initialize the proposed AAIS algorithm with a
T-mixture that is composed of 50 equally weighted components. The
degree of freedom for each T component is still fixed to be 5. The
centers of these components for each dimension are selected
uniformly $[-10,10]$. The procedure to initialize the covariance
matrix for every mixture component is identical to that used for the
example shown in Section \ref{sec:example1}. We specify a particle
size $N=8000$, and annealing schedule $\lambda_t=0.1t$, with
$t=1,\ldots,10$.
The AMIS method \citep{cappe2008ais} is also involved for
performance comparison. The AMIS is initialized identically as for
AAIS, and will run 10 iterations to give the final result.
Fig.\ref{fig:7D_scatter} depicts the scatterplot for the resulting
posterior samples (which has been equally weighted by resampling).
It's shown that, the AMIS missed modes in the 2nd and 7th dimension,
while the proposed AAIS manages to capture all the modes in the
posterior, which should be benefited from the annealing idea being
involved.
\begin{figure}[!htb]
\begin{tabular}{c}
\centerline{\includegraphics[width=4in,height=3.2in]{Fig/seven_outer_product_scatter.png}\includegraphics[width=4in,height=3.2in]{Fig/AIS_7d_outter_scatter.png}}
\end{tabular}
\caption{Seven dimensional outer product example. Left:posterior
samples produced by the proposed AAIS algorithm. Right:posterior
samples produced by the AMIS algorithm \citep{cappe2008ais}. In the
diagonal, the curves are kernel density estimates of the posterior
samples.} \label{fig:7D_scatter}
\end{figure}
Similarly as for example 1, we give a quantitative comparison for
the involved algorithms in Table \ref{comparison_7d}. We see that
again the proposed AAIS algorithm leads to much better results for
every comparison criterion.
\begin{center}
\begin{tabular}{c||c|c|c}
& \ESS/$N$ & $\KL$ & Marginal likelihood: (true value:1) \\
\hline AMIS \citep{cappe2008ais}& 0.2454 & 10.8804 & $0.4675\pm0.0246$\\
\hline AAIS proposed here & 0.4948 & 0.4075 & $1.0011\pm0.0303$\\
\hline
\end{tabular}
\tabcaption{Seven dimensional outer product example. Quantitative
Performance comparison}\label{comparison_7d}
\end{center}
\section{Real Data Tests}
In this section, we perform the proposed AAIS method to deal with
two real RV data set.
\subsection{HD73526 Data reported in \cite{tinney2003four}}
This data set contains 18 data components, and were claimed to
support an orbit of $190.5\pm3.0$ days \citep{tinney2003four}.
\cite{gregory2005bayesian} did a Bayesian re-analysis on this data
set using a parallel tempering MCMC algorithm, and reported three
possible orbits, with periods $127.88_{-0.09}^{+0.37}$,
$190.4_{-2.1}^{+1.8}$ and $376.2_{-4.3}^{+1.4}$ days, respectively.
\cite{gregory2005bayesian} also discussed the possibility of having
an additional planet, and reported that the Bayes factor is less
than 1 when comparing $\mathcal{M}_2$ with the $\mathcal{M}_1$.
We use the proposed AAIS algorithm to deal with $\mathcal{M}_0$,
$\mathcal{M}_1$ and $\mathcal{M}_2$, respectively. The marginal
likelihood estimation result is summarized in Table
\ref{marginal_likelihood_data1}. These estimates are reliable
indicated by their corresponding \ESS$/N$. The resulting Bayes
factors are shown in Table \ref{Bayes_factor_data1}. As indicated by
the result, this data set support the one-planet hypothesis, which
coincides the conclusions made by