diff --git a/users/documentation/case-studies/gpareto_functions.html b/users/documentation/case-studies/gpareto_functions.html index d567ea5b..32fdd7bc 100644 --- a/users/documentation/case-studies/gpareto_functions.html +++ b/users/documentation/case-studies/gpareto_functions.html @@ -1,144 +1,388 @@ - +
- + - +This notebook demonstrates how to implement user defined probability functions in Stan language. As an example I use the generalized Pareto distribution (GPD) to model geomagnetic storm data from the World Data Center for Geomagnetism.
+This notebook demonstrates how to implement user defined probability +functions in Stan language. As an example I use the generalized Pareto +distribution (GPD) to model geomagnetic storm data from the World Data +Center for Geomagnetism.
Load some libraries:
@@ -149,20 +393,23 @@Read the data. This file has magnitudes of 373 geomagnetic storms which lasted longer than 48h with absolute magnitude larger than 100 in 1957-2014.
+Read the data. This file has magnitudes of 373 geomagnetic storms +which lasted longer than 48h with absolute magnitude larger than 100 in +1957-2014.
# file preview shows a header row
d <- read.csv("geomagnetic_tail_data.csv", header = FALSE)
# dst are the absolute magnitudes
colnames(d) <- "dst"
d <- d %>% mutate(dst = abs(dst)) %>% arrange(dst)
-Compute the empirical complementary cumulative distribution function
+Compute the empirical complementary cumulative distribution +function
n <- dim(d)[1]
d$ccdf <- seq(n,1,-1)/n
head(d)
@@ -179,24 +426,41 @@ The largest event in the data is the March 1989 geomagnetic storm, also known as Quebec blackout event (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). Now we are interested in estimating the probability of observing a same magnitude event in the future which would be helpful, for example, for insurance companies (for a short term geomagnetic storm predictions there are very elaborate models using observations closer to sun, too). Extreme value theory says that many distributions have a tail which is well modelled with generalized Pareto distribution given the cutoff point is far enough in the tail. For a more detailed model we could also take into account that geomagnetic storms are temporally correlated and tend to appear in bursts.
+ +The largest event in the data is the March 1989 geomagnetic storm, +also known as Quebec blackout event (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). +Now we are interested in estimating the probability of observing a same +magnitude event in the future which would be helpful, for example, for +insurance companies (for a short term geomagnetic storm predictions +there are very elaborate models using observations closer to sun, too). +Extreme value theory says that many distributions have a tail which is +well modelled with generalized Pareto distribution given the cutoff +point is far enough in the tail. For a more detailed model we could also +take into account that geomagnetic storms are temporally correlated and +tend to appear in bursts.
The Generalized Pareto distribution is defined as \[
+ The Generalized Pareto distribution is defined as \[
p(y|u,\sigma,k)=
\begin{cases}
- \frac{1}{\sigma}\left(1+k\left(\frac{y-u}{\sigma}\right)\right)^{-\frac{1}{k}-1}, & k\neq 0 \\
+ \frac{1}{\sigma}\left(1+k\left(\frac{y-u}{\sigma}\right)\right)^{-\frac{1}{k}-1},
+& k\neq 0 \\
\frac{1}{\sigma}\exp\left(\frac{y-u}{\sigma}\right), & k = 0,
\end{cases}
-\] where \(u\) is a lower bound parameter, \(\sigma\) is a scale parameter, \(k\) is a shape parameter, and \(y\) is the data restricted to the range \((u, \inf)\) (see, e.g., https://en.wikipedia.org/wiki/Generalized_Pareto_distribution for cdf and random number generation). For probability distributions implemented in the Stan math library there are functions
For probability distributions implemented in the Stan math library +there are functions
Unlike the functions implemented in the C++ Stan math library, the user defined functions can have only signature. In this example, I have chosen the function signatures so that the behavior is as close as possible to most usual ways to call these functions. The main feature is that all these function return a scalar real value (except _rng would return scalar integer for a discrete integer valued distribution).
+Unlike the functions implemented in the C++ Stan math library, the +user defined functions can have only signature. In this example, I have +chosen the function signatures so that the behavior is as close as +possible to most usual ways to call these functions. The main feature is +that all these function return a scalar real value (except _rng would +return scalar integer for a discrete integer valued distribution).
For the Generalized Pareto distribution we implement
As we can define only one function signature for user defined functions, I have chosen to have vector type for y and real types for the parameters, while builtin functions can handle vectors, arrays and scalars (denoted by generic type reals
). For vector valued y, _lpdf, _lcdf, and _lccdf return sum of log values computed with each element of y. For vector valued y, _cdf returns product of values computed with each element of y. _rng returns a single random number generated from the generalized Pareto distribution.
As we can define only one function signature for user defined
+functions, I have chosen to have vector type for y and real
+types for the parameters, while builtin functions can handle vectors,
+arrays and scalars (denoted by generic type reals
). For
+vector valued y, _lpdf, _lcdf, and _lccdf return sum of log
+values computed with each element of y. For vector valued
+y, _cdf returns product of values computed with each element of
+y. _rng returns a single random number generated from the
+generalized Pareto distribution.
The whole code for functions, the basic model and the generated quantities for posterior predictive checking (ppc), leave-one-out cross-validation (loo), and prediction of rare events is shown below. I will next go through some practical details.
+The whole code for functions, the basic model and the generated +quantities for posterior predictive checking (ppc), leave-one-out +cross-validation (loo), and prediction of rare events is shown below. I +will next go through some practical details.
writeLines(readLines("gpareto.stan"))
functions {
real gpareto_lpdf(vector y, real ymin, real k, real sigma) {
- // generalised Pareto log pdf
+ // generalised Pareto log pdf
int N = rows(y);
real inv_k = inv(k);
- if (k<0 && max(y-ymin)/sigma > -inv_k)
- reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma)
- if (sigma<=0)
- reject("sigma<=0; found sigma =", sigma)
- if (fabs(k) > 1e-15)
- return -(1+inv_k)*sum(log1p((y-ymin) * (k/sigma))) -N*log(sigma);
- else
- return -sum(y-ymin)/sigma -N*log(sigma); // limit k->0
+ if (k < 0 && max(y - ymin) / sigma > -inv_k) {
+ reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma);
+ }
+ if (sigma <= 0) {
+ reject("sigma<=0; found sigma =", sigma);
+ }
+ if (abs(k) > 1e-15) {
+ return -(1 + inv_k) * sum(log1p((y - ymin) * (k / sigma)))
+ - N * log(sigma);
+ } else {
+ return -sum(y - ymin) / sigma - N * log(sigma);
+ } // limit k->0
}
real gpareto_cdf(vector y, real ymin, real k, real sigma) {
// generalised Pareto cdf
real inv_k = inv(k);
- if (k<0 && max(y-ymin)/sigma > -inv_k)
- reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma)
- if (sigma<=0)
- reject("sigma<=0; found sigma =", sigma)
- if (fabs(k) > 1e-15)
- return exp(sum(log1m_exp((-inv_k)*(log1p((y-ymin) * (k/sigma))))));
- else
- return exp(sum(log1m_exp(-(y-ymin)/sigma))); // limit k->0
+ if (k < 0 && max(y - ymin) / sigma > -inv_k) {
+ reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma);
+ }
+ if (sigma <= 0) {
+ reject("sigma<=0; found sigma =", sigma);
+ }
+ if (abs(k) > 1e-15) {
+ return exp(sum(log1m_exp((-inv_k) * log1p((y - ymin) * (k / sigma)))));
+ } else {
+ return exp(sum(log1m_exp(-(y - ymin) / sigma)));
+ } // limit k->0
}
real gpareto_lcdf(vector y, real ymin, real k, real sigma) {
// generalised Pareto log cdf
real inv_k = inv(k);
- if (k<0 && max(y-ymin)/sigma > -inv_k)
- reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma)
- if (sigma<=0)
- reject("sigma<=0; found sigma =", sigma)
- if (fabs(k) > 1e-15)
- return sum(log1m_exp((-inv_k)*(log1p((y-ymin) * (k/sigma)))));
- else
- return sum(log1m_exp(-(y-ymin)/sigma)); // limit k->0
+ if (k < 0 && max(y - ymin) / sigma > -inv_k) {
+ reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma);
+ }
+ if (sigma <= 0) {
+ reject("sigma<=0; found sigma =", sigma);
+ }
+ if (abs(k) > 1e-15) {
+ return sum(log1m_exp((-inv_k) * log1p((y - ymin) * (k / sigma))));
+ } else {
+ return sum(log1m_exp(-(y - ymin) / sigma));
+ } // limit k->0
}
real gpareto_lccdf(vector y, real ymin, real k, real sigma) {
// generalised Pareto log ccdf
real inv_k = inv(k);
- if (k<0 && max(y-ymin)/sigma > -inv_k)
- reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma)
- if (sigma<=0)
- reject("sigma<=0; found sigma =", sigma)
- if (fabs(k) > 1e-15)
- return (-inv_k)*sum(log1p((y-ymin) * (k/sigma)));
- else
- return -sum(y-ymin)/sigma; // limit k->0
+ if (k < 0 && max(y - ymin) / sigma > -inv_k) {
+ reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma);
+ }
+ if (sigma <= 0) {
+ reject("sigma<=0; found sigma =", sigma);
+ }
+ if (abs(k) > 1e-15) {
+ return (-inv_k) * sum(log1p((y - ymin) * (k / sigma)));
+ } else {
+ return -sum(y - ymin) / sigma;
+ } // limit k->0
}
real gpareto_rng(real ymin, real k, real sigma) {
// generalised Pareto rng
- if (sigma<=0)
- reject("sigma<=0; found sigma =", sigma)
- if (fabs(k) > 1e-15)
- return ymin + (uniform_rng(0,1)^-k -1) * sigma / k;
- else
- return ymin - sigma*log(uniform_rng(0,1)); // limit k->0
+ if (sigma <= 0) {
+ reject("sigma<=0; found sigma =", sigma);
+ }
+ if (abs(k) > 1e-15) {
+ return ymin + (uniform_rng(0, 1) ^ -k - 1) * sigma / k;
+ } else {
+ return ymin - sigma * log(uniform_rng(0, 1));
+ } // limit k->0
}
}
data {
@@ -290,8 +590,8 @@ Stan code with user defined functions
real ymax = max(y);
}
parameters {
- real<lower=0> sigma;
- real<lower=-sigma/(ymax-ymin)> k;
+ real<lower=0> sigma;
+ real<lower=-sigma / (ymax - ymin)> k;
}
model {
y ~ gpareto(ymin, k, sigma);
@@ -300,24 +600,38 @@ Stan code with user defined functions
vector[N] log_lik;
vector[N] yrep;
vector[Nt] predccdf;
- for (n in 1:N) {
- log_lik[n] = gpareto_lpdf(rep_vector(y[n],1) | ymin, k, sigma);
+ for (n in 1 : N) {
+ log_lik[n] = gpareto_lpdf(rep_vector(y[n], 1) | ymin, k, sigma);
yrep[n] = gpareto_rng(ymin, k, sigma);
}
- for (nt in 1:Nt)
- predccdf[nt] = exp(gpareto_lccdf(rep_vector(yt[nt],1) | ymin, k, sigma));
+ for (nt in 1 : Nt) {
+ predccdf[nt] = exp(gpareto_lccdf(rep_vector(yt[nt], 1) | ymin, k, sigma));
+ }
}
-For each function we do basic argument checking. In addition of invalid values due to user errors, due to the limited accuracy of the floating point presentation of the values, sometimes we may get invalid parameters. For example, due to the underflow, we may get sigma equal to 0, even if we have declared it as real<lower=0> sigma;
. For these latter cases, it is useful to use reject
statement so the corresponding MCMC proposal will be rejected and the sampling can continue.
For each function we do basic argument checking. In addition of
+invalid values due to user errors, due to the limited accuracy of the
+floating point presentation of the values, sometimes we may get invalid
+parameters. For example, due to the underflow, we may get sigma equal to
+0, even if we have declared it as
+real<lower=0> sigma;
. For these latter cases, it is
+useful to use reject
statement so the corresponding MCMC
+proposal will be rejected and the sampling can continue.
if (k<0 && max(y-ymin)/sigma > -inv_k)
reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma)
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma)
-Stan documentation warns about use of fabs
and conditional evaluation as they may lead to discontinuous energy or gradient. In GPD we need to handle a special case of \(k \rightarrow 0\). The following will cause discontinuity in theory, but the magnitude of the discontinuity is smaller than the accuracy of the floating point presentation, and thus this should not cause any additional numerical instability.
Stan documentation warns about use of fabs
and
+conditional evaluation as they may lead to discontinuous energy or
+gradient. In GPD we need to handle a special case of \(k \rightarrow 0\). The following will cause
+discontinuity in theory, but the magnitude of the discontinuity is
+smaller than the accuracy of the floating point presentation, and thus
+this should not cause any additional numerical instability.
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y-ymin) * (k/sigma))) -N*log(sigma);
else
return -sum(y-ymin)/sigma -N*log(sigma); // limit k->0
-In data block we need to define the minimum threshold ymin. We also define test points yt.
+In data block we need to define the minimum threshold ymin. +We also define test points yt.
data {
real ymin;
int<lower=0> N;
@@ -325,7 +639,8 @@ Stan code with user defined functions
int<lower=0> Nt;
vector<lower=ymin>[Nt] yt;
}
-sigma has to be positive and k has a lower limit which depends on sigma and the maximum value.
+sigma has to be positive and k has a lower limit +which depends on sigma and the maximum value.
transformed data {
real ymax;
ymax = max(y);
@@ -334,19 +649,34 @@ Stan code with user defined functions
real<lower=0> sigma;
real<lower=-sigma/(ymax-ymin)> k;
}
-By defining gpareto_lpdf
we can use also the common ~
notation in Stan to write familiar looking model code.
By defining gpareto_lpdf
we can use also the common
+~
notation in Stan to write familiar looking model
+code.
model {
y ~ gpareto(ymin, k, sigma);
}
-In generated quantities we compute log_lik
values for LOO and yrep
values for posterior predictive checking. We had defined the first argument of pareto_lpdf
to be a vector. Now a single element of vector y[n]
has a type real
, and as Stan has strong typing and we can’t overload the user defined functions, we need to cast y[n]
to be a vector by making a 1 element vector rep_vector(y[n],1)
. Alternatively we could write another function with a different name which could accept scalar y[n]
, but here I wanted to demonstrate the minimal approach.
In generated quantities we compute log_lik
values for
+LOO and yrep
values for posterior predictive checking. We
+had defined the first argument of pareto_lpdf
to be a
+vector. Now a single element of vector y[n]
has a type
+real
, and as Stan has strong typing and we can’t overload
+the user defined functions, we need to cast y[n]
to be a
+vector by making a 1 element vector rep_vector(y[n],1)
.
+Alternatively we could write another function with a different name
+which could accept scalar y[n]
, but here I wanted to
+demonstrate the minimal approach.
for (n in 1:N) {
log_lik[n] = gpareto_lpdf(rep_vector(y[n],1) | ymin, k, sigma);
yrep[n] = gpareto_rng(ymin, k, sigma);
}
-Finally we compute predictive probabilities for observing events with certain magnitudes.
+Finally we compute predictive probabilities for observing events with +certain magnitudes.
for (nt in 1:Nt)
predccdf[nt] = exp(gpareto_lccdf(rep_vector(yt[nt],1) | ymin, k, sigma));
-Before running the Stan model, it’s a good idea to check that we have not made mistakes in the implementation of the user defined functions. RStan makes it easy to call the user defined functions in R, and thus easy to make some checks.
+Before running the Stan model, it’s a good idea to check that we have +not made mistakes in the implementation of the user defined functions. +RStan makes it easy to call the user defined functions in R, and thus +easy to make some checks.
Expose the user defined functions to R
expose_stan_functions("gpareto.stan")
# check that integral of exp(gpareto_lpdf) (from ymin to ymax) matches with gpareto_cdf
@@ -365,7 +695,7 @@ Testing the user defined functions
# the first set of tests are testing the self consistency
# check that exp(gpareto_lpdf)) integrates to 1
integrate(gpareto_pdf, lower = ymin, upper = Inf, ymin = ymin, k = k, sigma = sigma)
-1 with absolute error < 6.3e-05
+1 with absolute error < 2.6e-08
# check that integral of exp(gpareto_lpdf)) from ymin to yr matches gpareto_cdf
yr <- gpareto_rng(ymin, k, sigma)
all.equal(integrate(gpareto_pdf, lower = ymin, upper = yr, ymin = ymin, k = k, sigma = sigma)$value,
@@ -389,14 +719,14 @@ Testing the user defined functions
# estimate sigma and k using method by Zhang & Stephens (2009)
# (you could also use Stan to estimate the parameters)
th <- loo::gpdfit(yrs-ymin)
-sprintf('True sigma=%.2f, k=%.2f, estimated sigmahat=%.2f, khat=%.2f', sigma, k, th$sigma, th$k)
-[1] "True sigma=3.53, k=0.30, estimated sigmahat=3.52, khat=0.30"
+sprintf('True sigma=%.2f, k=%.2f, estimated sigmahat=%.2f, khat=%.2f', sigma, k, th$sigma, th$k)
+[1] "True sigma=0.24, k=0.19, estimated sigmahat=0.24, khat=0.19"
# test separately the special case k=0
k <- 0
# the first set of tests are testing the self consistency
# check that exp(gpareto_lpdf)) integrates to 1
integrate(gpareto_pdf, lower = ymin, upper = Inf, ymin = ymin, k = k, sigma = sigma)
-1 with absolute error < 4.7e-06
+1 with absolute error < 1.8e-06
# check that integral of exp(gpareto_lpdf)) from ymin to yr matches gpareto_cdf
yr <- gpareto_rng(ymin, k, sigma)
all.equal(integrate(gpareto_pdf, lower = ymin, upper = yr, ymin = ymin, k = k, sigma = sigma)$value,
@@ -420,31 +750,49 @@ Testing the user defined functions
# estimate sigma and k using method by Zhang & Stephens (2009)
# (you could also use Stan to estimate the parameters)
th <- loo::gpdfit(yrs-ymin)
-sprintf('True sigma=%.2f, k=%.2f, estimated sigmahat=%.2f, khat=%.2f', sigma, k, th$sigma, th$k)
-[1] "True sigma=3.53, k=0.00, estimated sigmahat=3.52, khat=0.00"
+sprintf('True sigma=%.2f, k=%.2f, estimated sigmahat=%.2f, khat=%.2f', sigma, k, th$sigma, th$k)
+[1] "True sigma=0.24, k=0.00, estimated sigmahat=0.24, khat=0.00"
Next we use the defined Stan model to analyse the distribution of the largest geomagnetic storms.
+Next we use the defined Stan model to analyse the distribution of the +largest geomagnetic storms.
Fit the Stan model
yt<-append(10^seq(2,3,.01),850)
ds<-list(ymin=100, N=n, y=d$dst, Nt=length(yt), yt=yt)
-fit_gpd <- stan(file='gpareto.stan', data=ds, refresh=0,
+fit_gpd <- stan(file='gpareto.stan', data=ds, refresh=0,
chains=4, seed=100)
Run the usual diagnostics (see Robust Statistical Workflow with RStan)
-check_treedepth(fit_gpd)
-[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
-check_energy(fit_gpd)
-check_div(fit_gpd)
-[1] "0 of 4000 iterations ended with a divergence (0%)"
+Run the usual diagnostics (see Robust +Statistical Workflow with RStan)
+monitor(as.array(fit_gpd, pars=c("k","sigma")))
+Inference for the input samples (4 chains: each with iter = 1000; warmup = 500):
+
+ Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
+k 0.1 0.2 0.4 0.3 0.1 1.00 724 747
+sigma 37.8 43.8 50.3 44.0 3.8 1.01 763 1066
+
+For each parameter, Bulk_ESS and Tail_ESS are crude measures of
+effective sample size for bulk and tail quantities respectively (an ESS > 100
+per chain is considered good), and Rhat is the potential scale reduction
+factor on rank normalized split chains (at convergence, Rhat <= 1.05).
+check_hmc_diagnostics(fit_gpd)
+
+Divergences:
+
+Tree depth:
+
+Energy:
The diagnostics do not reveal anything alarming.
Plot the model fit with 90% posterior interval. The largest observed magnitude in the data corresponds to Quebec blackout in 1989 (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). The vertical dashed line at 850 shows the estimated magnitude of the Carrington event in 1859 (https://en.wikipedia.org/wiki/Solar_storm_of_1859).
+Plot the model fit with 90% posterior interval. The largest observed +magnitude in the data corresponds to Quebec blackout in 1989 (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). +The vertical dashed line at 850 shows the estimated magnitude of the +Carrington event in 1859 (https://en.wikipedia.org/wiki/Solar_storm_of_1859).
gpd_params <- rstan::extract(fit_gpd)
mu <- apply(t(gpd_params$predccdf), 1, quantile, c(0.05, 0.5, 0.95)) %>%
t() %>% data.frame(x = yt, .) %>% gather(pct, y, -x)
@@ -452,101 +800,133 @@ Predictions
ggplot() +
geom_point(aes(dst, ccdf), data = d, size = 1, color = clrs[[5]]) +
geom_line(aes(x=c(850,850),y=c(1e-4,1)),linetype="dashed",color="red") +
- geom_line(aes(x, y, linetype = pct), data = mu, color = 'red') +
+ geom_line(aes(x, y, linetype = pct), data = mu, color = 'red') +
scale_linetype_manual(values = c(2,1,2)) +
coord_trans(x="log10", y="log10", limx=c(100,1000), limy=c(1e-4,1)) +
scale_y_continuous(breaks=c(1e-5,1e-4,1e-3,1e-2,1e-1,1), limits=c(1e-4,1)) +
scale_x_continuous(breaks=c(100,200,300,400,500,600,700,850,1000), limits=c(100,1000)) +
geom_text(aes(x = d$dst[n], y = d$ccdf[n]), label = "Quebec blackout", vjust="top", nudge_y=-0.0005) +
geom_text(aes(x = 820, y = 0.02), label = "Carrington event", angle=90) +
- labs(y = "Probability earth's field weakens at least this much", x= "Absolute dst") +
+ labs(y = "Probability earth's field weakens at least this much", x= "Absolute dst") +
guides(linetype = F) +
theme_bw()
-
+
For additional model checking plot 1) kernel density estimate of log(dst) and posterior predictive replicates, 2) max log magnitude (Quebec event) and histogram of maximums of posterior predictive replicates, 3) leave-one-out cross-validation probability-integral-transformation, and 4) khat values from the PSIS-LOO. None of these diagnostics reveal problems in the model fit.
+For additional model checking plot 1) kernel density estimate of +log(dst) and posterior predictive replicates, 2) max log magnitude +(Quebec event) and histogram of maximums of posterior predictive +replicates, 3) leave-one-out cross-validation +probability-integral-transformation, and 4) khat values from the +PSIS-LOO. None of these diagnostics reveal problems in the model +fit.
ppc1 <- ppc_dens_overlay(log(d$dst),log(gpd_params$yrep[1:50,])) + labs(x="log(absolute dst)")
ppc2 <- ppc_stat(log(d$dst), log(gpd_params$yrep), stat = "max") + labs(x="max(log(absolute dst))")
-psis <- psislw(-gpd_params$log_lik)
+psiss <- psis(-gpd_params$log_lik)
clrs <- color_scheme_get("brightblue")
-pkhats <- ggplot() + geom_point(aes(x=seq(1,n),y=psis$pareto_k), color=clrs[[5]]) + labs(y="khat", x="") +
+pkhats <- ggplot() + geom_point(aes(x=seq(1,n),y=psiss$diagnostics$pareto_k), color=clrs[[5]]) + labs(y="khat", x="") +
geom_hline(yintercept=0, linetype="dashed") + ylim(-0.25,1) + theme_default()
-ppc3 <- ppc_loo_pit(log(d$dst), log(gpd_params$yrep), lw=psis$lw_smooth)
+ppc3 <- ppc_loo_pit_qq(log(d$dst), log(gpd_params$yrep), lw=psiss$log_weights)
grid.arrange(ppc1,ppc2,ppc3,pkhats,ncol=2)
-
-We compute also PSIS-LOO estimate, although this is not much use without an alternative model to compare against.
-(loo_gpd<-loo(gpd_params$log_lik))
-Computed from 4000 by 373 log-likelihood matrix
+
+We compute also PSIS-LOO estimate, although this is not much use
+without an alternative model to compare against.
+(loo_gpd<-loo(fit_gpd))
+
+Computed from 4000 by 373 log-likelihood matrix
Estimate SE
elpd_loo -1874.7 23.7
-p_loo 1.7 0.2
-looic 3749.3 47.3
+p_loo 1.8 0.2
+looic 3749.4 47.3
+------
+Monte Carlo SE of elpd_loo is 0.0.
-All Pareto k estimates are good (k < 0.5)
-See help('pareto-k-diagnostic') for details.
+All Pareto k estimates are good (k < 0.5).
+See help('pareto-k-diagnostic') for details.
Here the analysis was simplified by assuming the geomagnetic storm events to be independent in time although they appear in bursts. The Generalized Pareto distribution can be used for correlated data, but we should really take the correlation into account for predicting the magnitude of storms in the future . Based on the simplified independence assumption though, and assuming same number of events larger than 100 in the next 57 years, we can compute
+Here the analysis was simplified by assuming the geomagnetic storm +events to be independent in time although they appear in bursts. The +Generalized Pareto distribution can be used for correlated data, but we +should really take the correlation into account for predicting the +magnitude of storms in the future . Based on the simplified independence +assumption though, and assuming same number of events larger than 100 in +the next 57 years, we can compute
round(mean(1-(1-gpd_params$predccdf[,78])^(373)),2)
-[1] 0.8
+[1] 0.79
round(mean(1-(1-gpd_params$predccdf[,102])^(373)),2)
[1] 0.4
-That is, the probability that we would observe Quebec blackout magnitude event in the next 57 years is 80% and the probability of observing Carrington level event in the next 57 years is 40%.
+That is, the probability that we would observe Quebec blackout +magnitude event in the next 57 years is 80% and the probability of +observing Carrington level event in the next 57 years is 40%.
Now go read about the effects of these geomagnetic storms!
sessionInfo()
-R version 3.2.3 (2015-12-10)
+R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
-Running under: Ubuntu 16.04.3 LTS
+Running under: Ubuntu 22.04.3 LTS
+
+Matrix products: default
+BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
+LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
- [3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.UTF-8
- [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
- [7] LC_PAPER=fi_FI.utf8 LC_NAME=C
+ [3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_US.UTF-8
+ [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8
+ [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
-[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+[11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
- [1] bindrcpp_0.2 loo_1.1.0 bayesplot_1.4.0
- [4] rstan_2.16.2 StanHeaders_2.16.0-1 gridExtra_2.3
- [7] dplyr_0.7.4 tidyr_0.7.2 ggplot2_2.2.1
-[10] extraDistr_1.8.7
+[1] loo_2.6.0 bayesplot_1.10.0 rstan_2.26.23
+[4] StanHeaders_2.26.28 gridExtra_2.3 dplyr_1.1.3
+[7] tidyr_1.3.0 ggplot2_3.4.3 extraDistr_1.9.1
loaded via a namespace (and not attached):
- [1] Rcpp_0.12.13 plyr_1.8.4 bindr_0.1
- [4] tools_3.2.3 digest_0.6.12 evaluate_0.10.1
- [7] tibble_1.3.4 gtable_0.2.0 pkgconfig_2.0.1
-[10] rlang_0.1.4 yaml_2.1.14 parallel_3.2.3
-[13] stringr_1.2.0 knitr_1.17 tidyselect_0.2.2
-[16] stats4_3.2.3 rprojroot_1.2 grid_3.2.3
-[19] glue_1.2.0 inline_0.3.14 R6_2.2.2
-[22] rmarkdown_1.6 reshape2_1.4.2 purrr_0.2.4
-[25] magrittr_1.5 rstantools_1.3.0 backports_1.1.1
-[28] scales_0.5.0 codetools_0.2-15 htmltools_0.3.6
-[31] matrixStats_0.52.2 assertthat_0.2.0 colorspace_1.3-2
-[34] labeling_0.3 stringi_1.1.5 lazyeval_0.2.1
-[37] munsell_0.4.3
+ [1] Rcpp_1.0.11 lattice_0.20-45 QuickJSR_1.0.5
+ [4] prettyunits_1.1.1 ps_1.7.5 digest_0.6.33
+ [7] utf8_1.2.3 R6_2.5.1 plyr_1.8.8
+[10] backports_1.4.1 stats4_4.2.2 evaluate_0.21
+[13] highr_0.10 pillar_1.9.0 rlang_1.1.1
+[16] rstudioapi_0.15.0 callr_3.7.3 jquerylib_0.1.4
+[19] Matrix_1.5-1 checkmate_2.2.0 rmarkdown_2.24
+[22] labeling_0.4.3 stringr_1.5.0 RcppEigen_0.3.3.9.3
+[25] munsell_0.5.0 compiler_4.2.2 xfun_0.40
+[28] pkgconfig_2.0.3 pkgbuild_1.4.2 rstantools_2.3.1.1
+[31] htmltools_0.5.6 tidyselect_1.2.0 tensorA_0.36.2
+[34] tibble_3.2.1 codetools_0.2-19 matrixStats_1.0.0
+[37] fansi_1.0.4 crayon_1.5.2 withr_2.5.0
+[40] distributional_0.3.2 BH_1.81.0-1 grid_4.2.2
+[43] jsonlite_1.8.7 gtable_0.3.4 lifecycle_1.0.3
+[46] posterior_1.4.1 magrittr_2.0.3 scales_1.2.1
+[49] RcppParallel_5.1.7 cli_3.6.1 stringi_1.7.12
+[52] cachem_1.0.8 farver_2.1.1 reshape2_1.4.4
+[55] bslib_0.5.1 generics_0.1.3 vctrs_0.6.3
+[58] tools_4.2.2 glue_1.6.2 purrr_1.0.2
+[61] abind_1.4-5 processx_3.8.2 parallel_4.2.2
+[64] fastmap_1.1.1 yaml_2.3.7 inline_0.3.19
+[67] colorspace_2.1-0 knitr_1.43 sass_0.4.7
I thank Ben Goodrich, Ben Bales, and Bob Carpenter for useful comments on the draft of this notebook.
+I thank Ben Goodrich, Ben Bales, and Bob Carpenter for useful +comments on the draft of this notebook.