From 9c485be79a19457e9afab8cf2bbbb89d6d535fd6 Mon Sep 17 00:00:00 2001 From: micl Date: Thu, 21 Nov 2024 18:22:55 -0500 Subject: [PATCH 01/19] minor reorg of qreg --- linear_model_extensions.qmd | 50 ++++++++++++++++++++----------------- 1 file changed, 27 insertions(+), 23 deletions(-) diff --git a/linear_model_extensions.qmd b/linear_model_extensions.qmd index 2524424..013b04e 100644 --- a/linear_model_extensions.qmd +++ b/linear_model_extensions.qmd @@ -1324,28 +1324,13 @@ With 1000 observations, we see that having just 10 relatively extreme scores is There are a few approaches we could take here, with common approaches being dropping those observations, [winsorizing](https://en.wikipedia.org/wiki/Winsorizing) them, or doing some transformation. Throwing away data because you don't like the way it behaves is almost statistical abuse, and winsorization is just replacing those extreme values with numbers that you like a little bit better. Transformations done independently of the model rarely work for this purpose either. So let's try something else! -### The quantile loss function {#sec-quantile-loss} - -A better answer to this challenge might be to try a median-based approach instead. This is where a model like **quantile regression** becomes handy. Formally, the objective function for the model can be expressed as: - -:::{.content-visible when-format='html'} -![Quantile Loss Function](img/eq-quantile-reg.png){#eq-quantile-loss-function} -::: - -:::{.content-visible when-format='pdf'} -$$ -\text{Value} = \Sigma \left((\pmb{\tau} - 1)\sum_{\pmb{y_{i} < \hat{y}}}(y_{i} - \hat{y}) + \pmb{\tau} \sum_{\pmb{y_{i}\geq \hat{y}}}(y_{i} - \hat{y}) \right) -$${#eq-quantile-loss-function} -::: - - - -With quantile regression, we are given an extra parameter for the model: $\tau$. It's a number between 0 and 1 representing the desired quantile (e.g., 0.5 for the median). The objective function treats positive residuals differently than negative residuals. If the residual is positive, then we multiply it by $\tau$. If the residual is negative, then we multiply it by $\tau -1$. The increased penalty for over-predictions for ensures that the estimated quantile $\hat{y}$ is such that $\tau$ proportion of the data falls below it, balancing the total loss and providing a robust estimate of the desired quantile. ### A standard quantile regression {#sec-quantile-standard} +A better answer to this challenge might be to try a median-based approach instead. This is where a model like **quantile regression** becomes handy. Quantile regression is a type of regression that allows us to model the relationship between the features and the target at different quantiles of the target. For example, we can examine models at the 10th, 25th, 50th, 75th, and 90th percentiles of the target. This is very cool, as it allows us to model the relationship between the features and the target in a way that is robust to outliers and extreme scores. It's also a way to understand a type of nonlinearity that is not captured by a standard linear model, as the feature target relationship may change at different quantiles of the target. + To demonstrate this type of model, let's use our movie reviews data. Let's say that we are curious about the relationship between the `word_count` and `rating` to keep things simple. To make it even more straightforward, we will use the standardized (scaled) version of the feature. In our default approach, we will start with a median regression, in other words, a quantile associated with $\tau$ = .5[^leastabs]. [^leastabs]: This is equivalent to using the least absolute deviation objective. @@ -1411,17 +1396,18 @@ broom::tidy(model_median) |> gt() ``` -Fortunately, our interpretation of this result isn't all that different from a standard linear model -- the rating should decrease by `r round(coef(model_median)['word_count_sc'],2)` for every bump in a standard deviation for the number of words, which in this case is about `r round(sd(df_reviews$word_count))` words. However, this concerns the expected median rating, not the mean, as would be the case with standard linear regression. +Fortunately, our interpretation of this result isn't all that different from a standard linear model - the rating should decrease by `r round(coef(model_median)['word_count_sc'],2)` for every bump in a standard deviation for the number of words, which in this case is about `r round(sd(df_reviews$word_count))` words. However, this concerns the expected *median* rating, not the mean, as would be the case with standard linear regression. -Quantile regression is not a one-trick-pony though. Being able to compute a median regression is just the default. We can also model different quantiles of the same data. It gives us the ability to answer brand new questions -- does the relationship between word count and their ratings change at different quantiles of rating? Very cool! +Quantile regression is not a one-trick-pony though - being able to compute a median regression is just the default. As mentioned, we can also explore different quantiles, and this gives us the ability to answer brand new questions - does the relationship between word count and their ratings change at different quantiles of rating? Very cool! Let's now examine the trends within 5 different quantiles of the data - .1, .3 .5, .7, and .9[^morequants]. We aren't limited to just those quantiles though, and you can examine any of them that you might find interesting. Here is a plot of the results of these models. [^morequants]: The R function can take a vector of quantiles, while the Python function can only take a single quantile, so you would need to loop through the quantiles. -:::{.content-visible when-format='html'} + ```{r} #| echo: false +#| eval: false #| label: fig-quantile-lines #| fig-cap: Quantile Regression Lines tau_values = c(.1, .3, .5, .7, .9) @@ -1470,10 +1456,8 @@ ggplot() + ggsave('img/lm-extend-quantile_lines.svg', width = 8, height = 6) ``` -::: -:::{.content-visible when-format='pdf'} + ![Quantile Regression Lines](img/lm-extend-quantile_lines.svg){#fig-quantile-lines} -::: To interpret our visualization, we could start by saying that all of the model results suggest a negative relationship. The 10th and 90th quantiles seem weakest, while those in the middle show a notably stronger relationship. We can also see that the 90th percentile model is better able to capture those values that would otherwise be deemed as outliers using other standard techniques. The following table shows the estimated coefficients for each of the quantiles, and suggests that all word count relationships are statistically significant, since the confidence intervals do not include zero. @@ -1483,6 +1467,7 @@ To interpret our visualization, we could start by saying that all of the model r #| echo: false #| label: tbl-quantile-model-output-multi-quants #| tbl-cap: Quantile Regression Model Results +tau_values = c(.1, .3, .5, .7, .9) rq( rating ~ word_count_sc, @@ -1515,6 +1500,25 @@ rq( \normalsize + + +### The quantile loss function {#sec-quantile-loss} + +Formally, the objective function for the quantile regression model can be expressed as: + + +:::{.content-visible when-format='html'} +![Quantile Loss Function](img/eq-quantile-reg.png){#eq-quantile-loss-function} +::: + +:::{.content-visible when-format='pdf'} +$$ +\text{Value} = \Sigma \left((\pmb{\tau} - 1)\sum_{\pmb{y_{i} < \hat{y}}}(y_{i} - \hat{y}) + \pmb{\tau} \sum_{\pmb{y_{i}\geq \hat{y}}}(y_{i} - \hat{y}) \right) +$${#eq-quantile-loss-function} +::: + +Compared to a standard linear regression, we are given an extra parameter for the model: $\tau$. It's a number between 0 and 1 representing the desired quantile (e.g., 0.5 for the median). The objective function treats positive residuals differently than negative residuals. If the residual is positive, then we multiply it by $\tau$. If the residual is negative, then we multiply it by $\tau -1$. The increased penalty for over-predictions for ensures that the estimated quantile $\hat{y}$ is such that $\tau$ proportion of the data falls below it, balancing the total loss and providing a robust estimate of the desired quantile. + ### Rolling our own Given how relatively simple the objective function is, let's demystify this model further by creating our own quantile regression model and see if we can get the same results. We'll start by creating a loss function that we can use to fit our model. From 488dae00dba49fea6e083f50c016394fc7a5b1b0 Mon Sep 17 00:00:00 2001 From: micl Date: Fri, 22 Nov 2024 18:33:34 -0500 Subject: [PATCH 02/19] split uncertainty to its own chap --- estimation.qmd | 13 +- uncertainty.qmd | 1298 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 1299 insertions(+), 12 deletions(-) create mode 100644 uncertainty.qmd diff --git a/estimation.qmd b/estimation.qmd index 916a4e0..f5d772d 100644 --- a/estimation.qmd +++ b/estimation.qmd @@ -11,18 +11,7 @@ filter = dplyr::filter ``` ```{python} -#| label: py-setup-Estimation -#| echo: false -import pandas as pd -import numpy as np - -np.set_printoptions(formatter={'float': lambda x: '{0:0.4f}'.format(x)}) -pd.set_option('display.float_format', lambda x: '%.3f' % x) -``` - - -```{python} -#| label: py-setup-data +#| label: py-setup-estimation #| echo: false import pandas as pd diff --git a/uncertainty.qmd b/uncertainty.qmd new file mode 100644 index 0000000..64c9682 --- /dev/null +++ b/uncertainty.qmd @@ -0,0 +1,1298 @@ +# Estimating Uncertainty {#sec-estim-uncertainty} + +![](img/chapter_gp_plots/gp_plot_3.svg){width=75%} + +```{r} +#| label: setup-estimation +#| include: false +#| +# not sure what's going on to need this +filter = dplyr::filter +``` + + +Our focus thus far has been on estimating the best parameters for a model. But we also want to know how certain we are about those estimates. There are different ways to estimate **uncertainty**, and understanding the uncertainty in our results helps us make better decisions from our model. We'll briefly cover a few approaches here, but realize we are but merely scratching the surface on these approaches. There are whole books, and even philosophies, dedicated to the topic of uncertainty estimation. + + +## Key Ideas {#sec-unc-key-ideas} + +- There are multiple ways to estimate uncertainty in parameters or prediction. +- Many statistical models provide formulaic interval estimates for parameters and predictions, couched in a **frequentist** framework. +- **Monte Carlo** methods use a simulation approach to estimate uncertainty. +- **Bootstrap** methods use resampling to estimate uncertainty. +- **Bayesian** methods provide a different way to estimate uncertainty and an alternative philosophical spirit. +- **Conformal** prediction provides a way to estimate uncertainty in predictions where other methods falter. + +### Why this matters + +Understanding uncertainty is crucial for making decisions based on model results. It's difficult to make informed decisions if we don't know how certain we are about our estimates. This is especially important in high-stakes decisions, where the consequences of being wrong are severe. For example, in medical diagnosis, we want to be as certain as possible about the diagnosis before starting treatment. In finance, we want to be as certain as possible about the risk of an investment before making it. In all these cases, understanding uncertainty is key to making the best decision. + +### Helpful context + +If you are comfortable with standard linear models you should be okay here. This chapter does get a bit more technical than others, but the examples should prove straightforward. + + +## Data Setup {#sec-unc-data-setup} + +Data setup follows the estimation chapter for consistency (@sec-estim-data-setup). + +:::{.panel-tabset} + +##### R + +```{r} +#| label: r-happiness-data-setup +df_happiness = read_csv('https://tinyurl.com/worldhappiness2018') |> + drop_na() |> + rename(happiness = happiness_score) |> + select( + country, + happiness, + contains('_sc') + ) +``` + +##### Python + +```{python} +#| label: py-happiness-data-setup +import pandas as pd + +df_happiness = ( + pd.read_csv('https://tinyurl.com/worldhappiness2018') + .dropna() + .rename(columns = {'happiness_score': 'happiness'}) + .filter(regex = '_sc|country|happ') +) +``` + +::: + + +Nothing beyond base R is needed. For Python examples, the following are required. + +```{python} +#| label: py-imports-estimation +import numpy as np + +import statsmodels.api as sm +import statsmodels.formula.api as smf + +from sklearn.linear_model import LinearRegression +from sklearn.model_selection import train_test_split + +from scipy import stats +``` + + +## Standard Frequentist {#sec-estim-frequentist} + +We talked a bit about the frequentist approach in our discussion of confidence intervals (@sec-lm-interpretation-feature). There we described the process using the interval to *capture* the 'true' parameter value a certain percentage of the time. The key assumption is that the true parameter is fixed, and the interval is a random variable that will contain the true value with some percentage frequency. With this approach, if you were to repeat the experiment, i.e. data collection and analysis, many times, each interval would be slightly different. Although they would be different, any one of the intervals is as good or valid as the others. You also know that a certain percentage of them will contain the true value, and a (usually small) percentage will not. You will never know if a specific interval does actually capture the true value, because we don't know the true value in practice. + +This is a common approach in traditional statistical analysis, and so it's used in many modeling contexts. If no particular estimation approach is specified, the default is usually a frequentist one. The approach not only provides confidence intervals for the parameters, but we can also get them for predictions, which is typically also a goal. + +Here is an example using our previous model to get interval estimates for predictions. Here we get so-called 'confidence' or 'prediction' intervals. Both are confidence intervals in the frequentist sense, just for different purposes. The confidence interval is for the average prediction, while the prediction interval is for a future observation. The prediction interval is wider because it includes the uncertainty in the model parameters as well as the uncertainty in the prediction itself. + +:::{.panel-tabset} + +##### R + +```{r} +#| label: r-frequentist +#| eval: true +#| results: hide + +model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) + +confint(model) + +predict(model, interval = 'confidence') # for an average prediction +predict(model, interval = 'prediction') # for a future observation (wider) +``` + +##### Python + +```{python} +#| label: py-frequentist +#| eval: false + +model = smf.ols( + 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc', + data = df_happiness +).fit() + +model.conf_int() + +model.get_prediction().summary_frame() # both 'confidence' and 'prediction' intervals +``` + +::: + + +The confidence interval is narrower because it only includes the uncertainty in the model parameters, while the prediction interval is wider because it includes the uncertainty in the model parameters and the prediction itself. The linear regression model provides these intervals by default, but we can also calculate them by hand. Here we show how to calculate the intervals for the predictions by hand by essentially performing the formula for the interval estimates. A sample of results are shown in the following table. + +:::{.panel-tabset} + +##### R +```{r} +#| eval: true +#| label: r-conf-pred-interval-by-hand +#| results: hide +X = model.matrix(model) + +# get the prediction +y_hat = X %*% coef(model) + +# get the standard error +se = sqrt(diag(X %*% vcov(model) %*% t(X))) + +# critical value for 95% confidence +cv = qt(0.975, df = model$df.residual) + +# get the confidence interval +tibble( + prediction = y_hat[,1], + lower = y_hat[,1] - cv * se, + upper = y_hat[,1] + cv * se +) |> + head() + +predict(model, interval = 'confidence') |> head() + +# get the prediction interval +se_pred = sqrt(se^2 + summary(model)$sigma^2) + +data.frame( + prediction = y_hat[,1], + lower = y_hat[,1] - cv * se_pred, + upper = y_hat[,1] + cv * se_pred +) |> + head() + +predict(model, interval = 'prediction') |> head() +``` + +##### Python + +```{python} +#| eval: false +#| label: py-conf-pred-interval-by-hand +X = model.model.exog + +# get the prediction +y_hat = X @ model.params + +# get the standard error +se = np.sqrt(np.diag(X @ model.cov_params() @ X.T)) + +# critical value for 95% confidence +cv = stats.t.ppf(0.975, model.df_resid) + +# get the confidence interval +pd.DataFrame({ + 'prediction': y_hat, + 'lower': y_hat - cv * se, + 'upper': y_hat + cv * se +}).head() + +model.get_prediction().summary_frame().head() + +# get the prediction interval +se_pred = np.sqrt(se**2 + model.mse_resid) + +pd.DataFrame({ + 'prediction': y_hat, + 'lower': y_hat - cv * se_pred, + 'upper': y_hat + cv * se_pred +}).head() + +model.get_prediction().summary_frame().head() +``` +::: + +```{r} +#| echo: false +#| label: tbl-conf-pred-interval-by-hand +#| tbl-cap: Confidence and Prediction Interval Estimates + +tibble( + type = 'Confidence', + prediction = y_hat[,1], + our_lwr = y_hat[,1] - cv * se, + our_upr = y_hat[,1] + cv * se, + lm_lwr = predict(model, interval = 'confidence')[, 'lwr'], + lm_upr = predict(model, interval = 'confidence')[, 'upr'], + ) |> + bind_rows( + tibble( + type = 'Prediction', + prediction = y_hat[,1], + our_lwr = y_hat[,1] - cv * se_pred, + our_upr = y_hat[,1] + cv * se_pred, + lm_lwr = predict(model, interval = 'prediction')[, 'lwr'], + lm_upr = predict(model, interval = 'prediction')[, 'upr'], + ) + ) |> + group_by(type) |> + slice(1:3) |> + # ungroup() |> + gt() |> + tab_options( + row_group.font.size = 10 + ) +``` + + +These interval estimates for parameters and predictions are actually not easy to get right for more complicated models beyond generalized linear models. Given this, one should be cautious when moving beyond standard linear models. The next two approaches we'll discuss are often used within the frequentist framework to estimate uncertainty in more complex models. + + +## Monte Carlo + +Monte Carlo methods derive their name from the famous casino in Monaco[^mcname]. The idea is to use random sampling to estimate a value. With statistical models, we can use Monte Carlo methods to estimate uncertainty in our model parameters and predictions. The general idea is as follows: + +1. **Estimate the model parameters** using the data and their range of possible values (e.g. based on a probability distribution). +2. **Simulate new data** from the model using the estimated parameters and assumed probability distributions for those parameters. +3. **Estimate the metrics of interest** using the simulated data. +4. **Repeat** many times. + + +[^mcname]: The name originates with Stanislav Ulam, who worked on the Manhattan Project and would actually come up with the idea from playing solitaire. He is also the one who inspired the name of the Bayesian probabilistic programming language Stan! + +The result is a distribution of the value of interest, be it a parameter, a prediction, or maybe an evaluation metric like RMSE. This distribution can then be used to provide a sense of uncertainty in the value, such as an interval estimate. We can use Monte Carlo methods to estimate the uncertainty in predictions for our happiness model as follows. + +:::{.panel-tabset} + +##### R + +```{r} +#| label: r-monte-carlo + +# we'll use the model from the previous section +model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) + +# number of simulations +mc_predictions = function( + model, + nsim = 2500, + seed = 42 +) { + set.seed(seed) + + params_est = coef(model) + params = mvtnorm::rmvnorm( + n = nsim, + mean = params_est, + sigma = vcov(model) + ) + + sigma = summary(model)$sigma + X = model.matrix(model) + + y_hat = X %*% t(params) + rnorm(n = nrow(X) * nsim, sd = sigma) + + pred_int = apply(y_hat, 1, quantile, probs = c(.025, .975)) + + return(pred_int) +} + +our_mc = mc_predictions(model) +``` + +##### Python + +```{python} +#| label: py-monte-carlo +#| eval: false + +# we'll use the model from the previous section +model = smf.ols( + 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc', + data = df_happiness +).fit() + +def mc_predictions(model, nsim=2500, seed=42): + np.random.seed(seed) + + params_est = model.params + params = np.random.multivariate_normal( + mean = params_est, + cov = model.cov_params(), + size = nsim + ) + + sigma = model.mse_resid**.5 + X = model.model.exog + + y_hat = X @ params.T + np.random.normal(scale = sigma, size = (X.shape[0], nsim)) + + pred_int = np.quantile(y_hat, q = [.025, .975], axis = 1) + + return pred_int + +our_mc = mc_predictions(model) +``` + +::: + +Here are the results of the Monte Carlo simulation for the prediction intervals. They are pretty close to what we'd already have available from the model package used for linear regression. However, we can use this for other models where uncertainty estimates are not readily available, can help us estimate values and intervals in a general way. + +```{r} +#| echo: false +#| label: tbl-monte-carlo +#| tbl-cap: Monte Carlo Prediction Intervals +pred_int_lm = predict(model, interval = 'prediction') + +tibble( + observed_value = df_happiness$happiness, + prediction = fitted(model), + lower = our_mc[1, ], + upper = our_mc[2, ], + lower_lm = pred_int_lm[, 'lwr'], + upper_lm = pred_int_lm[, 'upr'] +) |> + head() |> + gt() |> + tab_footnote( + footnote = 'Results based on the R simulation' + ) +``` + +Monte Carlo simulation is a very popular approach in modeling, and a variant of it, markov chain monte carlo (MCMC), is the basis for Bayesian estimation, which we'll also talk about in more detail later. + + +## Bootstrap {#sec-estim-bootstrap} + +An extremely common method for estimating uncertainty is the **bootstrap**. The bootstrap is a method where we create new datasets by randomly sampling the original data with replacement. This means that each new dataset is the same size as the original, but some observations may be selected multiple times while others may not be selected at all. We then estimate our model with each data set, and each time, we can collect parameter estimates, predictions, or any other calculations we are interested in. Ultimately, we end up with a *distribution* of all the things we calculated. The nice thing about this we don't need to know the specific distribution (e.g., normal, or t-distribution) of the values we want to get uncertainty estimates for, we can just use the data we have to produce that distribution. And this is a key distinction from the Monte Carlo method just discussed. + +The results of bootstrapping give us a range of possible values, which is useful for inference[^infdef], as we can use the distribution to calculate interval estimates. The average parameter estimate is typically the same as whatever the underlying model used would produce, so not really useful for that in the context of simpler linear models. Even so, we can calculate derivatives of the parameters, like say a ratio or sum, or a model metric like R^2^, or a prediction. Some of these normally would not be estimated as part of the model, or maybe the model tool does not provide anything beyond the value itself. Yet the bootstrap provides a way to get at a measure of uncertainty for the values of interest, with fewer assumptions about how that distribution should take shape. + +The approach bootstrap very flexible, and it can potentially be used with any model whether in a statistical or machine learning context. Let's see this in action with the happiness model. We'll create a bootstrap function, then use it to estimate the uncertainty in the coefficients for the happiness model. + +[^infdef]: We're using inference here in the standard statistical/philosophical sense, not as a synonym for prediction or generalization, which is how it is often used in machine learning. We're not exactly sure how that terminological muddling arose in ML, but be on the lookout for it. + +:::{.panel-tabset} + +##### R + +```{r} +#| label: r-bootstrap +#| results: hide +bootstrap = function(X, y, nboot = 100, seed = 123) { + + N = nrow(X) + p = ncol(X) + 1 # add one for intercept + + # initialize + beta = matrix(NA, p*nboot, nrow = nboot, ncol = p) + colnames(beta) = c('Intercept', colnames(X)) + mse = rep(NA, nboot) + + # set seed + set.seed(seed) + + for (i in 1:nboot) { + # sample with replacement + idx = sample(1:N, N, replace = TRUE) + Xi = X[idx,] + yi = y[idx] + + # estimate model + mod = lm(yi ~., data = Xi) + + # save results + beta[i, ] = coef(mod) + mse[i] = sum((mod$fitted - yi)^2) / N + } + + # given mean estimates, calculate MSE + y_hat = cbind(1, as.matrix(X)) %*% colMeans(beta) + final_mse = sum((y - y_hat)^2) / N + + output = list( + par = as_tibble(beta), + MSE = mse, + final_mse = final_mse + ) + + return(output) +} + +X = df_happiness |> + select(life_exp_sc:gdp_pc_sc) + +y = df_happiness$happiness + +our_boot = bootstrap( + X = X, + y = y, + nboot = 1000 +) +``` + +##### Python + +```{python} +#| label: py-bootstrap +#| eval: false +def bootstrap(X, y, nboot=100, seed=123): + # add a column of 1s for the intercept + X = np.c_[np.ones(X.shape[0]), X] + N = X.shape[0] + + # initialize + beta = np.empty((nboot, X.shape[1])) + + # beta = pd.DataFrame(beta, columns=['Intercept'] + list(cn)) + mse = np.empty(nboot) + + # set seed + np.random.seed(seed) + + for i in range(nboot): + # sample with replacement + idx = np.random.randint(0, N, N) + Xi = X[idx, :] + yi = y[idx] + + # estimate model + model = LinearRegression(fit_intercept=False) # from sklearn + mod = model.fit(Xi, yi) + + # save results + beta[i, :] = mod.coef_ + mse[i] = np.sum((mod.predict(Xi) - yi)**2) / N + + # given mean estimates, calculate MSE + y_hat = X @ beta.mean(axis=0) + final_mse = np.sum((y - y_hat)**2) / N + + output = { + 'par': beta, + 'mse': mse, + 'final_mse': final_mse + } + + return output + +our_boot = bootstrap( + X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']], + y = df_happiness['happiness'], + nboot = 1000 +) +``` +::: + + +Here are the results of the interval estimates for the coefficients. Each parameter has the mean estimate, the lower and upper bounds of the 95% confidence interval, and the width of the interval. The bootstrap intervals are a bit wider than the OLS intervals, but for this model these should converge as the number of observations increases. + +\small +```{r} +#| echo: false +#| label: tbl-bootstrap +#| tbl-cap: Bootstrap Parameter Estimates + +tab_boot_summary = our_boot$par |> + as_tibble() |> + tidyr::pivot_longer(everything(), names_to = 'Parameter') |> + summarize( + mean = mean(value), + `Lower BS` = quantile(value, .025), + `Upper BS` = quantile(value, .975), + .by = Parameter + ) |> + mutate( + `Lower OLS` = confint(model)[, 1], + `Upper OLS` = confint(model)[, 2], + `Diff Width` = (`Upper BS` - `Lower BS`) - (`Upper OLS` - `Lower OLS`) + ) + +tab_boot_summary |> + gt(decimals = 2) |> + tab_footnote( + footnote = 'Width of bootstrap estimate minus width of OLS estimate', + locations = cells_column_labels(columns = `Diff Width`) + ) |> + tab_options( + footnotes.font.size = 10 + ) +``` +\normalsize + +Let's look more closely at the distributions for each coefficient. Standard statistical estimates assume a specific distribution like the normal. But the bootstrap method provides more flexibility, even though it often leans towards the assumed distribution. We can see these distributions aren't perfectly symmetrical like a normal distribution, but they suit our needs in that we can extract the lower and upper quantiles to create an interval estimate. + +:::{.content-visible when-format='html'} +```{r} +#| echo: false +#| label: fig-r-bootstrap +#| fig-cap: Bootstrap Distributions of Parameter Estimates + +our_boot$par |> + as_tibble() |> + pivot_longer(everything(), names_to = 'Parameter') |> + ggplot(aes(value)) + + geom_density(color = okabe_ito[2]) + + geom_point( + aes(x = `Lower BS`, group = Parameter), + y = 0, + color = okabe_ito[1], + size = 3, + alpha = 1, + data = tab_boot_summary + ) + + geom_point( + aes(x = `Upper BS`, group = Parameter), + y = 0, + color = okabe_ito[1], + size = 3, + alpha = 1, + data = tab_boot_summary + ) + + geom_segment( + aes( + x = `Lower BS`, + xend = `Upper BS`, + y = 0, + yend = 0, + group = Parameter + ), + color = okabe_ito[1], + linewidth = 1, + data = tab_boot_summary + ) + + facet_wrap(~Parameter, scales = 'free') + + labs(x = 'Parameter Value', y = 'Density') + +ggsave('img/estim-bootstrap.svg', width = 8, height = 6) +``` +::: + +:::{.content-visible when-format='pdf'} +![Bootstrap Distributions of Parameter Estimates](img/estim-bootstrap.svg){#fig-r-bootstrap} +::: + +As mentioned, the bootstrap is often used to provide uncertainty for unmodeled parameters, predictions, and other metrics. However, because we repeatedly run the model or some aspect of it over and over, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g. extreme values) or [in some data contexts](https://stats.stackexchange.com/questions/9664/what-are-examples-where-a-naive-bootstrap-fails) (e.g. correlated observations) without extra considerations. Variants exist to help deal with some of these issues, and despite limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model. + + +## Bayesian {#sec-estim-bayes} + +The **Bayesian** approach to modeling is many things - a philosophical viewpoint, an entirely different way to think about probability, a different way to measure uncertainty, and on a practical level, just another way to get model parameter estimates. It can be as frustrating as it is fun to use, and one of the really nice things about using Bayesian estimation is that it can handle model complexities that other approaches don't do well or at all. + +The basis of Bayesian estimation is the **likelihood**, the same as with maximum likelihood, and everything we did there applies here. So you need a good grasp of maximum likelihood to understand the Bayesian approach. However, the Bayesian approach is different because it also lets us use our knowledge about the parameters through **prior distributions**. For example, we may think that the coefficients for a linear model come from a normal distribution centered on zero with some variance. That would serve as our prior distribution for those parameters. + +The combination of a prior distribution with the likelihood results in the **posterior distribution**, which is a *distribution* of possible parameter values. It falls somewhere between the prior and the likelihood. With more data, it tends toward the likelihood result, and with less data, it tends toward what the prior would have suggested. The posterior distribution is what we ultimately use to make inferences about the parameters, and it can be used to estimate uncertainty in the same way as the bootstrap. + +![Prior, likelihood, posterior and distributions](img/prior2post_clean.png){#fig-bayesian-prior-posterior} + + +#### Example {.unnumbered} + +Let's do a simple example to show how this comes about. We'll use a binomial model where we have penalty kicks taken for a soccer player, and we want to estimate the probability of the player making a goal, which we'll call $\theta$. + +For our prior distribution, we'll use a [beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) that has a mean of 0.5, suggesting that we think this person would have about a 50% chance of converting the kick on average. However, we will keep this prior fairly loose, with a range that spans most of the (0, 1) interval. For the likelihood, we'll use a binomial distribution. We also use this in our GLM chapter (@eq-binomial), which, as we noted earlier in this chapter, is akin to using the log loss (@sec-estim-logloss). We'll then calculate the posterior distribution for the probability of making a shot, given our prior and the evidence at hand, i.e., the data. + +Let's start with some data, and just like our other estimation approaches, we'll have some guesses for $\theta$ which represents the probability of making a goal. We'll use the prior distribution to represent our beliefs about those parameter values, assigning more weight to values around 0.5. We'll then calculate the likelihood of the data given the parameter, which will put more weight on values closer to the observed chance of scoring a goal. Finally, we calculate the posterior distribution. + +:::{.panel-tabset} + +##### R +```{r} +#| label: bayesian-demo-r +pk = c( + 'goal','goal','goal','miss','miss', + 'goal','goal','miss','goal','goal' +) + +# convert to numeric, arbitrarily picking goal=1, miss=0 + +N = length(pk) # sample size +n_goal = sum(pk == 'goal') # number of pk made +n_miss = sum(pk == 'miss') # number of those miss + +# grid of potential theta values +theta = seq( + from = 1 / (N + 1), + to = N / (N + 1), + length = 10 +) + +### prior distribution +# beta prior with mean = .5, but fairly diffuse +# examine the prior +# theta = rbeta(1000, 5, 5) +# hist(theta, main = 'Prior Distribution', xlab = 'Theta', col = 'lightblue') +p_theta = dbeta(theta, 5, 5) + +# Normalize so that values sum to 1 +p_theta = p_theta / sum(p_theta) + +# likelihood (binomial) +p_data_given_theta = choose(N, n_goal) * theta^n_goal * (1 - theta)^n_miss + +# posterior (combination of prior and likelihood) +# p_data is the marginal probability of the data used for normalization +p_data = sum(p_data_given_theta * p_theta) + +p_theta_given_data = p_data_given_theta*p_theta / p_data # Bayes theorem + +# final estimate +theta_est = sum(theta * p_theta_given_data) +theta_est +``` + +##### Python + +```{python} +#| label: bayesian-demo-py +from scipy.stats import beta + +pk = np.array([ + 'goal','goal','goal','miss','miss', + 'goal','goal','miss','goal','goal' +]) + +# convert to numeric, arbitrarily picking goal=1, miss=0 +N = len(pk) # sample size +n_goal = np.sum(pk == 'goal') # number of pk made +n_miss = np.sum(pk == 'miss') # number of those miss + +# grid of potential theta values +theta = np.linspace(1 / (N + 1), N / (N + 1), 10) + +### prior distribution +# beta prior with mean = .5, but fairly diffuse +# examine the prior +# theta = beta.rvs(5, 5, size = 1000) +# plt.hist(theta, bins = 20, color = 'lightblue') +p_theta = beta.pdf(theta, 5, 5) + +# Normalize so that values sum to 1 +p_theta = p_theta / np.sum(p_theta) + +# likelihood (binomial) +p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss + +# posterior (combination of prior and likelihood) +# p_data is the marginal probability of the data used for normalization +p_data = np.sum(p_data_given_theta * p_theta) + +p_theta_given_data = p_data_given_theta * p_theta / p_data # Bayes theorem + +# final estimate +theta_est = np.sum(theta * p_theta_given_data) +theta_est +``` +::: + +Here is the table that puts all this together. Our prior distribution is centered around a $\theta$ of 0.5 because we made it that way. The likelihood is centered closer to 0.7 because that's the observed chance of scoring a goal. The posterior distribution is a combination of the two. It gives no weight to smaller values, or to the max value. Our final estimate is `r theta_est`, which falls between the prior and likelihood values that have the most weight. With more evidence in the form of data, our estimate will shift more and more towards what the likelihood would suggest. This is a simple example, but it shows how the Bayesian approach works, and this conceptually holds for more complex parameter estimation as well. + +```{r} +#| echo: false +#| label: tbl-bayesian-demo +#| tbl-cap: Bayesian Demo Results + +tibble( + theta = theta, + prior = p_theta, + like = p_data_given_theta, + post = p_theta_given_data +) |> + gt() |> + tab_style( + style = cell_text(weight = "bold"), + locations = list( + cells_body( + columns = prior, + rows = which.max(prior) + ), + cells_body( + columns = like, + rows = which.max(like) + ), + cells_body( + columns = post, + rows = which.max(post) + ) + ) + ) +``` + + +:::{.callout-note title='Priors as Regularization' collapse='true'} +In the context of penalized estimation and machine learning, the prior distribution can be thought of as a form of **regularization** (See @sec-estim-penalty and @sec-ml-regularization later). In this context, the prior shrinks the estimate, pulling the parameter estimates towards it, just like the penalty parameter does in the penalized estimation methods. In fact, many penalized methods can be thought of as a Bayesian approach with a specific prior distribution. An example would be ridge regression, which can be thought of as a Bayesian linear regression with a normal prior distribution for the coefficients. The variance of the prior is inversely related to the ridge penalty parameter. +::: + +#### Application {.unnumbered} + +Just like with the bootstrap which also provided distributions for the parameters, we can use the Bayesian approach to understand how certain we are about our estimates. We can look at any range of values in the posterior distribution to get what is often referred to as a **credible interval**, which is the Bayesian equivalent of a confidence interval[^confintinterp]. Here is an example of the posterior distribution for the parameters of our happiness model, along with 95% intervals[^brms]. + +[^brms]: We used the R package for [brms]{.pack} for these results. + +[^confintinterp]: Many people's default interpretation of a standard confidence interval is usually something like 'the range we expect the parameter to reside within'. Unfortunately, that's not quite right, though it is how you interpret the Bayesian interval. The frequentist confidence interval is a range that, if we were to repeat the experiment/data collection many times, contains the true parameter value a certain percentage of the time. For the Bayesian, the parameter is assumed to be random, and so the interval is that which we expect the parameter to fall within a certain percentage of the time. The Bayesian is probably a bit more intuitive for most, even if it's not the more widely used. + +:::{.content-visible when-format='html'} +```{r} +#| echo: false +#| label: fig-r-bayesian-posterior +#| fig-cap: Posterior Distribution of Parameters + +# bayes_mod = brms::brm( +# happiness ~ life_exp_sc + gdp_pc_sc + corrupt_sc, +# data = df_happiness, +# prior = c( +# brms::prior(normal(0, 1), class = 'b') +# ), +# thin = 8, +# ) + +# save( +# bayes_mod, +# file = 'estimation/data/brms_happiness.RData' +# ) + +load('estimation/data/brms_happiness.RData') + +p_dat = bayes_mod |> + tidybayes::spread_draws(b_Intercept, b_life_exp_sc, b_gdp_pc_sc, b_corrupt_sc) |> + select(-.chain, -.draw) |> + pivot_longer(-.iteration, names_to = 'Parameter') |> + mutate(Parameter = str_remove(Parameter, 'b_')) + +p_intervals = summary(bayes_mod)$fixed |> + as_tibble(rownames = 'Parameter') |> + rename( + value = Estimate, + lower = `l-95% CI`, + upper = `u-95% CI` + ) + + +p_dat |> + mutate(Parameter = factor(Parameter, unique(Parameter))) |> + ggplot(aes(value)) + + geom_density(color = okabe_ito[2]) + + # add credible interval + geom_point( + aes(x = lower, group = Parameter), + y = 0, + color = okabe_ito[1], + size = 3, + alpha = 1, + data = p_intervals + ) + + geom_point( + aes(x = upper, group = Parameter), + y = 0, + color = okabe_ito[1], + size = 3, + alpha = 1, + data = p_intervals + ) + + geom_segment( + aes( + x = lower, + xend = upper, + y = 0, + yend = 0, + group = Parameter + ), + color = okabe_ito[1], + size = 1, + data = p_intervals + ) + + facet_wrap(~factor(Parameter), scales = 'free') + + labs( + x = 'Parameter Value', + y = 'Density', + caption = '95% Credible Interval shown underneath the density plot' + ) + +ggsave('img/estim-bayesian-posterior.svg', width = 8, height = 6) +``` +::: + +:::{.content-visible when-format='pdf'} +![Posterior Distribution of Parameters](img/estim-bayesian-posterior.svg){#fig-r-bayesian-posterior} +::: + + +With Bayesian estimation we also provide starting values for the algorithm, which is a form of Monte Carlo estimation[^mcmc], to get things going. We also typically specify a number of iterations, or times the model will run, as the **stopping rule**. Each iteration gives us a new guess for each parameter, which amounts to a random draw from the posterior distribution. With more iterations the model takes longer to run, but the length often reflects the complexity of the model. + +[^mcmc]: The most common method for the Bayesian approach is Markov Chain Monte Carlo (MCMC), which is a way to sample from the posterior distribution. There are many MCMC algorithms, many of which are a form of the now fairly old Metropolis-Hastings algorithm, which you can find a demo of at [Michael's doc](https://m-clark.github.io/models-by-example/metropolis-hastings.html). + +We also specify multiple **chains**, which do the same estimation procedure, but due to the random nature of the Bayesian approach and starting point, take different estimation paths[^dlchains]. We can then compare the chains to see if they are converging to the same result, which is a check on the model. If they are not converging, we may need to run the model longer, or it may indicate a problem with how we set up the model. + +Here's an example of the four chains for our happiness model for the life expectancy coefficient. The chains bounce around a bit from one iteration to the next, but on average, they're giving very similar results, so we know the model is working well. Nowadays, we have default statistics in the output that also provide this information, which makes it easier to quickly check convergence for many parameters. + +[^dlchains]: Some deep learning implementations will use multiple random starts for similar reasons. + +:::{.content-visible when-format='html'} +```{r} +#| echo: false +#| label: fig-r-bayesian-chains +#| fig-cap: Bayesian Chains for Life Expectancy Coefficient + +p_dat = bayes_mod |> + tidybayes::spread_draws(b_life_exp_sc) |> + select(-.draw) |> + pivot_longer(-c(.chain, .iteration), names_to = 'Parameter') |> + mutate( + Parameter = str_remove(Parameter, 'b_'), + .chain = factor(.chain) + ) + + +p_dat |> + ggplot(aes(.iteration, value)) + + geom_hline(aes(yintercept = mean(value)), color = 'darkred', linewidth = 1) + + geom_line(aes(color = .chain), alpha = .75) + + scale_color_manual(values = unname(okabe_ito)) + + labs( + x = 'Iteration', + y = 'Coefficient', + caption = 'Dark line is the mean value for the life expectancy coefficient.' + ) + +# for this plot it's just easier to make a separate plot for print +p_print = p_dat |> + ggplot(aes(.iteration, value)) + + geom_hline(aes(yintercept = mean(value)), color = 'black', linewidth = 1) + + geom_line(aes(color = .chain, linetype = .chain), linewidth = 1) + + annotate( + 'text', + x = 5, + y = 0.8, + label = 'Each line type represents a separate chain', + color = 'black', + hjust = 0, + size = 3 + ) + + scale_color_grey() + # INTENTIONALLY LEFT GRAY + labs( + x = 'Iteration', + y = 'Coefficient', + caption = 'Horizontal line is the mean value.' + ) + + theme(legend.position = 'none') + +ggsave('img/estim-bayesian-chains.svg', p_print, width = 8, height = 6) + +``` +::: + +:::{.content-visible when-format='pdf'} +![Bayesian Chains for Life Expectancy Coefficient](img/estim-bayesian-chains.svg) +::: + +When we are interested in making predictions, we can use the results to generate a distribution of possible predictions *for each observation*, which can be very useful when we want to quantify uncertainty for complex models. This is referred to as **posterior predictive distribution**, which is explored in a non-bayesian context in @sec-knowing-model-vis. Here is a plot of several draws of predicted values against the true happiness scores. + +:::{.content-visible when-format='html'} +```{r} +#| echo: false +#| label: fig-r-bayesian-posterior-predictive +#| fig-cap: Posterior Predictive Distribution of Happiness Values +p_dat = brms::pp_check(bayes_mod)$data |> + mutate(source = ifelse(is_y, 'Observed', 'Predicted')) |> + select(rep_id, source, value) + +p_dat |> + ggplot(aes(value, color = source, group = rep_id)) + + stat_density( + aes(color = source, group = rep_id, linewidth = I(ifelse(source == 'Observed', 2, .25))), + position = 'identity', + geom = 'borderline', + # show.legend = FALSE + ) + + scale_color_manual(values = unname(okabe_ito[c(1,5)])) + + labs( + x = 'Happiness Value', + y = 'Density', + # caption = 'Observed values are in black' + ) + +p_print = p_dat |> + ggplot(aes(value, color = source, group = rep_id)) + + stat_density( + aes(color = source, group = rep_id, linewidth = I(ifelse(source == 'Observed', 2, .25))), + position = 'identity', + geom = 'borderline', + show.legend = FALSE + ) + + scale_color_manual(values = unname(okabe_ito[c(1,5)])) + + labs( + x = 'Happiness Value', + y = 'Density', + caption = 'Thin lines predicted values, thick line represents the observed target values.' + ) + +ggsave('img/estim-bayesian-posterior-predictive.svg', p_print, width = 8, height = 6) +``` +::: + +:::{.content-visible when-format='pdf'} +![Posterior Predictive Distribution of Happiness Values](img/estim-bayesian-posterior-predictive.svg) +::: + + +With the Bayesian approach, every metric we calculate has a range of possible values, not just one. For example, if you have a classification model and want to know the accuracy, AUROC, or true positive rate of the model. Instead of a single number, you would now have access to a whole distribution of values for that metric. How? For each possible set of model parameters from the posterior distribution, we apply those values and model to data to make a prediction. We can then assign it to a class, and compare it to the actual class. This gives us a range of possible predictions and classes. We can then calculate metrics like accuracy or true positive rate for each possible prediction set. As an example, we did this for our happiness model with a numeric target to obtain the interval estimate for R-squared. Pretty neat! + + +```{r} +#| echo: false +#| label: tbl-r-bayesian-metrics +#| tbl-cap: Bayesian R^2^ + +bayes_r2 = performance::r2(bayes_mod) # not very friendly object returned + +tibble(r2 = bayes_r2$R2_Bayes, as_tibble(attr(bayes_r2, 'CI')$R2_Bayes)) |> + select(-CI) |> + rename( + `Bayes R2` = r2, + `Lower` = CI_low, + `Upper` = CI_high + ) |> + gt() |> + tab_footnote( + footnote = '95% Credible interval for R-squared', + # locations = cells_column_labels(columns = `Bayes R2`) + ) |> + tab_options( + footnotes.font.size = 10 + ) +``` + + + +:::{.callout-note title='Frequentist PP check' collapse='true'} +As we saw in @sec-knowing-model-vis, nothing is keeping you from doing 'predictive checks' with other estimation approaches, and it's a very good idea to do so. For example, with a GLM you can use Monte Carlo simulation or the Bootstrap to generate a distribution of predictions, and then compare that to the actual data. This is a good way to check the model's assumptions and see if it's doing what you think it's doing. It's more straightforward with the Bayesian approach, since many modeling packages will do it for you with little effort. +::: + + + +#### Additional Thoughts + +It turns out that any standard (frequentist) statistical model can be seen as a Bayesian one from a certain point of view[^obi]. Here are a couple. + +[^obi]: Cue [Obi Wan Kenobi](https://www.youtube.com/watch?v=pSOBeD1GC_Y). + +- GLM and related models estimated via maximum likelihood: Bayesian estimation with a flat/uniform prior on the parameters. +- Ridge Regression: Bayesian estimation with a normal prior on the coefficients, penalty parameter is related to the variance of the prior. +- Lasso Regression: Bayesian estimation with a Laplace prior on the coefficients, penalty parameter is related to the variance of the prior. +- Mixed Models: random effects are, as the name suggests, random, and so are estimated as a distribution of possible values, which is conceptually in line with the Bayesian approach. + + + +So, in many modeling contexts, you're actually doing a restrictive form of Bayesian estimation already. + +The Bayesian approach is very flexible, and can be used for many different types of models, and can be used to get at uncertainty in a model in ways that other approaches can't. It's not always the best approach, even when appropriate due to the computational burden and just diagnostic complexity, but it's a good one to have in your toolbox[^rpybayes]. Hopefully we've helped to demystify the Bayesian approach a bit here, and you feel more comfortable trying it out. + +[^rpybayes]: R has excellent tools here for modeling and post-processing, like [brms]{.pack} and [tidybayes]{.pack}, and Python has [pymc3]{.pack}, [numpyro]{.pack}, and [arviz]{.pack}, which are also useful. Honestly R has way more going on here, with many packages devoted to Bayesian estimation of specific models even, but if you want to stick with Python it's gotten a lot better recently. + + +## Conformal Methods {#sec-estim-conformal} + +Conformal approaches bring us back to the frequentist world, and specifically regard prediction uncertainty. One of the primary strengths of the approach is that it is model agnostic and theoretically can work for any model, from linear regression to deep learning. Like the bootstrap and Bayesian methods, conformal prediction makes us think in terms of distributions of possible values, but it focuses on residuals or errors in prediction. + +It is based on the idea that we can estimate the uncertainty in our predictions by looking at the distribution of the predictions from the model, or more specifically, the prediction error. Using the observed prediction error on a calibration set that was not used to train the model, we can order those errors and find the quantile corresponding to the desired uncertainty coverage/error rate[^errorcov]. When predicting on new data, we assume the predictions and corresponding errors come from a similar distribution as what we've seen already in our training/calibration process. We do this with no particular assumption about that distribution. We then use the estimated quantile to create upper and lower bounds for a prediction for a new observation. + +While the implementation for various settings can get quite complicated, the conceptual approach is mostly straightforward. As an example, we can demonstrate the **split-conformal** procedure with the following steps. + + +1. **Split Data**: Split the dataset into training and calibration sets. +1. **Train Model**: Train the model using the training set. +1. **Calculate Scores**: Calculate conformity scores on the calibration set. These are the *absolute* residuals between the predicted and actual values on the calibration set. +1. **Quantile Calculation**: Determine the quantile value of the conformity scores for the desired confidence level. +1. **Generate Intervals**: Generate prediction intervals for new data points. For new data points, use the trained model to make predictions. Adjust these predictions by adding and subtracting the quantile value obtained from the conformity scores to generate the lower and upper bounds of the prediction intervals. + + +[^errorcov]: The error rate ($\alpha$) is the proportion of the data that would fall outside the prediction interval, while the coverage rate/interval is 1 - $\alpha$. + +Let's now demonstrate the split-conformal method with our happiness model. We'll start by defining the split-conformal function. The function takes the training data, the target variable, and new data for which we want to make predictions. It also takes an $\alpha$ value, which is the error rate we want to control, and a calibration split, which is the proportion of the data we use for calibration. And finally, we designate new data for which we want to make predictions. + +:::{.panel-tabset} + +##### R + + +```{r} +#| label: split-conformal-r-function +split_conformal = function( + X, + y, + new_data, + alpha = .05, + calibration_split = .5 +) { + # Splitting the data into training and calibration sets + idx = sample(1:nrow(X), size = floor(nrow(X) / 2)) + + train_data = X |> slice(idx) + cal_data = X |> slice(-idx) + train_y = y[idx] + cal_y = y[-idx] + + N = nrow(train_data) + + # Train the base model + model = lm(train_y ~ ., data = train_data) + + # Calculate residuals on calibration set + cal_preds = predict(model, newdata = cal_data) + residuals = abs(cal_y - cal_preds) + + # Sort residuals and find the quantile corresponding to (1-alpha) + residuals = sort(residuals) + quantile = quantile(residuals, (1 - alpha) * (N / (N + 1))) + + # Make predictions on new data and calculate prediction intervals + preds = predict(model, newdata = new_data) + lower_bounds = preds - quantile + upper_bounds = preds + quantile + + # Return predictions and prediction intervals + return( + list( + cp_error = quantile, + preds = preds, + lower_bounds = lower_bounds, + upper_bounds = upper_bounds + ) + ) +} +``` + +##### Python + +```{python} +#| label: split-conformal-py-function +def split_conformal(X, y, new_data, alpha = .05, calibration_split = .5): + # Splitting the data into training and calibration sets + X_train, X_cal, y_train, y_cal = train_test_split( + X, + y, + test_size = calibration_split, + random_state = 123 + ) + + N = X_train.shape[0] + + # Train the base model + model = LinearRegression().fit(X_train, y_train) + + # Calculate residuals on calibration set + cal_preds = model.predict(X_cal) + residuals = np.abs(y_cal - cal_preds) + + # Sort residuals and find the quantile corresponding to (1-alpha) + residuals = np.sort(residuals) + + # The correction here is useful for small sample sizes + quantile = np.quantile(residuals, (1 - alpha) * (N / (N + 1))) + + # Make predictions on new data and calculate prediction intervals + preds = model.predict(new_data) + lower_bounds = preds - quantile + upper_bounds = preds + quantile + + # Return predictions and prediction intervals + return { + 'cp_error': quantile, + 'preds': preds, + 'lower_bounds': lower_bounds, + 'upper_bounds': upper_bounds + } +``` + +::: + + +With our functions in place, we can now use them to calculate the prediction intervals for the happiness model. The `cp_error` value gives us the quantile value that we use to generate the prediction intervals. Raw result is not shown, but the subsequent table shows the first few predictions and their corresponding prediction intervals. + +:::{.panel-tabset} + +##### R + +```{r} +#| label: split-conformal-r-calculate +#| results: hide +# split data +set.seed(123) + +idx_train = sample(nrow(df_happiness), nrow(df_happiness) * .8) +idx_test = setdiff(1:nrow(df_happiness), idx_train) + +df_train = df_happiness |> + slice(idx_train) |> + select(happiness, life_exp_sc, gdp_pc_sc, corrupt_sc) + +y_train = df_happiness$happiness[idx_train] + +df_test = df_happiness |> + slice(idx_test) |> + select(life_exp_sc, gdp_pc_sc, corrupt_sc) + +y_test = df_happiness$happiness[idx_test] + +# apply the function +cp_error = split_conformal( + df_train |> select(-happiness), + y_train, + df_test, + alpha = .1 +) + +# cp_error[['cp_error']] + +tibble( + prediction = cp_error[['preds']], + lower_bounds = cp_error[['lower_bounds']], + upper_bounds = cp_error[['upper_bounds']] +) |> + head() +``` + +##### Python + +```{python} +#| label: split-conformal-py-calculate +#| results: hide +# split data +X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']] +y = df_happiness['happiness'] + +X_train, X_test, y_train, y_test = train_test_split( + df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']], + df_happiness['happiness'], + test_size = 0.5, + random_state = 123 +) + +our_cp_error = split_conformal( + X_train, + y_train, + X_test, + alpha = .1 +) + +# print(our_cp_error['cp_error']) + +pd.DataFrame({ + 'prediction': our_cp_error['preds'], + 'lower_bounds': our_cp_error['lower_bounds'], + 'upper_bounds': our_cp_error['upper_bounds'] +}).head() +``` + +::: + + + +```{r} +#| echo: false +#| label: tbl-split-conformal +#| tbl-cap: Split-Conformal Prediction Intervals + +tibble( + prediction = cp_error[['preds']], + lower_bounds = cp_error[['lower_bounds']], + upper_bounds = cp_error[['upper_bounds']] +) |> + head() |> + gt() |> + tab_footnote( + 'Result based on the R code' + ) |> + tab_options( + # table.width = "50%" + ) +``` + + +```{python} +#| eval: false +#| echo: false +#| label: MAPIE-comparison +from mapie.regression import MapieRegressor + +X = df_happiness[['life_exp_sc', 'gdp_pc_sc', 'corrupt_sc']] +y = df_happiness['happiness'] + +initial_fit = LinearRegression().fit(X, y) +model = MapieRegressor(initial_fit, method = 'naive') +y_pred, y_pis = model.fit(X, y).predict(X_test, alpha = 0.1) + +# take the first difference between upper and lower bounds, +# since it's constant for all predictions in this setting +cp_error = (y_pis[0, 1, 0] - y_pis[0, 0, 0]) / 2 +``` + +As a method of uncertainty estimation, conformal prediction is not without its challenges. It is computationally intensive for large datasets or complex models. There are multiple variants of conformal prediction, most of which attempt to alleviate a deficiency of simpler approaches. But they generally further increase the computational burden. + +Conformal prediction still relies on the assumptions about the data and the underlying model, and violations of these assumptions can lead to invalid prediction intervals. Furthermore, conformal prediction methods assume that the training and test data come from the same distribution, which may not always be the case in real-world applications due to distribution shifts or domain changes. In addition, validation sets must be viable splits of the data, which default splitting methods may not always provide. In general, for conformal prediction provides an alternative to other frequentist or Bayesian approaches that, under the right circumstances, may produce a better estimate of uncertainty, but does not come for free. + + + +## Wrapping Up {#sec-estim-wrap} + +Understanding uncertainty is key to understanding the quality of your model. It's not just about the point estimate, or getting a prediction, but also about how confident you are in value. We've covered several avenues from the basics of estimation to the more complex Bayesian and conformal methods. If the model provides a standard statistical solution, take it. Otherwise, the bootstrap is easy to understand and implement. Bayesian methods are more complex, but can provide more information about the uncertainty in your model. Conformal prediction is a good choice when you want to make predictions without making strong assumptions about the underlying model, and may be the best option for many contexts. + + + +We hope you now have a better understanding of how to estimate uncertainty in your models, and how to use that information to make better decisions. + + +### The common thread {#sec-estim-thread} + +No model is without uncertainty, so any of these techniques may be applicable to your work. The choice of method depends on the complexity of the model, the amount of data, and the resources you want to spend. + + + + +### Choose your own adventure {#sec-estim-choose} + +This chapter colors all others that focus on specific modeling techniques. You can think about how you might implement uncertainty estimation for any of them. + + +### Additional resources {#sec-estim-add-res} + +**Frequentist Approaches**: + +- Most statistical texts cover uncertainty estimation from the frequentist perspective. Pick one you like. +- [Error Statistics](https://errorstatistics.com/about-2/) Deborah Mayo's blog and comments on other blogs have always provided a strong philosophical defense of frequentist statistics. + + +**Monte Carlo**: + +- [Monte Carlo Methods](https://www.youtube.com/watch?v=OgO1gpXSUzU) John Guttag's MIT Course lecture on YouTube. + +**Bootstrap**: + +Classical treatments: + +- @efron_introduction_1994 +- @davison_bootstrap_1997 + +A more fun demo: + +- [Bootstrapping Main Ideas](https://www.youtube.com/watch?v=sDv4f4s2SB8) @statquest_with_josh_starmer_bootstrapping_2021 + +**Bayesian**: + +- Bayesian Data Analysis @gelman_bayesian_2013. For many this is the Bayesian bible. +- Statistical Rethinking @mcelreath_statistical_2020. A fantastic modeling book, Bayesian or otherwise. +- [Choosing priors](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) + +**Conformal Prediction**: + +General: + +- A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification (@angelopoulos_gentle_2022); [Example Python Notebooks](https://github.com/aangelopoulos/conformal-prediction) + +R demos: + +- [Conformal inference for regression models](https://www.tidymodels.org/learn/models/conformal-regression/) +- [Conformal prediction](https://marginaleffects.com/vignettes/conformal.html) + +Python demos: + +- Introduction To Conformal Prediction With Python (@molnar_introduction_2024) +- [Mapie Docs](https://mapie.readthedocs.io/en/latest/) From dd9563db8ea0a90cb8a6c33f8934b1f8ca055ee0 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 24 Nov 2024 14:11:49 -0500 Subject: [PATCH 03/19] uncertainty related notebook, quarto update --- _quarto.yml | 1 + .../uncertainty.ipynb | 1462 +++++++++++++++++ .../r_chapter_notebooks/uncertainty.qmd | 303 ++++ 3 files changed, 1766 insertions(+) create mode 100644 chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb create mode 100644 chapter-notebooks/r_chapter_notebooks/uncertainty.qmd diff --git a/_quarto.yml b/_quarto.yml index 89f3db7..9f15877 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -25,6 +25,7 @@ book: - understanding_models.qmd - understanding_features.qmd - estimation.qmd + - uncertainty.qmd - generalized_linear_models.qmd - linear_model_extensions.qmd # - part: "Machine Learning" diff --git a/chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb b/chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb new file mode 100644 index 0000000..838ac89 --- /dev/null +++ b/chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb @@ -0,0 +1,1462 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Uncertainty" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "\n", + "import statsmodels.api as sm\n", + "import statsmodels.formula.api as smf\n", + "\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "\n", + "from scipy import stats" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "df_happiness = (\n", + " pd.read_csv('https://tinyurl.com/worldhappiness2018')\n", + " .dropna()\n", + " .rename(columns = {'happiness_score': 'happiness'})\n", + " .filter(regex = '_sc|country|happ')\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Frequentist" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
meanmean_semean_ci_lowermean_ci_upperobs_ci_lowerobs_ci_upper
03.9876710.1327583.7245214.2508202.7369665.238375
15.4966380.1040655.2903635.7029144.2566536.736624
25.6765200.0874705.5031395.8499014.4415806.911459
35.4065850.1070005.1944925.6186784.1656186.647552
46.9666400.1267566.7153897.2178925.7183858.214896
.....................
1075.8612560.0778975.7068506.0156614.6288377.093674
1085.2903680.1471614.9986695.5820674.0333476.547389
1095.3279980.0836595.1621705.4938254.0940966.561899
1104.3081050.1010394.1078284.5083833.0691035.547107
1114.2673250.0982834.0725114.4621393.0291945.505455
\n", + "

112 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \\\n", + "0 3.987671 0.132758 3.724521 4.250820 2.736966 \n", + "1 5.496638 0.104065 5.290363 5.702914 4.256653 \n", + "2 5.676520 0.087470 5.503139 5.849901 4.441580 \n", + "3 5.406585 0.107000 5.194492 5.618678 4.165618 \n", + "4 6.966640 0.126756 6.715389 7.217892 5.718385 \n", + ".. ... ... ... ... ... \n", + "107 5.861256 0.077897 5.706850 6.015661 4.628837 \n", + "108 5.290368 0.147161 4.998669 5.582067 4.033347 \n", + "109 5.327998 0.083659 5.162170 5.493825 4.094096 \n", + "110 4.308105 0.101039 4.107828 4.508383 3.069103 \n", + "111 4.267325 0.098283 4.072511 4.462139 3.029194 \n", + "\n", + " obs_ci_upper \n", + "0 5.238375 \n", + "1 6.736624 \n", + "2 6.911459 \n", + "3 6.647552 \n", + "4 8.214896 \n", + ".. ... \n", + "107 7.093674 \n", + "108 6.547389 \n", + "109 6.561899 \n", + "110 5.547107 \n", + "111 5.505455 \n", + "\n", + "[112 rows x 6 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model = smf.ols(\n", + " 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc',\n", + " data = df_happiness\n", + ").fit()\n", + "\n", + "model.conf_int()\n", + "\n", + "model.get_prediction().summary_frame() # both 'confidence' and 'prediction' intervals" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
meanmean_semean_ci_lowermean_ci_upperobs_ci_lowerobs_ci_upper
03.9876710.1327583.7245214.2508202.7369665.238375
15.4966380.1040655.2903635.7029144.2566536.736624
25.6765200.0874705.5031395.8499014.4415806.911459
35.4065850.1070005.1944925.6186784.1656186.647552
46.9666400.1267566.7153897.2178925.7183858.214896
.....................
1075.8612560.0778975.7068506.0156614.6288377.093674
1085.2903680.1471614.9986695.5820674.0333476.547389
1095.3279980.0836595.1621705.4938254.0940966.561899
1104.3081050.1010394.1078284.5083833.0691035.547107
1114.2673250.0982834.0725114.4621393.0291945.505455
\n", + "

112 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \\\n", + "0 3.987671 0.132758 3.724521 4.250820 2.736966 \n", + "1 5.496638 0.104065 5.290363 5.702914 4.256653 \n", + "2 5.676520 0.087470 5.503139 5.849901 4.441580 \n", + "3 5.406585 0.107000 5.194492 5.618678 4.165618 \n", + "4 6.966640 0.126756 6.715389 7.217892 5.718385 \n", + ".. ... ... ... ... ... \n", + "107 5.861256 0.077897 5.706850 6.015661 4.628837 \n", + "108 5.290368 0.147161 4.998669 5.582067 4.033347 \n", + "109 5.327998 0.083659 5.162170 5.493825 4.094096 \n", + "110 4.308105 0.101039 4.107828 4.508383 3.069103 \n", + "111 4.267325 0.098283 4.072511 4.462139 3.029194 \n", + "\n", + " obs_ci_upper \n", + "0 5.238375 \n", + "1 6.736624 \n", + "2 6.911459 \n", + "3 6.647552 \n", + "4 8.214896 \n", + ".. ... \n", + "107 7.093674 \n", + "108 6.547389 \n", + "109 6.561899 \n", + "110 5.547107 \n", + "111 5.505455 \n", + "\n", + "[112 rows x 6 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.get_prediction().summary_frame()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By hand approach" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
predictionlowerupper
03.9876713.7245214.250820
15.4966385.2903635.702914
25.6765205.5031395.849901
35.4065855.1944925.618678
46.9666406.7153897.217892
\n", + "
" + ], + "text/plain": [ + " prediction lower upper\n", + "0 3.987671 3.724521 4.250820\n", + "1 5.496638 5.290363 5.702914\n", + "2 5.676520 5.503139 5.849901\n", + "3 5.406585 5.194492 5.618678\n", + "4 6.966640 6.715389 7.217892" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X = model.model.exog\n", + "\n", + "# get the prediction\n", + "y_hat = X @ model.params\n", + "\n", + "# get the standard error\n", + "se = np.sqrt(np.diag(X @ model.cov_params() @ X.T))\n", + "\n", + "# critical value for 95% confidence\n", + "cv = stats.t.ppf(0.975, model.df_resid)\n", + "\n", + "# get the interval\n", + "pd.DataFrame({\n", + " 'prediction': y_hat,\n", + " 'lower': y_hat - cv * se,\n", + " 'upper': y_hat + cv * se\n", + "}).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
predictionlowerupper
03.9876712.7369665.238375
15.4966384.2566536.736624
25.6765204.4415806.911459
35.4065854.1656186.647552
46.9666405.7183858.214896
\n", + "
" + ], + "text/plain": [ + " prediction lower upper\n", + "0 3.987671 2.736966 5.238375\n", + "1 5.496638 4.256653 6.736624\n", + "2 5.676520 4.441580 6.911459\n", + "3 5.406585 4.165618 6.647552\n", + "4 6.966640 5.718385 8.214896" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# get the prediction interval\n", + "se_pred = np.sqrt(se**2 + model.mse_resid)\n", + "\n", + "pd.DataFrame({\n", + " 'prediction': y_hat,\n", + " 'lower': y_hat - cv * se_pred,\n", + " 'upper': y_hat + cv * se_pred\n", + "}).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
meanmean_semean_ci_lowermean_ci_upperobs_ci_lowerobs_ci_upper
03.9876710.1327583.7245214.2508202.7369665.238375
15.4966380.1040655.2903635.7029144.2566536.736624
25.6765200.0874705.5031395.8499014.4415806.911459
35.4065850.1070005.1944925.6186784.1656186.647552
46.9666400.1267566.7153897.2178925.7183858.214896
\n", + "
" + ], + "text/plain": [ + " mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \\\n", + "0 3.987671 0.132758 3.724521 4.250820 2.736966 \n", + "1 5.496638 0.104065 5.290363 5.702914 4.256653 \n", + "2 5.676520 0.087470 5.503139 5.849901 4.441580 \n", + "3 5.406585 0.107000 5.194492 5.618678 4.165618 \n", + "4 6.966640 0.126756 6.715389 7.217892 5.718385 \n", + "\n", + " obs_ci_upper \n", + "0 5.238375 \n", + "1 6.736624 \n", + "2 6.911459 \n", + "3 6.647552 \n", + "4 8.214896 " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.get_prediction().summary_frame().head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Monte Carlo" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# we'll use the model from the previous section\n", + "model = smf.ols(\n", + " 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc',\n", + " data = df_happiness\n", + ").fit()\n", + "\n", + "def mc_predictions(model, nsim=2500, seed=42):\n", + " np.random.seed(seed)\n", + "\n", + " params_est = model.params\n", + " params = np.random.multivariate_normal(\n", + " mean = params_est,\n", + " cov = model.cov_params(),\n", + " size = nsim\n", + " )\n", + "\n", + " sigma = model.mse_resid**.5\n", + " X = model.model.exog\n", + "\n", + " y_hat = X @ params.T + np.random.normal(scale = sigma, size = (X.shape[0], nsim))\n", + "\n", + " pred_int = np.quantile(y_hat, q = [.025, .975], axis = 1)\n", + "\n", + " return pred_int\n", + "\n", + "our_mc = mc_predictions(model)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
observed_valuepredictionsimulated_lowersimulated_upperstatsmodels_lowerstatsmodels_upper
03.6323.9882.7705.1972.7375.238
14.5865.4974.2786.7594.2576.737
26.3885.6774.4516.8894.4426.911
34.3215.4074.2186.6664.1666.648
47.2726.9675.7338.1675.7188.215
.....................
1076.3795.8614.6707.0904.6297.094
1086.0965.2904.0576.5474.0336.547
1094.8065.3284.1266.5304.0946.562
1104.3774.3083.0975.5313.0695.547
1113.6924.2673.0375.5233.0295.505
\n", + "

112 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " observed_value prediction simulated_lower simulated_upper \\\n", + "0 3.632 3.988 2.770 5.197 \n", + "1 4.586 5.497 4.278 6.759 \n", + "2 6.388 5.677 4.451 6.889 \n", + "3 4.321 5.407 4.218 6.666 \n", + "4 7.272 6.967 5.733 8.167 \n", + ".. ... ... ... ... \n", + "107 6.379 5.861 4.670 7.090 \n", + "108 6.096 5.290 4.057 6.547 \n", + "109 4.806 5.328 4.126 6.530 \n", + "110 4.377 4.308 3.097 5.531 \n", + "111 3.692 4.267 3.037 5.523 \n", + "\n", + " statsmodels_lower statsmodels_upper \n", + "0 2.737 5.238 \n", + "1 4.257 6.737 \n", + "2 4.442 6.911 \n", + "3 4.166 6.648 \n", + "4 5.718 8.215 \n", + ".. ... ... \n", + "107 4.629 7.094 \n", + "108 4.033 6.547 \n", + "109 4.094 6.562 \n", + "110 3.069 5.547 \n", + "111 3.029 5.505 \n", + "\n", + "[112 rows x 6 columns]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Statsmodels Prediction Intervals\n", + "prediction_intervals = model.get_prediction().summary_frame()\n", + "statsmodels_lower = prediction_intervals['obs_ci_lower']\n", + "statsmodels_upper = prediction_intervals['obs_ci_upper']\n", + "\n", + "\n", + "pd.DataFrame({\n", + " 'observed_value': df_happiness['happiness'],\n", + " 'prediction': model.fittedvalues,\n", + " 'simulated_lower': our_mc[0],\n", + " 'simulated_upper': our_mc[1],\n", + " 'statsmodels_lower': statsmodels_lower,\n", + " 'statsmodels_upper': statsmodels_upper\n", + "}).round(3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bootstrap" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "def bootstrap(X, y, nboot=100, seed=123):\n", + " # add a column of 1s for the intercept\n", + " X = np.c_[np.ones(X.shape[0]), X]\n", + " N = X.shape[0]\n", + "\n", + " # initialize\n", + " beta = np.empty((nboot, X.shape[1]))\n", + " \n", + " # beta = pd.DataFrame(beta, columns=['Intercept'] + list(cn))\n", + " mse = np.empty(nboot) \n", + "\n", + " # set seed\n", + " np.random.seed(seed)\n", + "\n", + " for i in range(nboot):\n", + " # sample with replacement\n", + " idx = np.random.randint(0, N, N)\n", + " Xi = X[idx, :]\n", + " yi = y[idx]\n", + "\n", + " # estimate model\n", + " model = LinearRegression(fit_intercept=False)\n", + " mod = model.fit(Xi, yi)\n", + "\n", + " # save results\n", + " beta[i, :] = mod.coef_\n", + " mse[i] = np.sum((mod.predict(Xi) - yi)**2) / N\n", + "\n", + " # given mean estimates, calculate MSE\n", + " y_hat = X @ beta.mean(axis=0)\n", + " final_mse = np.sum((y - y_hat)**2) / N\n", + "\n", + " output = {\n", + " 'par': beta,\n", + " 'mse': mse,\n", + " 'final_mse': final_mse\n", + " }\n", + "\n", + " return output\n", + "\n", + "our_boot = bootstrap(\n", + " X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],\n", + " y = df_happiness['happiness'],\n", + " nboot = 1000\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 5.34092479, 0.27665819, -0.29543964, 0.19114177])" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.percentile(our_boot['par'], 2.5, axis=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
parammeanlwrupr
0Intercept5.4517035.3409255.572782
1life_exp_sc0.5119170.2766580.754842
2corrupt_sc-0.106482-0.2954400.080125
3gdp_pc_sc0.4598290.1911420.776553
\n", + "
" + ], + "text/plain": [ + " param mean lwr upr\n", + "0 Intercept 5.451703 5.340925 5.572782\n", + "1 life_exp_sc 0.511917 0.276658 0.754842\n", + "2 corrupt_sc -0.106482 -0.295440 0.080125\n", + "3 gdp_pc_sc 0.459829 0.191142 0.776553" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.DataFrame({\n", + " 'param': ['Intercept', 'life_exp_sc', 'corrupt_sc', 'gdp_pc_sc'],\n", + " 'mean': our_boot['par'].mean(axis=0),\n", + " 'lwr': np.percentile(our_boot['par'], 2.5, axis=0),\n", + " 'upr': np.percentile(our_boot['par'], 97.5, axis=0)\n", + "})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Bayesian" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/x6/4jhswqxj0sqf_gkgq6lw6l880000gn/T/ipykernel_2286/1473881323.py:27: DeprecationWarning: `np.math` is a deprecated alias for the standard library `math` module (Deprecated Numpy 1.25). Replace usages of `np.math` with `math`\n", + " p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss\n" + ] + }, + { + "data": { + "text/plain": [ + "0.599999996503221" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from scipy.stats import beta\n", + "\n", + "pk = np.array([\n", + " 'goal','goal','goal','miss','miss',\n", + " 'goal','goal','miss','goal','goal'\n", + "])\n", + "\n", + "# convert to numeric, arbitrarily picking goal=1, miss=0\n", + "N = len(pk) # sample size\n", + "n_goal = np.sum(pk == 'goal') # number of pk made\n", + "n_miss = np.sum(pk == 'miss') # number of those miss\n", + "\n", + "# grid of potential theta values\n", + "theta = np.linspace(1 / (N + 1), N / (N + 1), 10)\n", + "\n", + "### prior distribution\n", + "# beta prior with mean = .5, but fairly diffuse\n", + "# examine the prior\n", + "# theta = beta.rvs(5, 5, size = 1000)\n", + "# plt.hist(theta, bins = 20, color = 'lightblue')\n", + "p_theta = beta.pdf(theta, 5, 5)\n", + "\n", + "# Normalize so that values sum to 1\n", + "p_theta = p_theta / np.sum(p_theta)\n", + "\n", + "# likelihood (binomial)\n", + "p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss\n", + "\n", + "# posterior (combination of prior and likelihood)\n", + "# marginal probability of the data used for normalization\n", + "p_data = np.sum(p_data_given_theta * p_theta) \n", + "\n", + "p_theta_given_data = p_data_given_theta * p_theta / p_data # Bayes theorem\n", + "\n", + "# final estimate\n", + "theta_est = np.sum(theta * p_theta_given_data)\n", + "theta_est" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conformal" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "def split_conformal(X, y, new_data, alpha = .05, calibration_split = .5):\n", + " # Splitting the data into training and calibration sets\n", + " X_train, X_cal, y_train, y_cal = train_test_split(\n", + " X, \n", + " y,\n", + " test_size = calibration_split,\n", + " random_state = 123\n", + " )\n", + "\n", + " N = X_train.shape[0]\n", + "\n", + " # Train the base model\n", + " model = LinearRegression().fit(X_train, y_train)\n", + "\n", + " # Calculate residuals on calibration set\n", + " cal_preds = model.predict(X_cal)\n", + " residuals = np.abs(y_cal - cal_preds)\n", + "\n", + " # Sort residuals and find the quantile corresponding to (1-alpha)\n", + " residuals = np.sort(residuals)\n", + "\n", + " # The correction here is useful for small sample sizes\n", + " quantile = np.quantile(residuals, (1 - alpha) * (N / (N + 1)))\n", + "\n", + " # Make predictions on new data and calculate prediction intervals\n", + " preds = model.predict(new_data)\n", + " lower_bounds = preds - quantile\n", + " upper_bounds = preds + quantile\n", + "\n", + " # Return predictions and prediction intervals\n", + " return {\n", + " 'cp_error': quantile,\n", + " 'preds': preds,\n", + " 'lower_bounds': lower_bounds,\n", + " 'upper_bounds': upper_bounds\n", + " }\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.0366496877019815\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
predslower_boundsupper_bounds
06.2178095.1811597.254459
14.2690993.2324505.305749
24.7596003.7229505.796249
35.2270534.1904036.263703
44.2871073.2504575.323756
\n", + "
" + ], + "text/plain": [ + " preds lower_bounds upper_bounds\n", + "0 6.217809 5.181159 7.254459\n", + "1 4.269099 3.232450 5.305749\n", + "2 4.759600 3.722950 5.796249\n", + "3 5.227053 4.190403 6.263703\n", + "4 4.287107 3.250457 5.323756" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# split data\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']]\n", + "y = df_happiness['happiness']\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(\n", + " df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],\n", + " df_happiness['happiness'],\n", + " test_size = 0.5,\n", + " random_state = 123\n", + ")\n", + "\n", + "our_cp_error = split_conformal(\n", + " X_train,\n", + " y_train,\n", + " X_test,\n", + " alpha = .1\n", + ")\n", + "\n", + "print(our_cp_error['cp_error'])\n", + "\n", + "pd.DataFrame({\n", + " 'preds': our_cp_error['preds'],\n", + " 'lower_bounds': our_cp_error['lower_bounds'],\n", + " 'upper_bounds': our_cp_error['upper_bounds']\n", + "}).head()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "from mapie.regression import MapieRegressor\n", + "\n", + "model = MapieRegressor(LinearRegression(), method = 'base', random_state=123)\n", + "y_pred, y_pis = model.fit(X_train, y_train).predict(X_test, alpha = 0.1)\n", + "\n", + "# take the first difference between upper and lower bounds,\n", + "# since it's constant for all predictions in this setting\n", + "\n", + "cp_error = (y_pis[0, 1, 0] - y_pis[0, 0, 0]) / 2 " + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[5.39702408, 7.54753509],\n", + " [3.24052412, 5.39103513],\n", + " [3.65681605, 5.80732707],\n", + " [4.28406556, 6.43457658],\n", + " [3.24132564, 5.39183665]])" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_pis[:5].reshape(-1, 2)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1.0366496877019815, 1.075255506898503)" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "our_cp_error['cp_error'], cp_error" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "book-of-models", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/chapter-notebooks/r_chapter_notebooks/uncertainty.qmd b/chapter-notebooks/r_chapter_notebooks/uncertainty.qmd new file mode 100644 index 0000000..816a892 --- /dev/null +++ b/chapter-notebooks/r_chapter_notebooks/uncertainty.qmd @@ -0,0 +1,303 @@ +--- +title: "Estimating Uncertainty" +format: html +--- + + +```{r} +library(tidyverse) +``` + + +```{r} +df_happiness = read_csv('https://tinyurl.com/worldhappiness2018') |> + drop_na() |> + rename(happiness = happiness_score) |> + select( + country, + happiness, + contains('_sc') + ) +``` + +## Standard Frequentist + + +```{r} +model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) + +confint(model) + +predict(model, interval = 'confidence') # for an average prediction +predict(model, interval = 'prediction') # for a future observation (wider) +``` + + + +```{r} +X = model.matrix(model) + +# get the prediction +y_hat = X %*% coef(model) + +# get the standard error +se = sqrt(diag(X %*% vcov(model) %*% t(X))) + +# critical value for 95% confidence +cv = qt(0.975, df = model$df.residual) + +# get the confidence interval +tibble( + prediction = y_hat[,1], + lower = y_hat[,1] - cv * se, + upper = y_hat[,1] + cv * se +) |> + head() + +predict(model, interval = 'confidence') |> head() +``` + + +```{r} +# get the prediction interval +se_pred = sqrt(se^2 + summary(model)$sigma^2) + +data.frame( + prediction = y_hat[,1], + lower = y_hat[,1] - cv * se_pred, + upper = y_hat[,1] + cv * se_pred +) |> + head() + +predict(model, interval = 'prediction') |> head() +``` + + +## Monte Carlo + +```{r} +model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) + +# number of simulations +mc_predictions = function( + model, + nsim = 2500, + seed = 42 +) { + set.seed(seed) + + params_est = coef(model) + params = mvtnorm::rmvnorm( + n = nsim, + mean = params_est, + sigma = vcov(model) + ) + + sigma = summary(model)$sigma + X = model.matrix(model) + + y_hat = X %*% t(params) + rnorm(n = nrow(X) * nsim, sd = sigma) + + pred_int = apply(y_hat, 1, quantile, probs = c(.025, .975)) + + pred_int = tibble(lower = pred_int[1,], upper = pred_int[2,]) + return(pred_int) +} + +our_mc = mc_predictions(model) +our_mc +``` + + +## Bootstrap + + +```{r} +bootstrap = function(X, y, nboot = 100, seed = 123) { + + N = nrow(X) + p = ncol(X) + 1 # add one for intercept + + # initialize + beta = matrix(NA, p*nboot, nrow = nboot, ncol = p) + colnames(beta) = c('Intercept', colnames(X)) + mse = rep(NA, nboot) + + # set seed + set.seed(seed) + + for (i in 1:nboot) { + # sample with replacement + idx = sample(1:N, N, replace = TRUE) + Xi = X[idx,] + yi = y[idx] + + # estimate model + mod = lm(yi ~., data = Xi) + + # save results + beta[i, ] = coef(mod) + mse[i] = sum((mod$fitted - yi)^2) / N + } + + # given mean estimates, calculate MSE + y_hat = cbind(1, as.matrix(X)) %*% colMeans(beta) + final_mse = sum((y - y_hat)^2) / N + + output = list( + par = as_tibble(beta), + MSE = mse, + final_mse = final_mse + ) + + return(output) +} + +X = df_happiness |> + select(life_exp_sc:gdp_pc_sc) + +y = df_happiness$happiness + +our_boot = bootstrap( + X = X, + y = y, + nboot = 1000 +) + +our_boot +``` + +## Bayesian + +### Example + +```{r} +#| label: bayesian-demo-r +pk = c( + 'goal','goal','goal','miss','miss', + 'goal','goal','miss','goal','goal' +) + +# convert to numeric, arbitrarily picking goal=1, miss=0 + +N = length(pk) # sample size +n_goal = sum(pk == 'goal') # number of pk made +n_miss = sum(pk == 'miss') # number of those miss + +# grid of potential theta values +theta = seq( + from = 1 / (N + 1), + to = N / (N + 1), + length = 10 +) + +### prior distribution +# beta prior with mean = .5, but fairly diffuse +# examine the prior +# theta = rbeta(1000, 5, 5) +# hist(theta, main = 'Prior Distribution', xlab = 'Theta', col = 'lightblue') +p_theta = dbeta(theta, 5, 5) + +# Normalize so that values sum to 1 +p_theta = p_theta / sum(p_theta) + +# likelihood (binomial) +p_data_given_theta = choose(N, n_goal) * theta^n_goal * (1 - theta)^n_miss + +# posterior (combination of prior and likelihood) +# p_data is the marginal probability of the data used for normalization +p_data = sum(p_data_given_theta * p_theta) + +p_theta_given_data = p_data_given_theta*p_theta / p_data # Bayes theorem + +# final estimate +theta_est = sum(theta * p_theta_given_data) +theta_est +``` + +## Conformal Prediction + + +```{r} +split_conformal = function( + X, + y, + new_data, + alpha = .05, + calibration_split = .5 +) { + # Splitting the data into training and calibration sets + idx = sample(1:nrow(X), size = floor(nrow(X) / 2)) + + train_data = X |> slice(idx) + cal_data = X |> slice(-idx) + train_y = y[idx] + cal_y = y[-idx] + + N = nrow(train_data) + + # Train the base model + model = lm(train_y ~ ., data = train_data) + + # Calculate residuals on calibration set + cal_preds = predict(model, newdata = cal_data) + residuals = abs(cal_y - cal_preds) + + # Sort residuals and find the quantile corresponding to (1-alpha) + residuals = sort(residuals) + quantile = quantile(residuals, (1 - alpha) * (N / (N + 1))) + + # Make predictions on new data and calculate prediction intervals + preds = predict(model, newdata = new_data) + lower_bounds = preds - quantile + upper_bounds = preds + quantile + + # Return predictions and prediction intervals + return( + list( + cp_error = quantile, + preds = preds, + lower_bounds = lower_bounds, + upper_bounds = upper_bounds + ) + ) +} +``` + + +```{r} +# split data +set.seed(123) + +idx_train = sample(nrow(df_happiness), nrow(df_happiness) * .8) +idx_test = setdiff(1:nrow(df_happiness), idx_train) + +df_train = df_happiness |> + slice(idx_train) |> + select(happiness, life_exp_sc, gdp_pc_sc, corrupt_sc) + +y_train = df_happiness$happiness[idx_train] + +df_test = df_happiness |> + slice(idx_test) |> + select(life_exp_sc, gdp_pc_sc, corrupt_sc) + +y_test = df_happiness$happiness[idx_test] + +# apply the function +cp_error = split_conformal( + df_train |> select(-happiness), + y_train, + df_test, + alpha = .1 +) + +# cp_error[['cp_error']] + +tibble( + preds = cp_error[['preds']], + lower_bounds = cp_error[['lower_bounds']], + upper_bounds = cp_error[['upper_bounds']] +) |> + head() +``` \ No newline at end of file From 629742afdd059683471fc8229342f5a14231c7de Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 24 Nov 2024 14:12:05 -0500 Subject: [PATCH 04/19] misc content add --- conclusion.qmd | 2 +- generalized_linear_models.qmd | 15 +++++++++++---- index.qmd | 2 +- linear_model_extensions.qmd | 18 ++++++++++++++---- more_models.qmd | 2 +- 5 files changed, 28 insertions(+), 11 deletions(-) diff --git a/conclusion.qmd b/conclusion.qmd index b11024e..3b54e05 100644 --- a/conclusion.qmd +++ b/conclusion.qmd @@ -140,7 +140,7 @@ The deep learning landscape also includes models like deep graphical networks, a :::{.callout-note title="A List of Models" collapse='true'} -You can find a list of some specific models for each of these categories in the appendix (@sec-app-more-models). It is by no means an exhaustive list, but it should give you a good starting point for exploring additional models once you're finished here. +You can find a list of some specific models for each of these categories in the [appendix](https://m-clark.github.io/book-of-models/more_models.html) (web only). It is by no means an exhaustive list, but it should give you a good starting point for exploring additional models once you're finished here. ::: diff --git a/generalized_linear_models.qmd b/generalized_linear_models.qmd index 9ac85d3..f0be37e 100644 --- a/generalized_linear_models.qmd +++ b/generalized_linear_models.qmd @@ -511,7 +511,7 @@ Many coming from a non-statistical background are not aware that their logistic ### Interpretation and visualization {#sec-glm-binomial-interpret} -If our modeling goal is not just producing predictions, we need to know what those results mean. The coefficients that we get from our model are in the log odds scale. Interpreting log odds is difficult at best, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from 'bad' to 'good', whereas a negative log odds would indicate that a decrease in the feature will decrease the log odds of moving from 'bad' to 'good'. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease. As we demonstrated, we can exponentiate them to get the odds ratio for additional interpretability. +If our modeling goal is not just producing predictions, we need to know what those results mean. The coefficients that we get from our model are in the log odds scale. Interpreting log odds is difficult, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from 'bad' to 'good', whereas a negative log odds would indicate that a increase in the feature will decrease the log odds of moving from 'bad' to 'good'. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease. As we demonstrated, we can exponentiate them to get the odds ratio for additional interpretability. ```{r} @@ -520,6 +520,8 @@ If our modeling goal is not just producing predictions, we need to know what tho #| tbl-cap: Raw Coefficients and Odds Ratios for a Logistic Regression lo_result = parameters::model_parameters(model_logistic) or_result = exp(lo_result$Coefficient) +me_logistic = marginaleffects::avg_slopes(model_logistic) +me_logistic_wc = me_logistic$estimate[me_logistic$term == 'word_count'] lo_result |> as_tibble() |> @@ -531,7 +533,7 @@ lo_result |> The intercept provides a baseline odds of a 'good' review when word count is 0 and gender is 'female'. From there, we see that we've got a negative raw coefficient and odds ratio of `r round(or_result[2], 2)` for the word count variable. We have a positive raw coefficient and `r round(or_result[3], 2)` odds ratio for the male variable. This means that for every one unit increase in word count, the odds of a 'good' review decreases by about `r 100*round(1-or_result[2], 2)`%. Males are associated with an odds of a 'good' review that is `r 100*round(or_result[3] -1, 2)`% higher than females. -We feel it is much more intuitive to interpret things on the probability scale, so we'll get predicted probabilities for different values of the features. The way we do this is through the (inverse) link function, which will convert our log odds of the linear predictor to probabilities. We can then plot these probabilities to see how they change with the features. For the word count feature, we hold gender at the reference group ('female'), and for the gender feature, we hold word count at its mean. In addition we convert the probability to the percentage chance of a 'good' review. +We feel it is much more intuitive to interpret things on the probability scale, so we'll get predicted probabilities for different values of the features. The way we do this is through the (inverse) link function, which will convert our log odds of the linear predictor to probabilities. We can then look at specific predictions, calculate marginal effects, or plot these probabilities to see how they change with the features. For the word count feature in the following visualization, we hold gender at the reference group ('female'), and for the gender feature, we hold word count at its mean. In addition we convert the probability to the percentage chance of a 'good' review. :::{.content-visible when-format='html'} @@ -540,6 +542,8 @@ We feel it is much more intuitive to interpret things on the probability scale, #| label: fig-logistic-regression-word-count #| fig-cap: Model Predictions for Word Count Feature #| out-width: 100% + + p_prob = ggeffects::ggpredict(model_logistic, terms = c('word_count')) |> plot(color = okabe_ito['darkblue'], use_theme = FALSE) + annotate(geom = 'text', x = 20, y = plogis(1), label = 'Not so much!') + @@ -606,7 +610,7 @@ ggsave( ::: -In @fig-logistic-regression-word-count, we can see a clear negative relationship between the number of words in a review and the probability of being considered a 'good' movie. As we get over 20 words, the predicted probability of being a 'good' movie is less than .2. +In @fig-logistic-regression-word-count, we can see a clear negative relationship between the number of words in a review and the probability of being considered a 'good' movie. As we get over 20 words, the predicted probability of being a 'good' movie is less than .2. We also calculated the average marginal effect (@sec-avg-marginal-effects), or average slope, for word count. It suggests a `r round(me_logistic_wc, 2)` `r word_sign(me_logistic_wc, c('increase', 'decrease'))` in the probability of a 'good' rating for each additional word in the review (on average). We also see the increase in the chance for a good rating with males vs. females, but our model results suggest this is not a statistically significant difference. @@ -865,6 +869,8 @@ Now let's check out the results more deeply. Like with logistic, we can exponent log_result = parameters::model_parameters(model_poisson) rr_result = exp(log_result$Coefficient) +me_poisson = marginaleffects::avg_slopes(model_poisson) +me_poisson_wc = me_poisson$estimate[me_poisson$term == 'word_count'] log_result |> as_tibble() |> @@ -875,7 +881,8 @@ log_result |> gt(decimals = 2) ``` -For this model, an increase in one review word leads to an expected log count `r word_sign(log_result$Coefficient[2], c('increase', 'decrease'))` of ~`r round(log_result$Coefficient[2], 2)`. We can exponentiate this to get `r round(rr_result[2], 2)`, and this tells us that every added word in a review gets us a `r scales::label_percent()(round(rr_result[2], 2) - 1) ` `r word_sign(log_result$Coefficient[2], c('increase', 'decrease'))` in the number of possessive pronouns. This is probably not a surprising result - wordier stuff has more types of words! A similar, though slightly smaller increase is seen for males relative to females, but, as with our previous model, this is not statistically significant. +For this model, an increase in one review word leads to an expected log count `r word_sign(log_result$Coefficient[2], c('increase', 'decrease'))` of ~`r round(log_result$Coefficient[2], 2)`. We can exponentiate this to get `r round(rr_result[2], 2)`, and this tells us that every added word in a review gets us a `r scales::label_percent()(round(rr_result[2], 2) - 1) ` `r word_sign(log_result$Coefficient[2], c('increase', 'decrease'))` in the number of possessive pronouns. This is probably not a surprising result - wordier stuff has more types of words! In addition, the average marginal effect (@sec-avg-marginal-effects), or average slope, for word count suggested a `r round(me_poisson_wc, 2)` `r word_sign(me_poisson_wc, c('increase', 'decrease'))` in the number of possessive pronouns per word on average. A similar, though slightly smaller, increase is seen for males relative to females, but, as with our previous model, this is not statistically significant. + But as usual, the visualization tells the better story. Here is the relationship for word count. Notice that the predictions are not discrete like the raw count, but continuous. This is because predictions here are the same as with our other models, and reflect the expected, or average, count that we'd predict with this data. diff --git a/index.qmd b/index.qmd index 32c7294..eb2e25c 100644 --- a/index.qmd +++ b/index.qmd @@ -29,7 +29,7 @@ For anyone reading this book, we especially hope you get a sense of the commonal You'll definitely want to have some familiarity with R or Python (both are used for examples), and some very basic knowledge of statistics will be helpful. We'll try to explain things as we go, but we won't be able to cover everything. If you're looking for a good introduction to R, we recommend [R for Data Science](https://r4ds.had.co.nz/) or the [Python for Data Analysis](https://wesmckinney.com/book/) book for Python. Beyond that, we'll try to provide the context you need so that you can be comfortable trying things out. -Also, if you happen to be reading this book in print, you can find the book in web form at https://m-clark.github.io/book-of-models. There you'll find all the code, figures, and other content that you can interact with more easily, as well as the most up-to-date content, fixes, etc. +Also, if you happen to be reading this book in print, you can find the book in web form at https://m-clark.github.io/book-of-models. There you'll find all the code, figures, and other content that you can interact with more easily, as well as the most up-to-date content, fixes, etc. The web version will be updated with some regularity and have additional content as well. ## Data & Code {#sec-preface-data} diff --git a/linear_model_extensions.qmd b/linear_model_extensions.qmd index 013b04e..0b65f01 100644 --- a/linear_model_extensions.qmd +++ b/linear_model_extensions.qmd @@ -767,7 +767,7 @@ ggsave("img/lm-extend-random_effects.svg", width = 8, height = 8) ![Estimated Random Effects](img/lm-extend-random_effects.svg){#fig-random-effects width=100%} -How do we interpret these *deviations*? For starters, they are deviations from the fixed effect for the intercept and decade trend coefficient. So here that means anything negative is an intercept or slope below the corresponding fixed effect value, and anything positive is above that value. If we want the specific effect for a country, we just add the random effect value to the fixed effect value, and we can refer to those as **random coefficients**. +How do we interpret these *deviations*? For starters, they are deviations from the fixed effect for the intercept and decade trend coefficient. So here that means anything negative is an intercept or slope below the corresponding fixed effect value, and anything positive is above that value. If we want the specific effect for a country, we just add its random effect to the fixed effect, and we can refer to those as **random coefficients**. For example, if we wanted to know the effects for the US, we would add its random effects to the population level fixed effects. This is exactly what we did in the previous section in our interpretation with the interaction model. However, you can typically get these from the package functionality directly. The result shows that the US starts at a very high happiness score, but actually has a negative trend over time[^lifeexpusa]. @@ -801,6 +801,13 @@ ranef_usa + model_ran_slope.fe_params ``` ::: + +:::{.callout-note title='Averages in Mixed Models' collapse='true'} +In the linear mixed effect model setting with a random intercept, the fixed effects can be seen as the (population) average effects, but this is not exactly what you are getting from the mixed model. To make the distinction clear, consider family groups and a gender effect for males vs. females. The linear regression and some other types of models (e.g. estimated via generalized estimating equations) would give you the average effect male-female difference across all families. The mixed model actually tells you the male-female difference as if they were in the same family (e.g. siblings). Again, in the simplest mixed model setting these are the same. Beyond that that, when we start dealing with random slopes and non-gaussian distributions, they are not. + +In general, if we set the random effect to 0 to get a prediction, that tells us what the prediction would be for a typical group, in this case, a typical country. Often we want to get something more like the average slope or prediction across countries that we would have with linear regression. This gets us back to the idea of the **marginal effect** we discussed earlier. While the mechanics are not straightforward for mixed models, the tool use generally takes no additional effort. +::: + Let's plot those random coefficients together to see how they relate to each other. ```{r} @@ -1331,9 +1338,8 @@ There are a few approaches we could take here, with common approaches being drop A better answer to this challenge might be to try a median-based approach instead. This is where a model like **quantile regression** becomes handy. Quantile regression is a type of regression that allows us to model the relationship between the features and the target at different quantiles of the target. For example, we can examine models at the 10th, 25th, 50th, 75th, and 90th percentiles of the target. This is very cool, as it allows us to model the relationship between the features and the target in a way that is robust to outliers and extreme scores. It's also a way to understand a type of nonlinearity that is not captured by a standard linear model, as the feature target relationship may change at different quantiles of the target. -To demonstrate this type of model, let's use our movie reviews data. Let's say that we are curious about the relationship between the `word_count` and `rating` to keep things simple. To make it even more straightforward, we will use the standardized (scaled) version of the feature. In our default approach, we will start with a median regression, in other words, a quantile associated with $\tau$ = .5[^leastabs]. +To demonstrate this type of model, let's use our movie reviews data. Let's say that we are curious about the relationship between the `word_count` and `rating` to keep things simple. To make it even more straightforward, we will use the standardized (scaled) version of the feature. In our default approach, we will start with a median regression. -[^leastabs]: This is equivalent to using the least absolute deviation objective. :::{.panel-tabset} @@ -1517,7 +1523,11 @@ $$ $${#eq-quantile-loss-function} ::: -Compared to a standard linear regression, we are given an extra parameter for the model: $\tau$. It's a number between 0 and 1 representing the desired quantile (e.g., 0.5 for the median). The objective function treats positive residuals differently than negative residuals. If the residual is positive, then we multiply it by $\tau$. If the residual is negative, then we multiply it by $\tau -1$. The increased penalty for over-predictions for ensures that the estimated quantile $\hat{y}$ is such that $\tau$ proportion of the data falls below it, balancing the total loss and providing a robust estimate of the desired quantile. +Compared to a standard linear regression, we are given an extra parameter for the model: $\tau$. It's a number between 0 and 1 representing the desired quantile (e.g., 0.5 for the median)[^leastabs]. The objective function treats positive residuals differently than negative residuals. If the residual is positive, then we multiply it by $\tau$. If the residual is negative, then we multiply it by $\tau -1$. The increased penalty for over-predictions for ensures that the estimated quantile $\hat{y}$ is such that $\tau$ proportion of the data falls below it, balancing the total loss and providing a robust estimate of the desired quantile. + + +[^leastabs]: This is equivalent to using the least absolute deviation objective. + ### Rolling our own diff --git a/more_models.qmd b/more_models.qmd index 457ec2b..7bcaf79 100644 --- a/more_models.qmd +++ b/more_models.qmd @@ -119,6 +119,6 @@ Convolutional neural networks as currently implemented can be seen going back to NLP and language processing more generally can be seen as evolving from matrix factorization and LDA, to neural network models such as word2vec and GloVe. In addition, the temporal nature of text suggested time-based models, including even more statistical models like hidden markov models back in the day. But in the neural network domain, we have standard Recurrent networks, then LSTMs, GRUs, Seq2Seq, and more that continued the theme. Now the field is dominated by attention-based transformers, of which BERT variants were popular early on, and OpenAI's GPT is among the most famous example of modern larger language models. But there are many others that have been developed in the last few years, offered from Meta, Google, Anthropic and others. You can see some recent [performance rankings](https://livebench.ai/#/?IF=a&Reasoning=a&Coding=a&Mathematics=a&Data+Analysis=a&Language=a), and note that there is not one model that is best at every task. -You'll also find deep learning approaches to some of the models in the ML section, such as recommender systems, clustering, graphs and more. Michael has surveyed some of the developments in deep learning for tabular data (@clark_deep_2022), and though he hasn't seen anything as of this writing to change the general conclusion there, he hopes to revisit the topic in earnest again in the future, so stay tuned. +You'll also find deep learning approaches to some of the models in the ML section, such as recommender systems, clustering, graphs and more. Recent efforts have attempted 'foundational' models to time series, such as Moirai. Michael has surveyed some of the developments in deep learning for tabular data (@clark_deep_2022), and though he hasn't seen anything as of this writing to change the general conclusion there, he hopes to revisit the topic in earnest again in the future, so stay tuned. ::: From 4e258d6047de4a5668c0a0f64b403d0ed9f9de40 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 24 Nov 2024 14:33:02 -0500 Subject: [PATCH 05/19] grammar check through lm --- linear_models.qmd | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/linear_models.qmd b/linear_models.qmd index e1431bf..e60a372 100644 --- a/linear_models.qmd +++ b/linear_models.qmd @@ -28,7 +28,7 @@ pd.set_option('display.float_format', lambda x: '%.3f' % x) ``` -Now it's time to dive in to some modeling! We'll start things off by covering the building block of all modeling, and a solid understanding here will provide you the basis for just about anything that comes after, no matter how complex it gets. The **linear model** is our starting point. At first glance, it may seem like a very simple model, and it is, relatively speaking. But it's also quite powerful and flexible, able to take in different types of inputs, handle nonlinear relationships, temporal and spatial relations, clustering, and more. Linear models have a long history, with even the formal and scientific idea behind correlation and linear regression being well over a century old[^corbypeirce]! And in that time, the linear model is far and away the most used model out there. But before we start talking about the *linear* model, we need to talk about what a **model** is in general. +Now it's time to dive into some modeling! We'll start things off by covering the building block of all modeling, and a solid understanding here will provide you the basis for just about anything that comes after, no matter how complex it gets. The **linear model** is our starting point. At first glance, it may seem like a very simple model, and it is, relatively speaking. But it's also quite powerful and flexible, able to take in different types of inputs, handle nonlinear relationships, temporal and spatial relations, clustering, and more. Linear models have a long history, with even the formal and scientific idea behind correlation and linear regression being well over a century old[^corbypeirce]! And in that time, the linear model is far and away the most used model out there. But before we start talking about the *linear* model, we need to talk about what a **model** is in general. [^corbypeirce]: Regression in general is typically attributed to [Galton](https://en.wikipedia.org/wiki/Francis_Galton), and correlation to Pearson, whose coefficient bearing his name is still the most widely used measure of association. Peirce & Bowditch were actually ahead of both [@rovine_peirce_2004], but [Bravais beat all of them](https://en.wikipedia.org/wiki/Auguste_Bravais). @@ -170,7 +170,7 @@ model_rsq = round(performance::r2(model_lr_rating)$R2, 2) For such a simple model, we certainly have a lot to unpack here! Don't worry, you'll eventually come to know what it all means. But it's nice to know how easy it is to get the results! For now we can just say that there's a *negative* relationship between the word count and the rating (the `r round(coef(model_lr_rating)[2], 3)`), which means that we expect lower ratings with longer reviews. The output also tells us that the value regarding the relationship is statistically significant (`P(>|t|)` value is < .05). -Getting more into the details, we'll start with the fact that the linear model posits a **linear combination** of the features. This is an important concept to understand, but really, a linear combination is just a sum of the features, each of which has been multiplied by some specific value. That value is often called a **coefficient**, or possibly **weight**, depending on the context, and will allow different features to have different contributions to the result. Those contributions reflect the amount and direction of feature-target relationship. The linear model is expressed as (math incoming!): +Getting more into the details, we'll start with the fact that the linear model posits a **linear combination** of the features. This is an important concept to understand, but really, a linear combination is just a sum of the features, each of which has been multiplied by some specific value. That value is often called a **coefficient**, or possibly **weight**, depending on the context, and will allow different features to have different contributions to the result. Those contributions reflect the amount and direction of the feature-target relationship. The linear model is expressed as (math incoming!): $$ y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n @@ -395,7 +395,7 @@ You'll often see predictions referred to as **fitted values**, but these imply w What predictions we can get depends on the type of model we are using. For the linear model we have at present, we can get predictions for the target, which is a **continuous variable**. Very commonly, we also can get predictions for a **categorical target**, such as whether the rating is 'good' or 'bad'. This simple breakdown pretty much covers everything, as we typically would be predicting a continuous numeric variable or a categorical variable, or more of them, like multiple continuous variables, or a target with multiple categories, or sequences of categories (e.g. words). -In our case, we can get predictions for the rating, which is a number between 1 and 5. Had our target been a binary good vs. bad rating, our predictions would still be numeric for most models, and usually expressed as a probability between 0 and 1, say, for the 'good' category, or in a initial form that is then transformed to a probability. For example, in the context of predicting a good rating, higher probabilities would mean we'd more likely predict the movie is good, and lower probabilities would mean we'd more likely predict the movie is bad. We then would convert that probability to a class of good or bad depending on a chosen probability cutoff. We'll talk about how to get predictions for categorical targets later[^treepreds]. +In our case, we can get predictions for the rating, which is a number between 1 and 5. Had our target been a binary good vs. bad rating, our predictions would still be numeric for most models, and usually expressed as a probability between 0 and 1, say, for the 'good' category, or in an initial form that is then transformed to a probability. For example, in the context of predicting a good rating, higher probabilities would mean we'd more likely predict the movie is good, and lower probabilities would mean we'd more likely predict the movie is bad. We then would convert that probability to a class of good or bad depending on a chosen probability cutoff. We'll talk about how to get predictions for categorical targets later[^treepreds]. [^treepreds]: Some models, such as the tree approaches outlined in @sec-ml-common-trees, can directly predict categorical targets, but we still like, and often\ prefer using a probability. @@ -494,7 +494,7 @@ The values reflect the `r word_sign(wc_coef, c('positive', 'negative'))` coeffic As we have seen, predictions are not perfect, and an essential part of the modeling endeavor is to better understand these errors and why they occur. In addition, error assessment is the fundamental way in which we assess a model's performance, and, by extension, compare that performance to other models. In general, prediction error is the difference between the actual value and the predicted value or some function of it. In statistical models, it is also often called the **residual**. We can look at these individually, or we can look at them in aggregate with a single metric. -Let's start with looking at the residuals visually. Often the modeling package you use will have this as a default plotting method when doing a standard linear regression, so it's wise to take advantage of it. We plot both the distribution of raw error scores and the cumulative distribution of absolute prediction error. Here we see a couple things. First, the distribution appears roughly normal, which is a good thing, since statistical linear regression assumes our error is normally distributed, and the prediction error serves as an estimate of that. Second, we see that the mean of the errors is zero, which is a property of linear regression, and the reason we look at other metrics besides a simple 'average error' when assessing model performance. We can also see that our average *absolute* error is around, `r round(mean(abs(resid(model_lr_rating))), 1)`, most of our predictions (>90%) are within ±1 star rating. +Let's start with looking at the residuals visually. Often the modeling package you use will have this as a default plotting method when doing a standard linear regression, so it's wise to take advantage of it. We plot both the distribution of raw error scores and the cumulative distribution of absolute prediction error. Here we see a couple things. First, the distribution appears roughly normal, which is a good thing, since statistical linear regression assumes our error is normally distributed, and the prediction error serves as an estimate of that. Second, we see that the mean of the errors is zero, which is a property of linear regression, and the reason we look at other metrics besides a simple 'average error' when assessing model performance. We can also see that our average *absolute* error is around `r round(mean(abs(resid(model_lr_rating))), 1)`, most of our predictions (>90%) are within ±1 star rating. :::{.content-visible when-format='html'} @@ -1018,7 +1018,7 @@ $$ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \dots + \epsilon $$ -where $y$ is the target, $X$ is a 2-d matrix of features[^interceptones], where the rows are observations/instances and columns features, and $\beta$ is a vector of coefficients or weights corresponding to the number of columns in $X$. Matrix multiplication provides us an efficient way to get our expected value/prediction, and depicting the model in this way is a common practice that makes things more succint +where $y$ is the target, $X$ is a 2-d matrix of features[^interceptones], where the rows are observations/instances and columns features, and $\beta$ is a vector of coefficients or weights corresponding to the number of columns in $X$. Matrix multiplication provides us an efficient way to get our expected value/prediction, and depicting the model in this way is a common practice that makes it more succinct. [^interceptones]: In the first depiction without $\alpha$, there is an additional column at the beginning of the matrix that is all ones, which is a way to incorporate the intercept into the model. However, most models that use a matrix as input will not have the intercept column, as it's either not part of the model estimation, is automatically added behind the scenes, and may be estimated separately. @@ -1175,7 +1175,7 @@ mre_mae = round(performance::performance_mae(model_lr_rating_extra), 2) There is definitely more to unpack here than our simpler model, but it's important to note that it's just *more* stuff, not *different* stuff. The model-level components are the same in that we still see R^2^ etc., although they are all 'better' (higher R^2^, lower error) because we have a model that more accurately predicts the observed target. -Our coefficients have the same output, and though they are on different scales and we'd interpret them in the same way. Starting with word count, we see that it's still statistically significant, but it has been reduced just slightly from our previous model where it was the only feature (`r wc_coef` vs. `r mre_coef[['word_count']]`). Why? This suggests that word count has some non-zero correlation, sometimes called **collinearity**, with other features that are also explaining the target to some extent. Our linear model shows the effect of each feature *controlling for other features*, or, *holding other features constant*, or *adjusted for other features*[^controllingfor]. Conceptually this means that the effect of word count is the effect of word count *after* we've accounted for the other features in the model. In this case, an increase of a single word results in a `r round(mre_coef[['word_count']], 2)` `r word_sign(mre_coef[['word_count']], c('increase', 'drop'))`, even after adjusting for the effect of other features. Looking at another feature, the addition of a child to the home is associated with `r mre_coef[['children_in_home']]` `r word_sign(mre_coef[['children_in_home']], c('increase', 'drop'))` in rating, again, accounting for the other features. +Our coefficients have the same output, and though they are on different scales we'd interpret them in the same way. Starting with word count, we see that it's still statistically significant, but it has been reduced just slightly from our previous model where it was the only feature (`r wc_coef` vs. `r mre_coef[['word_count']]`). Why? This suggests that word count has some non-zero correlation, sometimes called **collinearity**, with other features that are also explaining the target to some extent. Our linear model shows the effect of each feature *controlling for other features*, or, *holding other features constant*, or *adjusted for other features*[^controllingfor]. Conceptually this means that the effect of word count is the effect of word count *after* we've accounted for the other features in the model. In this case, an increase of a single word results in a `r round(mre_coef[['word_count']], 2)` `r word_sign(mre_coef[['word_count']], c('increase', 'drop'))`, even after adjusting for the effect of other features. Looking at another feature, the addition of a child to the home is associated with `r mre_coef[['children_in_home']]` `r word_sign(mre_coef[['children_in_home']], c('increase', 'drop'))` in rating, again, accounting for the other features. [^controllingfor]: A lot of statisticians and causal modeling folks get very hung up on the terminology here, but we'll leave that to them as we'd like to get on with things. For our purposes, we'll just say that we're interested in the relationship of a feature with the target *after* we've accounted for the other features in the model. @@ -1333,7 +1333,7 @@ Recall also that an interpretation of the intercept is the expected value of the When we have a lot of categories, it's often not practical to look at the coefficients for each one, and even when there aren't that many, we often prefer to get a sense of the total effect of the feature. For standard linear models, we can break down the target variance explained by the model into the variance explained by each feature, and this is called the **ANOVA**, or analysis of variance[^typess]. It is not without its issues, but it's one way to get a sense of whether a categorical (or other) feature as a whole is statistically significant. -[^typess]: There are actually different types of ANOVA in this context, and different ways to calculate the variance values. One may notice the Python ANOVA result is different, even though the season coefficients and initial model is identical. R defaults with what is called Type II sums of squares, while the Python default uses Type I sums of squares. We won't bore you with the details of their differences, and the astute modeler will not come to different conclusions because of this sort of thing, and you now have enough detail to look it up. +[^typess]: There are actually different types of ANOVA in this context, and different ways to calculate the variance values. One may notice the Python ANOVA result is different, even though the season coefficients and initial model are identical. R defaults with what is called Type II sums of squares, while the Python default uses Type I sums of squares. We won't bore you with the details of their differences, and the astute modeler will not come to different conclusions because of this sort of thing, and you now have enough detail to look it up. :::{.panel-tabset} @@ -1375,7 +1375,7 @@ broom::tidy(anova(model_lr_cat)) |> sub_missing(missing_text = '') ``` -A primary reason to use ANOVA is to make these sort of summary claims of statistical significance. In this case, we can say that the relationship of season to rating is statistically significant. From the table, the DF (degrees of freedom) represents the number of categories minus 1, and the F-statistic is a measure of the (mean squared) variance explained by the feature relative to the (mean squared) variance not explained by the feature (F = mean square value divided by DF). The p-value is the probability of observing an F-statistic as extreme as the one observed, given that the null hypothesis is true. In this case, the null hypothesis is that the feature has no effect on the target. The p-value is less than 0.001, so we reject the null hypothesis and conclude that the observed feature-target relationship is statistically different from an assumption of no relationship. Note that nothing here is different from what we saw in our previous regression models, and we can run an `anova` function on those too[^anova_vs_t]. As a final note, it's good to be reminded that statistical significance is not the same as practical significance. Whether these group differences are meaningful is still left to be decided by the modeler given the context of the data. +A primary reason to use ANOVA is to make these sorts of summary claims of statistical significance. In this case, we can say that the relationship of season to rating is statistically significant. From the table, the DF (degrees of freedom) represents the number of categories minus 1, and the F-statistic is a measure of the (mean squared) variance explained by the feature relative to the (mean squared) variance not explained by the feature (F = mean square value divided by DF). The p-value is the probability of observing an F-statistic as extreme as the one observed, given that the null hypothesis is true. In this case, the null hypothesis is that the feature has no effect on the target. The p-value is less than 0.001, so we reject the null hypothesis and conclude that the observed feature-target relationship is statistically different from an assumption of no relationship. Note that nothing here is different from what we saw in our previous regression models, and we can run an `anova` function on those too[^anova_vs_t]. As a final note, it's good to be reminded that statistical significance is not the same as practical significance. Whether these group differences are meaningful is still left to be decided by the modeler given the context of the data. [^anova_vs_t]: For those interested, for those features with one degree of freedom, all else being equal the F statistic here would just be the square of the t-statistic for the coefficients, and the p-value would be the same. @@ -1501,7 +1501,7 @@ The take home message is that using models in more complex settings like machine Up to this point we've been using a continuous, numeric target. But what about a categorical target? For example, what if we just had a binary target of whether a movie was good or bad? We will dive much more into classification models in our upcoming chapters, but it turns out that we can still formulate it as a linear model. The main difference is that we use a transformation of our linear combination of features, using what is sometimes called a **link function**, and we'll need to use a different **objective function** rather than least squares, such as the binomial likelihood, to deal with the binary target. This also means we'll move away from R^2^ as a measure of model fit, and focus on something else, like accuracy. -Graphically we can see it in the following way, which when compared with our linear model (@fig-graph-lm), doesn't look much different. In what follows, we create our linear combination of features exactly as we did for the linear regression setting. Then we put it through the sigmoid function, which is a common link function for binary targets[^revlogit]. The result is a probability, which we can then use to classify the observation as good or bad based on a chosen threshold. For example, we might say that any instance associated with a probability greater than to 0.5 is classified as 'good', and less than that is classified as 'bad'. +Graphically we can see it in the following way, which when compared with our linear model (@fig-graph-lm), doesn't look much different. In what follows, we create our linear combination of features exactly as we did for the linear regression setting. Then we put it through the sigmoid function, which is a common link function for binary targets[^revlogit]. The result is a probability, which we can then use to classify the observation as good or bad based on a chosen threshold. For example, we might say that any instance associated with a probability greater than 0.5 is classified as 'good', and less than that is classified as 'bad'. [^revlogit]: The sigmoid function in this case is the inverse logistic function, and the resulting statistical model is called logistic regression. In other contexts the model would not be a logistic regression, but this is still a commonly used **activation function**. But others could potentially be used as well. For example, using a (cumulative) normal instead of logistic distribution to create a probability results in the so-called **probit model**, which is more common in econometrics and other fields. @@ -1580,7 +1580,7 @@ Now that you've got the basics, where do you want to go? - If you want to know more about how to understand linear and other models: @sec-knowing, @sec-knowing-feature - If you want a deeper dive into how we get the results from our model: @sec-estimation -- If you want to do some more modeling: @sec-glm, @sec-lm-extend or @sec-ml-core-concepts +- If you want to do some more modeling: @sec-glm, @sec-lm-extend, or @sec-ml-core-concepts - Got more data questions? @sec-data ### Additional resources {#sec-lm-resources} From 9de12c0751b6f2badc2d4472a60c5cdea53a8023 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 24 Nov 2024 15:31:26 -0500 Subject: [PATCH 06/19] gc through extensions --- estimation.qmd | 1057 +-------------------------------- generalized_linear_models.qmd | 6 +- linear_model_extensions.qmd | 6 +- uncertainty.qmd | 12 +- understanding_features.qmd | 6 +- understanding_models.qmd | 4 +- 6 files changed, 18 insertions(+), 1073 deletions(-) diff --git a/estimation.qmd b/estimation.qmd index f5d772d..1cbfcdd 100644 --- a/estimation.qmd +++ b/estimation.qmd @@ -2096,7 +2096,7 @@ our_sgd['par'], our_sgd['MSE'] Next we'll compare it to OLS estimates and our previous 'batch' gradient descent results. Even though SGD normally would not be used for such a small dataset, we at least get close[^closebutnosgd]! -[^closebutnosgd]: You'd get better results a couple of ways. The easiest is just to repeat the process a couple of times and average the results. This is a common approach in SGD. The initial shuffling that we did can help with convergence as well, and it would be done each repetition . When we're dealing with larger data and repeated runs/epochs, shuffling allows the samples/batches to be more representative of the entire data set. Also, we had to 'hand-tune' our learning rate and step size, which is not ideal, and normally we would use cross-validation to find the best values. +[^closebutnosgd]: You'd get better results in a couple of ways. The easiest is just to repeat the process a couple of times and average the results. This is a common approach in SGD. The initial shuffling that we did can help with convergence as well, and it would be done each repetition . When we're dealing with larger data and repeated runs/epochs, shuffling allows the samples/batches to be more representative of the entire data set. Also, we had to 'hand-tune' our learning rate and step size, which is not ideal, and normally we would use cross-validation to find the best values. ```{r} #| echo: false @@ -2265,1018 +2265,6 @@ map2_df( Before leaving our estimation discussion, we should mention there are other approaches one could use to estimate model parameters, including variations on least squares, the **method of moments**, **generalized estimating equations**, **robust** estimation, and more. We've focused on the most common ones generally, but it's good to be aware of others that might be more popular in some domains. -## Estimating Uncertainty {#sec-estim-other} - -Our focus thus far has been on estimating the best parameters for a model. But we also want to know how certain we are about those estimates. There are different ways to estimate **uncertainty**, and understanding the uncertainty in our results helps us make better decisions from our model. We'll briefly cover a few approaches here, but realize we are but merely scratching the surface on these approaches. There are whole books, and even philosophies, dedicated to the topic of uncertainty estimation. - - - -### Standard Frequentist {#sec-estim-frequentist} - -We talked a bit about the frequentist approach in our discussion of confidence intervals (@sec-lm-interpretation-feature). There we described the process using the interval to *capture* the 'true' parameter value a certain percentage of the time. The key assumption is that the true parameter is fixed, and the interval is a random variable that will contain the true value with some percentage frequency. With this approach, if you were to repeat the experiment, i.e. data collection and analysis, many times, each interval would be slightly different. Although they would be different, any one of the intervals is as good or valid as the others. You also know that a certain percentage of them will contain the true value, and a (usually small) percentage will not. You will never know if a specific interval does actually capture the true value, because we don't know the true value in practice. - -This is a common approach in traditional statistical analysis, and so it's used in many modeling contexts. If no particular estimation approach is specified, the default is usually a frequentist one. The approach not only provides confidence intervals for the parameters, but we can also get them for predictions, which is typically also a goal. - -Here is an example using our previous model to get interval estimates for predictions. Here we get so-called 'confidence' or 'prediction' intervals. Both are confidence intervals in the frequentist sense, just for different purposes. The confidence interval is for the average prediction, while the prediction interval is for a future observation. The prediction interval is always wider because it includes the uncertainty in the model parameters as well as the uncertainty in the prediction itself. - -:::{.panel-tabset} - -##### R - -```{r} -#| label: r-frequentist -#| eval: false - -model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) - -confint(model) - -predict(model, interval = 'confidence') # for an average prediction -predict(model, interval = 'prediction') # for a future observation (always wider) -``` - -##### Python - -```{python} -#| label: py-frequentist -#| eval: false - -model = smf.ols( - 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc', - data = df_happiness -).fit() - -model.conf_int() - -model.get_prediction().summary_frame() # both 'confidence' and 'prediction' intervals -``` - -::: - -These interval estimates for parameters and predictions are actually not easy to get right for more complicated models beyond generalized linear models. Given this, one should be cautious when moving beyond standard linear models. The next two approaches we'll discuss are often used within the frequentist framework to estimate uncertainty in more complex models. - - -### Monte Carlo - -Monte Carlo methods derive their name from the famous casino in Monaco[^mcname]. The idea is to use random sampling to estimate a value. With statistical models, we can use Monte Carlo methods to estimate uncertainty in our model parameters and predictions. The general idea is as follows: - -1. **Estimate the model parameters** using the data and their range of possible values (e.g. based on a probability distribution). -2. **Simulate new data** from the model using the estimated parameters and assumed probability distributions for those parameters. -3. **Estimate the metrics of interest** using the simulated data. -4. **Repeat** many times. - - -[^mcname]: The name originates with Stanislav Ulam, who worked on the Manhattan Project and would actually come up with the idea from playing solitaire. He is also the one who inspired the name of the Bayesian probabilistic programming language Stan! - -The result is a distribution of the value of interest, be it a parameter, a prediction, or maybe an evaluation metric like RMSE. This distribution can then be used to provide a sense of uncertainty in the value, such as an interval estimate. We can use Monte Carlo methods to estimate the uncertainty in predictions for our happiness model as follows. - -:::{.panel-tabset} - -##### R - -```{r} -#| label: r-monte-carlo - -# we'll use the model from the previous section -model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) - -# number of simulations -mc_predictions = function( - model, - nsim = 2500, - seed = 42 -) { - set.seed(seed) - - params_est = coef(model) - params = mvtnorm::rmvnorm( - n = nsim, - mean = params_est, - sigma = vcov(model) - ) - - sigma = summary(model)$sigma - X = model.matrix(model) - - y_hat = X %*% t(params) + rnorm(n = nrow(X) * nsim, sd = sigma) - - pred_int = apply(y_hat, 1, quantile, probs = c(.025, .975)) - - return(pred_int) -} - -our_mc = mc_predictions(model) -``` - -##### Python - -```{python} -#| label: py-monte-carlo -#| eval: false - -# we'll use the model from the previous section -model = smf.ols( - 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc', - data = df_happiness -).fit() - -def mc_predictions(model, nsim=2500, seed=42): - np.random.seed(seed) - - params_est = model.params - params = np.random.multivariate_normal( - mean = params_est, - cov = model.cov_params(), - size = nsim - ) - - sigma = model.mse_resid**.5 - X = model.model.exog - - y_hat = X @ params.T + np.random.normal(scale = sigma, size = (X.shape[0], nsim)) - - pred_int = np.quantile(y_hat, q = [.025, .975], axis = 1) - - return pred_int - -our_mc = mc_predictions(model) -``` - -::: - -Here are the results of the Monte Carlo simulation for the prediction intervals. They are pretty close to what we'd already have available from the model package used for linear regression. However, we can use this for other models where uncertainty estimates are not readily available, can help us estimate values and intervals in a general way. - -```{r} -#| echo: false -#| label: tbl-monte-carlo -#| tbl-cap: Monte Carlo Prediction Intervals -pred_int_lm = predict(model, interval = 'prediction') - -tibble( - observed_value = df_happiness$happiness, - prediction = fitted(model), - lower = our_mc[1, ], - upper = our_mc[2, ], - lower_lm = pred_int_lm[, 'lwr'], - upper_lm = pred_int_lm[, 'upr'] -) |> - head() |> - gt() |> - tab_footnote( - footnote = 'Results based on the R simulation' - ) -``` - -Monte Carlo simulation is a very popular approach in modeling, and a variant of it, markov chain monte carlo (MCMC), is the basis for Bayesian estimation, which we'll also talk about in more detail later. - - -### Bootstrap {#sec-estim-bootstrap} - -An extremely common method for estimating uncertainty is the **bootstrap**. The bootstrap is a method where we create new datasets by randomly sampling the original data with replacement. This means that each new dataset is the same size as the original, but some observations may be selected multiple times while others may not be selected at all. We then estimate our model with each data set, and each time, we can collect parameter estimates, predictions, or any other calculations we are interested in. Ultimately, we end up with a *distribution* of all the things we calculated. The nice thing about this we don't need to know the specific distribution (e.g., normal, or t-distribution) of the values we want to get uncertainty estimates for, we can just use the data we have to produce that distribution. And this is a key distinction from the Monte Carlo method just discussed. - -The results of bootstrapping give us a range of possible values, which is useful for inference[^infdef], as we can use the distribution to calculate interval estimates. The average parameter estimate is typically the same as whatever the underlying model used would produce, so not really useful for that in the context of simpler linear models. Even so, we can calculate derivatives of the parameters, like say a ratio or sum, or a model metric like R^2^, or a prediction. Some of these normally would not be estimated as part of the model, or maybe the model tool does not provide anything beyond the value itself. Yet the bootstrap provides a way to get at a measure of uncertainty for the values of interest, with fewer assumptions about how that distribution should take shape. - -The approach bootstrap very flexible, and it can potentially be used with any model whether in a statistical or machine learning context. Let's see this in action with the happiness model. We'll create a bootstrap function, then use it to estimate the uncertainty in the coefficients for the happiness model. - -[^infdef]: We're using inference here in the standard statistical/philosophical sense, not as a synonym for prediction or generalization, which is how it is often used in machine learning. We're not exactly sure how that terminological muddling arose in ML, but be on the lookout for it. - -:::{.panel-tabset} - -##### R - -```{r} -#| label: r-bootstrap -#| results: hide -bootstrap = function(X, y, nboot = 100, seed = 123) { - - N = nrow(X) - p = ncol(X) + 1 # add one for intercept - - # initialize - beta = matrix(NA, p*nboot, nrow = nboot, ncol = p) - colnames(beta) = c('Intercept', colnames(X)) - mse = rep(NA, nboot) - - # set seed - set.seed(seed) - - for (i in 1:nboot) { - # sample with replacement - idx = sample(1:N, N, replace = TRUE) - Xi = X[idx,] - yi = y[idx] - - # estimate model - mod = lm(yi ~., data = Xi) - - # save results - beta[i, ] = coef(mod) - mse[i] = sum((mod$fitted - yi)^2) / N - } - - # given mean estimates, calculate MSE - y_hat = cbind(1, as.matrix(X)) %*% colMeans(beta) - final_mse = sum((y - y_hat)^2) / N - - output = list( - par = as_tibble(beta), - MSE = mse, - final_mse = final_mse - ) - - return(output) -} - -X = df_happiness |> - select(life_exp_sc:gdp_pc_sc) - -y = df_happiness$happiness - -our_boot = bootstrap( - X = X, - y = y, - nboot = 1000 -) -``` - -##### Python - -```{python} -#| label: py-bootstrap -#| eval: false -def bootstrap(X, y, nboot=100, seed=123): - # add a column of 1s for the intercept - X = np.c_[np.ones(X.shape[0]), X] - N = X.shape[0] - - # initialize - beta = np.empty((nboot, X.shape[1])) - - # beta = pd.DataFrame(beta, columns=['Intercept'] + list(cn)) - mse = np.empty(nboot) - - # set seed - np.random.seed(seed) - - for i in range(nboot): - # sample with replacement - idx = np.random.randint(0, N, N) - Xi = X[idx, :] - yi = y[idx] - - # estimate model - model = LinearRegression(fit_intercept=False) # from sklearn - mod = model.fit(Xi, yi) - - # save results - beta[i, :] = mod.coef_ - mse[i] = np.sum((mod.predict(Xi) - yi)**2) / N - - # given mean estimates, calculate MSE - y_hat = X @ beta.mean(axis=0) - final_mse = np.sum((y - y_hat)**2) / N - - output = { - 'par': beta, - 'mse': mse, - 'final_mse': final_mse - } - - return output - -our_boot = bootstrap( - X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']], - y = df_happiness['happiness'], - nboot = 1000 -) -``` -::: - - -Here are the results of the interval estimates for the coefficients. Each parameter has the mean estimate, the lower and upper bounds of the 95% confidence interval, and the width of the interval. The bootstrap intervals are a bit wider than the OLS intervals, but for this model these should converge as the number of observations increases. - -\small -```{r} -#| echo: false -#| label: tbl-bootstrap -#| tbl-cap: Bootstrap Parameter Estimates - -tab_boot_summary = our_boot$par |> - as_tibble() |> - tidyr::pivot_longer(everything(), names_to = 'Parameter') |> - summarize( - mean = mean(value), - `Lower BS` = quantile(value, .025), - `Upper BS` = quantile(value, .975), - .by = Parameter - ) |> - mutate( - `Lower OLS` = confint(model_compare)[, 1], - `Upper OLS` = confint(model_compare)[, 2], - `Diff Width` = (`Upper BS` - `Lower BS`) - (`Upper OLS` - `Lower OLS`) - ) - -tab_boot_summary |> - gt(decimals = 2) |> - tab_footnote( - footnote = 'Width of bootstrap estimate minus width of OLS estimate', - locations = cells_column_labels(columns = `Diff Width`) - ) |> - tab_options( - footnotes.font.size = 10 - ) -``` -\normalsize - -Let's look more closely at the distributions for each coefficient. Standard statistical estimates assume a specific distribution like the normal. But the bootstrap method provides more flexibility, even though it often leans towards the assumed distribution. We can see these distributions aren't perfectly symmetrical like a normal distribution, but they suit our needs in that we can extract the lower and upper quantiles to create an interval estimate. - -:::{.content-visible when-format='html'} -```{r} -#| echo: false -#| label: fig-r-bootstrap -#| fig-cap: Bootstrap Distributions of Parameter Estimates - -our_boot$par |> - as_tibble() |> - pivot_longer(everything(), names_to = 'Parameter') |> - ggplot(aes(value)) + - geom_density(color = okabe_ito[2]) + - geom_point( - aes(x = `Lower BS`, group = Parameter), - y = 0, - color = okabe_ito[1], - size = 3, - alpha = 1, - data = tab_boot_summary - ) + - geom_point( - aes(x = `Upper BS`, group = Parameter), - y = 0, - color = okabe_ito[1], - size = 3, - alpha = 1, - data = tab_boot_summary - ) + - geom_segment( - aes( - x = `Lower BS`, - xend = `Upper BS`, - y = 0, - yend = 0, - group = Parameter - ), - color = okabe_ito[1], - linewidth = 1, - data = tab_boot_summary - ) + - facet_wrap(~Parameter, scales = 'free') + - labs(x = 'Parameter Value', y = 'Density') - -ggsave('img/estim-bootstrap.svg', width = 8, height = 6) -``` -::: - -:::{.content-visible when-format='pdf'} -![Bootstrap Distributions of Parameter Estimates](img/estim-bootstrap.svg){#fig-r-bootstrap} -::: - -As mentioned, the bootstrap is often used to provide uncertainty for unmodeled parameters, predictions, and other metrics. However, because we repeatedly run the model or some aspect of it over and over, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g. extreme values) or [in some data contexts](https://stats.stackexchange.com/questions/9664/what-are-examples-where-a-naive-bootstrap-fails) (e.g. correlated observations) without extra considerations. Variants exist to help deal with some of these issues, and despite limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model. - - -### Bayesian estimation {#sec-estim-bayes} - -The **Bayesian** approach to modeling is many things - a philosophical viewpoint, an entirely different way to think about probability, a different way to measure uncertainty, and on a practical level, just another way to get model parameter estimates. It can be as frustrating as it is fun to use, and one of the really nice things about using Bayesian estimation is that it can handle model complexities that other approaches don't do well or at all. - -The basis of Bayesian estimation is the **likelihood**, the same as with maximum likelihood, and everything we did there applies here. So you need a good grasp of maximum likelihood to understand the Bayesian approach. However, the Bayesian approach is different because it also lets us use our knowledge about the parameters through **prior distributions**. For example, we may think that the coefficients for a linear model come from a normal distribution centered on zero with some variance. That would serve as our prior distribution for those parameters. - -The combination of a prior distribution with the likelihood results in the **posterior distribution**, which is a *distribution* of possible parameter values. It falls somewhere between the prior and the likelihood. With more data, it tends toward the likelihood result, and with less data, it tends toward what the prior would have suggested. The posterior distribution is what we ultimately use to make inferences about the parameters, and it can be used to estimate uncertainty in the same way as the bootstrap. - -![Prior, likelihood, posterior and distributions](img/prior2post_clean.png){#fig-bayesian-prior-posterior} - - -#### Example {.unnumbered} - -Let's do a simple example to show how this comes about. We'll use a binomial model where we have penalty kicks taken for a soccer player, and we want to estimate the probability of the player making a goal, which we'll call $\theta$. - -For our prior distribution, we'll use a [beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) that has a mean of 0.5, suggesting that we think this person would have about a 50% chance of converting the kick on average. However, we will keep this prior fairly loose, with a range that spans most of the (0, 1) interval. For the likelihood, we'll use a binomial distribution. We also use this in our GLM chapter (@eq-binomial), which, as we noted earlier in this chapter, is akin to using the log loss (@sec-estim-logloss). We'll then calculate the posterior distribution for the probability of making a shot, given our prior and the evidence at hand, i.e., the data. - -Let's start with some data, and just like our other estimation approaches, we'll have some guesses for $\theta$ which represents the probability of making a goal. We'll use the prior distribution to represent our beliefs about those parameter values, assigning more weight to values around 0.5. We'll then calculate the likelihood of the data given the parameter, which will put more weight on values closer to the observed chance of scoring a goal. Finally, we calculate the posterior distribution. - -:::{.panel-tabset} - -##### R -```{r} -#| label: bayesian-demo-r -pk = c( - 'goal','goal','goal','miss','miss', - 'goal','goal','miss','goal','goal' -) - -# convert to numeric, arbitrarily picking goal=1, miss=0 - -N = length(pk) # sample size -n_goal = sum(pk == 'goal') # number of pk made -n_miss = sum(pk == 'miss') # number of those miss - -# grid of potential theta values -theta = seq( - from = 1 / (N + 1), - to = N / (N + 1), - length = 10 -) - -### prior distribution -# beta prior with mean = .5, but fairly diffuse -# examine the prior -# theta = rbeta(1000, 5, 5) -# hist(theta, main = 'Prior Distribution', xlab = 'Theta', col = 'lightblue') -p_theta = dbeta(theta, 5, 5) - -# Normalize so that values sum to 1 -p_theta = p_theta / sum(p_theta) - -# likelihood (binomial) -p_data_given_theta = choose(N, n_goal) * theta^n_goal * (1 - theta)^n_miss - -# posterior (combination of prior and likelihood) -# p_data is the marginal probability of the data used for normalization -p_data = sum(p_data_given_theta * p_theta) - -p_theta_given_data = p_data_given_theta*p_theta / p_data # Bayes theorem - -# final estimate -theta_est = sum(theta * p_theta_given_data) -theta_est -``` - -##### Python - -```{python} -#| label: bayesian-demo-py -from scipy.stats import beta - -pk = np.array([ - 'goal','goal','goal','miss','miss', - 'goal','goal','miss','goal','goal' -]) - -# convert to numeric, arbitrarily picking goal=1, miss=0 -N = len(pk) # sample size -n_goal = np.sum(pk == 'goal') # number of pk made -n_miss = np.sum(pk == 'miss') # number of those miss - -# grid of potential theta values -theta = np.linspace(1 / (N + 1), N / (N + 1), 10) - -### prior distribution -# beta prior with mean = .5, but fairly diffuse -# examine the prior -# theta = beta.rvs(5, 5, size = 1000) -# plt.hist(theta, bins = 20, color = 'lightblue') -p_theta = beta.pdf(theta, 5, 5) - -# Normalize so that values sum to 1 -p_theta = p_theta / np.sum(p_theta) - -# likelihood (binomial) -p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss - -# posterior (combination of prior and likelihood) -# p_data is the marginal probability of the data used for normalization -p_data = np.sum(p_data_given_theta * p_theta) - -p_theta_given_data = p_data_given_theta * p_theta / p_data # Bayes theorem - -# final estimate -theta_est = np.sum(theta * p_theta_given_data) -theta_est -``` -::: - -Here is the table that puts all this together. Our prior distribution is centered around a $\theta$ of 0.5 because we made it that way. The likelihood is centered closer to 0.7 because that's the observed chance of scoring a goal. The posterior distribution is a combination of the two. It gives no weight to smaller values, or to the max value. Our final estimate is `r theta_est`, which falls between the prior and likelihood values that have the most weight. With more evidence in the form of data, our estimate will shift more and more towards what the likelihood would suggest. This is a simple example, but it shows how the Bayesian approach works, and this conceptually holds for more complex parameter estimation as well. - -```{r} -#| echo: false -#| label: tbl-bayesian-demo -#| tbl-cap: Bayesian Demo Results - -tibble( - theta = theta, - prior = p_theta, - like = p_data_given_theta, - post = p_theta_given_data -) |> - gt() |> - tab_style( - style = cell_text(weight = "bold"), - locations = list( - cells_body( - columns = prior, - rows = which.max(prior) - ), - cells_body( - columns = like, - rows = which.max(like) - ), - cells_body( - columns = post, - rows = which.max(post) - ) - ) - ) -``` - - -:::{.callout-note title='Priors as Regularization' collapse='true'} -In the context of penalized estimation and machine learning, the prior distribution can be thought of as a form of **regularization** (See @sec-estim-penalty and @sec-ml-regularization later). In this context, the prior shrinks the estimate, pulling the parameter estimates towards it, just like the penalty parameter does in the penalized estimation methods. In fact, many penalized methods can be thought of as a Bayesian approach with a specific prior distribution. An example would be ridge regression, which can be thought of as a Bayesian linear regression with a normal prior distribution for the coefficients. The variance of the prior is inversely related to the ridge penalty parameter. -::: - -#### Application {.unnumbered} - -Just like with the bootstrap which also provided distributions for the parameters, we can use the Bayesian approach to understand how certain we are about our estimates. We can look at any range of values in the posterior distribution to get what is often referred to as a **credible interval**, which is the Bayesian equivalent of a confidence interval[^confintinterp]. Here is an example of the posterior distribution for the parameters of our happiness model, along with 95% intervals[^brms]. - -[^brms]: We used the R package for [brms]{.pack} for these results. - -[^confintinterp]: Many people's default interpretation of a standard confidence interval is usually something like 'the range we expect the parameter to reside within'. Unfortunately, that's not quite right, though it is how you interpret the Bayesian interval. The frequentist confidence interval is a range that, if we were to repeat the experiment/data collection many times, contains the true parameter value a certain percentage of the time. For the Bayesian, the parameter is assumed to be random, and so the interval is that which we expect the parameter to fall within a certain percentage of the time. The Bayesian is probably a bit more intuitive for most, even if it's not the more widely used. - -:::{.content-visible when-format='html'} -```{r} -#| echo: false -#| label: fig-r-bayesian-posterior -#| fig-cap: Posterior Distribution of Parameters - -# bayes_mod = brms::brm( -# happiness ~ life_exp_sc + gdp_pc_sc + corrupt_sc, -# data = df_happiness, -# prior = c( -# brms::prior(normal(0, 1), class = 'b') -# ), -# thin = 8, -# ) - -# save( -# bayes_mod, -# file = 'estimation/data/brms_happiness.RData' -# ) - -load('estimation/data/brms_happiness.RData') - -p_dat = bayes_mod |> - tidybayes::spread_draws(b_Intercept, b_life_exp_sc, b_gdp_pc_sc, b_corrupt_sc) |> - select(-.chain, -.draw) |> - pivot_longer(-.iteration, names_to = 'Parameter') |> - mutate(Parameter = str_remove(Parameter, 'b_')) - -p_intervals = summary(bayes_mod)$fixed |> - as_tibble(rownames = 'Parameter') |> - rename( - value = Estimate, - lower = `l-95% CI`, - upper = `u-95% CI` - ) - - -p_dat |> - mutate(Parameter = factor(Parameter, unique(Parameter))) |> - ggplot(aes(value)) + - geom_density(color = okabe_ito[2]) + - # add credible interval - geom_point( - aes(x = lower, group = Parameter), - y = 0, - color = okabe_ito[1], - size = 3, - alpha = 1, - data = p_intervals - ) + - geom_point( - aes(x = upper, group = Parameter), - y = 0, - color = okabe_ito[1], - size = 3, - alpha = 1, - data = p_intervals - ) + - geom_segment( - aes( - x = lower, - xend = upper, - y = 0, - yend = 0, - group = Parameter - ), - color = okabe_ito[1], - size = 1, - data = p_intervals - ) + - facet_wrap(~factor(Parameter), scales = 'free') + - labs( - x = 'Parameter Value', - y = 'Density', - caption = '95% Credible Interval shown underneath the density plot' - ) - -ggsave('img/estim-bayesian-posterior.svg', width = 8, height = 6) -``` -::: - -:::{.content-visible when-format='pdf'} -![Posterior Distribution of Parameters](img/estim-bayesian-posterior.svg){#fig-r-bayesian-posterior} -::: - - -With Bayesian estimation we also provide starting values for the algorithm, which is a form of Monte Carlo estimation[^mcmc], to get things going. We also typically specify a number of iterations, or times the model will run, as the **stopping rule**. Each iteration gives us a new guess for each parameter, which amounts to a random draw from the posterior distribution. With more iterations the model takes longer to run, but the length often reflects the complexity of the model. - -[^mcmc]: The most common method for the Bayesian approach is Markov Chain Monte Carlo (MCMC), which is a way to sample from the posterior distribution. There are many MCMC algorithms, many of which are a form of the now fairly old Metropolis-Hastings algorithm, which you can find a demo of at [Michael's doc](https://m-clark.github.io/models-by-example/metropolis-hastings.html). - -We also specify multiple **chains**, which do the same estimation procedure, but due to the random nature of the Bayesian approach and starting point, take different estimation paths[^dlchains]. We can then compare the chains to see if they are converging to the same result, which is a check on the model. If they are not converging, we may need to run the model longer, or it may indicate a problem with how we set up the model. - -Here's an example of the four chains for our happiness model for the life expectancy coefficient. The chains bounce around a bit from one iteration to the next, but on average, they're giving very similar results, so we know the model is working well. Nowadays, we have default statistics in the output that also provide this information, which makes it easier to quickly check convergence for many parameters. - -[^dlchains]: Some deep learning implementations will use multiple random starts for similar reasons. - -:::{.content-visible when-format='html'} -```{r} -#| echo: false -#| label: fig-r-bayesian-chains -#| fig-cap: Bayesian Chains for Life Expectancy Coefficient - -p_dat = bayes_mod |> - tidybayes::spread_draws(b_life_exp_sc) |> - select(-.draw) |> - pivot_longer(-c(.chain, .iteration), names_to = 'Parameter') |> - mutate( - Parameter = str_remove(Parameter, 'b_'), - .chain = factor(.chain) - ) - - -p_dat |> - ggplot(aes(.iteration, value)) + - geom_hline(aes(yintercept = mean(value)), color = 'darkred', linewidth = 1) + - geom_line(aes(color = .chain), alpha = .75) + - scale_color_manual(values = unname(okabe_ito)) + - labs( - x = 'Iteration', - y = 'Coefficient', - caption = 'Dark line is the mean value for the life expectancy coefficient.' - ) - -# for this plot it's just easier to make a separate plot for print -p_print = p_dat |> - ggplot(aes(.iteration, value)) + - geom_hline(aes(yintercept = mean(value)), color = 'black', linewidth = 1) + - geom_line(aes(color = .chain, linetype = .chain), linewidth = 1) + - annotate( - 'text', - x = 5, - y = 0.8, - label = 'Each line type represents a separate chain', - color = 'black', - hjust = 0, - size = 3 - ) + - scale_color_grey() + # INTENTIONALLY LEFT GRAY - labs( - x = 'Iteration', - y = 'Coefficient', - caption = 'Horizontal line is the mean value.' - ) + - theme(legend.position = 'none') - -ggsave('img/estim-bayesian-chains.svg', p_print, width = 8, height = 6) - -``` -::: - -:::{.content-visible when-format='pdf'} -![Bayesian Chains for Life Expectancy Coefficient](img/estim-bayesian-chains.svg) -::: - -When we are interested in making predictions, we can use the results to generate a distribution of possible predictions *for each observation*, which can be very useful when we want to quantify uncertainty for complex models. This is referred to as **posterior predictive distribution**, which is explored in a non-bayesian context in @sec-knowing-model-vis. Here is a plot of several draws of predicted values against the true happiness scores. - -:::{.content-visible when-format='html'} -```{r} -#| echo: false -#| label: fig-r-bayesian-posterior-predictive -#| fig-cap: Posterior Predictive Distribution of Happiness Values -p_dat = brms::pp_check(bayes_mod)$data |> - mutate(source = ifelse(is_y, 'Observed', 'Predicted')) |> - select(rep_id, source, value) - -p_dat |> - ggplot(aes(value, color = source, group = rep_id)) + - stat_density( - aes(color = source, group = rep_id, linewidth = I(ifelse(source == 'Observed', 2, .25))), - position = 'identity', - geom = 'borderline', - # show.legend = FALSE - ) + - scale_color_manual(values = unname(okabe_ito[c(1,5)])) + - labs( - x = 'Happiness Value', - y = 'Density', - # caption = 'Observed values are in black' - ) - -p_print = p_dat |> - ggplot(aes(value, color = source, group = rep_id)) + - stat_density( - aes(color = source, group = rep_id, linewidth = I(ifelse(source == 'Observed', 2, .25))), - position = 'identity', - geom = 'borderline', - show.legend = FALSE - ) + - scale_color_manual(values = unname(okabe_ito[c(1,5)])) + - labs( - x = 'Happiness Value', - y = 'Density', - caption = 'Thin lines predicted values, thick line represents the observed target values.' - ) - -ggsave('img/estim-bayesian-posterior-predictive.svg', p_print, width = 8, height = 6) -``` -::: - -:::{.content-visible when-format='pdf'} -![Posterior Predictive Distribution of Happiness Values](img/estim-bayesian-posterior-predictive.svg) -::: - - -With the Bayesian approach, every metric we calculate has a range of possible values, not just one. For example, if you have a classification model and want to know the accuracy, AUROC, or true positive rate of the model. Instead of a single number, you would now have access to a whole distribution of values for that metric. How? For each possible set of model parameters from the posterior distribution, we apply those values and model to data to make a prediction. We can then assign it to a class, and compare it to the actual class. This gives us a range of possible predictions and classes. We can then calculate metrics like accuracy or true positive rate for each possible prediction set. As an example, we did this for our happiness model with a numeric target to obtain the interval estimate for R-squared. Pretty neat! - - -```{r} -#| echo: false -#| label: tbl-r-bayesian-metrics -#| tbl-cap: Bayesian R^2^ - -bayes_r2 = performance::r2(bayes_mod) # not very friendly object returned - -tibble(r2 = bayes_r2$R2_Bayes, as_tibble(attr(bayes_r2, 'CI')$R2_Bayes)) |> - select(-CI) |> - rename( - `Bayes R2` = r2, - `Lower` = CI_low, - `Upper` = CI_high - ) |> - gt() |> - tab_footnote( - footnote = '95% Credible interval for R-squared', - # locations = cells_column_labels(columns = `Bayes R2`) - ) |> - tab_options( - footnotes.font.size = 10 - ) -``` - - - -:::{.callout-note title='Frequentist PP check' collapse='true'} -As we saw in @sec-knowing-model-vis, nothing is keeping you from doing 'predictive checks' with other estimation approaches, and it's a very good idea to do so. For example, with a GLM you can use Monte Carlo simulation or the Bootstrap to generate a distribution of predictions, and then compare that to the actual data. This is a good way to check the model's assumptions and see if it's doing what you think it's doing. It's more straightforward with the Bayesian approach, since many modeling packages will do it for you with little effort. -::: - - - -#### Additional Thoughts - -It turns out that any standard (frequentist) statistical model can be seen as a Bayesian one from a certain point of view[^obi]. Here are a couple. - -[^obi]: Cue [Obi Wan Kenobi](https://www.youtube.com/watch?v=pSOBeD1GC_Y). - -- GLM and related models estimated via maximum likelihood: Bayesian estimation with a flat/uniform prior on the parameters. -- Ridge Regression: Bayesian estimation with a normal prior on the coefficients, penalty parameter is related to the variance of the prior. -- Lasso Regression: Bayesian estimation with a Laplace prior on the coefficients, penalty parameter is related to the variance of the prior. -- Mixed Models: random effects are, as the name suggests, random, and so are estimated as a distribution of possible values, which is conceptually in line with the Bayesian approach. - - - -So, in many modeling contexts, you're actually doing a restrictive form of Bayesian estimation already. - -The Bayesian approach is very flexible, and can be used for many different types of models, and can be used to get at uncertainty in a model in ways that other approaches can't. It's not always the best approach, even when appropriate due to the computational burden and just diagnostic complexity, but it's a good one to have in your toolbox[^rpybayes]. Hopefully we've helped to demystify the Bayesian approach a bit here, and you feel more comfortable trying it out. - -[^rpybayes]: R has excellent tools here for modeling and post-processing, like [brms]{.pack} and [tidybayes]{.pack}, and Python has [pymc3]{.pack}, [numpyro]{.pack}, and [arviz]{.pack}, which are also useful. Honestly R has way more going on here, with many packages devoted to Bayesian estimation of specific models even, but if you want to stick with Python it's gotten a lot better recently. - - -### Conformal Methods {#sec-estim-conformal} - -Conformal approaches bring us back to the frequentist world, and specifically regard prediction uncertainty. One of the primary strengths of the approach is that it is model agnostic and theoretically can work for any model, from linear regression to deep learning. Like the bootstrap and Bayesian methods, conformal prediction makes us think in terms of distributions of possible values, but it focuses on residuals or errors in prediction. - -It is based on the idea that we can estimate the uncertainty in our predictions by looking at the distribution of the predictions from the model, or more specifically, the prediction error. Using the observed prediction error on a calibration set that was not used to train the model, we can order those errors and find the quantile corresponding to the desired uncertainty coverage/error rate[^errorcov]. When predicting on new data, we assume the predictions and corresponding errors come from a similar distribution as what we've seen already in our training/calibration process. We do this with no particular assumption about that distribution. We then use the estimated quantile to create upper and lower bounds for a prediction for a new observation. - -While the implementation for various settings can get quite complicated, the conceptual approach is mostly straightforward. As an example, we can demonstrate the **split-conformal** procedure with the following steps. - - -1. **Split Data**: Split the dataset into training and calibration sets. -1. **Train Model**: Train the model using the training set. -1. **Calculate Scores**: Calculate conformity scores on the calibration set. These are the *absolute* residuals between the predicted and actual values on the calibration set. -1. **Quantile Calculation**: Determine the quantile value of the conformity scores for the desired confidence level. -1. **Generate Intervals**: Generate prediction intervals for new data points. For new data points, use the trained model to make predictions. Adjust these predictions by adding and subtracting the quantile value obtained from the conformity scores to generate the lower and upper bounds of the prediction intervals. - - -[^errorcov]: The error rate ($\alpha$) is the proportion of the data that would fall outside the prediction interval, while the coverage rate/interval is 1 - $\alpha$. - -Let's now demonstrate the split-conformal method with our happiness model. We'll start by defining the split-conformal function. The function takes the training data, the target variable, and new data for which we want to make predictions. It also takes an $\alpha$ value, which is the error rate we want to control, and a calibration split, which is the proportion of the data we use for calibration. And finally, we designate new data for which we want to make predictions. - -:::{.panel-tabset} - -##### R - - -```{r} -#| label: split-conformal-r-function -split_conformal = function( - X, - y, - new_data, - alpha = .05, - calibration_split = .5 -) { - # Splitting the data into training and calibration sets - idx = sample(1:nrow(X), size = floor(nrow(X) / 2)) - - train_data = X |> slice(idx) - cal_data = X |> slice(-idx) - train_y = y[idx] - cal_y = y[-idx] - - N = nrow(train_data) - - # Train the base model - model = lm(train_y ~ ., data = train_data) - - # Calculate residuals on calibration set - cal_preds = predict(model, newdata = cal_data) - residuals = abs(cal_y - cal_preds) - - # Sort residuals and find the quantile corresponding to (1-alpha) - residuals = sort(residuals) - quantile = quantile(residuals, (1 - alpha) * (N / (N + 1))) - - # Make predictions on new data and calculate prediction intervals - preds = predict(model, newdata = new_data) - lower_bounds = preds - quantile - upper_bounds = preds + quantile - - # Return predictions and prediction intervals - return( - list( - cp_error = quantile, - preds = preds, - lower_bounds = lower_bounds, - upper_bounds = upper_bounds - ) - ) -} -``` - -##### Python - -```{python} -#| label: split-conformal-py-function -def split_conformal(X, y, new_data, alpha = .05, calibration_split = .5): - # Splitting the data into training and calibration sets - X_train, X_cal, y_train, y_cal = train_test_split( - X, - y, - test_size = calibration_split, - random_state = 123 - ) - - N = X_train.shape[0] - - # Train the base model - model = LinearRegression().fit(X_train, y_train) - - # Calculate residuals on calibration set - cal_preds = model.predict(X_cal) - residuals = np.abs(y_cal - cal_preds) - - # Sort residuals and find the quantile corresponding to (1-alpha) - residuals = np.sort(residuals) - - # The correction here is useful for small sample sizes - quantile = np.quantile(residuals, (1 - alpha) * (N / (N + 1))) - - # Make predictions on new data and calculate prediction intervals - preds = model.predict(new_data) - lower_bounds = preds - quantile - upper_bounds = preds + quantile - - # Return predictions and prediction intervals - return { - 'cp_error': quantile, - 'preds': preds, - 'lower_bounds': lower_bounds, - 'upper_bounds': upper_bounds - } -``` - -::: - - -With our functions in place, we can now use them to calculate the prediction intervals for the happiness model. The `cp_error` value gives us the quantile value that we use to generate the prediction intervals. The table below shows the first few predictions and their corresponding prediction intervals. - -:::{.panel-tabset} - -##### R - -```{r} -#| label: split-conformal-r-calculate -#| results: hide -# split data -set.seed(123) - -idx_train = sample(nrow(df_happiness), nrow(df_happiness) * .8) -idx_test = setdiff(1:nrow(df_happiness), idx_train) - -df_train = df_happiness |> - slice(idx_train) |> - select(happiness, life_exp_sc, gdp_pc_sc, corrupt_sc) - -y_train = df_happiness$happiness[idx_train] - -df_test = df_happiness |> - slice(idx_test) |> - select(life_exp_sc, gdp_pc_sc, corrupt_sc) - -y_test = df_happiness$happiness[idx_test] - -cp_error = split_conformal( - df_train |> select(-happiness), - y_train, - df_test, - alpha = .1 -) - -# cp_error[['cp_error']] - -tibble( - preds = cp_error[['preds']], - lower_bounds = cp_error[['lower_bounds']], - upper_bounds = cp_error[['upper_bounds']] -) |> - head() -``` - -##### Python - -```{python} -#| label: split-conformal-py-calculate -#| results: hide -# split data -X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']] -y = df_happiness['happiness'] - -X_train, X_test, y_train, y_test = train_test_split( - df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']], - df_happiness['happiness'], - test_size = 0.5, - random_state = 123 -) - -our_cp_error = split_conformal( - X_train, - y_train, - X_test, - alpha = .1 -) - -# print(our_cp_error['cp_error']) - -pd.DataFrame({ - 'preds': our_cp_error['preds'], - 'lower_bounds': our_cp_error['lower_bounds'], - 'upper_bounds': our_cp_error['upper_bounds'] -}).head() -``` - -::: - - - -```{python} -#| eval: false -#| echo: false -#| label: MAPIE-comparison -from mapie.regression import MapieRegressor - -X = df_happiness[['life_exp_sc', 'gdp_pc_sc', 'corrupt_sc']] -y = df_happiness['happiness'] - -initial_fit = LinearRegression().fit(X, y) -model = MapieRegressor(initial_fit, method = 'naive') -y_pred, y_pis = model.fit(X, y).predict(X_test, alpha = 0.1) - -# take the first difference between upper and lower bounds, -# since it's constant for all predictions in this setting -cp_error = (y_pis[0, 1, 0] - y_pis[0, 0, 0]) / 2 -``` - -As a method of uncertainty estimation, conformal prediction is not without its challenges. It is computationally intensive for large datasets or complex models. There are multiple variants of conformal prediction, most of which attempt to alleviate a deficiency of simpler approaches. But they generally further increase the computational burden. - -Conformal prediction still relies on the assumptions about the data and the underlying model, and violations of these assumptions can lead to invalid prediction intervals. Furthermore, conformal prediction methods assume that the training and test data come from the same distribution, which may not always be the case in real-world applications due to distribution shifts or domain changes. In addition, validation sets must be viable splits of the data, which default splitting methods may not always provide. In general, for conformal prediction provides an alternative to other frequentist or Bayesian approaches that, under the right circumstances, may produce a better estimate of uncertainty, but does not come for free. - ## Wrapping Up {#sec-estim-wrap} @@ -3311,49 +2299,6 @@ More demonstration of the simple AdaGrad algorithm used above: - @brownlee_gradient_2021 - @databricks_what_2019 -**Frequentist Approaches**: - -- Most statistical texts cover uncertainty estimation from the frequentist perspective. Pick one you like. -- [Error Statistics](https://errorstatistics.com/about-2/) Deborah Mayo's blog and comments on other blogs have always provided a strong philosophical defense of frequentist statistics. - - -**Monte Carlo**: - -- [Monte Carlo Methods](https://www.youtube.com/watch?v=OgO1gpXSUzU) John Guttag's MIT Course lecture on YouTube. - -**Bootstrap**: - -Classical treatments: - -- @efron_introduction_1994 -- @davison_bootstrap_1997 - -More fun demo: - -- [Bootstrapping Main Ideas](https://www.youtube.com/watch?v=sDv4f4s2SB8) @statquest_with_josh_starmer_bootstrapping_2021 - -**Bayesian**: - -- BDA @gelman_bayesian_2013 -- Statistical Rethinking @mcelreath_statistical_2020 -- [Choosing priors](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) - -**Conformal Prediction**: - -General: - -- A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification (@angelopoulos_gentle_2022); [Example Python Notebooks](https://github.com/aangelopoulos/conformal-prediction) - -R demos: - -- [Conformal inference for regression models](https://www.tidymodels.org/learn/models/conformal-regression/) -- [Conformal prediction](https://marginaleffects.com/vignettes/conformal.html) - -Python demos: - -- Introduction To Conformal Prediction With Python (@molnar_introduction_2024) -- [Mapie Docs](https://mapie.readthedocs.io/en/latest/) - ## Guided Exploration {#sec-estim-exercise} diff --git a/generalized_linear_models.qmd b/generalized_linear_models.qmd index f0be37e..1ed20df 100644 --- a/generalized_linear_models.qmd +++ b/generalized_linear_models.qmd @@ -76,7 +76,7 @@ $$ -We create the linear combination of our features, and then we employ a normal distribution that uses that combination as the mean, which will naturally vary for each sample of data. However, this may not be the best approach in many cases. Instead, we can use on some other distribution that potentially fits the data better. But often these other distributions don't have a direct link to our features, and that's where a **link function** comes in. +We create the linear combination of our features, and then we employ a normal distribution that uses that combination as the mean, which will naturally vary for each sample of data. However, this may not be the best approach in many cases. Instead, we can use some other distribution that potentially fits the data better. But often these other distributions don't have a direct link to our features, and that's where a **link function** comes in. Think of the link function as a bridge between our features and the distribution we want to use. It lets us use a linear combination of features to predict the mean or other parameters of the distribution. As an example, we can use a log to link the mean to the linear predictor, or conversely, exponentiate the linear predictor to get the mean. In this example, the log is the link function and we use its inverse to map the linear predictor back to the mean. @@ -499,7 +499,7 @@ model_logistic.summary() ::: -Now that we have some results, we can see that they aren't too dissimilar from the linear regression output we obtained before. But, let's examine them more closely in hte next section. +Now that we have some results, we can see that they aren't too dissimilar from the linear regression output we obtained before. But, let's examine them more closely in the next section. :::{.callout-note title='Binomial Regression' collapse='true'} @@ -511,7 +511,7 @@ Many coming from a non-statistical background are not aware that their logistic ### Interpretation and visualization {#sec-glm-binomial-interpret} -If our modeling goal is not just producing predictions, we need to know what those results mean. The coefficients that we get from our model are in the log odds scale. Interpreting log odds is difficult, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from 'bad' to 'good', whereas a negative log odds would indicate that a increase in the feature will decrease the log odds of moving from 'bad' to 'good'. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease. As we demonstrated, we can exponentiate them to get the odds ratio for additional interpretability. +If our modeling goal is not just producing predictions, we need to know what those results mean. The coefficients that we get from our model are in the log odds scale. Interpreting log odds is difficult, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from 'bad' to 'good', whereas a negative log odds would indicate that an increase in the feature will decrease the log odds of moving from 'bad' to 'good'. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease. As we demonstrated, we can exponentiate them to get the odds ratio for additional interpretability. ```{r} diff --git a/linear_model_extensions.qmd b/linear_model_extensions.qmd index 0b65f01..e5e0685 100644 --- a/linear_model_extensions.qmd +++ b/linear_model_extensions.qmd @@ -55,7 +55,7 @@ While these models are extensions of the linear model, they are not significantl Things can be quite complex in a typical model with multiple features, but just adding features may not be enough to capture the complexity of the relationships between features and target. Sometimes, we need to consider how features interact with each other to better understand how they correlate with the target. A common way to add complexity in linear models is through **interactions**. This is where we allow the effect of a feature to vary depending on the values of another feature, or even itself! -As a conceptual example, we can think about a movie's rating is different for movies from different genres. For example, maybe by default ratings are higher for kids movies, and lower for horror movies. But, genre and season might work together in some way to affect rating, e.g., action movies get higher ratings in summer. Or maybe having kids in the home might also interact with genre ratings by naturally resulting in higher ratings for kids movies. As a different example, we might also consider that the length of a movie might positively relate to rating, but plateau or even have a negative effect on rating after a certain point. In other words, it would have a **curvilinear** effect where really long movies aren't as highly rated as those of shorter length. +As a conceptual example, we can think about a movie's rating being different for movies from different genres. For example, maybe by default ratings are higher for kids movies, and lower for horror movies. But, genre and season might work together in some way to affect rating, e.g., action movies get higher ratings in summer. Or maybe having kids in the home might also interact with genre ratings by naturally resulting in higher ratings for kids movies. As a different example, we might also consider that the length of a movie might positively relate to rating, but plateau or even have a negative effect on rating after a certain point. In other words, it would have a **curvilinear** effect where really long movies aren't as highly rated as those of shorter length. All of these are types of interactions we can explore. Interactions allow us to incorporate nonlinear relationships into the model, and so greatly extend the linear model's capabilities - we basically get to use a linear model in a nonlinear way! @@ -527,9 +527,9 @@ A simple random intercept and slope is just the start. As an example, we can let ### Using a mixed model {#sec-mixed-models-using} -One of the key advantages of a mixed is that we can use it when the observations within a group are not independent. This is a very common situation in many fields, and it's a good idea to consider this when you have grouped data. As an example we'll use the happiness data for all available years, and we'll consider the country as a grouping variable. In this case, observations within a country are likely to be more similar to each other than to observations from other countries. Such **longitudinal data** is a classic example of when to use a mixed model. This is also a case where we wouldn't just throw in country as a feature like any other factor, since there are `r n_distinct(df_happiness_all$country)` countries in the data. We need an easier way to handle so many groups! +One of the key advantages of a mixed is that we can use it when the observations within a group are not independent. This is a very common situation in many fields, and it's a good idea to consider this when you have grouped data. As an example we'll use the happiness data for all available years, and we'll consider the country as a grouping variable. In this case, observations within a country are likely to be more similar to each other than to observations from other countries. Such **longitudinal data** is a classic example of when to use a mixed model. This is also a case where we wouldn't just throw in 'country' as a feature like any other factor, since there are `r n_distinct(df_happiness_all$country)` countries in the data. We need an easier way to handle so many groups! -In general, to use mixed models we have to specify a random effect pertaining to the categorical feature of focus, but that's the primary difference from our previous approaches used for linear or generalized linear models. For our example, we'll look at a model with a random intercept for country, and one that adds a random coefficient for the yearly trend across countries. This means that we are allowing the intercepts and slopes to vary across countries, and the intercepts and slopes can correlate with one another. Furthermore, by recoding year to start at zero, the intercept will represent the happiness score at the start of the data. In addition, to see a more reasonable effect, we also divide the yearly trend by 10, so the coefficient provides the change in happiness score per decade. +In general, to use mixed models we have to specify a random effect pertaining to the categorical feature of focus, but that's the primary difference from our previous approaches used for linear or generalized linear models. For our example, we'll look at a model with a random intercept for the country feature, and one that adds a random coefficient for the yearly trend across countries. This means that we are allowing the intercepts and slopes to vary across countries, and the intercepts and slopes can correlate with one another. Furthermore, by recoding year to start at zero, the intercept will represent the happiness score at the start of the data. In addition, to see a more reasonable effect, we also divide the yearly trend by 10, so the coefficient provides the change in happiness score per decade. :::{.panel-tabset} diff --git a/uncertainty.qmd b/uncertainty.qmd index 64c9682..4724b41 100644 --- a/uncertainty.qmd +++ b/uncertainty.qmd @@ -334,7 +334,7 @@ our_mc = mc_predictions(model) ::: -Here are the results of the Monte Carlo simulation for the prediction intervals. They are pretty close to what we'd already have available from the model package used for linear regression. However, we can use this for other models where uncertainty estimates are not readily available, can help us estimate values and intervals in a general way. +Here are the results of the Monte Carlo simulation for the prediction intervals. They are pretty close to what we'd already have available from the model package used for linear regression. However, we can use this for other models where uncertainty estimates are not readily available, providing a more general tool. ```{r} #| echo: false @@ -362,11 +362,11 @@ Monte Carlo simulation is a very popular approach in modeling, and a variant of ## Bootstrap {#sec-estim-bootstrap} -An extremely common method for estimating uncertainty is the **bootstrap**. The bootstrap is a method where we create new datasets by randomly sampling the original data with replacement. This means that each new dataset is the same size as the original, but some observations may be selected multiple times while others may not be selected at all. We then estimate our model with each data set, and each time, we can collect parameter estimates, predictions, or any other calculations we are interested in. Ultimately, we end up with a *distribution* of all the things we calculated. The nice thing about this we don't need to know the specific distribution (e.g., normal, or t-distribution) of the values we want to get uncertainty estimates for, we can just use the data we have to produce that distribution. And this is a key distinction from the Monte Carlo method just discussed. +An extremely common method for estimating uncertainty is the **bootstrap**. The bootstrap is a method where we create new datasets by randomly sampling the original data with replacement. This means that each new dataset is the same size as the original, but some observations may be selected multiple times while others may not be selected at all. We then estimate our model with each data set, and each time, we can collect parameter estimates, predictions, or any other calculations we are interested in. Ultimately, we end up with a *distribution* of all the things we calculated. The nice thing about this is that we don't need to know the specific distribution (e.g., normal, or t-distribution) of the values we want to get uncertainty estimates for, we can just use the data we have to produce that distribution. And this is a key distinction from the Monte Carlo method just discussed. The results of bootstrapping give us a range of possible values, which is useful for inference[^infdef], as we can use the distribution to calculate interval estimates. The average parameter estimate is typically the same as whatever the underlying model used would produce, so not really useful for that in the context of simpler linear models. Even so, we can calculate derivatives of the parameters, like say a ratio or sum, or a model metric like R^2^, or a prediction. Some of these normally would not be estimated as part of the model, or maybe the model tool does not provide anything beyond the value itself. Yet the bootstrap provides a way to get at a measure of uncertainty for the values of interest, with fewer assumptions about how that distribution should take shape. -The approach bootstrap very flexible, and it can potentially be used with any model whether in a statistical or machine learning context. Let's see this in action with the happiness model. We'll create a bootstrap function, then use it to estimate the uncertainty in the coefficients for the happiness model. +The approach is very flexible, and it can potentially be used with any model whether in a statistical or machine learning context. Let's see this in action with the happiness data. We'll create a bootstrap function, then use it to estimate the uncertainty in the coefficients for the model. [^infdef]: We're using inference here in the standard statistical/philosophical sense, not as a synonym for prediction or generalization, which is how it is often used in machine learning. We're not exactly sure how that terminological muddling arose in ML, but be on the lookout for it. @@ -1227,7 +1227,7 @@ cp_error = (y_pis[0, 1, 0] - y_pis[0, 0, 0]) / 2 As a method of uncertainty estimation, conformal prediction is not without its challenges. It is computationally intensive for large datasets or complex models. There are multiple variants of conformal prediction, most of which attempt to alleviate a deficiency of simpler approaches. But they generally further increase the computational burden. -Conformal prediction still relies on the assumptions about the data and the underlying model, and violations of these assumptions can lead to invalid prediction intervals. Furthermore, conformal prediction methods assume that the training and test data come from the same distribution, which may not always be the case in real-world applications due to distribution shifts or domain changes. In addition, validation sets must be viable splits of the data, which default splitting methods may not always provide. In general, for conformal prediction provides an alternative to other frequentist or Bayesian approaches that, under the right circumstances, may produce a better estimate of uncertainty, but does not come for free. +Conformal prediction still relies on the assumptions about the data and the underlying model, and violations of these assumptions can lead to invalid prediction intervals. Furthermore, conformal prediction methods assume that the training and test data come from the same distribution, which may not always be the case in real-world applications due to distribution shifts or domain changes. In addition, validation sets must be viable splits of the data, which default splitting methods may not always provide. In general, conformal prediction provides an alternative to other frequentist or Bayesian approaches that, under the right circumstances, may produce a better estimate of uncertainty, but does not come for free. @@ -1242,7 +1242,7 @@ We hope you now have a better understanding of how to estimate uncertainty in yo ### The common thread {#sec-estim-thread} -No model is without uncertainty, so any of these techniques may be applicable to your work. The choice of method depends on the complexity of the model, the amount of data, and the resources you want to spend. +No model is without uncertainty, so any of these techniques may be applicable to your work. The choice of method depends largely on how you want to tackle the issue. @@ -1277,7 +1277,7 @@ A more fun demo: **Bayesian**: -- Bayesian Data Analysis @gelman_bayesian_2013. For many this is the Bayesian bible. +- Bayesian Data Analysis @gelman_bayesian_2013. For many, this is the Bayesian bible. - Statistical Rethinking @mcelreath_statistical_2020. A fantastic modeling book, Bayesian or otherwise. - [Choosing priors](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) diff --git a/understanding_features.qmd b/understanding_features.qmd index 897c97a..29d425f 100644 --- a/understanding_features.qmd +++ b/understanding_features.qmd @@ -24,7 +24,7 @@ We'd suggest having the linear model basics down pretty well, as much of what we ## Data Setup -We'll use the movie review data as our with our previous chapters. Later on, we'll also use the world happiness data set to explore some more advanced concepts. For the movie review data, we'll split the data into training and testing sets, and then fit a linear regression model and a logistic regression model to the training data. We'll then use the testing data to evaluate the models. +We'll use the movie review data as with our previous chapters. Later on, we'll also use the world happiness data set to explore some more advanced concepts. For the movie review data, we'll split the data into training and testing sets, and then fit a linear regression model and a logistic regression model to the training data. We'll then use the testing data to evaluate the models. :::{.panel-tabset} @@ -343,7 +343,7 @@ When word count is zero, i.e. its mean and everything else is at its mean/mode, Let's say we want to boil our understanding of the feature-target relationship to a single number. In this case, the coefficient is fine if we're dealing with an entirely linear model. In this classification case, the raw coefficient tells us what we need to know, but on the log odds scale, which is not very intuitive for most folks. We can understand the probability scale, but this means things get nonlinear. As an example, a .1 to .2 change in the probability is doubling it, while a .8 to .9 change is a 12.5% increase in the probability. But is there any way we can stick with probabilities and get a single value to understand the change in the probability of a good review as word count changes by 1 unit? -Yes! We can look at what's called the **average marginal effect** of word count. This is the average of the slope of the predicted probability of a good review as word count changes. This is a bit more complicated than just looking at the coefficient, but it's still intuitive, and moreso than a coefficient that regards odds. How do we get it? By a neat little trick where we predict the target with the feature at two values. We start with the observed value and then add or subtract a very small amount. Then we take the difference in the predictions for those two feature values. This results in the same thing as taking the derivative of the target with respect to the feature. +Yes! We can look at what's called the **average marginal effect** of word count. This is the average of the slope of the predicted probability of a good review as word count changes. This is a bit more complicated than just looking at the coefficient, but it's still intuitive, and more so than a coefficient that regards odds. How do we get it? By a neat little trick where we predict the target with the feature at two values. We start with the observed value and then add or subtract a very small amount. Then we take the difference in the predictions for those two feature values. This results in the same thing as taking the derivative of the target with respect to the feature. ```{r} #| echo: false @@ -436,7 +436,7 @@ np.mean(fudge_plus - fudge_minus) / fudge_factor ::: -Our results suggests we're getting about a `r round(mean(fudge_plus - fudge_minus) / fudge_factor, 1)` drop in the expected probability of a good review for a 1 standard deviation increase in word count. This is a bit more intuitive than the coefficient or odds ratio based on it, and we probably don't want to ignore that sort of change. It also doesn't take much to get with the right package, or even on our own. Another nice thing about this approach is that it can potentially be applied to any model, including ones that don't normally produce coefficients, like gradient boosting models or deep learning models. +Our result suggests we're getting about a `r round(mean(fudge_plus - fudge_minus) / fudge_factor, 1)` drop in the expected probability of a good review for a 1 standard deviation increase in word count on average. This is a bit more intuitive than the coefficient or odds ratio based on it, and we probably don't want to ignore that sort of change. It also doesn't take much to get with the right package, or even on our own. Another nice thing about this approach is that it can potentially be applied to any model, including ones that don't normally produce coefficients, like gradient boosting models or deep learning models. ### Marginal means diff --git a/understanding_models.qmd b/understanding_models.qmd index 5a68a62..b298885 100644 --- a/understanding_models.qmd +++ b/understanding_models.qmd @@ -259,7 +259,7 @@ df_train.to_csv("data/models/knowing_models/knowing_model_data_train.csv", index df_test.to_csv("data/models/knowing_models/knowing_model_data_test.csv", index = False) ``` -You'll notice that we created training data with 75% of our data and we will use the other 25% to test our model. This is an arbitrary but common split. With training data in hand, let's produce a model to predict review rating. We'll use the standardized (scaled `_sc`) versions of several features, and use the 'year' features starting at year 0, which represents the earlies year observed in our data[^whyzero]. Finally, we also include the genre of the movie as a categorical feature. +You'll notice that we created training data with 75% of our data and we will use the other 25% to test our model. This is an arbitrary but common split. With training data in hand, let's produce a model to predict review rating. We'll use the standardized (scaled `_sc`) versions of several features, and use the 'year' features starting at year 0, which represents the earliest year observed in our data[^whyzero]. Finally, we also include the genre of the movie as a categorical feature. [^whyzero]: See @sec-data-time-features for more on why we like to start year features at 0. @@ -989,7 +989,7 @@ without actually showing any text, or in html also, although the reference in pd Earlier when we obtained the predicted class, and subsequently all the metrics based on it, we used a predicted probability value of 0.5 as a cutoff for a 'good' vs. a 'bad' rating, and this is usually the default if we don't specify it explicitly. Assuming that this is the best for a given situation is actually a bold assumption on our part, and we should probably make sure that the cut-off value we choose is going to offer us the best result given the modeling context. -But what is the *best* result? That's going to depend on the situation. If we are predicting whether a patient has a disease, we might want to minimize false negatives, since if we miss the diagnosis, the patient could be in serious trouble. Meanwhile if we are predicting whether a transaction is fraudulent, we might want to minimize false positives, since if we flag a transaction as fraudulent when it isn't, we could be causing a lot of trouble for the customer, and add cost the company to deal with it. In other words, we might want to maximize the true positive or true negative rates, respectively. +But what is the *best* result? That's going to depend on the situation. If we are predicting whether a patient has a disease, we might want to minimize false negatives, since if we miss the diagnosis, the patient could be in serious trouble. Meanwhile if we are predicting whether a transaction is fraudulent, we might want to minimize false positives, since if we flag a transaction as fraudulent when it isn't, we could be causing a lot of trouble for the customer, and add cost to the company to deal with it. In other words, we might want to maximize the true positive or true negative rates, respectively. Whatever we decide, we ultimately are just shifting the metrics around relative to one another. As an easy example, if we were to classify all of our observations as 'good', we would have a sensitivity of 1 because all good ratings would be classified correctly. However, our positive predictive value would not be 1, and we'd have a specificity of 0. No matter which cutpoint we choose, we are going to have to make a tradeoff. From 4f5205c8143225239f5d677a8d04223c75d47a57 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 24 Nov 2024 18:24:02 -0500 Subject: [PATCH 07/19] gc through ML --- machine_learning.qmd | 10 ++++++---- ml_common_models.qmd | 4 ++-- ml_more.qmd | 8 ++++---- 3 files changed, 12 insertions(+), 10 deletions(-) diff --git a/machine_learning.qmd b/machine_learning.qmd index 652e155..5ac67b6 100644 --- a/machine_learning.qmd +++ b/machine_learning.qmd @@ -754,7 +754,7 @@ The take home point is this: our primary concern is generalization error. We can We now are very aware that a key aspect of the machine learning approach is having our model to work well with new data. One way to improve generalization is through the use of **regularization**, which is a general approach to penalize complexity in a model, and is typically used to prevent **overfitting**. Overfitting occurs when a model fits the training data very well, but does not generalize well to new data. This usually happens when the model is too complex and starts fitting to random noise in the training data. We can also have the opposite problem, where the model is too simple to capture the patterns in the data, and this is known as **underfitting**[^underfit]. -[^underfit]: Underfitting is a notable problem in many academic disciplines, where the models are often too simple to capture the complexity of the underlying process. Typically the model employed assume linear relationships without any interactions, and the true data generating process may be anything but. These models are chosen for their simplicity and interpretability, rather than how well they can explain the phenomenon in question. However, one could make the argument that 'understanding' an unrealistic result is not very useful either, and that the goal should be to understand the true process however we can, and not just choose a model that's convenient. +[^underfit]: Underfitting is a notable problem in many academic disciplines, where the models are often too simple to capture the complexity of the underlying process. Typically, the model employed assumes linear relationships without any interactions, and the true data generating process may be anything but. These models are chosen for their simplicity and interpretability, rather than how well they can explain the phenomenon in question. However, one could make the argument that 'understanding' an unrealistic result is not very useful either, and that the goal should be to understand the true process however we can, and not just choose a model that's convenient. In the following demonstration, the first plot shows results from a model that is probably too complex for the data setting. The curve is very wiggly as it tries as much of the data as possible, and is an example of overfitting. The second plot shows a straight line fit as we'd get from linear regression. It's too simple for the underlying feature-target relationship, and is an example of underfitting. The third plot shows a model that is a better fit to the data, and is an example of a model that is complex enough to capture the nonlinear aspect of the data, but not so complex that it capitalizes on a lot of noise. @@ -821,7 +821,7 @@ ggsave('img/ml-core-over-under-fit.svg', width = 8, height = 6) ::: -As a demonstration, let's examine generalization performance in this type of setting[^gamdat] with the following table that represents test set RMSE. We see that the overfit model does best on training data, but relatively very poorly on test- nearly a 20% increase in the RMSE value. The underfit model doesn't change as much in test performance because it was poor to begin with, and is the worst performer for both. Our 'better' model wasn't best on training, but was best on the test set. +As a demonstration, let's examine generalization performance in this type of setting[^gamdat] with the following table that represents the test set RMSE. We see that the overfit model does best on training data, but relatively very poorly on test- nearly a 20% increase in the RMSE value. The underfit model doesn't change as much in test performance because it was poor to begin with, and is the worst performer for both. Our 'better' model wasn't best on training, but was best on the test set. [^gamdat]: The data is based on a simulation (using `mgcv::gamSim`), with training sample of 200 and scale of 1, so the test data is just more simulated data points. @@ -1154,7 +1154,9 @@ From the five validation sets, we end up with five separate accuracy values, one There are different approaches we can take for cross-validation that we may need for different data scenarios. Here are some of the more common ones. - **Shuffled**: Shuffling prior to splitting can help avoid data ordering having undue effects. -- **Grouped/stratified**: In cases where we want to account for the grouping of the data, e.g. for data with a hierarchical structure. We may want groups to appear in training *or* test, but not both, as with grouped k-fold. Or we may want to ensure group proportions are consistent across training and test sets, as with stratified k-fold. +- **Grouped/stratified**: In cases where we want to account for the grouping of the data, e.g. for data with a hierarchical structure. + - Grouped: We may want groups to appear in training *or* test, but not both. This allows us to generalize to new groups. + - Stratified: We may want to ensure group proportions are consistent across training and test sets. This is especially useful in unbalanced target settings to ensure all class labels are present in training and test. - **Time-based**: for time series data, where we only want to assess error on future values - **Combinations**: For example, grouped and time-based @@ -1184,7 +1186,7 @@ It's generally always useful to use a stratified approach to cross-validation, e ## Tuning {#sec-ml-tuning} -One problem with the previous ridge logistic model we just used is that we set the penalty parameter to a fixed value. We can do better by searching over a range of values instead, and picking a 'best' value based on which model performs best. This is generally known as **hyperparameter tuning**, or simply **tuning**. We can do this with k-fold cross-validation to assess the error for each value of the penalty parameter values. We then select the value of the penalty parameter that gives the lowest average error. This is a form of model selection. +One problem with the previous ridge logistic model we just used is that we set the penalty parameter to a fixed value. We can do better by searching over a range of values instead, and picking a 'best' value based on which model performs best with a specific penalty value. This is generally known as **hyperparameter tuning**, or simply **tuning**. It is one aspect of machine learning that distinguishes it from traditional statistical modeling, where we usually don't have hyperparameters to consider. For this example with penalized regression, we can tune our model with k-fold cross-validation to assess the error for each proposed value of the penalty parameter. We then select the value of the penalty parameter for which the associated model gives the lowest average error. This is a form of model selection. Another potential point of concern is that we are using the same data to both select the model and assess its performance. This is a type of a more general phenomenon of **data leakage**, and may result in an overly optimistic assessment of performance. One solution is to do as we've discussed before, which is to split the data into three parts: training, validation, and test. We use the training set(s) to fit the model, assess their performance on the validation set(s), and select the best model. Then finally we use the test set to assess the best model's performance. So the validation approach is used to select the model, and the test set is used to assess that model's performance. The following visualizations from the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html) illustrates the process. diff --git a/ml_common_models.qmd b/ml_common_models.qmd index a2a44c1..db15e82 100644 --- a/ml_common_models.qmd +++ b/ml_common_models.qmd @@ -205,7 +205,7 @@ ggsave("img/ml-beat_the_baseline.svg", width = 8, height = 6, dpi = 300) Before getting carried away with models, we should have a good reference point for performance - a **baseline model**. The baseline model should serve as a way to gauge how much better your model performs over one that is simpler, probably more computationally efficient, more interpretable, and is still *viable*. It could also be a model that is sufficiently complex to capture something about the data you are exploring, but not as complex as the models you're also interested in. -Take a classification model for example. In this case we might use a logistic regression as a baseline. It is a viable model to begin answering some questions, and get a sense of performance possibilities. But is often too simple to be adequately performant for many situations. We should be able to do better with more complex models, or if we can't, there is little justification for using them. +Take a classification model for example. In this case we might use a logistic regression as a baseline. It is a viable model to begin answering some questions, and get a sense of performance possibilities, but it is often too simple to be adequately performant for many situations. We should be able to do better with more complex models, or if we can't, there is little justification for using them. ### Why do we do this? {#sec-ml-common-baseline-why} @@ -452,7 +452,7 @@ The depth of each tree refers to how many levels we allow the model to branch ou It's also generally a good idea to take a random sample of features for each tree (or possibly even each branch), to also help reduce overfitting, but it's not obvious what proportion to take. The regularization parameters[^gbl1l2] are typically less important in practice, but can help reduce overfitting as in other modeling circumstances we've talked about. As with hyperparameters in other model settings, you'll use something like cross-validation to settle on final values. -[^gbl1l2]: For boosting models, the regularization parameters are basically penalties regarding to the weights of the leaves. For example, a smaller value would reduce the contribution of that leaf to the overall model, and so would help to reduce overfitting. +[^gbl1l2]: For boosting models, the regularization parameters are basically penalties on the weights of the leaves. For example, a smaller value would reduce the contribution of that leaf to the overall model, and so would help to reduce overfitting. ### Example with LightGBM {#sec-ml-common-trees-lightgbm} diff --git a/ml_more.qmd b/ml_more.qmd index f5f2950..3cbf3df 100644 --- a/ml_more.qmd +++ b/ml_more.qmd @@ -138,7 +138,7 @@ Consider the following setup for such a situation: An autoencoder in this case would be equivalent to principal components analysis. In the approach described, PCA perfectly reconstructs the original data when considering all components, and so the error would be zero. But that doesn't give us any dimension reduction, as we have as many nodes in the compression layer as we did inputs. So with PCA, we often only focus on a small number of components that capture the data variance by some arbitrary amount. The discarded nodes are actually still estimated though. -Neural networks are not bound to linear activation functions, the size of the inputs, or even a single layer. As such, they provide a much more flexible approach that can compress the data at a certain layer, but still have very good reconstruction error. Typical autoencoders would have multiple layers with notably more nodes than inputs, at least for some layers. They may ultimately compress to a bottleneck layer comprised of a fewer set of nodes, before expanding out again. An autoencoder is not as easily interpretable as typical factor analytic techniques, and we still have to sort out the architecture. However, it's a good example of how the same underlying approach can be used for different purposes. +Neural networks are not bound to linear activation functions, the size of the inputs, or even a single layer. As such, they provide a much more flexible approach that can compress the data at a certain layer, but still have very good reconstruction error. Typical autoencoders would have multiple layers with notably more nodes than inputs, at least for some layers. They may ultimately compress to a bottleneck layer consisting of a fewer set of nodes, before expanding out again. An autoencoder is not as easily interpretable as typical factor analytic techniques, and we still have to sort out the architecture. However, it's a good example of how the same underlying approach can be used for different purposes. ![Conceptual Diagram of an Autoencoder](img/autoencoder.png){#fig-autoencoder width=75%} @@ -278,7 +278,7 @@ Reinforcement learning has many applications, including robotics, games, and aut ## Working with Specialized Data Types {#sec-ml-more-non-tabular} -While our focus in this book is on tabular data due to its ubiquity, there are many other types of data that is used for machine learning and modeling in general. This data often starts in a special format or must be considered uniquely. You'll often hear this labeled as 'unstructured', but that's probably not the best conceptual way to think about it, as the data is still structured in some way, sometimes in a strict format (e.g. images). Here we'll briefly discuss some of the other types of data you'll potentially come across. +While our focus in this book is on tabular data due to its ubiquity, there are many other types of data used for machine learning and modeling in general. This data often starts in a special format or must be considered uniquely. You'll often hear this labeled as 'unstructured', but that's probably not the best conceptual way to think about it, as the data is still structured in some way, sometimes in a strict format (e.g. images). Here we'll briefly discuss some of the other types of data you'll potentially come across. ### Spatial {#sec-ml-more-spatial} @@ -319,7 +319,7 @@ Deep learning methods have proven very effective for analyzing audio data, and c ![Convolutional Neural Network [LeNet](https://alexlenail.me/NN-SVG/LeNet.html)](img/cnn.png){#fig-cnn width=66%} -Computer vision involves a range of models and techniques for analyzing and interpreting image-based data. It includes tasks like image classification (labeling an image), object detection (finding the location of objects are in an image), image segmentation (identifying the boundaries of objects in an image), and object tracking (following objects as they move over time). +Computer vision involves a range of models and techniques for analyzing and interpreting image-based data. It includes tasks like image classification (labeling an image), object detection (finding the location of objects in an image), image segmentation (identifying the boundaries of objects in an image), and object tracking (following objects as they move over time). Typically, your raw data is an image, which is represented as a matrix of pixel values. For example, each row of the matrix could be a grayscale value for a pixel, or it could be a 3-dimensional array of Red, Green, and Blue (RGB) values for each pixel. The modeling goal is to extract features from the image data that can be used for the task at hand. For example, you might extract features that relate to color, texture, and shape. You can then use these features to train a model to classify images or whatever your task may be. @@ -335,7 +335,7 @@ These days we generally don't have to start from scratch though, as there are pr One of the hottest areas of modeling development in recent times regards **natural language processing**, as evidenced by the runaway success of models like [ChatGPT](https://chat.openai.com/). Natural language processing (NLP) is a field of study that focuses on understanding human language, and along with computer vision, is a very visible subfield of artificial intelligence. NLP is used in a wide range of applications, including language translation, speech recognition, text classification, and more. NLP is behind some of the most exciting modeling applications today, with tools that continue to amaze with their capabilities to generate summaries of articles, answering questions, write code, and even [pass the bar exam with flying colors](https://www.abajournal.com/web/article/latest-version-of-chatgpt-aces-the-bar-exam-with-score-in-90th-percentile)! -Early efforts in this field were based on statistical models, and then variations on things like PCA, but it took a lot of [data pre-processing work](https://m-clark.github.io/text-analysis-with-R/intro.html) to get much from those approaches, and results could still be unsatisfactory. More recently, deep learning models became the standard application, and there is no looking back in that regard due to their success. Current state of the art models have been trained on massive amounts of data, [even much of the internet](https://commoncrawl.org/), and require a tremendous of computing power. Thankfully, you don't have to train such a model yourself to take advantage of the results. Now you can simply use a pretrained model like GPT-4 for your own tasks. In some cases, much of the trouble comes with just generating the best prompt to produce the desired results. However, the field and the models are evolving very rapidly, and, for those who don't have the resources of Google, Meta, or OpenAI, things are getting easier to implement all the time. In the meantime, feel free to just [play around with ChatGPT yourself](https://chat.openai.com/). +Early efforts in this field were based on statistical models, and then variations on things like PCA, but it took a lot of [data pre-processing work](https://m-clark.github.io/text-analysis-with-R/intro.html) to get much from those approaches, and results could still be unsatisfactory. More recently, deep learning models became the standard application, and there is no looking back in that regard due to their success. Current state of the art models have been trained on massive amounts of data, [even much of the internet](https://commoncrawl.org/), and require a tremendous amount of computing power. Thankfully, you don't have to train such a model yourself to take advantage of the results. Now you can simply use a pretrained model like GPT-4 for your own tasks. In some cases, much of the trouble comes with just generating the best prompt to produce the desired results. However, the field and the models are evolving very rapidly, and, for those who don't have the resources of Google, Meta, or OpenAI, things are getting easier to implement all the time. In the meantime, feel free to just [play around with ChatGPT yourself](https://chat.openai.com/). ## Pretrained Models & Transfer Learning {#sec-ml-more-pretrained} From 018478ba9bbe9cd666e82e7d3b3941d90b2bccfd Mon Sep 17 00:00:00 2001 From: micl Date: Wed, 27 Nov 2024 18:03:27 -0500 Subject: [PATCH 08/19] gc through dz --- causal.qmd | 6 +++--- danger_zone.qmd | 10 ++++------ data.qmd | 10 +++++----- 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/causal.qmd b/causal.qmd index f854f47..8d4bcca 100644 --- a/causal.qmd +++ b/causal.qmd @@ -419,7 +419,7 @@ Analysis of variance, or **ANOVA**, allows the t-test to be extended to more tha If linear regression didn't suggest any notion of causality to you before, it shouldn't now either. The model is *identical* whether there was an experimental design with random assignment or not. The only difference is that the data was collected in a different way, and the theoretical assumptions and motivations are different. Even the statistical assumptions are the same whether you use random assignment, or there are more than two groups, or whether the treatment is continuous or categorical. -Experimental design[^exprand] can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same *except* for the treatment, then we can be more confident that the treatment is the cause of an differences we see, and be more confident in a causal interpretation of the results. But it doesn't change the model itself, and the results of a model don't prove a causal relationship on their own. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn't taken in the modeling process to even assess the generalization of the results (@sec-ml-generalization), you may find they don't hold up[^explimits]. +Experimental design[^exprand] can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same *except* for the treatment, then we can be more confident that the treatment is the cause of any differences we see, and be more confident in a causal interpretation of the results. But it doesn't change the model itself, and the results of a model don't prove a causal relationship on their own. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn't taken in the modeling process to even assess the generalization of the results (@sec-ml-generalization), you may find they don't hold up[^explimits]. [^exprand]: Note that experimental design is not just any setting that uses random assignment, but more generally how we introduce *control* in the sample settings. @@ -647,7 +647,7 @@ simulate_confounding(nreps = 500, n = 1000, true = 1) ::: -Results suggest that the coefficient for `X` is different in the two models. If we don't include the confounder, the feature's relationship with the target is biased upwardly. The nature of the bias depends on the relationship between the confounder and the treatment and target, but in this case it's pretty clear! +Results suggest that the coefficient for `X` is different in the two models. If we don't include the confounder, the feature's relationship with the target is biased upward. The nature of the bias ultimately depends on the relationship between the confounder and the treatment and target, but in this case it's pretty clear! ```{r} #| echo: false @@ -872,7 +872,7 @@ In that graph, our focal treatment, or 'exposure', is physical activity, and we One thing to note relative to the other graphical model depictions we've seen, is that the arrows directly flow to a target or set of targets, as opposed to just producing an 'output' that we then compare with the target. In graphical causal models, we're making clear the direction and focus of the causal relationships, i.e., the causal structure, as opposed to the model structure. Also, in graphical causal models, the effects for any given feature are adjusted for the other features in the model in a particular way, so that we can think about them in isolation, rather than as a collective set of features that are all influencing the target[^graphout]. -[^graphout]: If we were to model this in an overly simple fashion with linear regressions for any variable with an arrow to it, you could say physical activity and dietary habits would basically be the output of their respective models. It isn't that simple in practice though, such that we can just run separate regressions and feed in the results to the next one, thought that's how they used to do it back in the day. We have to take more care in how we adjust for all features in the model, as well as correctly account for the uncertainty if we do take a multi-stage approach. +[^graphout]: If we were to model this in an overly simple fashion with linear regressions for any variable with an arrow to it, you could say physical activity and dietary habits would basically be the output of their respective models. It isn't that simple in practice though, such that we can just run separate regressions and feed in the results to the next one, though that's how they used to do it back in the day. We have to take more care in how we adjust for all features in the model, as well as correctly account for the uncertainty if we do take a multi-stage approach. Structural equation models are widely employed in the social sciences and education, and are often used to model both observed and *latent* variables (@sec-data-latent), with either serving as features or targets[^sembias]. They are also used to model causal relationships, to the point that historically they were even called 'causal graphical models' or 'causal structural models'. SEMs are actually a special case of the graphical models just described, which are more common in non-social science disciplines. Compared to other graphical modeling techniques like DAGs, SEMs will typically have more assumptions, and these are often difficult to meet[^semass]. diff --git a/danger_zone.qmd b/danger_zone.qmd index 5ef70a2..afe5646 100644 --- a/danger_zone.qmd +++ b/danger_zone.qmd @@ -31,9 +31,7 @@ A related issue is **p-hacking**, which occurs when you try many different model ### Ignoring complexity {#sec-danger-complexity} -While techniques like standard linear/logistic regression and GLMs are valid and very useful, for many modeling contexts they may be too simply to capture the complexity of the data generating process. This can lead to underfitting, where the model is too simple to capture the underlying structure of the data. - -On the other side of the coin, many applications of statistical models ignore model assessment on a separate dataset, which can lead to overfitting. This makes generalization of statistical and other results more problematic. Those applications typically use a single model as well, and so may not be indicative of the best approach that could be taken. It'd be better to have a few models of varying complexity to explore. +While techniques like standard linear/logistic regression and GLMs are valid and very useful, for many modeling contexts they may be too simple to capture the complexity of the data generating process, a form of underfitting. On the other side of the coin, many applications of statistical models ignore model assessment on a separate dataset, which can lead to overfitting. This makes generalization of such results more problematic. Those applications typically use a single model as well, and so may not be indicative of the best approach that could be taken. It'd be better to have a few models of varying complexity to explore. ### Using outdated techniques {#sec-danger-datedtech} @@ -138,7 +136,7 @@ ggsave('img/danger-r2_adjr2_vis.svg', width = 8, height = 6) ``` -The following plot shows 250 simulations with a sample size of 100 and 40 completely meaningless features used in a linear regression. The R^2^ values would all suggest the model is somewhat useful, with an average of ~.4. The adjusted R^2 values average zero, which is correct, but they can only average that by being negative, which is a meaningless value. Many adjusted values still get into areas that would be viable for some domains. +The following plot shows 250 simulations with a sample size of 100 and 40 completely meaningless features used in a linear regression. The R^2^ values would all suggest the model is somewhat useful, with an average of ~.4. The adjusted R^2 values average zero, which is correct, but they can only average that by being negative, which is a meaningless value. Many of the adjusted values still get into areas that would be viable for some domains. ![The problem of R^2^](img/danger-r2_adjr2_vis.svg){#r2_adjr2_vis} @@ -254,7 +252,7 @@ A common issue in statistical and machine learning modeling is the **garden of f From traditional statistical models to deep learning, the more you know about the underlying modeling process, the more apt you are to tweak some aspect of the model to try and improve performance. When you start thinking about changing optimizer options, link/activation functions, learning rates, etc., you can easily get lost in the weeds. This would be okay if you knew ahead of time it would make a big difference. However, in many, or maybe even most cases, this sort of tweaking doesn't improve model results by much, or there are ways to not have to make the choice in the first place such as through hyperparameter tuning (@sec-ml-tuning). More to the point, if this sort of 'by-hand' parameter tweaking does make a notable difference, that may suggest that you have a bigger problem with your model architecture or data. -For many tools, a lot of work has been done for you for some of the by folks who had a lot more time to work on these aspects of the model, and who will attempt to provide 'sensible defaults' which can work pretty well. There is still plenty we need to explore, and maybe a lot with more complex models such as boosting or deep learning. Even so, when you've appropriately tuned over the parameters that need it, you'll often find the results are not that different from what are otherwise notably different parameter settings. +For many tools, a lot of work has been done for you by folks who had a lot more time to work on these aspects of the model, and who will attempt to provide 'sensible defaults' which can work pretty well. There is still plenty we need to explore, and maybe a lot with more complex models such as boosting or deep learning. Even so, when you've appropriately tuned over the parameters that need it, you'll often find the results are not that different from what are otherwise notably different parameter settings. ### Everything is fine {#sec-danger-fine} @@ -266,7 +264,7 @@ There is a flip side to the previous point, and that is that many assume that th ### Just bootstrap it! {#sec-danger-bootstrap} -When it comes to uncertainty estimation, many common modeling tools leave that to the user, and when the developers are pressed on how to get uncertainty estimates, they often will suggest to just bootstrap the result. While the bootstrap is a powerful tool for inference, it isn't appropriate just because you decide to use it. The suggestion to use bootstrapping is often made in the context of a complex modeling situation where it would be very (prohibitively) computationally expensive, and in other cases the properties of the results are not well understood. Other methods of prediction inference, such as conformal prediction, may be better suited to the task. In general, if a package developer suggests you bootstrap because their package doesn't have any means of uncertainty estimation, you should be cautious. If it's the obvious option, it should be included in the package. +When it comes to uncertainty estimation, many common modeling tools leave that to the user, and when the developers are pressed on how to get uncertainty estimates, they will often suggest to just bootstrap the result. While the bootstrap is a powerful tool for inference, it isn't appropriate just because you decide to use it. The suggestion to use bootstrapping is often made in the context of a complex modeling situation where it would be very (prohibitively) computationally expensive, and in other cases the properties of the results are not well understood. Other methods of prediction inference, such as conformal prediction, may be better suited to the task. In general, if a package developer suggests you bootstrap because their package doesn't have any means of uncertainty estimation, you should be cautious. If it's the obvious option, it should be included in the package. While we're at it, another common suggestion om <: is to use a quantile regression (@sec-lm-extend-quantile) approach to get prediction intervals. This is a valid option in some cases, but it's not clear how appropriate it is for complex models or for certain types of outcomes, and modeling tools for predicting quantiles are not typically available for a given model implementation. diff --git a/data.qmd b/data.qmd index c54ee8c..54fe8ec 100644 --- a/data.qmd +++ b/data.qmd @@ -266,7 +266,7 @@ When we encode categories for statistical analysis, we can summarize their impac #### Text embeddings {#sec-data-embeddings} -When it comes to other string representations like sentence and paragraphs, we can use other methods to represent them numerically. One important way to encode text is through an **embedding**. This is a way of representing the text as a vector of numbers, at which point the numeric embedding feature is used in the model like any other. The way to do this usually involves a model or a specific part of the model's architecture, one that learns the best way to represent the text or categories numerically. This is commonly used in deep learning, and natural language processing in particular. However, embeddings can also be used as a preprocessing step in *any* modeling situation. +When it comes to other string representations like sentences and paragraphs, we can use other methods to represent them numerically. One important way to encode text is through an **embedding**. This is a way of representing the text as a vector of numbers, at which point the numeric embedding feature is used in the model like any other. The way to do this usually involves a model or a specific part of the model's architecture, one that learns the best way to represent the text or categories numerically. This is commonly used in deep learning, and natural language processing in particular. However, embeddings can also be used as a preprocessing step in *any* modeling situation. To understand how embeddings work, consider a one-hot encoded matrix for a categorical variable. This matrix then connects to a hidden layer of a neural network. The weights learned for that layer are the embeddings for the categorical variable. While this isn't the exact method used (there are more efficient methods that don't require the actual matrix), the concept is the same. In addition, we normally don't even use whole words. Instead, we break the text into smaller units called **tokens**, like characters or subwords, and then use embeddings for those units. [Tokenization](https://huggingface.co/learn/nlp-course/en/chapter6/5) is used in many of the most successful models for natural language processing, including those such as [ChatGPT](https://www.youtube.com/watch?v=zduSFxRajkE). @@ -450,7 +450,7 @@ The first way to deal with missing data is the simplest - **complete case analys ### Single value imputation {#sec-data-missing-single} -**Single value imputation** involves replace missing values with a single value, such as the mean, median, mode or some other typical value of the feature. As common as an approach as this is, it will rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you've also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g. collinearity), or cause you to miss out on interactions. +**Single value imputation** involves replacing missing values with a single value, such as the mean, median, mode or some other typical value of the feature. As common an approach as this is, it will rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you've also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g. collinearity), or cause you to miss out on interactions. Single value imputation makes perfect sense if you *know* that the missingness should be a specific value, like a count feature where missing means a count of zero. If you don't have much missing data, it's unlikely this would have any real benefit over complete case analysis. One exception is imputing the feature allows you to use all the other complete feature samples that would otherwise be dropped. But then, you could just drop this less informative feature while keeping the others, as it will likely not be very useful in the model. @@ -532,7 +532,7 @@ These are not necessarily mutually exclusive. For example, it's probably a good Probability **calibration** is often a concern in classification problems. It is a bit more complex of an issue than just having class imbalance, but is often discussed in the same setting. Having calibrated probabilities refers to the situation where the predicted probabilities of the target match up well to the actual proportion of observed classes. For example, if a model predicts an average 0.5 probability of loan default for a certain segment of the samples, the actual proportion of defaults should be around 0.5. -One way to assess calibration is to use a **calibration curve**, which is a plot of the predicted probabilities vs. the observed proportions. We bin our predicted probabilities, say in to 5 or 10 equal bins. We then calculate the average predicted probability and the average observed proportion of the target in each bin. If the model is well-calibrated, the points should fall along the 45-degree line. If the model is not well-calibrated, the points will fall above or below the line. +One way to assess calibration is to use a **calibration curve**, which is a plot of the predicted probabilities vs. the observed proportions. We bin our predicted probabilities, say, into 5 or 10 equal bins. We then calculate the average predicted probability and the average observed proportion of the target in each bin. If the model is well-calibrated, the points should fall along the 45-degree line. If the model is not well-calibrated, the points will fall above or below the line. In @fig-calibration-plot, one model seems to align well with the observed proportions based on the chosen bins. The other model (dashed line) is not so well calibrated, and is overshooting with its predictions. For example, that model's average prediction for the third bin predicts a ~0.5 probability of the outcome, while the actual proportion is around 0.2. @@ -613,7 +613,7 @@ The assessment of calibration in this manner also has a few issues that we haven [^testcalibbinsize]: Note that each bin will reflect the portion of the test set size in this situation. If you have a small test set, the observed proportions will be more variable, and the calibration plot will be more variable as well. -All this is to say that each point in a calibration plot, 'true' or predicted, has some uncertainty it, and the difference in those values is not formally tested in any way by a calibration curve plot. Their uncertainty, if it was actually measured, could even overlap while still being statistically different! So, if we're interested in a more rigorous statistical assessment, the differences between models and the 'best case scenario' would need additional steps to suss out. +All this is to say that each point in a calibration plot, 'true' or predicted, has some uncertainty with it, and the difference in those values is not formally tested in any way by a calibration curve plot. Their uncertainty, if it was actually measured, could even overlap while still being statistically different! So, if we're interested in a more rigorous statistical assessment, the differences between models and the 'best case scenario' would need additional steps to suss out. Some methods are available to calibrate probabilities if they are deemed miscalibrated, but they are not commonly implemented in practice, and often involve another model-based technique, with all of its own assumptions and limitations. It's also not exactly clear that forcing your probabilities to be on the line is helping solve the actual modeling goal in any way[^practicalprob]. But if you are interested, you can read more [here](https://scikit-learn.org/stable/modules/calibration.html). @@ -1092,7 +1092,7 @@ ggsave( We visited spatial data in a discussion on non-tabular data (@sec-ml-more-spatial), but here we want to talk about it from a modeling perspective, especially within the tabular domain. Say you have a target that is a function of location, such as the proportion of people voting a certain way in a county, or the number of crimes in a city. You can use a **spatial regression** model, where the target is a function of location, among other features that may or may not be spatially oriented. Two approaches already discussed may be applied in the case of having continuous spatial features, such as latitude and longitude, or discrete features like county. For the continuous case, we could use a GAM (@sec-gam), where we use a smooth interaction of latitude and longitude. For the discrete setting, we can use a mixed model (@sec-mixed-models-overview), where we include a random effect for county. -There are other traditional techniques to spatial regression, especially in the continuous spatial domain, such as using a **spatial lag**. In this case, we incorporate information about the neighborhood of a observation's location into the model (e.g. a weighted mean of neighboring values, as in the visualization above based on code from @walker_analyzing_2023). Techniques include CAR (conditional autoregressive), SAR (spatial autoregressive), BYM, kriging, and more. These models can be very effective, but are can be seen as a different form of random effects models very similar to those used for time-based settings. They can likewise can be seen as special cases of gaussian processes. So don't let the names fool you, you often will incorporate similar modeling techniques for both the time and spatial domains. +There are other traditional techniques to spatial regression, especially in the continuous spatial domain, such as using a **spatial lag**. In this case, we incorporate information about the neighborhood of an observation's location into the model (e.g. a weighted mean of neighboring values, as in the visualization above based on code from @walker_analyzing_2023). Techniques include CAR (conditional autoregressive), SAR (spatial autoregressive), BYM, kriging, and more. These models can be very effective, but can also be seen as a different form of random effects models very similar to those used for time-based settings. They can also be seen as special cases of gaussian process regression more generally. So don't let the names fool you, you often will incorporate similar modeling techniques for both the time and spatial domains. ## Multivariate Targets From 1a7d9f852c191e984aeba90fd9da61b7759d1fe8 Mon Sep 17 00:00:00 2001 From: micl Date: Wed, 27 Nov 2024 18:43:38 -0500 Subject: [PATCH 09/19] eg ie commas --- causal.qmd | 6 +++--- conclusion.qmd | 4 ++-- danger_zone.qmd | 12 ++++++------ data.qmd | 18 +++++++++--------- dataset_descriptions.qmd | 2 +- estimation.qmd | 6 +++--- generalized_linear_models.qmd | 4 ++-- linear_model_extensions.qmd | 10 +++++----- linear_models.qmd | 32 ++++++++++++++++---------------- machine_learning.qmd | 10 +++++----- matrix_operations.qmd | 2 +- ml_common_models.qmd | 6 +++--- ml_more.qmd | 10 +++++----- models.qmd | 2 +- more_models.qmd | 12 ++++++------ pyr.qmd | 2 +- uncertainty.qmd | 6 +++--- understanding_features.qmd | 10 +++++----- understanding_models.qmd | 2 +- 19 files changed, 78 insertions(+), 78 deletions(-) diff --git a/causal.qmd b/causal.qmd index 8d4bcca..69b06c2 100644 --- a/causal.qmd +++ b/causal.qmd @@ -423,7 +423,7 @@ Experimental design[^exprand] can give us more confidence in the causal explanat [^exprand]: Note that experimental design is not just any setting that uses random assignment, but more generally how we introduce *control* in the sample settings. -[^explimits]: Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g. interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many important experimental results are based on p-values with small data, and this is the part of the problem seen with the [replication crisis](https://en.wikipedia.org/wiki/Replication_crisis) in science. +[^explimits]: Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g., interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many important experimental results are based on p-values with small data, and this is the part of the problem seen with the [replication crisis](https://en.wikipedia.org/wiki/Replication_crisis) in science. :::{.callout-note title='A/B Testing' collapse='true'} @@ -545,7 +545,7 @@ As we noted, random assignment or a formal experiment is not always possible or The COVID-19 pandemic provides an example of a natural experiment. The pandemic introduced sudden and widespread changes that were not influenced by individuals' prior characteristics or behaviors, such as lockdowns, remote work, and vaccination campaigns. The randomness in the timing and implementation of these changes allows researchers to compare outcomes before and after the policy implementation or pandemic, or between different regions with varying policies, to infer causal effects. -For instance, we could compare states or counties that had mask mandates to those that didn't at the same time or with similar characteristics. Or we might compare areas that had high vaccination rates to those nearby that didn't. But these still aren't true experiments. So we'd need to control for as many additional factors that might influence the results, e.g. population density, age, wealth and so on, and eventually we might still get a pretty good idea of the causal impact of these interventions. +For instance, we could compare states or counties that had mask mandates to those that didn't at the same time or with similar characteristics. Or we might compare areas that had high vaccination rates to those nearby that didn't. But these still aren't true experiments. So we'd need to control for as many additional factors that might influence the results, like population density, age, wealth and so on, and eventually we might still get a pretty good idea of the causal impact of these interventions. ## Causal Inference {#sec-causal-inference} @@ -903,7 +903,7 @@ Formal graphical models provide a much richer set of tools for controlling vario :::{.callout-note title='Causal Language' collapse='true'} -It's often been suggested that we keep certain phrasing (e.g. feature X has an *effect* on target Y) only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we're trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that's fine, and we think that's okay in a modeling context too, as long as you are clear about the limits of your generalizability. +It's often been suggested that we keep certain phrasing, for example, feature X has an *effect* on target Y, only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we're trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that's fine, and we think that's okay in a modeling context too, as long as you are clear about the limits of your generalizability. ::: ### Counterfactual thinking {#sec-causal-counterfactual} diff --git a/conclusion.qmd b/conclusion.qmd index 3b54e05..3c50e81 100644 --- a/conclusion.qmd +++ b/conclusion.qmd @@ -126,7 +126,7 @@ As a final consideration, there are 'multivariate' techniques like Principal Com In a purely machine learning context, you may find other models beyond those just mentioned in the statistical realm, though, as we have mentioned several times at this point, potentially any model can be used with machine learning. These models prioritize prediction, and would not usually produce standard statistical output like coefficients and uncertainty estimates by default. Examples include support vector machines, k-nearest neighbors regression, and other techniques. Most of these traditional 'machine learning models' have fallen out of favor due to their inflexibility with heterogeneous data types, and/or poor performance compared to more modern approaches. However, even then, their spirit may live on in modern approaches. -You'll also find models that focus on ranking, either with an outcome of ranks requiring a specific loss function (e.g. LambdaRank), or where ranking is used to simplify decision-making through post-estimation ranking of predictions (e.g., decile ranking, uplift modeling). In addition, you can find machine learning techniques extended to survival, ordinal, and other situations that are more common in the statistical realm. +You'll also find models that focus on ranking, either with an outcome of ranks requiring a specific loss function (e.g., LambdaRank), or where ranking is used to simplify decision-making through post-estimation ranking of predictions (e.g., decile ranking, uplift modeling). In addition, you can find machine learning techniques extended to survival, ordinal, and other situations that are more common in the statistical realm. Other areas of machine learning, like reinforcement learning, recommender systems, network analysis, and unsupervised learning techniques provide more that can be used in various scenarios. Plenty is left for you to explore here as well! @@ -135,7 +135,7 @@ Other areas of machine learning, like reinforcement learning, recommender system When it comes to deep learning, it seems there is a new model every day, and it's hard to keep up. In general, convolutional neural networks are still the go-to for many types of computer vision tasks, while transformers are commonly used for natural language processing, but both have been applied to the other domain with success. For tabular data you'll typically see some variant of Multilayer Perceptrons (MLPs), often with embeddings for categorical features. Some have attempted transformers and CNNs here as well, but results are mixed. -The deep learning landscape also includes models like deep graphical networks, and deep Q learning for reinforcement learning, specific models for image segmentation (e.g. SAM), recurrent neural networks variants for time-series data, and generative adversarial networks for a variety of tasks. Some specific techniques are falling out of favor as transformer-based architectures are being applied to seemingly everything. But the field is dynamic, and it remains to be seen which methods will prevail in the long run. +The deep learning landscape also includes models like deep graphical networks, and deep Q learning for reinforcement learning, specific models for image segmentation (e.g., SAM), recurrent neural networks variants for time-series data, and generative adversarial networks for a variety of tasks. Some specific techniques are falling out of favor as transformer-based architectures are being applied to seemingly everything. But the field is dynamic, and it remains to be seen which methods will prevail in the long run. diff --git a/danger_zone.qmd b/danger_zone.qmd index afe5646..9771048 100644 --- a/danger_zone.qmd +++ b/danger_zone.qmd @@ -42,12 +42,12 @@ If you wanted to go on a road trip, would you prefer a [1973 Ford Pinto](https:/ This is not specific to the statistical linear modeling realm, but there are many applications of statistical models that rely on outdated techniques, metrics, or other tools that solve problems that don't exist anymore. For example, using stepwise/best subset regression for feature selection is not really viable when more principled approaches like the lasso are available. Likewise, we can't really think of a case where something like MANOVA/discriminant function analysis would provide the best answer to a data problem, or where a pseudo-R^2^ metric would help us understand a model better or make a decision about it. -Statistical analysis has been around a long time, and many of the techniques that have been developed are still valid, useful, and very powerful. But some reflect the limitations of the time in which they were developed. Others were an attempt to take something that was straightforward for simpler settings (e.g. linear regression) and apply to settings where it doesn't make sense (nonlinear, non-gaussian, etc.). Even when still valid, there may be better alternatives available now. +Statistical analysis has been around a long time, and many of the techniques that have been developed are still valid, useful, and very powerful. But some reflect the limitations of the time in which they were developed. Others were an attempt to take something that was straightforward for simpler settings (e.g., linear regression) and apply to settings where it doesn't make sense (nonlinear, non-gaussian, etc.). Even when still valid, there may be better alternatives available now. ### Simpler is not necessarily more interpretable {#sec-danger-interp} -Standard linear models are often used because of their interpretability, but in many of these modeling situations, interpretability can be difficult to obtain without using the same amount of effort one would for more complex models. Many statistical/linear models employ interactions, or nonlinear feature-target relationships (e.g. GLM/GAMs). If your goal is interpretability, these settings can be as difficult to interpret as features in a random forest. They still have the added benefit of more reliable uncertainty estimation. But you should not assume you will have a result as simple as a coefficient in a linear regression just because you didn't use a deep learning model. +Standard linear models are often used because of their interpretability, but in many of these modeling situations, interpretability can be difficult to obtain without using the same amount of effort one would for more complex models. Many statistical/linear models employ interactions, or nonlinear feature-target relationships (e.g., GLM/GAMs). If your goal is interpretability, these settings can be as difficult to interpret as features in a random forest. They still have the added benefit of more reliable uncertainty estimation. But you should not assume you will have a result as simple as a coefficient in a linear regression just because you didn't use a deep learning model. ### Model comparison {#sec-danger-compare} @@ -489,7 +489,7 @@ What's more, just because an importance metric may deem a feature as not importa As we have seen (@sec-knowing-feature-importance), the reality is that multiple valid measures of importance can come to different conclusions about the relative importance of a feature, even within the same model setting. One should be very cautious in how they interpret these. :::{.callout-note title='SHAP for Feature Importance' collapse='true'} -SHAP values are meant to assess *local*, i.e. observation level, feature contributions to a prediction. They are also used as *global* features of importance in many ML contexts, even though they are not meant to be used this way. Doing so can be misleading, and often average SHAP values will just reflect the distribution of the feature more than its importance, and could be notably inconsistent with other metrics even in simple settings. +SHAP values are meant to assess *local*, i.e., observation level, feature contributions to a prediction. They are also used as *global* features of importance in many ML contexts, even though they are not meant to be used this way. Doing so can be misleading, and often average SHAP values will just reflect the distribution of the feature more than its importance, and could be notably inconsistent with other metrics even in simple settings. ::: @@ -646,11 +646,11 @@ When it comes to data, plenty can go wrong before even starting with any modelin ### Transformations {#sec-danger-transform} -Many models will fail miserably without some sort of scaling or transformation of the data. A few techniques, like tree-based approaches, do not benefit, but practically all others do. At the very least, models will converge faster and possibly be more interpretable. You should not use transformations that would lose the expressivity of the data, because as we noted with binarization (@sec-danger-classification), some can do more harm than good. But you should always consider the need for transformations, and not just assume that the data is in a form that is ready for modeling. +Many models will fail miserably without some sort of scaling or transformation of the data. A few techniques, like tree-based approaches, do not benefit, but practically all others do. At the very least, models will converge faster and possibly be more interpretable. However, you should generally not use transformations that would lose the expressivity of the data, because as we noted with binarization (@sec-danger-classification), some can do more harm than good. But you should always consider the need for transformations, and not just assume that the data is in a form that is ready for modeling. ### Measurement error -**Measurement error** is a common issue in data collection, and it can lead to biased estimates and reduced power. Generally speaking, we mean the ability of data to measure what it's supposed to (or not). Measurement error can come from a variety of sources, and be difficult to assess, but it is important to try and understand how well your data reflects the constructs it is supposed to. If you can't correct for it, e.g., by finding better data, you should at least be aware of the issue and consider how they might affect your results. There is a saying about squeezing blood from a stone, or putting lipstick on a pig, or something like that, and it applies here. If your data is poor, your model won't save it. +**Measurement error** is a common issue in data collection, and it can lead to biased estimates and reduce our ability to detect meaningful feature-target relationships. Generally speaking, the reliability of a feature or target is its ability to measure what it's supposed to, while measurement error reflects its failure to do so. There is no perfectly measured variable, and measurement error can come from a variety of sources, and be difficult to assess. But it is important to try and understand how well your data reflects the constructs it is supposed to. If you can't correct for it, for example, by finding better data, you should at least be aware of the issue and consider how they might affect your results. There is a saying about squeezing blood from a stone, or putting lipstick on a pig, or something like that, and it applies here. If your data is poor, your model won't save it. ### Simple imputation techniques {#sec-danger-impute} @@ -661,7 +661,7 @@ Imputation may be required when you have missing data, but it can be done in way One common practice in modeling is to drop or modify values considered as "outliers". However, extreme values in the target variable are often a natural part of the data. Assuming there is no actual error in recording them, often, a simple transformation can address the issue. If extremes persist after modeling, it indicates that the model is unable to capture the underlying data structure, rather than an inherent problem with the data itself. Additionally, even values that may not appear extreme can still have large residuals, so it's important not to solely focus on just the most extreme observed values. -In terms of features, extreme values can cause strange effects, but often they reflect a data problem (e.g. incorrect values), or can be resolved using the transformations you should already be considering (e.g. taking the log). In other cases, they don't really cause any modeling problems at all. And again, some techniques are fairly robust to feature extremes, like tree-based methods. +In terms of features, extreme values can cause strange effects, but often they reflect a data problem (e.g., incorrect values), or can be resolved using the transformations you should already be considering (e.g., taking the log). In other cases, they don't really cause any modeling problems at all. And again, some techniques are fairly robust to feature extremes, like tree-based methods. ### Big data isn't always as big as you think {#sec-danger-bigdata} diff --git a/data.qmd b/data.qmd index 54fe8ec..138e8ef 100644 --- a/data.qmd +++ b/data.qmd @@ -16,7 +16,7 @@ It's an inescapable fact that models need data to work. One of the dirty secrets - Data transformations can provide many modeling benefits. - Label and text-based data still needs a numeric representation, and this can be accomplished in a variety of ways. - The data type for the target may suggest a particular model, but does not necessitate one. -- The data *structure*, e.g. temporal, spatial, censored, etc., may suggest a particular modeling domain to use. +- The data *structure*, for example, temporal, spatial, censored, etc., may suggest a particular modeling domain to use. - Missing data can be handled in a variety of ways, and the simpler approaches are typically not great. - Class imbalance is a very common issue in classification problems, and there are a number of ways to deal with it. - Latent variables are everywhere! @@ -39,7 +39,7 @@ Transforming variables from one form to another provides several benefits in mod - Easier convergence - Helping with heteroscedasticity -For example, just **centering** features, i.e. subtracting their respective means, provides a more interpretable intercept that will fall within the actual range of the target variable in a standard linear regression. After centering, the intercept tells us what the value of the target variable is when the features are at their means (or reference value if categorical). Centering also puts the intercept within the expected range of the target, which often makes for easier parameter estimation. So even if easier interpretation isn't a major concern, variable transformations can help with convergence and speed up estimation, so can always be of benefit. +For example, just **centering** features, i.e., subtracting their respective means, provides a more interpretable intercept that will fall within the actual range of the target variable in a standard linear regression. After centering, the intercept tells us what the value of the target variable is when the features are at their means (or reference value if categorical). Centering also puts the intercept within the expected range of the target, which often makes for easier parameter estimation. So even if easier interpretation isn't a major concern, variable transformations can help with convergence and speed up estimation, so can always be of benefit. ### Numeric variables {#sec-data-numeric} @@ -103,7 +103,7 @@ tab |> select(-Benefits) |> gt() ::: -For example, it is very common to use **standardized** or **scaled** variables. Some also call this **normalizing**, as with **batch or layer normalization** in deep learning, but [this term can mean a lot of things](https://en.wikipedia.org/wiki/Normalization_(statistics)), so one should be clear in their communication. If $y$ and $x$ are both standardized, a one unit (i.e. one standard deviation) change in $x$ leads to a $\beta$ standard deviation change in $y$. So, if $\beta$ was .5, a standard deviation change in $x$ leads to a half standard deviation change in $y$. In general, there is nothing to lose by standardizing, so you should employ it often. +For example, it is very common to use **standardized** or **scaled** variables. Some also call this **normalizing**, as with **batch or layer normalization** in deep learning, but [this term can mean a lot of things](https://en.wikipedia.org/wiki/Normalization_(statistics)), so one should be clear in their communication. If $y$ and $x$ are both standardized, a one unit (i.e., one standard deviation) change in $x$ leads to a $\beta$ standard deviation change in $y$. So, if $\beta$ was .5, a standard deviation change in $x$ leads to a half standard deviation change in $y$. In general, there is nothing to lose by standardizing, so you should employ it often. Another common transformation, particularly in machine learning, is **min-max scaling**. This involves changing variables to range from a chosen minimum value to a chosen maximum value, and usually this means zero and one respectively. This transformation can make numeric and categorical indicators more comparable, or at least put them on the same scale for estimation purposes, and so can help with convergence and speed up estimation. The following demonstrates how we can employ such approaches. @@ -317,7 +317,7 @@ With Bayesian tools, it's common to use the **[categorical distribution](https:/ In the machine learning context, we can use a variety of models we'd use for binary classification. How the model is actually implemented will depend on the tool, but one of the more popular methods is to use **one-vs-all** or **one-vs-one** strategies, where you treat each class as the target in a binary classification problem. In the first case of one vs. all, you would have a model for each class that predicts whether an observation is in that class versus the other classes. In the second case, you would have a model for each pair of classes. You should generally be careful with either approach if interpretation is important, as it can make the feature effects very difficult to understand. As an example, we can't expect feature X to have the same effect on the target in a model for class A vs B, as it does in a model for class A vs. (B & C) or A & C. As such, it can be misleading when the models are conducted as if the categories are independent. -Regardless of the context, interpretation is now spread across multiple target outputs, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical model setting (e.g. a multinomial regression), you now have coefficients that regard *relative* effects for one class versus a reference group, and so they cannot tell you a *general* effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (@sec-knowing-feature). +Regardless of the context, interpretation is now spread across multiple target outputs, and so it can be difficult to understand the overall effect of a feature on the target. Even in the statistical model setting (e.g., a multinomial regression), you now have coefficients that regard *relative* effects for one class versus a reference group, and so they cannot tell you a *general* effect of a feature on the target. This is where tools like marginal effects and SHAP can be useful (@sec-knowing-feature). #### Multilabel targets {#sec-data-multilabel} @@ -450,7 +450,7 @@ The first way to deal with missing data is the simplest - **complete case analys ### Single value imputation {#sec-data-missing-single} -**Single value imputation** involves replacing missing values with a single value, such as the mean, median, mode or some other typical value of the feature. As common an approach as this is, it will rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you've also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g. collinearity), or cause you to miss out on interactions. +**Single value imputation** involves replacing missing values with a single value, such as the mean, median, mode or some other typical value of the feature. As common an approach as this is, it will rarely help your model for a variety of reasons. Consider a numeric feature that is 50% missing, and for which you replace the missing with the mean. How good do you think that feature will be when at least half the values are identical? Whatever variance it normally would have and share with the target is probably reduced, and possibly dramatically. Furthermore, you've also attenuated correlations it has with the other features, which may mute other modeling issues that you would otherwise deal with in some way (e.g., collinearity), or cause you to miss out on interactions. Single value imputation makes perfect sense if you *know* that the missingness should be a specific value, like a count feature where missing means a count of zero. If you don't have much missing data, it's unlikely this would have any real benefit over complete case analysis. One exception is imputing the feature allows you to use all the other complete feature samples that would otherwise be dropped. But then, you could just drop this less informative feature while keeping the others, as it will likely not be very useful in the model. @@ -785,7 +785,7 @@ ggsave( ![Truncation](img/data-truncation.svg){width=100% #fig-truncation} ::: -You could truncate predictions after the fact, but this is a bit of a hack, and often results in lumpiness in the predictions at the boundary that is rarely realistic. Alternatively, Bayesian methods allow you to model the target as a distribution with truncated distributions, and so you can model the probability of the target being above or below some value. There are also models such as **hurdle models** that might prove useful where the truncation is theoretically motivated, e.g. a zero-inflated Poisson model for count data where the zero counts are due to a separate process than the non-zero counts. +You could truncate predictions after the fact, but this is a bit of a hack, and often results in lumpiness in the predictions at the boundary that is rarely realistic. Alternatively, Bayesian methods allow you to model the target as a distribution with truncated distributions, and so you can model the probability of the target being above or below some value. There are also models such as **hurdle models** that might prove useful where the truncation is theoretically motivated, for example, a zero-inflated Poisson model for count data where the zero counts are due to a separate process than the non-zero counts. ```{r} #| echo: false @@ -946,7 +946,7 @@ In marketing contexts, some perform **adstocking** with features. This approach If you have the year as a feature, you can use it as a numeric feature or as a categorical feature. If you treat it as numeric, you need to consider what a zero means. In a linear model, the intercept usually represents the outcome when all features are zero. But with a feature like year, a zero year isn't meaningful in most contexts. To solve this, you can shift the values so that the earliest time point, like the first year in your data, becomes zero. This way, the intercept in your model will represent the outcome for this first time point, which is more meaningful. The same goes if you are using months or days as a numeric feature. It doesn't really matter which year/month/day is zero, just that zero refers to one of the actual time points observed. Shifting your time feature in this manner can also help with convergence for some models. In addition, you may want to convert the feature to represent decades, or quarters, or some other time period, to help with interpretation. -Dates and/or times can be a bit trickier. Often you can just split dates out into year, month, day, etc., and proceed with those as features. In other cases you'd want to track the time period to assess possible seasonal effects. You can use something like a **cyclic** approach (e.g. [cyclic spline or sine/cosine transformation](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)) to get at yearly or within-day seasonal effects. As mentioned, a fourier transform can also be used to decompose the time series into its component frequencies for use as model features. Time components like hours, minutes, and seconds can often be dealt with in similar ways, but you will more often deal with the periodicity in the data. For example, if you are looking at hourly data, you may want to consider the 24-hour cycle. +Dates and/or times can be a bit trickier. Often you can just split dates out into year, month, day, etc., and proceed with those as features. In other cases you'd want to track the time period to assess possible seasonal effects. You can use something like a **cyclic** approach (e.g., [cyclic spline or sine/cosine transformation](https://developer.nvidia.com/blog/three-approaches-to-encoding-time-information-as-features-for-ml-models/)) to get at yearly or within-day seasonal effects. As mentioned, a fourier transform can also be used to decompose the time series into its component frequencies for use as model features. Time components like hours, minutes, and seconds can often be dealt with in similar ways, but you will more often deal with the periodicity in the data. For example, if you are looking at hourly data, you may want to consider the 24-hour cycle. :::{.callout-note title="Calendars are hard" collapse='true'} *Weeks are not universal*. Some start on Sunday, others Monday. Some data contexts only consider weekdays. [Some systems](https://en.wikipedia.org/wiki/ISO_week_date) may have 52 or 53 weeks in a year, and dates may not be in the same week from one year to the next, etc. So use caution when considering weeks as a feature. @@ -1092,7 +1092,7 @@ ggsave( We visited spatial data in a discussion on non-tabular data (@sec-ml-more-spatial), but here we want to talk about it from a modeling perspective, especially within the tabular domain. Say you have a target that is a function of location, such as the proportion of people voting a certain way in a county, or the number of crimes in a city. You can use a **spatial regression** model, where the target is a function of location, among other features that may or may not be spatially oriented. Two approaches already discussed may be applied in the case of having continuous spatial features, such as latitude and longitude, or discrete features like county. For the continuous case, we could use a GAM (@sec-gam), where we use a smooth interaction of latitude and longitude. For the discrete setting, we can use a mixed model (@sec-mixed-models-overview), where we include a random effect for county. -There are other traditional techniques to spatial regression, especially in the continuous spatial domain, such as using a **spatial lag**. In this case, we incorporate information about the neighborhood of an observation's location into the model (e.g. a weighted mean of neighboring values, as in the visualization above based on code from @walker_analyzing_2023). Techniques include CAR (conditional autoregressive), SAR (spatial autoregressive), BYM, kriging, and more. These models can be very effective, but can also be seen as a different form of random effects models very similar to those used for time-based settings. They can also be seen as special cases of gaussian process regression more generally. So don't let the names fool you, you often will incorporate similar modeling techniques for both the time and spatial domains. +There are other traditional techniques to spatial regression, especially in the continuous spatial domain, such as using a **spatial lag**. In this case, we incorporate information about the neighborhood of an observation's location into the model (e.g., a weighted mean of neighboring values, as in the visualization above based on code from @walker_analyzing_2023). Techniques include CAR (conditional autoregressive), SAR (spatial autoregressive), BYM, kriging, and more. These models can be very effective, but can also be seen as a different form of random effects models very similar to those used for time-based settings. They can also be seen as special cases of gaussian process regression more generally. So don't let the names fool you, you often will incorporate similar modeling techniques for both the time and spatial domains. ## Multivariate Targets @@ -1110,7 +1110,7 @@ In deep learning contexts, the multivariate setting is ubiquitous. For example, diff --git a/dataset_descriptions.qmd b/dataset_descriptions.qmd index 9143562..147124e 100644 --- a/dataset_descriptions.qmd +++ b/dataset_descriptions.qmd @@ -17,7 +17,7 @@ The movie reviews dataset was a fun way to use an LLM to create movie titles and - `genre`: The genre of the movie - `release_year`: The year the movie was released - `length_minutes`: The length of the movie in minutes -- `season`: The season the movie was released (e.g. Fall, Winter) +- `season`: The season the movie was released (e.g., Fall, Winter) - `total_reviews`: The total number of reviews for the movie - `rating`: The rating of the movie - `review_text`: The text of the review diff --git a/estimation.qmd b/estimation.qmd index 1cbfcdd..99161fa 100644 --- a/estimation.qmd +++ b/estimation.qmd @@ -214,7 +214,7 @@ $$ This prediction error tells us how far off our model prediction is from the observed target values, but it also gives us a way to compare models. How? With our measure of prediction error, we can calculate a total error for all observations/predictions (@sec-knowing-model-metrics), or similarly, the average error. If one model or parameter set has less total or average error, we can say it's a better model than one that has more (@sec-knowing-model-compare). Ideally we'd like to choose a model with the least possible error, but we'll see that this is not always possible[^neverbest]. -However, if we just take the average of our errors from a linear regression model, you'll see that it is roughly zero! This is by design for many common models, and is even made explicit in their mathematical depiction. So, to get a meaningful error metric, we need to use the squared error value or the absolute value. These also allow errors of similar value above and below the observed value to cost the same[^absloss]. As we've done elsewhere, we'll use squared error here, and we'll calculate the mean of the squared errors for all our predictions, i.e. the **mean squared error (MSE)**. +However, if we just take the average of our errors from a linear regression model, you'll see that it is roughly zero! This is by design for many common models, and is even made explicit in their mathematical depiction. So, to get a meaningful error metric, we need to use the squared error value or the absolute value. These also allow errors of similar value above and below the observed value to cost the same[^absloss]. As we've done elsewhere, we'll use squared error here, and we'll calculate the mean of the squared errors for all our predictions, i.e., the **mean squared error (MSE)**. [^absloss]: We don't have to do it this way, but it's the default in most scenarios. As an example, maybe for your situation overshooting is worse than undershooting, and so you might want to use an approach that would weight those errors more heavily. @@ -307,7 +307,7 @@ Now all of this is useful, and at least we can say one model is better than anot ## Ordinary Least Squares {#sec-estim-ols} -In a simple linear model, we often use the **Ordinary Least Squares (OLS)** method to estimate parameters. This method finds the coefficients that minimize the sum of the squared differences between the predicted and actual values[^notamodel]. In other words, it finds the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, which is what we just did in our previous example. The sum of the squared errors is also called the **residual sum of squares** (RSS), as opposed to the 'total' sums of squares (i.e. the variance of the target), and the part explained by the model ('model' or 'explained' sums of squares). We can express this as follows, where $y_i$ is the observed value of the target for observation $i$, and $\hat{y_i}$ is the predicted value from the model. +In a simple linear model, we often use the **Ordinary Least Squares (OLS)** method to estimate parameters. This method finds the coefficients that minimize the sum of the squared differences between the predicted and actual values[^notamodel]. In other words, it finds the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, which is what we just did in our previous example. The sum of the squared errors is also called the **residual sum of squares** (RSS), as opposed to the 'total' sums of squares (i.e., the variance of the target), and the part explained by the model ('model' or 'explained' sums of squares). We can express this as follows, where $y_i$ is the observed value of the target for observation $i$, and $\hat{y_i}$ is the predicted value from the model. [^notamodel]: Some disciplines seem to confuse models with estimation methods and link functions. It doesn't really make sense, nor is it informative, to call something an OLS model or a logit model. Many models are estimated using a least squares objective function, even deep learning, and different types of models use a logit link, from logistic regression, to beta regression, to activation functions used in deep learning. @@ -595,7 +595,7 @@ So, let's try it out! We start out with several inputs: - the objective function - the initial guess for the parameters to get things going - other related inputs to the objective function, such as the data -- options for the optimization process, e.g. algorithm, maximum number of iterations, etc. +- options for the optimization process, e.g., algorithm, maximum number of iterations, etc. With these inputs, we'll let the chosen optimization function do the rest of the work. We'll again compare our results to the standard functions to make sure we're on the right track. diff --git a/generalized_linear_models.qmd b/generalized_linear_models.qmd index 1ed20df..52ca22d 100644 --- a/generalized_linear_models.qmd +++ b/generalized_linear_models.qmd @@ -61,7 +61,7 @@ Generalized linear models are a broad class of models that extend the linear mod ## Distributions & Link Functions {#sec-glm-distributions} -Remember how linear regression models really enjoy the whole Gaussian, i.e. 'normal', distribution scene? We saw that the essential form of the linear model can be expressed as follows. With probabilistic models such as these, the formula is generally expressed as $y | X, \theta \sim ...$, where X is the matrix of features (data) and $\theta$ the parameters estimated by the model. We simplify this as $y^*$ here. +Remember how linear regression models really enjoy the whole Gaussian, i.e., 'normal', distribution scene? We saw that the essential form of the linear model can be expressed as follows. With probabilistic models such as these, the formula is generally expressed as $y | X, \theta \sim ...$, where X is the matrix of features (data) and $\theta$ the parameters estimated by the model. We simplify this as $y^*$ here. $$ y| X, \beta, \sigma \sim \textrm{N}(\mu, \sigma^2) @@ -96,7 +96,7 @@ If you know a distribution's 'canonical' link function, which is like the defaul :::{.callout-note title='Conditional reminder' collapse='true'} -One thing to note, when we switch distributions for GLMs, we're still concerning ourselves with the *conditional* distribution of the target variable given the features. The distribution of the target variable itself is not changing per se, even though its nature, e.g. as a binary variable, is suggesting to us to try something that would allow us to produce a binary outcome from the model. But just like we don't assume the target itself is normal in a linear regression, here we are assuming that the conditional distribution of the target given the features is the distribution we are specifying. +One thing to note, when we switch distributions for GLMs, we're still concerning ourselves with the *conditional* distribution of the target variable given the features. The distribution of the target variable itself is not changing per se, even though its nature, e.g., as a binary variable, is suggesting to us to try something that would allow us to produce a binary outcome from the model. But just like we don't assume the target itself is normal in a linear regression, here we are assuming that the conditional distribution of the target given the features is the distribution we are specifying. ::: diff --git a/linear_model_extensions.qmd b/linear_model_extensions.qmd index e5e0685..ecb3c39 100644 --- a/linear_model_extensions.qmd +++ b/linear_model_extensions.qmd @@ -494,9 +494,9 @@ What we've just seen might initially bring to mind an interaction effect, and th Before going too much further, the term *mixed model* is as vanilla as we can possibly make it, but you might have heard of different names such as *hierarchical linear models*, or *multilevel models*, or maybe *mixed-effects models*. Maybe you've even been exposed to ideas like *random effects* or *random slopes*. These are in fact all instances of what we're calling a *mixed model*. -What makes a model a *mixed* model? The mixed model is characterized by the idea that a model can have **fixed effects** and **random effects**. Fortunately, you've already encountered *fixed* effects -- those are the features that we have been using in all of our models so far! We are assuming a single true parameter (e.g. coefficient/weight) to estimate for each of those features, and that parameter is *fixed*. +What makes a model a *mixed* model? The mixed model is characterized by the idea that a model can have **fixed effects** and **random effects**. Fortunately, you've already encountered *fixed* effects -- those are the features that we have been using in all of our models so far! We are assuming a single true parameter (e.g., coefficient/weight) to estimate for each of those features, and that parameter is *fixed*. -In mixed models, a *random effect* instead comes from a specific distribution, and this is almost always a normal distribution. This random effect adds a unique source of variance in the target variable. This distribution of effects can be based on a grouping variable (such as genre), where we let those parameters, i.e. coefficients (or weights), vary across the groups, creating a distribution of values. +In mixed models, a *random effect* instead comes from a specific distribution, and this is almost always a normal distribution. This random effect adds a unique source of variance in the target variable. This distribution of effects can be based on a grouping variable (such as genre), where we let those parameters, i.e., coefficients (or weights), vary across the groups, creating a distribution of values. Let's take our initial example with movie length and genre. Formally, we might specify something like this: @@ -803,7 +803,7 @@ ranef_usa + model_ran_slope.fe_params :::{.callout-note title='Averages in Mixed Models' collapse='true'} -In the linear mixed effect model setting with a random intercept, the fixed effects can be seen as the (population) average effects, but this is not exactly what you are getting from the mixed model. To make the distinction clear, consider family groups and a gender effect for males vs. females. The linear regression and some other types of models (e.g. estimated via generalized estimating equations) would give you the average effect male-female difference across all families. The mixed model actually tells you the male-female difference as if they were in the same family (e.g. siblings). Again, in the simplest mixed model setting these are the same. Beyond that that, when we start dealing with random slopes and non-gaussian distributions, they are not. +In the linear mixed effect model setting with a random intercept, the fixed effects can be seen as the (population) average effects, but this is not exactly what you are getting from the mixed model. To make the distinction clear, consider family groups and a gender effect for males vs. females. The linear regression and some other types of models (e.g., estimated via generalized estimating equations) would give you the average effect male-female difference across all families. The mixed model actually tells you the male-female difference as if they were in the same family (e.g., siblings). Again, in the simplest mixed model setting these are the same. Beyond that that, when we start dealing with random slopes and non-gaussian distributions, they are not. In general, if we set the random effect to 0 to get a prediction, that tells us what the prediction would be for a typical group, in this case, a typical country. Often we want to get something more like the average slope or prediction across countries that we would have with linear regression. This gets us back to the idea of the **marginal effect** we discussed earlier. While the mechanics are not straightforward for mixed models, the tool use generally takes no additional effort. ::: @@ -872,7 +872,7 @@ For a model with just one feature, we certainly had a lot to talk about! And thi - The random effect is akin to a latent variable of 'unspecified group causes'. This is a very powerful idea that can be used in many different ways, but importantly, you might want to start thinking about how you can figure out what those 'unspecified' causes may be! - Group effects will almost always improve your model's predictive performance relative to not having them, especially if you weren't including those groups in your model because of how many groups there were. -[^indieobs]: Independence of observations is a key assumption in linear regression models, and when it's violated, the standard errors of the coefficients are biased, which can lead to incorrect inferences. Rather than hacking a model (so-called 'fixed effects' models) or 'correcting' the standard error (e.g. with some 'sandwich' or estimator), mixed models can account for this lack of independence through the model itself. +[^indieobs]: Independence of observations is a key assumption in linear regression models, and when it's violated, the standard errors of the coefficients are biased, which can lead to incorrect inferences. Rather than hacking a model (so-called 'fixed effects' models) or 'correcting' the standard error (e.g., with some 'sandwich' or estimator), mixed models can account for this lack of independence through the model itself. In short, mixed models are a fun way to incorporate additional interpretive color to your model, while also getting several additional benefits to help you understand your data! @@ -1369,7 +1369,7 @@ c(mean(ll - main[,'lwr']), mean(ul - main[,'upr'])) plot(ll, main[,'lwr']) plot(ul, main[,'upr']) -# does it statistically work for other models? e.g. the lgbm for the mean might not have the same feature space for the q 975 https://developer.ibm.com/articles/prediction-intervals-explained-a-lightgbm-tutorial/ +# does it statistically work for other models? e.g., the lgbm for the mean might not have the same feature space for the q 975 https://developer.ibm.com/articles/prediction-intervals-explained-a-lightgbm-tutorial/ # https://jmarkhou.com/lgbqr/ ``` diff --git a/linear_models.qmd b/linear_models.qmd index e60a372..5f93ca6 100644 --- a/linear_models.qmd +++ b/linear_models.qmd @@ -187,16 +187,16 @@ $$ x_1 + x_2 + ... + x_n $$ -In this equation, $x$ is the feature and n is the number identifier for the features, so $x_1$ is the first feature (e.g. word count), $x_2$ the second (e.g. movie release year), and so on. $x$ is an arbitrary designation - you could use any letter, symbol you want, or even better, would be the actual feature name. Now look at the linear model. +In this equation, $x$ is the feature and n is the number identifier for the features, so $x_1$ is the first feature (e.g., word count), $x_2$ the second (e.g., movie release year), and so on. $x$ is an arbitrary designation - you could use any letter, symbol you want, or even better, would be the actual feature name. Now look at the linear model. $$ y = x_1 + x_2 + ... + x_n $$ -In this case, the function is *just a sum*, something so simple we do it all the time. In the linear model sense though, we're actually saying a bit more. Another way to understand that equation is that *y is a function of x*. We don't show any coefficients here, i.e. the *b*s in our initial equation (@eq-lm-basic), but technically it's as if each coefficient was equal to a value of 1. In other words, for this simple linear *model*, we're saying that each feature contributes in an identical fashion to the target. +In this case, the function is *just a sum*, something so simple we do it all the time. In the linear model sense though, we're actually saying a bit more. Another way to understand that equation is that *y is a function of x*. We don't show any coefficients here, i.e., the *b*s in our initial equation (@eq-lm-basic), but technically it's as if each coefficient was equal to a value of 1. In other words, for this simple linear *model*, we're saying that each feature contributes in an identical fashion to the target. -In practice, features will never contribute in the same ways, because they correlate with the target differently, or are on different scales. So if we want to relate some features, $x_1$ and $x_2$, to target $y$, we probably would not assume that they both contribute in the same way. For instance, we might assign more weight to $x_1$ than $x_2$, for whatever reason. In the linear model, this is expressed by multiplying each feature by a different coefficient or weight. So the linear model's primary component is really just a sum of the features multiplied by their coefficients, i.e. a *weighted sum*. Each feature's contribution to explaining or accounting for the target is proportional to its coefficient. So if we have a feature $x_1$ and a coefficient $b_1$, then the contribution of $x_1$ to the target is $b_1\cdot x_1$. If we have a feature $x_2$ and a coefficient $b_2$, then the contribution of $x_2$ to the target is $b_2 \cdot x_2$. And so on. So the linear model is really just a sum of the features multiplied by their respective weights. +In practice, features will never contribute in the same ways, because they correlate with the target differently, or are on different scales. So if we want to relate some features, $x_1$ and $x_2$, to target $y$, we probably would not assume that they both contribute in the same way. For instance, we might assign more weight to $x_1$ than $x_2$, for whatever reason. In the linear model, this is expressed by multiplying each feature by a different coefficient or weight. So the linear model's primary component is really just a sum of the features multiplied by their coefficients, i.e., a *weighted sum*. Each feature's contribution to explaining or accounting for the target is proportional to its coefficient. So if we have a feature $x_1$ and a coefficient $b_1$, then the contribution of $x_1$ to the target is $b_1\cdot x_1$. If we have a feature $x_2$ and a coefficient $b_2$, then the contribution of $x_2$ to the target is $b_2 \cdot x_2$. And so on. So the linear model is really just a sum of the features multiplied by their respective weights. For our specific model, here is the mathematical representation: @@ -351,7 +351,7 @@ $$ $$ {#eq-lm-prediction} -What is $\hat{y}$? The 'hat' over the $y$ just means that it's a predicted, or 'expected', value of the model, i.e. the output. This distinguishes it from the target value we actually observe in the data. Our first equations that just used $y$ implicitly suggested that we would get a perfect rating value given the model, but that's not the case. We can only get an estimate. The $\hat{y}$ is also the linear predictor in our graphical version (@fig-graph-lm), which makes clear it is not the actual target, but a combination of the features that is related to the target. +What is $\hat{y}$? The 'hat' over the $y$ just means that it's a predicted, or 'expected', value of the model, i.e., the output. This distinguishes it from the target value we actually observe in the data. Our first equations that just used $y$ implicitly suggested that we would get a perfect rating value given the model, but that's not the case. We can only get an estimate. The $\hat{y}$ is also the linear predictor in our graphical version (@fig-graph-lm), which makes clear it is not the actual target, but a combination of the features that is related to the target. To make our first equation (@eq-lm-basic) accurately reflect the relationship between the target and our features, we need to add what is usually referred to as an **error term**, $\epsilon$, to account for the fact that our predictions will not be perfect[^perfect_prediction]. So the full linear (regression) model is: @@ -393,7 +393,7 @@ You'll often see predictions referred to as **fitted values**, but these imply w ### What kinds of predictions can we get? {#sec-lm-prediction-types} -What predictions we can get depends on the type of model we are using. For the linear model we have at present, we can get predictions for the target, which is a **continuous variable**. Very commonly, we also can get predictions for a **categorical target**, such as whether the rating is 'good' or 'bad'. This simple breakdown pretty much covers everything, as we typically would be predicting a continuous numeric variable or a categorical variable, or more of them, like multiple continuous variables, or a target with multiple categories, or sequences of categories (e.g. words). +What predictions we can get depends on the type of model we are using. For the linear model we have at present, we can get predictions for the target, which is a **continuous variable**. Very commonly, we also can get predictions for a **categorical target**, such as whether the rating is 'good' or 'bad'. This simple breakdown pretty much covers everything, as we typically would be predicting a continuous numeric variable or a categorical variable, or more of them, like multiple continuous variables, or a target with multiple categories, or sequences of categories (e.g., words). In our case, we can get predictions for the rating, which is a number between 1 and 5. Had our target been a binary good vs. bad rating, our predictions would still be numeric for most models, and usually expressed as a probability between 0 and 1, say, for the 'good' category, or in an initial form that is then transformed to a probability. For example, in the context of predicting a good rating, higher probabilities would mean we'd more likely predict the movie is good, and lower probabilities would mean we'd more likely predict the movie is bad. We then would convert that probability to a class of good or bad depending on a chosen probability cutoff. We'll talk about how to get predictions for categorical targets later[^treepreds]. @@ -973,7 +973,7 @@ In those settings, statistical significance is often used as a proxy for importa If we are very interested in the coefficient or weight value specifically, it is better to focus on the range of possible values. This is provided by the confidence interval, along with the predictions that come about based on that coefficient's value, which will likewise have interval estimates. Like statistical significance, a confidence interval is also a 'loaded' description of a feature's relationship to the target, not without issues. However, we can use it in a very practical way as a range of possible values for that feature's weight, and more importantly, *think of possibilities rather than certainties*. -Suffice it to say at this point, that how much one focuses on prediction versus explanation depends on the context and goals of the data endeavor. There are cases where predictive capability is of utmost importance, and we care less about explanatory details, but not to the point of ignoring it. For example, even with deep learning models for image classification, where the inputs are just RGB values from an image, we'd still like to know what the (notably complex) model is picking up on, otherwise we may be classifying images based on something like image backgrounds (e.g. outdoors vs. indoors) instead of the objects of actual interest (dogs vs. cats). In some business or other organizational settings, we are very, or even mostly, interested in the coefficients/weights, which might indicate how to allocate monetary resources in some fashion. But if those weights come from a model with no predictive power, placing much importance on them may be a fruitless endeavor. +Suffice it to say at this point, that how much one focuses on prediction versus explanation depends on the context and goals of the data endeavor. There are cases where predictive capability is of utmost importance, and we care less about explanatory details, but not to the point of ignoring it. For example, even with deep learning models for image classification, where the inputs are just RGB values from an image, we'd still like to know what the (notably complex) model is picking up on, otherwise we may be classifying images based on something like image backgrounds (e.g., outdoors vs. indoors) instead of the objects of actual interest (dogs vs. cats). In some business or other organizational settings, we are very, or even mostly, interested in the coefficients/weights, which might indicate how to allocate monetary resources in some fashion. But if those weights come from a model with no predictive power, placing much importance on them may be a fruitless endeavor. In the end we'll need to balance our efforts to suit the task at hand. Prediction and explanation are both fundamental to the modeling endeavor. We return to this topic again in the chapter on causal models (@sec-causal-prediction-explanation). @@ -1437,11 +1437,11 @@ The standard linear regression model we've come to know is no different, and it - That your model is not grossly misspecified (e.g., you've included the right features and not left out important ones) - The data that you're modeling reflects the population you want to make generalizations about -- The model is linear in the parameters (i.e. no $e^\beta$ or $\beta_1 \cdot beta_2 \cdot X$ type stuff) +- The model is linear in the parameters (i.e., no $e^\beta$ or $\beta_1 \cdot beta_2 \cdot X$ type stuff) - The features are not correlated with the error (prediction errors, unobserved causes) - Your data observations are independent of each other -- The prediction errors are homoscedastic (e.g. some predictions aren't associated with very large errors relative to others) -- Normality of the errors (i.e. your prediction errors). Another way to put it is that your target variable is normally distributed *conditional* on the features. +- The prediction errors are homoscedastic (e.g., some predictions aren't associated with very large errors relative to others) +- Normality of the errors (i.e., your prediction errors). Another way to put it is that your target variable is normally distributed *conditional* on the features. Things a linear regression model does not assume: @@ -1472,7 +1472,7 @@ And finally: So basically, whether or not you meet the assumptions of your model doesn't actually say much about whether the model is great or terrible. For the linear regression model, if you do meet those assumptions, your coefficient estimates are unbiased[^unbiased], and in general, your statistical inferences are valid ones. If you don't meet the assumptions, there are alternative versions of the linear model you could use that would potentially address the issues. -For example, data that runs over a sequence of time (**time series** data) violates the independence assumption, since observations closer in time are more likely to be similar than those farther apart. Violation of this assumption will result in problems with the standard errors of the coefficients, and thus the p-values and confidence intervals. But we could use a **time series** or similar model instead to account for this. If normality is difficult to meet, you could assume a different data generating distribution. We'll discuss some of these approaches explicitly in later chapters (e.g. @sec-glm), but it's also important to note that not meeting the assumptions for the model may only mean you'll prefer a different type of linear or other model to use in order to meet them. +For example, data that runs over a sequence of time (**time series** data) violates the independence assumption, since observations closer in time are more likely to be similar than those farther apart. Violation of this assumption will result in problems with the standard errors of the coefficients, and thus the p-values and confidence intervals. But we could use a **time series** or similar model instead to account for this. If normality is difficult to meet, you could assume a different data generating distribution. We'll discuss some of these approaches explicitly in later chapters (e.g., @sec-glm), but it's also important to note that not meeting the assumptions for the model may only mean you'll prefer a different type of linear or other model to use in order to meet them. [^unbiased]: This means they are correct *on average*, not that they are the *true* value. And if they were biased, this refers to **statistical bias**, and has nothing to do with the moral or ethical implications of the data, or whether the features themselves are biased in measurement. Culturally biased data is a different problem than statistical/prediction bias or measurement error, though they are not mutually exclusive. Statistical bias can more readily be tested, while other types of bias are more difficult to assess. Even statistical unbiasedness is not necessarily a goal, as we will see later @sec-estim-penalty. @@ -1483,9 +1483,9 @@ Let's say you're running some XGBoost or a Deep Linear Model and getting outstan - You have enough data to make the model generalizable - Your data isn't biased (e.g., you don't have 90% of your data from one particular region when you want to talk about a much wider area) -- You adequately sampled the hyperparameter space (e.g. you didn't just use the defaults (@sec-danger-fine) or a small grid search) +- You adequately sampled the hyperparameter space (e.g., you didn't just use the defaults (@sec-danger-fine) or a small grid search) - Your observations are independent or at least [exchangeable](https://en.wikipedia.org/wiki/Exchangeable_random_variables#Exchangeability_and_the_i.i.d._statistical_model) and don't have data leakage (@sec-danger-ml-other), or you are explicitly modeling observation dependence -- That all the parameter settings you set are correct or at least viable (e.g. you let the model run for a long enough set of iterations, your batch size was adequate, you had enough hidden layers, etc.) +- That all the parameter settings you set are correct or at least viable (e.g., you let the model run for a long enough set of iterations, your batch size was adequate, you had enough hidden layers, etc.) And if you want to talk about specific feature contributions, you are assuming: @@ -1521,7 +1521,7 @@ g |> ![A Linear Model with Transformation Can Be a Logistic Regression](img/nn_logreg.svg){width=75% #fig-graph-logistic} -As soon as we move away from the standard linear model and use transformations of our linear predictor, simple coefficient interpretation becomes difficult, sometimes exceedingly so. We will explore more of these types of models and how to interpret them in later chapters (e.g. @sec-glm). +As soon as we move away from the standard linear model and use transformations of our linear predictor, simple coefficient interpretation becomes difficult, sometimes exceedingly so. We will explore more of these types of models and how to interpret them in later chapters (e.g., @sec-glm). @@ -1533,7 +1533,7 @@ Before we leave our humble linear model, let's look at some others. Here is a br Generalized Linear Models and related -- True GLM e.g. logistic, poisson +- True GLM e.g., logistic, poisson - Other distributions: beta regression, tweedie, t (so-called robust), truncated - Penalized regression: ridge, lasso, elastic net - Censored outcomes: Survival models, tobit @@ -1559,7 +1559,7 @@ Latent Linear Models - Mixture models - Structural Equation Modeling, Graphical models generally -All of these are explicitly linear models or can be framed as such, and may only require only a tweak or two from what you've already seen - e.g. a different distributional assumption, a different link function, penalizing the coefficients, etc. In other cases, we can bounce from one to another and even get similar results. For example we can reshape our multivariate outcome to be amenable to a mixed model approach and get the exact same results. We can potentially add a random effect to any model, and that random effect can be based on time, spatial or other considerations. Additionally, the same type of linear combination of features used in linear regression can be used in many types of models, even deep learning models! +All of these are explicitly linear models or can be framed as such, and may only require only a tweak or two from what you've already seen - for example, a different distributional assumption, a different link function, penalizing the coefficients, etc. In other cases, we can bounce from one to another and even get similar results. For example we can reshape our multivariate outcome to be amenable to a mixed model approach and get the exact same results. We can potentially add a random effect to any model, and that random effect can be based on time, spatial or other considerations. Additionally, the same type of linear combination of features used in linear regression can be used in many types of models, even deep learning models! The important thing to know is that the linear model is a very flexible tool that expands easily, and allows you to model most of the types of outcomes we are interested in. As such, it's a very powerful approach to modeling. @@ -1615,4 +1615,4 @@ For this exercise let's switch to the world happiness data 2018 data. You can fi - Suggestion for features: GDP per capita, Social support, Healthy life expectancy - Summarize the model, and interpret the coefficients. What do you find? - Assess the model fit with RMSE and R^2^ -- Try to get a prediction of at least one new observation of interest, e.g. log GDP Per Capita of 10, life expectancy 70, social support 0.8, which would represent a decently well-off country. Contrast that prediction with a less well-off country, with values less than the median for each feature. What do you find? +- Try to get a prediction of at least one new observation of interest, e.g., log GDP Per Capita of 10, life expectancy 70, social support 0.8, which would represent a decently well-off country. Contrast that prediction with a less well-off country, with values less than the median for each feature. What do you find? diff --git a/machine_learning.qmd b/machine_learning.qmd index 5ac67b6..94b21a3 100644 --- a/machine_learning.qmd +++ b/machine_learning.qmd @@ -48,7 +48,7 @@ To dive into applying machine learning models, you really only need a decent gra ## Objective Functions {#sec-ml-objective} -We've implemented a variety of objective functions in other chapters, such as mean squared error for numeric targets and log loss for binary targets (@sec-estimation). The objective function is what we used to estimate model parameters, but it's not necessarily the same as the performance metric we ultimately use to select a model. For example, we may use log loss as the objective function, but then use accuracy as the performance metric. In that setting, the log loss provides a 'smooth' objective function to search the parameter space over, while accuracy is a straightforward and more interpretable metric for stakeholders. In this case, the objective function is used to optimize the model, while the performance metric is used to evaluate the model. In some cases, the objective function and performance metric are the same (e.g. (R)MSE), and even if not, they might have selected the same 'best' model, but this is not always the case. The following table shows some commonly used objective functions in machine learning for regression and classification tasks (@tbl-objective-functions). +We've implemented a variety of objective functions in other chapters, such as mean squared error for numeric targets and log loss for binary targets (@sec-estimation). The objective function is what we used to estimate model parameters, but it's not necessarily the same as the performance metric we ultimately use to select a model. For example, we may use log loss as the objective function, but then use accuracy as the performance metric. In that setting, the log loss provides a 'smooth' objective function to search the parameter space over, while accuracy is a straightforward and more interpretable metric for stakeholders. In this case, the objective function is used to optimize the model, while the performance metric is used to evaluate the model. In some cases, the objective function and performance metric are the same (e.g., (R)MSE), and even if not, they might have selected the same 'best' model, but this is not always the case. The following table shows some commonly used objective functions in machine learning for regression and classification tasks (@tbl-objective-functions). @@ -371,7 +371,7 @@ In the previous section you can compare our results on the tests vs. training se > This part gets into the weeds a bit. If you are not so inclined, skip to the summary of this section. -In the following discussion, you can think of a standard linear model scenario, e.g. with squared-error loss function, and a data set where we split some of the observations in a random fashion into a training set, for initial model fitting, and a test set, which will be kept separate and independent, and used to measure generalization performance. We note **training error** as the average loss over all the training sets we could create in this process of random splitting. The **test error** is the average prediction error obtained when a model fitted on the training data is used to make predictions on the test data. +In the following discussion, you can think of a standard linear model scenario, for example, with squared-error loss function, and a data set where we split some of the observations in a random fashion into a training set, for initial model fitting, and a test set, which will be kept separate and independent, and used to measure generalization performance. We note **training error** as the average loss over all the training sets we could create in this process of random splitting. The **test error** is the average prediction error obtained when a model fitted on the training data is used to make predictions on the test data. @@ -892,7 +892,7 @@ Regularization is used in many modeling scenarios. Here is a quick rundown of so - GAMs use penalized regression for estimation of the coefficients for the basis functions (typically with L2). This keeps the 'wiggly' part of the GAM from getting too wiggly, as in the overfit model in @fig-over-under. This shrinks the feature-target relationship toward a linear one. -- Similarly, the variance estimate of a random effect in mixed models, e.g. for the intercept or slope, is inversely related to an L2 penalty on the effect estimates for that group effect. The more penalization applied, the less random effect variance, and the more the random effect is shrunk toward the overall mean[^mixedpenalty]. +- Similarly, the variance estimate of a random effect in mixed models, e.g., for the intercept or slope, is inversely related to an L2 penalty on the effect estimates for that group effect. The more penalization applied, the less random effect variance, and the more the random effect is shrunk toward the overall mean[^mixedpenalty]. [^mixedpenalty]: One more reason to prefer a random effects approach over so-called fixed effects models, as the latter are not penalized at all, and thus are more prone to overfitting. @@ -1154,7 +1154,7 @@ From the five validation sets, we end up with five separate accuracy values, one There are different approaches we can take for cross-validation that we may need for different data scenarios. Here are some of the more common ones. - **Shuffled**: Shuffling prior to splitting can help avoid data ordering having undue effects. -- **Grouped/stratified**: In cases where we want to account for the grouping of the data, e.g. for data with a hierarchical structure. +- **Grouped/stratified**: In cases where we want to account for the grouping of the data, e.g., for data with a hierarchical structure. - Grouped: We may want groups to appear in training *or* test, but not both. This allows us to generalize to new groups. - Stratified: We may want to ensure group proportions are consistent across training and test sets. This is especially useful in unbalanced target settings to ensure all class labels are present in training and test. - **Time-based**: for time series data, where we only want to assess error on future values @@ -1646,7 +1646,7 @@ at = auto_tuner( tuner = tnr('random_search'), learner = pipeline, resampling = rsmp ('cv', folds = 5), - measure = msr('classif.???'), # change ??? e.g. try auc, recall, logloss + measure = msr('classif.???'), # change ??? e.g., try auc, recall, logloss term_evals = 10 ) diff --git a/matrix_operations.qmd b/matrix_operations.qmd index 9b5cdb2..29876c1 100644 --- a/matrix_operations.qmd +++ b/matrix_operations.qmd @@ -803,7 +803,7 @@ Matrix multiplication is not the same as **elementwise** multiplication. Element ## Division -Though addition, subtraction, and multiplication are all pretty straightforward, matrix division is not. In fact, there really isn't such a thing as matrix division, we just use matrix multiplication in a particular way. This is similar to how we can divide two numbers, e.g. $a/b$, but we can also multiply by the reciprocal, $a*(1/b)$. In matrix terms this would look something like: +Though addition, subtraction, and multiplication are all pretty straightforward, matrix division is not. In fact, there really isn't such a thing as matrix division, we just use matrix multiplication in a particular way. This is similar to how we can divide two numbers, for example, $a/b$, but we can also multiply by the reciprocal, $a*(1/b)$. In matrix terms this would look something like: $$ diff --git a/ml_common_models.qmd b/ml_common_models.qmd index db15e82..394f196 100644 --- a/ml_common_models.qmd +++ b/ml_common_models.qmd @@ -166,7 +166,7 @@ majority = pmax(prevalence, 1 - prevalence) In this data, roughly `r scales::percent(prevalence)` suffered from heart disease, so if we're interested in accuracy- we could get `r scales::percent(majority)` correct by just guessing the majority class of no disease. Hopefully we can do better than that! -One last thing, as we go along, performance metrics will vary depending on your setup (e.g. Python vs. R), package versions used, and other things. As such your results may not look exactly like these, and that's okay! Your results should still be similar, and the important thing is to understand the concepts and how to apply them to your own data. +One last thing, as we go along, performance metrics will vary depending on your setup (e.g., Python vs. R), package versions used, and other things. As such your results may not look exactly like these, and that's okay! Your results should still be similar, and the important thing is to understand the concepts and how to apply them to your own data. ## Beat the Baseline {#sec-ml-common-baseline} @@ -210,7 +210,7 @@ Take a classification model for example. In this case we might use a logistic re ### Why do we do this? {#sec-ml-common-baseline-why} -Having a baseline model can help you avoid wasting time and resources implementing more complex tools, and to avoid mistakenly thinking performance is better than expected. It is probably rare, but sometimes relationships for the chosen features and target are mostly or nearly linear and have little interaction. In this case, no amount of fancy modeling will make complex feature targets exist if they don't already. Furthermore, if our baseline is a more complex model that actually incorporates nonlinear relationships and interactions (e.g. a GAMM), you'll often find that the more complex models often don't significantly improve on it. As a last example, in time series settings, a *moving average* can often be a difficult baseline to beat, and so can be a good starting point. +Having a baseline model can help you avoid wasting time and resources implementing more complex tools, and to avoid mistakenly thinking performance is better than expected. It is probably rare, but sometimes relationships for the chosen features and target are mostly or nearly linear and have little interaction. In this case, no amount of fancy modeling will make complex feature targets exist if they don't already. Furthermore, if our baseline is a more complex model that actually incorporates nonlinear relationships and interactions (e.g., a GAMM), you'll often find that the more complex models often don't significantly improve on it. As a last example, in time series settings, a *moving average* can often be a difficult baseline to beat, and so can be a good starting point. So in general, you may find that the initial baseline model is good enough for present purposes, and you can then move on to other problems to solve, like acquiring data that is more predictive. This is especially true if you are working in a situation with limited time and resources, but should be of mind generally. @@ -909,7 +909,7 @@ So why might we want to use a neural network for tabular data? The main reason i - Some tolerance to correlated inputs. - Batch processing and parallelization of many operations makes it very efficient for large datasets. - Can be used for even standard GLM approaches. -- Can be added as a component to other deep learning models (e.g. LLMs that are handling text input). +- Can be added as a component to other deep learning models (e.g., LLMs that are handling text input). **Weaknesses** diff --git a/ml_more.qmd b/ml_more.qmd index 3cbf3df..93d1da0 100644 --- a/ml_more.qmd +++ b/ml_more.qmd @@ -94,7 +94,7 @@ In both clustering rows and reducing columns, we're essentially reducing the dim -[^componentvar]: Ideally we'd capture all the variability, but that's not the end result, and some techniques or results may only capture a relatively small percentage. In our personality example, this could be because the questions don't adequately capture the underlying personality constructs (i.e. an issue of the reliability of instrument), or because personality is just not that simple and we'd need more dimensions. +[^componentvar]: Ideally we'd capture all the variability, but that's not the end result, and some techniques or results may only capture a relatively small percentage. In our personality example, this could be because the questions don't adequately capture the underlying personality constructs (i.e., an issue of the reliability of instrument), or because personality is just not that simple and we'd need more dimensions. Now, imagine if we reduced the features to a single categorical variable, say, with two or three groups. Now you have cluster analysis! You can discretize any continuous feature to a coarser set of categories, and this goes for latent variables as well as those we actually observe in our data. For example, if we do a factor analysis with one latent feature, we could either convert it to a probability of some class with an appropriate transformation, or just say that scores higher than some cutoff are in cluster A and the others are in cluster B. Indeed, there is a whole class of clustering models called **mixture models** that do just that - they estimate the latent probability of class membership. Many of these approaches are conceptually similar or even identical to the continuous method counterparts, and the primary difference is how we think about and interpret the results. @@ -256,9 +256,9 @@ In summary, there are many methods that fall under the umbrella of unsupervised :::{.callout-note title='Generative vs. Discriminative Models' collapse='true'} -Many unsupervised learning and many deep learning techniques involved in computer vision and natural language processing are often thought of as **generative** models. These attempt to model the underlying data generating process, i.e. the features, but possibly a target variable also. In contrast, most supervised learning models are often thought of as **discriminative** models, that try to model the conditional distribution of the target given the features only. +Many unsupervised learning and many deep learning techniques involved in computer vision and natural language processing are often thought of as **generative** models. These attempt to model the underlying data generating process, i.e., the features, but possibly a target variable also. In contrast, most supervised learning models are often thought of as **discriminative** models, that try to model the conditional distribution of the target given the features only. -These labels are a bit problematic though. Any probabilistic model can be used to generate data, even if it is only for the target, so calling a model generative isn't all that clarifying. And models that might be thought of as discriminative in a machine learning context might not be in others (e.g. [Bayesian](https://stats.stackexchange.com/questions/7455/the-connection-between-bayesian-statistics-and-generative-modeling)). +These labels are a bit problematic though. Any probabilistic model can be used to generate data, even if it is only for the target, so calling a model generative isn't all that clarifying. And models that might be thought of as discriminative in a machine learning context might not be in others (e.g., [Bayesian](https://stats.stackexchange.com/questions/7455/the-connection-between-bayesian-statistics-and-generative-modeling)). ::: @@ -278,7 +278,7 @@ Reinforcement learning has many applications, including robotics, games, and aut ## Working with Specialized Data Types {#sec-ml-more-non-tabular} -While our focus in this book is on tabular data due to its ubiquity, there are many other types of data used for machine learning and modeling in general. This data often starts in a special format or must be considered uniquely. You'll often hear this labeled as 'unstructured', but that's probably not the best conceptual way to think about it, as the data is still structured in some way, sometimes in a strict format (e.g. images). Here we'll briefly discuss some of the other types of data you'll potentially come across. +While our focus in this book is on tabular data due to its ubiquity, there are many other types of data used for machine learning and modeling in general. This data often starts in a special format or must be considered uniquely. You'll often hear this labeled as 'unstructured', but that's probably not the best conceptual way to think about it, as the data is still structured in some way, sometimes in a strict format (e.g., images). Here we'll briefly discuss some of the other types of data you'll potentially come across. ### Spatial {#sec-ml-more-spatial} @@ -293,7 +293,7 @@ While our focus in this book is on tabular data due to its ubiquity, there are m ![Spatial Data (code available from [Kyle Walker](https://github.com/walkerke/mb-immigrants))](img/ml-spatial-kw.png){width=75%} ::: -Spatial data, which includes geographic and similar information, can be quite complex. It often comes in specific formats (e.g. shapefiles), and may require specialized tools to work with it. Spatial specific features may include continuous variables like latitude and longitude, or tracking data from a device like a smartwatch. Other spatial features are more discrete, such as states or political regions within a country. +Spatial data, which includes geographic and similar information, can be quite complex. It often comes in specific formats (e.g., shapefiles), and may require specialized tools to work with it. Spatial specific features may include continuous variables like latitude and longitude, or tracking data from a device like a smartwatch. Other spatial features are more discrete, such as states or political regions within a country. We could use these spatial features as we would others in the tabular setting, but we often want to take into account the uniqueness of a particular region, or the correlation of spatial regions. Historically, most spatial data can be incorporated into approaches like mixed models or generalized additive models, but in certain applications, such as satellite imagery, deep learning models are more the norm, and the models often transition into image processing techniques. diff --git a/models.qmd b/models.qmd index 466f037..e577dbb 100644 --- a/models.qmd +++ b/models.qmd @@ -13,7 +13,7 @@ At its core, a model is just an **idea**. It's a way of thinking about the world ## What Goes into a Model? What Comes Out? {#sec-lm-in-a-model} -In the context of a model, how we specify the nature of the relationship between various entities depends on the context. In the interest of generality, we'll refer to the **target** as what we want to explain, and **features** as those aspects of the data we will use to explain it. Because people come at data from a variety of contexts, they often use different terminology to mean the same thing. The next table shows some of the common terms used to refer to features and targets. Note that they can be mixed and matched, e.g. someone might refer to covariates and a response, or inputs and a label. +In the context of a model, how we specify the nature of the relationship between various entities depends on the context. In the interest of generality, we'll refer to the **target** as what we want to explain, and **features** as those aspects of the data we will use to explain it. Because people come at data from a variety of contexts, they often use different terminology to mean the same thing. The next table shows some of the common terms used to refer to features and targets. Note that they can be mixed and matched, for example, someone might refer to covariates and a response, or inputs and a label. ```{r tbl-feat-target, cache=FALSE} #| echo: false diff --git a/more_models.qmd b/more_models.qmd index 7bcaf79..82b7be3 100644 --- a/more_models.qmd +++ b/more_models.qmd @@ -20,11 +20,11 @@ Many of these were listed in @sec-lm-more, but we provide a few more to think ab **Generalized Linear Models and related** -- True GLM e.g. gamma +- True GLM (e.g., gamma) - Other GLM type distributions: beta regression, tweedie, t (so-called 'robust'), truncated - Censored outcomes: Survival models, tobit, Heckman - Nonlinear regression -- Modeling other parameters (e.g. heteroscedastic models) +- Modeling other parameters (e.g., heteroscedastic models) - Modeling beyond the mean (quantile regression) - Mixed models - Generalized Additive Models @@ -35,7 +35,7 @@ Many of these were listed in @sec-lm-more, but we provide a few more to think ab - Gaussian process regression - Spatial models (CAR, SAR, etc.) -- Time series models (ARIMA and related, e.g. state space, Dynamic linear) +- Time series models (ARIMA and generalizations like state space models, Dynamic linear) - Factor analysis **Multivariate/multiclass/multipart** @@ -44,7 +44,7 @@ Many of these were listed in @sec-lm-more, but we provide a few more to think ab - Multinomial/Categorical/Ordinal regression (>2 classes) - MANOVA/Linear Discriminant Analysis (these are identical, and can handle multiple outputs or >=2 classes) - Zero (or some number) -inflated/hurdle/altered -- Mixture models and Cluster analysis (e.g. K-means, Hierarchical clustering) +- Mixture models and Cluster analysis (e.g., K-means, Hierarchical clustering) - Two-stage least squares, instrumental variables, simultaneous equations - SEM, simultaneous equations - PCA, Factor Analysis @@ -54,7 +54,7 @@ Many of these were listed in @sec-lm-more, but we provide a few more to think ab -All of these are explicitly linear models or can be framed as such, and compared to what you've already seen, typically only require a tweak or two from those - e.g. a different distribution, a different link function, penalizing the coefficients, etc. In other cases, we can bounce from one to another, or there is heavy overlap. For example we can reshape our multivariate regression or an IRT model to be amenable to a mixed model approach, and get the exact same results. We can potentially add a random effect to any model, and that random effect can be based on time, spatial or other considerations. The important thing to know is that the linear model is a very flexible tool that expands easily, and allows you to model most of the types of outcomes we are interested in. As such, it's a very powerful approach to modeling. +All of these are explicitly linear models or can be framed as such, and compared to what you've already seen, typically only require a tweak or two from those - for example, a different distribution, a different link function, penalizing the coefficients, etc. In other cases, we can bounce from one to another, or there is heavy overlap. For example we can reshape our multivariate regression or an IRT model to be amenable to a mixed model approach, and get the exact same results. We can potentially add a random effect to any model, and that random effect can be based on time, spatial or other considerations. The important thing to know is that the linear model is a very flexible tool that expands easily, and allows you to model most of the types of outcomes we are interested in. As such, it's a very powerful approach to modeling. ## Other Machine Learning Models {#sec-app-more-ml} @@ -117,7 +117,7 @@ We haven't delved into the world of deep learning as much as there hasn't yet be Convolutional neural networks as currently implemented can be seen going back to LeNet in the late 1990s, and took off several years later with AlexNet and VGG. ResNet (residual networks), Densenet, and YOLO are relatively more recent examples of CNNs, though even they have been around for several years at this point. Even so, several of these still serve as baseline models for image classification and object detection, either in practice or as a reference point for current model performance. In general, you'll have specific models for certain types of tasks, such as segmentation, object tracking, etc. -NLP and language processing more generally can be seen as evolving from matrix factorization and LDA, to neural network models such as word2vec and GloVe. In addition, the temporal nature of text suggested time-based models, including even more statistical models like hidden markov models back in the day. But in the neural network domain, we have standard Recurrent networks, then LSTMs, GRUs, Seq2Seq, and more that continued the theme. Now the field is dominated by attention-based transformers, of which BERT variants were popular early on, and OpenAI's GPT is among the most famous example of modern larger language models. But there are many others that have been developed in the last few years, offered from Meta, Google, Anthropic and others. You can see some recent [performance rankings](https://livebench.ai/#/?IF=a&Reasoning=a&Coding=a&Mathematics=a&Data+Analysis=a&Language=a), and note that there is not one model that is best at every task. +Modern NLP and language processing can be seen as starting from matrix factorization and LDA, and subsequently neural network models such as word2vec and GloVe. In addition, the temporal nature of text suggested time-based models, including even more statistical models like hidden markov models back in the day. But in the neural network domain, we have standard recurrent neural networks, then LSTMs, GRUs, Seq2Seq, and more that continued the theme. Now the field is dominated by attention-based transformers, of which BERT and variants were popular early on, and OpenAI's GPT is among the most famous example of modern larger language models. But there are many others that have been developed in the last few years, offered from Meta, Google, Anthropic and others. You can see some recent [performance rankings](https://livebench.ai/#/?IF=a&Reasoning=a&Coding=a&Mathematics=a&Data+Analysis=a&Language=a), and note that there is not one model that is best at every task. You'll also find deep learning approaches to some of the models in the ML section, such as recommender systems, clustering, graphs and more. Recent efforts have attempted 'foundational' models to time series, such as Moirai. Michael has surveyed some of the developments in deep learning for tabular data (@clark_deep_2022), and though he hasn't seen anything as of this writing to change the general conclusion there, he hopes to revisit the topic in earnest again in the future, so stay tuned. ::: diff --git a/pyr.qmd b/pyr.qmd index 39670a4..52cfa1d 100644 --- a/pyr.qmd +++ b/pyr.qmd @@ -3,7 +3,7 @@ As we mentioned at the beginning (@sec-intro-lang), use the language you like as long as it gets the job done *well*. The vast majority of data scientists use Python, but R is very common in academia and statistical analysis in general. Both languages have their strengths and weaknesses, and it's worth getting a sense of what they are[^otherlang]. -[^otherlang]: There are other languages that can be used for data science, but they are not nearly as common. For example, Julia is a language that is very fast for some things and generally has a lot of potential for data science. But honestly it came too late to the game, and it's not as user-friendly as Python or R. Proprietary tools like Matlab, SAS, Stata, and similar had their day, and it has long since passed, and closed-source tools will never develop as rapidly as popular open source tools. Still other languages whose primary purpose is not data science may be useful for some tasks (e.g. Spark-ML), but they are not as well-suited for as Python or R. +[^otherlang]: There are other languages that can be used for data science, but they are not nearly as common. For example, Julia is a language that is very fast for some things and generally has a lot of potential for data science. But honestly it came too late to the game, and it's not as user-friendly as Python or R. Proprietary tools like Matlab, SAS, Stata, and similar had their day, and it has long since passed, and closed-source tools will never develop as rapidly as popular open source tools. Still other languages whose primary purpose is not data science may be useful for some tasks (e.g., Spark-ML), but they are not as well-suited for as Python or R. ## Our Background {#sec-pyr-context} diff --git a/uncertainty.qmd b/uncertainty.qmd index 4724b41..f27987e 100644 --- a/uncertainty.qmd +++ b/uncertainty.qmd @@ -87,7 +87,7 @@ from scipy import stats ## Standard Frequentist {#sec-estim-frequentist} -We talked a bit about the frequentist approach in our discussion of confidence intervals (@sec-lm-interpretation-feature). There we described the process using the interval to *capture* the 'true' parameter value a certain percentage of the time. The key assumption is that the true parameter is fixed, and the interval is a random variable that will contain the true value with some percentage frequency. With this approach, if you were to repeat the experiment, i.e. data collection and analysis, many times, each interval would be slightly different. Although they would be different, any one of the intervals is as good or valid as the others. You also know that a certain percentage of them will contain the true value, and a (usually small) percentage will not. You will never know if a specific interval does actually capture the true value, because we don't know the true value in practice. +We talked a bit about the frequentist approach in our discussion of confidence intervals (@sec-lm-interpretation-feature). There we described the process using the interval to *capture* the 'true' parameter value a certain percentage of the time. The key assumption is that the true parameter is fixed, and the interval is a random variable that will contain the true value with some percentage frequency. With this approach, if you were to repeat the experiment, i.e., data collection and analysis, many times, each interval would be slightly different. Although they would be different, any one of the intervals is as good or valid as the others. You also know that a certain percentage of them will contain the true value, and a (usually small) percentage will not. You will never know if a specific interval does actually capture the true value, because we don't know the true value in practice. This is a common approach in traditional statistical analysis, and so it's used in many modeling contexts. If no particular estimation approach is specified, the default is usually a frequentist one. The approach not only provides confidence intervals for the parameters, but we can also get them for predictions, which is typically also a goal. @@ -250,7 +250,7 @@ These interval estimates for parameters and predictions are actually not easy to Monte Carlo methods derive their name from the famous casino in Monaco[^mcname]. The idea is to use random sampling to estimate a value. With statistical models, we can use Monte Carlo methods to estimate uncertainty in our model parameters and predictions. The general idea is as follows: -1. **Estimate the model parameters** using the data and their range of possible values (e.g. based on a probability distribution). +1. **Estimate the model parameters** using the data and their range of possible values (e.g., based on a probability distribution). 2. **Simulate new data** from the model using the estimated parameters and assumed probability distributions for those parameters. 3. **Estimate the metrics of interest** using the simulated data. 4. **Repeat** many times. @@ -570,7 +570,7 @@ ggsave('img/estim-bootstrap.svg', width = 8, height = 6) ![Bootstrap Distributions of Parameter Estimates](img/estim-bootstrap.svg){#fig-r-bootstrap} ::: -As mentioned, the bootstrap is often used to provide uncertainty for unmodeled parameters, predictions, and other metrics. However, because we repeatedly run the model or some aspect of it over and over, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g. extreme values) or [in some data contexts](https://stats.stackexchange.com/questions/9664/what-are-examples-where-a-naive-bootstrap-fails) (e.g. correlated observations) without extra considerations. Variants exist to help deal with some of these issues, and despite limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model. +As mentioned, the bootstrap is often used to provide uncertainty for unmodeled parameters, predictions, and other metrics. However, because we repeatedly run the model or some aspect of it over and over, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g., extreme values) or [in some data contexts](https://stats.stackexchange.com/questions/9664/what-are-examples-where-a-naive-bootstrap-fails) (e.g., correlated observations) without extra considerations. Variants exist to help deal with some of these issues, and despite limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model. ## Bayesian {#sec-estim-bayes} diff --git a/understanding_features.qmd b/understanding_features.qmd index 29d425f..996c5c7 100644 --- a/understanding_features.qmd +++ b/understanding_features.qmd @@ -155,7 +155,7 @@ Truly understanding feature contribution is a bit more complicated than just loo So what are we to do? What you need to know to get started looking at a feature's contribution includes the following: - feature range and variability -- feature distributions (e.g. skewness) +- feature distributions (e.g., skewness) - representative values of the feature - target range and variability - feature interactions and correlations @@ -336,7 +336,7 @@ avg_pred_typical.round(3), avg_pred_0.round(3) ``` ::: -When word count is zero, i.e. its mean and everything else is at its mean/mode, we'd predict a chance of a good review at about `r label_percent()(pred_0)`. As such, we interpret this as 'when everything is typical', we have a pretty good chance of getting a good review. The average prediction we'd get if we predicted every observation as if it were the mean word count is more like `r label_percent()(avg_pred_0)`, which is notably less. Which is correct? Both, or neither! They are telling us different things, either of which may be useful, or not. If it's doubtful that the feature values used in the calculation are realistic (e.g. everything at its mean at the same time, or an average word count when length of a movie is at its minimum), then they may both be misleading. You have to know your features and your target to know how best to use the information. +When word count is zero, i.e., its mean, and everything else is at its mean/mode, we'd predict a chance of a good review at about `r label_percent()(pred_0)`. As such, we interpret this as 'when everything is typical', we have a pretty good chance of getting a good review. The average prediction we'd get if we predicted every observation as if it were the mean word count is more like `r label_percent()(avg_pred_0)`, which is notably less. Which is correct? Both, or neither! They are telling us different things, either of which may be useful, or not. If it's doubtful that the feature values used in the calculation are realistic (e.g., everything at its mean at the same time, or an average word count when length of a movie is at its minimum), then they may both be misleading. You have to know your features and your target to know how best to use the information. ### Average marginal effects {#sec-avg-marginal-effects} @@ -657,7 +657,7 @@ It's very easy even with base package functions to see some very interesting thi ## SHAP Values {#sec-knowing-shap-values} -As we've suggested, most models are more complicated than can be explained by a simple coefficient, e.g. nonlinear effects in generalized additive models. Or, there may not even be feature-specific coefficients available, like gradient boosting models. Or, we may even have many parameters associated with a feature, as in deep learning. Such models typically won't come with statistical output like standard errors and confidence intervals either. But we'll still have some tricks up our sleeve to help us figure things out! +As we've suggested, most models are more complicated than can be explained by a simple coefficient, for example, nonlinear effects in generalized additive models. Or, there may not even be feature-specific coefficients available, like gradient boosting models. Or, we may even have many parameters associated with a feature, as in deep learning. Such models typically won't come with statistical output like standard errors and confidence intervals either. But we'll still have some tricks up our sleeve to help us figure things out! A very common interpretation tool is called a **SHAP value**. SHAP stands for **SHapley Additive exPlanations**, and it provides a means to understand how much each feature contributes to a specific prediction. It's based on a concept from game theory called the Shapley value, which is a way to understand how much each player contributes to the outcome of a game. For our modeling context, SHAP values break down a prediction to show the impact of each feature. The reason we bring it up here is that it has a nice intuition in the linear model case, and seeing it in that context is a good way to get a sense of how it works. Furthermore, it builds on what we've been talking about with our various prediction approaches. @@ -1702,7 +1702,7 @@ ggsave("img/knowing-feature-importance-2.svg", width = 8, height = 6) ::: -All of the metrics shown have uncertainty in their estimate, and some packages make it easy to plot or extract. As an example one could bootstrap a metric, or use the permutations as a means to get at the uncertainty. However, the behavior and distribution of these metrics is not always well understood, and in some cases, the computation would often be notable (e.g. with SHAP). You could also look at the range of the ranks created by bootstrapping or permuting, and take the upper bound as the worst case for a given feature. Although this might possibly be conservative, the usual problem is that people are too optimistic about their feature importance result, so this might be a good thing. +All of the metrics shown have uncertainty in their estimate, and some packages make it easy to plot or extract. As an example one could bootstrap a metric, or use the permutations as a means to get at the uncertainty. However, the behavior and distribution of these metrics is not always well understood, and in some cases, the computation would often be notable (e.g., with SHAP). You could also look at the range of the ranks created by bootstrapping or permuting, and take the upper bound as the worst case for a given feature. Although this might possibly be conservative, the usual problem is that people are too optimistic about their feature importance result, so this might be a good thing. The take home message is that in the best of circumstances, there is no automatic way of saying one feature is more important than another. It's nice that we can use approaches like SHAP and permutation methods for more complicated models like boosting and deep learning models, but they're not perfect, and they still suffer from most of the same issues as the linear model. In the end, understanding a feature's role within a model is ultimately a matter of context, and highly dependent on what you're trying to do with the information. @@ -1797,7 +1797,7 @@ Let's put our model understanding that we've garnered in this and the previous c - Now create a simpler or more complex model for comparison. The simple model should just remove a feature, while the complex model should add features to your initial model. - Interpret the first model as a whole, then compare it to the simpler/more complex model. Did anything change in how you assessed the features they have in common? - Create a counterfactual prediction for an observation, either one from the observed data or one that is of interest to you. -- Choose the 'best' model, and justify your reason for doing so, e.g. using a specific metric of your choosing. +- Choose the 'best' model, and justify your reason for doing so, e.g., using a specific metric of your choosing. - As a bonus, get feature importance metrics for your chosen model. If using R, we suggest using the [iml]{.pack} package ([example]()), and in Python, run the linear regression with [scikit-learn]{.pack} and use the permutation-based importance function. :::{.panel-tabset} diff --git a/understanding_models.qmd b/understanding_models.qmd index b298885..e074104 100644 --- a/understanding_models.qmd +++ b/understanding_models.qmd @@ -1456,7 +1456,7 @@ ggsave("img/knowing-model-pp-check-simple.svg", p_pp_simple, width = 8, height = ![Simple Predictive Check for a regression model](img/knowing-model-pp-check-simple.svg){#fig-model-pp-check-simple} -We may be getting ahead of ourselves to understand this completely yet, but it's worth knowing about ***posterior* predictive checks**, which are typically used with Bayesian models, but are not restricted to that case. A proper posterior predictive check is a bit more involved, but there are packages that make it straightforward[^ppcpack]. The basic idea is that we simulate the target based on the model parameter estimates and their uncertainty. And with that distribution of estimates (e.g. regression coefficients), we can then simulate random draws of predicted values. A step-by-step approach is as follows: +We may be getting ahead of ourselves to understand this completely yet, but it's worth knowing about ***posterior* predictive checks**, which are typically used with Bayesian models, but are not restricted to that case. A proper posterior predictive check is a bit more involved, but there are packages that make it straightforward[^ppcpack]. The basic idea is that we simulate the target based on the model parameter estimates and their uncertainty. And with that distribution of estimates (e.g., regression coefficients), we can then simulate random draws of predicted values. A step-by-step approach is as follows: 1. Simulate the parameters following the assumed distribution, e.g., a normal distribution for the regression coefficients. 2. For each simulated parameter set, make a model prediction. From ed997e6fd03b34aaea9ceb01ae5c2bda094dc61d Mon Sep 17 00:00:00 2001 From: micl Date: Sat, 30 Nov 2024 14:04:34 -0500 Subject: [PATCH 10/19] gc through end --- acknowledgments.qmd | 12 ++++-------- conclusion.qmd | 2 +- matrix_operations.qmd | 2 +- references.qmd | 2 +- 4 files changed, 7 insertions(+), 11 deletions(-) diff --git a/acknowledgments.qmd b/acknowledgments.qmd index 93d6c68..1db6203 100644 --- a/acknowledgments.qmd +++ b/acknowledgments.qmd @@ -6,15 +6,11 @@ This work was supported by no grants or grad students, and was done in the spare We could not have done this without the support of our families, who have been patient and understanding throughout the process. -In particular, Michael wishes to thank his wife Xilin, whose patience, humor, and support have been invaluable. He'd also like to thank Rich Herrington, who got him started with data science many years ago, and the folks at Strong Analytics/OneSix who have been supportive of his recent efforts, put up with his transition from academia to industry, and made him much better at everything he does in data science. And finally, he would like to thank the many mentors in the statistical, ML, and programming communities, who never knew they were such a large influence, but did much to expand his knowledge in many ways over the years. - -Seth wishes to thank his wife Megan, who has been supportive and understanding throughout the process, and his kids, who put up with with the time spent on this project. He'd also like to thank his colleagues and students at Notre Dame, who have supported teaching antics and research shenanigans. He hopes that this book is more proof that he knows what he's talking about, and not just a bunch of nonsense. - - -Together we'd like to thank those who helped with the book by providing comments and feedback, specifically: Malcolm Barrett, Isabella Gehment, Demetri Pananos, and Chelsea Parlett-Pellereti. Their insights and suggestions were invaluable in making this book better. We'd also like to thank others who took the time to take even just a quick glance to see what was going on and provide a brief comment here and there. We appreciate all the help we can get. - -Finally, we'd like to thank you, the reader, for taking the time to read this book. We hope you find it useful and informative, and that it helps you in your data science journey. If you have any feedback, please feel free to reach out to us. We'd love to hear from you. +In particular, Michael wishes to thank his wife Xilin, whose insights, humor, support, and especially patience, have been invaluable throughout the process. He'd also like to thank Rich Herrington, whose encouragement was the catalyst for his shift to data science many years ago. Michael also thanks the folks at Strong Analytics/OneSix who put up with his transition from academia to industry, and made him much better at everything he does in the realm of data science. And finally, he would like to thank the many mentors in the statistical, machine learning, deep learning, and programming communities, who never knew they were such a large influence, but did much to expand his knowledge in many ways over the years. +Seth is grateful for the support and understanding of his wife Megan, and his fantastic kids, who never cease to bring him joy. He'd also like to thank his colleagues and students at Notre Dame, who have supported teaching antics and research shenanigans. He hopes that this book provides at least some evidence that he knows what he's talking about. +Together we'd like to thank those who helped with the book by providing comments and feedback, specifically: Malcolm Barrett, Isabella Gehment, Demetri Pananos, and Chelsea Parlett-Pellereti. Their insights and suggestions were invaluable in making this book a lot better. We'd also like to thank others who took the time to take even just a quick glance to see what was going on and provide a brief comment here and there. Every contribution was appreciated. +Finally, we'd like to thank you, the reader, for taking the time to peruse this book. We hope you find it useful and informative, and that it helps you in your data science journey. If you have any feedback, please feel free to reach out to us. We'd love to hear from you. \ No newline at end of file diff --git a/conclusion.qmd b/conclusion.qmd index 3c50e81..39da858 100644 --- a/conclusion.qmd +++ b/conclusion.qmd @@ -53,7 +53,7 @@ The task can be thought of as the goal of our model, which might be defined as r Various algorithms allow us to estimate the parameters of the model, typically in an iterative fashion, moving from one guess to a hopefully better one. We can think of general approaches, like maximum likelihood, Bayesian estimation, or stochastic gradient descent. Or we can focus on a specific implementation of these, such as penalized likelihood, hamilton monte carlo, or backpropagation. -So when we think about models, we start with an idea, but in the end it needs to be expressed in a form that suggests an architecture. That architecture specifies how we take in data and make outputs in the form of predictions, or something that can be transformed to them. With that in place, we need an algorithm search the parameter space of estimates of the model, and a way to evaluate how well the model is doing. While this is enough to produce results, it only gets us the bare minimum. There are many more things we have to do to help us interpret those results, understand the model's performance, and get a sense of its limitations. +So when we think about models, we start with an idea, but in the end it needs to be expressed in a form that suggests an architecture. That architecture specifies how we take in data and make outputs in the form of predictions, or something that can be transformed to them. With that in place, we need an algorithm to search the parameter space of estimates of the model, and a way to evaluate how well the model is doing. While this is enough to produce results, it only gets us the bare minimum. There are many more things we have to do to help us interpret those results, understand the model's performance, and get a sense of its limitations. diff --git a/matrix_operations.qmd b/matrix_operations.qmd index 29876c1..21f8a3f 100644 --- a/matrix_operations.qmd +++ b/matrix_operations.qmd @@ -536,7 +536,7 @@ matrix_A - 3 ## Transpose -You might see a matrix denoted as $A^T$ or $A'$. The superscripted $T$ or $'$ for matrix **transpose**. If we transpose a matrix, all we are doing is flipping the rows and columns along the matrix's 'main diagonal'. A visual example is much easier: +You might see a matrix denoted as $A^T$ or $A'$. The superscripted $T$ or $'$ for matrix **transpose**. If we transpose a matrix, all we are doing is flipping the rows and columns along the matrix 'main diagonal'. A visual example is much easier: $$ \stackrel{\mbox{Matrix A}}{ diff --git a/references.qmd b/references.qmd index 2ff35c7..b979747 100644 --- a/references.qmd +++ b/references.qmd @@ -8,7 +8,7 @@ nocite: | # References -These references tend to be more functional than academic, and hopefully will be more practically useful to you as well. If you prefer additional academic resources, you'll find some of those as we well, but you can also look at the references within many of these for deeper or more formal dives, or just search Google Scholar for any of the topics covered. +These references tend to be more functional than academic, and hopefully will be more practically useful to you as well. If you prefer additional academic resources, you'll find some of those as well, but you can also look at the references within many of these for deeper or more formal dives, or just search Google Scholar for any of the topics covered. ::: {#refs} ::: From 3d2d401373f1002a66819dead13d4fb7ec76a256 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 17:30:48 -0500 Subject: [PATCH 11/19] move some discussion from conc to model chap --- conclusion.qmd | 98 ++++++++------------------------------------------ models.qmd | 88 +++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 95 insertions(+), 91 deletions(-) diff --git a/conclusion.qmd b/conclusion.qmd index 39da858..69010c6 100644 --- a/conclusion.qmd +++ b/conclusion.qmd @@ -1,4 +1,4 @@ -# Until Next Time... {#conclusion} +# Parting Thoughts {#conclusion} ![](img/chapter_gp_plots/gp_plot_12.svg){width=75%} @@ -10,7 +10,7 @@ As we wrap things up, let's revisit some of the key points we've covered in this text, and talk more about the modeling process in general. -## How to Think About Models (revisited) {#conc-models-think} +## How to Think About Models {#conc-models-think} When we first started our discussion of models in data science (@sec-models), we talked about how a model is a simplified representation of reality. They start as ideas based on our intuition or experience, and they can sometimes be very simple ones. But at some point we start to think of them more formally, as a step towards testing those ideas in the real world. For statistics, machine learning, and data science more generally, models are then put into mathematical equations that give us a common language to reference them by. This does not have to be complex though. As an example, most of the models you've seen so far can be expressed as follows: @@ -37,105 +37,35 @@ To aid our understanding beyond the math, we try to visually express models in a But even now these models are still at the idea stage, and we ultimately need to see how they work in the world, make predictions, and help us to make informed decisions. We've seen how to do this with linear models of various forms, and more unusual model implementations in the form of tree-based models, and even highly complex neural networks. These are the tools that allow us to take our ideas and turn them into something that can be used to make decisions, and that's the real power of using models in data science. -We can break our thinking about models into the following components: -**Model** -In data science, a model refers to a unique (mathematical) implementation we're using to answer our questions. It specifies the **architecture** of the model, which might be a simple linear component, a series of trees, or neural network. In addition, the model specifies the **functional form**, the $f()$ in our equation, that translates inputs to outputs, and the **parameters** required to make that transformation. In code, the model is implemented with functions such as `lm` in R, or in Python, an `XGBoostClassifier` or PyTorch `nn.Model` class. - - -**Task** - -The task can be thought of as the goal of our model, which might be defined as regression, classification, ranking, or next word prediction. It is closely tied to the objective or loss function, which is a measure of correspondence between the model output and the target we're trying to understand. The objective function provides the model a goal - minimize target-output discrepancy or maximize similarity. As an example, if our target is numeric and our task is 'regression', we can use mean squared error as an objective function, which provides a measure of the prediction-target discrepancy. - - -**Algorithm** - -Various algorithms allow us to estimate the parameters of the model, typically in an iterative fashion, moving from one guess to a hopefully better one. We can think of general approaches, like maximum likelihood, Bayesian estimation, or stochastic gradient descent. Or we can focus on a specific implementation of these, such as penalized likelihood, hamilton monte carlo, or backpropagation. - -So when we think about models, we start with an idea, but in the end it needs to be expressed in a form that suggests an architecture. That architecture specifies how we take in data and make outputs in the form of predictions, or something that can be transformed to them. With that in place, we need an algorithm to search the parameter space of estimates of the model, and a way to evaluate how well the model is doing. While this is enough to produce results, it only gets us the bare minimum. There are many more things we have to do to help us interpret those results, understand the model's performance, and get a sense of its limitations. - - - -## Key Steps in Modeling - -When it comes to modeling, there are a few key steps that you should always keep in mind. These are not necessarily exhaustive, but we feel they're a good way to think about how to approach modeling in data science. - -**Define the problem** - -Start by clearly defining the problem you want to solve. It is often easy to express in very general terms, but more challenging to precisely pin down the problem statement in a way that can actually help you solve it. What are you trying to predict? What data do you have to work with? What are the constraints on your data and model? What are the consequences of the results, whatever they may be? Why do you even care about any of this? These are all questions you should try to answer before diving into modeling. - - -**Know your data well** - -During our time consulting in industry and academia, we've seen many cases where the available data is simply not suited to answer the question at hand[^unsuitabledata]. This leads to wasted time, money, and other resources. You can't possibly answer your question if the data doesn't have the appropriate content to do so. - -[^unsuitabledata]: This is a common problem where data is often collected for one purpose and then used for another, as with general purpose surveys or administrative data. Sometimes it can be that the available data is simply not enough to say anything without a lot of uncertainty, as in the case of demographic data regarding minority groups for which there may be few instances of a particular sample. Deep learning approaches like zero/Few-shot learning isn't applicable here, because there isn't a model pre-trained on millions/billions of similar examples to transfer knowledge from. - -In addition, if your data is fraught with issues due to inadequate exploration, cleaning , or transformation, then you're going to have a hard time getting valuable results. It is very common to be dealing with data that has issues that even those who collected it are unaware of, so always be looking out for ways to improve it. - - -**Have multiple models at your disposal** - -Go into a modeling project with a couple models in mind that you think might be useful. This could even be as simple as increasing complexity within a single model approach - you don't have to get too fancy! You should have a few models that you're comfortable with and that you know how to use, and for which you know the strengths and weaknesses. Whenever possible, make time to explore more complex or less familiar approaches that you also think may be suitable to the problem. As we've discussed (@sec-knowing-model-compare), model comparison can help you have more confidence in the results of the model that's finally chosen. Just like in a lot of other situations, you don't want to 'put all your eggs in one basket', and you'll always have more to talk about and consider if you have multiple models to work with. - - -**Communicate your results** - -If you don't know the model and underlying data well enough to explain the results to others, you're not going to be able to use them effectively in the first place. Conversely, you also may know the technical side very well, but if you're unable to communicate the results in simpler terms that others can understand, you're going to have a hard time convincing others of the value of your work. Communication is an essential component of the modeling process, and it's something that you should be thinking about from the very beginning. - - - -## The Hard Part {#conc-models-hard} - - -Modeling is just one aspect of the data science process, and the hard part of that process is often not so much the model itself, but everything else that goes into it and what you do with it after. It can be difficult to come up with the original idea for a model, and even harder to get it to work in practice. - - -**The Data** - -Model performance is largely going to come from the quality of the data and how you've prepared it, from ensuring its integrity to feature engineering. Some models will usually work better than others in certain situations, but there are no guarantees, and often the practical difference in performance is minimal. But you can potentially improve performance by understanding your data better, and by understanding the limitations of your model. Having more domain knowledge can help reduce noise and irrelevant information that you might have otherwise retained, and can provide insights for feature engineering. Thorough data exploration can reveal bugs and issues to be fixed, and will help you understand the relationships between your features and your target. - - -**The Interpretation** - -Once you have a model, you need to understand what it's telling you. This can be as simple as looking at the coefficients of a linear regression, or as complex as trying to understand the output of a hidden layer in a neural network. Once you get past a linear regression though, you need to *expect* model interpretation to get hard. But whatever model you use, you need to be able to explain what the model is doing, and how you're ultimately coming to your conclusions. This can be difficult, and often requires a lot of work. Even if you've used a model often, it may still be difficult to understand in a new data environment. Model interpretation can take a lot of effort, but it's important to do what's necessary to trust your model results, and help others trust them as well. - - -**What You Do With It** - -Once you have the model and you (think you) understand it, you need to be able to use it effectively. If you've gone to this sort of trouble, you must have had a good reason for undertaking what can be a very difficult task. We use models to make business decisions, inform policy, understand the world around us, and to make our lives better. However, using a model effectively means understanding its limitations, as well as the practical, ethical, scientific, and other *consequences* of the decisions you make based on it. It's at this point that the true value of your model is realized. - -In the end, models are a tool to *help* you solve a problem. They do not solve the problem for you, and they do not absolve you of the responsibility of understanding the problem and the consequences of your decisions. - - -## More Models +## More Models {#conc-models-more} When choosing a model, there's a lot at your disposal, and we've only scratched the surface of what's out there. Here are a few more models that you may encounter in your data science journey: **Statistical Models** -In the statistical realm there are many more models that focus on different target distributions and types. For instance, we might use a beta distribution for targets between 0 and 1, ordinal logistic regression for ordinal targets, or survival models for time-to-event outcomes. Some models are field-specific, like two-stage least squares in econometrics. Most of these models are essentially linear models with slight modifications. +In the statistical realm there are many more models that focus on different target distributions and types. For instance, we might use a beta distribution for targets between 0 and 1, ordinal logistic regression for ordinal targets, or survival models for time-to-event outcomes. Some models are field-specific, like two-stage least squares in econometrics. Beyond these, specific implementations will be found for time series (ARIMA, state space models), spatial data (kriging, CAR), and other special target considerations. Most of these models are essentially linear models with slight modifications. Nonlinear models are another realm, which are a bit different from the nonlinear aspects of GLMs, GAMs, or deep learning. These models assume a specific (non-linear) functional form, and can be used to explore relationships that are not well captured by standard linear models. Examples range from something as simple as a polynomial regression or logistic growth model, to more complex biological and epidemiological models. These approaches are not as flexible as Generalized Additive Models (GAMs), or as predictive as neural networks, but they can potentially be useful in the right context. -As a final consideration, there are 'multivariate' techniques like Principal Component Analysis (PCA), factor analysis, and similar which are still pretty widely used. There are also cases where the primary target is multivariate in nature, meaning a standard regression with multiple outcomes. These are more common within some areas like economics and psychology. +In addition, there are 'multivariate' techniques like Principal Component Analysis (PCA), factor analysis, and similar which are still pretty widely used. There are also cases where the primary target is multivariate in nature, meaning a standard regression with multiple outcomes. These are more common within some areas like economics and psychology. **Machine Learning** -In a purely machine learning context, you may find other models beyond those just mentioned in the statistical realm, though, as we have mentioned several times at this point, potentially any model can be used with machine learning. These models prioritize prediction, and would not usually produce standard statistical output like coefficients and uncertainty estimates by default. Examples include support vector machines, k-nearest neighbors regression, and other techniques. Most of these traditional 'machine learning models' have fallen out of favor due to their inflexibility with heterogeneous data types, and/or poor performance compared to more modern approaches. However, even then, their spirit may live on in modern approaches. +In a purely machine learning context, you may find other models beyond those just mentioned in the statistical realm. However, as we have mentioned several times at this point, potentially any model can be used with machine learning, including statistical models. The machine learning context prioritizes prediction, and many models used would not usually produce standard statistical output like coefficients and uncertainty estimates by default. Examples include support vector machines, k-nearest neighbors regression, and other techniques. Most of these traditional 'machine learning models' have fallen out of favor due to their inflexibility with heterogeneous data types, and/or poor performance or efficiency compared to more modern approaches. However, even then, their spirit may live on in modern approaches. You'll also find models that focus on ranking, either with an outcome of ranks requiring a specific loss function (e.g., LambdaRank), or where ranking is used to simplify decision-making through post-estimation ranking of predictions (e.g., decile ranking, uplift modeling). In addition, you can find machine learning techniques extended to survival, ordinal, and other situations that are more common in the statistical realm. -Other areas of machine learning, like reinforcement learning, recommender systems, network analysis, and unsupervised learning techniques provide more that can be used in various scenarios. Plenty is left for you to explore here as well! +Other areas of machine learning, like reinforcement learning, recommender systems, network analysis, and unsupervised learning techniques, provide more options that might be useful. Plenty is left for you to explore here as well! **Deep Learning** -When it comes to deep learning, it seems there is a new model every day, and it's hard to keep up. In general, convolutional neural networks are still the go-to for many types of computer vision tasks, while transformers are commonly used for natural language processing, but both have been applied to the other domain with success. For tabular data you'll typically see some variant of Multilayer Perceptrons (MLPs), often with embeddings for categorical features. Some have attempted transformers and CNNs here as well, but results are mixed. +When it comes to deep learning, it seems there is a new model every day, and it's hard to keep up. In general, convolutional neural networks are still the go-to for many types of computer vision tasks, while transformers are commonly used for natural language processing, but both have been applied to the other domain with success. Many 'foundational' models have been developed that allow you to apply pre-trained models to your specific problem, and form the basis of modern AI. For tabular data as we've focused on here, you'll typically see some variant of Multilayer Perceptrons (MLPs), often with embeddings for categorical features. Some have attempted transformers and CNNs here as well, but results are mixed. -The deep learning landscape also includes models like deep graphical networks, and deep Q learning for reinforcement learning, specific models for image segmentation (e.g., SAM), recurrent neural networks variants for time-series data, and generative adversarial networks for a variety of tasks. Some specific techniques are falling out of favor as transformer-based architectures are being applied to seemingly everything. But the field is dynamic, and it remains to be seen which methods will prevail in the long run. +The deep learning landscape also includes models like deep graphical networks, and deep Q learning for reinforcement learning, specific models for image segmentation (e.g., SAM), recurrent neural network variants for time-series data, and generative adversarial networks for a variety of tasks. Some specific techniques are falling out of favor as transformer-based architectures are being applied to seemingly everything. But the field is dynamic, and it remains to be seen which methods will prevail in the long run. @@ -146,7 +76,7 @@ You can find a list of some specific models for each of these categories in the ## Families of Models {#conc-models-families} -Though there are many models out there, even if we restrict the discussion to tabular data, we can group them in a fairly simple way that would cover most of the standard problems you'll come across. +While there are many models out there, even if we restrict the discussion to tabular data, we can group them in a fairly simple way that would cover most of the standard problems you'll come across. **GLM and Related: Interpretable Insights** @@ -156,7 +86,7 @@ Here we have standard linear models with a focus on interpretability. Basically - Includes: GLM, survival, ordinal, time-series, other distributions (beta, tweedie) - Best for: small data situations (samples and features), a baseline model, a causal model, post-model analysis of the results from more complex models - Primary strength: ease of estimation, interpretability, uncertainty estimation -- Primary weakness: relatively poor prediction, may not capture natural data complexity without some work +- Primary weakness: relatively poor prediction, may not capture natural data complexity without additional work **Penalized Regression and Friends: Predictive Progress** @@ -188,7 +118,7 @@ In practice, just a handful of techniques from this text can provide a lot of mo - **Penalized Regression**: Lasso, ridge, GAMs, Mixed models and similar keep things linear while increasing predictive power and accommodating more features than their non-penalized counterparts. If you need to focus more on the explanatory and statistical side of things, you can use the standard GLM. - **Boosting/Tree-based Models**: At the time of this writing, boosting methods consistently deliver the best predictive performance for tabular data, and are quite computationally efficient relative to deep learning techniques. That's reason enough to know how to use them and keep them handy. -- **A Basic Deep Learning Model**: A 'simple' deep learning model that incorporates embeddings for categorical and text features is a very powerful tool[^isthisdeep]. In addition, it can be combined with other deep learning models applied to other types of data like images or text for a combined predictive effort. We're still working towards an implementation of deep learning that can handle any tabular data we throw at it, but we're not quite there yet. +- **A Basic Deep Learning Model**: A 'simple' deep learning model that incorporates embeddings for categorical and text features is a very powerful tool[^isthisdeep]. Additionally, using a deep learning approach can be integrated with other DL models that process different types of data, such as images or text, to enhance predictive performance. We're still working towards an implementation of deep learning that can handle any tabular data we throw at it, but we're not quite there yet. Besides the models, it's crucial to understand how to evaluate your models (cross-validation, metrics), how to interpret them (coefficients, SHAP, feature importance, uncertainty), and how to manage the data you're working with. While we've discussed many topics in the text, there's always more to learn, and more to practice. @@ -207,7 +137,7 @@ GAMs are a great way to handle nonlinear relationships, interactions, and add pe ## How to Choose? {#conc-models-how2choose} -People love to say that 'all models are wrong, but some are useful'[^box]. We prefer to think of this a bit differently. There is no (necessarily) wrong model to use to answer your question, and there's no guarantee that you would come to a different practical conclusion from using a simple correlation than you would from a complex neural network. But some models can be more useful depending on the context and the question you're asking. +So how should we choose a specific model for our data? People love to say that 'all models are wrong, but some are useful'[^box]. We prefer to think of this a bit differently. There is no (necessarily) wrong model to use to answer your question, and there's no guarantee that you would come to a different *practical* conclusion from using a simple correlation than you would from a complex neural network. But some models can be more useful depending on the context and the question you're asking. [^box]: George Box, a famous statistician, said this in 1976. @@ -221,7 +151,7 @@ And that's the main thing, you don't have to restrict yourself when it comes to ## Choose Your Own Adventure {#conc-models-adventure} -We've covered a lot of ground in this text, and we hope you've learned something new along the way. But there's so much more to learn, and so much more to do. We hope that you'll be able to take what you've learned here and apply it to your own work, and that you'll continue to learn and grow as a data scientist. +We've covered a lot of ground in this text, and we hope you've learned something new along the way. But there's so much more out there for you to continue to explore. We hope that you'll be able to take what you've learned here and apply it to your own work, and that you'll continue to learn and grow as a data scientist. So where do you go from here? The world of data science is vast - choose your own adventure! diff --git a/models.qmd b/models.qmd index e577dbb..81fc7fe 100644 --- a/models.qmd +++ b/models.qmd @@ -13,7 +13,7 @@ At its core, a model is just an **idea**. It's a way of thinking about the world ## What Goes into a Model? What Comes Out? {#sec-lm-in-a-model} -In the context of a model, how we specify the nature of the relationship between various entities depends on the context. In the interest of generality, we'll refer to the **target** as what we want to explain, and **features** as those aspects of the data we will use to explain it. Because people come at data from a variety of contexts, they often use different terminology to mean the same thing. The next table shows some of the common terms used to refer to features and targets. Note that they can be mixed and matched, for example, someone might refer to covariates and a response, or inputs and a label. +In the context of a model, how we specify the nature of the relationship between various entities depends on the context. In the interest of generality, we'll refer to the **target** as what we want to explain, and **features** as those aspects of the data we will use to explain it. Because people come at data from a variety of contexts, they often use different terminology to mean the same or similar things. The next table shows some of the common terms used to refer to features and targets. Note that they can be mixed and matched, for example, someone might refer to covariates and a response, or inputs and a label. ```{r tbl-feat-target, cache=FALSE} #| echo: false @@ -193,13 +193,14 @@ Models are expressed through a particular language, math, but don't let that wor ![A generic model](img/eq-model-basic.png){#eq-generic-model} -In words, this equation says we are trying to explain something $y$, as a function $f()$ of other things $X$, but there is typically some aspect we can't explain $u$ that is also at play. This is the basic form of a model used in data science, and it's essentially the same for linear regression, logistic regression, and even random forests and neural networks. +In words, this equation says we are trying to explain something $y$, as a function $f()$ of other things $X$. The output of our model is $f(X)$, but there is typically some aspect we can't explain $u$ that is also at play. This depiction is the basic form of a model used in data science, and it's essentially the same for linear regression, logistic regression, and even random forests and neural networks. + +But in simpler terms, we're just trying to understand everyday things, like how the amount of sleep relates to cognitive functioning, how the weather affects the number of people who visit a park, how much money to spend on advertising to increase sales, how to detect fraud, and so on. Any of these could form the basis of a model, as they stem from scientifically testable ideas, and they all express relationships between things we are interested in, possibly even with an implication of causal relations. -But in everyday terms, we're trying to understand everyday things, like how the amount of sleep relates to cognitive functioning, how the weather affects the number of people who visit a park, how much money to spend on advertising to increase sales, how to detect fraud, and so on. Any of these could form the basis of a model, as they stem from scientifically testable ideas, and they all express relationships between things we are interested in, possibly even with an implication of causal relations. ### Expressing models visually {#sec-models-expressing-visually} -Often it is useful to express models visually, as it can help us understand the relationships more easily. For example, we already showed how to express the relationship between a single feature and target in the previous @fig-corr-plot. A more formal way is with a graphical model, and the following is a generic representation of a linear model. +Often it is useful to express models visually, as it can help us understand the relationships more easily. For example, we already showed how to express the relationship between a single feature and target in @fig-corr-plot. A more formal way is with a graphical model, and the following is a generic representation of a **linear model**. ```{r} @@ -217,7 +218,7 @@ g |> ![A linear model](img/graphical-simple_model.svg){#fig-model-graph width=75%} -This makes clear there is an output from the model that is created from the inputs (X). The 'w' values are weights, which can be different for each input, and the output is the combination of these weighted inputs. As we'll see later, we'll want to find a way to create the best correspondence between the outputs of the model and the target, which is the essence of fitting a model. +This makes clear there is an output from the model that is created from the inputs (X). The 'w' values are weights, which can be different for each input, and the output is the combination of these weighted inputs. As we'll see later, we'll want to find a way to create the best correspondence between the outputs of the model and the target, which is the essence of **model fitting**. ### Expressing models in code {#sec-models-expressing-code} @@ -253,14 +254,87 @@ The first part with the `~` is the model formula, which is how math comes into p In practice, models are implemented in a variety of ways, and the previous code is just one way to express a model. For example, the linear model can be expressed in a variety of ways depending on the tool used, such as a simple linear regression, a penalized regression, or a mixed model. When we think of models as a specific implementation, we are thinking of something like `glm` or `lmer` in R, or `LinearRegression` or `XGBoostClassifier` in Python, or the architecture of a deep neural network. In our examples, we use functions where we will specify the formula that expresses the feature target relationships, or will specify the input features and target in some fashion, e.g., as separate data objects called `X` and `y`. Afterwards, or in conjunction with this specification, we will fit the model to the data, which is the process of finding the best way to map the feature inputs to the target. +## Components of Modeling {#sec-models-components} + +It might help to also think about models, or the process of modeling, as having different aspects or parts. We can break our thinking about models into the following components. + +**Task** + +The task can be thought of as the goal of our model, which might be defined as regression, classification, ranking, or next word prediction. It is closely tied to the objective (loss) function, which is a measure of correspondence between the model output and the target we're trying to understand. The objective function provides the model a goal - minimize target-output discrepancy or maximize similarity. As an example, if our target is numeric and our task is 'regression', we can use mean squared error as an objective function, which provides a measure of the prediction-target discrepancy. + +**Model** + +In data science, a model generally refers to a unique (mathematical) implementation we're using to answer our questions. It specifies the **architecture** of the model, and as we will see, this might be a simple linear component, a series of trees, or neural network. In addition, the model specifies the **functional form**, the $f()$ in our equation, that translates inputs to outputs, and the **parameters** required to make that transformation. In code, the model is implemented with functions such as `lm` in R, or in Python, an `XGBoostClassifier` or PyTorch `nn.Model` class. + +**Algorithm** + +Various algorithms allow us to estimate the parameters of the model, typically in an iterative fashion, moving from one guess to a hopefully better one. We can think of general approaches, like maximum likelihood, Bayesian estimation, or stochastic gradient descent. Or we can focus on a specific implementation of these, such as penalized likelihood, hamilton monte carlo, or backpropagation. + +So when we think about models, we start with an idea, but in the end it needs to be expressed in a form that suggests an architecture. That architecture specifies how we take in data and make outputs in the form of predictions, or something that can be transformed to them. With that in place, we need an algorithm to search the parameter space of the model, and a way to evaluate how well the model is doing. While this is enough to produce results, it only gets us the bare minimum. + +We will see demonstrations of all of these components throughout the book, and how they work together to produce results. Beyond these components, there are many more things we have to do to prepare the data for modeling, help us interpret those results, understand the model's performance, and get a sense of its limitations. + ## Some Clarifications {#sec-models-clarify} -You will sometimes see models referred to as a specific statistic, a particular aspect of the model, or an algorithm. This is often a source of confusion for those early on in their data science journey, because the terms don't really refer to what the model represents. For example, a t-test is a statistical result, not a model in and of itself. Similarly, some refer to 'logit model' or 'probit model', but these are *link functions* used in fitting what is in fact the same model, which we'll cover in detail later. A 'classifier' tells you the *task* of the model, but not what the model is. 'OLS' is an estimation technique used for many types of models, etc., not just a name for a linear regression model. Machine learning can potentially be used to fit *any* model, and not a specific collection of models. +You will sometimes see models referred to as a specific statistic, a particular aspect of the model, or an algorithm. This is often a source of confusion for those early on in their data science journey, because the terms don't really refer to what the model represents. For example, a t-test is a statistical result, not a model in and of itself. Similarly, some refer to 'logit model' or 'probit model', but these are *link functions* used in fitting what is in fact the same model, which we'll cover in detail later. A 'classifier' tells you the *task* of the model, but not what the model is. 'OLS' is an estimation technique used for many types of models, etc., not just another name for linear regression. Machine learning can potentially be used to fit *any* model, and is not a specific collection of models. All this is to say that it's good to be clear about the model, and to try to keep it distinguished from specific aspects or implementations of it. Sometimes the nomenclature can't help but get a little fuzzy, and that's okay. Again though, at the core of a model is the idea that specifies the relationship between the features and target. +## Key Steps in Modeling + +When it comes to modeling, there are a few key steps that you should always keep in mind. These are not necessarily exhaustive, but we feel they're a good way to think about how to approach modeling in data science. + +**Define the problem** + +Start by clearly defining the problem you want to solve. It is often easy to express in very general terms, but more challenging to precisely pin down the problem statement in a way that can actually help you solve it. What are you trying to predict? What data do you have to work with? What are the constraints on your data and model? What are the consequences of the results, whatever they may be? Why do you even care about any of this? These are all questions you should try to answer before diving into modeling. + + +**Know your data well** + +During our time consulting in industry and academia, we've seen many cases where the available data is simply not suited to answer the question at hand[^unsuitabledata]. This leads to wasted time, money, and other resources. You can't possibly answer your question if the data doesn't have the appropriate content to do so. + +[^unsuitabledata]: This is a common problem where data is often collected for one purpose and then used for another, as with general purpose surveys or administrative data. Sometimes it can be that the available data is simply not enough to say anything without a lot of uncertainty, as in the case of demographic data regarding minority groups, for which there may be few instances of a particular sample. Deep learning approaches like zero/Few-shot learning isn't applicable here, because there isn't a model pre-trained on millions or billions of similar examples to transfer knowledge from. + +In addition, if your data is fraught with issues due to inadequate exploration, cleaning , or transformation, then you're going to have a hard time getting valuable results. It is very common to be dealing with data that has issues that even those who collected it are unaware of, so always be looking out for ways to improve it. + + +**Have multiple models at your disposal** + +Go into a modeling project with a couple models in mind that you think might be useful. This could even be as simple as increasing complexity within a single model approach - you don't have to get too fancy! You should have a few models that you're comfortable with and that you know how to use, and for which you know the strengths and weaknesses. Whenever possible, make time to explore more complex or less familiar approaches that you also think may be suitable to the problem. As we'll demonstrate, model comparison can help you have more confidence in the results of the model that's finally chosen. Just like in a lot of other situations, you don't want to 'put all your eggs in one basket', and you'll always have more to talk about and consider if you have multiple models to work with. + + +**Communicate your results** + +If you don't know the model and underlying data well enough to explain the results to others, you're not going to be able to use them effectively in the first place. Conversely, you also may know the technical side very well, but if you're unable to communicate the results in simpler terms that others can understand, you're going to have a hard time convincing others of the value of your work. Communication is an essential component of the modeling process, and it's something that you should be thinking about from the very beginning. + + + +## The Hard Part {#sec-models-hard} + + +Modeling is just one aspect of the data science process, and the hard part of that process is often not so much the model itself, but everything else that goes into it and what you do with it after. It can be difficult to come up with the original idea for a model, and even harder to get it to work in practice. + + +**The Data** + +Model performance is largely going to come from the quality of the data and how you've prepared it, from ensuring its integrity to feature engineering. Some models will usually work better than others in certain situations, but there are no guarantees, and often the practical difference in performance is minimal. But you can potentially improve performance by understanding your data better, and by understanding the limitations of your model. Having more domain knowledge can help reduce noise and irrelevant information that you might have otherwise retained, and can provide insights for feature engineering. Thorough data exploration can reveal bugs and issues to be fixed, and will help you understand the relationships between your features and your target. + + +**The Interpretation** + +Once you have a model, you need to understand what it's telling you. This can be as simple as looking at the coefficients of a linear regression, or as complex as trying to understand the output of a hidden layer in a neural network. Once you get past a linear regression though, you need to *expect* model interpretation to get hard. But whatever model you use, you need to be able to explain what the model is doing, and how you're ultimately coming to your conclusions. This can be difficult, and often requires a lot of work. Even if you've used a model often, it may still be difficult to understand in a new data environment. Model interpretation can take a lot of effort, but it's important to do what's necessary to trust your model results, and help others trust them as well. + + +**What You Do With It** + +Once you have the model and you (think you) understand it, you need to be able to use it effectively. If you've gone to this sort of trouble, you must have had a good reason for undertaking what can be a very difficult task. We use models to make business decisions, inform policy, understand the world around us, and to make our lives better. However, using a model effectively means understanding its limitations, as well as the practical, ethical, scientific, and other *consequences* of the decisions you make based on it. It's at this point that the true value of your model is realized. + +In the end, models are a tool to *help* you solve a problem. They do not solve the problem for you, and they do not absolve you of the responsibility of understanding the problem and the consequences of your decisions. + + + ## Getting Ready for More {#sec-models-ready-for-more} -The goal of this book is to help you understand models in a practical way that makes clear what we're trying to understand, but also how models produce those results we're so interested in. We'll be using a variety of models to help you understand the relationships between features and targets, and how to use models to make predictions, and how to interpret the results. We'll also show you how the models are estimated, how to evaluate them, and how to choose the right one for the job. We hope you'll come away with a better understanding of how models work, and how to use them in your own projects. So let's get started! +The goal of this book is to help you understand models in a practical way that makes clear the relationships we're trying to understand with them, and also how models produce those results we're so interested in. We'll be using a variety of models to help you understand the relationships between features and targets, and how to use models to make predictions, and how to interpret the results. We'll also show you how the models are estimated, how to evaluate them, and how to choose the right one for the job. We hope you'll come away with a better understanding of how models work, and how to use them in your own projects. So let's get started! From 50f9771414bd7015065655c4fd15ed4938321bf4 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 17:58:01 -0500 Subject: [PATCH 12/19] add exercise --- uncertainty.qmd | 58 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 57 insertions(+), 1 deletion(-) diff --git a/uncertainty.qmd b/uncertainty.qmd index f27987e..c75b019 100644 --- a/uncertainty.qmd +++ b/uncertainty.qmd @@ -1,6 +1,6 @@ # Estimating Uncertainty {#sec-estim-uncertainty} -![](img/chapter_gp_plots/gp_plot_3.svg){width=75%} +![](img/chapter_gp_plots/gp_plot_13.svg){width=75%} ```{r} #| label: setup-estimation @@ -1296,3 +1296,59 @@ Python demos: - Introduction To Conformal Prediction With Python (@molnar_introduction_2024) - [Mapie Docs](https://mapie.readthedocs.io/en/latest/) + + +## Guided Exploration + +We find that simulation is a great way to understand models, and the monte carlo approach to uncertainty definitely puts simulation at the forefront. The next chapter focuses on generalized linear models, so if you're not familiar with logistic regression, head there first. If you are familiar, see if you can apply the monte carlo approach to get predicted probabilities for a logistic regression model. You really only need to change two lines from our previous code. + +:::{.panel-tabset} + +##### R + +```{r} +#| label: mc-logistic-r +#| eval: false + +mc_predictions = function( + model, + nsim = 2500, + seed = 42 +) { + ... + # we aren't dealing with a normal distribution for this + # how should we change this line? + yhat = X %*% t(params) + rnorm(n = nrow(X) * nsim, sd = sigma) + + # how do we get probabilities from this? + ???? = ???? + + # proceed as before + pred_int = apply(y_hat, 1, quantile, probs = c(.025, .975)) + + return(pred_int) +} +``` + +##### Python + +```{python} +#| label: mc-logistic-py +#| eval: false + +def mc_predictions(model, nsim=2500, seed=42): + ... + # we aren't dealing with a normal distribution for this + # how should we change this line? + yhat = X @ params + np.random.normal(scale = sigma, size = (X.shape[0], nsim)) + + # how do we get probabilities from this? + ???? = ???? + + # proceed as before + pred_int = np.quantile(yhat, q = [.025, .975], axis = 1) + + return pred_int + +``` +::: \ No newline at end of file From b0cfe265e6ca0ebe386b4e65fde356933679b9d8 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 18:00:44 -0500 Subject: [PATCH 13/19] add unc exercise to notebooks --- .../model_estimation_optimization.ipynb | 1076 ----------------- .../uncertainty.ipynb | 44 +- .../r_chapter_notebooks/uncertainty.qmd | 24 + 3 files changed, 60 insertions(+), 1084 deletions(-) diff --git a/chapter-notebooks/python_chapter_notebooks/model_estimation_optimization.ipynb b/chapter-notebooks/python_chapter_notebooks/model_estimation_optimization.ipynb index 830e9b4..4a5fedd 100644 --- a/chapter-notebooks/python_chapter_notebooks/model_estimation_optimization.ipynb +++ b/chapter-notebooks/python_chapter_notebooks/model_estimation_optimization.ipynb @@ -811,1082 +811,6 @@ "\n", "our_sgd['par'], our_sgd['MSE']" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Uncertainty" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Frequentist" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
meanmean_semean_ci_lowermean_ci_upperobs_ci_lowerobs_ci_upper
03.9876710.1327583.7245214.2508202.7369665.238375
15.4966380.1040655.2903635.7029144.2566536.736624
25.6765200.0874705.5031395.8499014.4415806.911459
35.4065850.1070005.1944925.6186784.1656186.647552
46.9666400.1267566.7153897.2178925.7183858.214896
.....................
1075.8612560.0778975.7068506.0156614.6288377.093674
1085.2903680.1471614.9986695.5820674.0333476.547389
1095.3279980.0836595.1621705.4938254.0940966.561899
1104.3081050.1010394.1078284.5083833.0691035.547107
1114.2673250.0982834.0725114.4621393.0291945.505455
\n", - "

112 rows × 6 columns

\n", - "
" - ], - "text/plain": [ - " mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \\\n", - "0 3.987671 0.132758 3.724521 4.250820 2.736966 \n", - "1 5.496638 0.104065 5.290363 5.702914 4.256653 \n", - "2 5.676520 0.087470 5.503139 5.849901 4.441580 \n", - "3 5.406585 0.107000 5.194492 5.618678 4.165618 \n", - "4 6.966640 0.126756 6.715389 7.217892 5.718385 \n", - ".. ... ... ... ... ... \n", - "107 5.861256 0.077897 5.706850 6.015661 4.628837 \n", - "108 5.290368 0.147161 4.998669 5.582067 4.033347 \n", - "109 5.327998 0.083659 5.162170 5.493825 4.094096 \n", - "110 4.308105 0.101039 4.107828 4.508383 3.069103 \n", - "111 4.267325 0.098283 4.072511 4.462139 3.029194 \n", - "\n", - " obs_ci_upper \n", - "0 5.238375 \n", - "1 6.736624 \n", - "2 6.911459 \n", - "3 6.647552 \n", - "4 8.214896 \n", - ".. ... \n", - "107 7.093674 \n", - "108 6.547389 \n", - "109 6.561899 \n", - "110 5.547107 \n", - "111 5.505455 \n", - "\n", - "[112 rows x 6 columns]" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model = smf.ols(\n", - " 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc',\n", - " data = df_happiness\n", - ").fit()\n", - "\n", - "model.conf_int()\n", - "\n", - "model.get_prediction().summary_frame() # both 'confidence' and 'prediction' intervals" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
meanmean_semean_ci_lowermean_ci_upperobs_ci_lowerobs_ci_upper
03.9876710.1327583.7245214.2508202.7369665.238375
15.4966380.1040655.2903635.7029144.2566536.736624
25.6765200.0874705.5031395.8499014.4415806.911459
35.4065850.1070005.1944925.6186784.1656186.647552
46.9666400.1267566.7153897.2178925.7183858.214896
.....................
1075.8612560.0778975.7068506.0156614.6288377.093674
1085.2903680.1471614.9986695.5820674.0333476.547389
1095.3279980.0836595.1621705.4938254.0940966.561899
1104.3081050.1010394.1078284.5083833.0691035.547107
1114.2673250.0982834.0725114.4621393.0291945.505455
\n", - "

112 rows × 6 columns

\n", - "
" - ], - "text/plain": [ - " mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower \\\n", - "0 3.987671 0.132758 3.724521 4.250820 2.736966 \n", - "1 5.496638 0.104065 5.290363 5.702914 4.256653 \n", - "2 5.676520 0.087470 5.503139 5.849901 4.441580 \n", - "3 5.406585 0.107000 5.194492 5.618678 4.165618 \n", - "4 6.966640 0.126756 6.715389 7.217892 5.718385 \n", - ".. ... ... ... ... ... \n", - "107 5.861256 0.077897 5.706850 6.015661 4.628837 \n", - "108 5.290368 0.147161 4.998669 5.582067 4.033347 \n", - "109 5.327998 0.083659 5.162170 5.493825 4.094096 \n", - "110 4.308105 0.101039 4.107828 4.508383 3.069103 \n", - "111 4.267325 0.098283 4.072511 4.462139 3.029194 \n", - "\n", - " obs_ci_upper \n", - "0 5.238375 \n", - "1 6.736624 \n", - "2 6.911459 \n", - "3 6.647552 \n", - "4 8.214896 \n", - ".. ... \n", - "107 7.093674 \n", - "108 6.547389 \n", - "109 6.561899 \n", - "110 5.547107 \n", - "111 5.505455 \n", - "\n", - "[112 rows x 6 columns]" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model.get_prediction().summary_frame()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Monte Carlo" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [], - "source": [ - "# we'll use the model from the previous section\n", - "model = smf.ols(\n", - " 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc',\n", - " data = df_happiness\n", - ").fit()\n", - "\n", - "def mc_predictions(model, nsim=2500, seed=42):\n", - " np.random.seed(seed)\n", - "\n", - " params_est = model.params\n", - " params = np.random.multivariate_normal(\n", - " mean = params_est,\n", - " cov = model.cov_params(),\n", - " size = nsim\n", - " )\n", - "\n", - " sigma = model.mse_resid**.5\n", - " X = model.model.exog\n", - "\n", - " y_hat = X @ params.T + np.random.normal(scale = sigma, size = (X.shape[0], nsim))\n", - "\n", - " pred_int = np.quantile(y_hat, q = [.025, .975], axis = 1)\n", - "\n", - " return pred_int\n", - "\n", - "our_mc = mc_predictions(model)" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
observed_valuepredictionsimulated_lowersimulated_upperstatsmodels_lowerstatsmodels_upper
03.6323.9882.7705.1972.7375.238
14.5865.4974.2786.7594.2576.737
26.3885.6774.4516.8894.4426.911
34.3215.4074.2186.6664.1666.648
47.2726.9675.7338.1675.7188.215
.....................
1076.3795.8614.6707.0904.6297.094
1086.0965.2904.0576.5474.0336.547
1094.8065.3284.1266.5304.0946.562
1104.3774.3083.0975.5313.0695.547
1113.6924.2673.0375.5233.0295.505
\n", - "

112 rows × 6 columns

\n", - "
" - ], - "text/plain": [ - " observed_value prediction simulated_lower simulated_upper \\\n", - "0 3.632 3.988 2.770 5.197 \n", - "1 4.586 5.497 4.278 6.759 \n", - "2 6.388 5.677 4.451 6.889 \n", - "3 4.321 5.407 4.218 6.666 \n", - "4 7.272 6.967 5.733 8.167 \n", - ".. ... ... ... ... \n", - "107 6.379 5.861 4.670 7.090 \n", - "108 6.096 5.290 4.057 6.547 \n", - "109 4.806 5.328 4.126 6.530 \n", - "110 4.377 4.308 3.097 5.531 \n", - "111 3.692 4.267 3.037 5.523 \n", - "\n", - " statsmodels_lower statsmodels_upper \n", - "0 2.737 5.238 \n", - "1 4.257 6.737 \n", - "2 4.442 6.911 \n", - "3 4.166 6.648 \n", - "4 5.718 8.215 \n", - ".. ... ... \n", - "107 4.629 7.094 \n", - "108 4.033 6.547 \n", - "109 4.094 6.562 \n", - "110 3.069 5.547 \n", - "111 3.029 5.505 \n", - "\n", - "[112 rows x 6 columns]" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Statsmodels Prediction Intervals\n", - "prediction_intervals = model.get_prediction().summary_frame()\n", - "statsmodels_lower = prediction_intervals['obs_ci_lower']\n", - "statsmodels_upper = prediction_intervals['obs_ci_upper']\n", - "\n", - "\n", - "pd.DataFrame({\n", - " 'observed_value': df_happiness['happiness'],\n", - " 'prediction': model.fittedvalues,\n", - " 'simulated_lower': our_mc[0],\n", - " 'simulated_upper': our_mc[1],\n", - " 'statsmodels_lower': statsmodels_lower,\n", - " 'statsmodels_upper': statsmodels_upper\n", - "}).round(3)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Bootstrap" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [], - "source": [ - "def bootstrap(X, y, nboot=100, seed=123):\n", - " # add a column of 1s for the intercept\n", - " X = np.c_[np.ones(X.shape[0]), X]\n", - " N = X.shape[0]\n", - "\n", - " # initialize\n", - " beta = np.empty((nboot, X.shape[1]))\n", - " \n", - " # beta = pd.DataFrame(beta, columns=['Intercept'] + list(cn))\n", - " mse = np.empty(nboot) \n", - "\n", - " # set seed\n", - " np.random.seed(seed)\n", - "\n", - " for i in range(nboot):\n", - " # sample with replacement\n", - " idx = np.random.randint(0, N, N)\n", - " Xi = X[idx, :]\n", - " yi = y[idx]\n", - "\n", - " # estimate model\n", - " model = LinearRegression(fit_intercept=False)\n", - " mod = model.fit(Xi, yi)\n", - "\n", - " # save results\n", - " beta[i, :] = mod.coef_\n", - " mse[i] = np.sum((mod.predict(Xi) - yi)**2) / N\n", - "\n", - " # given mean estimates, calculate MSE\n", - " y_hat = X @ beta.mean(axis=0)\n", - " final_mse = np.sum((y - y_hat)**2) / N\n", - "\n", - " output = {\n", - " 'par': beta,\n", - " 'mse': mse,\n", - " 'final_mse': final_mse\n", - " }\n", - "\n", - " return output\n", - "\n", - "our_boot = bootstrap(\n", - " X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],\n", - " y = df_happiness['happiness'],\n", - " nboot = 1000\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([ 5.34092479, 0.27665819, -0.29543964, 0.19114177])" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "np.percentile(our_boot['par'], 2.5, axis=0)" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
parammeanlwrupr
0Intercept5.4517035.3409255.572782
1life_exp_sc0.5119170.2766580.754842
2corrupt_sc-0.106482-0.2954400.080125
3gdp_pc_sc0.4598290.1911420.776553
\n", - "
" - ], - "text/plain": [ - " param mean lwr upr\n", - "0 Intercept 5.451703 5.340925 5.572782\n", - "1 life_exp_sc 0.511917 0.276658 0.754842\n", - "2 corrupt_sc -0.106482 -0.295440 0.080125\n", - "3 gdp_pc_sc 0.459829 0.191142 0.776553" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.DataFrame({\n", - " 'param': ['Intercept', 'life_exp_sc', 'corrupt_sc', 'gdp_pc_sc'],\n", - " 'mean': our_boot['par'].mean(axis=0),\n", - " 'lwr': np.percentile(our_boot['par'], 2.5, axis=0),\n", - " 'upr': np.percentile(our_boot['par'], 97.5, axis=0)\n", - "})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Bayesian" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/var/folders/x6/4jhswqxj0sqf_gkgq6lw6l880000gn/T/ipykernel_51608/1473881323.py:27: DeprecationWarning: `np.math` is a deprecated alias for the standard library `math` module (Deprecated Numpy 1.25). Replace usages of `np.math` with `math`\n", - " p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss\n" - ] - }, - { - "data": { - "text/plain": [ - "0.599999996503221" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from scipy.stats import beta\n", - "\n", - "pk = np.array([\n", - " 'goal','goal','goal','miss','miss',\n", - " 'goal','goal','miss','goal','goal'\n", - "])\n", - "\n", - "# convert to numeric, arbitrarily picking goal=1, miss=0\n", - "N = len(pk) # sample size\n", - "n_goal = np.sum(pk == 'goal') # number of pk made\n", - "n_miss = np.sum(pk == 'miss') # number of those miss\n", - "\n", - "# grid of potential theta values\n", - "theta = np.linspace(1 / (N + 1), N / (N + 1), 10)\n", - "\n", - "### prior distribution\n", - "# beta prior with mean = .5, but fairly diffuse\n", - "# examine the prior\n", - "# theta = beta.rvs(5, 5, size = 1000)\n", - "# plt.hist(theta, bins = 20, color = 'lightblue')\n", - "p_theta = beta.pdf(theta, 5, 5)\n", - "\n", - "# Normalize so that values sum to 1\n", - "p_theta = p_theta / np.sum(p_theta)\n", - "\n", - "# likelihood (binomial)\n", - "p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss\n", - "\n", - "# posterior (combination of prior and likelihood)\n", - "# marginal probability of the data used for normalization\n", - "p_data = np.sum(p_data_given_theta * p_theta) \n", - "\n", - "p_theta_given_data = p_data_given_theta * p_theta / p_data # Bayes theorem\n", - "\n", - "# final estimate\n", - "theta_est = np.sum(theta * p_theta_given_data)\n", - "theta_est" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Conformal" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [], - "source": [ - "def split_conformal(X, y, new_data, alpha = .05, calibration_split = .5):\n", - " # Splitting the data into training and calibration sets\n", - " X_train, X_cal, y_train, y_cal = train_test_split(\n", - " X, \n", - " y,\n", - " test_size = calibration_split,\n", - " random_state = 123\n", - " )\n", - "\n", - " N = X_train.shape[0]\n", - "\n", - " # Train the base model\n", - " model = LinearRegression().fit(X_train, y_train)\n", - "\n", - " # Calculate residuals on calibration set\n", - " cal_preds = model.predict(X_cal)\n", - " residuals = np.abs(y_cal - cal_preds)\n", - "\n", - " # Sort residuals and find the quantile corresponding to (1-alpha)\n", - " residuals = np.sort(residuals)\n", - "\n", - " # The correction here is useful for small sample sizes\n", - " quantile = np.quantile(residuals, (1 - alpha) * (N / (N + 1)))\n", - "\n", - " # Make predictions on new data and calculate prediction intervals\n", - " preds = model.predict(new_data)\n", - " lower_bounds = preds - quantile\n", - " upper_bounds = preds + quantile\n", - "\n", - " # Return predictions and prediction intervals\n", - " return {\n", - " 'cp_error': quantile,\n", - " 'preds': preds,\n", - " 'lower_bounds': lower_bounds,\n", - " 'upper_bounds': upper_bounds\n", - " }\n" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1.1358174413022732\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
predslower_boundsupper_bounds
04.6695523.5337345.805369
14.6806753.5448585.816493
26.3251345.1893177.460952
33.4098762.2740584.545693
44.4484333.3126155.584250
\n", - "
" - ], - "text/plain": [ - " preds lower_bounds upper_bounds\n", - "0 4.669552 3.533734 5.805369\n", - "1 4.680675 3.544858 5.816493\n", - "2 6.325134 5.189317 7.460952\n", - "3 3.409876 2.274058 4.545693\n", - "4 4.448433 3.312615 5.584250" - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# split data\n", - "from sklearn.model_selection import train_test_split\n", - "\n", - "X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']]\n", - "y = df_happiness['happiness']\n", - "\n", - "X_train, X_test, y_train, y_test = train_test_split(\n", - " df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],\n", - " df_happiness['happiness'],\n", - " test_size = 0.5,\n", - " random_state = 1234\n", - ")\n", - "\n", - "our_cp_error = split_conformal(\n", - " X_train,\n", - " y_train,\n", - " X_test,\n", - " alpha = .1\n", - ")\n", - "\n", - "print(our_cp_error['cp_error'])\n", - "\n", - "pd.DataFrame({\n", - " 'preds': our_cp_error['preds'],\n", - " 'lower_bounds': our_cp_error['lower_bounds'],\n", - " 'upper_bounds': our_cp_error['upper_bounds']\n", - "}).head()" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [], - "source": [ - "from mapie.regression import MapieRegressor\n", - "\n", - "model = MapieRegressor(LinearRegression(), method = 'base', random_state=123)\n", - "y_pred, y_pis = model.fit(X_train, y_train).predict(X_test, alpha = 0.1)\n", - "\n", - "# take the first difference between upper and lower bounds,\n", - "# since it's constant for all predictions in this setting\n", - "\n", - "cp_error = (y_pis[0, 1, 0] - y_pis[0, 0, 0]) / 2 " - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[3.93113326, 6.15153942],\n", - " [3.93302836, 6.15343453],\n", - " [5.08887412, 7.30928028],\n", - " [2.7385576 , 4.95896377],\n", - " [3.73161544, 5.95202161]])" - ] - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "y_pis[:5].reshape(-1, 2)" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(1.1358174413022732, 1.1102030815534594)" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "our_cp_error['cp_error'], cp_error" - ] } ], "metadata": { diff --git a/chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb b/chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb index 838ac89..e41b5e9 100644 --- a/chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb +++ b/chapter-notebooks/python_chapter_notebooks/uncertainty.ipynb @@ -1146,7 +1146,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "/var/folders/x6/4jhswqxj0sqf_gkgq6lw6l880000gn/T/ipykernel_2286/1473881323.py:27: DeprecationWarning: `np.math` is a deprecated alias for the standard library `math` module (Deprecated Numpy 1.25). Replace usages of `np.math` with `math`\n", + "/var/folders/x6/4jhswqxj0sqf_gkgq6lw6l880000gn/T/ipykernel_17791/1473881323.py:27: DeprecationWarning: `np.math` is a deprecated alias for the standard library `math` module (Deprecated Numpy 1.25). Replace usages of `np.math` with `math`\n", " p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss\n" ] }, @@ -1254,7 +1254,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -1334,7 +1334,7 @@ "4 4.287107 3.250457 5.323756" ] }, - "execution_count": 34, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } @@ -1371,7 +1371,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -1388,7 +1388,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 17, "metadata": {}, "outputs": [ { @@ -1401,7 +1401,7 @@ " [3.24132564, 5.39183665]])" ] }, - "execution_count": 36, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -1412,7 +1412,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -1421,7 +1421,7 @@ "(1.0366496877019815, 1.075255506898503)" ] }, - "execution_count": 37, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } @@ -1430,6 +1430,34 @@ "our_cp_error['cp_error'], cp_error" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Guided Exploration" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "# def mc_predictions(model, nsim=2500, seed=42):\n", + "# ...\n", + "# # we aren't dealing with a normal distribution for this\n", + "# # how should we change this line?\n", + "# yhat = X @ params + np.random.normal(scale = sigma, size = (X.shape[0], nsim))\n", + "\n", + "# # how do we get probabilities from this?\n", + "# ???? = ????\n", + "\n", + "# # proceed as before\n", + "# pred_int = np.quantile(yhat, q = [.025, .975], axis = 1)\n", + " \n", + "# return pred_int" + ] + }, { "cell_type": "code", "execution_count": null, diff --git a/chapter-notebooks/r_chapter_notebooks/uncertainty.qmd b/chapter-notebooks/r_chapter_notebooks/uncertainty.qmd index 816a892..a2447c2 100644 --- a/chapter-notebooks/r_chapter_notebooks/uncertainty.qmd +++ b/chapter-notebooks/r_chapter_notebooks/uncertainty.qmd @@ -300,4 +300,28 @@ tibble( upper_bounds = cp_error[['upper_bounds']] ) |> head() +``` + +## Guided Exploration + +```{r} +#| eval: false +mc_predictions = function( + model, + nsim = 2500, + seed = 42 +) { + ... + # we aren't dealing with a normal distribution for this + # how should we change this line? + yhat = X %*% t(params) + rnorm(n = nrow(X) * nsim, sd = sigma) + + # how do we get probabilities from this? + ???? = ???? + + # proceed as before + pred_int = apply(y_hat, 1, quantile, probs = c(.025, .975)) + + return(pred_int) +} ``` \ No newline at end of file From 9124be4761ddac00fb58d94deef9e3d64b3640e2 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 18:01:31 -0500 Subject: [PATCH 14/19] misc edits --- functions/utils.r | 4 ++-- index.qmd | 4 ++-- introduction.qmd | 20 ++++++++++---------- machine_learning.qmd | 2 +- 4 files changed, 15 insertions(+), 15 deletions(-) diff --git a/functions/utils.r b/functions/utils.r index 50b897d..654fa9d 100644 --- a/functions/utils.r +++ b/functions/utils.r @@ -1,4 +1,4 @@ -sd <- function(x, na.rm = TRUE) { +sd = function(x, na.rm = TRUE) { sqrt(var(x, na.rm = na.rm)) } @@ -6,7 +6,7 @@ sd <- function(x, na.rm = TRUE) { # mean(x, na.rm = na.rm) # } -word_sign <- function(value, words) { +word_sign = function(value, words) { ifelse(sign(value) == 1, words[1], words[2]) } diff --git a/index.qmd b/index.qmd index eb2e25c..ffc7f59 100644 --- a/index.qmd +++ b/index.qmd @@ -40,7 +40,7 @@ All the data and code used in this book is available on the book's [GitHub repos ## About the Authors {#sec-preface-authors} -**Michael** is a senior machine learning scientist for [Strong Analytics](https://strong.io)[^onesix]. Prior to industry he honed his chops in academia, earning a PhD in Experimental Psychology before turning to data science full-time as a consultant. His models have been used in production across a variety of industries, and can be seen in dozens of publications across several disciplines. He has a passion for helping others learn difficult stuff, and has taught a variety of data science courses and workshops for people of all skill levels in many different contexts. +**Michael** is a senior machine learning scientist for [Strong Analytics](https://strong.io)[^onesix]. Prior to industry he honed his chops in academia, earning a PhD in Experimental Psychology before turning to data science full-time as a consultant. His models have been used in production across a variety of industries, and can be seen in dozens of publications across several academic disciplines. He has a passion for helping others learn difficult stuff, and has taught a variety of data science courses and workshops for people of all skill levels in many different contexts. He also maintains a [blog](https://m-clark.github.io) covering many aspects of statistical and machine learning modeling, and has several posts and long-form documents on a variety of data science topics there. He lives in Ann Arbor Michigan with his wife and his dog, where they all enjoy long walks around the neighborhood. During the course of writing this book, he became a father to Juni, and is now learning the joys of sleep deprivation. @@ -52,7 +52,7 @@ He also maintains a [blog](https://m-clark.github.io) covering many aspects of s **Seth** is the Academic Co-Director of the [Master of Science in Business Analytics (MSBA)](https://mendoza.nd.edu/graduate-programs/business-analytics-msba/) and Associate Teaching Professor at the University of Notre Dame for the IT, Analytics, and Operations Department. He likewise has a PhD in Applied Experimental Psychology and has been teaching and consulting in data science for over a decade. He is an excellent instructor, and teaches several data science courses at the undergraduate and graduate level. -Seth lives in South Bend, Indiana with his wife and three kids, and spends his free time lifting more weights than he should, playing guitar, and chopping wood. +Seth lives in the South Bend area of Indiana with his wife and three kids, and spends his free time lifting more weights than he should, playing guitar, and chopping wood. :::{.content-visible when-format='html'} ![Seth](img/seth.png){width=1in} diff --git a/introduction.qmd b/introduction.qmd index 44ad2aa..71517ab 100644 --- a/introduction.qmd +++ b/introduction.qmd @@ -19,26 +19,27 @@ Our approach is first and foremost a practical one - models are just tools to he ### What we hope you take away {#sec-intro-takeaway} + Here are a few things we hope you'll take away from this book: - A sense of the common thread that runs through the modeling landscape, from simple linear models to complex neural networks -- A small set of modeling tools that will be applicable to the vast majority of tabular data problems you'll encounter +- A small set of modeling tools that will nonetheless be applicable to many common data problems you'll encounter - Enough understanding to be able to confidently apply these tools to your own data -While we recommend working through the chapters in order if you're starting out, we hope that this book can serve as a "choose your own adventure" reference. Whether you want a surface-level understanding, a deeper dive, or just want to be able to understand what the analysts in your organization are talking about, we think you will find value in this book. +While we recommend working through the chapters in order if you're starting out, we hope that this book can serve as a "choose your own adventure" reference. Whether you want a surface-level understanding or a deeper dive, we think you will find value in this book. ### What you can expect {#sec-intro-expect} -For each topic that we cover in a chapter, you will generally see the same type of content structure. We start with an overview and provide some key ideas to keep in mind as we go through the chapter. We then demonstrate the model with data, code, results, and visualizations. To further demystify the modeling process, at various points we take time to show *how* a model comes about by estimating them by hand. We'll also provide some concluding thoughts, connections to other techniques and topics, and suggestions on what to explore next. Occasionally we'll also provide some exercises to try on your own. +For each topic that we cover in a chapter, you will generally see the same type of content structure. We start with an overview and provide some key ideas to keep in mind as we go through the chapter. We then demonstrate the model with data, code, results, and visualizations. To further demystify the modeling process, at various points we take time to show *how* a model or some aspect of it comes about by estimating them by hand. We'll also provide some concluding thoughts, connections to other techniques and topics, and suggestions on what to explore next. Occasionally we'll also provide suggestions for things to try on your own. Some topics may get a bit more into the weeds than you want, and that's okay! We hope that you can take away the big ideas and come back to the details when you're ready. Just having an awareness of what's possible is often the first step to understanding how to apply it to your own data. In general though, we'll touch a little bit on a lot of things, but hopefully not in an overwhelming way. ### What you can't expect {#sec-intro-expect-not} -This book will not teach you programming, but you really only need a basic understanding of R or Python. We also won't be teaching you basic statistics, so won't be delving into hypothesis testing or the intricacies of statistical theory. The text is more focused on applied modeling, prediction and performance than a normal stats book, and more focused on interpretation and uncertainty in the modeling process than a typical machine learning book. It's not an academic treatment of the topics, so when it comes to references, you'll be more likely to find a nice blog post or youtube video that clearly demonstrates a concept, rather than a dense academic paper. That said you should have a great idea of where to go and what to search to go further for deeper content. +This book will not teach you programming, but you really only need a basic understanding of R or Python. We also won't be teaching you basic statistics, so won't be delving into hypothesis testing or the intricacies of statistical theory. The text is more focused on applied modeling, prediction, and performance than a normal stats book, and more focused on interpretation and uncertainty in the modeling process than a typical machine learning book. It's not an academic treatment of the topics, so when it comes to references, you'll be more likely to find a nice blog post or youtube video that clearly demonstrates a concept, rather than a dense academic paper. That said you should have a great idea of where to go and what to search to go further for deeper content. @@ -46,9 +47,9 @@ This book will not teach you programming, but you really only need a basic under This book is intended for every type of *data dabbler*, no matter what part of the data world you call home. If you consider yourself a data scientist, a machine learning engineer, a business analyst, or a deep learning hobbyist, you already know that the best part of a good dive into data is the modeling. But whatever your data persuasion, models give us the possibility to answer questions, make predictions, and understand what we're interested in a little bit better. And no matter who you are, it isn't always easy to understand *how the models work*. Even when you do get a good grasp of a modeling approach, things can still get complicated, and there are a lot of details to keep track of. In other cases, maybe you just have other things going on in your life and have forgotten a few things. In that case, we find that it's always good to remind yourself of the basics! So if you're just interested in data and hoping to understand it a little better, then it's likely you'll find something useful. -Your humble authors have struggled mightily themselves throughout the course of their data science history, and still do! We often found it difficult to get a good grasp of statistical modeling and machine learning. It took us a lot of effort to learn how to use the tools, how to interpret the results, and possibly the most difficult, how to explain what we're doing to others! We've forgotten a lot, confused ourselves, and made some happy accidents in the process. That's okay! Our goal is to help you avoid some of those pitfalls, help you understand the basics of how models work, and get a sense of how most modeling endeavors have a lot of things in common. + @@ -75,13 +76,12 @@ Whether you enthusiastically pour over formulas and code, or prefer to skip over ::: -You've probably noticed most data science books, blogs, and courses choose R or Python. While many individuals often have a strong opinion towards teaching and using one over the other, we eschew dogmatic approaches and language flame wars. R and Python are both great languages for modeling, and both flawed in unique ways. Even if you specialize in one, it's good to have awareness of the other as they are the most popular languages for statistical modeling and machine learning, and both excel in at least some areas the other does not. We use both extensively in our own work for teaching, personal use, and production level code, and have found both are up to whatever task you have in mind. +You've probably noticed most data science books, blogs, and courses choose R or Python. While many individuals often have a strong opinion towards teaching and using one over the other, we eschew dogmatic approaches and language flame wars. R and Python are both great languages for modeling, and both flawed in unique ways. Even if you specialize in one, it's good to have awareness of the other as they are the most popular languages for statistical modeling and machine learning, and both excel in at least some areas the other does not. We use both extensively in our own work for teaching, personal use, and production level code, and have may be useful to whatever task you have in mind. -Throughout this book, we will be presenting demonstrations in both R and Python, and you can use both or take your pick, but we want to leave that choice up to you. Our goal is to use them as a tool to help understand some big model ideas. This book can be a resource for the R user who could use a little help translating their R knowledge to Python; we'd also like it to be a resource for the Python user who sees the value in R's statistical modeling abilities and more. You'll find that our coding style/presentation bends more toward legibility, clarity and consistency, which is not necessarily the same as a standard like PEP8 or the tidyverse style guide[^codestyle]. We hope that you can take the code we provide and make it your own, and that you can use it to help you understand the models we're discussing. +Throughout this book, we will be presenting demonstrations in both R and Python, and you can use both or take your pick, but we want to leave that choice up to you. Our goal is to use them as a tool to help understand some big model ideas. This book can be a resource for the R user who could use a little help translating their R knowledge to Python. We'd also like it to be a resource for the Python user who sees the value in R's statistical modeling abilities and more. You'll find that our coding style/presentation bends more toward legibility, clarity and consistency, which is not necessarily the same as a standard like PEP8 or the tidyverse style guide[^codestyle]. We hope that you can take the code we provide and make it your own, and that you can use it to help you understand the models we're discussing. [^codestyle]: The commonly used coding styles for both R and Python aren't actually scientifically derived or tested, and only recently has research been conducted in this area (see @ivanova_comprehension_2020 for an example). The guidelines are generally good, but mostly reflect the preferences of the person(s) who wrote them. Our focus here is not on programming though. ## Moving Towards an Excellent Adventure {#sec-intro-adventure} -Remember the point we made about "choosing your own adventure"? Modeling and programming in data science is an adventure, even if you never leave your desk! Every situation calls for choices to be made and every choice you make will lead you down a different path. You will run into errors, dead ends, and you might even find that you've spent considerable time to conclude that nothing interesting is happening in your data. This, no doubt, is actually part of the fun, and all of those struggles will make your ultimate success that much sweeter. Like every adventure, things might not be immediately clear and you might find yourself in perilous situations! If you find that something isn't making sense upon your first read, that is okay. Both authors have spent considerable time mulling over models and foggy ideas during our assorted (mis)adventures - nobody should expect to master complex concepts on a single read through! In any arena where you strive to develop skills, distributed practice and repetition are essential. When concepts get tough, step away from the book, and come back with a fresh mind. We have great faith you will get where you want to go, and we're here to help you along the way! - +Remember the point we made about "choosing your own adventure"? Modeling and programming in data science is an adventure, even if you never leave your desk! Every situation calls for choices to be made, and every choice you make will lead you down a different path. You will run into errors, dead ends, and you might even find that you've spent considerable time to conclude that nothing interesting is happening in your data. This, no doubt, is actually part of the fun, and all of those struggles will make your ultimate success that much sweeter. Like every adventure, things might not be immediately clear, and you might find yourself in perilous situations! If you find that something isn't making sense upon your first read, that's fine! Your humble authors have spent considerable time mulling over models and foggy ideas during our assorted (mis)adventures, and nobody should expect to master complex concepts on a single read through! In any arena where you strive to develop skills, distributed practice and repetition are essential. When concepts get tough, step away from the book, and come back with a fresh mind. We have great faith you will get where you want to go, and we're here to help you along the way! diff --git a/machine_learning.qmd b/machine_learning.qmd index 94b21a3..cc2f82a 100644 --- a/machine_learning.qmd +++ b/machine_learning.qmd @@ -1085,7 +1085,7 @@ from sklearn.linear_model import LogisticRegressionCV X = df_reviews.filter(regex='_sc$') # grab the standardized features y = df_reviews['rating_good'] -# Cs is the (inverse) penalty parameter; +# Cs is the (inverse) penalty parameter model_logistic_l2 = LogisticRegressionCV( penalty='l2', # penalty type Cs=[1], # penalty parameter value From ac6b12099161cf1161f6e7afd8ecff3b300f407a Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 18:14:20 -0500 Subject: [PATCH 15/19] minor edit to readme --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1ed9494..2f75853 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # book-of-models -This is the repository for a book tentatively titled: +This is the repository for a book ~~tentatively~~ titled: - Models Demystified - ~~- Models by Example~~ @@ -14,6 +14,6 @@ We've seen many people struggle with understanding how models work, just as we'v -This should be out on CRC press in 2024. We welcome any feedback in the meantime as it develops, so please feel free to create an issue. For contributions, please see the [contributing](CONTRIBUTING.md) page for more information if available, other. +This should be out on CRC press in 2024 or early 2025. We welcome any feedback in the meantime as it develops, so please feel free to create an issue. For contributions, please see the [contributing](CONTRIBUTING.md) page for more information if available, other. [Web Version of the book](https://m-clark.github.io/book-of-models/) From 45143eb13beac69e4013437825f0d2cd7d520b1a Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 19:20:51 -0500 Subject: [PATCH 16/19] chap title change --- ml_more.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ml_more.qmd b/ml_more.qmd index 93d1da0..8644ca3 100644 --- a/ml_more.qmd +++ b/ml_more.qmd @@ -1,4 +1,4 @@ -# More Machine Learning {#sec-ml-more} +# Extending Machine Learning {#sec-ml-more} ![](img/chapter_gp_plots/gp_plot_8.svg){width=75%} From e84fc2f878f0da04e5d2e941c7a5a8331a8538c7 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 19:35:00 -0500 Subject: [PATCH 17/19] fix crossrefs --- introduction.qmd | 2 +- linear_model_extensions.qmd | 2 +- linear_models.qmd | 2 +- uncertainty.qmd | 18 +++++++++--------- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/introduction.qmd b/introduction.qmd index 71517ab..531bac4 100644 --- a/introduction.qmd +++ b/introduction.qmd @@ -53,7 +53,7 @@ Whether you enthusiastically pour over formulas and code, or prefer to skip over -[^formulas]: We actually never had this ability. + diff --git a/linear_model_extensions.qmd b/linear_model_extensions.qmd index ecb3c39..c323797 100644 --- a/linear_model_extensions.qmd +++ b/linear_model_extensions.qmd @@ -868,7 +868,7 @@ For a model with just one feature, we certainly had a lot to talk about! And thi - Because of the way they are estimated, mixed models can account for lack of independence of observations[^indieobs], which is a common issue in many datasets. This is especially important when you have repeated measures, or when you have a hierarchical structure in your data, such as students within schools, or patients within hospitals. - Standard modeling approaches can estimate these difficult models very efficiently, even with thousands of groups and millions of observations. - The group effects are like a very simplified **embedding** (@sec-data-cat), where we have taken a categorical feature and turned it into a numeric one, like those shown in @fig-random-effects. This may help you understand other embedding techniques that are used in other places like deep learning. -- When you start to think about random effects and/or distributions for effects, you're already thinking like a Bayesian (@sec-estim-bayes), who is always thinking about the distributions for various effects. Mixed models are the perfect segue from standard linear model estimation to Bayesian estimation, where everything is random. +- When you start to think about random effects and/or distributions for effects, you're already thinking like a Bayesian (@sec-unc-bayes), who is always thinking about the distributions for various effects. Mixed models are the perfect segue from standard linear model estimation to Bayesian estimation, where everything is random. - The random effect is akin to a latent variable of 'unspecified group causes'. This is a very powerful idea that can be used in many different ways, but importantly, you might want to start thinking about how you can figure out what those 'unspecified' causes may be! - Group effects will almost always improve your model's predictive performance relative to not having them, especially if you weren't including those groups in your model because of how many groups there were. diff --git a/linear_models.qmd b/linear_models.qmd index 5f93ca6..8918369 100644 --- a/linear_models.qmd +++ b/linear_models.qmd @@ -705,7 +705,7 @@ ggsave('img/lm-prediction-intervals-plot.svg', width = 8, height = 6) ![Prediction & Confidence Intervals Compared](img/lm-prediction-intervals-plot.svg){width=100% #fig-prediction-intervals} -Once you move past simpler linear models and generalized linear models, obtaining uncertainty estimates for predictions is difficult, and tools to do so can be scarce. This is especially the case for models used in machine learning contexts, and relatively rare for deep learning approaches. In practice, you can use bootstrapping (@sec-estim-bootstrap) to get a sense of the uncertainty, but this is often not a good estimate in many data scenarios and can be computationally expensive. Bayesian approaches (@sec-estim-bayes) can also provide estimates of uncertainty, but likewise are computationally expensive, and require a good deal of expertise to implement for more complex settings. Quantile regression (@sec-lm-extend-quantile) can sometimes be appropriate to estimate predictions at different quantiles that can serve as a proxy for prediction intervals, but tools to do so for various models are uncommon. On the other hand, **conformal prediction** tools are becoming more popular, and can provide a more reliable estimate of prediction uncertainty for any type of model. Yet they too are computationally expensive for more accurate estimates, and good tools are only recently becoming available[^confpred]. +Once you move past simpler linear models and generalized linear models, obtaining uncertainty estimates for predictions is difficult, and tools to do so can be scarce. This is especially the case for models used in machine learning contexts, and relatively rare for deep learning approaches. In practice, you can use bootstrapping (@sec-unc-bootstrap) to get a sense of the uncertainty, but this is often not a good estimate in many data scenarios and can be computationally expensive. Bayesian approaches (@sec-unc-bayes) can also provide estimates of uncertainty, but likewise are computationally expensive, and require a good deal of expertise to implement for more complex settings. Quantile regression (@sec-lm-extend-quantile) can sometimes be appropriate to estimate predictions at different quantiles that can serve as a proxy for prediction intervals, but tools to do so for various models are uncommon. On the other hand, **conformal prediction** tools are becoming more popular, and can provide a more reliable estimate of prediction uncertainty for any type of model. Yet they too are computationally expensive for more accurate estimates, and good tools are only recently becoming available[^confpred]. [^confpred]: For a good intro to conformal prediction, see @angelopoulos_gentle_2022. The [mapie]{.pack} is a good tool for Python, and the [tidymodels family](https://www.tidymodels.org/learn/models/conformal-regression/) has recently added this functionality via the [probably]{.pack} package. Michael has a mostly complete blog post on conformal prediction that got lost in the shuffle, but will be available at some point as well. diff --git a/uncertainty.qmd b/uncertainty.qmd index c75b019..4d91e4b 100644 --- a/uncertainty.qmd +++ b/uncertainty.qmd @@ -1,4 +1,4 @@ -# Estimating Uncertainty {#sec-estim-uncertainty} +# Estimating Uncertainty {#sec-unc-uncertainty} ![](img/chapter_gp_plots/gp_plot_13.svg){width=75%} @@ -85,7 +85,7 @@ from scipy import stats ``` -## Standard Frequentist {#sec-estim-frequentist} +## Standard Frequentist {#sec-unc-frequentist} We talked a bit about the frequentist approach in our discussion of confidence intervals (@sec-lm-interpretation-feature). There we described the process using the interval to *capture* the 'true' parameter value a certain percentage of the time. The key assumption is that the true parameter is fixed, and the interval is a random variable that will contain the true value with some percentage frequency. With this approach, if you were to repeat the experiment, i.e., data collection and analysis, many times, each interval would be slightly different. Although they would be different, any one of the intervals is as good or valid as the others. You also know that a certain percentage of them will contain the true value, and a (usually small) percentage will not. You will never know if a specific interval does actually capture the true value, because we don't know the true value in practice. @@ -360,7 +360,7 @@ tibble( Monte Carlo simulation is a very popular approach in modeling, and a variant of it, markov chain monte carlo (MCMC), is the basis for Bayesian estimation, which we'll also talk about in more detail later. -## Bootstrap {#sec-estim-bootstrap} +## Bootstrap {#sec-unc-bootstrap} An extremely common method for estimating uncertainty is the **bootstrap**. The bootstrap is a method where we create new datasets by randomly sampling the original data with replacement. This means that each new dataset is the same size as the original, but some observations may be selected multiple times while others may not be selected at all. We then estimate our model with each data set, and each time, we can collect parameter estimates, predictions, or any other calculations we are interested in. Ultimately, we end up with a *distribution* of all the things we calculated. The nice thing about this is that we don't need to know the specific distribution (e.g., normal, or t-distribution) of the values we want to get uncertainty estimates for, we can just use the data we have to produce that distribution. And this is a key distinction from the Monte Carlo method just discussed. @@ -573,7 +573,7 @@ ggsave('img/estim-bootstrap.svg', width = 8, height = 6) As mentioned, the bootstrap is often used to provide uncertainty for unmodeled parameters, predictions, and other metrics. However, because we repeatedly run the model or some aspect of it over and over, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g., extreme values) or [in some data contexts](https://stats.stackexchange.com/questions/9664/what-are-examples-where-a-naive-bootstrap-fails) (e.g., correlated observations) without extra considerations. Variants exist to help deal with some of these issues, and despite limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model. -## Bayesian {#sec-estim-bayes} +## Bayesian {#sec-unc-bayes} The **Bayesian** approach to modeling is many things - a philosophical viewpoint, an entirely different way to think about probability, a different way to measure uncertainty, and on a practical level, just another way to get model parameter estimates. It can be as frustrating as it is fun to use, and one of the really nice things about using Bayesian estimation is that it can handle model complexities that other approaches don't do well or at all. @@ -988,7 +988,7 @@ The Bayesian approach is very flexible, and can be used for many different types [^rpybayes]: R has excellent tools here for modeling and post-processing, like [brms]{.pack} and [tidybayes]{.pack}, and Python has [pymc3]{.pack}, [numpyro]{.pack}, and [arviz]{.pack}, which are also useful. Honestly R has way more going on here, with many packages devoted to Bayesian estimation of specific models even, but if you want to stick with Python it's gotten a lot better recently. -## Conformal Methods {#sec-estim-conformal} +## Conformal Methods {#sec-unc-conformal} Conformal approaches bring us back to the frequentist world, and specifically regard prediction uncertainty. One of the primary strengths of the approach is that it is model agnostic and theoretically can work for any model, from linear regression to deep learning. Like the bootstrap and Bayesian methods, conformal prediction makes us think in terms of distributions of possible values, but it focuses on residuals or errors in prediction. @@ -1231,7 +1231,7 @@ Conformal prediction still relies on the assumptions about the data and the unde -## Wrapping Up {#sec-estim-wrap} +## Wrapping Up {#sec-unc-wrap} Understanding uncertainty is key to understanding the quality of your model. It's not just about the point estimate, or getting a prediction, but also about how confident you are in value. We've covered several avenues from the basics of estimation to the more complex Bayesian and conformal methods. If the model provides a standard statistical solution, take it. Otherwise, the bootstrap is easy to understand and implement. Bayesian methods are more complex, but can provide more information about the uncertainty in your model. Conformal prediction is a good choice when you want to make predictions without making strong assumptions about the underlying model, and may be the best option for many contexts. @@ -1240,19 +1240,19 @@ Understanding uncertainty is key to understanding the quality of your model. It' We hope you now have a better understanding of how to estimate uncertainty in your models, and how to use that information to make better decisions. -### The common thread {#sec-estim-thread} +### The common thread {#sec-unc-thread} No model is without uncertainty, so any of these techniques may be applicable to your work. The choice of method depends largely on how you want to tackle the issue. -### Choose your own adventure {#sec-estim-choose} +### Choose your own adventure {#sec-unc-choose} This chapter colors all others that focus on specific modeling techniques. You can think about how you might implement uncertainty estimation for any of them. -### Additional resources {#sec-estim-add-res} +### Additional resources {#sec-unc-add-res} **Frequentist Approaches**: From 315b25b7d390d2732779e4c67684b9278a6b8f57 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 20:09:20 -0500 Subject: [PATCH 18/19] fix tab header issue --- linear_model_extensions.qmd | 5 +---- machine_learning.qmd | 31 +++++++------------------------ 2 files changed, 8 insertions(+), 28 deletions(-) diff --git a/linear_model_extensions.qmd b/linear_model_extensions.qmd index c323797..2c26127 100644 --- a/linear_model_extensions.qmd +++ b/linear_model_extensions.qmd @@ -266,15 +266,12 @@ me_children |> as_tibble() |> select(-s.value) |> gt() |> - tab_header( - title = "", - # subtitle = "Average coefficient for children in home averaged over genre" - ) |> fmt_number( columns = c(estimate), decimals = 3 ) ``` + \normalsize diff --git a/machine_learning.qmd b/machine_learning.qmd index cc2f82a..d15c494 100644 --- a/machine_learning.qmd +++ b/machine_learning.qmd @@ -78,13 +78,11 @@ objective_functions = tibble( objective_functions |> group_by(Task) |> - gt() |> - tab_header( - title = '', - subtitle = '' - ) + gt() ``` +\normalsize + \elandscape {{< pagebreak >}} @@ -218,7 +216,6 @@ We put them all together in the following table. Now we know how to get them, an #| echo: false #| label: tbl-r_metrics_show #| tbl-cap: Example Metrics for Linear and Logistic Regression Models -#| tibble( Model = rep(c('Linear Regression', 'Logistic Regression'), each = 3), @@ -226,11 +223,7 @@ tibble( Value = c(rmse_val, mae_val, r2_val, accuracy, precision, recall) ) |> group_by(Model) |> - gt() |> - tab_header( - title = '', - subtitle = '' - ) + gt() ``` @@ -343,11 +336,7 @@ tibble( prediction = c('Train', 'Test'), rmse = c(py$rmse_train, py$rmse_test) ) |> - gt(decimals = 3) |> - tab_header( - title = '', - subtitle = '' -) + gt(decimals = 3) ``` So there you have it, you just did some machine learning! And now we have a model that we can use to predict with any new data that comes along with ease. But as we'll soon see, there are limitations to doing things this simply. But conceptually this is an important idea, and one we will continue to return to. @@ -741,7 +730,7 @@ ggsave('img/ml-core-double_descent.svg', p_dd_main, width = 8, height = 6) ![Double Descent on the classic mtcars dataset](img/ml-core-double_descent.svg){#fig-double-descent} -On the left part of the visualization, we see that the test error dips as we get a better model. Our best test error is noted by the large gray dot. Eventually though, the test error rises as expected, even as training error gets better. Test error eventually hits a peak when the number of parameters equals the number of training observations. But then we keep going, and the test error starts to decrease again! By the end we have essentially perfect training prediction, and our test error is as good as it was with the simpler models. This is the double descent phenomenon with one of the simplest datasets around. Cool! +On the left part of the visualization, we see that the test error dips as we get a better model. Our best test error is noted by the large dot on the left. Eventually though, the test error rises as expected, even as training error gets better. Test error eventually hits a peak when the number of parameters equals the number of training observations. But then we keep going, and the test error starts to decrease again! By the end we have essentially perfect training prediction, and our test error is as good as it was with the simpler models (large dot on the right). This is the double descent phenomenon with one of the simplest datasets around. Cool! @@ -866,9 +855,6 @@ init |> ) |> select(Data, Over, Under, Better) |> gt() |> - tab_header( - title = '', - ) |> tab_style( style = list( # cell_fill(color = '#F9E3D6'), # will break latex/pdf @@ -1356,10 +1342,7 @@ acc_test = model_logistic_grid$predict(task, row_ids=splits$test)$score(measure # acc_train = acc_train, # acc_test = acc_test # ) |> -# gt() |> -# tab_header( -# title = '' -# ) +# gt() glue::glue( 'Best lambda: {best_param$lambda} From 5cfcc8047562a6b0b476877a59c3ff58d8c117f6 Mon Sep 17 00:00:00 2001 From: micl Date: Sun, 1 Dec 2024 20:51:01 -0500 Subject: [PATCH 19/19] fix some table header and margin issues --- estimation.qmd | 5 ++++- machine_learning.qmd | 13 ++++++++++--- ml_common_models.qmd | 11 +++++++---- uncertainty.qmd | 25 ++++++++++++++++++------- understanding_features.qmd | 11 ++++++++--- understanding_models.qmd | 4 ---- 6 files changed, 47 insertions(+), 22 deletions(-) diff --git a/estimation.qmd b/estimation.qmd index 99161fa..f2fcd31 100644 --- a/estimation.qmd +++ b/estimation.qmd @@ -555,7 +555,10 @@ c(coef(model_lr_happy_life), performance::performance_mse(model_lr_happy_life)) #| label: py-ols-lr #| results: hide -model_lr_happy_life = sm.OLS(df_happiness['happiness'], sm.add_constant(df_happiness['life_exp_sc'])).fit() +model_lr_happy_life = sm.OLS( + df_happiness['happiness'], + sm.add_constant(df_happiness['life_exp_sc']) +).fit() # not shown model_lr_happy_life.params, model_lr_happy_life.scale diff --git a/machine_learning.qmd b/machine_learning.qmd index d15c494..384dd96 100644 --- a/machine_learning.qmd +++ b/machine_learning.qmd @@ -1292,7 +1292,12 @@ X = df_reviews |> as.data.table() # Define task -task = TaskClassif$new('movie_reviews', X, target = 'rating_good', positive = 'good') +task = TaskClassif$new( + 'movie_reviews', + X, + target = 'rating_good', + positive = 'good' +) # split the dataset into training and test sets splits = partition(task, ratio = 0.75) @@ -1328,8 +1333,10 @@ model_logistic_grid$train(task, row_ids = splits$train) best_param = model_logistic_grid$model$learner$param_set$values # Use the best model to predict and get metrics -acc_train = model_logistic_grid$predict(task, row_ids=splits$train)$score(measure) -acc_test = model_logistic_grid$predict(task, row_ids=splits$test)$score(measure) +pred_train = model_logistic_grid$predict(task, row_ids = splits$train) +pred_test = model_logistic_grid$predict(task, row_ids = splits$test) +acc_train = pred_train$score(measure) +acc_test = pred_test$score(measure) ``` ```{r} diff --git a/ml_common_models.qmd b/ml_common_models.qmd index 394f196..c57342f 100644 --- a/ml_common_models.qmd +++ b/ml_common_models.qmd @@ -119,7 +119,9 @@ from lightgbm import LGBMClassifier from sklearn.neural_network import MLPClassifier # Metrics and more -from sklearn.model_selection import cross_validate, RandomizedSearchCV, train_test_split +from sklearn.model_selection import ( + cross_validate, RandomizedSearchCV, train_test_split +) from sklearn.metrics import accuracy_score from sklearn.inspection import PartialDependenceDisplay ``` @@ -147,10 +149,10 @@ majority = np.max([prevalence, 1 - prevalence]) #| label: setup-data-r library(tidyverse) -df_heart = read_csv("https://tinyurl.com/heartdiseaseprocessed") |> +df_heart = read_csv('https://tinyurl.com/heartdiseaseprocessed') |> mutate(across(where(is.character), as.factor)) -df_heart_num = read_csv("https://tinyurl.com/heartdiseaseprocessednumeric") +df_heart_num = read_csv('https://tinyurl.com/heartdiseaseprocessednumeric') # for use with for mlr3 X = df_heart_num |> @@ -1050,7 +1052,8 @@ model_boost_cv_tune = auto_tuner( ) model_boost_cv_tune$train(tsk_model_boost_cv_tune, row_ids = split$train) -model_boost_cv_tune$predict(tsk_model_boost_cv_tune, row_ids = split$test)$score(msr("classif.acc")) +test_preds = model_boost_cv_tune$predict(tsk_model_boost_cv_tune, row_ids = split$test) +test_preds$score(msr("classif.acc")) ``` ```{r} diff --git a/uncertainty.qmd b/uncertainty.qmd index 4d91e4b..a639168 100644 --- a/uncertainty.qmd +++ b/uncertainty.qmd @@ -102,7 +102,10 @@ Here is an example using our previous model to get interval estimates for predic #| eval: true #| results: hide -model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) +model = lm( + happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, + data = df_happiness +) confint(model) @@ -123,7 +126,8 @@ model = smf.ols( model.conf_int() -model.get_prediction().summary_frame() # both 'confidence' and 'prediction' intervals +# both 'confidence' and 'prediction' intervals +model.get_prediction().summary_frame() ``` ::: @@ -268,7 +272,10 @@ The result is a distribution of the value of interest, be it a parameter, a pred #| label: r-monte-carlo # we'll use the model from the previous section -model = lm(happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, data = df_happiness) +model = lm( + happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc, + data = df_happiness +) # number of simulations mc_predictions = function( @@ -323,7 +330,8 @@ def mc_predictions(model, nsim=2500, seed=42): sigma = model.mse_resid**.5 X = model.model.exog - y_hat = X @ params.T + np.random.normal(scale = sigma, size = (X.shape[0], nsim)) + y_hat = X @ params.T + \ + np.random.normal(scale = sigma, size = (X.shape[0], nsim)) pred_int = np.quantile(y_hat, q = [.025, .975], axis = 1) @@ -485,7 +493,7 @@ our_boot = bootstrap( Here are the results of the interval estimates for the coefficients. Each parameter has the mean estimate, the lower and upper bounds of the 95% confidence interval, and the width of the interval. The bootstrap intervals are a bit wider than the OLS intervals, but for this model these should converge as the number of observations increases. -\small +\footnotesize ```{r} #| echo: false #| label: tbl-bootstrap @@ -669,7 +677,9 @@ p_theta = beta.pdf(theta, 5, 5) p_theta = p_theta / np.sum(p_theta) # likelihood (binomial) -p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss +p_data_given_theta = np.math.comb(N, n_goal) * theta**n_goal * \ + (1 - theta)**n_miss + # posterior (combination of prior and likelihood) # p_data is the marginal probability of the data used for normalization @@ -1340,7 +1350,8 @@ def mc_predictions(model, nsim=2500, seed=42): ... # we aren't dealing with a normal distribution for this # how should we change this line? - yhat = X @ params + np.random.normal(scale = sigma, size = (X.shape[0], nsim)) + yhat = X @ params + \ + np.random.normal(scale = sigma, size = (X.shape[0], nsim)) # how do we get probabilities from this? ???? = ???? diff --git a/understanding_features.qmd b/understanding_features.qmd index 996c5c7..bd6ac84 100644 --- a/understanding_features.qmd +++ b/understanding_features.qmd @@ -633,10 +633,12 @@ pd.DataFrame({ ::: +Here are the results in a clean table. + ```{r} #| echo: false #| label: tbl-counterfactual-happiness -#| tbl-cap: Predictions for Happiness Score for Russia and the US with Switched Freedom and GDP +#| tbl-cap: Counterfactual Predictions for Happiness Score tab_democracy = tibble( country = c("Russia", "United States"), base_predictions, @@ -842,7 +844,10 @@ shap_value_package = DALEX::predict_parts( rbind( shap_value_ours, - shap_value_package[c('age', 'release_year', 'length_minutes'), 'contribution'] + shap_value_package[ + c('age', 'release_year', 'length_minutes'), + 'contribution' + ] ) ``` @@ -894,13 +899,13 @@ pd.concat([shap_value_ours, shap_value_package]) ::: +The following table reveals that the results are identical. ```{r} #| echo: false #| label: tbl-shap-values-comparison #| tbl-cap: SHAP Value Comparison -# paste(x$label, x$variable, sep = ": "), mean, na.rm = TRUE) rbind( ours = shap_value_ours, dalex = shap_value_package[c('age', 'release_year', 'length_minutes'), 'contribution'] diff --git a/understanding_models.qmd b/understanding_models.qmd index e074104..fa7c1db 100644 --- a/understanding_models.qmd +++ b/understanding_models.qmd @@ -145,10 +145,6 @@ performance_metrics = tribble( performance_metrics |> group_by(Problem_Type) |> gt() |> - tab_header( - title = "", - subtitle = "" - ) |> tab_footnote( footnote = "Beta = 1 for F1", locations = cells_body(columns = vars(`Other Names/Notes`), rows = Metric == 'F1'),