model intro reorg + misc (#90)

* glm updates, related lm edit, fix chap ref in danger zone * Merge pull request #89 from m-clark/reorg-intro-lm-model split models chap * misc app and setup edits
m-clark · Aug 18, 2024 · c427d94 · c427d94
2 parents a60fe8d + 123eab8
commit c427d94
Show file tree

Hide file tree

Showing 23 changed files with 1,942 additions and 1,694 deletions.
diff --git a/.Rprofile b/.Rprofile
@@ -229,46 +229,7 @@ ggsave = function(filename, width = 8, height = 6, ...) {
   )
 }
 
-skimmer = function() {
-	skimr::skim_with(
-		# reordering/naming/slimming numeric output
-		numeric = skimr::sfl(
-			mean = ~ mean(., na.rm = TRUE),
-			sd   = ~ sd(., na.rm = TRUE),
-			min  = ~ min(., na.rm = TRUE),
-			med  = ~ median(., na.rm = TRUE),
-			max  = ~ max(., na.rm = TRUE),
-			iqr  = NULL,
-			hist = NULL,
-			p0   = NULL,  # renamed
-			p25  = NULL,
-			p50  = NULL,  # renamed
-			p75  = NULL,
-			p100 = NULL   # renamed
-		),
-
-		character = skimr::sfl(
-			empty  = \(x) skimr::n_empty(x) + skimr::n_whitespace(x), # replace default which is only n_empty
-      whitespace = NULL,
-      min = NULL,  # these refer to nchar which I doubt anyone would know
-      max = NULL,
-		),
-		append = TRUE
-	)
-}
-
-summarize_data = function(data, types = 'all') {
-	init = skimmer()(data)
-	summ = skimr::partition(init)
-
-	if (!all(types == 'all')) {
-		summ = summ[tolower(names(summ)) %in% tolower(types)]
-	}
 
-	summ = purrr::map(summ, tibble::tibble)
-
-	return(summ)
-}
 
 options(digits = 4) # number of digits of precision for floating point output
 
diff --git a/_quarto.yml b/_quarto.yml
@@ -18,6 +18,7 @@ book:
   chapters:
     - index.qmd
     - introduction.qmd  
+    - models.qmd
     - part: "Linear Models & More"
       chapters:
         - linear_models.qmd
@@ -39,10 +40,12 @@ book:
   appendices:
     - part: "Additional Topics"
       chapters:
-        - dataset_descriptions.qmd  # this and ref to separate section
         - matrix_operations.qmd
         - pyr.qmd
         - more_models.qmd
+    - part: "References & Resources"
+      chapters:
+        - dataset_descriptions.qmd  # this and ref to separate section
         - references.qmd   # this and ref to separate section maybe 'resources'
       # - appendix_placeholder.qmd
   search: true

diff --git a/conclusion.qmd b/conclusion.qmd
@@ -11,9 +11,9 @@ As we wrap things up, let's revisit some of the key points we've covered in this
 
 <!-- TODO: some of this might be better as an intro -->
 
-## How to Think About Models {#conc-models-think}
+## How to Think About Models (revisited) {#conc-models-think}
 
-When we first started our discussion of models in data science, we talked about how a model is a simplified representation of reality. They start as ideas based on our intuition or experience, and they can sometimes be very simple ones. But at some point we start to think of them more formally, as a step towards testing those ideas in the real world. For statistics, machine learning, and data science more generally, models are then put into mathematical equations that give us a common language to reference them by. This does not have to be complex though. As an example, most of the models you've seen so far can be expressed as follows:
+When we first started our discussion of models in data science (@sec-models), we talked about how a model is a simplified representation of reality. They start as ideas based on our intuition or experience, and they can sometimes be very simple ones. But at some point we start to think of them more formally, as a step towards testing those ideas in the real world. For statistics, machine learning, and data science more generally, models are then put into mathematical equations that give us a common language to reference them by. This does not have to be complex though. As an example, most of the models you've seen so far can be expressed as follows:
 
 <!-- the correct display of annotation seems to only work for pdf (which will not do anything for the color); so here we add image see annotated_equations.qmd
 TODO: this produces a figure cap instead of an equation cap; need to fix.
@@ -22,7 +22,7 @@ TODO: this produces a figure cap instead of an equation cap; need to fix.
 
 In words, this equation says that the target variable $y$ is a function of the feature inputs $X$, along with anything else that we don't include in that set. This is the basic form of a model, and it's the same for linear regression, logistic regression, and even random forests and neural networks[^netfuncs].
 
-[^netfuncs]: Neural networks are a bit different in that they can be thought of as a series of (typically nested) functions that are applied to the data, but they can still be expressed in this form. The functions are just more complex, and the parameters are estimated in a different way.
+[^netfuncs]: Neural networks are a bit different in that they can be thought of as a series of (typically nested) functions that are applied to the data, but they can still be expressed in this form e.g., $h(g(f(X)))$. The functions are more complex, and the parameters are estimated in a different way than typical tabular data models, but the basic idea is the same.
 
 To aid our understanding beyond the math, we try to visually express models in the form of graphical models, or even in more complex ways with neural networks[^lda], as in the following images. 
 
@@ -34,7 +34,7 @@ To aid our understanding beyond the math, we try to visually express models in t
 ![Plate Notation for Latent Dirichlet Allocation](img/plate_notation_LDA.png){width=66%}
 
 
-![[Just part of the Nano GPT model](https://bbycroft.net/llm)](img/nano_gpt.png){width=66%}
+![[Just part of the Nano GPT model](https://bbycroft.net/llm)](img/nano_gpt.png){width=50%}
 
 
 But even now these models are still at the idea stage, and we ultimately need to see how they work in the world, make predictions, and help us to make important decisions. We've seen how to do this with linear models of various forms, and more unusual model implementations in the form of tree-based models, and even highly complex neural networks. These are the tools that allow us to take our ideas and turn them into something that can be used to make decisions, and that's the real power of using models in data science.
@@ -90,7 +90,7 @@ If you don't know the model and underlying data well enough to explain the resul
 
 ## More Models 
 
-When choosing a model, there's a lot at your disposal. The world of data science is vast, and we've only scratched the surface of what's out there. Here are a few more models that you may encounter in your data science journey:
+When choosing a model, there's a lot at your disposal, and we've only scratched the surface of what's out there. Here are a few more models that you may encounter in your data science journey:
 
 **Statistical Models**
 
@@ -103,7 +103,7 @@ As a final consideration, there are 'multivariate' techniques like Principal Com
 
 **Machine Learning**
 
-In a purely machine learning context, you may find other models beyond those just mentioned in the statistical realm. These models prioritize prediction and would not produce standard statistical output like coefficients and uncertainty estimates by default. Examples include support vector machines, k-nearest neighbors, and other techniques. Most of these traditional 'machine learning models' have fallen out of favor due to their inflexibility with heterogeneous data or poor performance compared to more modern approaches. However, even then, their spirit may live on in modern applications.
+In a purely machine learning context, you may find other models beyond those just mentioned in the statistical realm, though, as we have mentioned several times at this point, potentially any model can be used with machine learning. These models prioritize prediction and would not produce standard statistical output like coefficients and uncertainty estimates by default. Examples include support vector machines, k-nearest neighbors, and other techniques. Most of these traditional 'machine learning models' have fallen out of favor due to their inflexibility with heterogeneous data, and/or poor performance compared to more modern approaches. However, even then, their spirit may live on in modern applications.
 
 You'll also find models that focus on ranking, either with an outcome of ranks requiring a specific loss function (e.g. LambdaRank), or where ranking is used to simplify decision-making through post-estimation ranking of predictions (e.g., decile ranking, uplift modeling). In addition, you can find machine learning techniques extended to survival, ordinal, and other situations that are more common in the statistical realm.
 
@@ -112,7 +112,7 @@ Other areas of machine learning, like reinforcement learning, recommender system
 
 **Deep Learning**
 
-When it comes to deep learning, it seems there is a new model every day, and it's hard to keep up. In general,convolutional neural networks are the go-to for computer vision tasks, while transformers are commonly used for natural language processing, but both have been applied to the other domain with success.  For tabular data you'll typically see some variant of Multilayer Perceptrons (MLPs), often with embeddings for categorical features. Some have attempted transformers and CNNs here as well, but results are mixed.
+When it comes to deep learning, it seems there is a new model every day, and it's hard to keep up. In general, convolutional neural networks are the go-to for computer vision tasks, while transformers are commonly used for natural language processing, but both have been applied to the other domain with success. For tabular data you'll typically see some variant of Multilayer Perceptrons (MLPs), often with embeddings for categorical features. Some have attempted transformers and CNNs here as well, but results are mixed.
 
 The deep learning landscape also includes models like deep graphical networks, and deep Q learning for reinforcement learning, specific models for image segmentation (e.g. SAM), recurrent neural networks and LSTM for time-series data, and generative adversarial networks for a variety of tasks. Some specific techniques are falling out of favor as transformer-based architectures are being applied to seemingly everything, but the field is dynamic, and it remains to be seen which methods will prevail in the long run.
 
@@ -167,9 +167,9 @@ The differences between the model families are not substantial, particularly bet
 In practice, just a handful of techniques of the ones you've seen in this text can provide a lot of modeling power. Here's a simple toolbox that can cover a lot of the ground you'd need in a typical data science project:
 
 - **Penalized Regression**: Lasso, ridge and similar models keep things linear while increasing predictive power and accommodating more features than their non-penalized counterparts.
-- **Generalized Additive Models (GAM)**: These models simplify to GLM and mixed models if needed, handle nonlinear relationships and interactions, and use a penalized approach. They can also be extended to time-series and spatial data contexts with ease, making GAMs a very versatile option.
-- **Boosting/Tree-based Models**: At the time of this writing, boosting methods consistently deliver the best predictive performance for tabular data, and are quite computationally efficient. That's reason enough to know how to use them.
-- **A Basic Deep Learning Model**: A Multilayer Perceptron (MLP), and/or a similarly 'simple' deep learning model that incorporates embeddings for categorical and text features, is a very powerful tool[^isthisdeep]. In addition, it can be combined with other deep learning models applied to other types of data for a combined predictive effort. We're still working towards an implementation of deep learning that can handle any tabular data we throw at it, but we're not quite there yet.
+- **Generalized Additive Models (GAM)**: These models simplify to GLM and mixed models if needed, handle nonlinear relationships and interactions, and use a penalized approach. GAMs can also be extended to time-series and spatial data contexts with ease, making them a very versatile option.
+- **Boosting/Tree-based Models**: At the time of this writing, boosting methods consistently deliver the best predictive performance for tabular data, and are quite computationally efficient. That's reason enough to know how to use them and keep them handy.
+- **A Basic Deep Learning Model**: A 'simple' deep learning model that incorporates embeddings for categorical and text features is a very powerful tool[^isthisdeep]. In addition, it can be combined with other deep learning models applied to other types of data for a combined predictive effort. We're still working towards an implementation of deep learning that can handle any tabular data we throw at it, but we're not quite there yet.
 
 Besides the models, it's crucial to understand how to evaluate your models (cross-validation, metrics), how to interpret them (coefficients, SHAP, feature importance, uncertainty), and how to manage the data you're working with. We've covered a lot of this in the text, but there's always more to learn, and more to practice.