Skip to content

Commit

Permalink
Merge pull request #108 from m-clark/dev
Browse files Browse the repository at this point in the history
grammar check plus
  • Loading branch information
m-clark authored Dec 2, 2024
2 parents 702d46f + 5cfcc80 commit 9e7395d
Show file tree
Hide file tree
Showing 29 changed files with 3,510 additions and 2,454 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# book-of-models

This is the repository for a book tentatively titled:
This is the repository for a book ~~tentatively~~ titled:

- Models Demystified
- ~~- Models by Example~~
Expand All @@ -14,6 +14,6 @@ We've seen many people struggle with understanding how models work, just as we'v



This should be out on CRC press in 2024. We welcome any feedback in the meantime as it develops, so please feel free to create an issue. For contributions, please see the [contributing](CONTRIBUTING.md) page for more information if available, other.
This should be out on CRC press in 2024 or early 2025. We welcome any feedback in the meantime as it develops, so please feel free to create an issue. For contributions, please see the [contributing](CONTRIBUTING.md) page for more information if available, other.

[Web Version of the book](https://m-clark.github.io/book-of-models/)
1 change: 1 addition & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ book:
- understanding_models.qmd
- understanding_features.qmd
- estimation.qmd
- uncertainty.qmd
- generalized_linear_models.qmd
- linear_model_extensions.qmd
# - part: "Machine Learning"
Expand Down
12 changes: 4 additions & 8 deletions acknowledgments.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,11 @@ This work was supported by no grants or grad students, and was done in the spare

We could not have done this without the support of our families, who have been patient and understanding throughout the process.

In particular, Michael wishes to thank his wife Xilin, whose patience, humor, and support have been invaluable. He'd also like to thank Rich Herrington, who got him started with data science many years ago, and the folks at Strong Analytics/OneSix who have been supportive of his recent efforts, put up with his transition from academia to industry, and made him much better at everything he does in data science. And finally, he would like to thank the many mentors in the statistical, ML, and programming communities, who never knew they were such a large influence, but did much to expand his knowledge in many ways over the years.

Seth wishes to thank his wife Megan, who has been supportive and understanding throughout the process, and his kids, who put up with with the time spent on this project. He'd also like to thank his colleagues and students at Notre Dame, who have supported teaching antics and research shenanigans. He hopes that this book is more proof that he knows what he's talking about, and not just a bunch of nonsense.


Together we'd like to thank those who helped with the book by providing comments and feedback, specifically: Malcolm Barrett, Isabella Gehment, Demetri Pananos, and Chelsea Parlett-Pellereti. Their insights and suggestions were invaluable in making this book better. We'd also like to thank others who took the time to take even just a quick glance to see what was going on and provide a brief comment here and there. We appreciate all the help we can get.

Finally, we'd like to thank you, the reader, for taking the time to read this book. We hope you find it useful and informative, and that it helps you in your data science journey. If you have any feedback, please feel free to reach out to us. We'd love to hear from you.
In particular, Michael wishes to thank his wife Xilin, whose insights, humor, support, and especially patience, have been invaluable throughout the process. He'd also like to thank Rich Herrington, whose encouragement was the catalyst for his shift to data science many years ago. Michael also thanks the folks at Strong Analytics/OneSix who put up with his transition from academia to industry, and made him much better at everything he does in the realm of data science. And finally, he would like to thank the many mentors in the statistical, machine learning, deep learning, and programming communities, who never knew they were such a large influence, but did much to expand his knowledge in many ways over the years.

Seth is grateful for the support and understanding of his wife Megan, and his fantastic kids, who never cease to bring him joy. He'd also like to thank his colleagues and students at Notre Dame, who have supported teaching antics and research shenanigans. He hopes that this book provides at least some evidence that he knows what he's talking about.


Together we'd like to thank those who helped with the book by providing comments and feedback, specifically: Malcolm Barrett, Isabella Gehment, Demetri Pananos, and Chelsea Parlett-Pellereti. Their insights and suggestions were invaluable in making this book a lot better. We'd also like to thank others who took the time to take even just a quick glance to see what was going on and provide a brief comment here and there. Every contribution was appreciated.

Finally, we'd like to thank you, the reader, for taking the time to peruse this book. We hope you find it useful and informative, and that it helps you in your data science journey. If you have any feedback, please feel free to reach out to us. We'd love to hear from you.
12 changes: 6 additions & 6 deletions causal.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -419,11 +419,11 @@ Analysis of variance, or **ANOVA**, allows the t-test to be extended to more tha

If linear regression didn't suggest any notion of causality to you before, it shouldn't now either. The model is *identical* whether there was an experimental design with random assignment or not. The only difference is that the data was collected in a different way, and the theoretical assumptions and motivations are different. Even the statistical assumptions are the same whether you use random assignment, or there are more than two groups, or whether the treatment is continuous or categorical.

Experimental design[^exprand] can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same *except* for the treatment, then we can be more confident that the treatment is the cause of an differences we see, and be more confident in a causal interpretation of the results. But it doesn't change the model itself, and the results of a model don't prove a causal relationship on their own. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn't taken in the modeling process to even assess the generalization of the results (@sec-ml-generalization), you may find they don't hold up[^explimits].
Experimental design[^exprand] can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same *except* for the treatment, then we can be more confident that the treatment is the cause of any differences we see, and be more confident in a causal interpretation of the results. But it doesn't change the model itself, and the results of a model don't prove a causal relationship on their own. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn't taken in the modeling process to even assess the generalization of the results (@sec-ml-generalization), you may find they don't hold up[^explimits].

[^exprand]: Note that experimental design is not just any setting that uses random assignment, but more generally how we introduce *control* in the sample settings.

[^explimits]: Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g. interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many important experimental results are based on p-values with small data, and this is the part of the problem seen with the [replication crisis](https://en.wikipedia.org/wiki/Replication_crisis) in science.
[^explimits]: Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g., interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many important experimental results are based on p-values with small data, and this is the part of the problem seen with the [replication crisis](https://en.wikipedia.org/wiki/Replication_crisis) in science.


:::{.callout-note title='A/B Testing' collapse='true'}
Expand Down Expand Up @@ -545,7 +545,7 @@ As we noted, random assignment or a formal experiment is not always possible or

The COVID-19 pandemic provides an example of a natural experiment. The pandemic introduced sudden and widespread changes that were not influenced by individuals' prior characteristics or behaviors, such as lockdowns, remote work, and vaccination campaigns. The randomness in the timing and implementation of these changes allows researchers to compare outcomes before and after the policy implementation or pandemic, or between different regions with varying policies, to infer causal effects.

For instance, we could compare states or counties that had mask mandates to those that didn't at the same time or with similar characteristics. Or we might compare areas that had high vaccination rates to those nearby that didn't. But these still aren't true experiments. So we'd need to control for as many additional factors that might influence the results, e.g. population density, age, wealth and so on, and eventually we might still get a pretty good idea of the causal impact of these interventions.
For instance, we could compare states or counties that had mask mandates to those that didn't at the same time or with similar characteristics. Or we might compare areas that had high vaccination rates to those nearby that didn't. But these still aren't true experiments. So we'd need to control for as many additional factors that might influence the results, like population density, age, wealth and so on, and eventually we might still get a pretty good idea of the causal impact of these interventions.


## Causal Inference {#sec-causal-inference}
Expand Down Expand Up @@ -647,7 +647,7 @@ simulate_confounding(nreps = 500, n = 1000, true = 1)

:::

Results suggest that the coefficient for `X` is different in the two models. If we don't include the confounder, the feature's relationship with the target is biased upwardly. The nature of the bias depends on the relationship between the confounder and the treatment and target, but in this case it's pretty clear!
Results suggest that the coefficient for `X` is different in the two models. If we don't include the confounder, the feature's relationship with the target is biased upward. The nature of the bias ultimately depends on the relationship between the confounder and the treatment and target, but in this case it's pretty clear!

```{r}
#| echo: false
Expand Down Expand Up @@ -872,7 +872,7 @@ In that graph, our focal treatment, or 'exposure', is physical activity, and we

One thing to note relative to the other graphical model depictions we've seen, is that the arrows directly flow to a target or set of targets, as opposed to just producing an 'output' that we then compare with the target. In graphical causal models, we're making clear the direction and focus of the causal relationships, i.e., the causal structure, as opposed to the model structure. Also, in graphical causal models, the effects for any given feature are adjusted for the other features in the model in a particular way, so that we can think about them in isolation, rather than as a collective set of features that are all influencing the target[^graphout].

[^graphout]: If we were to model this in an overly simple fashion with linear regressions for any variable with an arrow to it, you could say physical activity and dietary habits would basically be the output of their respective models. It isn't that simple in practice though, such that we can just run separate regressions and feed in the results to the next one, thought that's how they used to do it back in the day. We have to take more care in how we adjust for all features in the model, as well as correctly account for the uncertainty if we do take a multi-stage approach.
[^graphout]: If we were to model this in an overly simple fashion with linear regressions for any variable with an arrow to it, you could say physical activity and dietary habits would basically be the output of their respective models. It isn't that simple in practice though, such that we can just run separate regressions and feed in the results to the next one, though that's how they used to do it back in the day. We have to take more care in how we adjust for all features in the model, as well as correctly account for the uncertainty if we do take a multi-stage approach.


Structural equation models are widely employed in the social sciences and education, and are often used to model both observed and *latent* variables (@sec-data-latent), with either serving as features or targets[^sembias]. They are also used to model causal relationships, to the point that historically they were even called 'causal graphical models' or 'causal structural models'. SEMs are actually a special case of the graphical models just described, which are more common in non-social science disciplines. Compared to other graphical modeling techniques like DAGs, SEMs will typically have more assumptions, and these are often difficult to meet[^semass].
Expand Down Expand Up @@ -903,7 +903,7 @@ Formal graphical models provide a much richer set of tools for controlling vario


:::{.callout-note title='Causal Language' collapse='true'}
It's often been suggested that we keep certain phrasing (e.g. feature X has an *effect* on target Y) only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we're trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that's fine, and we think that's okay in a modeling context too, as long as you are clear about the limits of your generalizability.
It's often been suggested that we keep certain phrasing, for example, feature X has an *effect* on target Y, only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we're trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that's fine, and we think that's okay in a modeling context too, as long as you are clear about the limits of your generalizability.
:::

### Counterfactual thinking {#sec-causal-counterfactual}
Expand Down
Loading

0 comments on commit 9e7395d

Please sign in to comment.