diff --git a/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json b/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json index ccebece..4d6667b 100644 --- a/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json +++ b/_freeze/posts/24-best-practices-data-analyses/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "71282c20abecaa44a90b0019e845c20b", + "hash": "45d43da22debdd0780275400d94dc8fa", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"24 - Best practices for data analyses\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"A noncomprehensive set of best practices for building data analyses\"\ncategories: [module 6, week 8, best practices]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/24-best-practices-data-analyses/index.qmd).*\n\n# Pre-lecture materials\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- [Sharing biological data: why, when, and how](https://febs.onlinelibrary.wiley.com/doi/10.1002/1873-3468.14067)\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\nBe able to state best practices for:\n\n- Considerations around building ethical data analyses\n- Sharing data\n- Creating data visualizations\n:::\n\n\n::: {.cell}\n\n:::\n\n\n# Best practices for data ethics\n\nIn philosophy departments, classes and modules centered around **data ethics** are widely discussed.\n\nThe ethical challenges around working with data are not fundamentally different from the ethical challenges philosophers have always faced.\n\nHowever, putting an ethical framework around building data analyses in practice is indeed new for most data scientists, and for many of us, we are woefully under-prepared to teach so far outside our comfort zone.\n\nThat being said, we can provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Defining ethics\n\nWe start with a grounding in the definition of Ethics:\n\n**Ethics**, also called moral philosophy, has three main branches:\n\n1. [Applied ethics](https://www.oxfordbibliographies.com/view/document/obo-9780195396577/obo-9780195396577-0006.xml) \"is a branch of ethics devoted to the treatment of moral problems, practices, and policies in personal life, professions, technology, and government.\"\n2. [Ethical theory](https://academic.oup.com/edited-volume/35492/chapter-abstract/304418208?redirectedFrom=fulltext&login=false) \"is concerned with the articulation and the justification of the fundamental principles that govern the issues of how we should live and what we morally ought to do. Its most general concerns are providing an account of moral evaluation and, possibly, articulating a decision procedure to guide moral action.\"\n3. [Metaethics](https://plato.stanford.edu/entries/metaethics/) \"is the attempt to understand the metaphysical, epistemological, semantic, and psychological, presuppositions and commitments of moral thought, talk, and practice.\"\n\nWhile, unfortunately, there are myriad examples of **ethical data science problems** (see, for example, blog posts [bookclub](https://teachdatascience.com/bookclub/) and [data feminism](https://teachdatascience.com/datafem/)), here I aim to connect some of the broader data science ethics issues with the existing philosophical literature.\n\nNote, I am only scratching the surface and a deeper dive might involve education in related philosophical fields (epistemology, metaphysics, or philosophy of science), philosophical methodologies, and ethical schools of thought, but you can peruse all of these through, for example, a course or readings introducing the discipline of philosophy.\n\nBelow we provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Case Study\n\nWe begin by considering a case study around ethical data analyses.\n\nMany **ethics case studies** provided in a classroom setting **describe algorithms built on data which are meant to predict outcomes**.\n\n::: callout-tip\n### Note\n\nLarge scale algorithmic decision making presents particular ethical predicaments because of both the scale of impact and the \"black-box\" sense of how the algorithm is generating predictions.\n:::\n\nConsider the well-known issue of using [facial recognition software](https://en.wikipedia.org/wiki/Facial_recognition_system) in policing.\n\nThere are many questions surrounding the policing issue:\n\n- What are the action options with respect to the outcome of the algorithm?\n- What are the good and bad aspects of each action and how are these to be weighed against each other?\n\n![](https://teachdatascience.com/philosophy/LAPD.png)\n\n\\[Source: [CNN](https://www.cnn.com/2019/09/12/tech/california-body-cam-facial-recognition-ban/index.html)\\]\n\n::: callout-tip\n### Important questions\n\nThe two main ethical concerns surrounding facial recognition software break down into\n\n- How the algorithms were developed?\n- How the algorithm is used?\n:::\n\nWhen thinking about the questions below, reflect on the good aspects and the bad aspects and how one might weight the good versus the bad.\n\n### Creating the algorithm\n\n- What data should be used to train the algorithm?\n - If the accuracy rates of the algorithm differ based on the demographics of the subgroups within the data, is more data and testing required?\n- Who and what criteria should be used to tune the algorithm?\n - Who should be involved in decisions on the tuning parameters of the algorithm?\n - Which optimization criteria should be used (e.g., accuracy? false positive rate? false negative rate?)\n- Issues of access:\n - Who should own or have control of the facial image data?\n - Do individuals have a right to keep their facial image private from being in databases?\n - Do individuals have a right to be notified that their facial image is in the data base? For example, if I ring someone's doorbell and my face is captured in a database, do I need to be told? \\[While traditional human subjects and IRB requirements necessitate consent to be included in any research project, in most cases it is legal to photograph a person without their consent.\\]\n - Should the data be accessible to researchers working to make the field more equitable? What if allowing accessibility thereby makes the data accessible to bad actors?\n\n### Using the algorithm\n\n- Issues of personal impact:\n - The software might make it easier to accurately associate an individual with a crime, but it might also make it easier to mistakenly associate an individual with a crime. How should the pro vs con be weighed against each other?\n - Do individuals have a right to know, correct, or delete personal information included in a database?\n- Issues of societal impact:\n - Is it permissible to use a facial recognition software which has been trained primarily on faces of European ancestry individual, given that this results in false positive and false negative rates that are not equally dispersed across racial lines?\n - While the software might make it easier to protect against criminal activity, it also makes it easier to undermine specific communities when their members are mistakenly identified with criminal activity. How should the pro vs con of different communities be weighed against each other?\n- Issues of money:\n - Is it permissible for a software company to profit from an algorithm while having no financial responsibility for its misuse or negative impacts?\n - Who should pay the court fees and missed work hours of those who were mistakenly accused of crimes?\n\nTo settle the questions above, we need to study various ethical theories, and it turns out that the different theories may lead us to different conclusions. As non-philosophers, we recognize that the suggested readings and ideas may come across as overwhelming. If you are overwhelmed, we suggest that you choose one ethical theory, think carefully about how it informs decision making, and help your students to connect the ethical framework to a data science case study.\n\n## Final thoughts\n\nThis is a challenging topic, but as you analyze data, ask yourself the following broad questions to help you with ethical considerations around the data analysis.\n\n::: callout-tip\n### Questions to ask yourself when analyzing data?\n\n1. Why are we producing this knowledge?\n2. For whom are we producing this knowledge?\n3. What communities do they serve?\n4. Which stakeholders need to be involved in making decisions in and around the data analysis?\n:::\n\n# Best practices for sharing data\n\nData sharing is an essential element of the scientific method, imperative to ensure transparency and reproducibility.\n\nDifferent areas of research collect fundamentally different types of data, such as tabular data, time series data, image data, or genomic data. These types of data differ greatly in size and require different approaches for sharing.\n\nIn this section, I outline broad best practices to make your data publicly accessible and usable, generally and for several specific kinds of data.\n\n## FAIR principles\n\nSharing data proves more useful when others can easily find and access, interpret, and reuse the data. To maximize the benefit of sharing your data, follow the [findable, accessible, interoperable, and reusable (FAIR)](https://www.go-fair.org/fair-principles/) guiding principles of data sharing, which optimize reuse of generated data.\n\n::: callout-tip\n### FAIR data sharing principles\n\n1. **Findable**. The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.\n2. **Accessible**. Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization.\n3. **Interoperable**. The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.\n4. **Reusable**. The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.\n:::\n\n\n\n## Why share?\n\n1. **Benefits of sharing data to science and society**. Sharing data allows for transparency in scientific studies and allows one to fully understand what occurred in an analysis and reproduce the results. Without complete data, metadata, and information about resources used to generate the data, reproducing a study proves impossible.\n2. **Benefits of sharing data to individual researchers**. Sharing data increases the impact of a researcher's work and reputation for sound science. Awards for those with an excellent record of [data sharing](https://researchsymbionts.org/) or [data reuse](https://researchparasite.com/) can exemplify this reputation.\n\n\n\n\n\n### Addressing common concerns about data sharing\n\nDespite the clear benefits of sharing data, some researchers still have concerns about doing so.\n\n- **Novelty**. Some worry that sharing data may decrease the novelty of their work and their chance to publish in prominent journals. You can address this concern by sharing your data only after publication. You can also choose to preprint your manuscript when you decide to share your data. Furthermore, you only need to share the data and metadata required to reproduce your published study.\n- **Time spent on sharing data**. Some have concerns about the time it takes to organize and share data publicly. Many add 'data available upon request' to manuscripts instead of depositing the data in a public repository in hopes of getting the work out sooner. It does take time to organize data in preparation for sharing, but sharing data publicly may save you time. Sharing data in a public repository that guarantees archival persistence means that you will not have to worry about storing and backing up the data yourself.\n- **Human subject data**. Sharing of data on human subjects requires special ethical, legal, and privacy considerations. Existing recommendations largely aim to balance the privacy of human participants with the benefits of data sharing by de-identifying human participants and obtaining consent for sharing. Sharing human data poses a variety of challenges for analysis, transparency, reproducibility, interoperability, and access.\n\n::: callout-tip\n### Human data\n\nSometimes you cannot publicly post all human data, even after de-identification. We suggest three strategies for making these data maximally accessible.\n\n1. Deposit raw data files in a controlled-access repository. Controlled-access repositories allow only qualified researchers who apply to access the data.\n - Example: the database of Genotypes and Phenotypes (dbGaP) \n2. Even if you cannot make individual-level raw data available, you can make as much processed data available as possible. This may take the form of summary statistics such as means and standard deviations, rather than individual-level data.\n - Example: GWAS summary data such as the one available from \n3. You may want to generate simulated data distinct from the original data but statistically similar to it. Simulated data would allow others to reproduce your analysis without disclosing the original data or requiring the security controls needed for controlled access.\n:::\n\n## What data to share?\n\nDepending on the data type, you might be able to share the data itself, or a summarized version of it. Broadly thought, you want to share the following:\n\n1. The **data** itself, or a summarized version, or a simulated data similar to the original.\n2. Any **metadata** to describe the primary data and the resources used to generate it. Most disciplines have specific metadata standards to follow (e.g. [microarrays](http://fged.org/projects/minseqe/)).\n3. **Data dictionary**. These have crucial role in organizing your data, especially explaining the variables and their representation. Data dictionaries should provide short names for each variable, a longer text label for the variable, a definition for each variable, data type (such as floating-point number, integer, or string), measurement units, and expected minimum and maximum values. Data dictionaries can make explicit what future users would otherwise have to guess about the representation of data.\n - You have gotten used to seeing the *Tidy Tuesday* data dictionaries such as \n4. **Source code**. Ideally, readers should have all materials needed to completely reproduce the study described in a publication, not just data. These materials include source code, preprocessing, and analysis scripts. Guidelines for organization of computational project can help you arrange your data and scripts in a way that will make it easier for you and other to access and reuse them.\n - See for how we organize code in my team.\n - See also for how to deposit your GitHub repository on Zenodo and get a Digital Object Identifier (DOI) that others can cite. Example \n5. **Licensing**. Clear licensing information attached to your data avoids any questions of whether others may reuse it. Many data resources turn out not to be as reusable as the providers intended, due to lack of clarity in licensing or restrictive licensing choices.\n - See for more on how to choose a license\n\n::: callout-tip\n### How should you document your data?\n\nDocument your data in three ways:\n\n1. **With your manuscript**.\n2. **With description fields** in the metadata collected by repositories\n3. **With README files**. README files provide abbreviated information about a collection of files (e.g. explain organization, file locations, observations and variables present in each file, details on the experimental design, etc).\n:::\n\n# Best practices for data visualizations\n\n## Motiviation\n\n::: callout-tip\n### Quote from one of Roger Peng's heroes\n\n*\"The greatest value of a picture is when it forces us to notice what we never expected to see.\"* -John W. Tukey\n\n\n::: {.cell .fig-cap-location-top layout-align=\"center\"}\n::: {.cell-output-display}\n![](http://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg){fig-align='center'}\n:::\n:::\n\n:::\n\nMistakes, biases, systematic errors and unexpected variability are commonly found in data regardless of applications. Failure to discover these problems often leads to **flawed analyses and false discoveries**.\n\nAs an example, consider that measurement devices sometimes fail and not all summarization procedures, such as the `mean()` function in R, are designed to detect these. Yet, these functions will still give you an answer.\n\nFurthermore, it may be hard or impossible to notice an error was made just from the reported summaries.\n\n**Data visualization is a powerful approach to detecting these problems**. We refer to this particular task as exploratory data analysis (EDA), coined by John Tukey.\n\nOn a more positive note, data visualization can also lead to discoveries which would otherwise be missed if we simply subject the data to a battery of statistical summaries or procedures.\n\nWhen analyzing data, we often **make use of exploratory plots to motivate the analyses** we choose.\n\nIn this section, we will discuss some types of plots to avoid, better ways to visualize data, some principles to create good plots, and ways to use `ggplot2` to create **expository** (intended to explain or describe something) graphs.\n\n::: callout-tip\n### Example\n\nThe following figure is from [Lippmann et al. 2006](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1665439/):\n\n![Nickel concentration and PM10 health effects (Blue points represent average county-level concentrations from 2000--2005 for 72 U.S. counties representing 69 communities).](../../images/lippman.png){width=\"70%\"}\n\nThe following figure is from [Dominici et al. 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2137127/), in response to the work by Lippmann et al. above.\n\n![Nickel concentration and PM10 health effects (with and without New York).](../../images/dominici_ehp.png){width=\"70%\"}\n\nElevated levels of Ni and V PM2.5 chemical components in New York are likely attributed to oil-fired power plants and emissions from ships burning oil, as noted by Lippmann et al. (2006).\n:::\n\n### Generating data visualizations\n\nIn order to determine the effectiveness or quality of a visualization, we need to first understand three things:\n\n::: callout-tip\n### Questions to ask yourself when building data visualizations\n\n1. What is the question we are trying to answer?\n2. Why are we building this visualization?\n3. For whom are we producing this data visualization for? Who is the intended audience to consume this visualization?\n:::\n\nNo plot (or any statistical tool, really) can be judged without knowing the answers to those questions. No plot or graphic exists in a vacuum. There is always context and other surrounding factors that play a role in determining a plot's effectiveness.\n\nConversely, **high-quality, well-made visualizations** usually allow one to properly deduce what question is being asked and who the audience is meant to be. A good visualization **tells a complete story in a single frame**.\n\n::: callout-tip\n## Broad steps for creating data visualizations\n\nThe act of visualizing data typically proceeds in two broad steps:\n\n1. Given the question and the audience, **what type of plot should I make?**\n2. Given the plot I intend to make, **how can I optimize it for clarity and effectiveness?**\n:::\n\n## Data viz principles\n\n### Developing plots\n\nInitially, one must decide what information should be presented. The following principles for developing analytic graphics come from Edward Tufte's book [*Beautiful Evidence*](https://www.edwardtufte.com/tufte/books_be).\n\n1. Show comparisons\n2. Show causality, mechanism, explanation\n3. Show multivariate data\n4. Integrate multiple modes of evidence\n5. Describe and document the evidence\n6. Content is king - good plots start with good questions\n\n### Optimizing plots\n\n1. Maximize the data/ink ratio -- if \"ink\" can be removed without reducing the information being communicated, then it should be removed.\n2. Maximize the range of perceptual conditions -- your audience's perceptual abilities may not be fully known, so it's best to allow for a wide range, to the extent possible (or knowable).\n3. Show variation in the **data**, not variation in the **design**.\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd <- airquality %>%\n mutate(Summer = ifelse(Month %in% c(7, 8, 9), 2, 3))\nwith(d, {\n plot(Temp, Ozone, col = unclass(Summer), pch = 19, frame.plot = FALSE)\n legend(\"topleft\",\n col = 2:3, pch = 19, bty = \"n\",\n legend = c(\"Summer\", \"Non-Summer\")\n )\n})\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nairquality %>%\n mutate(Summer = ifelse(Month %in% c(7, 8, 9),\n \"Summer\", \"Non-Summer\"\n )) %>%\n ggplot(aes(Temp, Ozone)) +\n geom_point(aes(color = Summer), size = 2) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nSome of these principles are taken from Edward Tufte's *Visual Display of Quantitative Information*:\n\n## Plots to Avoid\n\nThis section is based on a talk by [Karl W. Broman](http://kbroman.org/) titled [\"How to Display Data Badly\"](https://www.biostat.wisc.edu/~kbroman/presentations/graphs_cmp2014.pdf), in which he described how the default plots offered by Microsoft Excel \"obscure your data and annoy your readers\" ([here](https://kbroman.org/talks.html) is a link to a collection of Karl Broman's talks).\n\n::: callout-tip\n### FYI\n\nKarl's lecture was inspired by the 1984 paper by H. Wainer: How to display data badly. American Statistician 38(2): 137--147.\n\nDr. Wainer was the first to elucidate the principles of the bad display of data.\n\nHowever, according to Karl Broman, \"The now widespread use of Microsoft Excel has resulted in remarkable advances in the field.\"\n\nHere we show examples of \"bad plots\" and how to improve them in R.\n:::\n\n::: callout-tip\n### Some general principles of *bad* plots\n\n- Display as little information as possible.\n- Obscure what you do show (with chart junk).\n- Use pseudo-3D and color gratuitously.\n- Make a pie chart (preferably in color and 3D).\n- Use a poorly chosen scale.\n- Ignore significant figures.\n:::\n\n## Examples\n\nHere are some examples of bad plots and suggestions on how to improve\n\n### Pie charts\n\nLet's say we are interested in the most commonly used browsers. Wikipedia has a [table](https://en.wikipedia.org/wiki/Usage_share_of_web_browsers) with the \"usage share of web browsers\" or the proportion of visitors to a group of web sites that use a particular web browser from July 2017.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbrowsers <- c(\n Chrome = 60, Safari = 14, UCBrowser = 7,\n Firefox = 5, Opera = 3, IE = 3, Noinfo = 8\n)\nbrowsers.df <- gather(\n data.frame(t(browsers)),\n \"browser\", \"proportion\"\n)\n```\n:::\n\n\nLet's say we want to report the results of the usage. The standard way of displaying these is with a pie chart:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npie(browsers, main = \"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nIf we look at the help file for `pie()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?pie\n```\n:::\n\n\nIt states:\n\n> \"Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.\"\n\nTo see this, look at the figure above and try to determine the percentages just from looking at the plot. Unless the percentages are close to 25%, 50% or 75%, this is not so easy. Simply showing the numbers is not only clear, but also saves on printing costs.\n\nHaving said that, see how we used a pie chart in a [2023 pre-print](https://doi.org/10.1101/2023.02.15.528722) . The published version is accessible from doi [10.1126/science.adh1938](http://dx.doi.org/10.1126/science.adh1938).\n\n#### Instead of pie charts, try bar plots\n\nIf you do want to plot them, then a barplot is appropriate. Here we use the `geom_bar()` function in `ggplot2`. Note, there are also horizontal lines at every multiple of 10, which helps the eye quickly make comparisons across:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- browsers.df %>%\n ggplot(aes(\n x = reorder(browser, -proportion),\n y = proportion\n )) +\n geom_bar(stat = \"identity\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nNotice that we can now pretty easily determine the percentages by following a horizontal line to the x-axis.\n\n#### Polish your plots\n\nWhile this figure is already a big improvement over a pie chart, we can do even better. When you create figures, you want your figures to be self-sufficient, meaning someone looking at the plot can understand everything about it.\n\nSome possible critiques are:\n\n1. make the axes bigger\n2. make the labels bigger\n3. make the labels be full names (e.g. \"Browser\" and \"Proportion of users\", ideally with units when appropriate)\n4. add a title\n\nLet's explore how to do these things to make an even better figure.\n\nTo start, go to the help file for `theme()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ggplot2::theme\n```\n:::\n\n\nWe see there are arguments with text that control all the text sizes in the plot. If you scroll down, you see the text argument in the theme command requires class `element_text`. Let's try it out.\n\nTo change the x-axis and y-axis labels to be full names, use `xlab()` and `ylab()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + xlab(\"Browser\") +\n ylab(\"Proportion of Users\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nMaybe a title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\nNext, we can also use the `theme()` function in `ggplot2` to control the justifications and sizes of the axes, labels and titles.\n\nTo center the title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\") +\n theme(plot.title = element_text(hjust = 0.5))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nTo create bigger text/labels/titles:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + ggtitle(\"Browser Usage (July 2022)\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15)\n )\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\n#### \"I don't like that theme\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_bw()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_dark()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_classic() # axis lines!\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggthemes::theme_base()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n### 3D barplots\n\nPlease, avoid a 3D version because it obfuscates the plot, making it more difficult to find the percentages by eye.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig2b.png)\n\n### Donut plots\n\nEven worse than pie charts are donut plots.\n\n![](http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Donut-Chart.svg/360px-Donut-Chart.svg.png)\n\nThe reason is that by removing the center, we remove one of the visual cues for determining the different areas: the angles. **There is no reason to ever use a donut plot to display data**.\n\n::: callout-note\n### Question\n\nWhy are pie/donut charts [so common](https://blog.usejournal.com/why-humans-love-pie-charts-9cd346000bdc)?\n\n\n:::\n\n### Barplots as data summaries\n\nWhile barplots are useful for showing percentages, they are incorrectly used to display data from two groups being compared. Specifically, barplots are created with height equal to the group means; an antenna is added at the top to represent standard errors. This plot is simply showing two numbers per group and the plot adds nothing:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig1c.png)\n\n#### Instead of bar plots for summaries, try box plots\n\nIf the number of points is small enough, we might as well add them to the plot. When the number of points is too large for us to see them, just showing a boxplot is preferable.\n\nLet's recreate these barplots as boxplots and overlay the points. We will simulate similar data to demonstrate one way to improve the graphic above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\n \"Treatment\" = rnorm(10, 30, sd = 4),\n \"Control\" = rnorm(10, 36, sd = 4)\n)\ngather(dat, \"type\", \"response\") %>%\n ggplot(aes(type, response)) +\n geom_boxplot() +\n geom_point(position = \"jitter\") +\n ggtitle(\"Response to drug treatment\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nNotice how much more we see here: the center, spread, range, and the points themselves. In the barplot, we only see the mean and the standard error (SE), and the SE has more to do with sample size than with the spread of the data.\n\nThis problem is magnified when our data has outliers or very large tails. For example, in the plot below, there appears to be very large and consistent differences between the two groups:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig3c.png)\n\nHowever, a quick look at the data demonstrates that this difference is mostly driven by just two points.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\n \"Treatment\" = rgamma(10, 10, 1),\n \"Control\" = rgamma(10, 1, .01)\n)\ngather(dat, \"type\", \"response\") %>%\n ggplot(aes(type, response)) +\n geom_boxplot() +\n geom_point(position = \"jitter\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\n#### Use log scale if data includes outliers\n\nA version showing the data in the log-scale is much more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngather(dat, \"type\", \"response\") %>%\n ggplot(aes(type, response)) +\n geom_boxplot() +\n geom_point(position = \"jitter\") +\n scale_y_log10()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Barplots for paired data\n\nA common task in data analysis is the comparison of two groups. When the dataset is small and data are paired, such as the outcomes before and after a treatment, two-color barplots are unfortunately often used to display the results.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig6r_e.png)\n\n#### Instead of paired bar plots, try scatter plots\n\nThere are better ways of showing these data to illustrate that there is an increase after treatment. One is to simply make a scatter plot, which shows that most points are above the identity line. Another alternative is to plot the differences against the before values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\nbefore <- runif(6, 5, 8)\nafter <- rnorm(6, before * 1.15, 2)\nli <- range(c(before, after))\nymx <- max(abs(after - before))\n\npar(mfrow = c(1, 2))\nplot(before, after,\n xlab = \"Before\", ylab = \"After\",\n ylim = li, xlim = li\n)\nabline(0, 1, lty = 2, col = 1)\n\nplot(before, after - before,\n xlab = \"Before\", ylim = c(-ymx, ymx),\n ylab = \"Change (After - Before)\", lwd = 2\n)\nabline(h = 0, lty = 2, col = 1)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\n#### or line plots\n\nLine plots are not a bad choice, although they can be harder to follow than the previous two. Boxplots show you the increase, but lose the paired information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- rep(c(0, 1), rep(6, 2))\npar(mfrow = c(1, 2))\nplot(z, c(before, after),\n xaxt = \"n\", ylab = \"Response\",\n xlab = \"\", xlim = c(-0.5, 1.5)\n)\naxis(side = 1, at = c(0, 1), c(\"Before\", \"After\"))\nsegments(rep(0, 6), before, rep(1, 6), after, col = 1)\n\nboxplot(before, after,\n names = c(\"Before\", \"After\"),\n ylab = \"Response\"\n)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-22-1.png){width=672}\n:::\n:::\n\n\n\n\nThe above plot was made using [`ggpubr::ggpaired()`](https://rpkgs.datanovia.com/ggpubr/reference/ggpaired.html). Note that the title of the package is:\n\n> ggpubr: ‘ggplot2’ Based Publication Ready Plots\n\n### Gratuitous 3D\n\nThe figure below shows three curves. Pseudo 3D is used, but it is not clear why. Maybe to separate the three curves? Notice how difficult it is to determine the values of the curves at any given point:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig8b.png)\n\nThis plot can be made better by simply using color to distinguish the three lines:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- read_csv(\"https://github.com/kbroman/Talk_Graphs/raw/master/R/fig8dat.csv\") %>%\n as_tibble(.name_repair = make.names)\n\np <- x %>%\n gather(\"drug\", \"proportion\", -log.dose) %>%\n ggplot(aes(\n x = log.dose, y = proportion,\n color = drug\n )) +\n geom_line()\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n:::\n\n\nThis plot demonstrates that using color is more than enough to distinguish the three lines.\n\nWe can make this plot better using the functions we learned above\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15)\n )\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\n#### Legends\n\nWe can also move the legend inside the plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15),\n legend.position = c(0.2, 0.3)\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2\n3.5.0.\nℹ Please use the `legend.position.inside` argument of `theme()` instead.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-25-1.png){width=672}\n:::\n:::\n\n\nWe can also make the legend transparent\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend <- theme(\n legend.background = element_rect(fill = \"transparent\"),\n legend.key = element_rect(\n fill = \"transparent\",\n color = \"transparent\"\n )\n)\n\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15),\n legend.position = c(0.2, 0.3)\n ) +\n transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n### Too many significant digits\n\nBy default, statistical software like R returns many significant digits. This does not mean we should report them. Cutting and pasting directly from R is a bad idea since you might end up showing a table, such as the one below, comparing the heights of basketball players:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nheights <- cbind(\n rnorm(8, 73, 3), rnorm(8, 73, 3), rnorm(8, 80, 3),\n rnorm(8, 78, 3), rnorm(8, 78, 3)\n)\ncolnames(heights) <- c(\"SG\", \"PG\", \"C\", \"PF\", \"SF\")\nrownames(heights) <- paste(\"team\", 1:8)\nheights\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n SG PG C PF SF\nteam 1 68.88065 73.07480 81.80948 76.60455 82.23521\nteam 2 70.05272 66.86024 74.64847 72.70140 78.55640\nteam 3 71.33653 73.63946 81.00483 78.56787 77.86893\nteam 4 73.36414 81.01021 81.68293 76.90146 77.35226\nteam 5 72.63738 69.31895 83.66281 81.17280 82.39133\nteam 6 68.99188 75.50274 79.36564 75.77514 78.68900\nteam 7 73.51017 74.59772 82.09829 73.95492 78.32287\nteam 8 73.46524 71.05953 77.88069 76.44808 73.86569\n```\n\n\n:::\n:::\n\n\nWe are reporting precision up to 0.00001 inches. Do you know of a tape measure with that much precision? This can be easily remedied:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(heights, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n SG PG C PF SF\nteam 1 68.9 73.1 81.8 76.6 82.2\nteam 2 70.1 66.9 74.6 72.7 78.6\nteam 3 71.3 73.6 81.0 78.6 77.9\nteam 4 73.4 81.0 81.7 76.9 77.4\nteam 5 72.6 69.3 83.7 81.2 82.4\nteam 6 69.0 75.5 79.4 75.8 78.7\nteam 7 73.5 74.6 82.1 74.0 78.3\nteam 8 73.5 71.1 77.9 76.4 73.9\n```\n\n\n:::\n:::\n\n\n### Minimal figure captions\n\nRecall the plot we had before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend <- theme(\n legend.background = element_rect(fill = \"transparent\"),\n legend.key = element_rect(\n fill = \"transparent\",\n color = \"transparent\"\n )\n)\n\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15),\n legend.position = c(0.2, 0.3)\n ) +\n xlab(\"dose (mg)\") +\n transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWhat type of caption would be good here?\n\nWhen creating figure captions, think about the following:\n\n1. Be specific\n\n> A plot of the proportion of patients who survived after three drug treatments.\n\n2. Label the caption\n\n> Figure 1. A plot of the proportion of patients who survived after three drug treatments.\n\n3. Tell a story\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments.\n\n4. Include units\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram).\n\n5. Explain aesthetics\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram). Three colors represent three drug treatments. Drug A results in largest survival proportion for the larger drug doses.\n\n## Final thoughts data viz\n\nIn general, you should follow these principles:\n\n- Create expository graphs to tell a story (figure and caption should be self-sufficient; it's the first thing people look at)\n\n - Be accurate and clear\n - Let the data speak\n - Make axes, labels and titles big\n - Make labels full names (ideally with units when appropriate)\n - Add informative legends; use space effectively\n\n- Show as much information as possible, taking care not to obscure the message\n\n- Science not sales: avoid unnecessary frills (especially gratuitous 3D)\n\n- In tables, every digit should be meaningful\n\n### Some further reading\n\n- N Cross (2011). Design Thinking: Understanding How Designers Think and Work. Bloomsbury Publishing.\n- J Tukey (1977). Exploratory Data Analysis.\n- ER Tufte (1983) The visual display of quantitative information. Graphics Press.\n- ER Tufte (1990) Envisioning information. Graphics Press.\n- ER Tufte (1997) Visual explanations. Graphics Press.\n- ER Tufte (2006) Beautiful Evidence. Graphics Press.\n- WS Cleveland (1993) Visualizing data. Hobart Press.\n- WS Cleveland (1994) The elements of graphing data. CRC Press.\n- A Gelman, C Pasarica, R Dodhia (2002) Let's practice what we preach: Turning tables into graphs. The American Statistician 56:121-130.\n- NB Robbins (2004) Creating more effective graphs. Wiley.\n- [Nature Methods columns](http://bang.clearscience.info/?p=546)\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-08-20\n pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.0.5 2022-11-15 [1] CRAN (R 4.4.0)\n bit64 4.0.5 2020-08-30 [1] CRAN (R 4.4.0)\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0)\n crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)\n curl 5.2.1 2024-03-01 [1] CRAN (R 4.4.0)\n digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)\n dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)\n fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)\n farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)\n ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n ggthemes 5.1.0 2024-02-10 [1] CRAN (R 4.4.0)\n glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)\n gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)\n lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)\n lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)\n munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)\n readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)\n stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)\n tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)\n timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)\n utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)\n vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)\n vroom 1.6.5 2023-12-05 [1] CRAN (R 4.4.0)\n withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)\n xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", + "markdown": "---\ntitle: \"24 - Best practices for data analyses\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"A noncomprehensive set of best practices for building data analyses\"\ncategories: [module 6, week 8, best practices]\n---\n\n\n*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing/commits/main/posts/24-best-practices-data-analyses/index.qmd).*\n\n# Pre-lecture materials\n\n### Acknowledgements\n\nMaterial for this lecture was borrowed and adopted from\n\n- \n- \n- [Sharing biological data: why, when, and how](https://febs.onlinelibrary.wiley.com/doi/10.1002/1873-3468.14067)\n- \n- \n\n# Learning objectives\n\n::: callout-note\n# Learning objectives\n\n**At the end of this lesson you will:**\n\nBe able to state best practices for:\n\n- Considerations around building ethical data analyses\n- Sharing data\n- Creating data visualizations\n:::\n\n\n::: {.cell}\n\n:::\n\n\n# Best practices for data ethics\n\nIn philosophy departments, classes and modules centered around **data ethics** are widely discussed.\n\nThe ethical challenges around working with data are not fundamentally different from the ethical challenges philosophers have always faced.\n\nHowever, putting an ethical framework around building data analyses in practice is indeed new for most data scientists, and for many of us, we are woefully under-prepared to teach so far outside our comfort zone.\n\nThat being said, we can provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Defining ethics\n\nWe start with a grounding in the definition of Ethics:\n\n**Ethics**, also called moral philosophy, has three main branches:\n\n1. [Applied ethics](https://www.oxfordbibliographies.com/view/document/obo-9780195396577/obo-9780195396577-0006.xml) \"is a branch of ethics devoted to the treatment of moral problems, practices, and policies in personal life, professions, technology, and government.\"\n2. [Ethical theory](https://academic.oup.com/edited-volume/35492/chapter-abstract/304418208?redirectedFrom=fulltext&login=false) \"is concerned with the articulation and the justification of the fundamental principles that govern the issues of how we should live and what we morally ought to do. Its most general concerns are providing an account of moral evaluation and, possibly, articulating a decision procedure to guide moral action.\"\n3. [Metaethics](https://plato.stanford.edu/entries/metaethics/) \"is the attempt to understand the metaphysical, epistemological, semantic, and psychological, presuppositions and commitments of moral thought, talk, and practice.\"\n\nWhile, unfortunately, there are myriad examples of **ethical data science problems** (see, for example, blog posts [bookclub](https://teachdatascience.com/bookclub/) and [data feminism](https://teachdatascience.com/datafem/)), here I aim to connect some of the broader data science ethics issues with the existing philosophical literature.\n\nNote, I am only scratching the surface and a deeper dive might involve education in related philosophical fields (epistemology, metaphysics, or philosophy of science), philosophical methodologies, and ethical schools of thought, but you can peruse all of these through, for example, a course or readings introducing the discipline of philosophy.\n\nBelow we provide some thoughts on how to approach a data science problem using a philosophical lens.\n\n## Case Study\n\nWe begin by considering a case study around ethical data analyses.\n\nMany **ethics case studies** provided in a classroom setting **describe algorithms built on data which are meant to predict outcomes**.\n\n::: callout-tip\n### Note\n\nLarge scale algorithmic decision making presents particular ethical predicaments because of both the scale of impact and the \"black-box\" sense of how the algorithm is generating predictions.\n:::\n\nConsider the well-known issue of using [facial recognition software](https://en.wikipedia.org/wiki/Facial_recognition_system) in policing.\n\nThere are many questions surrounding the policing issue:\n\n- What are the action options with respect to the outcome of the algorithm?\n- What are the good and bad aspects of each action and how are these to be weighed against each other?\n\n![](https://teachdatascience.com/philosophy/LAPD.png)\n\n\\[Source: [CNN](https://www.cnn.com/2019/09/12/tech/california-body-cam-facial-recognition-ban/index.html)\\]\n\n::: callout-tip\n### Important questions\n\nThe two main ethical concerns surrounding facial recognition software break down into\n\n- How the algorithms were developed?\n- How the algorithm is used?\n:::\n\nWhen thinking about the questions below, reflect on the good aspects and the bad aspects and how one might weight the good versus the bad.\n\n### Creating the algorithm\n\n- What data should be used to train the algorithm?\n - If the accuracy rates of the algorithm differ based on the demographics of the subgroups within the data, is more data and testing required?\n- Who and what criteria should be used to tune the algorithm?\n - Who should be involved in decisions on the tuning parameters of the algorithm?\n - Which optimization criteria should be used (e.g., accuracy? false positive rate? false negative rate?)\n- Issues of access:\n - Who should own or have control of the facial image data?\n - Do individuals have a right to keep their facial image private from being in databases?\n - Do individuals have a right to be notified that their facial image is in the data base? For example, if I ring someone's doorbell and my face is captured in a database, do I need to be told? \\[While traditional human subjects and IRB requirements necessitate consent to be included in any research project, in most cases it is legal to photograph a person without their consent.\\]\n - Should the data be accessible to researchers working to make the field more equitable? What if allowing accessibility thereby makes the data accessible to bad actors?\n\n### Using the algorithm\n\n- Issues of personal impact:\n - The software might make it easier to accurately associate an individual with a crime, but it might also make it easier to mistakenly associate an individual with a crime. How should the pro vs con be weighed against each other?\n - Do individuals have a right to know, correct, or delete personal information included in a database?\n- Issues of societal impact:\n - Is it permissible to use a facial recognition software which has been trained primarily on faces of European ancestry individual, given that this results in false positive and false negative rates that are not equally dispersed across racial lines?\n - While the software might make it easier to protect against criminal activity, it also makes it easier to undermine specific communities when their members are mistakenly identified with criminal activity. How should the pro vs con of different communities be weighed against each other?\n- Issues of money:\n - Is it permissible for a software company to profit from an algorithm while having no financial responsibility for its misuse or negative impacts?\n - Who should pay the court fees and missed work hours of those who were mistakenly accused of crimes?\n\nTo settle the questions above, we need to study various ethical theories, and it turns out that the different theories may lead us to different conclusions. As non-philosophers, we recognize that the suggested readings and ideas may come across as overwhelming. If you are overwhelmed, we suggest that you choose one ethical theory, think carefully about how it informs decision making, and help your students to connect the ethical framework to a data science case study.\n\n## Final thoughts\n\nThis is a challenging topic, but as you analyze data, ask yourself the following broad questions to help you with ethical considerations around the data analysis.\n\n::: callout-tip\n### Questions to ask yourself when analyzing data?\n\n1. Why are we producing this knowledge?\n2. For whom are we producing this knowledge?\n3. What communities do they serve?\n4. Which stakeholders need to be involved in making decisions in and around the data analysis?\n:::\n\n
\n\n

\n\nFor more on this line of thought read this book by @CCriadoPerez or other books 📚 https://t.co/ws3ctX3ijr pic.twitter.com/8IYmCpxeGx\n\n

\n\n— 🇲🇽 Leonardo Collado-Torres (@lcolladotor) April 2, 2024\n\n
\n\n\n\n# Best practices for sharing data\n\nData sharing is an essential element of the scientific method, imperative to ensure transparency and reproducibility.\n\nDifferent areas of research collect fundamentally different types of data, such as tabular data, time series data, image data, or genomic data. These types of data differ greatly in size and require different approaches for sharing.\n\nIn this section, I outline broad best practices to make your data publicly accessible and usable, generally and for several specific kinds of data.\n\n## FAIR principles\n\nSharing data proves more useful when others can easily find and access, interpret, and reuse the data. To maximize the benefit of sharing your data, follow the [findable, accessible, interoperable, and reusable (FAIR)](https://www.go-fair.org/fair-principles/) guiding principles of data sharing, which optimize reuse of generated data.\n\n::: callout-tip\n### FAIR data sharing principles\n\n1. **Findable**. The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.\n2. **Accessible**. Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization.\n3. **Interoperable**. The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.\n4. **Reusable**. The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.\n:::\n\n\n\n## Why share?\n\n1. **Benefits of sharing data to science and society**. Sharing data allows for transparency in scientific studies and allows one to fully understand what occurred in an analysis and reproduce the results. Without complete data, metadata, and information about resources used to generate the data, reproducing a study proves impossible.\n2. **Benefits of sharing data to individual researchers**. Sharing data increases the impact of a researcher's work and reputation for sound science. Awards for those with an excellent record of [data sharing](https://researchsymbionts.org/) or [data reuse](https://researchparasite.com/) can exemplify this reputation.\n\n\n\n\n\n### Addressing common concerns about data sharing\n\nDespite the clear benefits of sharing data, some researchers still have concerns about doing so.\n\n- **Novelty**. Some worry that sharing data may decrease the novelty of their work and their chance to publish in prominent journals. You can address this concern by sharing your data only after publication. You can also choose to preprint your manuscript when you decide to share your data. Furthermore, you only need to share the data and metadata required to reproduce your published study.\n- **Time spent on sharing data**. Some have concerns about the time it takes to organize and share data publicly. Many add 'data available upon request' to manuscripts instead of depositing the data in a public repository in hopes of getting the work out sooner. It does take time to organize data in preparation for sharing, but sharing data publicly may save you time. Sharing data in a public repository that guarantees archival persistence means that you will not have to worry about storing and backing up the data yourself.\n- **Human subject data**. Sharing of data on human subjects requires special ethical, legal, and privacy considerations. Existing recommendations largely aim to balance the privacy of human participants with the benefits of data sharing by de-identifying human participants and obtaining consent for sharing. Sharing human data poses a variety of challenges for analysis, transparency, reproducibility, interoperability, and access.\n\n::: callout-tip\n### Human data\n\nSometimes you cannot publicly post all human data, even after de-identification. We suggest three strategies for making these data maximally accessible.\n\n1. Deposit raw data files in a controlled-access repository. Controlled-access repositories allow only qualified researchers who apply to access the data.\n - Example: the database of Genotypes and Phenotypes (dbGaP) \n2. Even if you cannot make individual-level raw data available, you can make as much processed data available as possible. This may take the form of summary statistics such as means and standard deviations, rather than individual-level data.\n - Example: GWAS summary data such as the one available from \n3. You may want to generate simulated data distinct from the original data but statistically similar to it. Simulated data would allow others to reproduce your analysis without disclosing the original data or requiring the security controls needed for controlled access.\n:::\n\n## What data to share?\n\nDepending on the data type, you might be able to share the data itself, or a summarized version of it. Broadly thought, you want to share the following:\n\n1. The **data** itself, or a summarized version, or a simulated data similar to the original.\n2. Any **metadata** to describe the primary data and the resources used to generate it. Most disciplines have specific metadata standards to follow (e.g. [microarrays](http://fged.org/projects/minseqe/)).\n3. **Data dictionary**. These have crucial role in organizing your data, especially explaining the variables and their representation. Data dictionaries should provide short names for each variable, a longer text label for the variable, a definition for each variable, data type (such as floating-point number, integer, or string), measurement units, and expected minimum and maximum values. Data dictionaries can make explicit what future users would otherwise have to guess about the representation of data.\n - You have gotten used to seeing the *Tidy Tuesday* data dictionaries such as \n4. **Source code**. Ideally, readers should have all materials needed to completely reproduce the study described in a publication, not just data. These materials include source code, preprocessing, and analysis scripts. Guidelines for organization of computational project can help you arrange your data and scripts in a way that will make it easier for you and other to access and reuse them.\n - See for how we organize code in my team.\n - See also for how to deposit your GitHub repository on Zenodo and get a Digital Object Identifier (DOI) that others can cite. Example \n5. **Licensing**. Clear licensing information attached to your data avoids any questions of whether others may reuse it. Many data resources turn out not to be as reusable as the providers intended, due to lack of clarity in licensing or restrictive licensing choices.\n - See for more on how to choose a license\n\n::: callout-tip\n### How should you document your data?\n\nDocument your data in three ways:\n\n1. **With your manuscript**.\n2. **With description fields** in the metadata collected by repositories\n3. **With README files**. README files provide abbreviated information about a collection of files (e.g. explain organization, file locations, observations and variables present in each file, details on the experimental design, etc).\n:::\n\n# Best git/GitHub practices when adapting code\n\nThe following material is from a [LIBD rstats club](https://research.libd.org/rstatsclub/) presentation I gave in 2023-09-15.\n\n\n\n*I've adapted some of the content from the [public notes](https://docs.google.com/document/d/1Hi5KN6aC0t6iF3jkXJ0r72NPW8X3sZeTOImiEFQB66M/edit?usp=sharing).*\n\n## Adapted from (file path) strategy\n\n- Examples: \n\nPros:\n\n- Simple, you likely know the file path to which script you are adapting.\n\nCons:\n\n- File paths change over time\n- Not everyone will have access to older files\n\n## Adapted from (GitHub URL) strategy\n\n- Examples: \n\nPros:\n\n- You likely know the URL to the script you are adapting\n\nCons:\n\n- URLs change over time: files get renamed, code gets moved around\n - It's best to use GitHub permalinks documented [here](https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files)\n- Not everyone will have access to private GitHub repos\n\n## Current recommended strategy\n\n1. Copy code\n\n- Ideally full script.\n- Note that if you are adapting a full GitHub repository it's then best to *fork it*.\n\n2. Prior to making any changes, version control your copy.\n\n- On the commit message provide the [GitHub permalink](https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files) to the original source\n- Use the [GitHub co-authored commit message syntax](https://github.blog/news-insights/product-news/commit-together-with-co-authors/) to provide credit to others\n - Enables downstream use of the *GitHub contributors graph* like at \n\n3. Make edits and version control as you see fit\n4. (highly recommended) Auto-style code using `styler` as shown here: \n\n- Save the auto-style changes on their own commit message to avoid mixing auto-style changes with actual changes you made. Otherwise you'll later have trouble distinguishing the two types of changes.\n\nPros:\n\n- Most future proof version.\n\n- Enables downstream use of the GitHub contributor graph instead of relying on pure memory.\n\nCons:\n\n- Technically it involves a few steps which might take a bit of time getting used to.\n\n# Best practices for data visualizations\n\n## Motivation\n\n::: callout-tip\n### Quote from one of Roger Peng's heroes\n\n*\"The greatest value of a picture is when it forces us to notice what we never expected to see.\"* -John W. Tukey\n\n\n::: {.cell .fig-cap-location-top layout-align=\"center\"}\n::: {.cell-output-display}\n![](http://upload.wikimedia.org/wikipedia/en/e/e9/John_Tukey.jpg){fig-align='center'}\n:::\n:::\n\n:::\n\nMistakes, biases, systematic errors and unexpected variability are commonly found in data regardless of applications. Failure to discover these problems often leads to **flawed analyses and false discoveries**.\n\nAs an example, consider that measurement devices sometimes fail and not all summarization procedures, such as the `mean()` function in R, are designed to detect these. Yet, these functions will still give you an answer.\n\nFurthermore, it may be hard or impossible to notice an error was made just from the reported summaries.\n\n**Data visualization is a powerful approach to detecting these problems**. We refer to this particular task as exploratory data analysis (EDA), coined by John Tukey.\n\nOn a more positive note, data visualization can also lead to discoveries which would otherwise be missed if we simply subject the data to a battery of statistical summaries or procedures.\n\nWhen analyzing data, we often **make use of exploratory plots to motivate the analyses** we choose.\n\nIn this section, we will discuss some types of plots to avoid, better ways to visualize data, some principles to create good plots, and ways to use `ggplot2` to create **expository** (intended to explain or describe something) graphs.\n\n::: callout-tip\n### Example\n\nThe following figure is from [Lippmann et al. 2006](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1665439/):\n\n![Nickel concentration and PM10 health effects (Blue points represent average county-level concentrations from 2000--2005 for 72 U.S. counties representing 69 communities).](../../images/lippman.png){width=\"70%\"}\n\nThe following figure is from [Dominici et al. 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2137127/), in response to the work by Lippmann et al. above.\n\n![Nickel concentration and PM10 health effects (with and without New York).](../../images/dominici_ehp.png){width=\"70%\"}\n\nElevated levels of Ni and V PM2.5 chemical components in New York are likely attributed to oil-fired power plants and emissions from ships burning oil, as noted by Lippmann et al. (2006).\n:::\n\n### Generating data visualizations\n\nIn order to determine the effectiveness or quality of a visualization, we need to first understand three things:\n\n::: callout-tip\n### Questions to ask yourself when building data visualizations\n\n1. What is the question we are trying to answer?\n2. Why are we building this visualization?\n3. For whom are we producing this data visualization for? Who is the intended audience to consume this visualization?\n:::\n\nNo plot (or any statistical tool, really) can be judged without knowing the answers to those questions. No plot or graphic exists in a vacuum. There is always context and other surrounding factors that play a role in determining a plot's effectiveness.\n\nConversely, **high-quality, well-made visualizations** usually allow one to properly deduce what question is being asked and who the audience is meant to be. A good visualization **tells a complete story in a single frame**.\n\n::: callout-tip\n## Broad steps for creating data visualizations\n\nThe act of visualizing data typically proceeds in two broad steps:\n\n1. Given the question and the audience, **what type of plot should I make?**\n2. Given the plot I intend to make, **how can I optimize it for clarity and effectiveness?**\n:::\n\nAgain I highly recommend checking the content made by [Christine Zhang](https://christineyzhang.com/)!\n\n## Data viz principles\n\n### Developing plots\n\nInitially, one must decide what information should be presented. The following principles for developing analytic graphics come from Edward Tufte's book [*Beautiful Evidence*](https://www.edwardtufte.com/tufte/books_be).\n\n1. Show comparisons\n2. Show causality, mechanism, explanation\n3. Show multivariate data\n4. Integrate multiple modes of evidence\n5. Describe and document the evidence\n6. Content is king - good plots start with good questions\n\n### Optimizing plots\n\n1. Maximize the data/ink ratio -- if \"ink\" can be removed without reducing the information being communicated, then it should be removed.\n2. Maximize the range of perceptual conditions -- your audience's perceptual abilities may not be fully known, so it's best to allow for a wide range, to the extent possible (or knowable).\n3. Show variation in the **data**, not variation in the **design**.\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nd <- airquality %>%\n mutate(Summer = ifelse(Month %in% c(7, 8, 9), 2, 3))\nwith(d, {\n plot(Temp, Ozone, col = unclass(Summer), pch = 19, frame.plot = FALSE)\n legend(\"topleft\",\n col = 2:3, pch = 19, bty = \"n\",\n legend = c(\"Summer\", \"Non-Summer\")\n )\n})\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nWhat's sub-optimal about this plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nairquality %>%\n mutate(Summer = ifelse(Month %in% c(7, 8, 9),\n \"Summer\", \"Non-Summer\"\n )) %>%\n ggplot(aes(Temp, Ozone)) +\n geom_point(aes(color = Summer), size = 2) +\n theme_minimal()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\nSome of these principles are taken from Edward Tufte's *Visual Display of Quantitative Information*:\n\n## Plots to Avoid\n\nThis section is based on a talk by [Karl W. Broman](http://kbroman.org/) titled [\"How to Display Data Badly\"](https://www.biostat.wisc.edu/~kbroman/presentations/graphs_cmp2014.pdf), in which he described how the default plots offered by Microsoft Excel \"obscure your data and annoy your readers\" ([here](https://kbroman.org/talks.html) is a link to a collection of Karl Broman's talks).\n\n::: callout-tip\n### FYI\n\nKarl's lecture was inspired by the 1984 paper by H. Wainer: How to display data badly. American Statistician 38(2): 137--147.\n\nDr. Wainer was the first to elucidate the principles of the bad display of data.\n\nHowever, according to Karl Broman, \"The now widespread use of Microsoft Excel has resulted in remarkable advances in the field.\"\n\nHere we show examples of \"bad plots\" and how to improve them in R.\n:::\n\n::: callout-tip\n### Some general principles of *bad* plots\n\n- Display as little information as possible.\n- Obscure what you do show (with chart junk).\n- Use pseudo-3D and color gratuitously.\n- Make a pie chart (preferably in color and 3D).\n- Use a poorly chosen scale.\n- Ignore significant figures.\n:::\n\n## Examples\n\nHere are some examples of bad plots and suggestions on how to improve\n\n### Pie charts\n\nLet's say we are interested in the most commonly used browsers. Wikipedia has a [table](https://en.wikipedia.org/wiki/Usage_share_of_web_browsers) with the \"usage share of web browsers\" or the proportion of visitors to a group of web sites that use a particular web browser from July 2017.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbrowsers <- c(\n Chrome = 60, Safari = 14, UCBrowser = 7,\n Firefox = 5, Opera = 3, IE = 3, Noinfo = 8\n)\nbrowsers.df <- gather(\n data.frame(t(browsers)),\n \"browser\", \"proportion\"\n)\n```\n:::\n\n\nLet's say we want to report the results of the usage. The standard way of displaying these is with a pie chart:\n\n\n::: {.cell}\n\n```{.r .cell-code}\npie(browsers, main = \"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-6-1.png){width=672}\n:::\n:::\n\n\nIf we look at the help file for `pie()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?pie\n```\n:::\n\n\nIt states:\n\n> \"Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.\"\n\nTo see this, look at the figure above and try to determine the percentages just from looking at the plot. Unless the percentages are close to 25%, 50% or 75%, this is not so easy. Simply showing the numbers is not only clear, but also saves on printing costs.\n\nHaving said that, see how we used a pie chart in a [2023 pre-print](https://doi.org/10.1101/2023.02.15.528722) . The published version is accessible from doi [10.1126/science.adh1938](http://dx.doi.org/10.1126/science.adh1938).\n\n#### Instead of pie charts, try bar plots\n\nIf you do want to plot them, then a barplot is appropriate. Here we use the `geom_bar()` function in `ggplot2`. Note, there are also horizontal lines at every multiple of 10, which helps the eye quickly make comparisons across:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- browsers.df %>%\n ggplot(aes(\n x = reorder(browser, -proportion),\n y = proportion\n )) +\n geom_bar(stat = \"identity\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nNotice that we can now pretty easily determine the percentages by following a horizontal line to the x-axis.\n\n#### Polish your plots\n\nWhile this figure is already a big improvement over a pie chart, we can do even better. When you create figures, you want your figures to be self-sufficient, meaning someone looking at the plot can understand everything about it.\n\nSome possible critiques are:\n\n1. make the axes bigger\n2. make the labels bigger\n3. make the labels be full names (e.g. \"Browser\" and \"Proportion of users\", ideally with units when appropriate)\n4. add a title\n\nLet's explore how to do these things to make an even better figure.\n\nTo start, go to the help file for `theme()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?ggplot2::theme\n```\n:::\n\n\nWe see there are arguments with text that control all the text sizes in the plot. If you scroll down, you see the text argument in the theme command requires class `element_text`. Let's try it out.\n\nTo change the x-axis and y-axis labels to be full names, use `xlab()` and `ylab()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + xlab(\"Browser\") +\n ylab(\"Proportion of Users\")\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nMaybe a title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-11-1.png){width=672}\n:::\n:::\n\n\nNext, we can also use the `theme()` function in `ggplot2` to control the justifications and sizes of the axes, labels and titles.\n\nTo center the title\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Browser Usage (July 2022)\") +\n theme(plot.title = element_text(hjust = 0.5))\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\nTo create bigger text/labels/titles:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np <- p + ggtitle(\"Browser Usage (July 2022)\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15)\n )\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-13-1.png){width=672}\n:::\n:::\n\n\n#### \"I don't like that theme\"\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_bw()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-14-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_dark()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-15-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + theme_classic() # axis lines!\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggthemes::theme_base()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-17-1.png){width=672}\n:::\n:::\n\n\n### 3D barplots\n\nPlease, avoid a 3D version because it obfuscates the plot, making it more difficult to find the percentages by eye.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig2b.png)\n\n### Donut plots\n\nEven worse than pie charts are donut plots.\n\n![](http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Donut-Chart.svg/360px-Donut-Chart.svg.png)\n\nThe reason is that by removing the center, we remove one of the visual cues for determining the different areas: the angles. **There is no reason to ever use a donut plot to display data**.\n\n::: callout-note\n### Question\n\nWhy are pie/donut charts [so common](https://blog.usejournal.com/why-humans-love-pie-charts-9cd346000bdc)?\n\n\n:::\n\n### Barplots as data summaries\n\nWhile barplots are useful for showing percentages, they are incorrectly used to display data from two groups being compared. Specifically, barplots are created with height equal to the group means; an antenna is added at the top to represent standard errors. This plot is simply showing two numbers per group and the plot adds nothing:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig1c.png)\n\n#### Instead of bar plots for summaries, try box plots\n\nIf the number of points is small enough, we might as well add them to the plot. When the number of points is too large for us to see them, just showing a boxplot is preferable.\n\nLet's recreate these barplots as boxplots and overlay the points. We will simulate similar data to demonstrate one way to improve the graphic above.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\n \"Treatment\" = rnorm(10, 30, sd = 4),\n \"Control\" = rnorm(10, 36, sd = 4)\n)\ngather(dat, \"type\", \"response\") %>%\n ggplot(aes(type, response)) +\n geom_boxplot() +\n geom_point(position = \"jitter\") +\n ggtitle(\"Response to drug treatment\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-18-1.png){width=672}\n:::\n:::\n\n\nNotice how much more we see here: the center, spread, range, and the points themselves. In the barplot, we only see the mean and the standard error (SE), and the SE has more to do with sample size than with the spread of the data.\n\nThis problem is magnified when our data has outliers or very large tails. For example, in the plot below, there appears to be very large and consistent differences between the two groups:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig3c.png)\n\nHowever, a quick look at the data demonstrates that this difference is mostly driven by just two points.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\ndat <- data.frame(\n \"Treatment\" = rgamma(10, 10, 1),\n \"Control\" = rgamma(10, 1, .01)\n)\ngather(dat, \"type\", \"response\") %>%\n ggplot(aes(type, response)) +\n geom_boxplot() +\n geom_point(position = \"jitter\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-19-1.png){width=672}\n:::\n:::\n\n\n#### Use log scale if data includes outliers\n\nA version showing the data in the log-scale is much more informative.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ngather(dat, \"type\", \"response\") %>%\n ggplot(aes(type, response)) +\n geom_boxplot() +\n geom_point(position = \"jitter\") +\n scale_y_log10()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Barplots for paired data\n\nA common task in data analysis is the comparison of two groups. When the dataset is small and data are paired, such as the outcomes before and after a treatment, two-color barplots are unfortunately often used to display the results.\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig6r_e.png)\n\n#### Instead of paired bar plots, try scatter plots\n\nThere are better ways of showing these data to illustrate that there is an increase after treatment. One is to simply make a scatter plot, which shows that most points are above the identity line. Another alternative is to plot the differences against the before values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1000)\nbefore <- runif(6, 5, 8)\nafter <- rnorm(6, before * 1.15, 2)\nli <- range(c(before, after))\nymx <- max(abs(after - before))\n\npar(mfrow = c(1, 2))\nplot(before, after,\n xlab = \"Before\", ylab = \"After\",\n ylim = li, xlim = li\n)\nabline(0, 1, lty = 2, col = 1)\n\nplot(before, after - before,\n xlab = \"Before\", ylim = c(-ymx, ymx),\n ylab = \"Change (After - Before)\", lwd = 2\n)\nabline(h = 0, lty = 2, col = 1)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-21-1.png){width=672}\n:::\n:::\n\n\n#### or line plots\n\nLine plots are not a bad choice, although they can be harder to follow than the previous two. Boxplots show you the increase, but lose the paired information.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nz <- rep(c(0, 1), rep(6, 2))\npar(mfrow = c(1, 2))\nplot(z, c(before, after),\n xaxt = \"n\", ylab = \"Response\",\n xlab = \"\", xlim = c(-0.5, 1.5)\n)\naxis(side = 1, at = c(0, 1), c(\"Before\", \"After\"))\nsegments(rep(0, 6), before, rep(1, 6), after, col = 1)\n\nboxplot(before, after,\n names = c(\"Before\", \"After\"),\n ylab = \"Response\"\n)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-22-1.png){width=672}\n:::\n:::\n\n\n\n\nThe above plot was made using [`ggpubr::ggpaired()`](https://rpkgs.datanovia.com/ggpubr/reference/ggpaired.html). Note that the title of the package is:\n\n> ggpubr: ‘ggplot2’ Based Publication Ready Plots\n\n### Gratuitous 3D\n\nThe figure below shows three curves. Pseudo 3D is used, but it is not clear why. Maybe to separate the three curves? Notice how difficult it is to determine the values of the curves at any given point:\n\n![](https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig8b.png)\n\nThis plot can be made better by simply using color to distinguish the three lines:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx <- read_csv(\"https://github.com/kbroman/Talk_Graphs/raw/master/R/fig8dat.csv\") %>%\n as_tibble(.name_repair = make.names)\n\np <- x %>%\n gather(\"drug\", \"proportion\", -log.dose) %>%\n ggplot(aes(\n x = log.dose, y = proportion,\n color = drug\n )) +\n geom_line()\np\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-23-1.png){width=672}\n:::\n:::\n\n\nThis plot demonstrates that using color is more than enough to distinguish the three lines.\n\nWe can make this plot better using the functions we learned above\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15)\n )\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-24-1.png){width=672}\n:::\n:::\n\n\n#### Legends\n\nWe can also move the legend inside the plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15),\n legend.position = c(0.2, 0.3)\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2\n3.5.0.\nℹ Please use the `legend.position.inside` argument of `theme()` instead.\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-25-1.png){width=672}\n:::\n:::\n\n\nWe can also make the legend transparent\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend <- theme(\n legend.background = element_rect(fill = \"transparent\"),\n legend.key = element_rect(\n fill = \"transparent\",\n color = \"transparent\"\n )\n)\n\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15),\n legend.position = c(0.2, 0.3)\n ) +\n transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-26-1.png){width=672}\n:::\n:::\n\n\n### Too many significant digits\n\nBy default, statistical software like R returns many significant digits. This does not mean we should report them. Cutting and pasting directly from R is a bad idea since you might end up showing a table, such as the one below, comparing the heights of basketball players:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nheights <- cbind(\n rnorm(8, 73, 3), rnorm(8, 73, 3), rnorm(8, 80, 3),\n rnorm(8, 78, 3), rnorm(8, 78, 3)\n)\ncolnames(heights) <- c(\"SG\", \"PG\", \"C\", \"PF\", \"SF\")\nrownames(heights) <- paste(\"team\", 1:8)\nheights\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n SG PG C PF SF\nteam 1 68.88065 73.07480 81.80948 76.60455 82.23521\nteam 2 70.05272 66.86024 74.64847 72.70140 78.55640\nteam 3 71.33653 73.63946 81.00483 78.56787 77.86893\nteam 4 73.36414 81.01021 81.68293 76.90146 77.35226\nteam 5 72.63738 69.31895 83.66281 81.17280 82.39133\nteam 6 68.99188 75.50274 79.36564 75.77514 78.68900\nteam 7 73.51017 74.59772 82.09829 73.95492 78.32287\nteam 8 73.46524 71.05953 77.88069 76.44808 73.86569\n```\n\n\n:::\n:::\n\n\nWe are reporting precision up to 0.00001 inches. Do you know of a tape measure with that much precision? This can be easily remedied:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nround(heights, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n SG PG C PF SF\nteam 1 68.9 73.1 81.8 76.6 82.2\nteam 2 70.1 66.9 74.6 72.7 78.6\nteam 3 71.3 73.6 81.0 78.6 77.9\nteam 4 73.4 81.0 81.7 76.9 77.4\nteam 5 72.6 69.3 83.7 81.2 82.4\nteam 6 69.0 75.5 79.4 75.8 78.7\nteam 7 73.5 74.6 82.1 74.0 78.3\nteam 8 73.5 71.1 77.9 76.4 73.9\n```\n\n\n:::\n:::\n\n\n### Minimal figure captions\n\nRecall the plot we had before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntransparent_legend <- theme(\n legend.background = element_rect(fill = \"transparent\"),\n legend.key = element_rect(\n fill = \"transparent\",\n color = \"transparent\"\n )\n)\n\np + ggtitle(\"Survival proportion\") +\n theme(\n plot.title = element_text(hjust = 0.5),\n text = element_text(size = 15),\n legend.position = c(0.2, 0.3)\n ) +\n xlab(\"dose (mg)\") +\n transparent_legend\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-29-1.png){width=672}\n:::\n:::\n\n\nWhat type of caption would be good here?\n\nWhen creating figure captions, think about the following:\n\n1. Be specific\n\n> A plot of the proportion of patients who survived after three drug treatments.\n\n2. Label the caption\n\n> Figure 1. A plot of the proportion of patients who survived after three drug treatments.\n\n3. Tell a story\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments.\n\n4. Include units\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram).\n\n5. Explain aesthetics\n\n> Figure 1. Drug treatment survival. A plot of the proportion of patients who survived after three drug treatments (milligram). Three colors represent three drug treatments. Drug A results in largest survival proportion for the larger drug doses.\n\n## Final thoughts data viz\n\nIn general, you should follow these principles:\n\n- Create expository graphs to tell a story (figure and caption should be self-sufficient; it's the first thing people look at)\n\n - Be accurate and clear\n - Let the data speak\n - Make axes, labels and titles big\n - Make labels full names (ideally with units when appropriate)\n - Add informative legends; use space effectively\n\n- Show as much information as possible, taking care not to obscure the message\n\n- Science not sales: avoid unnecessary frills (especially gratuitous 3D)\n\n- In tables, every digit should be meaningful\n\n### Some further reading\n\n- N Cross (2011). Design Thinking: Understanding How Designers Think and Work. Bloomsbury Publishing.\n- J Tukey (1977). Exploratory Data Analysis.\n- ER Tufte (1983) The visual display of quantitative information. Graphics Press.\n- ER Tufte (1990) Envisioning information. Graphics Press.\n- ER Tufte (1997) Visual explanations. Graphics Press.\n- ER Tufte (2006) Beautiful Evidence. Graphics Press.\n- WS Cleveland (1993) Visualizing data. Hobart Press.\n- WS Cleveland (1994) The elements of graphing data. CRC Press.\n- A Gelman, C Pasarica, R Dodhia (2002) Let's practice what we preach: Turning tables into graphs. The American Statistician 56:121-130.\n- NB Robbins (2004) Creating more effective graphs. Wiley.\n- [Nature Methods columns](http://bang.clearscience.info/?p=546)\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.4.1 (2024-06-14)\n os macOS Sonoma 14.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2024-10-14\n pandoc 3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n bit 4.5.0 2024-09-20 [1] CRAN (R 4.4.1)\n bit64 4.5.2 2024-09-22 [1] CRAN (R 4.4.1)\n cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)\n colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)\n colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.0)\n crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.0)\n curl 5.2.3 2024-09-20 [1] CRAN (R 4.4.1)\n digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)\n dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1)\n fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)\n farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)\n fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)\n ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n ggthemes 5.1.0 2024-02-10 [1] CRAN (R 4.4.0)\n glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)\n gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)\n htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)\n htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)\n jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)\n knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)\n labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)\n lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)\n lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)\n munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)\n readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)\n rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.0)\n rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)\n scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)\n stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)\n stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)\n tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)\n timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)\n utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)\n vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)\n vroom 1.6.5 2023-12-05 [1] CRAN (R 4.4.0)\n withr 3.0.1 2024-07-31 [1] CRAN (R 4.4.0)\n xfun 0.48 2024-10-03 [1] CRAN (R 4.4.1)\n yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n", "supporting": [ "index_files" ], diff --git a/posts/24-best-practices-data-analyses/index.qmd b/posts/24-best-practices-data-analyses/index.qmd index 6d25c5f..dd064ab 100644 --- a/posts/24-best-practices-data-analyses/index.qmd +++ b/posts/24-best-practices-data-analyses/index.qmd @@ -148,6 +148,20 @@ This is a challenging topic, but as you analyze data, ask yourself the following 4. Which stakeholders need to be involved in making decisions in and around the data analysis? ::: + + + + # Best practices for sharing data Data sharing is an essential element of the scientific method, imperative to ensure transparency and reproducibility. @@ -230,9 +244,74 @@ Document your data in three ways: 3. **With README files**. README files provide abbreviated information about a collection of files (e.g. explain organization, file locations, observations and variables present in each file, details on the experimental design, etc). ::: +# Best git/GitHub practices when adapting code + +The following material is from a [LIBD rstats club](https://research.libd.org/rstatsclub/) presentation I gave in 2023-09-15. + + + +*I've adapted some of the content from the [public notes](https://docs.google.com/document/d/1Hi5KN6aC0t6iF3jkXJ0r72NPW8X3sZeTOImiEFQB66M/edit?usp=sharing).* + +## Adapted from (file path) strategy + +- Examples: + +Pros: + +- Simple, you likely know the file path to which script you are adapting. + +Cons: + +- File paths change over time +- Not everyone will have access to older files + +## Adapted from (GitHub URL) strategy + +- Examples: + +Pros: + +- You likely know the URL to the script you are adapting + +Cons: + +- URLs change over time: files get renamed, code gets moved around + - It's best to use GitHub permalinks documented [here](https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files) +- Not everyone will have access to private GitHub repos + +## Current recommended strategy + +1. Copy code + +- Ideally full script. +- Note that if you are adapting a full GitHub repository it's then best to *fork it*. + +2. Prior to making any changes, version control your copy. + +- On the commit message provide the [GitHub permalink](https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files) to the original source +- Use the [GitHub co-authored commit message syntax](https://github.blog/news-insights/product-news/commit-together-with-co-authors/) to provide credit to others + - Enables downstream use of the *GitHub contributors graph* like at + +3. Make edits and version control as you see fit +4. (highly recommended) Auto-style code using `styler` as shown here: + +- Save the auto-style changes on their own commit message to avoid mixing auto-style changes with actual changes you made. Otherwise you'll later have trouble distinguishing the two types of changes. + +Pros: + +- Most future proof version. + +- Enables downstream use of the GitHub contributor graph instead of relying on pure memory. + +Cons: + +- Technically it involves a few steps which might take a bit of time getting used to. + # Best practices for data visualizations -## Motiviation +## Motivation ::: callout-tip ### Quote from one of Roger Peng's heroes @@ -301,6 +380,8 @@ The act of visualizing data typically proceeds in two broad steps: 2. Given the plot I intend to make, **how can I optimize it for clarity and effectiveness?** ::: +Again I highly recommend checking the content made by [Christine Zhang](https://christineyzhang.com/)! + ## Data viz principles ### Developing plots