From e34a90fee88ae0a35ffd70cc9fec888e37590eef Mon Sep 17 00:00:00 2001 From: lcolladotor Date: Thu, 14 Sep 2023 11:22:28 -0400 Subject: [PATCH] Try to reduce errors students are having by making the install.packages() commands conditional on whether you have the package installed or not. --- .../project-1/index/execute-results/html.json | 4 ++-- projects/project-1/index.R | 14 +++++++++----- projects/project-1/index.qmd | 14 +++++++++----- 3 files changed, 20 insertions(+), 12 deletions(-) diff --git a/_freeze/projects/project-1/index/execute-results/html.json b/_freeze/projects/project-1/index/execute-results/html.json index 130db7f..475ae49 100644 --- a/_freeze/projects/project-1/index/execute-results/html.json +++ b/_freeze/projects/project-1/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "95efd36c9d72b6a2208f813a2e13d6b6", + "hash": "5b3e2f80ee8252d5759b6c120a7588c4", "result": { - "markdown": "---\ntitle: \"Project 1\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Finding great chocolate bars!\"\ncategories: [project 1, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-1/index.qmd).*\n\n# Background\n\n**Due date: Sept 16 at 11:59pm**\n\n### To submit your project\n\nPlease write up your project using R Markdown and `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** for each of the answers to each part.\n\nTo get started, [watch this video on setting up your R Markdown document](https://www.stephaniehicks.com/jhustatcomputing2021/posts/2021-09-02-literate-programming/#create-and-knit-your-first-r-markdown-document).\n\n### Install `tidyverse`\n\nBefore attempting this assignment, you should first install the `tidyverse` package if you have not already. The `tidyverse` package is actually a collection of many packages that serves as a convenient way to install many packages without having to do them one by one. This can be done with the `install.packages()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\n```\n:::\n\n\nRunning this function will install a host of other packages so it make take a minute or two depending on how fast your computer is. Once you have installed it, you will want to load the package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n```\n:::\n\n\n### Data\n\nThat data for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com), which is a weekly podcast and global [community activity](https://github.com/rfordatascience/tidytuesday) brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.\n\n![](https://github.com/rfordatascience/tidytuesday/raw/master/static/tt_logo.png){.preview-image}\n\n\\[**Source**: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/static/tt_logo.png)\\]\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022#2022-data) from 2022, we see this dataset chocolate bar reviews.\n\nTo access the data, you need to install the `tidytuesdayR` R package and use the function `tt_load()` with the date of '2022-01-18' to load the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidytuesdayR\")\n```\n:::\n\n\nThis is how you can download the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load(\"2022-01-18\")\nchocolate <- tuesdata$chocolate\n```\n:::\n\n\nHowever, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nlibrary(tidyverse)\n\n# tests if a directory named \"data\" exists locally\nif (!dir.exists(here(\"data\"))) {\n dir.create(here(\"data\"))\n}\n\n# saves data only once (not each time you knit a R Markdown)\nif (!file.exists(here(\"data\", \"chocolate.RDS\"))) {\n url_csv <- \"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv\"\n chocolate <- readr::read_csv(url_csv)\n\n # save the file to RDS objects\n saveRDS(chocolate, file = here(\"data\", \"chocolate.RDS\"))\n}\n```\n:::\n\n\nHere we read in the `.RDS` dataset locally from our computing environment:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate <- readRDS(here(\"data\", \"chocolate.RDS\"))\nas_tibble(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2,530 × 10\n ref company_manufacturer company_location review_date\n \n 1 2454 5150 U.S.A. 2019\n 2 2458 5150 U.S.A. 2019\n 3 2454 5150 U.S.A. 2019\n 4 2542 5150 U.S.A. 2021\n 5 2546 5150 U.S.A. 2021\n 6 2546 5150 U.S.A. 2021\n 7 2542 5150 U.S.A. 2021\n 8 797 A. Morin France 2012\n 9 797 A. Morin France 2012\n10 1011 A. Morin France 2013\n# ℹ 2,520 more rows\n# ℹ 6 more variables: country_of_bean_origin ,\n# specific_bean_origin_or_bar_name , cocoa_percent ,\n# ingredients , most_memorable_characteristics , rating \n```\n:::\n:::\n\n\nWe can take a glimpse at the data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 2,530\nColumns: 10\n$ ref 2454, 2458, 2454, 2542, 2546, 2546, 2…\n$ company_manufacturer \"5150\", \"5150\", \"5150\", \"5150\", \"5150…\n$ company_location \"U.S.A.\", \"U.S.A.\", \"U.S.A.\", \"U.S.A.…\n$ review_date 2019, 2019, 2019, 2021, 2021, 2021, 2…\n$ country_of_bean_origin \"Tanzania\", \"Dominican Republic\", \"Ma…\n$ specific_bean_origin_or_bar_name \"Kokoa Kamili, batch 1\", \"Zorzal, bat…\n$ cocoa_percent \"76%\", \"76%\", \"76%\", \"68%\", \"72%\", \"8…\n$ ingredients \"3- B,S,C\", \"3- B,S,C\", \"3- B,S,C\", \"…\n$ most_memorable_characteristics \"rich cocoa, fatty, bready\", \"cocoa, …\n$ rating 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…\n```\n:::\n:::\n\n\nHere is a data dictionary for what all the column names mean:\n\n- \n\n# Part 1: Explore data\n\nIn this part, use functions from `dplyr` and `ggplot2` to answer the following questions.\n\n1. Make a histogram of the `rating` scores to visualize the overall distribution of scores. Change the number of bins from the default to 10, 15, 20, and 25. Pick on the one that you think looks the best. Explain what the difference is when you change the number of bins and explain why you picked the one you did.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here and describe your answer afterwards\n```\n:::\n\n\nThe ratings are discrete values making the histogram look strange. When you make the bin size smaller, it aggregates the ratings together in larger groups removing that effect. I picked 15, but there really is no wrong answer. Just looking for an answer here.\n\n2. Consider the countries where the beans originated from. How many reviews come from each country of bean origin?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n3. What is average `rating` scores from reviews of chocolate bars that have Ecuador as `country_of_bean_origin` in this dataset? For this same set of reviews, also calculate (1) the total number of reviews and (2) the standard deviation of the `rating` scores. Your answer should be a new data frame with these three summary statistics in three columns. Label the name of these columns `mean`, `sd`, and `total`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n4. Which company (name) makes the best chocolate (or has the highest ratings on average) with beans from Ecuador?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n5. Calculate the average rating across all country of origins for beans. Which top 3 countries (for bean origin) have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n6. Following up on the previous problem, now remove any countries of bean origins that have less than 10 chocolate bar reviews. Now, which top 3 countries have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n7. For this last part, let's explore the relationship between percent chocolate and ratings.\n\nUse the functions in `dplyr`, `tidyr`, and `lubridate` to perform the following steps to the `chocolate` dataset:\n\n1. Identify the countries of bean origin with at least 50 reviews. Remove reviews from countries are not in this list.\n2. Using the variable describing the chocolate percentage for each review, create a new column that groups chocolate percentages into one of four groups: (i) \\<60%, (ii) \\>=60 to \\<70%, (iii) \\>=70 to \\<90%, and (iii) \\>=90% (**Hint** check out the `substr()` function in base R and the `case_when()` function from `dplyr` -- see example below).\n3. Using the new column described in #2, re-order the factor levels (if needed) to be starting with the smallest percentage group and increasing to the largest percentage group (**Hint** check out the `fct_relevel()` function from `forcats`).\n4. For each country, make a set of four side-by-side boxplots plotting the groups on the x-axis and the ratings on the y-axis. These plots should be faceted by country.\n\nOn average, which category of chocolate percentage is most highly rated? Do these countries mostly agree or are there disagreements?\n\n**Hint**: You may find the `case_when()` function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a `mutate()` call).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate some random numbers\ndat <- tibble(x = rnorm(100))\nslice(dat, 1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 1\n x\n \n1 0.481\n2 1.06 \n3 0.529\n```\n:::\n\n```{.r .cell-code}\n## Create a new column that indicates whether the value of 'x' is positive or negative\ndat %>%\n mutate(is_positive = case_when(\n x >= 0 ~ \"Yes\",\n x < 0 ~ \"No\"\n ))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 100 × 2\n x is_positive\n \n 1 0.481 Yes \n 2 1.06 Yes \n 3 0.529 Yes \n 4 -0.221 No \n 5 -0.906 No \n 6 2.96 Yes \n 7 -0.0564 No \n 8 -0.931 No \n 9 -0.0624 No \n10 0.240 Yes \n# ℹ 90 more rows\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 2: Join two datasets together\n\nThe goal of this part of the assignment is to join two datasets together. `gapminder` is a [R package](https://cran.r-project.org/web/packages/gapminder/README.html) that contains an excerpt from the [Gapminder data](https://www.gapminder.org/data/).\n\n### Tasks\n\n1. Use this dataset it to create a new column called `continent` in our `chocolate` dataset that contains the continent name for each review where the country of bean origin is.\n2. Only keep reviews that have reviews from countries of bean origin with at least 10 reviews.\n3. Also, remove the country of bean origin named `\"Blend\"`.\n4. Make a set of violin plots with ratings on the y-axis and `continent`s on the x-axis.\n\n**Hint**:\n\n- Check to see if there are any `NA`s in the new column. If there are any `NA`s, add the continent name for each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Convert wide data into long data\n\nThe goal of this part of the assignment is to take a dataset that is either messy or simply not tidy and to make them tidy datasets. The objective is to gain some familiarity with the functions in the `dplyr`, `tidyr` packages. You may find it helpful to review the section on pivoting data from wide to long format and vice versa.\n\n### Tasks\n\nWe are going to create a set of features for us to plot over time. Use the functions in `dplyr` and `tidyr` to perform the following steps to the `chocolate` dataset:\n\n1. Create a new set of columns titled `beans`, `sugar`, `cocoa_butter`, `vanilla`, `letchin`, and `salt` that contain a 1 or 0 representing whether or not that review for the chocolate bar contained that ingredient (1) or not (0).\n2. Create a new set of columns titled `char_cocoa`, `char_sweet`, `char_nutty`, `char_creamy`, `char_roasty`, `char_earthy` that contain a 1 or 0 representing whether or not that the most memorable characteristic for the chocolate bar had that word (1) or not (0). For example, if the word \"sweet\" appears in the `most_memorable_characteristics`, then record a 1, otherwise a 0 for that review in the `char_sweet` column (**Hint**: check out `str_detect()` from the `stringr` package).\n3. For each year (i.e. `review_date`), calculate the mean value in each new column you created across all reviews for that year. (**Hint**: If all has gone well thus far, you should have a dataset with 16 rows and 13 columns).\n4. Convert this wide dataset into a long dataset with a new `feature` and `mean_score` column.\n\nIt should look something like this:\n\n``` \nreview_date feature mean_score\n \n2006 beans 0.967741935 \n2006 sugar 0.967741935 \n2006 cocoa_butter 0.903225806 \n2006 vanilla 0.693548387 \n2006 letchin 0.693548387 \n2006 salt 0.000000000 \n2006 char_cocoa 0.209677419 \n2006 char_sweet 0.161290323 \n2006 char_nutty 0.032258065 \n2006 char_creamy 0.241935484 \n```\n\n### Notes\n\n- You may need to use functions outside these packages to obtain this result.\n\n- Do not worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Data visualization\n\nIn this part of the project, we will continue to work with our now tidy song dataset from the previous part.\n\n### Tasks\n\nUse the functions in `ggplot2` package to make a scatter plot of the `mean_score`s (y-axis) over time (x-axis). One plot for each `mean_score`. For full credit, your plot should include:\n\n1. An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure with your name.\n2. Both the observed points for the `mean_score`, but also a smoothed non-linear pattern of the trend\n3. All plots should be shown in the one figure\n4. There should be an informative x-axis and y-axis label\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Notes\n\n- You may need to use functions outside these packages to obtain this result.\n\n- Don't worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 5: Make the worst plot you can!\n\nThis sounds a bit crazy I know, but I want this to try and be FUN! Instead of trying to make a \"good\" plot, I want you to explore your creative side and make a really awful data visualization in every way. :)\n\n### Tasks\n\nUsing the `chocolate` dataset (or any of the modified versions you made throughout this assignment or anything else you wish you build upon it):\n\n1. Make the absolute worst plot that you can. You need to customize it in **at least 7 ways** to make it awful.\n2. In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), and how it could be useful for you when you want to make an awesome data visualization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 6: Make my plot a better plot!\n\nThe goal is to take my sad looking plot and make it better! If you'd like an [example](https://twitter.com/drmowinckels/status/1392136510468763652), here is a tweet I came across of someone who gave a talk about how to zhoosh up your ggplots.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate %>%\n ggplot(aes(\n x = as.factor(review_date),\n y = rating,\n fill = review_date\n )) +\n geom_violin()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Tasks\n\n1. You need to customize it in **at least 7 ways** to make it better.\n2. In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), describing how you improved it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-09-13\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"Project 1\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Finding great chocolate bars!\"\ncategories: [project 1, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-1/index.qmd).*\n\n# Background\n\n**Due date: Sept 16 at 11:59pm**\n\n### To submit your project\n\nPlease write up your project using R Markdown and `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** for each of the answers to each part.\n\nTo get started, [watch this video on setting up your R Markdown document](https://www.stephaniehicks.com/jhustatcomputing2021/posts/2021-09-02-literate-programming/#create-and-knit-your-first-r-markdown-document).\n\n### Install `tidyverse`\n\nBefore attempting this assignment, you should first install the `tidyverse` package if you have not already. The `tidyverse` package is actually a collection of many packages that serves as a convenient way to install many packages without having to do them one by one. This can be done with the `install.packages()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Install the tidyverse package if you don't have it\nif (!require(\"tidyverse\", quietly = TRUE))\n install.packages(\"tidyverse\")\n```\n:::\n\n\nRunning this function will install a host of other packages so it make take a minute or two depending on how fast your computer is. Once you have installed it, you will want to load the package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"tidyverse\")\n```\n:::\n\n\n### Data\n\nThat data for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com), which is a weekly podcast and global [community activity](https://github.com/rfordatascience/tidytuesday) brought to you by the R4DS Online Learning Community. The goal of TidyTuesday is to help R learners learn in real-world contexts.\n\n![](https://github.com/rfordatascience/tidytuesday/raw/master/static/tt_logo.png){.preview-image}\n\n\\[**Source**: [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/static/tt_logo.png)\\]\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022#2022-data) from 2022, we see this dataset chocolate bar reviews.\n\nTo access the data, you need to install the `tidytuesdayR` R package and use the function `tt_load()` with the date of '2022-01-18' to load the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Install the tidytuesdayR package if you don't have it\nif (!require(\"tidytuesdayR\", quietly = TRUE))\n install.packages(\"tidytuesdayR\")\n```\n:::\n\n\nThis is how you can download the data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load(\"2022-01-18\")\nchocolate <- tuesdata$chocolate\n```\n:::\n\n\nHowever, if you use this code, you will hit an API limit after trying to compile the document a few times. Instead, I suggest you use the following code below. Here, I provide the code below for you to avoid re-downloading data:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(\"here\")\nlibrary(\"tidyverse\")\n\n# tests if a directory named \"data\" exists locally\nif (!dir.exists(here(\"data\"))) {\n dir.create(here(\"data\"))\n}\n\n# saves data only once (not each time you knit a R Markdown)\nif (!file.exists(here(\"data\", \"chocolate.RDS\"))) {\n url_csv <- \"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv\"\n chocolate <- readr::read_csv(url_csv)\n\n # save the file to RDS objects\n saveRDS(chocolate, file = here(\"data\", \"chocolate.RDS\"))\n}\n```\n:::\n\n\nHere we read in the `.RDS` dataset locally from our computing environment:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate <- readRDS(here(\"data\", \"chocolate.RDS\"))\nas_tibble(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2,530 × 10\n ref company_manufacturer company_location review_date\n \n 1 2454 5150 U.S.A. 2019\n 2 2458 5150 U.S.A. 2019\n 3 2454 5150 U.S.A. 2019\n 4 2542 5150 U.S.A. 2021\n 5 2546 5150 U.S.A. 2021\n 6 2546 5150 U.S.A. 2021\n 7 2542 5150 U.S.A. 2021\n 8 797 A. Morin France 2012\n 9 797 A. Morin France 2012\n10 1011 A. Morin France 2013\n# ℹ 2,520 more rows\n# ℹ 6 more variables: country_of_bean_origin ,\n# specific_bean_origin_or_bar_name , cocoa_percent ,\n# ingredients , most_memorable_characteristics , rating \n```\n:::\n:::\n\n\nWe can take a glimpse at the data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nglimpse(chocolate)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 2,530\nColumns: 10\n$ ref 2454, 2458, 2454, 2542, 2546, 2546, 2…\n$ company_manufacturer \"5150\", \"5150\", \"5150\", \"5150\", \"5150…\n$ company_location \"U.S.A.\", \"U.S.A.\", \"U.S.A.\", \"U.S.A.…\n$ review_date 2019, 2019, 2019, 2021, 2021, 2021, 2…\n$ country_of_bean_origin \"Tanzania\", \"Dominican Republic\", \"Ma…\n$ specific_bean_origin_or_bar_name \"Kokoa Kamili, batch 1\", \"Zorzal, bat…\n$ cocoa_percent \"76%\", \"76%\", \"76%\", \"68%\", \"72%\", \"8…\n$ ingredients \"3- B,S,C\", \"3- B,S,C\", \"3- B,S,C\", \"…\n$ most_memorable_characteristics \"rich cocoa, fatty, bready\", \"cocoa, …\n$ rating 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…\n```\n:::\n:::\n\n\nHere is a data dictionary for what all the column names mean:\n\n- \n\n# Part 1: Explore data\n\nIn this part, use functions from `dplyr` and `ggplot2` to answer the following questions.\n\n1. Make a histogram of the `rating` scores to visualize the overall distribution of scores. Change the number of bins from the default to 10, 15, 20, and 25. Pick on the one that you think looks the best. Explain what the difference is when you change the number of bins and explain why you picked the one you did.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here and describe your answer afterwards\n```\n:::\n\n\nThe ratings are discrete values making the histogram look strange. When you make the bin size smaller, it aggregates the ratings together in larger groups removing that effect. I picked 15, but there really is no wrong answer. Just looking for an answer here.\n\n2. Consider the countries where the beans originated from. How many reviews come from each country of bean origin?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n3. What is average `rating` scores from reviews of chocolate bars that have Ecuador as `country_of_bean_origin` in this dataset? For this same set of reviews, also calculate (1) the total number of reviews and (2) the standard deviation of the `rating` scores. Your answer should be a new data frame with these three summary statistics in three columns. Label the name of these columns `mean`, `sd`, and `total`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n4. Which company (name) makes the best chocolate (or has the highest ratings on average) with beans from Ecuador?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n5. Calculate the average rating across all country of origins for beans. Which top 3 countries (for bean origin) have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n6. Following up on the previous problem, now remove any countries of bean origins that have less than 10 chocolate bar reviews. Now, which top 3 countries have the highest ratings on average?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n7. For this last part, let's explore the relationship between percent chocolate and ratings.\n\nUse the functions in `dplyr`, `tidyr`, and `lubridate` to perform the following steps to the `chocolate` dataset:\n\n1. Identify the countries of bean origin with at least 50 reviews. Remove reviews from countries are not in this list.\n2. Using the variable describing the chocolate percentage for each review, create a new column that groups chocolate percentages into one of four groups: (i) \\<60%, (ii) \\>=60 to \\<70%, (iii) \\>=70 to \\<90%, and (iii) \\>=90% (**Hint** check out the `substr()` function in base R and the `case_when()` function from `dplyr` -- see example below).\n3. Using the new column described in #2, re-order the factor levels (if needed) to be starting with the smallest percentage group and increasing to the largest percentage group (**Hint** check out the `fct_relevel()` function from `forcats`).\n4. For each country, make a set of four side-by-side boxplots plotting the groups on the x-axis and the ratings on the y-axis. These plots should be faceted by country.\n\nOn average, which category of chocolate percentage is most highly rated? Do these countries mostly agree or are there disagreements?\n\n**Hint**: You may find the `case_when()` function useful in this part, which can be used to map values from one variable to different values in a new variable (when used in a `mutate()` call).\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Generate some random numbers\ndat <- tibble(x = rnorm(100))\nslice(dat, 1:3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 × 1\n x\n \n1 -0.599\n2 1.49 \n3 -0.756\n```\n:::\n\n```{.r .cell-code}\n## Create a new column that indicates whether the value of 'x' is positive or negative\ndat %>%\n mutate(is_positive = case_when(\n x >= 0 ~ \"Yes\",\n x < 0 ~ \"No\"\n ))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 100 × 2\n x is_positive\n \n 1 -0.599 No \n 2 1.49 Yes \n 3 -0.756 No \n 4 1.94 Yes \n 5 1.48 Yes \n 6 0.367 Yes \n 7 0.121 Yes \n 8 -1.42 No \n 9 -0.149 No \n10 0.948 Yes \n# ℹ 90 more rows\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 2: Join two datasets together\n\nThe goal of this part of the assignment is to join two datasets together. `gapminder` is a [R package](https://cran.r-project.org/web/packages/gapminder/README.html) that contains an excerpt from the [Gapminder data](https://www.gapminder.org/data/).\n\n### Tasks\n\n1. Use this dataset it to create a new column called `continent` in our `chocolate` dataset that contains the continent name for each review where the country of bean origin is.\n2. Only keep reviews that have reviews from countries of bean origin with at least 10 reviews.\n3. Also, remove the country of bean origin named `\"Blend\"`.\n4. Make a set of violin plots with ratings on the y-axis and `continent`s on the x-axis.\n\n**Hint**:\n\n- Check to see if there are any `NA`s in the new column. If there are any `NA`s, add the continent name for each row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Convert wide data into long data\n\nThe goal of this part of the assignment is to take a dataset that is either messy or simply not tidy and to make them tidy datasets. The objective is to gain some familiarity with the functions in the `dplyr`, `tidyr` packages. You may find it helpful to review the section on pivoting data from wide to long format and vice versa.\n\n### Tasks\n\nWe are going to create a set of features for us to plot over time. Use the functions in `dplyr` and `tidyr` to perform the following steps to the `chocolate` dataset:\n\n1. Create a new set of columns titled `beans`, `sugar`, `cocoa_butter`, `vanilla`, `letchin`, and `salt` that contain a 1 or 0 representing whether or not that review for the chocolate bar contained that ingredient (1) or not (0).\n2. Create a new set of columns titled `char_cocoa`, `char_sweet`, `char_nutty`, `char_creamy`, `char_roasty`, `char_earthy` that contain a 1 or 0 representing whether or not that the most memorable characteristic for the chocolate bar had that word (1) or not (0). For example, if the word \"sweet\" appears in the `most_memorable_characteristics`, then record a 1, otherwise a 0 for that review in the `char_sweet` column (**Hint**: check out `str_detect()` from the `stringr` package).\n3. For each year (i.e. `review_date`), calculate the mean value in each new column you created across all reviews for that year. (**Hint**: If all has gone well thus far, you should have a dataset with 16 rows and 13 columns).\n4. Convert this wide dataset into a long dataset with a new `feature` and `mean_score` column.\n\nIt should look something like this:\n\n``` \nreview_date feature mean_score\n \n2006 beans 0.967741935 \n2006 sugar 0.967741935 \n2006 cocoa_butter 0.903225806 \n2006 vanilla 0.693548387 \n2006 letchin 0.693548387 \n2006 salt 0.000000000 \n2006 char_cocoa 0.209677419 \n2006 char_sweet 0.161290323 \n2006 char_nutty 0.032258065 \n2006 char_creamy 0.241935484 \n```\n\n### Notes\n\n- You may need to use functions outside these packages to obtain this result.\n\n- Do not worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Data visualization\n\nIn this part of the project, we will continue to work with our now tidy song dataset from the previous part.\n\n### Tasks\n\nUse the functions in `ggplot2` package to make a scatter plot of the `mean_score`s (y-axis) over time (x-axis). One plot for each `mean_score`. For full credit, your plot should include:\n\n1. An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure with your name.\n2. Both the observed points for the `mean_score`, but also a smoothed non-linear pattern of the trend\n3. All plots should be shown in the one figure\n4. There should be an informative x-axis and y-axis label\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Notes\n\n- You may need to use functions outside these packages to obtain this result.\n\n- Don't worry about the ordering of the rows or columns. Depending on whether you use `gather()` or `pivot_longer()`, the order of your output may differ from what is printed above. As long as the result is a tidy data set, that is sufficient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 5: Make the worst plot you can!\n\nThis sounds a bit crazy I know, but I want this to try and be FUN! Instead of trying to make a \"good\" plot, I want you to explore your creative side and make a really awful data visualization in every way. :)\n\n### Tasks\n\nUsing the `chocolate` dataset (or any of the modified versions you made throughout this assignment or anything else you wish you build upon it):\n\n1. Make the absolute worst plot that you can. You need to customize it in **at least 7 ways** to make it awful.\n2. In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), and how it could be useful for you when you want to make an awesome data visualization.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 6: Make my plot a better plot!\n\nThe goal is to take my sad looking plot and make it better! If you'd like an [example](https://twitter.com/drmowinckels/status/1392136510468763652), here is a tweet I came across of someone who gave a talk about how to zhoosh up your ggplots.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchocolate %>%\n ggplot(aes(\n x = as.factor(review_date),\n y = rating,\n fill = review_date\n )) +\n geom_violin()\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-20-1.png){width=672}\n:::\n:::\n\n\n### Tasks\n\n1. You need to customize it in **at least 7 ways** to make it better.\n2. In your document, write 1 - 2 sentences about each different customization you added (using bullets -- i.e. there should be at least 7 bullet points each with 1-2 sentences), describing how you improved it.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5.2\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-09-14\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [ "index_files" ], diff --git a/projects/project-1/index.R b/projects/project-1/index.R index 814cf15..72c7f78 100644 --- a/projects/project-1/index.R +++ b/projects/project-1/index.R @@ -1,13 +1,17 @@ -## install.packages("tidyverse") +## ## Install the tidyverse package if you don't have it +## if (!require("tidyverse", quietly = TRUE)) +## install.packages("tidyverse") -library(tidyverse) +library("tidyverse") -## install.packages("tidytuesdayR") +## ## Install the tidytuesdayR package if you don't have it +## if (!require("tidytuesdayR", quietly = TRUE)) +## install.packages("tidytuesdayR") #| eval: false @@ -16,8 +20,8 @@ library(tidyverse) #| message: false -library(here) -library(tidyverse) +library("here") +library("tidyverse") # tests if a directory named "data" exists locally if (!dir.exists(here("data"))) { diff --git a/projects/project-1/index.qmd b/projects/project-1/index.qmd index e3f15de..c75411f 100644 --- a/projects/project-1/index.qmd +++ b/projects/project-1/index.qmd @@ -31,13 +31,15 @@ To get started, [watch this video on setting up your R Markdown document](https: Before attempting this assignment, you should first install the `tidyverse` package if you have not already. The `tidyverse` package is actually a collection of many packages that serves as a convenient way to install many packages without having to do them one by one. This can be done with the `install.packages()` function. ```{r,eval=FALSE} -install.packages("tidyverse") +## Install the tidyverse package if you don't have it +if (!require("tidyverse", quietly = TRUE)) + install.packages("tidyverse") ``` Running this function will install a host of other packages so it make take a minute or two depending on how fast your computer is. Once you have installed it, you will want to load the package. ```{r, message=FALSE} -library(tidyverse) +library("tidyverse") ``` ### Data @@ -53,7 +55,9 @@ If we look at the [TidyTuesday github repo](https://github.com/rfordatascience/t To access the data, you need to install the `tidytuesdayR` R package and use the function `tt_load()` with the date of '2022-01-18' to load the data. ```{r,eval=FALSE} -install.packages("tidytuesdayR") +## Install the tidytuesdayR package if you don't have it +if (!require("tidytuesdayR", quietly = TRUE)) + install.packages("tidytuesdayR") ``` This is how you can download the data. @@ -68,8 +72,8 @@ However, if you use this code, you will hit an API limit after trying to compile ```{r} #| message: false -library(here) -library(tidyverse) +library("here") +library("tidyverse") # tests if a directory named "data" exists locally if (!dir.exists(here("data"))) {