From ae5c996b50b08826ddc8108bb1b76e293900438b Mon Sep 17 00:00:00 2001 From: lcolladotor Date: Mon, 25 Sep 2023 15:52:19 -0400 Subject: [PATCH] Add links to https://ggplot2.tidyverse.org/reference/geom_linerange.html and https://r-graphics.org/recipe-annotate-error-bar on the instructions for project 2 Co-authored-by: Natalia Sifnugel --- _freeze/projects/project-2/index/execute-results/html.json | 4 ++-- projects/project-2/index.qmd | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/_freeze/projects/project-2/index/execute-results/html.json b/_freeze/projects/project-2/index/execute-results/html.json index 612b63c..dd52c9f 100644 --- a/_freeze/projects/project-2/index/execute-results/html.json +++ b/_freeze/projects/project-2/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "ada4d2e310eab05602418f63c41c90d5", + "hash": "f6db99c91f2414218a62ea2a1be6dfb0", "result": { - "markdown": "---\ntitle: \"Project 2\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Exploring temperature and rainfall in Australia\"\ncategories: [project 2, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-2/index.qmd).*\n\n# Background\n\n**Due date: October 1st at 11:59pm**\n\nThe goal of this assignment is to practice designing and writing functions along with practicing our tidyverse skills that we learned in our previous project. Writing functions involves thinking about how code should be divided up and what the interface/arguments should be. In addition, you need to think about what the function will return as output.\n\n### To submit your project\n\nPlease write up your project using R Markdown and processed with `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** (i.e. make sure to set `echo = TRUE`) for each of the answers to each part.\n\n### Install packages\n\nBefore attempting this assignment, you should first install the following packages, if they are not already installed:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\ninstall.packages(\"tidytuesdayR\")\n```\n:::\n\n\n# Part 1: Fun with functions\n\nIn this part, we are going to practice creating functions.\n\n### Part 1A: Exponential transformation\n\nThe exponential of a number can be written as an infinite series expansion of the form $$\n\\exp(x) = 1 + x + \\frac{x^2}{2!} + \\frac{x^3}{3!} + \\cdots\n$$ Of course, we cannot compute an infinite series by the end of this term and so we must truncate it at a certain point in the series. The truncated sum of terms represents an approximation to the true exponential, but the approximation may be usable.\n\nWrite a function that computes the exponential of a number using the truncated series expansion. The function should take two arguments:\n\n- `x`: the number to be exponentiated\n\n- `k`: the number of terms to be used in the series expansion beyond the constant 1. The value of `k` is always $\\geq 1$.\n\nFor example, if $k = 1$, then the `Exp` function should return the number $1 + x$. If $k = 2$, then you should return the number $1 + x + x^2/2!$.\n\nInclude at least one example of output using your function.\n\n::: callout-note\n- You can assume that the input value `x` will always be a *single* number.\n\n- You can assume that the value `k` will always be an integer $\\geq 1$.\n\n- Do not use the `exp()` function in R.\n\n- The `factorial()` function can be used to compute factorials.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nExp <- function(x, k) {\n # Add your solution here\n}\n```\n:::\n\n\n### Part 1B: Sample mean and sample standard deviation\n\nNext, write two functions called `sample_mean()` and `sample_sd()` that takes as input a vector of data of length $N$ and calculates the sample average and sample standard deviation for the set of $N$ observations.\n\n$$\n\\bar{x} = \\frac{1}{N} \\sum_{i=1}^n x_i\n$$ $$\ns = \\sqrt{\\frac{1}{N-1} \\sum_{i=1}^N (x_i - \\overline{x})^2}\n$$ Include at least one example of output using your functions.\n\n::: callout-note\n- You can assume that the input value `x` will always be a *vector* of numbers of length *N*.\n\n- Do not use the `mean()` and `sd()` functions in R.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsample_mean <- function(x) {\n # Add your solution here\n}\n\nsample_sd <- function(x) {\n # Add your solution here\n}\n```\n:::\n\n\n### Part 1C: Confidence intervals\n\nNext, write a function called `calculate_CI()` that:\n\n1. There should be two inputs to the `calculate_CI()`. First, it should take as input a vector of data of length $N$. Second, the function should also have a `conf` ($=1-\\alpha$) argument that allows the confidence interval to be adapted for different $\\alpha$.\n\n2. Calculates a confidence interval (CI) (e.g. a 95% CI) for the estimate of the mean in the population. If you are not familiar with confidence intervals, it is an interval that contains the population parameter with probability $1-\\alpha$ taking on this form\n\n$$\n\\bar{x} \\pm t_{\\alpha/2, N-1} s_{\\bar{x}}\n$$\n\nwhere $t_{\\alpha/2, N-1}$ is the value needed to generate an area of $\\alpha / 2$ in each tail of the $t$-distribution with $N-1$ degrees of freedom and $s_{\\bar{x}} = \\frac{s}{\\sqrt{N}}$ is the standard error of the mean. For example, if we pick a 95% confidence interval and $N$=50, then you can calculate $t_{\\alpha/2, N-1}$ as\n\n\n::: {.cell}\n\n```{.r .cell-code}\nalpha <- 1 - 0.95\ndegrees_freedom <- 50 - 1\nt_score <- qt(p = alpha / 2, df = degrees_freedom, lower.tail = FALSE)\n```\n:::\n\n\n3. Returns a named vector of length 2, where the first value is the `lower_bound`, the second value is the `upper_bound`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI <- function(x, conf = 0.95) {\n # Add your solution here\n}\n```\n:::\n\n\nInclude example of output from your function showing the output when using two different levels of `conf`.\n\n::: callout-note\nIf you want to check if your function output matches an existing function in R, consider a vector $x$ of length $N$ and see if the following two code chunks match.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI(x, conf = 0.95)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndat <- data.frame(x = x)\nfit <- lm(x ~ 1, dat)\n\n# Calculate a 95% confidence interval\nconfint(fit, level = 0.95)\n```\n:::\n\n:::\n\n# Part 2: Wrangling data\n\nIn this part, we will practice our wrangling skills with the tidyverse that we learned about in module 1.\n\n### Data\n\nThe two datasets for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com). Specifically, we will use the following data from January 2020, which I have provided for you below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load(\"2020-01-07\")\nrainfall <- tuesdata$rainfall\ntemperature <- tuesdata$temperature\n```\n:::\n\n\nHowever, to avoid re-downloading data, we will check to see if those files already exist using an `if()` statement:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nif (!file.exists(here(\"data\", \"tuesdata_rainfall.RDS\"))) {\n tuesdata <- tidytuesdayR::tt_load(\"2020-01-07\")\n rainfall <- tuesdata$rainfall\n temperature <- tuesdata$temperature\n\n # save the files to RDS objects\n saveRDS(tuesdata$rainfall, file = here(\"data\", \"tuesdata_rainfall.RDS\"))\n saveRDS(tuesdata$temperature, file = here(\"data\", \"tuesdata_temperature.RDS\"))\n}\n```\n:::\n\n\n::: callout-note\nThe above code will only run if it cannot find the path to the `tuesdata_rainfall.RDS` on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.\n:::\n\nLet's load the datasets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrainfall <- readRDS(here(\"data\", \"tuesdata_rainfall.RDS\"))\ntemperature <- readRDS(here(\"data\", \"tuesdata_temperature.RDS\"))\n```\n:::\n\n\nNow we can look at the data with `glimpse()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nglimpse(rainfall)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 179,273\nColumns: 11\n$ station_code \"009151\", \"009151\", \"009151\", \"009151\", \"009151\", \"009151…\n$ city_name \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Pe…\n$ year 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 196…\n$ month \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01…\n$ day \"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10…\n$ rainfall NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ period NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ quality NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ lat -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -…\n$ long 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 1…\n$ station_name \"Subiaco Wastewater Treatment Plant\", \"Subiaco Wastewater…\n```\n:::\n\n```{.r .cell-code}\nglimpse(temperature)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 528,278\nColumns: 5\n$ city_name \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PER…\n$ date 1910-01-01, 1910-01-02, 1910-01-03, 1910-01-04, 1910-01-0…\n$ temperature 26.7, 27.0, 27.5, 24.0, 24.8, 24.4, 25.3, 28.0, 32.6, 35.9…\n$ temp_type \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"m…\n$ site_name \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH …\n```\n:::\n:::\n\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020#2020-data) from 2020, we see this dataset contains temperature and rainfall data from Australia.\n\nHere is a data dictionary for what all the column names mean:\n\n- \n\n### Tasks\n\nUsing the `rainfall` and `temperature` data, perform the following steps and create a new data frame called `df`:\n\n1. Start with `rainfall` dataset and drop any rows with NAs.\n2. Create a new column titled `date` that combines the columns `year`, `month`, `day` into one column separated by \"-\". (e.g. \"2020-01-01\"). This column should not be a character, but should be recognized as a date. (**Hint**: check out the `ymd()` function in `lubridate` R package). You will also want to add a column that just keeps the `year`.\n3. Using the `city_name` column, convert the city names (character strings) to all upper case.\n4. Join this wrangled rainfall dataset with the `temperature` dataset such that it includes only observations that are in both data frames. (**Hint**: there are two keys that you will need to join the two datasets together). (**Hint**: If all has gone well thus far, you should have a dataset with 83,964 rows and 13 columns).\n\n::: callout-note\n- You may need to use functions outside these packages to obtain this result, in particular you may find the functions `drop_na()` from `tidyr` and `str_to_upper()` function from `stringr` useful.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Data visualization\n\nIn this part, we will practice our `ggplot2` plotting skills within the tidyverse starting with our wrangled `df` data from Part 2. For full credit in this part (and for all plots that you make), your plots should include:\n\n1. An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure.\n2. There should be an informative x-axis and y-axis label.\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Part 3A: Plotting temperature data over time\n\nUse the functions in `ggplot2` package to make a line plot of the max and min temperature (y-axis) over time (x-axis) for each city in our wrangled data from Part 2. You should only consider years 2014 and onwards. For full credit, your plot should include:\n\n1. For a given city, the min and max temperature should both appear on the plot, but they should be two different colors.\n2. Use a facet function to facet by `city_name` to show all cities in one figure.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 3B: Plotting rainfall over time\n\nHere we want to explore the distribution of rainfall (log scale) with histograms for a given city (indicated by the `city_name` column) for a given year (indicated by the `year` column) so we can make some exploratory plots of the data.\n\n::: callout-note\nYou are again using the wrangled data from Part 2.\n:::\n\nThe following code plots the data from one city (`city_name == \"PERTH\"`) in a given year (`year == 2000`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf %>%\n filter(city_name == \"PERTH\", year == 2000) %>%\n ggplot(aes(log(rainfall))) +\n geom_histogram()\n```\n:::\n\n\nWhile this code is useful, it only provides us information on one city in one year. We could cut and paste this code to look at other cities/years, but that can be error prone and just plain messy.\n\nThe aim here is to **design** and **implement** a function that can be re-used to visualize all of the data in this dataset.\n\n1. There are 2 aspects that may vary in the dataset: The **city_name** and the **year**. Note that not all combinations of `city_name` and `year` have measurements.\n\n2. Your function should take as input two arguments **city_name** and **year**.\n\n3. Given the input from the user, your function should return a **single** histogram for that input. Furthermore, the data should be **readable** on that plot so that it is in fact useful. It should be possible visualize the entire dataset with your function (through repeated calls to your function).\n\n4. If the user enters an input that does not exist in the dataset, your function should catch that and report an error (via the `stop()` function).\n\nFor this section,\n\n1. Write a short description of how you chose to design your function and why.\n\n2. Present the code for your function in the R markdown document.\n\n3. Include at least one example of output from your function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Apply functions and plot\n\n### Part 4A: Tasks\n\nIn this part, we will apply the functions we wrote in Part 1 to our rainfall data starting with our wrangled `df` data from Part 2.\n\n1. First, filter for only years including 2014 and onwards.\n2. For a given city and for a given year, calculate the sample mean (using your function `sample_mean()`), the sample standard deviation (using your function `sample_sd()`), and a 95% confidence interval for the average rainfall (using your function `calculate_CI()`). Specifically, you should add two columns in this summarized dataset: a column titled `lower_bound` and a column titled `upper_bound` containing the lower and upper bounds for you CI that you calculated (using your function `calculate_CI()`).\n3. Call this summarized dataset `rain_df`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 4B: Tasks\n\nUsing the `rain_df`, plots the estimates of mean rainfall and the 95% confidence intervals on the same plot. There should be a separate faceted plot for each city. Think about using `ggplot()` with both `geom_point()` (and `geom_line()` to connect the points) for the means and `geom_errorbar()` for the lower and upper bounds of the confidence interval.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.5\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-09-12\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", + "markdown": "---\ntitle: \"Project 2\"\nauthor:\n - name: Leonardo Collado Torres\n url: http://lcolladotor.github.io/\n affiliations:\n - id: libd\n name: Lieber Institute for Brain Development\n url: https://libd.org/\n - id: jhsph\n name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics\n url: https://publichealth.jhu.edu/departments/biostatistics\ndescription: \"Exploring temperature and rainfall in Australia\"\ncategories: [project 2, projects]\n---\n\n\n*This project, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the [GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/projects/project-2/index.qmd).*\n\n# Background\n\n**Due date: October 1st at 11:59pm**\n\nThe goal of this assignment is to practice designing and writing functions along with practicing our tidyverse skills that we learned in our previous project. Writing functions involves thinking about how code should be divided up and what the interface/arguments should be. In addition, you need to think about what the function will return as output.\n\n### To submit your project\n\nPlease write up your project using R Markdown and processed with `knitr`. Compile your document as an **HTML file** and submit your HTML file to the dropbox on Courseplus. Please **show all your code** (i.e. make sure to set `echo = TRUE`) for each of the answers to each part.\n\n### Install packages\n\nBefore attempting this assignment, you should first install the following packages, if they are not already installed:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"tidyverse\")\ninstall.packages(\"tidytuesdayR\")\n```\n:::\n\n\n# Part 1: Fun with functions\n\nIn this part, we are going to practice creating functions.\n\n### Part 1A: Exponential transformation\n\nThe exponential of a number can be written as an infinite series expansion of the form $$\n\\exp(x) = 1 + x + \\frac{x^2}{2!} + \\frac{x^3}{3!} + \\cdots\n$$ Of course, we cannot compute an infinite series by the end of this term and so we must truncate it at a certain point in the series. The truncated sum of terms represents an approximation to the true exponential, but the approximation may be usable.\n\nWrite a function that computes the exponential of a number using the truncated series expansion. The function should take two arguments:\n\n- `x`: the number to be exponentiated\n\n- `k`: the number of terms to be used in the series expansion beyond the constant 1. The value of `k` is always $\\geq 1$.\n\nFor example, if $k = 1$, then the `Exp` function should return the number $1 + x$. If $k = 2$, then you should return the number $1 + x + x^2/2!$.\n\nInclude at least one example of output using your function.\n\n::: callout-note\n- You can assume that the input value `x` will always be a *single* number.\n\n- You can assume that the value `k` will always be an integer $\\geq 1$.\n\n- Do not use the `exp()` function in R.\n\n- The `factorial()` function can be used to compute factorials.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nExp <- function(x, k) {\n # Add your solution here\n}\n```\n:::\n\n\n### Part 1B: Sample mean and sample standard deviation\n\nNext, write two functions called `sample_mean()` and `sample_sd()` that takes as input a vector of data of length $N$ and calculates the sample average and sample standard deviation for the set of $N$ observations.\n\n$$\n\\bar{x} = \\frac{1}{N} \\sum_{i=1}^n x_i\n$$ $$\ns = \\sqrt{\\frac{1}{N-1} \\sum_{i=1}^N (x_i - \\overline{x})^2}\n$$ Include at least one example of output using your functions.\n\n::: callout-note\n- You can assume that the input value `x` will always be a *vector* of numbers of length *N*.\n\n- Do not use the `mean()` and `sd()` functions in R.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsample_mean <- function(x) {\n # Add your solution here\n}\n\nsample_sd <- function(x) {\n # Add your solution here\n}\n```\n:::\n\n\n### Part 1C: Confidence intervals\n\nNext, write a function called `calculate_CI()` that:\n\n1. There should be two inputs to the `calculate_CI()`. First, it should take as input a vector of data of length $N$. Second, the function should also have a `conf` ($=1-\\alpha$) argument that allows the confidence interval to be adapted for different $\\alpha$.\n\n2. Calculates a confidence interval (CI) (e.g. a 95% CI) for the estimate of the mean in the population. If you are not familiar with confidence intervals, it is an interval that contains the population parameter with probability $1-\\alpha$ taking on this form\n\n$$\n\\bar{x} \\pm t_{\\alpha/2, N-1} s_{\\bar{x}}\n$$\n\nwhere $t_{\\alpha/2, N-1}$ is the value needed to generate an area of $\\alpha / 2$ in each tail of the $t$-distribution with $N-1$ degrees of freedom and $s_{\\bar{x}} = \\frac{s}{\\sqrt{N}}$ is the standard error of the mean. For example, if we pick a 95% confidence interval and $N$=50, then you can calculate $t_{\\alpha/2, N-1}$ as\n\n\n::: {.cell}\n\n```{.r .cell-code}\nalpha <- 1 - 0.95\ndegrees_freedom <- 50 - 1\nt_score <- qt(p = alpha / 2, df = degrees_freedom, lower.tail = FALSE)\n```\n:::\n\n\n3. Returns a named vector of length 2, where the first value is the `lower_bound`, the second value is the `upper_bound`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI <- function(x, conf = 0.95) {\n # Add your solution here\n}\n```\n:::\n\n\nInclude example of output from your function showing the output when using two different levels of `conf`.\n\n::: callout-note\nIf you want to check if your function output matches an existing function in R, consider a vector $x$ of length $N$ and see if the following two code chunks match.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncalculate_CI(x, conf = 0.95)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndat <- data.frame(x = x)\nfit <- lm(x ~ 1, dat)\n\n# Calculate a 95% confidence interval\nconfint(fit, level = 0.95)\n```\n:::\n\n:::\n\n# Part 2: Wrangling data\n\nIn this part, we will practice our wrangling skills with the tidyverse that we learned about in module 1.\n\n### Data\n\nThe two datasets for this part of the assignment comes from [TidyTuesday](https://www.tidytuesday.com). Specifically, we will use the following data from January 2020, which I have provided for you below:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntuesdata <- tidytuesdayR::tt_load(\"2020-01-07\")\nrainfall <- tuesdata$rainfall\ntemperature <- tuesdata$temperature\n```\n:::\n\n\nHowever, to avoid re-downloading data, we will check to see if those files already exist using an `if()` statement:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(here)\nif (!file.exists(here(\"data\", \"tuesdata_rainfall.RDS\"))) {\n tuesdata <- tidytuesdayR::tt_load(\"2020-01-07\")\n rainfall <- tuesdata$rainfall\n temperature <- tuesdata$temperature\n\n # save the files to RDS objects\n saveRDS(tuesdata$rainfall, file = here(\"data\", \"tuesdata_rainfall.RDS\"))\n saveRDS(tuesdata$temperature, file = here(\"data\", \"tuesdata_temperature.RDS\"))\n}\n```\n:::\n\n\n::: callout-note\nThe above code will only run if it cannot find the path to the `tuesdata_rainfall.RDS` on your computer. Then, we can just read in these files every time we knit the R Markdown, instead of re-downloading them every time.\n:::\n\nLet's load the datasets\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrainfall <- readRDS(here(\"data\", \"tuesdata_rainfall.RDS\"))\ntemperature <- readRDS(here(\"data\", \"tuesdata_temperature.RDS\"))\n```\n:::\n\n\nNow we can look at the data with `glimpse()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\n\nglimpse(rainfall)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 179,273\nColumns: 11\n$ station_code \"009151\", \"009151\", \"009151\", \"009151\", \"009151\", \"009151…\n$ city_name \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Perth\", \"Pe…\n$ year 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 1967, 196…\n$ month \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01\", \"01…\n$ day \"01\", \"02\", \"03\", \"04\", \"05\", \"06\", \"07\", \"08\", \"09\", \"10…\n$ rainfall NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ period NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ quality NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…\n$ lat -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -31.96, -…\n$ long 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 115.79, 1…\n$ station_name \"Subiaco Wastewater Treatment Plant\", \"Subiaco Wastewater…\n```\n:::\n\n```{.r .cell-code}\nglimpse(temperature)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRows: 528,278\nColumns: 5\n$ city_name \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PERTH\", \"PER…\n$ date 1910-01-01, 1910-01-02, 1910-01-03, 1910-01-04, 1910-01-0…\n$ temperature 26.7, 27.0, 27.5, 24.0, 24.8, 24.4, 25.3, 28.0, 32.6, 35.9…\n$ temp_type \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"max\", \"m…\n$ site_name \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH AIRPORT\", \"PERTH …\n```\n:::\n:::\n\n\nIf we look at the [TidyTuesday github repo](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020#2020-data) from 2020, we see this dataset contains temperature and rainfall data from Australia.\n\nHere is a data dictionary for what all the column names mean:\n\n- \n\n### Tasks\n\nUsing the `rainfall` and `temperature` data, perform the following steps and create a new data frame called `df`:\n\n1. Start with `rainfall` dataset and drop any rows with NAs.\n2. Create a new column titled `date` that combines the columns `year`, `month`, `day` into one column separated by \"-\". (e.g. \"2020-01-01\"). This column should not be a character, but should be recognized as a date. (**Hint**: check out the `ymd()` function in `lubridate` R package). You will also want to add a column that just keeps the `year`.\n3. Using the `city_name` column, convert the city names (character strings) to all upper case.\n4. Join this wrangled rainfall dataset with the `temperature` dataset such that it includes only observations that are in both data frames. (**Hint**: there are two keys that you will need to join the two datasets together). (**Hint**: If all has gone well thus far, you should have a dataset with 83,964 rows and 13 columns).\n\n::: callout-note\n- You may need to use functions outside these packages to obtain this result, in particular you may find the functions `drop_na()` from `tidyr` and `str_to_upper()` function from `stringr` useful.\n:::\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 3: Data visualization\n\nIn this part, we will practice our `ggplot2` plotting skills within the tidyverse starting with our wrangled `df` data from Part 2. For full credit in this part (and for all plots that you make), your plots should include:\n\n1. An overall title for the plot and a subtitle summarizing key trends that you found. Also include a caption in the figure.\n2. There should be an informative x-axis and y-axis label.\n\nConsider playing around with the `theme()` function to make the figure shine, including playing with background colors, font, etc.\n\n### Part 3A: Plotting temperature data over time\n\nUse the functions in `ggplot2` package to make a line plot of the max and min temperature (y-axis) over time (x-axis) for each city in our wrangled data from Part 2. You should only consider years 2014 and onwards. For full credit, your plot should include:\n\n1. For a given city, the min and max temperature should both appear on the plot, but they should be two different colors.\n2. Use a facet function to facet by `city_name` to show all cities in one figure.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 3B: Plotting rainfall over time\n\nHere we want to explore the distribution of rainfall (log scale) with histograms for a given city (indicated by the `city_name` column) for a given year (indicated by the `year` column) so we can make some exploratory plots of the data.\n\n::: callout-note\nYou are again using the wrangled data from Part 2.\n:::\n\nThe following code plots the data from one city (`city_name == \"PERTH\"`) in a given year (`year == 2000`).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf %>%\n filter(city_name == \"PERTH\", year == 2000) %>%\n ggplot(aes(log(rainfall))) +\n geom_histogram()\n```\n:::\n\n\nWhile this code is useful, it only provides us information on one city in one year. We could cut and paste this code to look at other cities/years, but that can be error prone and just plain messy.\n\nThe aim here is to **design** and **implement** a function that can be re-used to visualize all of the data in this dataset.\n\n1. There are 2 aspects that may vary in the dataset: The **city_name** and the **year**. Note that not all combinations of `city_name` and `year` have measurements.\n\n2. Your function should take as input two arguments **city_name** and **year**.\n\n3. Given the input from the user, your function should return a **single** histogram for that input. Furthermore, the data should be **readable** on that plot so that it is in fact useful. It should be possible visualize the entire dataset with your function (through repeated calls to your function).\n\n4. If the user enters an input that does not exist in the dataset, your function should catch that and report an error (via the `stop()` function).\n\nFor this section,\n\n1. Write a short description of how you chose to design your function and why.\n\n2. Present the code for your function in the R markdown document.\n\n3. Include at least one example of output from your function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# Part 4: Apply functions and plot\n\n### Part 4A: Tasks\n\nIn this part, we will apply the functions we wrote in Part 1 to our rainfall data starting with our wrangled `df` data from Part 2.\n\n1. First, filter for only years including 2014 and onwards.\n2. For a given city and for a given year, calculate the sample mean (using your function `sample_mean()`), the sample standard deviation (using your function `sample_sd()`), and a 95% confidence interval for the average rainfall (using your function `calculate_CI()`). Specifically, you should add two columns in this summarized dataset: a column titled `lower_bound` and a column titled `upper_bound` containing the lower and upper bounds for you CI that you calculated (using your function `calculate_CI()`).\n3. Call this summarized dataset `rain_df`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n### Part 4B: Tasks\n\nUsing the `rain_df`, plots the estimates of mean rainfall and the 95% confidence intervals on the same plot. There should be a separate faceted plot for each city. Think about using `ggplot()` with both `geom_point()` (and `geom_line()` to connect the points) for the means and `geom_errorbar()` for the lower and upper bounds of the confidence interval. Check and or the official documentation for examples of how to use `geom_errorbar()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Add your solution here\n```\n:::\n\n\n# R session information\n\n\n::: {.cell}\n\n```{.r .cell-code}\noptions(width = 120)\nsessioninfo::session_info()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.1 (2023-06-16)\n os macOS Ventura 13.6\n system aarch64, darwin20\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/New_York\n date 2023-09-25\n pandoc 3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)\n colorout 1.2-2 2023-05-06 [1] Github (jalvesaq/colorout@79931fd)\n colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)\n digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)\n dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.0)\n evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)\n fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)\n fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)\n forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)\n generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)\n ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.0)\n glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)\n gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.0)\n here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)\n hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)\n htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.0)\n htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)\n jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)\n knitr 1.44 2023-09-11 [1] CRAN (R 4.3.0)\n lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)\n lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)\n magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)\n munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)\n pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)\n pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)\n purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)\n R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)\n readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)\n rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)\n rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)\n rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)\n rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)\n scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)\n sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)\n stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)\n stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)\n tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)\n tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)\n tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)\n tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)\n timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)\n tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)\n utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)\n vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)\n withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)\n xfun 0.40 2023-08-09 [1] CRAN (R 4.3.0)\n yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)\n\n [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library\n\n──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/projects/project-2/index.qmd b/projects/project-2/index.qmd index d7f98de..26c69b2 100644 --- a/projects/project-2/index.qmd +++ b/projects/project-2/index.qmd @@ -293,7 +293,7 @@ In this part, we will apply the functions we wrote in Part 1 to our rainfall dat ### Part 4B: Tasks -Using the `rain_df`, plots the estimates of mean rainfall and the 95% confidence intervals on the same plot. There should be a separate faceted plot for each city. Think about using `ggplot()` with both `geom_point()` (and `geom_line()` to connect the points) for the means and `geom_errorbar()` for the lower and upper bounds of the confidence interval. +Using the `rain_df`, plots the estimates of mean rainfall and the 95% confidence intervals on the same plot. There should be a separate faceted plot for each city. Think about using `ggplot()` with both `geom_point()` (and `geom_line()` to connect the points) for the means and `geom_errorbar()` for the lower and upper bounds of the confidence interval. Check and or the official documentation for examples of how to use `geom_errorbar()`. ```{r} # Add your solution here