diff --git a/_freeze/content/lectures/13-cs01-analysis/execute-results/html.json b/_freeze/content/lectures/13-cs01-analysis/execute-results/html.json index fd9a281..3295f04 100644 --- a/_freeze/content/lectures/13-cs01-analysis/execute-results/html.json +++ b/_freeze/content/lectures/13-cs01-analysis/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "a8036478fb7d853228624bab66cdc6d0", + "hash": "7dd6d09087241c7f1d863d30e5715f66", "result": { - "markdown": "---\ntitle: \"13-cs01-analysis\"\nauthor: \"Professor Shannon Ellis\"\ndate: \"2023-11-14\"\n\nformat:\n html: \n output-file: 13-cs01-analysis.html\n embed-resources: true\n revealjs:\n output-file: 13-cs01-analysis-slides.html\n slide-number: true\n chalkboard: false \n preview-links: auto\n logo: images/cogs137-logo-hex.png\n css: slides.css\n footer: \n scrollable: true\n embed-resources: true\n execute:\n echo: true\n eval: true\n---\n\n::: {.cell}\n\n:::\n\n\n\n# CS01: Biomarkers of Recent Use (Analysis) {background-color=\"#92A86A\"}\n\n## Q&A {.smaller}\n\n> Q: How extensive does our extension component need to be?\\\n> A: A bit hard to answer in certain terms. We'll discuss some examples today to hopefully set expectaions well. To explain in writing here, the most typical extension is students using the data provided to ask and answer a question not directly presented in class. Thus, simply generating a visualization not presented in class would NOT be sufficient. At the other end, finding external data on the topic and analyzing that data, while certainly allowed, would far exceed expectations. In between those extremes is what we expect: add significantly to the analysis, beyond what was presented in class.\n\n\n## Course Announcements\n\nDue Dates:\n\n- **HW03** (MLR) due Mon 11/20\n- **Project Proposal** (it will be a Google Form) due 11/20\n- **CS01** Deadlines:\n - **Lab06** due Friday - cs01-focused\n - Report & \"General Communication\" due 11/27\n - survey about how working with group went - due 11/28\n\n. . .\n\nNotes:\n\nMidterm scores & Feedback posted\n\n- overall, did very well\n - avg: 13.85/15 (92%)\n - 6 perfect scores\n- answer key on course website\n\nI am behind on emails and Piazza posts.\n\n## Mid-course Survey Summary \n\n- N=73 (~75%)\n- Pacing workload (so far) about right\n- Course notes most helpful in the course overall\n- Also helpful: completing labs, doing homework,\n- Many are not checking labs against answer keys; most are not doing suggested readings\n- Of those that attend lecture, most find it helpful\n\n## Mid-course: Time Spent\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-1-1.png){width=3000}\n:::\n:::\n\n\n\n## Mid-course: What would you change?\n \n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n:::\n\n\n \n## Agenda\n\n- Debugging/Understanding Code Strategies\n- Sensitivity & Specificity\n- Cross-compound correlations\n- Extensions\n\n## Summary: Figuring out what's going on in code\n\nSuggestions (as discussed in class):\n\n:::incremental\n1. Look up documentation (i.e. `?...`) / Google the function\n2. Run it on different input; see how output changing\n3. Run the code line-by-line, understanding output at each step \n4. Ask ChatGPT\n:::\n\n# Question {background-color=\"#92A86A\"}\n\nWhich compound, in which matrix, and at what cutoff is the best biomarker of recent use?\n\n. . .\n\n::: callout-message\nEvery group will answer this question.\n:::\n\n# Data & Files {background-color=\"#92A86A\"}\n\n## Packages\n\nThree additional packages required for these notes:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(purrr)\nlibrary(rstatix)\nlibrary(cowplot)\n```\n:::\n\n\n\n## The Data\n\nReading in the data from the end of data wrangling notes: \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nload(\"data/compounds.RData\")\nload(\"data/timepoints.RData\")\nload(\"data/data_clean.RData\")\n```\n:::\n\n::: {.cell}\n\n:::\n\n\n\nAnd the functions...\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource(\"src/cs01_functions.R\")\n```\n:::\n\n\n\n# Analysis {background-color=\"#92A86A\"}\n\n\n## Sensitivity & Specificity\n\n**Sensitivity** | the ability of a test to correctly identify patients with a disease/trait/condition. $$TP/(TP + FN)$$\n\n. . .\n\n**Specificity** | the ability of a test to correctly identify people without the disease/trait/condition. $$TN/(TN + FP)$$\n\n. . . \n\n[❓ For this analysis, do you care more about sensitivity? about specificity? equally about both?]{style=\"background-color: #ADD8E6\"}\n\n## What is a TP here? TN? FP? FN? \n\n**Post-smoking** (cutoff > 0)\n\n:::incremental\n- TP = THC group, value >= cutoff\n- FN = THC group, value < cutoff\n- FP = Placebo group, value >= cutoff\n- TN = Placebo group, value < cutoff\n:::\n\n. . .\n\n**Post-smoking** (cutoff == 0)\n\nCannot be a TP or FP if zero...\n\n- TP = THC group, value > cutoff),\n- FN = THC group, value <= cutoff),\n- FP = Placebo group, value > cutoff),\n- TN = Placebo group, value < cutoff)\n\n. . . \n\n**Pre-smoking** \n\nCannot be a TP or FN before consuming...\n\n- TP = 0\n- FN = 0\n- FP = value >= cutoff\n- TN = value < cutoff\n\n## ROC\n\nReceiver-Operator Characteristic (ROC) Curve: TPR (Sensitivity) vs FPR (1-Specificity)\n\n![](images/13/Roc_curve.svg.png)\n\nImage Credit: By cmglee, MartinThoma - Roc-draft-xkcd-style.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=109730045\n\n\n\n\n## Calculating Sensitivity and Specificity\n\n:::panel-tabset\n\n### Calculate \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmake_calculations <- function(dataset, dataset_removedups, split, compound, \n start = start, stop = stop, tpt_use = tpt_use){\n ## remove NAs\n df <- dataset_removedups %>% \n dplyr::select(treatment, compound, timepoint_use) %>%\n rename(compound = 2) %>%\n filter(complete.cases(.))\n if(nrow(df)>0){\n if(stop <= 0){\n output <- df %>% \n summarise(TP = 0,\n FN = 0,\n FP = sum(compound >= split),\n TN = sum(compound < split)) \n }else{\n if(split == 0){\n output_pre <- df %>% \n filter(tpt_use == \"pre-smoking\") %>%\n summarise(TP = 0,\n FN = 0,\n FP = sum(compound >= split),\n TN = sum(compound < split)) \n \n output <- df %>% \n filter(tpt_use != \"pre-smoking\") %>%\n summarise(TP = sum(treatment != \"Placebo\" & compound > split),\n FN = sum(treatment != \"Placebo\" & compound <= split),\n FP = sum(treatment == \"Placebo\" & compound > split),\n TN = sum(treatment == \"Placebo\" & compound < split))\n \n output <- output + output_pre\n }else{\n ## calculate values if pre-smoking\n output_pre <- df %>% \n filter(tpt_use == \"pre-smoking\") %>%\n summarise(TP = 0,\n FN = 0,\n FP = sum(compound >= split),\n TN = sum(compound < split)) \n \n output <- df %>% \n filter(tpt_use != \"pre-smoking\") %>%\n summarise(TP = sum(treatment != \"Placebo\" & compound >= split),\n FN = sum(treatment != \"Placebo\" & compound < split),\n FP = sum(treatment == \"Placebo\" & compound >= split),\n TN = sum(treatment == \"Placebo\" & compound < split))\n \n output <- output + output_pre\n }\n }\n }\n # clean things up; make calculations on above values\n output <- output %>%\n mutate(detection_limit = split,\n compound = compound,\n time_start = start,\n time_stop = stop,\n time_window = tpt_use,\n NAs = nrow(dataset) - nrow(df),\n N = nrow(dataset_removedups),\n N_removed = nrow(dataset) - nrow(dataset_removedups),\n Sensitivity = (TP/(TP + FN)), \n Specificity = (TN /(TN + FP)),\n PPV = (TP/(TP+FP)),\n NPV = (TN/(TN + FN)),\n Efficiency = ((TP + TN)/(TP + TN + FP + FN))*100\n )\n \n return(output)\n}\n```\n:::\n\n\n\n### Run\n\n- determine what cutoff values to try\n- carry out above function on those cutoffs\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsens_spec <- function(dataset, compound, start, stop, tpt_use, \n lowest_value = 0.5, splits = NULL, ...){\n # if it's not all NAs...\n if(sum(is.na(dataset[,compound])) != nrow(dataset)){\n # specify what splits should be used for calculations\n if(is.null(splits)){\n limits <- dataset[is.finite(rowSums(dataset[,compound])),compound]\n ## define lower and upper limits\n lower = min(limits, na.rm=TRUE)\n upper = max(limits, na.rm=TRUE)\n ## determine splits to use for calculations\n tosplit = pull(limits[,1])[limits[,1]>0]\n ## only split if there are detectable limits:\n if(length(tosplit)>=1){\n splits = c(lowest_value, quantile(tosplit, probs=seq(0, 1, by = 0.01), na.rm=TRUE))\n splits = unique(splits)\n }else{\n splits = 0\n }\n }else{\n splits = splits\n }\n # filter to include timepoint of interest\n dataset <- dataset %>% \n filter(time_from_start > start & time_from_start <= stop & !is.na(timepoint_use))\n dataset_removedups <- dataset %>%\n filter(!is.na(timepoint_use)) %>% \n group_by(timepoint_use) %>% \n distinct(id, .keep_all = TRUE) %>% \n ungroup()\n\n ## create empty output variable which we'll fill in\n ## iterate through each possible dose and calculate\n output <- map_dfr(as.list(splits), ~make_calculations(dataset, \n dataset_removedups, \n split = .x,\n compound,\n start = start,\n stop = stop, \n tpt_use = tpt_use))\n }\n \n return(output)\n}\n```\n:::\n\n\n\n### Apply\n\nMap the above for each matrix...\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsens_spec_cpd <- function(dataset, cpd, timepoints, splits = NULL){\n args2 <- list(start = timepoints$start, \n stop = timepoints$stop, \n tpt_use = timepoints$timepoint)\n out <- args2 %>% \n pmap_dfr(sens_spec, dataset, compound = cpd, splits = splits)\n return(out)\n}\n```\n:::\n\n\n\n### Do it!\n\nThis takes a few minutes to run... (reminder: `cache=TRUE`)\n\n\n\n::: {.cell hash='13-cs01-analysis_cache/revealjs/unnamed-chunk-10_9dac3684d755322598037474bd630929'}\n\n```{.r .cell-code}\noutput_WB <- map_dfr(compounds_WB, \n ~sens_spec_cpd(dataset = WB, cpd = all_of(.x), \n timepoints = timepoints_WB)) %>% clean_gluc()\n\noutput_BR <- map_dfr(compounds_BR, \n ~sens_spec_cpd(dataset = BR, cpd = all_of(.x),\n timepoints = timepoints_BR)) %>% clean_gluc()\n\noutput_OF <- map_dfr(compounds_OF, \n ~sens_spec_cpd(dataset = OF, cpd = all_of(.x),\n timepoints = timepoints_OF)) %>% clean_gluc()\n```\n:::\n\n\n\n::: \n\n## ROC \n\n:::panel-tabset\n\n### Code\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nss_plot <- function(output, tpts=8, tissue){\n to_include = output %>%\n group_by(compound) %>% \n summarize(mean_detection = mean(detection_limit)) %>% \n filter(mean_detection > 0)\n \n output <- output %>% \n mutate(iszero = ifelse(time_start<0,TRUE,FALSE),\n Sensitivity = round(Sensitivity*100,0),\n Specificity = round(Specificity*100,0)) %>%\n filter(compound %in% to_include$compound,\n time_window != \"pre-smoking\") %>%\n clean_gluc() %>% \n mutate(compound = fct_relevel(as.factor(compound), \"THC\"))\n \n output <- output %>% mutate(\n legend = paste0(time_window,' (N=', N,')'))\n \n blue_colors = c('#C2F8FF', '#A2DDED', '#86BEDC', '#6C9FCA', \n '#547EB9', '#3F5EA8', '#2D4096', '#1E2385',\n '#181173', '#180762', '#180051')\n values = c(blue_colors[1:tpts])\n \n print(ggplot(output, aes(x = detection_limit, y = Sensitivity, group = fct_inorder(legend))) + \n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound, scales = \"free_x\") +\n labs(x = 'Detection Limit',\n y = 'Sensitivity') +\n ylim(0,1) +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16), \n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12)) \n )\n print(\n ggplot(output, aes(x = detection_limit, y = Specificity, group = fct_inorder(legend))) + \n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound, scales = \"free_x\") +\n ylim(0,100) +\n labs(title = tissue,\n x = 'Detection Limit',\n y = 'Specificity') +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16),\n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12))\n )\n print(\n ggplot(output, aes(x=(100-Specificity), y = Sensitivity, group = fct_inorder(legend))) +\n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound) +\n xlim(0, 100) +\n ylim(0, 100) +\n labs(title = tissue,\n x = '(100-Specificity)',\n y = 'Sensitivity') +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16),\n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12),\n axis.text = element_text(size=12))\n )\n}\n\nroc_plot <- function(output, tpts=8, tissue){\n to_include = output %>%\n group_by(compound) %>% \n summarize(mean_detection = mean(detection_limit)) %>% \n filter(mean_detection > 0)\n \n output <- output %>% \n mutate(iszero = ifelse(time_start<0,TRUE,FALSE),\n Sensitivity = round(Sensitivity*100,0),\n Specificity = round(Specificity*100,0)) %>%\n filter(compound %in% to_include$compound,\n time_window != \"pre-smoking\") %>%\n clean_gluc() %>% \n mutate(compound = fct_relevel(as.factor(compound), \"THC\"))\n \n output <- output %>% mutate(\n legend = paste0(time_window,' (N=', N,')'))\n \n blue_colors = c('#C2F8FF', '#86BEDC', \n '#547EB9', '#2D4096',\n '#181173', '#180051')\n values = c(blue_colors[1:tpts])\n print(\n ggplot(output, aes(x=(100-Specificity), y = Sensitivity, group = fct_inorder(legend))) +\n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound) +\n xlim(0, 100) +\n ylim(0, 100) +\n labs(title = tissue,\n x = '(100-Specificity)',\n y = 'Sensitivity') +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16),\n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12),\n axis.text = element_text(size=12) )\n )\n}\n```\n:::\n\n\n\n### Calculate\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nss1_a <- ss_plot(output_WB, tpts = length(unique(output_WB$time_start)), tissue = \"Blood\")\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-12-1.png){width=3000}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-12-2.png){width=3000}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-12-3.png){width=3000}\n:::\n\n```{.r .cell-code}\nss2_a <- ss_plot(output_OF, tpts = length(unique(output_OF$time_start)), tissue = \"Oral Fluid\")\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-12-4.png){width=3000}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-12-5.png){width=3000}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-12-6.png){width=3000}\n:::\n\n```{.r .cell-code}\nss3_a <- roc_plot(output_BR, tpts = length(unique(output_BR$time_start)), tissue = \"Breath\")\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-12-7.png){width=3000}\n:::\n:::\n\n\n\n### Plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbottom_row <- plot_grid(ss2_a, ss3_a, labels = c('B', 'C'), label_size = 12, ncol = 2, rel_widths = c(0.66, .33))\nplot_grid(ss1_a, bottom_row, labels = c('A', ''), label_size = 12, ncol = 1)\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-13-1.png){width=3000}\n:::\n:::\n\n\n\n:::\n\n## Calculate: THC\n\nReminder: Currently, states have laws on the books from zero tolerance (detection of any level) to 5ng/mL\n\n:::panel-tabset\n\n### WB\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncutoffs = c(0.5, 1, 2, 5, 10)\nWB_THC <- sens_spec_cpd(dataset = WB, cpd = 'thc',\n timepoints = timepoints_WB,\n splits = cutoffs) %>% clean_gluc()\n\nWB_THC\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 50 × 17\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 81 108 0.5 THC -400 0\n 2 0 0 61 128 1 THC -400 0\n 3 0 0 45 144 2 THC -400 0\n 4 0 0 10 179 5 THC -400 0\n 5 0 0 1 188 10 THC -400 0\n 6 124 2 28 33 0.5 THC 0 30\n 7 123 3 22 39 1 THC 0 30\n 8 119 7 15 46 2 THC 0 30\n 9 106 20 4 57 5 THC 0 30\n10 101 25 0 61 10 THC 0 30\n# ℹ 40 more rows\n# ℹ 9 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency \n```\n:::\n:::\n\n\n\n### OF\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nOF_THC <- sens_spec_cpd(dataset = OF, cpd = 'thc',\n timepoints = timepoints_OF,\n splits = cutoffs) %>% clean_gluc()\n\nOF_THC\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 40 × 17\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 35 157 0.5 THC -400 0\n 2 0 0 20 172 1 THC -400 0\n 3 0 0 9 183 2 THC -400 0\n 4 0 0 0 192 5 THC -400 0\n 5 0 0 0 192 10 THC -400 0\n 6 129 0 39 24 0.5 THC 0 30\n 7 129 0 30 33 1 THC 0 30\n 8 128 1 19 44 2 THC 0 30\n 9 128 1 3 60 5 THC 0 30\n10 125 4 1 62 10 THC 0 30\n# ℹ 30 more rows\n# ℹ 9 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency \n```\n:::\n:::\n\n\n\n### BR\n\nWhy is there no calculation for breath?\n\n\n:::\n\n## Cutoffs\n\n:::panel-tabset\n\n### Code\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_cutoffs <- function(dataset, timepoint_use_variable, tissue, labels = c(\"A\", \"B\"), vertline, cpd, x_labels){\n col_val = c(\"#D9D9D9\", \"#BDBDBD\", \"#969696\", \"#636363\", \"#252525\")\n lines = rep(\"solid\", 5)\n \n df_ss <- dataset %>% \n mutate(time_window = fct_relevel(as.factor(time_window), \n levels(timepoint_use_variable)),\n detection_limit = as.factor(detection_limit),\n Sensitivity = round(Sensitivity*100,0),\n Specificity = round(Specificity*100,0),\n my_label = paste0(time_window, ' N=', N),\n my_label = gsub(\" \", \"\\n\", my_label),\n my_label = fct_relevel(as.factor(my_label), x_labels)) #%>% \n \n p1 <- df_ss %>% \n ggplot(aes(x = my_label, y = Sensitivity, \n colour = detection_limit)) + \n geom_line(size = 1.2, aes(group = detection_limit, \n linetype = detection_limit)) + \n geom_vline(xintercept=vertline, linetype = 'dotted') +\n geom_point(show.legend=FALSE) + \n ylim(0,100) +\n scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +\n scale_linetype_manual(values=lines) +\n scale_color_manual(values = col_val, name = \"Cutoff \\n (ng/mL)\",\n guide = guide_legend(override.aes = list(linetype = c(1),\n shape = rep(NA, length(lines))) )) +\n theme_classic() +\n theme( axis.title = element_text(size=16),\n axis.text = element_text(size=10),\n legend.position = c(0.08, 0.4),\n panel.grid = element_blank(),\n strip.background = element_blank()\n ) +\n guides(linetype = FALSE) +\n labs(x = \"Time Window\", \n y = \"Sensitivity\", \n title = paste0(tissue,\": \", cpd) )\n \n p2 <- df_ss %>% \n ggplot(aes(x = my_label, y = Specificity,\n group = detection_limit, \n colour = detection_limit, \n linetype = detection_limit)) + \n geom_line(size = 1.2) +\n geom_vline(xintercept=vertline, linetype = 'dotted') +\n geom_point() + \n ylim(0,100) +\n scale_color_manual(values = col_val) +\n scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +\n scale_linetype_manual(values = lines, \n guide = guide_legend(override.aes = list(linetype = \"solid\",\n shape = rep(NA, length(lines))) )) +\n theme_classic() +\n theme(axis.title = element_text(size=16),\n axis.text = element_text(size=10),\n legend.position = \"none\", \n panel.grid = element_blank(),\n strip.background = element_blank()) +\n labs(x = \"Time Window\", \n y = \"Specificity\",\n title = \"\" )\n \n title <- ggdraw() + \n draw_label(\n tissue,\n x = 0.05,\n hjust = 0\n )\n \n plot_row <- plot_grid(p1, p2, labels = labels, label_size = 12)\n \n plot_grid(\n title, plot_row,\n ncol = 1,\n # rel_heights values control vertical title margins\n rel_heights = c(0.1, 1)\n )\n \n return(list(plot_row, df_ss))\n\n}\n```\n:::\n\n\n\n### WB\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nblood_levels <- c(\"pre-smoking\\nN=189\", \"0-30\\nmin\\nN=187\", \"31-70\\nmin\\nN=165\",\n \"71-100\\nmin\\nN=157\", \"101-180\\nmin\\nN=168\", \"181-210\\nmin\\nN=103\",\n \"211-240\\nmin\\nN=127\", \"241-270\\nmin\\nN=137\", \"271-300\\nmin\\nN=120\",\n \"301+\\nmin\\nN=88\")\n\nplot_cutoffs(dataset=WB_THC, \n timepoint_use_variable=WB$timepoint_use, \n tissue=\"Blood\", \n vertline=levels(WB$timepoint_use)[5], \n cpd=\"THC\", \n x_labels=blood_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-17-1.png){width=3000}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 50 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 81 108 0.5 THC -400 0\n 2 0 0 61 128 1 THC -400 0\n 3 0 0 45 144 2 THC -400 0\n 4 0 0 10 179 5 THC -400 0\n 5 0 0 1 188 10 THC -400 0\n 6 124 2 28 33 0.5 THC 0 30\n 7 123 3 22 39 1 THC 0 30\n 8 119 7 15 46 2 THC 0 30\n 9 106 20 4 57 5 THC 0 30\n10 101 25 0 61 10 THC 0 30\n# ℹ 40 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n\n### OF\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nof_levels <- c(\"pre-smoking\\nN=192\", \"0-30\\nmin\\nN=192\", \"31-90\\nmin\\nN=117\",\n \"91-180\\nmin\\nN=99\", \"181-210\\nmin\\nN=102\", \"211-240\\nmin\\nN=83\",\n \"241-270\\nmin\\nN=90\", \"271+\\nmin\\nN=76\")\n\nplot_cutoffs(OF_THC, OF$timepoint_use, tissue = \"Oral Fluid\", labels = c(\"A\", \"B\"), vertline=levels(OF$timepoint_use)[4], cpd=\"THC\", x_labels=of_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-18-1.png){width=3000}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 40 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 35 157 0.5 THC -400 0\n 2 0 0 20 172 1 THC -400 0\n 3 0 0 9 183 2 THC -400 0\n 4 0 0 0 192 5 THC -400 0\n 5 0 0 0 192 10 THC -400 0\n 6 129 0 39 24 0.5 THC 0 30\n 7 129 0 30 33 1 THC 0 30\n 8 128 1 19 44 2 THC 0 30\n 9 128 1 3 60 5 THC 0 30\n10 125 4 1 62 10 THC 0 30\n# ℹ 30 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n\n:::\n\n\n## Calculate: CBN\nReminder: Currently, states have laws on the books from zero tolerance (detection of any level) to 5ng/mL\n\n:::panel-tabset\n\n### WB\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nWB_CBN = sens_spec_cpd(dataset = WB, cpd = 'cbn',\n timepoints = timepoints_WB,\n splits = cutoffs) %>% clean_gluc()\n\nblood_levels <- c(\"pre-smoking\\nN=189\", \"0-30\\nmin\\nN=187\", \"31-70\\nmin\\nN=165\",\n \"71-100\\nmin\\nN=157\", \"101-180\\nmin\\nN=168\", \"181-210\\nmin\\nN=103\",\n \"211-240\\nmin\\nN=127\", \"241-270\\nmin\\nN=137\", \"271-300\\nmin\\nN=120\",\n \"301+\\nmin\\nN=88\")\n\nplot_cutoffs(WB_CBN, WB$timepoint_use, tissue = \"Blood\", vertline=levels(WB$timepoint_use)[5], cpd=\"CBN\", x_labels=blood_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-19-1.png){width=3000}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 50 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 1 188 0.5 CBN -400 0\n 2 0 0 0 189 1 CBN -400 0\n 3 0 0 0 189 2 CBN -400 0\n 4 0 0 0 189 5 CBN -400 0\n 5 0 0 0 189 10 CBN -400 0\n 6 106 20 7 54 0.5 CBN 0 30\n 7 97 29 0 61 1 CBN 0 30\n 8 82 44 0 61 2 CBN 0 30\n 9 40 86 0 61 5 CBN 0 30\n10 9 117 0 61 10 CBN 0 30\n# ℹ 40 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n\n### OF\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nOF_CBN = sens_spec_cpd(dataset = OF, cpd = 'cbn',\n timepoints = timepoints_OF,\n splits = cutoffs) %>% clean_gluc()\n\nof_levels <- c(\"pre-smoking\\nN=192\", \"0-30\\nmin\\nN=192\", \"31-90\\nmin\\nN=117\",\n \"91-180\\nmin\\nN=99\", \"181-210\\nmin\\nN=102\", \"211-240\\nmin\\nN=83\",\n \"241-270\\nmin\\nN=90\", \"271+\\nmin\\nN=76\")\n\nplot_cutoffs(OF_CBN, OF$timepoint_use, tissue = \"Oral Fluid\", labels = c(\"A\", \"B\"), vertline=levels(OF$timepoint_use)[4], cpd=\"CBN\", x_labels=of_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-20-1.png){width=3000}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 40 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 5 187 0.5 CBN -400 0\n 2 0 0 1 191 1 CBN -400 0\n 3 0 0 1 191 2 CBN -400 0\n 4 0 0 1 191 5 CBN -400 0\n 5 0 0 0 192 10 CBN -400 0\n 6 127 2 41 22 0.5 CBN 0 30\n 7 125 4 32 31 1 CBN 0 30\n 8 122 7 18 45 2 CBN 0 30\n 9 116 13 7 56 5 CBN 0 30\n10 107 22 3 60 10 CBN 0 30\n# ℹ 30 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n\n:::\n\n\n## Compound Correlations\n\n:::panel-tabset\n\n### Code\n\n\n::: {.cell}\n\n```{.r .cell-code}\nggplotRegression <- function (x, y, xlab, ylab, x_text, y_text, y_text2, title) {\n fit <- lm(y ~ x)\n if(max(fit$model[,1],na.rm=TRUE)!=0){\n ggplot(fit$model, aes_string(x = names(fit$model)[2], \n y = names(fit$model)[1])) + \n geom_point() +\n stat_smooth(method = \"lm\", col = \"#B73239\", size = 1.5, se = FALSE) +\n annotate(\"text\", x=x_text, y=y_text, \n label = paste(\"R^2 == \", format(signif(summary(fit)$adj.r.squared, 5), \n digits=2)),\n vjust=1, hjust=0, parse=TRUE,size=4.5) +\n labs(x = xlab, \n y = ylab, \n title = title ) +\n annotate(\"text\", x=x_text, y=y_text2, label = paste(\n \"y = \", format(signif(fit$coef[[2]], 5),digits=2),\n \"x + \",\n format(signif(fit$coef[[1]],5 ),digits=2),\n paste0(\"\\nN = \", length(x))),\n vjust=1, hjust=0, size=4.5) + \n theme_minimal(base_size=14) +\n theme(panel.grid = element_blank(),\n axis.line = element_line(size = 0.5, linetype = \"solid\",\n colour = \"black\"),\n legend.position=\"none\") \n } else{\n ggplot(fit$model, aes_string(x = names(fit$model)[2], \n y = names(fit$model)[1])) + \n geom_point() +\n scale_y_continuous(limits = c(0,3)) +\n stat_smooth(method = \"lm\", col = \"#B73239\", size = 1.5, se = FALSE) +\n annotate(\"text\", x=x_text, y=y_text, \n label = paste(\"R^2 == \", format(signif(summary(fit)$adj.r.squared, 5), digits=2)),vjust=1, hjust=1, parse=TRUE,size=4.5) +\n labs(x = xlab, \n y = ylab, \n title = title ) +\n annotate(\"text\", x=x_text, y=y_text2, label = paste(\n \"y = \", format(signif(fit$coef[[2]], 5),digits=2),\n \"x + \",\n format(signif(fit$coef[[1]],5 ),digits=2),\n paste0(\"\\nN = \", length(x))), vjust=1, hjust=1,size=4.5) + \n theme_minimal(base_size = 14) +\n theme(panel.grid = element_blank(),\n axis.line = element_line(size = 0.5, linetype = \"solid\",\n colour = \"black\"),\n legend.position=\"none\") \n \n \n }\n}\n```\n:::\n\n\n\n### Plot\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwb_reg <- ggplotRegression(WB$thc, WB$cbn, xlab = 'THC (ng/mL)', ylab = 'CBN (ng/mL)', x_text= 150, y_text = 7, y_text2 = 5, title = \"Blood\")\n\nof_reg <- ggplotRegression(OF$thc, OF$cbn, xlab = 'THC (ng/mL)', ylab = 'CBN (ng/mL)', x_text= 12500, y_text = 750, y_text2 = 500, title = \"Oral Fluid\")\n\nplot_grid(wb_reg, of_reg, labels = 'AUTO', label_size = 12, ncol = 2, scale = 1)\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-revealjs/unnamed-chunk-22-1.png){width=3000}\n:::\n:::\n\n\n\n:::\n\n## Possible Extensions\n\nOur current question asks for a *single* compound...and you'll need to decide that.\n\n. . . \n\n...but you could imagine a world where more than one compound or more than one matrix could be measured at the roadside.\n\n. . . \n\nSo:\n\n:::incremental\n- combination of the oral fluid and blood that would better predict recent use? (For example if an officer stopped a driver and got a high oral fluid, but could not get a blood sample for a couple of hours and got a relatively low result would this predict recent use better than blood (or OF) alone? \n- Is there a ratio of OF/blood that predicts recent use?\n- Machine learning model to determine optimal combination of measurements/cutoffs to detect recent use?\n:::\n\n. . . \n\nThings to keep in mind:\n\n- some matrices are easier to get at the roadside\n- time from use matters (trying to detect *recent* use)\n- we may not care equally about sensitivity and specificity\n\n\n## Recap {.smaller background-color=\"#92A86A\"}\n\n- Can you describe sensitivity? Specificity?\n- Can you explain how TP, TN, FP, and FN were calculated/defined *in this experiment*?\n- Can you describe the code used to carry out the calculations?\n- Can you interpret the results from these data?\n", + "markdown": "---\ntitle: \"13-cs01-analysis\"\nauthor: \"Professor Shannon Ellis\"\ndate: \"2023-11-14\"\n\nformat:\n html: \n output-file: 13-cs01-analysis.html\n embed-resources: true\n revealjs:\n output-file: 13-cs01-analysis-slides.html\n slide-number: true\n chalkboard: false \n preview-links: auto\n logo: images/cogs137-logo-hex.png\n css: slides.css\n footer: \n scrollable: true\n embed-resources: true\n execute:\n echo: true\n eval: true\n---\n\n::: {.cell}\n\n:::\n\n\n# CS01: Biomarkers of Recent Use (Analysis) {background-color=\"#92A86A\"}\n\n## Q&A {.smaller}\n\n> Q: How extensive does our extension component need to be?\\\n> A: A bit hard to answer in certain terms. We'll discuss some examples today to hopefully set expectaions well. To explain in writing here, the most typical extension is students using the data provided to ask and answer a question not directly presented in class. Thus, simply generating a visualization not presented in class would NOT be sufficient. At the other end, finding external data on the topic and analyzing that data, while certainly allowed, would far exceed expectations. In between those extremes is what we expect: add significantly to the analysis, beyond what was presented in class.\n\n\n## Course Announcements\n\nDue Dates:\n\n- **HW03** (MLR) due Mon 11/20\n- **Project Proposal** (it will be a Google Form) due 11/20\n- **CS01** Deadlines:\n - **Lab06** due Friday - cs01-focused\n - Report & \"General Communication\" due 11/27\n - survey about how working with group went - due 11/28\n\n. . .\n\nNotes:\n\nMidterm scores & Feedback posted\n\n- overall, did very well\n - avg: 13.85/15 (92%)\n - 6 perfect scores\n- answer key on course website\n\nI am behind on emails and Piazza posts.\n\n## Mid-course Survey Summary \n\n- N=73 (~75%)\n- Pacing workload (so far) about right\n- Course notes most helpful in the course overall\n- Also helpful: completing labs, doing homework,\n- Many are not checking labs against answer keys; most are not doing suggested readings\n- Of those that attend lecture, most find it helpful\n\n## Mid-course: Time Spent\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-1-1.png){width=2100}\n:::\n:::\n\n\n## Mid-course: What would you change?\n \n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n:::\n\n \n## Agenda\n\n- Debugging/Understanding Code Strategies\n- Sensitivity & Specificity\n- Cross-compound correlations\n- Extensions\n\n## Summary: Figuring out what's going on in code\n\nSuggestions (as discussed in class):\n\n:::incremental\n1. Look up documentation (i.e. `?...`) / Google the function\n2. Run it on different input; see how output changing\n3. Run the code line-by-line, understanding output at each step \n4. Ask ChatGPT\n:::\n\n# Question {background-color=\"#92A86A\"}\n\nWhich compound, in which matrix, and at what cutoff is the best biomarker of recent use?\n\n. . .\n\n::: callout-message\nEvery group will answer this question.\n:::\n\n# Data & Files {background-color=\"#92A86A\"}\n\n## Packages\n\nThree additional packages required for these notes:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(purrr)\nlibrary(rstatix)\nlibrary(cowplot)\n```\n:::\n\n\n## The Data\n\nReading in the data from the end of data wrangling notes: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nload(\"data/compounds.RData\")\nload(\"data/timepoints.RData\")\nload(\"data/data_clean.RData\")\n```\n:::\n\n::: {.cell}\n\n:::\n\n\nAnd the functions...\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsource(\"src/cs01_functions.R\")\n```\n:::\n\n\n# Analysis {background-color=\"#92A86A\"}\n\n\n## Sensitivity & Specificity\n\n**Sensitivity** | the ability of a test to correctly identify patients with a disease/trait/condition. $$TP/(TP + FN)$$\n\n. . .\n\n**Specificity** | the ability of a test to correctly identify people without the disease/trait/condition. $$TN/(TN + FP)$$\n\n. . . \n\n[❓ For this analysis, do you care more about sensitivity? about specificity? equally about both?]{style=\"background-color: #ADD8E6\"}\n\n## What is a TP here? TN? FP? FN? \n\n**Post-smoking** (cutoff > 0)\n\n:::incremental\n- TP = THC group, value >= cutoff\n- FN = THC group, value < cutoff\n- FP = Placebo group, value >= cutoff\n- TN = Placebo group, value < cutoff\n:::\n\n. . .\n\n**Post-smoking** (cutoff == 0)\n\nCannot be a TP or FP if zero...\n\n- TP = THC group, value > cutoff),\n- FN = THC group, value <= cutoff),\n- FP = Placebo group, value > cutoff),\n- TN = Placebo group, value < cutoff)\n\n. . . \n\n**Pre-smoking** \n\nCannot be a TP or FN before consuming...\n\n- TP = 0\n- FN = 0\n- FP = value >= cutoff\n- TN = value < cutoff\n\n## ROC\n\nReceiver-Operator Characteristic (ROC) Curve: TPR (Sensitivity) vs FPR (1-Specificity)\n\n![](images/13/Roc_curve.svg.png)\n\nImage Credit: By cmglee, MartinThoma - Roc-draft-xkcd-style.svg, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=109730045\n\n\n\n\n## Calculating Sensitivity and Specificity\n\n:::panel-tabset\n\n### Calculate \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmake_calculations <- function(dataset, dataset_removedups, split, compound, \n start = start, stop = stop, tpt_use = tpt_use){\n ## remove NAs\n df <- dataset_removedups %>% \n dplyr::select(treatment, compound, timepoint_use) %>%\n rename(compound = 2) %>%\n filter(complete.cases(.))\n if(nrow(df)>0){\n if(stop <= 0){\n output <- df %>% \n summarise(TP = 0,\n FN = 0,\n FP = sum(compound >= split),\n TN = sum(compound < split)) \n }else{\n if(split == 0){\n output_pre <- df %>% \n filter(tpt_use == \"pre-smoking\") %>%\n summarise(TP = 0,\n FN = 0,\n FP = sum(compound >= split),\n TN = sum(compound < split)) \n \n output <- df %>% \n filter(tpt_use != \"pre-smoking\") %>%\n summarise(TP = sum(treatment != \"Placebo\" & compound > split),\n FN = sum(treatment != \"Placebo\" & compound <= split),\n FP = sum(treatment == \"Placebo\" & compound > split),\n TN = sum(treatment == \"Placebo\" & compound < split))\n \n output <- output + output_pre\n }else{\n ## calculate values if pre-smoking\n output_pre <- df %>% \n filter(tpt_use == \"pre-smoking\") %>%\n summarise(TP = 0,\n FN = 0,\n FP = sum(compound >= split),\n TN = sum(compound < split)) \n \n output <- df %>% \n filter(tpt_use != \"pre-smoking\") %>%\n summarise(TP = sum(treatment != \"Placebo\" & compound >= split),\n FN = sum(treatment != \"Placebo\" & compound < split),\n FP = sum(treatment == \"Placebo\" & compound >= split),\n TN = sum(treatment == \"Placebo\" & compound < split))\n \n output <- output + output_pre\n }\n }\n }\n # clean things up; make calculations on above values\n output <- output %>%\n mutate(detection_limit = split,\n compound = compound,\n time_start = start,\n time_stop = stop,\n time_window = tpt_use,\n NAs = nrow(dataset) - nrow(df),\n N = nrow(dataset_removedups),\n N_removed = nrow(dataset) - nrow(dataset_removedups),\n Sensitivity = (TP/(TP + FN)), \n Specificity = (TN /(TN + FP)),\n PPV = (TP/(TP+FP)),\n NPV = (TN/(TN + FN)),\n Efficiency = ((TP + TN)/(TP + TN + FP + FN))*100\n )\n \n return(output)\n}\n```\n:::\n\n\n### Run\n\n- determine what cutoff values to try\n- carry out above function on those cutoffs\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsens_spec <- function(dataset, compound, start, stop, tpt_use, \n lowest_value = 0.5, splits = NULL, ...){\n # if it's not all NAs...\n if(sum(is.na(dataset[,compound])) != nrow(dataset)){\n # specify what splits should be used for calculations\n if(is.null(splits)){\n limits <- dataset[is.finite(rowSums(dataset[,compound])),compound]\n ## define lower and upper limits\n lower = min(limits, na.rm=TRUE)\n upper = max(limits, na.rm=TRUE)\n ## determine splits to use for calculations\n tosplit = pull(limits[,1])[limits[,1]>0]\n ## only split if there are detectable limits:\n if(length(tosplit)>=1){\n splits = c(lowest_value, quantile(tosplit, probs=seq(0, 1, by = 0.01), na.rm=TRUE))\n splits = unique(splits)\n }else{\n splits = 0\n }\n }else{\n splits = splits\n }\n # filter to include timepoint of interest\n dataset <- dataset %>% \n filter(time_from_start > start & time_from_start <= stop & !is.na(timepoint_use))\n dataset_removedups <- dataset %>%\n filter(!is.na(timepoint_use)) %>% \n group_by(timepoint_use) %>% \n distinct(id, .keep_all = TRUE) %>% \n ungroup()\n\n ## create empty output variable which we'll fill in\n ## iterate through each possible dose and calculate\n output <- map_dfr(as.list(splits), ~make_calculations(dataset, \n dataset_removedups, \n split = .x,\n compound,\n start = start,\n stop = stop, \n tpt_use = tpt_use))\n }\n \n return(output)\n}\n```\n:::\n\n\n### Apply\n\nMap the above for each matrix...\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsens_spec_cpd <- function(dataset, cpd, timepoints, splits = NULL){\n args2 <- list(start = timepoints$start, \n stop = timepoints$stop, \n tpt_use = timepoints$timepoint)\n out <- args2 %>% \n pmap_dfr(sens_spec, dataset, compound = cpd, splits = splits)\n return(out)\n}\n```\n:::\n\n\n### Do it!\n\nThis takes a few minutes to run... (reminder: `cache=TRUE`)\n\n\n::: {.cell hash='13-cs01-analysis_cache/html/unnamed-chunk-10_e432cbdfc0d26929ce6a1bb9fd220e35'}\n\n```{.r .cell-code}\noutput_WB <- map_dfr(compounds_WB, \n ~sens_spec_cpd(dataset = WB, cpd = all_of(.x), \n timepoints = timepoints_WB)) %>% clean_gluc()\n\noutput_BR <- map_dfr(compounds_BR, \n ~sens_spec_cpd(dataset = BR, cpd = all_of(.x),\n timepoints = timepoints_BR)) %>% clean_gluc()\n\noutput_OF <- map_dfr(compounds_OF, \n ~sens_spec_cpd(dataset = OF, cpd = all_of(.x),\n timepoints = timepoints_OF)) %>% clean_gluc()\n```\n:::\n\n\n::: \n\n## ROC \n\n:::panel-tabset\n\n### Code\n\n\n::: {.cell}\n\n```{.r .cell-code}\nss_plot <- function(output, tpts=8, tissue){\n to_include = output %>%\n group_by(compound) %>% \n summarize(mean_detection = mean(detection_limit)) %>% \n filter(mean_detection > 0)\n \n output <- output %>% \n mutate(iszero = ifelse(time_start<0,TRUE,FALSE),\n Sensitivity = round(Sensitivity*100,0),\n Specificity = round(Specificity*100,0)) %>%\n filter(compound %in% to_include$compound,\n time_window != \"pre-smoking\") %>%\n clean_gluc() %>% \n mutate(compound = fct_relevel(as.factor(compound), \"THC\"))\n \n output <- output %>% mutate(\n legend = paste0(time_window,' (N=', N,')'))\n \n blue_colors = c('#C2F8FF', '#A2DDED', '#86BEDC', '#6C9FCA', \n '#547EB9', '#3F5EA8', '#2D4096', '#1E2385',\n '#181173', '#180762', '#180051')\n values = c(blue_colors[1:tpts])\n \n print(ggplot(output, aes(x = detection_limit, y = Sensitivity, group = fct_inorder(legend))) + \n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound, scales = \"free_x\") +\n labs(x = 'Detection Limit',\n y = 'Sensitivity') +\n ylim(0,1) +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16), \n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12)) \n )\n print(\n ggplot(output, aes(x = detection_limit, y = Specificity, group = fct_inorder(legend))) + \n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound, scales = \"free_x\") +\n ylim(0,100) +\n labs(title = tissue,\n x = 'Detection Limit',\n y = 'Specificity') +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16),\n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12))\n )\n print(\n ggplot(output, aes(x=(100-Specificity), y = Sensitivity, group = fct_inorder(legend))) +\n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound) +\n xlim(0, 100) +\n ylim(0, 100) +\n labs(title = tissue,\n x = '(100-Specificity)',\n y = 'Sensitivity') +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16),\n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12),\n axis.text = element_text(size=12))\n )\n}\n\nroc_plot <- function(output, tpts=8, tissue){\n to_include = output %>%\n group_by(compound) %>% \n summarize(mean_detection = mean(detection_limit)) %>% \n filter(mean_detection > 0)\n \n output <- output %>% \n mutate(iszero = ifelse(time_start<0,TRUE,FALSE),\n Sensitivity = round(Sensitivity*100,0),\n Specificity = round(Specificity*100,0)) %>%\n filter(compound %in% to_include$compound,\n time_window != \"pre-smoking\") %>%\n clean_gluc() %>% \n mutate(compound = fct_relevel(as.factor(compound), \"THC\"))\n \n output <- output %>% mutate(\n legend = paste0(time_window,' (N=', N,')'))\n \n blue_colors = c('#C2F8FF', '#86BEDC', \n '#547EB9', '#2D4096',\n '#181173', '#180051')\n values = c(blue_colors[1:tpts])\n print(\n ggplot(output, aes(x=(100-Specificity), y = Sensitivity, group = fct_inorder(legend))) +\n geom_point(aes(color=fct_inorder(legend)), size = 0.9, show.legend = FALSE) +\n geom_path(aes(color=fct_inorder(legend)), size=1.2) + \n facet_grid(~compound) +\n xlim(0, 100) +\n ylim(0, 100) +\n labs(title = tissue,\n x = '(100-Specificity)',\n y = 'Sensitivity') +\n scale_color_manual(values = values, name = 'Time Window') +\n theme_classic(base_size = 12) + \n theme(axis.title = element_text(size=16),\n panel.grid = element_blank(),\n strip.background = element_blank(),\n strip.text.x = element_text(size = 12),\n axis.text = element_text(size=12) )\n )\n}\n```\n:::\n\n\n### Calculate\n\n\n::: {.cell}\n\n```{.r .cell-code}\nss1_a <- ss_plot(output_WB, tpts = length(unique(output_WB$time_start)), tissue = \"Blood\")\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-12-1.png){width=2100}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-12-2.png){width=2100}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-12-3.png){width=2100}\n:::\n\n```{.r .cell-code}\nss2_a <- ss_plot(output_OF, tpts = length(unique(output_OF$time_start)), tissue = \"Oral Fluid\")\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-12-4.png){width=2100}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-12-5.png){width=2100}\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-12-6.png){width=2100}\n:::\n\n```{.r .cell-code}\nss3_a <- roc_plot(output_BR, tpts = length(unique(output_BR$time_start)), tissue = \"Breath\")\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-12-7.png){width=2100}\n:::\n:::\n\n\n### Plot\n\n::: {.cell}\n\n```{.r .cell-code}\nbottom_row <- plot_grid(ss2_a, ss3_a, labels = c('B', 'C'), label_size = 12, ncol = 2, rel_widths = c(0.66, .33))\nplot_grid(ss1_a, bottom_row, labels = c('A', ''), label_size = 12, ncol = 1)\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-13-1.png){width=2100}\n:::\n:::\n\n\n:::\n\n## Calculate: THC\n\nReminder: Currently, states have laws on the books from zero tolerance (detection of any level) to 5ng/mL\n\n:::panel-tabset\n\n### WB\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncutoffs = c(0.5, 1, 2, 5, 10)\nWB_THC <- sens_spec_cpd(dataset = WB, cpd = 'thc',\n timepoints = timepoints_WB,\n splits = cutoffs) %>% clean_gluc()\n\nWB_THC\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 50 × 17\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 81 108 0.5 THC -400 0\n 2 0 0 61 128 1 THC -400 0\n 3 0 0 45 144 2 THC -400 0\n 4 0 0 10 179 5 THC -400 0\n 5 0 0 1 188 10 THC -400 0\n 6 124 2 28 33 0.5 THC 0 30\n 7 123 3 22 39 1 THC 0 30\n 8 119 7 15 46 2 THC 0 30\n 9 106 20 4 57 5 THC 0 30\n10 101 25 0 61 10 THC 0 30\n# ℹ 40 more rows\n# ℹ 9 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency \n```\n:::\n:::\n\n\n### OF\n\n\n::: {.cell}\n\n```{.r .cell-code}\nOF_THC <- sens_spec_cpd(dataset = OF, cpd = 'thc',\n timepoints = timepoints_OF,\n splits = cutoffs) %>% clean_gluc()\n\nOF_THC\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 40 × 17\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 35 157 0.5 THC -400 0\n 2 0 0 20 172 1 THC -400 0\n 3 0 0 9 183 2 THC -400 0\n 4 0 0 0 192 5 THC -400 0\n 5 0 0 0 192 10 THC -400 0\n 6 129 0 39 24 0.5 THC 0 30\n 7 129 0 30 33 1 THC 0 30\n 8 128 1 19 44 2 THC 0 30\n 9 128 1 3 60 5 THC 0 30\n10 125 4 1 62 10 THC 0 30\n# ℹ 30 more rows\n# ℹ 9 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency \n```\n:::\n:::\n\n\n### BR\n\nWhy is there no calculation for breath?\n\n\n:::\n\n## Cutoffs\n\n:::panel-tabset\n\n### Code\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_cutoffs <- function(dataset, timepoint_use_variable, tissue, labels = c(\"A\", \"B\"), vertline, cpd, x_labels){\n col_val = c(\"#D9D9D9\", \"#BDBDBD\", \"#969696\", \"#636363\", \"#252525\")\n lines = rep(\"solid\", 5)\n \n df_ss <- dataset %>% \n mutate(time_window = fct_relevel(as.factor(time_window), \n levels(timepoint_use_variable)),\n detection_limit = as.factor(detection_limit),\n Sensitivity = round(Sensitivity*100,0),\n Specificity = round(Specificity*100,0),\n my_label = paste0(time_window, ' N=', N),\n my_label = gsub(\" \", \"\\n\", my_label),\n my_label = fct_relevel(as.factor(my_label), x_labels)) #%>% \n \n p1 <- df_ss %>% \n ggplot(aes(x = my_label, y = Sensitivity, \n colour = detection_limit)) + \n geom_line(size = 1.2, aes(group = detection_limit, \n linetype = detection_limit)) + \n geom_vline(xintercept=vertline, linetype = 'dotted') +\n geom_point(show.legend=FALSE) + \n ylim(0,100) +\n scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +\n scale_linetype_manual(values=lines) +\n scale_color_manual(values = col_val, name = \"Cutoff \\n (ng/mL)\",\n guide = guide_legend(override.aes = list(linetype = c(1),\n shape = rep(NA, length(lines))) )) +\n theme_classic() +\n theme( axis.title = element_text(size=16),\n axis.text = element_text(size=10),\n legend.position = c(0.08, 0.4),\n panel.grid = element_blank(),\n strip.background = element_blank()\n ) +\n guides(linetype = FALSE) +\n labs(x = \"Time Window\", \n y = \"Sensitivity\", \n title = paste0(tissue,\": \", cpd) )\n \n p2 <- df_ss %>% \n ggplot(aes(x = my_label, y = Specificity,\n group = detection_limit, \n colour = detection_limit, \n linetype = detection_limit)) + \n geom_line(size = 1.2) +\n geom_vline(xintercept=vertline, linetype = 'dotted') +\n geom_point() + \n ylim(0,100) +\n scale_color_manual(values = col_val) +\n scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +\n scale_linetype_manual(values = lines, \n guide = guide_legend(override.aes = list(linetype = \"solid\",\n shape = rep(NA, length(lines))) )) +\n theme_classic() +\n theme(axis.title = element_text(size=16),\n axis.text = element_text(size=10),\n legend.position = \"none\", \n panel.grid = element_blank(),\n strip.background = element_blank()) +\n labs(x = \"Time Window\", \n y = \"Specificity\",\n title = \"\" )\n \n title <- ggdraw() + \n draw_label(\n tissue,\n x = 0.05,\n hjust = 0\n )\n \n plot_row <- plot_grid(p1, p2, labels = labels, label_size = 12)\n \n plot_grid(\n title, plot_row,\n ncol = 1,\n # rel_heights values control vertical title margins\n rel_heights = c(0.1, 1)\n )\n \n return(list(plot_row, df_ss))\n\n}\n```\n:::\n\n\n### WB\n\n\n::: {.cell}\n\n```{.r .cell-code}\nblood_levels <- c(\"pre-smoking\\nN=189\", \"0-30\\nmin\\nN=187\", \"31-70\\nmin\\nN=165\",\n \"71-100\\nmin\\nN=157\", \"101-180\\nmin\\nN=168\", \"181-210\\nmin\\nN=103\",\n \"211-240\\nmin\\nN=127\", \"241-270\\nmin\\nN=137\", \"271-300\\nmin\\nN=120\",\n \"301+\\nmin\\nN=88\")\n\nplot_cutoffs(dataset=WB_THC, \n timepoint_use_variable=WB$timepoint_use, \n tissue=\"Blood\", \n vertline=levels(WB$timepoint_use)[5], \n cpd=\"THC\", \n x_labels=blood_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-17-1.png){width=2100}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 50 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 81 108 0.5 THC -400 0\n 2 0 0 61 128 1 THC -400 0\n 3 0 0 45 144 2 THC -400 0\n 4 0 0 10 179 5 THC -400 0\n 5 0 0 1 188 10 THC -400 0\n 6 124 2 28 33 0.5 THC 0 30\n 7 123 3 22 39 1 THC 0 30\n 8 119 7 15 46 2 THC 0 30\n 9 106 20 4 57 5 THC 0 30\n10 101 25 0 61 10 THC 0 30\n# ℹ 40 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n### OF\n\n\n::: {.cell}\n\n```{.r .cell-code}\nof_levels <- c(\"pre-smoking\\nN=192\", \"0-30\\nmin\\nN=192\", \"31-90\\nmin\\nN=117\",\n \"91-180\\nmin\\nN=99\", \"181-210\\nmin\\nN=102\", \"211-240\\nmin\\nN=83\",\n \"241-270\\nmin\\nN=90\", \"271+\\nmin\\nN=76\")\n\nplot_cutoffs(OF_THC, OF$timepoint_use, tissue = \"Oral Fluid\", labels = c(\"A\", \"B\"), vertline=levels(OF$timepoint_use)[4], cpd=\"THC\", x_labels=of_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-18-1.png){width=2100}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 40 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 35 157 0.5 THC -400 0\n 2 0 0 20 172 1 THC -400 0\n 3 0 0 9 183 2 THC -400 0\n 4 0 0 0 192 5 THC -400 0\n 5 0 0 0 192 10 THC -400 0\n 6 129 0 39 24 0.5 THC 0 30\n 7 129 0 30 33 1 THC 0 30\n 8 128 1 19 44 2 THC 0 30\n 9 128 1 3 60 5 THC 0 30\n10 125 4 1 62 10 THC 0 30\n# ℹ 30 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n:::\n\n\n## Calculate: CBN\nReminder: Currently, states have laws on the books from zero tolerance (detection of any level) to 5ng/mL\n\n:::panel-tabset\n\n### WB\n\n\n::: {.cell}\n\n```{.r .cell-code}\nWB_CBN = sens_spec_cpd(dataset = WB, cpd = 'cbn',\n timepoints = timepoints_WB,\n splits = cutoffs) %>% clean_gluc()\n\nblood_levels <- c(\"pre-smoking\\nN=189\", \"0-30\\nmin\\nN=187\", \"31-70\\nmin\\nN=165\",\n \"71-100\\nmin\\nN=157\", \"101-180\\nmin\\nN=168\", \"181-210\\nmin\\nN=103\",\n \"211-240\\nmin\\nN=127\", \"241-270\\nmin\\nN=137\", \"271-300\\nmin\\nN=120\",\n \"301+\\nmin\\nN=88\")\n\nplot_cutoffs(WB_CBN, WB$timepoint_use, tissue = \"Blood\", vertline=levels(WB$timepoint_use)[5], cpd=\"CBN\", x_labels=blood_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-19-1.png){width=2100}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 50 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 1 188 0.5 CBN -400 0\n 2 0 0 0 189 1 CBN -400 0\n 3 0 0 0 189 2 CBN -400 0\n 4 0 0 0 189 5 CBN -400 0\n 5 0 0 0 189 10 CBN -400 0\n 6 106 20 7 54 0.5 CBN 0 30\n 7 97 29 0 61 1 CBN 0 30\n 8 82 44 0 61 2 CBN 0 30\n 9 40 86 0 61 5 CBN 0 30\n10 9 117 0 61 10 CBN 0 30\n# ℹ 40 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n### OF\n\n\n::: {.cell}\n\n```{.r .cell-code}\nOF_CBN = sens_spec_cpd(dataset = OF, cpd = 'cbn',\n timepoints = timepoints_OF,\n splits = cutoffs) %>% clean_gluc()\n\nof_levels <- c(\"pre-smoking\\nN=192\", \"0-30\\nmin\\nN=192\", \"31-90\\nmin\\nN=117\",\n \"91-180\\nmin\\nN=99\", \"181-210\\nmin\\nN=102\", \"211-240\\nmin\\nN=83\",\n \"241-270\\nmin\\nN=90\", \"271+\\nmin\\nN=76\")\n\nplot_cutoffs(OF_CBN, OF$timepoint_use, tissue = \"Oral Fluid\", labels = c(\"A\", \"B\"), vertline=levels(OF$timepoint_use)[4], cpd=\"CBN\", x_labels=of_levels)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n```\n:::\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-20-1.png){width=2100}\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\n[[2]]\n# A tibble: 40 × 18\n TP FN FP TN detection_limit compound time_start time_stop\n \n 1 0 0 5 187 0.5 CBN -400 0\n 2 0 0 1 191 1 CBN -400 0\n 3 0 0 1 191 2 CBN -400 0\n 4 0 0 1 191 5 CBN -400 0\n 5 0 0 0 192 10 CBN -400 0\n 6 127 2 41 22 0.5 CBN 0 30\n 7 125 4 32 31 1 CBN 0 30\n 8 122 7 18 45 2 CBN 0 30\n 9 116 13 7 56 5 CBN 0 30\n10 107 22 3 60 10 CBN 0 30\n# ℹ 30 more rows\n# ℹ 10 more variables: time_window , NAs , N , N_removed ,\n# Sensitivity , Specificity , PPV , NPV ,\n# Efficiency , my_label \n```\n:::\n:::\n\n\n:::\n\n\n## Compound Correlations\n\n:::panel-tabset\n\n### Code\n\n::: {.cell}\n\n```{.r .cell-code}\nggplotRegression <- function (x, y, xlab, ylab, x_text, y_text, y_text2, title) {\n fit <- lm(y ~ x)\n if(max(fit$model[,1],na.rm=TRUE)!=0){\n ggplot(fit$model, aes_string(x = names(fit$model)[2], \n y = names(fit$model)[1])) + \n geom_point() +\n stat_smooth(method = \"lm\", col = \"#B73239\", size = 1.5, se = FALSE) +\n annotate(\"text\", x=x_text, y=y_text, \n label = paste(\"R^2 == \", format(signif(summary(fit)$adj.r.squared, 5), \n digits=2)),\n vjust=1, hjust=0, parse=TRUE,size=4.5) +\n labs(x = xlab, \n y = ylab, \n title = title ) +\n annotate(\"text\", x=x_text, y=y_text2, label = paste(\n \"y = \", format(signif(fit$coef[[2]], 5),digits=2),\n \"x + \",\n format(signif(fit$coef[[1]],5 ),digits=2),\n paste0(\"\\nN = \", length(x))),\n vjust=1, hjust=0, size=4.5) + \n theme_minimal(base_size=14) +\n theme(panel.grid = element_blank(),\n axis.line = element_line(size = 0.5, linetype = \"solid\",\n colour = \"black\"),\n legend.position=\"none\") \n } else{\n ggplot(fit$model, aes_string(x = names(fit$model)[2], \n y = names(fit$model)[1])) + \n geom_point() +\n scale_y_continuous(limits = c(0,3)) +\n stat_smooth(method = \"lm\", col = \"#B73239\", size = 1.5, se = FALSE) +\n annotate(\"text\", x=x_text, y=y_text, \n label = paste(\"R^2 == \", format(signif(summary(fit)$adj.r.squared, 5), digits=2)),vjust=1, hjust=1, parse=TRUE,size=4.5) +\n labs(x = xlab, \n y = ylab, \n title = title ) +\n annotate(\"text\", x=x_text, y=y_text2, label = paste(\n \"y = \", format(signif(fit$coef[[2]], 5),digits=2),\n \"x + \",\n format(signif(fit$coef[[1]],5 ),digits=2),\n paste0(\"\\nN = \", length(x))), vjust=1, hjust=1,size=4.5) + \n theme_minimal(base_size = 14) +\n theme(panel.grid = element_blank(),\n axis.line = element_line(size = 0.5, linetype = \"solid\",\n colour = \"black\"),\n legend.position=\"none\") \n \n \n }\n}\n```\n:::\n\n\n### Plot\n\n::: {.cell}\n\n```{.r .cell-code}\nwb_reg <- ggplotRegression(WB$thc, WB$cbn, xlab = 'THC (ng/mL)', ylab = 'CBN (ng/mL)', x_text= 150, y_text = 7, y_text2 = 5, title = \"Blood\")\n\nof_reg <- ggplotRegression(OF$thc, OF$cbn, xlab = 'THC (ng/mL)', ylab = 'CBN (ng/mL)', x_text= 12500, y_text = 750, y_text2 = 500, title = \"Oral Fluid\")\n\nplot_grid(wb_reg, of_reg, labels = 'AUTO', label_size = 12, ncol = 2, scale = 1)\n```\n\n::: {.cell-output-display}\n![](13-cs01-analysis_files/figure-html/unnamed-chunk-22-1.png){width=2100}\n:::\n:::\n\n\n:::\n\n## Possible Extensions\n\nOur current question asks for a *single* compound...and you'll need to decide that.\n\n. . . \n\n...but you could imagine a world where more than one compound or more than one matrix could be measured at the roadside.\n\n. . . \n\nSo:\n\n:::incremental\n- combination of the oral fluid and blood that would better predict recent use? (For example if an officer stopped a driver and got a high oral fluid, but could not get a blood sample for a couple of hours and got a relatively low result would this predict recent use better than blood (or OF) alone? \n- Is there a ratio of OF/blood that predicts recent use?\n- Machine learning model to determine optimal combination of measurements/cutoffs to detect recent use?\n:::\n\n. . . \n\nThings to keep in mind:\n\n- some matrices are easier to get at the roadside\n- time from use matters (trying to detect *recent* use)\n- we may not care equally about sensitivity and specificity\n\n## cs01: what to do now?\n\n1. Communicate with your group!\n2. Discuss possible extensions\n3. Make a plan; figure out who's doing what; set deadlines\n4. Implement the plan!\n\n\n## What has to be done:\n\n:::incremental\n- Question | include in Rmd; add extension if applicable\n- Background | summarize and add to what was discussed in classed\n- Data\n - Describe data & variables\n - Data wrangling | likely copy + paste from notes; add explanation as you go\n- Analysis\n - EDA | likely borrowing parts from notes and adding more in; be sure to include interpretations of output & guide the reader\n - Analysis | likely borrowing *most/all* from class; interpretations/guiding reader/contextualizing is essential\n - Extension | must be completed\n- Conclusion | summarize\n- Proofread | ensure it makes sense from top to bottom\n- General Audience communication (submit on Canvas; 1 submission per group)\n:::\n\n## Collaborating on GitHub\n\n- Be sure to pull changes every time you sit down to work\n- Avoid working on the same part of the same file as another teammate OR work in separate files and combine at the end\n- push your changes once you're ready to add them to the group\n\n## Recap {.smaller background-color=\"#92A86A\"}\n\n- Can you describe sensitivity? Specificity?\n- Can you explain how TP, TN, FP, and FN were calculated/defined *in this experiment*?\n- Can you describe the code used to carry out the calculations?\n- Can you interpret the results from these data?\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" @@ -9,9 +9,6 @@ "includes": { "include-in-header": [ "\n\n\n\n\n\n\n\n\n" - ], - "include-after-body": [ - "\n\n\n" ] }, "engineDependencies": {}, diff --git a/content/lectures/13-cs01-analysis.qmd b/content/lectures/13-cs01-analysis.qmd index 9cc0a78..741c6c2 100644 --- a/content/lectures/13-cs01-analysis.qmd +++ b/content/lectures/13-cs01-analysis.qmd @@ -834,6 +834,36 @@ Things to keep in mind: - time from use matters (trying to detect *recent* use) - we may not care equally about sensitivity and specificity +## cs01: what to do now? + +1. Communicate with your group! +2. Discuss possible extensions +3. Make a plan; figure out who's doing what; set deadlines +4. Implement the plan! + + +## What has to be done: + +:::incremental +- Question | include in Rmd; add extension if applicable +- Background | summarize and add to what was discussed in classed +- Data + - Describe data & variables + - Data wrangling | likely copy + paste from notes; add explanation as you go +- Analysis + - EDA | likely borrowing parts from notes and adding more in; be sure to include interpretations of output & guide the reader + - Analysis | likely borrowing *most/all* from class; interpretations/guiding reader/contextualizing is essential + - Extension | must be completed +- Conclusion | summarize +- Proofread | ensure it makes sense from top to bottom +- General Audience communication (submit on Canvas; 1 submission per group) +::: + +## Collaborating on GitHub + +- Be sure to pull changes every time you sit down to work +- Avoid working on the same part of the same file as another teammate OR work in separate files and combine at the end +- push your changes once you're ready to add them to the group ## Recap {.smaller background-color="#92A86A"} diff --git a/content/lectures/images/14/LastLecture.png b/content/lectures/images/14/LastLecture.png deleted file mode 100644 index 416d2a9..0000000 Binary files a/content/lectures/images/14/LastLecture.png and /dev/null differ diff --git a/content/lectures/images/14/ex2.png b/content/lectures/images/14/ex2.png deleted file mode 100644 index f8cbf94..0000000 Binary files a/content/lectures/images/14/ex2.png and /dev/null differ diff --git a/content/lectures/images/14/sebastianalgharaballi.png b/content/lectures/images/14/sebastianalgharaballi.png deleted file mode 100644 index 19125b7..0000000 Binary files a/content/lectures/images/14/sebastianalgharaballi.png and /dev/null differ diff --git a/content/lectures/images/14/shenovad.png b/content/lectures/images/14/shenovad.png deleted file mode 100644 index 8071ae3..0000000 Binary files a/content/lectures/images/14/shenovad.png and /dev/null differ diff --git a/docs/content/lectures/08-effective-communication-slides.html b/docs/content/lectures/08-effective-communication-slides.html index be8f7b6..ec60768 100644 --- a/docs/content/lectures/08-effective-communication-slides.html +++ b/docs/content/lectures/08-effective-communication-slides.html @@ -556,37 +556,7 @@ code span.wa { color: #5e5e5e; font-style: italic; } - + + +
+ @@ -4046,7 +4046,7 @@

Calculating Sensitivity and Specificity

This takes a few minutes to run… (reminder: cache=TRUE)

-
+
output_WB <- map_dfr(compounds_WB, 
                      ~sens_spec_cpd(dataset = WB, cpd = all_of(.x), 
                                     timepoints = timepoints_WB)) %>% clean_gluc()
@@ -4190,27 +4190,27 @@ 

ROC

ss1_a <- ss_plot(output_WB, tpts = length(unique(output_WB$time_start)), tissue = "Blood")
-

+

-

+

-

+

ss2_a <- ss_plot(output_OF, tpts = length(unique(output_OF$time_start)), tissue = "Oral Fluid")
-

+

-

+

-

+

ss3_a <- roc_plot(output_BR, tpts = length(unique(output_BR$time_start)), tissue = "Breath")
-

+

@@ -4219,7 +4219,7 @@

ROC

bottom_row <- plot_grid(ss2_a, ss3_a, labels = c('B', 'C'), label_size = 12, ncol = 2, rel_widths = c(0.66, .33))
 plot_grid(ss1_a, bottom_row, labels = c('A', ''), label_size = 12, ncol = 1)
-

+

@@ -4403,7 +4403,7 @@

Cutoffs

[[1]]
-

+


@@ -4439,7 +4439,7 @@ 

Cutoffs

[[1]]
-

+


@@ -4489,7 +4489,7 @@ 

Calculate: CBN

[[1]]
-

+


@@ -4529,7 +4529,7 @@ 

Calculate: CBN

[[1]]
-

+


@@ -4624,7 +4624,7 @@ 

Compound Correlations

plot_grid(wb_reg, of_reg, labels = 'AUTO', label_size = 12, ncol = 2, scale = 1)
-

+

@@ -4656,6 +4656,46 @@

Possible Extensions

+
+

cs01: what to do now?

+
    +
  1. Communicate with your group!
  2. +
  3. Discuss possible extensions
  4. +
  5. Make a plan; figure out who’s doing what; set deadlines
  6. +
  7. Implement the plan!
  8. +
+
+
+

What has to be done:

+
+
    +
  • Question | include in Rmd; add extension if applicable
  • +
  • Background | summarize and add to what was discussed in classed
  • +
  • Data +
      +
    • Describe data & variables
    • +
    • Data wrangling | likely copy + paste from notes; add explanation as you go
    • +
  • +
  • Analysis +
      +
    • EDA | likely borrowing parts from notes and adding more in; be sure to include interpretations of output & guide the reader
    • +
    • Analysis | likely borrowing most/all from class; interpretations/guiding reader/contextualizing is essential
    • +
    • Extension | must be completed
    • +
  • +
  • Conclusion | summarize
  • +
  • Proofread | ensure it makes sense from top to bottom
  • +
  • General Audience communication (submit on Canvas; 1 submission per group)
  • +
+
+
+
+

Collaborating on GitHub

+
    +
  • Be sure to pull changes every time you sit down to work
  • +
  • Avoid working on the same part of the same file as another teammate OR work in separate files and combine at the end
  • +
  • push your changes once you’re ready to add them to the group
  • +
+

Recap

    @@ -5693,44 +5733,6 @@

    Recap

    ] }); - - - +
    +
@@ -5826,6 +5829,46 @@

Possible Extensions +
+

cs01: what to do now?

+
    +
  1. Communicate with your group!
  2. +
  3. Discuss possible extensions
  4. +
  5. Make a plan; figure out who’s doing what; set deadlines
  6. +
  7. Implement the plan!
  8. +
+
+
+

What has to be done:

+
+
    +
  • Question | include in Rmd; add extension if applicable
  • +
  • Background | summarize and add to what was discussed in classed
  • +
  • Data +
      +
    • Describe data & variables
    • +
    • Data wrangling | likely copy + paste from notes; add explanation as you go
    • +
  • +
  • Analysis +
      +
    • EDA | likely borrowing parts from notes and adding more in; be sure to include interpretations of output & guide the reader
    • +
    • Analysis | likely borrowing most/all from class; interpretations/guiding reader/contextualizing is essential
    • +
    • Extension | must be completed
    • +
  • +
  • Conclusion | summarize
  • +
  • Proofread | ensure it makes sense from top to bottom
  • +
  • General Audience communication (submit on Canvas; 1 submission per group)
  • +
+
+
+
+

Collaborating on GitHub

+
    +
  • Be sure to pull changes every time you sit down to work
  • +
  • Avoid working on the same part of the same file as another teammate OR work in separate files and combine at the end
  • +
  • push your changes once you’re ready to add them to the group
  • +
+

Recap

    diff --git a/docs/schedule.html b/docs/schedule.html index 917aae9..f46c9cf 100644 --- a/docs/schedule.html +++ b/docs/schedule.html @@ -496,10 +496,10 @@

    ----++++ @@ -651,7 +651,7 @@

    - + @@ -675,7 +675,7 @@

    - + @@ -693,19 +693,19 @@

    - + - + - + diff --git a/docs/search.json b/docs/search.json index 9736a84..fd81157 100644 --- a/docs/search.json +++ b/docs/search.json @@ -7403,6 +7403,321 @@ "href": "schedule.html", "title": "COGS 137", "section": "", - "text": "Week\nDate\nTitle\nType\n\n\n\n\n0\nTh Sep 28\nWelcome & Tooling\nLecture\n\n\n1\nTu Oct 3\nIntro to R\nLecture\n\n\n1\nTh Oct 5\nData Wrangling: dplyr\nLecture\n\n\n1\nFri Oct 6\nLab 01: Intro to R\nLab\n\n\n2\nTu Oct 10\nData Wrangling: tidyr\nLecture\n\n\n2\nTh Oct 12\nData Visualization: ggplot2 (day 1)\nLecture\n\n\n2\nFri Oct 13\nLab 02: Data Wrangling\nLab\n\n\n3\nMon Oct 16\nHW01 due (11:59 PM)\nHW\n\n\n3\nTu Oct 17\nData Visualization: ggplot2 (day 2)\nLecture\n\n\n3\nTh Oct 19\nData Analysis & Modeling\nLecture\n\n\n3\nFri Oct 20\nLab 03: Data Visualization\nLab\n\n\n4\nTu Oct 24\nLinear Models Review\nLecture\n\n\n4\nTh Oct 26\nEffective Communication\nLecture\n\n\n4\nFri Oct 27\nLab 04: Modeling\nLab\n\n\n5\nMon Oct 30\nHW02 due (11:59 PM)\nHW\n\n\n5\nTu Oct 31\nMultiple Linear Regression*\nLecture\n\n\n5\nTh Nov 2\nCase Study & Final Project Info\nLecture\n\n\n5\nFri Nov 3\nLab used for midterm review\nLab\n\n\n6\nMon Nov 6\nMIDTERM EXAM (due 11:59 PM) \nExam\n\n\n6\nTu Nov 7\nCase Study 01: THC Biomarkers (day 1)\nLecture\n\n\n6\nTh Nov 9\nCase Study 01: THC Biomarkers (day 2)\nLecture\n\n\n6\nFri Nov 10\nLab 05: Multiple Linear Regression\nLab\n\n\n7\nTu Nov 14\nCase Study 01: THC Biomarkers (day 3)\nLecture\n\n\n7\nTh Nov 16\nLogistic Regression\nLecture\n\n\n7\nFri Nov 17\nLab 06: CS01\nLab\n\n\n8\nMon Nov 20\nHW03 due (11:59 PM)\nHW\n\n\n8\nMon Nov 20\nFinal Project Proposal Due\nProject\n\n\n8\nTu Nov 21\nCase Study 02: Vaping Behaviors (day 1)\nLecture\n\n\n8\nTh Nov 23\nNo Class (Thanksgiving)\n--\n\n\n9\nMon Nov 27\nCS01 Due (11:59 PM)\nCase Study\n\n\n9\nTu Nov 28\nCase Study 02: Vaping Behaviors (day 2)\nLecture\n\n\n9\nTh Nov 30\nCase Study 02: Vaping Behaviors (day 3)\nLecture\n\n\n9\nFri Dec 1\nLab 07: Logistic Regression & CS02\nLab\n\n\n10\nMon Dec 4\nCS02 Due\nCase Study\n\n\n10\nTu Dec 5\nFinal Project Brainstorming\nLecture\n\n\n10\nTh Dec 7\nNext Steps\nLecture\n\n\n10\nFri Dec 8\nLab 08: Final Project\nLab\n\n\nFinals\nTu Dec 12\nFinal Project Due (11:59 PM)\nProject" + "text": "Week\nDate\nTitle\nType\n\n\n\n\n0\nTh Sep 28\nWelcome & Tooling\nLecture\n\n\n1\nTu Oct 3\nIntro to R\nLecture\n\n\n1\nTh Oct 5\nData Wrangling: dplyr\nLecture\n\n\n1\nFri Oct 6\nLab 01: Intro to R\nLab\n\n\n2\nTu Oct 10\nData Wrangling: tidyr\nLecture\n\n\n2\nTh Oct 12\nData Visualization: ggplot2 (day 1)\nLecture\n\n\n2\nFri Oct 13\nLab 02: Data Wrangling\nLab\n\n\n3\nMon Oct 16\nHW01 due (11:59 PM)\nHW\n\n\n3\nTu Oct 17\nData Visualization: ggplot2 (day 2)\nLecture\n\n\n3\nTh Oct 19\nData Analysis & Modeling\nLecture\n\n\n3\nFri Oct 20\nLab 03: Data Visualization\nLab\n\n\n4\nTu Oct 24\nLinear Models Review\nLecture\n\n\n4\nTh Oct 26\nEffective Communication\nLecture\n\n\n4\nFri Oct 27\nLab 04: Modeling\nLab\n\n\n5\nMon Oct 30\nHW02 due (11:59 PM)\nHW\n\n\n5\nTu Oct 31\nMultiple Linear Regression*\nLecture\n\n\n5\nTh Nov 2\nCase Study & Final Project Info\nLecture\n\n\n5\nFri Nov 3\nLab used for midterm review\nLab\n\n\n6\nMon Nov 6\nMIDTERM EXAM (due 11:59 PM) \nExam\n\n\n6\nTu Nov 7\nCase Study 01: THC Biomarkers (day 1)\nLecture\n\n\n6\nTh Nov 9\nCase Study 01: THC Biomarkers (day 2)\nLecture\n\n\n6\nFri Nov 10\nLab 05: Multiple Linear Regression\nLab\n\n\n7\nTu Nov 14\nCase Study 01: THC Biomarkers (day 3)\nLecture\n\n\n7\nTh Nov 16\ntidymodels\nLecture\n\n\n7\nFri Nov 17\nLab 06: CS01\nLab\n\n\n8\nMon Nov 20\nHW03 due (11:59 PM)\nHW\n\n\n8\nMon Nov 20\nFinal Project Proposal Due\nProject\n\n\n8\nTu Nov 21\nCase Study 02: Air Pollution (day 1)\nLecture\n\n\n8\nTh Nov 23\nNo Class (Thanksgiving)\n--\n\n\n9\nMon Nov 27\nCS01 Due (11:59 PM)\nCase Study\n\n\n9\nTu Nov 28\nCase Study 02: Air Pollution (day 2)\nLecture\n\n\n9\nTh Nov 30\nCase Study 02: Air Pollution (day 3)\nLecture\n\n\n9\nFri Dec 1\nLab 07: CS02\nLab\n\n\n10\nMon Dec 4\nCS02 Due\nCase Study\n\n\n10\nTu Dec 5\nFinal Project Brainstorming\nLecture\n\n\n10\nTh Dec 7\nNext Steps\nLecture\n\n\n10\nFri Dec 8\nLab 08: Final Project\nLab\n\n\nFinals\nTu Dec 12\nFinal Project Due (11:59 PM)\nProject" + }, + { + "objectID": "content/lectures/13-cs01-analysis.html#cs01-what-to-do-now", + "href": "content/lectures/13-cs01-analysis.html#cs01-what-to-do-now", + "title": "13-cs01-analysis", + "section": "cs01: what to do now?", + "text": "cs01: what to do now?\n\nCommunicate with your group!\nDiscuss possible extensions\nMake a plan; figure out who’s doing what; set deadlines\nImplement the plan!" + }, + { + "objectID": "content/lectures/13-cs01-analysis.html#what-has-to-be-done", + "href": "content/lectures/13-cs01-analysis.html#what-has-to-be-done", + "title": "13-cs01-analysis", + "section": "What has to be done:", + "text": "What has to be done:\n\n\nQuestion | include in Rmd; add extension if applicable\nBackground | summarize and add to what was discussed in classed\nData\n\nDescribe data & variables\nData wrangling | likely copy + paste from notes; add explanation as you go\n\nAnalysis\n\nEDA | likely borrowing parts from notes and adding more in; be sure to include interpretations of output & guide the reader\nAnalysis | likely borrowing most/all from class; interpretations/guiding reader/contextualizing is essential\nExtension | must be completed\n\nConclusion | summarize\nProofread | ensure it makes sense from top to bottom\nGeneral Audience communication (submit on Canvas; 1 submission per group)" + }, + { + "objectID": "content/lectures/13-cs01-analysis.html#collaborating-on-github", + "href": "content/lectures/13-cs01-analysis.html#collaborating-on-github", + "title": "13-cs01-analysis", + "section": "Collaborating on GitHub", + "text": "Collaborating on GitHub\n\nBe sure to pull changes every time you sit down to work\nAvoid working on the same part of the same file as another teammate OR work in separate files and combine at the end\npush your changes once you’re ready to add them to the group" + }, + { + "objectID": "content/lectures/14-tidymodels.html", + "href": "content/lectures/14-tidymodels.html", + "title": "14-tidymodels", + "section": "", + "text": "Q: I had a question about the presentations for the final projects; since it is due during finals week, is it a live presentation in class or do we submit a video? If it is a live presentation, do we present during our designated final day/time on webreg?\nA: Video submission!\n\n\nQ: I also wanted to mention that the mid/pre-course extra credit surveys doesn’t reflect a change in grade on canvas. (For ex. if i put a 0 or 100 for E.C my grade stays the same).\nA: Correct - I add these in at the end. Canvas can do many things, but it doesn’t handle EC well (from what I can tell).\n\n\nQ: I’m overwhelmed/confused by “the code :’) it’s quite a bit to take in”\nA: Yes! It’s a lot! This is why we have group mates on the case study. I encourage everyone to sit with the code after class and then work through it together as you complete the case study!\n\n\nQ: For oral fluid you mentioned looking more into why there’s that big dip in specificity and that we should look more into that on Friday with eda but would that be slightly guided because I have no idea where to start with that.\nA: I would make some plots that specifically look at the data/numbers there to figure out what could be leading to that drop at that particular time window.\n\n\nQ: Why are specificity graphs so high?  A: Good question - this is generally b/c people who didn’t smoke have values very close to zero across compounds…so they will rarely be above the cutoff, making this very effective at identifying individuals who did not smoke\n\n\nQ: What is the dplyr::select notation, like is it a way to use select from dplyr without librarying first?\nA: Yes!\n\n\nQ: Also separate topic, but do we have information on impairment so we can account for that with recent use?  A: Great question - impairment is very hard to define here. We (the researchers) have data on self-reported high and what the police officers determined, but y’all don’t have that data. So, we’re using knowledge from other studies (see 11-cs01-data notes) to understand what we know on impoairment but only focusing on detecting recent use here.\n\n\nQ: I am unable to locate where to sign up for groups for the final project\nA: This form was just released (sorry for delay). link to survey\n\n\nQ: I think I need more time to digest how the code works together to produce the visuals that we saw.\nA: I agree. I think I could balance and give more time in class…but I will say this is an exercise I want groups to work through together!\n\n\n\n\nDue Dates:\n\nLab06 due Friday - cs01-focused\nHW03 (MLR) due Mon 11/20\nProject Proposal due Mon 11/20 link\nCS01 due 11/27:\n\nReport (submitted via GitHub)\n“General Communication” (submit on Canvas; one submission per group)\nsurvey about how working with group went - due 11/28\n\n\n\n\n\n\nmachine learning intro\n(re)introduce tidymodels\nworked example: ML in tidymodels\n\n\n\n\nThe package itself has some worked examples: https://www.tidymodels.org/start/models/\nThere’s a whole book (written by the developer of tidymodels) that covers the tidymodels package: https://www.tmwr.org/\nThis may be a helpful reference for this lecture, although, of course, the book goes into way more detail than we will in a single lecture.\n\n\n\n\n“Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: pre-processing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret. The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do.” - Max Kuhn (tidymodels developer)\n\n\nBenefits:\n\nStandardized workflow/format/notation across different types of machine learning algorithms\nCan easily modify pre-processing, algorithm choice, and hyper-parameter tuning making optimization easy\n\n\n\n\n\nThe main packages (and their roles):\n ## Machine Learning: intro\nIn intro stats, you should have learned the central dogma of statistics: we sample from a population\n . . .\nThe data from the sample are used to make an inference about the population:\n . . .\nFor prediction, we have a similar sampling problem:\n . . .\nBut now we are trying to build a rule that can be used to predict a single observation’s value of some characteristic using characteristics of the other observations.\n ## ML: the goal\nThe goal is to:\nbuild a machine learning algorithm\n\nthat uses the features as input\n\n\nand predicts a outcome variable\n\n\nin the situation where we do not know the outcome variable.\n\n\n\n\nTypically, you use data where you have both the input and output data to train a machine learning algorithm.\nWhat you need:\n\nA data set to train from.\nAn algorithm or set of algorithms you can use to try values of \\(f\\).\nA distance metric \\(d\\) for measuring how close \\(Y\\) is to \\(\\hat{Y}\\).\nA definition of what a “good” distance is.\n\n\n\n\n\n\n\nHow these packages fit together for carrying out machine learning:\n\n\n\n\n ## Worked example: Hotel Stays\n\nGoal: predict which hotel stays included children and/or babies, based on the other characteristics of the stays such as which hotel the guests stay at, how much they pay, etc\n\nImage Source: https://www.tidymodels.org/start/case-study/\nCase Study: https://www.tidymodels.org/start/case-study/\n\n\n\n\n\nglmnet\n`rf``\n\n\n\n\n\n# install.packages(\"glmnet\")\n# install.packages(\"ranger\")\n# install.packages(\"vip\")\n\nlibrary(tidymodels)\n\n\n\n\nHotel Bookings Data from: Antonio, Almeida, and Nunes (2019)\n\nData Dictionary\n\n\n\nhotels <- \n read_csv(\"https://tidymodels.org/start/case-study/hotels.csv\") |>\n mutate(across(where(is.character), as.factor))\n\n\n\n\nglimpse(hotels)\n\nRows: 50,000\nColumns: 23\n$ hotel City_Hotel, City_Hotel, Resort_Hotel, R…\n$ lead_time 217, 2, 95, 143, 136, 67, 47, 56, 80, 6…\n$ stays_in_weekend_nights 1, 0, 2, 2, 1, 2, 0, 0, 0, 2, 1, 0, 1, …\n$ stays_in_week_nights 3, 1, 5, 6, 4, 2, 2, 3, 4, 2, 2, 1, 2, …\n$ adults 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, …\n$ children none, none, none, none, none, none, chi…\n$ meal BB, BB, BB, HB, HB, SC, BB, BB, BB, BB,…\n$ country DEU, PRT, GBR, ROU, PRT, GBR, ESP, ESP,…\n$ market_segment Offline_TA/TO, Direct, Online_TA, Onlin…\n$ distribution_channel TA/TO, Direct, TA/TO, TA/TO, Direct, TA…\n$ is_repeated_guest 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ previous_cancellations 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ previous_bookings_not_canceled 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ reserved_room_type A, D, A, A, F, A, C, B, D, A, A, D, A, …\n$ assigned_room_type A, K, A, A, F, A, C, A, D, A, D, D, A, …\n$ booking_changes 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ deposit_type No_Deposit, No_Deposit, No_Deposit, No_…\n$ days_in_waiting_list 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ customer_type Transient-Party, Transient, Transient, …\n$ average_daily_rate 80.75, 170.00, 8.00, 81.00, 157.60, 49.…\n$ required_car_parking_spaces none, none, none, none, none, none, non…\n$ total_of_special_requests 1, 3, 2, 1, 4, 1, 1, 1, 1, 1, 0, 1, 0, …\n$ arrival_date 2016-09-01, 2017-08-25, 2016-11-19, 20…\n\n\n\n\nOutcome variable: children (two levels):\n\nhotels %>% \n count(children) %>% \n mutate(prop = n/sum(n))\n\n# A tibble: 2 × 3\n children n prop\n \n1 children 4038 0.0808\n2 none 45962 0.919 \n\n\n\n\nNotes: - Data are imbalanced - There are methods for combating this issue usingrecipes(search for steps toupsampleordownsample) or other more specialized packages likethemis` - For demo purposes, we’ll model as-is\n\n\n\n\nTypically, data are split into a training and testing datasets\n\n\nTraining | data used to build & tune the model; model “learns” from these data\nTesting | data withheld from training to evaluate model performance (can it generalize?)\n\n\n\n\n\n\n\n\nReminder: children is pretty imbalanced so we’ll use a stratified random sample (to ensure similar proportion of chilren in training and testing)\n\nset.seed(123)\nsplits <- initial_split(hotels, strata = children)\nsplits\n\n\n<37500/12500/50000>\n\n\n\n\nhotel_other <- training(splits)\nhotel_test <- testing(splits)\n\n\n\n\n# training set proportions by children\nhotel_other %>% \n count(children) %>% \n mutate(prop = n/sum(n))\n\n# A tibble: 2 × 3\n children n prop\n \n1 children 3027 0.0807\n2 none 34473 0.919 \n\n\n\n\n\n# test set proportions by children\nhotel_test %>% \n count(children) %>% \n mutate(prop = n/sum(n))\n\n# A tibble: 2 × 3\n children n prop\n \n1 children 1011 0.0809\n2 none 11489 0.919 \n\n\n\n\n\n\n\nmultiple approaches (k-fold CV)\nhere: validation set\n\n Image Source: https://www.tidymodels.org/start/case-study/\n\n\n\n\nTraining: 30,000\nValidation: 7,500\nTesting: 12,500\n\n\n\nset.seed(234)\nval_set <- validation_split(hotel_other, \n strata = children, \n prop = c(0.8))\n\nval_set\n\n# Validation Set Split (0.8/0.2) using stratification \n# A tibble: 1 × 2\n splits id \n \n1 validation\n\n\n\n\n\n\nList ingredients | specify which variables we will be using as our outcome and predictors (assign roles)\n\n\n\n\nSpecify model (parsnip)\n\nlr_mod <- \n logistic_reg(penalty = tune(), mixture = 1) %>% \n set_engine(\"glmnet\")\n\n\nBuild recipe:\n\n\nstep_date() creates predictors for the year, month, and day of the week.\nstep_holiday() generates a set of indicator variables for specific holidays. Although we don’t know where these two hotels are located, we do know that the countries for origin for most stays are based in Europe.\nstep_rm() removes variables; here we’ll use it to remove the original date variable since we no longer want it in the model.\nstep_dummy() converts characters or factors (i.e., nominal variables) into one or more numeric binary model terms for the levels of the original data.\nstep_zv() removes indicator variables that only contain a single unique value (e.g. all zeros). This is important because, for penalized models, the predictors should be centered and scaled.\nstep_normalize() centers and scales numeric variables.\n\n\n\nholidays <- c(\"AllSouls\", \"AshWednesday\", \"ChristmasEve\", \"Easter\", \n \"ChristmasDay\", \"GoodFriday\", \"NewYearsDay\", \"PalmSunday\")\n\nlr_recipe <- \n recipe(children ~ ., data = hotel_other) %>% \n step_date(arrival_date) %>% \n step_holiday(arrival_date, holidays = holidays) %>% \n step_rm(arrival_date) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_predictors())\n\n\n\n\n\n\nlr_workflow <- \n workflow() %>% \n add_model(lr_mod) %>% \n add_recipe(lr_recipe)\n\n\n\n\n\nlr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))\n\n. . . Lowest penalty values:\n\nlr_reg_grid %>% top_n(-5)\n\n# A tibble: 5 × 1\n penalty\n \n1 0.0001 \n2 0.000127\n3 0.000161\n4 0.000204\n5 0.000259\n\n\nHighest penalty values:\n\nlr_reg_grid %>% top_n(5)\n\n# A tibble: 5 × 1\n penalty\n \n1 0.0386\n2 0.0489\n3 0.0621\n4 0.0788\n5 0.1 \n\n\n\n\n\n(Will take a few seconds to run)\n\nlr_res <- \n lr_workflow %>% \n tune_grid(val_set,\n grid = lr_reg_grid,\n control = control_grid(save_pred = TRUE),\n metrics = metric_set(roc_auc))\n\n\n\n\n\nlr_res %>% \n collect_metrics() %>% \n ggplot(aes(x = penalty, y = mean)) + \n geom_point() + \n geom_line() + \n ylab(\"Area under the ROC Curve\") +\n scale_x_log10(labels = scales::label_number())\n\n\n\n\n\n\nModel is generally better at lower penalty values\nMajority of predictors are important to the model\nROC AUC seems like a reasonable metric for assessing model performance\n\n\n\n\n\n\ntop_models <-\n lr_res %>% \n show_best(\"roc_auc\", n = 15) %>% \n arrange(penalty) \n\n\n\ntop_models\n\n# A tibble: 15 × 7\n penalty .metric .estimator mean n std_err .config \n \n 1 0.000127 roc_auc binary 0.872 1 NA Preprocessor1_Model02\n 2 0.000161 roc_auc binary 0.872 1 NA Preprocessor1_Model03\n 3 0.000204 roc_auc binary 0.873 1 NA Preprocessor1_Model04\n 4 0.000259 roc_auc binary 0.873 1 NA Preprocessor1_Model05\n 5 0.000329 roc_auc binary 0.874 1 NA Preprocessor1_Model06\n 6 0.000418 roc_auc binary 0.874 1 NA Preprocessor1_Model07\n 7 0.000530 roc_auc binary 0.875 1 NA Preprocessor1_Model08\n 8 0.000672 roc_auc binary 0.875 1 NA Preprocessor1_Model09\n 9 0.000853 roc_auc binary 0.876 1 NA Preprocessor1_Model10\n10 0.00108 roc_auc binary 0.876 1 NA Preprocessor1_Model11\n11 0.00137 roc_auc binary 0.876 1 NA Preprocessor1_Model12\n12 0.00174 roc_auc binary 0.876 1 NA Preprocessor1_Model13\n13 0.00221 roc_auc binary 0.876 1 NA Preprocessor1_Model14\n14 0.00281 roc_auc binary 0.875 1 NA Preprocessor1_Model15\n15 0.00356 roc_auc binary 0.873 1 NA Preprocessor1_Model16\n\n\n\n\n\nlikely want to pick a penalty before we see the decrease start (fewest predictors; equally-good performance) - 12th model\n\n\nlr_best <- \n lr_res %>% \n collect_metrics() %>% \n arrange(penalty) %>% \n slice(12)\n\n\n\n\n\n\nlr_auc <- \n lr_res %>% \n collect_predictions(parameters = lr_best) %>% \n roc_curve(children, .pred_children) %>% \n mutate(model = \"Logistic Regression\")\n\nautoplot(lr_auc)\n\n\n\n\n\nModel is performing reasonably well…but maybe another model/approach would improve accuracy?\n\n\nThis is where tidymodels shines\n\n\n\n\nDetect possible running in parallel:\n\ncores <- parallel::detectCores()\ncores\n\n[1] 10\n\n\n\n\nrf_mod <- \n rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% \n set_engine(\"ranger\", num.threads = cores) %>% \n set_mode(\"classification\")\n\nrf_mod\n\nRandom Forest Model Specification (classification)\n\nMain Arguments:\n mtry = tune()\n trees = 1000\n min_n = tune()\n\nEngine-Specific Arguments:\n num.threads = cores\n\nComputational engine: ranger \n\n\n\n\n\n\n\nrf_recipe <- \n recipe(children ~ ., data = hotel_other) %>% \n step_date(arrival_date) %>% \n step_holiday(arrival_date) %>% \n step_rm(arrival_date) \n\n\n\n\n\nrf_workflow <- \n workflow() %>% \n add_model(rf_mod) %>% \n add_recipe(rf_recipe)\n\n\n\n\n(will take longer to run -it’s trying out 25 different random forests, each w/ 1000 trees)\n\nset.seed(345)\nrf_res <- \n rf_workflow %>% \n tune_grid(val_set,\n grid = 25,\n control = control_grid(save_pred = TRUE),\n metrics = metric_set(roc_auc))\n\n\n\n\n\nrf_res %>% \n show_best(metric = \"roc_auc\")\n\n# A tibble: 5 × 8\n mtry min_n .metric .estimator mean n std_err .config \n \n1 8 7 roc_auc binary 0.926 1 NA Preprocessor1_Model13\n2 12 7 roc_auc binary 0.926 1 NA Preprocessor1_Model01\n3 13 4 roc_auc binary 0.925 1 NA Preprocessor1_Model05\n4 9 12 roc_auc binary 0.924 1 NA Preprocessor1_Model19\n5 6 18 roc_auc binary 0.924 1 NA Preprocessor1_Model24\n\n\n\n\n\n\nautoplot(rf_res)\n\n\n\n\n\n\n\n\nrf_best <- \n rf_res %>% \n select_best(metric = \"roc_auc\")\n\nrf_best\n\n# A tibble: 1 × 3\n mtry min_n .config \n \n1 8 7 Preprocessor1_Model13\n\n\n\n\n\n\nrf_res %>% \n collect_predictions()\n\n# A tibble: 187,500 × 8\n id .pred_children .pred_none .row mtry min_n children .config \n \n 1 validation 0.152 0.848 13 12 7 none Preprocessor…\n 2 validation 0.0302 0.970 20 12 7 none Preprocessor…\n 3 validation 0.513 0.487 22 12 7 children Preprocessor…\n 4 validation 0.0103 0.990 23 12 7 none Preprocessor…\n 5 validation 0.0111 0.989 31 12 7 none Preprocessor…\n 6 validation 0 1 38 12 7 none Preprocessor…\n 7 validation 0 1 39 12 7 none Preprocessor…\n 8 validation 0.00325 0.997 50 12 7 none Preprocessor…\n 9 validation 0.0241 0.976 54 12 7 none Preprocessor…\n10 validation 0.0441 0.956 57 12 7 children Preprocessor…\n# ℹ 187,490 more rows\n\n\n\nrf_auc <- \n rf_res %>% \n collect_predictions(parameters = rf_best) %>% \n roc_curve(children, .pred_children) %>% \n mutate(model = \"Random Forest\")\n\n\n\n\n\nbind_rows(rf_auc, lr_auc) %>% \n ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + \n geom_path(lwd = 1.5, alpha = 0.8) +\n geom_abline(lty = 3) + \n coord_equal() + \n scale_color_viridis_d(option = \"plasma\", end = .6)\n\n\n\n\n\n\n\n\nlast_rf_mod <- \n rand_forest(mtry = 8, min_n = 7, trees = 1000) %>% \n set_engine(\"ranger\", num.threads = cores, importance = \"impurity\") %>% # will allow us to see what was most important in the model\n set_mode(\"classification\")\n\n# the last workflow\nlast_rf_workflow <- \n rf_workflow %>% \n update_model(last_rf_mod)\n\n# the last fit\nset.seed(345)\nlast_rf_fit <- \n last_rf_workflow %>% \n last_fit(splits)\n\nlast_rf_fit\n\n# Resampling results\n# Manual resampling \n# A tibble: 1 × 6\n splits id .metrics .notes .predictions .workflow \n \n1 train/test sp… \n\n\n\n\n\n\nlast_rf_fit %>% \n collect_metrics()\n\n# A tibble: 2 × 4\n .metric .estimator .estimate .config \n \n1 accuracy binary 0.946 Preprocessor1_Model1\n2 roc_auc binary 0.923 Preprocessor1_Model1\n\n\n\n\n\n\nlast_rf_fit %>% \n extract_fit_parsnip() %>% \n vip::vip(num_features = 20)\n\n\n\n\n\n\nlast_rf_fit %>% \n collect_predictions() %>% \n roc_curve(children, .pred_children) %>% \n autoplot()" + }, + { + "objectID": "content/lectures/13-cs01-analysis-slides.html#cs01-what-to-do-now", + "href": "content/lectures/13-cs01-analysis-slides.html#cs01-what-to-do-now", + "title": "13-cs01-analysis", + "section": "cs01: what to do now?", + "text": "cs01: what to do now?\n\nCommunicate with your group!\nDiscuss possible extensions\nMake a plan; figure out who’s doing what; set deadlines\nImplement the plan!" + }, + { + "objectID": "content/lectures/13-cs01-analysis-slides.html#what-has-to-be-done", + "href": "content/lectures/13-cs01-analysis-slides.html#what-has-to-be-done", + "title": "13-cs01-analysis", + "section": "What has to be done:", + "text": "What has to be done:\n\n\nQuestion | include in Rmd; add extension if applicable\nBackground | summarize and add to what was discussed in classed\nData\n\nDescribe data & variables\nData wrangling | likely copy + paste from notes; add explanation as you go\n\nAnalysis\n\nEDA | likely borrowing parts from notes and adding more in; be sure to include interpretations of output & guide the reader\nAnalysis | likely borrowing most/all from class; interpretations/guiding reader/contextualizing is essential\nExtension | must be completed\n\nConclusion | summarize\nProofread | ensure it makes sense from top to bottom\nGeneral Audience communication (submit on Canvas; 1 submission per group)" + }, + { + "objectID": "content/lectures/13-cs01-analysis-slides.html#collaborating-on-github", + "href": "content/lectures/13-cs01-analysis-slides.html#collaborating-on-github", + "title": "13-cs01-analysis", + "section": "Collaborating on GitHub", + "text": "Collaborating on GitHub\n\nBe sure to pull changes every time you sit down to work\nAvoid working on the same part of the same file as another teammate OR work in separate files and combine at the end\npush your changes once you’re ready to add them to the group" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#qa", + "href": "content/lectures/14-tidymodels-slides.html#qa", + "title": "14-tidymodels", + "section": "Q&A", + "text": "Q&A\n\nQ: I had a question about the presentations for the final projects; since it is due during finals week, is it a live presentation in class or do we submit a video? If it is a live presentation, do we present during our designated final day/time on webreg?\nA: Video submission!\n\n\nQ: I also wanted to mention that the mid/pre-course extra credit surveys doesn’t reflect a change in grade on canvas. (For ex. if i put a 0 or 100 for E.C my grade stays the same).\nA: Correct - I add these in at the end. Canvas can do many things, but it doesn’t handle EC well (from what I can tell).\n\n\nQ: I’m overwhelmed/confused by “the code :’) it’s quite a bit to take in”\nA: Yes! It’s a lot! This is why we have group mates on the case study. I encourage everyone to sit with the code after class and then work through it together as you complete the case study!\n\n\nQ: For oral fluid you mentioned looking more into why there’s that big dip in specificity and that we should look more into that on Friday with eda but would that be slightly guided because I have no idea where to start with that.\nA: I would make some plots that specifically look at the data/numbers there to figure out what could be leading to that drop at that particular time window.\n\n\nQ: Why are specificity graphs so high?  A: Good question - this is generally b/c people who didn’t smoke have values very close to zero across compounds…so they will rarely be above the cutoff, making this very effective at identifying individuals who did not smoke\n\n\nQ: What is the dplyr::select notation, like is it a way to use select from dplyr without librarying first?\nA: Yes!\n\n\nQ: Also separate topic, but do we have information on impairment so we can account for that with recent use?  A: Great question - impairment is very hard to define here. We (the researchers) have data on self-reported high and what the police officers determined, but y’all don’t have that data. So, we’re using knowledge from other studies (see 11-cs01-data notes) to understand what we know on impoairment but only focusing on detecting recent use here.\n\n\nQ: I am unable to locate where to sign up for groups for the final project\nA: This form was just released (sorry for delay). link to survey\n\n\nQ: I think I need more time to digest how the code works together to produce the visuals that we saw.\nA: I agree. I think I could balance and give more time in class…but I will say this is an exercise I want groups to work through together!" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#course-announcements", + "href": "content/lectures/14-tidymodels-slides.html#course-announcements", + "title": "14-tidymodels", + "section": "Course Announcements", + "text": "Course Announcements\nDue Dates:\n\nLab06 due Friday - cs01-focused\nHW03 (MLR) due Mon 11/20\nProject Proposal due Mon 11/20 link\nCS01 due 11/27:\n\nReport (submitted via GitHub)\n“General Communication” (submit on Canvas; one submission per group)\nsurvey about how working with group went - due 11/28" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#agenda", + "href": "content/lectures/14-tidymodels-slides.html#agenda", + "title": "14-tidymodels", + "section": "Agenda", + "text": "Agenda\n\nmachine learning intro\n(re)introduce tidymodels\nworked example: ML in tidymodels" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#suggested-reading", + "href": "content/lectures/14-tidymodels-slides.html#suggested-reading", + "title": "14-tidymodels", + "section": "Suggested Reading", + "text": "Suggested Reading\nThe package itself has some worked examples: https://www.tidymodels.org/start/models/\nThere’s a whole book (written by the developer of tidymodels) that covers the tidymodels package: https://www.tmwr.org/\nThis may be a helpful reference for this lecture, although, of course, the book goes into way more detail than we will in a single lecture." + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#tidymodels-philosophy", + "href": "content/lectures/14-tidymodels-slides.html#tidymodels-philosophy", + "title": "14-tidymodels", + "section": "tidymodels: philosophy", + "text": "tidymodels: philosophy\n\n“Other packages, such as caret and mlr, help to solve the R model API issue. These packages do a lot of other things too: pre-processing, model tuning, resampling, feature selection, ensembling, and so on. In the tidyverse, we strive to make our packages modular and parsnip is designed only to solve the interface issue. It is not designed to be a drop-in replacement for caret. The tidymodels package collection, which includes parsnip, has other packages for many of these tasks, and they are designed to work together. We are working towards higher-level APIs that can replicate and extend what the current model packages can do.” - Max Kuhn (tidymodels developer)\n\n\nBenefits:\n\nStandardized workflow/format/notation across different types of machine learning algorithms\nCan easily modify pre-processing, algorithm choice, and hyper-parameter tuning making optimization easy" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#tidymodels-ecosystem", + "href": "content/lectures/14-tidymodels-slides.html#tidymodels-ecosystem", + "title": "14-tidymodels", + "section": "tidymodels: ecosystem", + "text": "tidymodels: ecosystem\nThe main packages (and their roles):\n ## Machine Learning: intro\nIn intro stats, you should have learned the central dogma of statistics: we sample from a population\n . . .\nThe data from the sample are used to make an inference about the population:\n . . .\nFor prediction, we have a similar sampling problem:\n . . .\nBut now we are trying to build a rule that can be used to predict a single observation’s value of some characteristic using characteristics of the other observations.\n ## ML: the goal\nThe goal is to:\nbuild a machine learning algorithm\n\nthat uses the features as input\n\n\nand predicts a outcome variable\n\n\nin the situation where we do not know the outcome variable." + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#classic-ml", + "href": "content/lectures/14-tidymodels-slides.html#classic-ml", + "title": "14-tidymodels", + "section": "Classic ML", + "text": "Classic ML\nTypically, you use data where you have both the input and output data to train a machine learning algorithm.\nWhat you need:\n\nA data set to train from.\nAn algorithm or set of algorithms you can use to try values of \\(f\\).\nA distance metric \\(d\\) for measuring how close \\(Y\\) is to \\(\\hat{Y}\\).\nA definition of what a “good” distance is." + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#tidymodels-for-ml", + "href": "content/lectures/14-tidymodels-slides.html#tidymodels-for-ml", + "title": "14-tidymodels", + "section": "tidymodels for ML", + "text": "tidymodels for ML\nHow these packages fit together for carrying out machine learning:" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#tidymodels-steps", + "href": "content/lectures/14-tidymodels-slides.html#tidymodels-steps", + "title": "14-tidymodels", + "section": "tidymodels: steps", + "text": "tidymodels: steps\n ## Worked example: Hotel Stays\n\nGoal: predict which hotel stays included children and/or babies, based on the other characteristics of the stays such as which hotel the guests stay at, how much they pay, etc\n\nImage Source: https://www.tidymodels.org/start/case-study/\nCase Study: https://www.tidymodels.org/start/case-study/" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#model-intro", + "href": "content/lectures/14-tidymodels-slides.html#model-intro", + "title": "14-tidymodels", + "section": "Model Intro", + "text": "Model Intro\n\nglmnet\n`rf``" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#setup", + "href": "content/lectures/14-tidymodels-slides.html#setup", + "title": "14-tidymodels", + "section": "Setup", + "text": "Setup\n\n# install.packages(\"glmnet\")\n# install.packages(\"ranger\")\n# install.packages(\"vip\")\n\nlibrary(tidymodels)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#data", + "href": "content/lectures/14-tidymodels-slides.html#data", + "title": "14-tidymodels", + "section": "Data", + "text": "Data\nHotel Bookings Data from: Antonio, Almeida, and Nunes (2019)\n\nData Dictionary\n\n\n\nhotels <- \n read_csv(\"https://tidymodels.org/start/case-study/hotels.csv\") |>\n mutate(across(where(is.character), as.factor))\n\n\n\n\nglimpse(hotels)\n\nRows: 50,000\nColumns: 23\n$ hotel City_Hotel, City_Hotel, Resort_Hotel, R…\n$ lead_time 217, 2, 95, 143, 136, 67, 47, 56, 80, 6…\n$ stays_in_weekend_nights 1, 0, 2, 2, 1, 2, 0, 0, 0, 2, 1, 0, 1, …\n$ stays_in_week_nights 3, 1, 5, 6, 4, 2, 2, 3, 4, 2, 2, 1, 2, …\n$ adults 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, …\n$ children none, none, none, none, none, none, chi…\n$ meal BB, BB, BB, HB, HB, SC, BB, BB, BB, BB,…\n$ country DEU, PRT, GBR, ROU, PRT, GBR, ESP, ESP,…\n$ market_segment Offline_TA/TO, Direct, Online_TA, Onlin…\n$ distribution_channel TA/TO, Direct, TA/TO, TA/TO, Direct, TA…\n$ is_repeated_guest 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ previous_cancellations 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ previous_bookings_not_canceled 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ reserved_room_type A, D, A, A, F, A, C, B, D, A, A, D, A, …\n$ assigned_room_type A, K, A, A, F, A, C, A, D, A, D, D, A, …\n$ booking_changes 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ deposit_type No_Deposit, No_Deposit, No_Deposit, No_…\n$ days_in_waiting_list 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …\n$ customer_type Transient-Party, Transient, Transient, …\n$ average_daily_rate 80.75, 170.00, 8.00, 81.00, 157.60, 49.…\n$ required_car_parking_spaces none, none, none, none, none, none, non…\n$ total_of_special_requests 1, 3, 2, 1, 4, 1, 1, 1, 1, 1, 0, 1, 0, …\n$ arrival_date 2016-09-01, 2017-08-25, 2016-11-19, 20…\n\n\n\n\nOutcome variable: children (two levels):\n\nhotels %>% \n count(children) %>% \n mutate(prop = n/sum(n))\n\n# A tibble: 2 × 3\n children n prop\n \n1 children 4038 0.0808\n2 none 45962 0.919 \n\n\n\n\nNotes: - Data are imbalanced - There are methods for combating this issue usingrecipes(search for steps toupsampleordownsample) or other more specialized packages likethemis` - For demo purposes, we’ll model as-is" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#data-splitting", + "href": "content/lectures/14-tidymodels-slides.html#data-splitting", + "title": "14-tidymodels", + "section": "Data Splitting", + "text": "Data Splitting\nTypically, data are split into a training and testing datasets\n\n\nTraining | data used to build & tune the model; model “learns” from these data\nTesting | data withheld from training to evaluate model performance (can it generalize?)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#data-splitting-hotels-rsample", + "href": "content/lectures/14-tidymodels-slides.html#data-splitting-hotels-rsample", + "title": "14-tidymodels", + "section": "Data Splitting: Hotels (rsample)", + "text": "Data Splitting: Hotels (rsample)\nReminder: children is pretty imbalanced so we’ll use a stratified random sample (to ensure similar proportion of chilren in training and testing)\n\nset.seed(123)\nsplits <- initial_split(hotels, strata = children)\nsplits\n\n\n<37500/12500/50000>\n\n\n\n\nhotel_other <- training(splits)\nhotel_test <- testing(splits)\n\n\n\n\n# training set proportions by children\nhotel_other %>% \n count(children) %>% \n mutate(prop = n/sum(n))\n\n# A tibble: 2 × 3\n children n prop\n \n1 children 3027 0.0807\n2 none 34473 0.919 \n\n\n\n\n\n# test set proportions by children\nhotel_test %>% \n count(children) %>% \n mutate(prop = n/sum(n))\n\n# A tibble: 2 × 3\n children n prop\n \n1 children 1011 0.0809\n2 none 11489 0.919" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#re-sampling", + "href": "content/lectures/14-tidymodels-slides.html#re-sampling", + "title": "14-tidymodels", + "section": "Re-sampling", + "text": "Re-sampling\n\nmultiple approaches (k-fold CV)\nhere: validation set\n\n Image Source: https://www.tidymodels.org/start/case-study/" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#resampling-hotels", + "href": "content/lectures/14-tidymodels-slides.html#resampling-hotels", + "title": "14-tidymodels", + "section": "Resampling: Hotels", + "text": "Resampling: Hotels\n\nTraining: 30,000\nValidation: 7,500\nTesting: 12,500\n\n\n\nset.seed(234)\nval_set <- validation_split(hotel_other, \n strata = children, \n prop = c(0.8))\n\nval_set\n\n# Validation Set Split (0.8/0.2) using stratification \n# A tibble: 1 × 2\n splits id \n \n1 validation" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#build-a-model-recipes", + "href": "content/lectures/14-tidymodels-slides.html#build-a-model-recipes", + "title": "14-tidymodels", + "section": "Build a Model (`recipes``)", + "text": "Build a Model (`recipes``)\nList ingredients | specify which variables we will be using as our outcome and predictors (assign roles)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#model-building-hotels", + "href": "content/lectures/14-tidymodels-slides.html#model-building-hotels", + "title": "14-tidymodels", + "section": "Model building: Hotels", + "text": "Model building: Hotels\nSpecify model (parsnip)\n\nlr_mod <- \n logistic_reg(penalty = tune(), mixture = 1) %>% \n set_engine(\"glmnet\")\n\n\nBuild recipe:\n\n\nstep_date() creates predictors for the year, month, and day of the week.\nstep_holiday() generates a set of indicator variables for specific holidays. Although we don’t know where these two hotels are located, we do know that the countries for origin for most stays are based in Europe.\nstep_rm() removes variables; here we’ll use it to remove the original date variable since we no longer want it in the model.\nstep_dummy() converts characters or factors (i.e., nominal variables) into one or more numeric binary model terms for the levels of the original data.\nstep_zv() removes indicator variables that only contain a single unique value (e.g. all zeros). This is important because, for penalized models, the predictors should be centered and scaled.\nstep_normalize() centers and scales numeric variables.\n\n\n\nholidays <- c(\"AllSouls\", \"AshWednesday\", \"ChristmasEve\", \"Easter\", \n \"ChristmasDay\", \"GoodFriday\", \"NewYearsDay\", \"PalmSunday\")\n\nlr_recipe <- \n recipe(children ~ ., data = hotel_other) %>% \n step_date(arrival_date) %>% \n step_holiday(arrival_date, holidays = holidays) %>% \n step_rm(arrival_date) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_predictors())" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#create-workflow-fit-the-model", + "href": "content/lectures/14-tidymodels-slides.html#create-workflow-fit-the-model", + "title": "14-tidymodels", + "section": "Create Workflow: fit the model", + "text": "Create Workflow: fit the model\n\nlr_workflow <- \n workflow() %>% \n add_model(lr_mod) %>% \n add_recipe(lr_recipe)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#tuning-set-up", + "href": "content/lectures/14-tidymodels-slides.html#tuning-set-up", + "title": "14-tidymodels", + "section": "Tuning (set up)", + "text": "Tuning (set up)\n\nlr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))\n\n. . . Lowest penalty values:\n\nlr_reg_grid %>% top_n(-5)\n\n# A tibble: 5 × 1\n penalty\n \n1 0.0001 \n2 0.000127\n3 0.000161\n4 0.000204\n5 0.000259\n\n\nHighest penalty values:\n\nlr_reg_grid %>% top_n(5)\n\n# A tibble: 5 × 1\n penalty\n \n1 0.0386\n2 0.0489\n3 0.0621\n4 0.0788\n5 0.1" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#train-and-tune", + "href": "content/lectures/14-tidymodels-slides.html#train-and-tune", + "title": "14-tidymodels", + "section": "Train and Tune", + "text": "Train and Tune\n(Will take a few seconds to run)\n\nlr_res <- \n lr_workflow %>% \n tune_grid(val_set,\n grid = lr_reg_grid,\n control = control_grid(save_pred = TRUE),\n metrics = metric_set(roc_auc))" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#results-penalty", + "href": "content/lectures/14-tidymodels-slides.html#results-penalty", + "title": "14-tidymodels", + "section": "Results: penalty", + "text": "Results: penalty\n\nlr_res %>% \n collect_metrics() %>% \n ggplot(aes(x = penalty, y = mean)) + \n geom_point() + \n geom_line() + \n ylab(\"Area under the ROC Curve\") +\n scale_x_log10(labels = scales::label_number())\n\n\n\n\nModel is generally better at lower penalty values\nMajority of predictors are important to the model\nROC AUC seems like a reasonable metric for assessing model performance" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#best-models", + "href": "content/lectures/14-tidymodels-slides.html#best-models", + "title": "14-tidymodels", + "section": "Best models", + "text": "Best models\n\ntop_models <-\n lr_res %>% \n show_best(\"roc_auc\", n = 15) %>% \n arrange(penalty) \n\n\n\ntop_models\n\n# A tibble: 15 × 7\n penalty .metric .estimator mean n std_err .config \n \n 1 0.000127 roc_auc binary 0.872 1 NA Preprocessor1_Model02\n 2 0.000161 roc_auc binary 0.872 1 NA Preprocessor1_Model03\n 3 0.000204 roc_auc binary 0.873 1 NA Preprocessor1_Model04\n 4 0.000259 roc_auc binary 0.873 1 NA Preprocessor1_Model05\n 5 0.000329 roc_auc binary 0.874 1 NA Preprocessor1_Model06\n 6 0.000418 roc_auc binary 0.874 1 NA Preprocessor1_Model07\n 7 0.000530 roc_auc binary 0.875 1 NA Preprocessor1_Model08\n 8 0.000672 roc_auc binary 0.875 1 NA Preprocessor1_Model09\n 9 0.000853 roc_auc binary 0.876 1 NA Preprocessor1_Model10\n10 0.00108 roc_auc binary 0.876 1 NA Preprocessor1_Model11\n11 0.00137 roc_auc binary 0.876 1 NA Preprocessor1_Model12\n12 0.00174 roc_auc binary 0.876 1 NA Preprocessor1_Model13\n13 0.00221 roc_auc binary 0.876 1 NA Preprocessor1_Model14\n14 0.00281 roc_auc binary 0.875 1 NA Preprocessor1_Model15\n15 0.00356 roc_auc binary 0.873 1 NA Preprocessor1_Model16\n\n\n\n\n\nlikely want to pick a penalty before we see the decrease start (fewest predictors; equally-good performance) - 12th model\n\n\nlr_best <- \n lr_res %>% \n collect_metrics() %>% \n arrange(penalty) %>% \n slice(12)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#check-performance", + "href": "content/lectures/14-tidymodels-slides.html#check-performance", + "title": "14-tidymodels", + "section": "Check Performance", + "text": "Check Performance\n\nlr_auc <- \n lr_res %>% \n collect_predictions(parameters = lr_best) %>% \n roc_curve(children, .pred_children) %>% \n mutate(model = \"Logistic Regression\")\n\nautoplot(lr_auc)\n\n\n\nModel is performing reasonably well…but maybe another model/approach would improve accuracy?\n\n\nThis is where tidymodels shines" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#random-forest-specify-model", + "href": "content/lectures/14-tidymodels-slides.html#random-forest-specify-model", + "title": "14-tidymodels", + "section": "Random Forest: specify Model", + "text": "Random Forest: specify Model\nDetect possible running in parallel:\n\ncores <- parallel::detectCores()\ncores\n\n[1] 10\n\n\n\n\nrf_mod <- \n rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% \n set_engine(\"ranger\", num.threads = cores) %>% \n set_mode(\"classification\")\n\nrf_mod\n\nRandom Forest Model Specification (classification)\n\nMain Arguments:\n mtry = tune()\n trees = 1000\n min_n = tune()\n\nEngine-Specific Arguments:\n num.threads = cores\n\nComputational engine: ranger" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#build-model-new-recipe", + "href": "content/lectures/14-tidymodels-slides.html#build-model-new-recipe", + "title": "14-tidymodels", + "section": "Build model: new `recipe``", + "text": "Build model: new `recipe``\n\nrf_recipe <- \n recipe(children ~ ., data = hotel_other) %>% \n step_date(arrival_date) %>% \n step_holiday(arrival_date) %>% \n step_rm(arrival_date)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#add-model-to-workflow", + "href": "content/lectures/14-tidymodels-slides.html#add-model-to-workflow", + "title": "14-tidymodels", + "section": "Add model to workflow", + "text": "Add model to workflow\n\nrf_workflow <- \n workflow() %>% \n add_model(rf_mod) %>% \n add_recipe(rf_recipe)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#fit-the-model", + "href": "content/lectures/14-tidymodels-slides.html#fit-the-model", + "title": "14-tidymodels", + "section": "Fit the model", + "text": "Fit the model\n(will take longer to run -it’s trying out 25 different random forests, each w/ 1000 trees)\n\nset.seed(345)\nrf_res <- \n rf_workflow %>% \n tune_grid(val_set,\n grid = 25,\n control = control_grid(save_pred = TRUE),\n metrics = metric_set(roc_auc))" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#results-table", + "href": "content/lectures/14-tidymodels-slides.html#results-table", + "title": "14-tidymodels", + "section": "Results (table)", + "text": "Results (table)\n\nrf_res %>% \n show_best(metric = \"roc_auc\")\n\n# A tibble: 5 × 8\n mtry min_n .metric .estimator mean n std_err .config \n \n1 8 7 roc_auc binary 0.926 1 NA Preprocessor1_Model13\n2 12 7 roc_auc binary 0.926 1 NA Preprocessor1_Model01\n3 13 4 roc_auc binary 0.925 1 NA Preprocessor1_Model05\n4 9 12 roc_auc binary 0.924 1 NA Preprocessor1_Model19\n5 6 18 roc_auc binary 0.924 1 NA Preprocessor1_Model24" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#results-viz", + "href": "content/lectures/14-tidymodels-slides.html#results-viz", + "title": "14-tidymodels", + "section": "Results (viz)", + "text": "Results (viz)\n\nautoplot(rf_res)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#select-best-model", + "href": "content/lectures/14-tidymodels-slides.html#select-best-model", + "title": "14-tidymodels", + "section": "Select (best) model", + "text": "Select (best) model\n\nrf_best <- \n rf_res %>% \n select_best(metric = \"roc_auc\")\n\nrf_best\n\n# A tibble: 1 × 3\n mtry min_n .config \n \n1 8 7 Preprocessor1_Model13" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#make-predictions", + "href": "content/lectures/14-tidymodels-slides.html#make-predictions", + "title": "14-tidymodels", + "section": "Make predictions", + "text": "Make predictions\n\nrf_res %>% \n collect_predictions()\n\n# A tibble: 187,500 × 8\n id .pred_children .pred_none .row mtry min_n children .config \n \n 1 validation 0.152 0.848 13 12 7 none Preprocessor…\n 2 validation 0.0302 0.970 20 12 7 none Preprocessor…\n 3 validation 0.513 0.487 22 12 7 children Preprocessor…\n 4 validation 0.0103 0.990 23 12 7 none Preprocessor…\n 5 validation 0.0111 0.989 31 12 7 none Preprocessor…\n 6 validation 0 1 38 12 7 none Preprocessor…\n 7 validation 0 1 39 12 7 none Preprocessor…\n 8 validation 0.00325 0.997 50 12 7 none Preprocessor…\n 9 validation 0.0241 0.976 54 12 7 none Preprocessor…\n10 validation 0.0441 0.956 57 12 7 children Preprocessor…\n# ℹ 187,490 more rows\n\n\n\nrf_auc <- \n rf_res %>% \n collect_predictions(parameters = rf_best) %>% \n roc_curve(children, .pred_children) %>% \n mutate(model = \"Random Forest\")" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#model-performance-comparison", + "href": "content/lectures/14-tidymodels-slides.html#model-performance-comparison", + "title": "14-tidymodels", + "section": "Model Performance Comparison", + "text": "Model Performance Comparison\n\nbind_rows(rf_auc, lr_auc) %>% \n ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + \n geom_path(lwd = 1.5, alpha = 0.8) +\n geom_abline(lty = 3) + \n coord_equal() + \n scale_color_viridis_d(option = \"plasma\", end = .6)" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#model-evaluation-testing-set", + "href": "content/lectures/14-tidymodels-slides.html#model-evaluation-testing-set", + "title": "14-tidymodels", + "section": "Model Evaluation: Testing Set", + "text": "Model Evaluation: Testing Set\n\nlast_rf_mod <- \n rand_forest(mtry = 8, min_n = 7, trees = 1000) %>% \n set_engine(\"ranger\", num.threads = cores, importance = \"impurity\") %>% # will allow us to see what was most important in the model\n set_mode(\"classification\")\n\n# the last workflow\nlast_rf_workflow <- \n rf_workflow %>% \n update_model(last_rf_mod)\n\n# the last fit\nset.seed(345)\nlast_rf_fit <- \n last_rf_workflow %>% \n last_fit(splits)\n\nlast_rf_fit\n\n# Resampling results\n# Manual resampling \n# A tibble: 1 × 6\n splits id .metrics .notes .predictions .workflow \n \n1 train/test sp… " + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#performance-in-test-set", + "href": "content/lectures/14-tidymodels-slides.html#performance-in-test-set", + "title": "14-tidymodels", + "section": "Performance in test set?", + "text": "Performance in test set?\n\nlast_rf_fit %>% \n collect_metrics()\n\n# A tibble: 2 × 4\n .metric .estimator .estimate .config \n \n1 accuracy binary 0.946 Preprocessor1_Model1\n2 roc_auc binary 0.923 Preprocessor1_Model1" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#important-features", + "href": "content/lectures/14-tidymodels-slides.html#important-features", + "title": "14-tidymodels", + "section": "Important features?", + "text": "Important features?\n\nlast_rf_fit %>% \n extract_fit_parsnip() %>% \n vip::vip(num_features = 20)\n\n\n\n\n\n\nlast_rf_fit %>% \n collect_predictions() %>% \n roc_curve(children, .pred_children) %>% \n autoplot()" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#our-next-case-study", + "href": "content/lectures/14-tidymodels-slides.html#our-next-case-study", + "title": "14-tidymodels", + "section": "Our next Case study", + "text": "Our next Case study" + }, + { + "objectID": "content/lectures/14-tidymodels-slides.html#recap", + "href": "content/lectures/14-tidymodels-slides.html#recap", + "title": "14-tidymodels", + "section": "Recap", + "text": "Recap\n\n\n\nhttps://cogs137.github.io/website/" } ] \ No newline at end of file diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 0169dc1..377b4e8 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -126,7 +126,7 @@ https://cogs137.github.io/website/content/lectures/09-mlr.html - 2023-11-14T21:09:31.488Z + 2023-11-15T22:08:10.592Z https://cogs137.github.io/website/content/lectures/00-welcome-slides.html @@ -146,11 +146,11 @@ https://cogs137.github.io/website/content/lectures/13-cs01-analysis-slides.html - 2023-11-14T21:13:18.349Z + 2023-11-15T22:08:36.202Z https://cogs137.github.io/website/content/lectures/13-cs01-analysis.html - 2023-11-14T21:13:19.643Z + 2023-11-15T22:08:37.435Z https://cogs137.github.io/website/content/lectures/10-projects-slides.html @@ -170,11 +170,11 @@ https://cogs137.github.io/website/content/lectures/08-effective-communication-slides.html - 2023-11-14T20:33:44.675Z + 2023-11-15T14:17:45.927Z https://cogs137.github.io/website/content/lectures/08-effective-communication.html - 2023-11-14T21:09:02.367Z + 2023-11-15T14:17:46.476Z https://cogs137.github.io/website/content/lectures/06-analysis-slides.html @@ -234,6 +234,14 @@ https://cogs137.github.io/website/schedule.html - 2023-11-14T20:34:33.536Z + 2023-11-15T22:08:12.505Z + + + https://cogs137.github.io/website/content/lectures/14-tidymodels.html + 2023-11-16T22:01:35.064Z + + + https://cogs137.github.io/website/content/lectures/14-tidymodels-slides.html + 2023-11-16T22:01:34.259Z diff --git a/schedule.qmd b/schedule.qmd index 9c44cae..0d046cf 100644 --- a/schedule.qmd +++ b/schedule.qmd @@ -1,5 +1,5 @@ | Week | Date | Title | Type | -|--------|----------------|---------------------------------------------------------------|----------------| +|------------|------------|------------------------------------|------------| | **0** | **Th Sep 28** | **Welcome & Tooling** | **Lecture** | | 1 | Tu Oct 3 | Intro to R | Lecture | | 1 | Th Oct 5 | Data Wrangling: `dplyr` | Lecture | @@ -23,16 +23,16 @@ | **6** | **Th Nov 9** | **Case Study 01: THC Biomarkers (day 2)** | **Lecture** | | **6** | **Fri Nov 10** | **Lab 05: Multiple Linear Regression** | **Lab** | | 7 | Tu Nov 14 | Case Study 01: THC Biomarkers (day 3) | Lecture | -| 7 | Th Nov 16 | Logistic Regression | Lecture | +| 7 | Th Nov 16 | `tidymodels` | Lecture | | 7 | Fri Nov 17 | Lab 06: CS01 | Lab | | **8** | **Mon Nov 20** | HW03 due (11:59 PM) | **HW** | | **8** | **Mon Nov 20** | Final Project Proposal Due | **Project** | -| **8** | **Tu Nov 21** | **Case Study 02: Vaping Behaviors (day 1)** | **Lecture** | +| **8** | **Tu Nov 21** | **Case Study 02: Air Pollution (day 1)** | **Lecture** | | **8** | **Th Nov 23** | **No Class (Thanksgiving)** | **\--** | | 9 | Mon Nov 27 | CS01 Due (11:59 PM) | Case Study | -| 9 | Tu Nov 28 | Case Study 02: Vaping Behaviors (day 2) | Lecture | -| 9 | Th Nov 30 | Case Study 02: Vaping Behaviors (day 3) | Lecture | -| 9 | Fri Dec 1 | Lab 07: Logistic Regression & CS02 | Lab | +| 9 | Tu Nov 28 | Case Study 02: Air Pollution (day 2) | Lecture | +| 9 | Th Nov 30 | Case Study 02: Air Pollution (day 3) | Lecture | +| 9 | Fri Dec 1 | Lab 07: CS02 | Lab | | **10** | **Mon Dec 4** | **CS02 Due** | **Case Study** | | **10** | **Tu Dec 5** | **Final Project Brainstorming** | **Lecture** | | **10** | **Th Dec 7** | **Next Steps** | **Lecture** |
    7 Th Nov 16Logistic Regressiontidymodels Lecture
    8 Tu Nov 21Case Study 02: Vaping Behaviors (day 1)Case Study 02: Air Pollution (day 1) Lecture
    9 Tu Nov 28Case Study 02: Vaping Behaviors (day 2)Case Study 02: Air Pollution (day 2) Lecture
    9 Th Nov 30Case Study 02: Vaping Behaviors (day 3)Case Study 02: Air Pollution (day 3) Lecture
    9 Fri Dec 1Lab 07: Logistic Regression & CS02Lab 07: CS02 Lab