-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathch-snippets.Rmd
306 lines (237 loc) · 13.8 KB
/
ch-snippets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
Snippets {#snippets}
====================================
Reading External Data {#snippets-reading}
------------------------------------
### Reading from Excel {#snippets-reading-excel}
*Background*: Avoid Excel for the [reasons previously discussed](#data-containers-avoid). But if there isn't another good option, be protective. [`readxl::read_excel()`](https://readxl.tidyverse.org/reference/read_excel.html) allows you to specify column types, but not column order. The names of `col_types` is ignored by `readxl::read_excel()`. To defend against roaming columns (*e.g.*, the files changed over time), `tesit::assert()` that the order is what you expect.
See the readxl vignette, [Cell and Column Types](https://readxl.tidyverse.org/articles/cell-and-column-types.html), for more info.
*Last Modified*: 2019-12-12 by Will
```r
# ---- declare-globals ---------------------------------------------------------
config <- config::get()
# cat(sprintf(' `%s` = "text",\n', colnames(ds)), sep="") # 'text' by default --then change where appropriate.
col_types <- c(
`Med Rec Num` = "text",
`Admit Date` = "date",
`Tot Cash Pymt` = "numeric"
)
# ---- load-data ---------------------------------------------------------------
ds <- readxl::read_excel(
path = config$path_admission_charge,
col_types = col_types
# sheet = "dont-use-sheets-if-possible"
)
testit::assert(
"The order of column names must match the expected list.",
names(col_types) == colnames(ds)
)
# Alternatively, this provides more detailed error messages than `testit::assert()`
# testthat::expect_equal(
# colnames(d),
# names(col_types),
# label = "worksheet's column name (x)",
# expected.label = "col_types' name (y)"
# )
```
### Removing Trailing Comma from Header {#snippets-reading-trailing-comma}
*Background*: Occasionally a Meditech Extract will have an extra comma at the end of the 1st line. For each subsequent line, [`readr:read_csv()`](https://readr.tidyverse.org/reference/read_delim.html) appropriately throws a new warning that it is missing a column. This warning flood can mask real problems.
*Explanation*: This snippet (a) reads the csv as plain text, (b) removes the final comma, and (c) passes the plain text to `readr::read_csv()` to convert it into a data.frame.
*Instruction*: Modify `Dx50 Name` to the name of the final (real) column.
*Real Example*: [truong-pharmacist-transition-1](https://github.com/OuhscBbmc/truong-pharmacist-transition-1/blob/eec6d7eb8aaa9e3df52dafb826dbc53aaf515c63/manipulation/ellis/dx-ellis.R#L158-L162) (Accessible to only CDW members.)
*Last Modified*: 2019-12-12 by Will
```r
# The next two lines remove the trailing comma at the end of the 1st line.
raw_text <- readr::read_file(path_in)
raw_text <- sub("^(.+Dx50 Name),", "\\1", raw_text)
ds <- readr::read_csv(raw_text, col_types=col_types)
```
### Removing Trailing Comma from Header {#snippets-reading-vroom}
*Background*: When incoming data files are on the large side to comfortably accept with [readr](https://readr.tidyverse.org/), we use [vroom](https://vroom.r-lib.org/). The two packages are developed by the same group and [might be combined](https://github.com/tidyverse/tidyverse.org/pull/375#issuecomment-564781603) in the future.
*Explanation*: This snippet defines the `col_types` list with names to mimic [our approach](https://ouhscbbmc.github.io/data-science-practices-1/prototype-r.html#chunk-declare) of using readr. There are some small differences with our readr approach:
1. `col_types` is a [list](https://stat.ethz.ch/R-manual/R-devel/library/base/html/list.html) instead of a [`readr::cols_only`](https://readr.tidyverse.org/reference/cols.html) object.
1. The call to [`vroom::vroom()`](https://vroom.r-lib.org/reference/vroom.html) passes `col_names = names(col_types)` explicitly.
1. If the data file contains columns we don't need, we define them in `col_types` anyway; vroom needs to know the file structure if it's missing a header row.
*Real Example*: [akande-medically-complex-1](https://github.com/OuhscBbmc/akande-medically-complex-1/tree/main/manipulation/ohca) (Accessible to only CDW members.) Thesee files did not have a header of variable names; the first line of the file is the first data row.
*Last Modified*: 2020-08-21 by Will
```r
# ---- declare-globals ---------------------------------------------------------
config <- config::get()
col_types <- list(
sak = vroom::col_integer(), # "system-assigned key"
aid_category_id = vroom::col_character(),
age = vroom::col_integer(),
service_date_first = vroom::col_date("%m/%d/%Y"),
service_date_lasst = vroom::col_date("%m/%d/%Y"),
claim_type = vroom::col_character(),
provider_id = vroom::col_character(),
provider_lat = vroom::col_double(),
provider_long = vroom::col_double(),
provider_zip = vroom::col_character(),
cpt = vroom::col_integer(),
revenue_code = vroom::col_integer(),
icd_code = vroom::col_character(),
icd_sequence = vroom::col_integer(),
vocabulary_coarse_id = vroom::col_integer()
)
# ---- load-data ---------------------------------------------------------------
ds <- vroom::vroom(
file = config$path_ohca_patient,
delim = "\t",
col_names = names(col_types),
col_types = col_types
)
rm(col_types)
```
Row Operations {#snippets-row}
------------------------------------
We frequently have to find the mean or sum across columns (within a row).
If
Finding mean across a lot of columns
Here are several approaches for finding the mean across columns, without naming each column. Some remarks:
* `m1` & `m2` are sanity checks for this example.
`m1` would be clumsy if you have 10+ items.
`m2` is discouraged because it's brittle.
A change in the column order could alter the calculation.
We prefer to use `grep()` to specify a sequence of items.
* Especially for large datasets,
I’d lean towards `m3` if the items are reasonably complete and
`m4` if some participants are missing enough items that their summary score is fishy.
In the approaches below, `m4` and `m6` return the mean only if the participant completed 2 or more items.
* `dplyr::rowwise()` is convenient, but slow for large datasets.
* If you need a more complex function that’s too clumsy to include directly in a `mutate()` statement,
see how the calculation for `m6` is delegated to the external function, `f6`.
* The technique behind `nonmissing` is pretty cool,
because you can apply an arbitrary function on each cell before they’re summed/averaged.
* This is in contrast to `f6()`, which applies to an entire (row-wise) data.frame.
```r
# Isolate the columns to average. Remember the `grep()` approach w/ `colnames()`
columns_to_average <- c("hp", "drat", "wt")
f6 <- function(x) {
# browser()
s <- sum(x, na.rm = TRUE)
n <- sum(!is.na(x))
dplyr::if_else(
2L <= n,
s / n,
NA_real_
)
}
mtcars |>
dplyr::mutate(
m1 = (hp + drat + wt) / 3,
m2 =
rowMeans(
dplyr::across(hp:wt), # All columns between hp & wt.
na.rm = TRUE
),
m3 =
rowMeans(
dplyr::across(!!columns_to_average),
na.rm = TRUE
),
s4 = # Finding the sum (used by m4)
rowSums(
dplyr::across(!!columns_to_average),
na.rm = TRUE
),
nonmissing =
rowSums(
dplyr::across(
!!columns_to_average,
.fns = \(x) { !is.na(x) }
)
),
m4 =
dplyr::if_else(
2 <= nonmissing,
s4 / nonmissing,
NA_real_
)
) |>
dplyr::rowwise() |> # Required for `m5`
dplyr::mutate(
m5 = mean(dplyr::c_across(dplyr::all_of(columns_to_average))),
) |>
dplyr::ungroup() |> # Clean up after rowwise()
dplyr::rowwise() |> # Required for `m6`
dplyr::mutate(
m6 = f6(dplyr::across(!!columns_to_average))
) |>
dplyr::ungroup() |> # Clean up after rowwise()
dplyr::select(
hp,
drat,
wt,
m1,
m2,
m3,
s4,
nonmissing,
m4,
m5,
m6,
)
```
Grooming {#snippets-grooming}
------------------------------------
### Correct for misinterpreted two-digit year {#snippets-grooming-two-year}
*Background*: Sometimes the Meditech dates are specified like `1/6/54` instead of `1/6/1954`. `readr::read_csv()` has to choose if the year is supposed to be '1954' or '2054'. A human can use context to guess a birth date is in the past (so it guesses 1954), but readr can't (so it guesses 2054). For avoid this and other problems, request dates in an [ISO-8601](https://www.explainxkcd.com/wiki/index.php/1179:_ISO_8601) format.
*Explanation*: Correct for this in a `dplyr::mutate()` clause; compare the date value against today. If the date is today or before, use it; if the day is in the future, subtract 100 years.
*Instruction*: For future dates such as loan payments, the direction will flip.
*Last Modified*: 2019-12-12 by Will
```r
ds |>
dplyr::mutate(
dob = dplyr::if_else(dob <= Sys.Date(), dob, dob - lubridate::years(100))
)
```
Identification {#snippets-identification}
------------------------------------
### Generating "tags" {#snippets-identification-tags}
*Background*: When you need to generate unique identification values for future people/clients/patients, as described in the [style guide](#style-number).
*Explanation*: This snippet will create a 5-row csv with random 7-character "tags" to send to the research team collecting patients. The
*Instruction*: Set `pt_count`, `tag_length`, `path_out`, and execute. Add and rename the columns to be more appropriate for your domain (*e.g.*, change "patient tag" to "store tag").
*Last Modified*: 2019-12-30 by Will
```r
pt_count <- 5L # The number of rows in the dataset.
tag_length <- 7L # The number of characters in each tag.
path_out <- "data-private/derived/pt-pool.csv"
draw_tag <- function (tag_length = 4L, urn = c(0:9, letters)) {
paste(sample(urn, size = tag_length, replace = T), collapse = "")
}
ds_pt_pool <-
tibble::tibble(
pt_index = seq_len(pt_count),
pt_tag = vapply(rep(tag_length, pt_count), draw_tag, character(1)),
assigned = FALSE,
name_last = "--",
name_first = "--"
)
readr::write_csv(ds_pt_pool, path_out)
```
The resulting dataset will look like this, but with different randomly-generated tags.
```csv
# A tibble: 5 x 5
pt_index pt_tag assigned name_last name_first
<int> <chr> <lgl> <chr> <chr>
1 1 seikyfr FALSE -- --
2 2 voiix4l FALSE -- --
3 3 wosn4w2 FALSE -- --
4 4 jl0dg84 FALSE -- --
5 5 r5ei5ph FALSE -- --
```
Correspondence with Collaborators {#snippets-correspondence}
------------------------------------
### Excel files {#snippets-correspondence-excel}
Receiving and storing [Excel files should almost always be avoided](#data-containers-avoid) for the reasons explained in this letter.
We receive extracts as Excel files frequently, and have the following request ready to email the person sending us Excel files. Adapt the bold values like "109.19" to your situation. If you are familiar with their tools, suggest an alternative for saving the file as a csv. Once presented with these Excel gotchas, almost everyone has an 'aha' moment and recognizes the problem. Unfortunately, not everyone has flexible software and can adapt easily.
---
[Start of the letter]
Sorry to be tedious, but could you please resend the extract as a [csv](https://en.wikipedia.org/wiki/Comma-separated_values) file? Please call me if you have questions.
Excel is being too helpful with some of the values, and essentially corrupting them. For example, values like **109.19** is interpreted as a number, not a character code (*e.g.*, see cell **L14**). Because of [limitations of finite precision](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), this becomes **109.18999999999999773**. We can't round it, because there are other values in this column that cannot be cast to numbers, such as **V55.0**. Furthermore, the "E"s in some codes are incorrectly interpreted as the exponent operator (*e.g.*, "4E5" is converted to 400,000).
Finally, values like [001.0](http://www.icd9data.com/2015/Volume1/001-139/001-009/001/001.0.htm) are being converted to a number and any leading or trailing zeros are dropped (so cells like "1" are not distinguishable from "001.0").
Unfortunately the problems exist in the Excel file itself. When we import the columns [as text](https://readxl.tidyverse.org/reference/read_excel.html), the values are already in their corrupted state.
Please compress/zip the csv if the file is be too large to email. We've found that an Excel file is typically 5-10 times larger than a compressed csv.
As much as Excel interferes with our medical variables, we’re lucky. It has messed with other branches of science much worse. Genomics were using it far too late [before they realized their mistakes](https://qz.com/768334/years-of-genomics-research-is-riddled-with-errors-thanks-to-a-bunch-of-botched-excel-spreadsheets/).
> What happened? By default, Excel and other popular spreadsheet applications convert some gene symbols to dates and numbers. For example, instead of writing out “Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase,” researchers have dubbed the gene MARCH1. Excel converts this into a date—03/01/2016, say—because that’s probably what the majority of spreadsheet users mean when they type it into a cell. Similarly, gene identifiers like “2310009E13” are converted to exponential numbers (2.31E+19). In both cases, the conversions strip out valuable information about the genes in question.
[End of the letter]