0306_advanced-dplyr.Rmd

# Advanced **dplyr** functions

## `recode()`

`recode()` is useful for recoding categorical variables.

Unlike most of the other function in **dplyr**, `recode()` is backwards in it's syntax:

```
recode(.x, old = new)
```

Lets take a look at recoding different variables using the `psychTools::bfi` dataset:

In the dataset, our `gender` variable has values 1 and 2.

This is a little vague since we don't know what 1 or 2 is in respect to gender.

```{r}
dat_bfi <- psychTools::bfi |> 
  rownames_to_column(var = ".id")

dat_bfi |>
  mutate(
    gender = recode(gender, "1" = "man", "2" = "woman")
  ) |>
  select(.id, gender, education) |>
  head()
```

*Note that for numeric values on the left side of `=`,*
*you need to wrap them in "quotes" or `backticks`;*
*however, that's not necessary for character values*

We can also specify a `.default` value within our `recode()`.

For example, say we want to have just "HS or less" versus "more than HS"

```{r}
dat_bfi |>
  mutate(
    education = recode(education, "1" = "HS", "2" = "HS", .default = "More than HS")
  ) |>
  select(.id, gender, education) |>
  head()
```

Another neat feature of the `recode()` function is the `.missing` value.

If we would rather convert NA values to something more explicit,
we can specify that in the `.missing` argument.

```{r}
dat_bfi |>
  mutate(
    education = recode(
      education, 
      "1" = "HS", 
      "2" = "HS", 
      .default = "More than HS", 
      .missing = "(Unknown)"
    )
  ) |>
  select(.id, gender, education) |>
  head()
```

Or we can use `tidyr::replace_na()` 

```{r}
dat_bfi |>
  mutate(
    education = replace_na(education, replace = "(Unknown)")
  ) |>
  select(.id, gender, education) |>
  head()
```


## `across()`

The `across` function allows us to apply transformations across multiple columns

Say we wanted to look at the mean of each agreeable variable between gender groups:

```{r}
dat_bfi |>
  group_by(gender) |>
  summarize(
    across(
      A1:A5,
      mean,
      na.rm = TRUE
    )
  )
```

If we want to put the function name `mean`, togther with all of its arguments, 
we can write it as an **anonymous function**:

```{r}
dat_bfi |>
  group_by(gender) |>
  summarize(
    across(
      A1:A5,
      \(x) mean(x, na.rm = TRUE)
    )
  )
```

What if we wanted to include the standard deviation as well? We can pass a `list` of functions into `across()`

```{r}
dat_bfi |>
  group_by(gender) |>
  summarize(
    across(
      A1:A5,
      list(
        mean = \(x) mean(x, na.rm = TRUE),
        sd = \(x) sd(x, na.rm = TRUE)
      )
    )
  )
```


## Complex `recoding` plus `across()`

Now sometimes with our scales we may encounter variables that are reverse scored.

```{r}
dat_bfi |>
  mutate(
    A1r = recode(
      A1, 
      "6" = 1, "5" = 2, "4" = 3, "3" = 4, "2" = 5, "1" = 6
    )
  ) |>
  select(A1, A1r) |> 
  head()

# or

dat_bfi |>
  mutate(A1r = max(A1, na.rm = TRUE) - A1 + min(A1, na.rm = TRUE)) |>
  select(A1, A1r) |> 
  head()
```

However, we can implement some more complex code that will reverse `recode()` in one fell swoop!

We start with either specifying our columns that need reverse coding or get it from a data dictionary:

```{r}
reversed <- c("A1", "C4", "C5", "E1", "E2", "O2", "O5")

# or

dict <- psychTools::bfi.dictionary |>
  as_tibble(rownames = "item")

reversed <- dict |>
  filter(Keying == -1) |>
  pull(item)
```

Putting it all together: 

```{r}
dat_bfi |>
  mutate(across(
    all_of(reversed),
    \(x) recode(x, "6" = 1, "5" = 2, "4" = 3, "3" = 4, "2" = 5, "1" = 6),
    .names = "{.col}r"
  )) |>
  head()
```

The `.names` argument tells how to name the new columns. 
If you omit `.names`, the columns will be modified in place. 
In `.names`, the `{.col}` bit means "the column name", 
and any text around that (here the letter `r`) is added to the name.


## `rowwise()`

`rowwise()` is a special `group_by()`. 
It tells R to treat each row of a data frame as its own group.

`rowwise()` is useful for computing summary scores across items for each person.
For example, to compute total scores for each person in the `dat_bfi` data:

```{r}
dat_bfi |>
  rowwise() |> 
  mutate(
    .id = .id,
    A_total = mean(c_across(A1:A5), na.rm = TRUE),
    C_total = mean(c_across(C1:C5), na.rm = TRUE),
    E_total = mean(c_across(E1:E5), na.rm = TRUE),
    N_total = mean(c_across(N1:N5), na.rm = TRUE),
    O_total = mean(c_across(O1:O5), na.rm = TRUE),
    .before = everything()
  ) |>
  head()
```

The `c_across()` function combines `c()` and `across()` into one.
It is like `c()` and creates a vector ala `c(1, 3, 5, 7)`, 
but you can use the same options for selecting column names as `select()`.

The `.before` argument says where to put the new columns you `mutate()`. 

`everything()` means "all the columns have I haven't named yet", 
so `.before = everything()` means put the new columns at the beginning of the data frame.