post_apple_export_II.Rmd

---
title: 'Apple Health Export Part II: Intra-Day Measures'
author: "John Goldin"
date: "3/3/2020"
output: 
  html_document:
    keep_md: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r libs, echo=FALSE, warning=FALSE}
library(tidyverse, quietly = TRUE)
library(lubridate, quietly = TRUE)
library(janitor, quietly = TRUE)
library(kableExtra, quietly = TRUE)
```
```{r include_functions, echo = FALSE}
path_saved_export <- "~/Dropbox/Programming/R_Stuff/john_vitals/Apple-Health-Data/"
path_to_healthexport1 <- "~/Documents/R_local_repos/applehealth1/R/"
source(paste0(path_to_healthexport1, "find_timezone.R"))
# source(paste0(path_to_healthexport1, "tripit_functions.R"))
```
## R Markdown

This post is Part II of a dive into the contents of the Apple Health Export.
[Part I](https://www.johngoldin.com/post/2020-02-15-apple-health-export1/) covered how to export the data from the Health app and import it into
R tables. It also described in detail how to adjust time stamps for daylight
savings and travel out of the local time zone.

Based on Part I, we have your data in the data frame `health_df`. 
The Health app emphasizes daily summaries of your data such as resting heart
rate, total steps, and the items the are covered by the "rings" that you see
in the Activity app. We will look at some summary in Part III, but in this
post the emphasisis will be on the intra-day details that are recorded in the Apple
Health Kit almost moment by moment. In the previous post I marvelled at how I had over
four million rows of data in `health_df`. What is all that stuff and what
is it good for.

I am going to restrict this post to the data that is generated by Watch OS
5.1.1.  Before OS 5 data was recorded much less frequently. When I first 
looked at the OS 5 data I noticed a few anomalies in the earliest versions.
Things seemed to settle down by 5.1.1 which in my case started in November 2018.
That's where I'll start with this analysis.

```{r import_health_df, echo = FALSE}
  load(paste0(path_saved_export, "save_processed_export.RData"))
```
```{r intra_day_report, echo = FALSE, cache = TRUE}
intra <- health_df %>%
  filter(sourceVersion >= "5.1.1", str_detect(sourceName, "Watch")) %>% 
  arrange(type, utc_start) 
intra %>%
  janitor::tabyl(type) %>% arrange(desc(n)) %>% 
  janitor::adorn_totals("row") %>% # do column total after arrange
  kable(format.args = list(decimal.mark = ".", big.mark = ","),
        table.attr='class="myTable"', label = "basic_counts",
        caption = "Frequency of Watch Data by Type", format = "markdown", digits = 3)
per_minute <- nrow(intra) / ((now() - as_datetime("2018-11-18 22:06:21")) / dminutes(1)) %>% round(1)
  
```

This table shows the counts of items that I regard as the intra-day watch items.
That works out to be about `r per_minute` rows per minute since November 2018. Wow!
That seems like a high density of
data. The top four account for about 97%. 

The point of this post will be to look in detail at these items.
Spoiler alert: I doubt that the thousands of intra-day rows for
basal or active energy are of much use, and we'll see why. Heart
rate over the course of the day has more potential and we'll 
see some examples later in this post.

### Workouts 

I have explored how frequently data rows are added, especially
basal energy, active energy, heart rate, and distance.
Workouts have a huge effect on frequency of measurements. I
did some crude experiments. Almost every day I walk the same 3.1
mile route in the woods. I compared a small sample of occasions doing
that route after declaring it to be a workout with a set of
occasions when I walked the same route but did not declare it as
a workout (and responded "dismiss" when my watch repeatedly
suggested it as a workout). I also tried sitting at my desk
while declaring that I was doing an indoor walking workout.

The effect is striking. Workouts produce many more measurements,
especially for active and basal energy, distance walked, and for heart rate.

```{r workout_experiments, echo = FALSE}
load(paste0(path_saved_export, "intra.RData"))
no_workout <- intra4 %>% 
  filter(local_date %in% as_date(c("2020-02-19", "2020-02-21", "2020-02-23", "2020-02-28")),
         ((local_start >= as_datetime("2020-02-19 15:02:00")) & (local_start <= as_datetime("2020-02-19 16:04:00"))) |
         ((local_start >= as_datetime("2020-02-21 13:15:00"))  & (local_start <= as_datetime("2020-02-21 14:20:00"))) |
         ((local_start >= as_datetime("2020-02-28 13:31:00"))  & (local_start <= as_datetime("2020-02-28 14:38:00"))) |
         ((local_start >= as_datetime("2020-02-23 14:15:00"))  & (local_start <= as_datetime("2020-02-23 15:20:00"))))
feb_workouts <- workout_df %>% filter(year(local_start) == 2020, month(local_start) == 2)
yes_workout <- intra4 %>% 
  filter(local_date %in% as_date(c("2020-02-14", "2020-02-16", "2020-02-05", "2020-02-09")),
         ((local_start >= as_datetime("2020-02-14 09:55:29")) & (local_start <= as_datetime("2020-02-14 11:03:06"))) |
         ((local_start >= as_datetime("2020-02-16 10:40:22"))  & (local_start <= as_datetime("2020-02-16 11:51:46"))) |
         ((local_start >= as_datetime("2020-02-05 14:23:00"))  & (local_start <= as_datetime("2020-02-05 15:30:00"))) |
         ((local_start >= as_datetime("2020-02-09 14:18:00"))  & (local_start <= as_datetime("2020-02-09 15:28:00"))))
workout_sit <- intra4 %>% 
  filter(local_date %in% as_date(c("2020-02-29")),
         ((local_start >= as_datetime("2020-02-29 07:51:00")) & (local_start <= as_datetime("2020-02-29 08:58:00"))))
require(hms)
in_office <- intra4 %>% 
  filter(local_date%in% as_date(c("2020-02-19", "2020-02-24", "2020-02-10", "2020-02-12")),
        start_time >= as_hms("10:40:00"), start_time <= as_hms("11:50:00"))
yes_workout$workout <- "Workout"
no_workout$workout <- "Not Workout"
in_office$workout <- "In Office"
workout_sit$workout <- "Workout Sit"
work <- bind_rows(yes_workout, no_workout, in_office, workout_sit) %>% 
  filter(type2 != "Exercise_Watch") %>% 
  mutate(type2 = factor(type2, levels = c("Active_Energy", "Basal_Energy", "Heart_Rate",
                                        "Steps_Watch", "Walking_Watch", "Climb_Watch")),
         activity = case_when(workout %in% c("Workout", "Not Workout") ~ "Walking",
                       workout %in% c("Workout Sit", "In Office") ~ "Sitting"),
         workout = factor(workout, levels = c("Workout", "Workout Sit", "Not Workout", "In Office"),
                          labels = c("Walking\nWorkout", "Sitting\nWorkout", "Walk, No\nWorkout", "Sitting No\nWorkout")))


raw_table <- work %>% 
  group_by(type2, workout, activity, local_date) %>% 
  summarise(total = sum(value), n = n(), interval = median(interval, na.rm = TRUE), span = median(span, na.rm = TRUE)) %>% 
  mutate(total = case_when(
    type2 == "Heart_Rate" ~ total/n,
    TRUE ~ total
  ))

for_plot <- raw_table %>% group_by(type2, workout, activity) %>% 
  summarise(amount = mean(total), obs = mean(n), interval =mean(interval, na.rm = TRUE), span = mean(span, na.rm = TRUE)) %>% 
  mutate(amount = round(amount), obs = round(obs))

#https://stackoverflow.com/questions/30179442/plotting-minor-breaks-on-a-log-scale-with-ggplot/33179099#33179099
log10_minor_break = function (n_intervals = 9, ...){
  function(x) {
    minx         = floor(min(log10(x), na.rm=T))-1;
    maxx         = ceiling(max(log10(x), na.rm=T))+1;
    n_major      = maxx-minx+1;
    major_breaks = seq(minx, maxx, by=1)
    minor_breaks = 
      rep(log10(seq(1, n_intervals, by=1)), times = n_major)+
      rep(major_breaks, each = n_intervals)
    return(10^(minor_breaks))
  }
}
pmedian <- ggplot(data = for_plot %>% filter(type2 != "Climb_Watch", type2 != "Steps_Watch"), aes(y = interval, x =fct_rev(workout) )) + 
  facet_wrap(~ type2) + geom_col(aes(fill = fct_rev(activity))) +
  scale_y_log10(breaks = c(6, 60, 600),minor_breaks = log10_minor_break(6)) +
  # scale_y_log10(breaks = c(5, 60, 6000),minor_breaks = minor_breaks_n(5)) + 
  ylab("median interval between observations (log scale, in seconds)") + xlab('"workout" and actitivity')+
  theme(legend.position="top") +
  guides(fill=guide_legend(title="Activity: ")) +
  geom_text(aes(fill = NULL, label = round(interval)), hjust = 1) + 
  coord_flip() +
  ggtitle("How Does Interval Between Observations\nRelate to Declared Workout and Amount of Activity")
```
```{r display_workout_table, echo = FALSE}
print(pmedian)
```

In this plot, "workout" means that I have told my watch that
I am doing a workout. Regardless of my physical activity this
causes a large increase in how frequently data is recorded
on the watch (indicated by a shorter interval between
rows of data). If I am sitting working at my desk, (the bottom
bar), active energy is recorded about once a minute (67 seconds
on the "Siting No Workout" bar) and heart rate is recorded
about every three minutes. But if I have declared a workout,
active energy is recorded every 3 seconds and heart rate is
recorded every 5 or 6 seconds. If I'm actually walking,
the interval between measurements is shorter, although not
as short as when a workout is explicitly declared. 

Although it's not shown on the plot, the walk involved
about 6,600 steps while a little over an hour sitting
at my desk produced an average of about 400 steps
(sometimes I pace while I work).
While walking my heart rate was about 96,
but 66 when I was working at my desk.
The walk is a
moderately challenging hike with hilly terrain.
The increase in movement during the hike causes the
watch to record data more frequently, with sorter
intervals between measurements, although not as 
frequently as during a declared workout.

The effect of a declared workout on how frequently step count and
flights climbed is recorded is not as great so I have
not included those on the plot.


The large effect of a declared workout on frequency of observations
explains how I have racked up so many rows in
my watch data. My exercise of choice is hiking and my idea of
a good vacation is a walking holiday. Last summer I did two
eight or nine day walks in England averaging about 12 miles 
per day. And I did quite a few 6 to 12 mile training walks
as preparation. I declared workouts for all those walks and
generated a gigantic amount of data. (Note that my watch battery
held up fine doing all-day workouts.) Even my daily walk of a
bit over one hour generates a lot of data over the course of a year.

In fact, the table below shows that only about 38%
of all the observations in my dataset occur outside
the context of a workout.

```{r workout_table}
health_df %>% 
  dplyr::mutate(workoutActivityType = forcats::fct_lump(workoutActivityType),
                  workoutActivityType = forcats::fct_explicit_na(workoutActivityType, na_level = "Not a Workout")) %>% 
  janitor::tabyl(workoutActivityType) %>% 
  janitor::adorn_totals("row") %>% janitor::adorn_pct_formatting(digits = 0) %>% 
  knitr::kable(format = "markdown", align = c("lrr"),
               caption = "Workouts and Volume of Rows",
               format.args = list(big.mark = ",")) 
```


Let's look at particular items.

Let's start with basal energy. This is a peculiar item. It's
not really a measurement. Basal energy is the amount of calories
the body consumes to maintain its internal operations even when
absolutely no physical activity involved. Most of what I know
about base matabolism comes from Wikipedia.
The Apple Watch has no way to measure basal metabolism directly.
Surely it relies on a common estimate based on age, gender,
height, and weight. This estimte doesn't change during
the day. I actually put my weight into the Health
app every day (via the Lose It! app) so there's a slight
change from day to day. Each day I get a day older. But I
haven't changed my height or my gender.

Given that basal energy doesn't really change, it's a bit
odd that there are over 800,000 rows of data. The Watch OS
seems to mostly aim to have a row for basal energy for
each row of active energy.

[note to self, where do the extra rows of active energy come from?]

If we zoom on a detailed subset of basal energy data, we
can see some patterns that should make us nervous about
how to interpret it.