explore_xml2_and_apple_health_export.Rmd

---
title: "Explore xml2 and Apple Health Export"
output: html_notebook
---

> xml_find_all(xx, ".//ExportDate")
{xml_nodeset (1)}
[1] <ExportDate value="2019-09-15 20:30:18 -0400"/>
> xml_find_all(xx, ".//ExportDate") %>% class()
[1] "xml_nodeset"
> n1 <- xml_find_all(xx, ".//ExportDate")
> n1 <- xml_find_first(xx, ".//ExportDate")
> n1
{xml_node}
<ExportDate value="2019-09-15 20:30:18 -0400">
> xml_path(n1)
[1] "/HealthData/ExportDate"
> xml_attr(n1)
Error in node_attr(x$node, name = attr, missing = default, nsMap = ns) : 
  argument "attr" is missing, with no default
> xml_path(n1, "value")
Error in xml_path(n1, "value") : unused argument ("value")
> xml_attr(n1, "value")
[1] "2019-09-15 20:30:18 -0400"


```{r read_xml}
require(xml2)
# read_xml takes 28 seconds on my MacBook Pro (and 29 seconds on first run on iMac!)
system.time(health_xml <- read_xml("~/Downloads/apple_health_export/export.xml"))

```


```{r}
# xml_find_first takes 22 seconds on my MacBook Pro
system.time(n1 <- xml_find_first(health_xml, "./ExportDate"))
system.time(n1 <- xml_find_first(health_xml, "./HKMetadataKeyTimeZone"))
HKMetadataKeyTimeZone
export_date <- xml_attr(n1, "value")

system.time(node_me <- xml_find_first(health_xml, ".//Me"))
dob <- xml_attr(node_me, "HKCharacteristicTypeIdentifierDateOfBirth")

system.time(yy <- xml_children(health_xml))  # very slow
xml_attr(yy[[7]], "type")
types <- unique(xml_attr(yy, "type"))  # very very slow

node_cycling <- xml_find_first(yy, ".//HKQuantityTypeIdentifierDistanceCycling")

```

> types
 [1] NA                                                 "HKQuantityTypeIdentifierHeight"                  
 [3] "HKQuantityTypeIdentifierBodyMass"                 "HKQuantityTypeIdentifierHeartRate"               
 [5] "HKQuantityTypeIdentifierBloodPressureSystolic"    "HKQuantityTypeIdentifierBloodPressureDiastolic"  
 [7] "HKQuantityTypeIdentifierStepCount"                "HKQuantityTypeIdentifierDistanceWalkingRunning"  
 [9] "HKQuantityTypeIdentifierBasalEnergyBurned"        "HKQuantityTypeIdentifierActiveEnergyBurned"      
[11] "HKQuantityTypeIdentifierFlightsClimbed"           "HKQuantityTypeIdentifierDietaryFatTotal"         
[13] "HKQuantityTypeIdentifierDietaryFatSaturated"      "HKQuantityTypeIdentifierDietaryCholesterol"      
[15] "HKQuantityTypeIdentifierDietarySodium"            "HKQuantityTypeIdentifierDietaryCarbohydrates"    
[17] "HKQuantityTypeIdentifierDietaryFiber"             "HKQuantityTypeIdentifierDietarySugar"            
[19] "HKQuantityTypeIdentifierDietaryEnergyConsumed"    "HKQuantityTypeIdentifierDietaryProtein"          
[21] "HKQuantityTypeIdentifierNumberOfTimesFallen"      "HKQuantityTypeIdentifierAppleExerciseTime"       
[23] "HKQuantityTypeIdentifierDietaryCaffeine"          "HKQuantityTypeIdentifierDistanceCycling"         
[25] "HKQuantityTypeIdentifierRestingHeartRate"         "HKQuantityTypeIdentifierVO2Max"                  
[27] "HKQuantityTypeIdentifierWalkingHeartRateAverage"  "HKCategoryTypeIdentifierSleepAnalysis"           
[29] "HKCategoryTypeIdentifierAppleStandHour"           "HKCategoryTypeIdentifierMindfulSession"          
[31] "HKCorrelationTypeIdentifierBloodPressure"         "HKCorrelationTypeIdentifierFood"                 
[33] "HKQuantityTypeIdentifierHeartRateVariabilitySDNN" "Patient"                                         
[35] "MedicationOrder"                                  "MedicationStatement"                             
[37] "DiagnosticReport"                                 "Observation"                                     
[39] "Condition"                                        "Procedure"                                       
[41] "Immunization"                                     "AllergyIntolerance"

    HKCategoryTypeIdentifierAppleStandHour           HKCategoryTypeIdentifierMindfulSession 
                                           16826                                               31 
           HKCategoryTypeIdentifierSleepAnalysis       HKQuantityTypeIdentifierActiveEnergyBurned 
                                            2794                                          1334739 
       HKQuantityTypeIdentifierAppleExerciseTime        HKQuantityTypeIdentifierBasalEnergyBurned 
                                           43531                                           871744 
  HKQuantityTypeIdentifierBloodPressureDiastolic    HKQuantityTypeIdentifierBloodPressureSystolic 
                                            1686                                             1686 
                HKQuantityTypeIdentifierBodyMass          HKQuantityTypeIdentifierDietaryCaffeine 
                                             122                                                1 
    HKQuantityTypeIdentifierDietaryCarbohydrates       HKQuantityTypeIdentifierDietaryCholesterol 
                                               1                                             1866 
   HKQuantityTypeIdentifierDietaryEnergyConsumed      HKQuantityTypeIdentifierDietaryFatSaturated 
                                            2140                                             2803 
         HKQuantityTypeIdentifierDietaryFatTotal             HKQuantityTypeIdentifierDietaryFiber 
                                            3228                                             1977 
          HKQuantityTypeIdentifierDietaryProtein            HKQuantityTypeIdentifierDietarySodium 
                                            2024                                             1975 
            HKQuantityTypeIdentifierDietarySugar          HKQuantityTypeIdentifierDistanceCycling 
                                            1945                                             1272 
  HKQuantityTypeIdentifierDistanceWalkingRunning           HKQuantityTypeIdentifierFlightsClimbed 
                                          728325                                            21099 
               HKQuantityTypeIdentifierHeartRate HKQuantityTypeIdentifierHeartRateVariabilitySDNN 
                                          594672                                             3365 
                  HKQuantityTypeIdentifierHeight      HKQuantityTypeIdentifierNumberOfTimesFallen 
                                               2                                                1 
        HKQuantityTypeIdentifierRestingHeartRate                HKQuantityTypeIdentifierStepCount 
                                             711                                           175573 
                  HKQuantityTypeIdentifierVO2Max  HKQuantityTypeIdentifierWalkingHeartRateAverage 
                                              86                                              633 

```{r}
#bp1 <- xml_attrs(yy, "HKCorrelationTypeIdentifierBloodPressure")
system.time(bp_systolic <- xml_attrs(yy, "HKQuantityTypeIdentifierBloodPressureSystolic"))
```

```{r}
test <- yy[1:10]
xml_find_first(test, "//Record")
```

```{r}
# from https://rdrr.io/github/deepankardatta/AppleHealthAnalysis/src/R/ah_import_xml.r
system.time(personal_data <- xml_find_all( health_xml , "//Me") %>% xml_attrs() %>% print())
```

```{r get_record_data}
  # this codes taken from deepankardatta
  # Extracts the health records, selects the 'Record' elements
  # And then transforms into a data frame using the 'purrr' library
  system.time(health_data <- xml2::xml_find_all( health_xml , "//Record") %>%
    purrr::map(xml2::xml_attrs) %>%
    purrr::map_df(as.list))
```

takes 212 seconds to use xml2 and purrr to extract Record data frame
and df <- XML:::xmlAttrsToDataFrame(xml["//Record"]) took 169 seconds. Faster, but I guess not by
a huge amount.

View-ed yy to see structure:
xml_attrs(yy[10000:10010][[1]]) gets the record data
xml_child(yy[10000:10010][[1]], 1) metadata
xml_child(yy[10000:10010][[1]], 1) attributes of the metadata

```{r}
system.time(daily_summary <- XML:::xmlAttrsToDataFrame(xml["//ActivitySummary"]) %>% as_tibble())
```

```{r}
names.XMLNode(xml)
```

```{r get_other_stuff}
# based on https://taraskaduk.com/2019/03/23/apple-health/
df_activity <- XML:::xmlAttrsToDataFrame(xml["//ActivitySummary"]) %>% as_tibble()
df_workout <-  XML:::xmlAttrsToDataFrame(xml["//Workout"], stringsAsFactors = FALSE) %>% as_tibble %>% 
  mutate(startDate = as_datetime(str_sub(startDate, 1, 19)),
         endDate = as_datetime(str_sub(endDate, 1, 19)),
         creationDate = as_datetime(str_sub(creationDate, 1, 19)))

  mutate(startDate = a_UTC_vector_to_clock_time(startDate))

df_resting_hr <- df %>% filter(type == "HKQuantityTypeIdentifierRestingHeartRate") %>% 
  mutate(startDate = a_UTC_vector_to_clock_time(startDate),
         endDate = a_UTC_vector_to_clock_time(endDate),
         startHour = hour(startDate), endHour = hour(endDate)) %>% 
  select(sourceVersion, creationDate, startDate, endDate, value, month, year, startHour, endHour)
dups <- df_resting_hr %>%  semi_join(df_resting_hr %>% count(endDate) %>% filter(n > 1) %>% select(endDate))
odd <- df_resting_hr %>%  filter(startHour > 2)


```

```{r}
df2 <- df_resting_hr %>% mutate(date = as_date(endDate))
p <- ggplot(data = df2, aes(y = value, x = date)) +
  geom_point()
```

let's join BP to the min hear rate
```{r join_to_bp}
df3 <- df2 %>% left_join(bp_group3 %>% ungroup() %>% 
  filter(type == "systolic") %>% 
  select(systolic = value,date), by = c("date"))
ggplot(data = df3 %>% filter(!is.na(systolic)), aes(y = value, x = date)) +
  geom_point() +
  scale_x_date(date_breaks = "3 month", date_minor_breaks = "1 month") +
  geom_smooth(span = 0.5)

ggplot(data = df3 %>% filter(!is.na(systolic)), aes(y = value, x = systolic)) +
  geom_point() + geom_smooth(span = 0.5)

```

```{r}
got_xml <- read_xml(got_chars_xml())
```

```{r}
df %>% count(type) %>% arrange(desc(n)) %>% kable()
```

here I will stash raw workout data as it appears in the export file, before and after change in time zone on November 3 2019:

> workout_df %>% arrange(desc(startDate)) %>% select(startDate, totalDistance, duration)
# A tibble: 434 x 3
   startDate                 totalDistance      duration         
   <chr>                     <chr>              <chr>            
 1 2019-10-23 16:23:21 -0400 0.5473660299954509 11.74872166514397
 2 2019-10-22 13:49:38 -0400 3.222100507149329  115.0385558168093
 3 2019-10-22 10:02:35 -0400 4.153307404893073  107.3226322154204
 4 2019-10-21 16:24:51 -0400 0.5657689312069328 12.01207630634308
 5 2019-10-20 14:42:07 -0400 2.289949799346769  56.60284678339958
 
 > workout_df %>% arrange(desc(startDate)) %>% select(startDate, totalDistance, duration)
# A tibble: 440 x 3
   startDate                 totalDistance      duration         
   <chr>                     <chr>              <chr>            
 7 2019-10-23 15:23:21 -0500 0.5473660299954509 11.74872166514397
 8 2019-10-22 12:49:38 -0500 3.222100507149329  115.0385558168093
 9 2019-10-22 09:02:35 -0500 4.153307404893073  107.3226322154204
10 2019-10-21 15:24:51 -0500 0.5657689312069328 12.01207630634308

```{r}
after_dst <- health_df %>% filter(type == "HKQuantityTypeIdentifierHeartRate", str_detect(startDate, "^2019-03-10"))
after_dst_nov <- health_df %>% filter(type == "HKQuantityTypeIdentifierHeartRate", str_detect(startDate, "^2018-11-04"))
after_dst_nov_burn <- health_df %>% filter(type == "HKQuantityTypeIdentifierActiveEnergyBurned", str_detect(startDate, "^2018-11-04")) %>% select(creationDate, startDate, endDate,value) %>% arrange(startDate)
during_dst_nov_burn <- health_df_DST %>% filter(type == "HKQuantityTypeIdentifierActiveEnergyBurned", str_detect(startDate, "^2018-11-04")) %>% select(creationDate, startDate, endDate, value) %>% arrange(startDate)
after_dst_nov_burn03 <- health_df %>% filter(type == "HKQuantityTypeIdentifierActiveEnergyBurned", str_detect(startDate, "^2018-11-03")) %>% select(creationDate, startDate, endDate,value) %>% arrange(startDate)
during_dst_nov_burn03 <- health_df_DST %>% filter(type == "HKQuantityTypeIdentifierActiveEnergyBurned", str_detect(startDate, "^2018-11-03")) %>% select(creationDate, startDate, endDate, value) %>% arrange(startDate)
HKQuantityTypeIdentifierActiveEnergyBurned
```

now look at adjusted times

```{r}
after_dst_nov_burn_adj <- health_df %>% filter(type == "HKQuantityTypeIdentifierActiveEnergyBurned", as_date(start_date) == ymd("2018-11-04")) %>% select(creationDate, start_date, end_date,value) %>% arrange(start_date)
after_dst_nov_burn03_adj <- health_df %>% filter(type == "HKQuantityTypeIdentifierActiveEnergyBurned", as_date(start_date) == ymd("2018-11-03")) %>% select(creationDate, start_date, end_date,value) %>% arrange(start_date)

during_dst_nov_burn03 %>% filter(startDate > "2018-11-03 23:00")
```

The first seven values for November 4, 2018 as exported during standard time:
> during_dst_nov_burn %>% head(7)
# A tibble: 7 x 4
  creationDate              startDate                 endDate                   value
  <chr>                     <chr>                     <chr>                     <dbl>
1 2018-11-04 00:13:58 -0400 2018-11-04 00:08:03 -0400 2018-11-04 00:08:23 -0400 0.043
2 2018-11-04 00:13:58 -0400 2018-11-04 00:09:14 -0400 2018-11-04 00:09:24 -0400 0.154
3 2018-11-04 00:35:30 -0400 2018-11-04 00:16:54 -0400 2018-11-04 00:17:05 -0400 0.155
4 2018-11-04 00:35:30 -0400 2018-11-04 00:19:48 -0400 2018-11-04 00:20:19 -0400 0.2  
5 2018-11-04 00:55:02 -0400 2018-11-04 00:51:31 -0400 2018-11-04 00:52:33 -0400 0.238
6 2018-11-04 01:05:55 -0400 2018-11-04 00:54:25 -0400 2018-11-04 00:54:56 -0400 0.116
7 2018-11-04 01:13:44 -0400 2018-11-04 00:54:56 -0400 2018-11-04 00:55:27 -0400 0.194

The last seven values for November 3, 2018 as exported during daylight savings time:
> after_dst_nov_burn03 %>% tail(7)
# A tibble: 7 x 4
  creationDate              startDate                 endDate                   value
  <chr>                     <chr>                     <chr>                     <dbl>
1 2018-11-03 23:13:58 -0500 2018-11-03 23:08:03 -0500 2018-11-03 23:08:23 -0500 0.043
2 2018-11-03 23:13:58 -0500 2018-11-03 23:09:14 -0500 2018-11-03 23:09:24 -0500 0.154
3 2018-11-03 23:35:30 -0500 2018-11-03 23:16:54 -0500 2018-11-03 23:17:05 -0500 0.155
4 2018-11-03 23:35:30 -0500 2018-11-03 23:19:48 -0500 2018-11-03 23:20:19 -0500 0.2  
5 2018-11-03 23:55:02 -0500 2018-11-03 23:51:31 -0500 2018-11-03 23:52:33 -0500 0.238
6 2018-11-04 00:05:55 -0500 2018-11-03 23:54:25 -0500 2018-11-03 23:54:56 -0500 0.116
7 2018-11-04 00:13:44 -0500 2018-11-03 23:54:56 -0500 2018-11-03 23:55:27 -0500 0.194

Depending on when you do the export, these values will be included with Novermber 3rd or
November 4th. The second version is the time I experienced when the values
occurred during the last hour of November 3rd (the day before daylight savings ended).

In general, if one does the export during daylight savings, the datetime stamp
will be "correct" for values that occurred during daylight savings time, but one
hour off for values that occurred during standard time. And if one does the export
during standard time, the reverse will be true. The datetime stamp will
be "correct" for values recorded during standard time, but one hour off for
those recorded during daylight savings.

and now for activity:

```{r}
activity_df_DST %>% filter(st)
```