DataLoading_FeaturePreselect.Rmd

---
title: "Predicting drug mode of action: data preprocessing"
author: "Florian Huber, Leonard Dubois"
date: "`r Sys.Date()`"
output:
  html_document:
    fig_caption: yes
    fig_height: 6
    fig_width: 6
    number_sections: yes
    toc: yes
    toc_depth: 2
    toc_float: yes
editor_options: 
  chunk_output_type: console
---

# Setup, library loading

```{r, message = FALSE}
source("./setup.R")

# directory for automatically generated data:
outdir <- "./data/programmatic_output"
indir <- "/Volumes/typas/Florian/dbsetup_tables_new"
```


# Loading and preprocessing

The aim of this section is to generate a **"matrix_container"** which contains 
everything needed to perform a repeated instanced nested cross-validation. 
Different rows can be compared to understand how algorithms or other not 
data-driven parameters are different in terms of performance. 


## Reading in and joining of tables

```{r, data_import, cache = TRUE}
(nichols_2011 <- read_delim(file.path(indir, "nichols_2011.csv"), delim = ";"))
(newsize <- read_delim(file.path(indir, "newsize.csv"), delim = ";"))
(shiver <- read_delim(file.path(indir, "shiver.csv"), delim = ";"))
(price <- read_delim(file.path(indir, "price.csv"), delim = ";"))
(genes <- read_delim(file.path(indir, "genes.csv"), delim = ";"))
(drugs_complete <- read_delim(file.path(indir, "drugs.csv"), delim = ";"))

my_datasets <- c("nichols_2011", "newsize", "shiver", "price")
all_studies <- mget(my_datasets) %>% 
  map(~mutate(.x, significant = qvalue <= 0.05))
```

Interestingly, the fraction of significant values varies greatly between the 
data sets:

```{r}
map(all_studies, ~ mean(.x$significant, na.rm = TRUE))
```

Get MoA information. Load from Google Drive, which contains the latest 
annotation. 

```{r}
# get mode of action information using Google docs, if it fails look for 
# information on the server
tryCatch(
  error = function(e) {
    moa <- read_delim(file.path(indir, "mode_of_action.csv"))
    return(moa)
  }, 
  {
  gfiles <- drive_find(pattern = "drug_moa")
  outpath <- file.path(outdir, "drug_moa_gdrive.csv")
  drive_download(file = gfiles[1, ], path = outpath, overwrite = TRUE)
  }
)

(mode_of_action <- read_delim(outpath, delim = ",", na = c("-", "")))
mode_of_action <- mode_of_action[, c("drug", "moa_broad", "moa_subgroup", 
  "dataset")]
colnames(mode_of_action) <- c("drugname_typaslab", "process_broad", 
  "process_subgroup", "dataset")
mode_of_action <- mode_of_action[complete.cases(mode_of_action), ]
write_csv(mode_of_action, path = file.path(outdir, "drug_moa_userfriendly.csv"))

stopifnot(all(!duplicated(mode_of_action$drugname_typaslab)))
```

Write out `drugs_complete` with MoA information.

```{r}
(drugs_complete <- left_join(drugs_complete, mode_of_action))
saveRDS(drugs_complete, file = file.path(outdir, "drugs_complete.rds"))
```


# Data cleanup, further preprocessing

## Data inspection

For each data set: find out which mutants (= features) have lots of NA values. 
Can do the same for the drugs. Purpose: excluding either mutants that are very 
prone to giving NA values or excluding conditions that do not interact with 
any mutant.  

```{r}
# record for each gene number of significant interactions, NA values
feature_stats <- map(all_studies, function(x) {
  group_by(x, gene_synonym) %>%
    summarise(n_signif = sum(significant, na.rm = TRUE), 
      n_NA_sscores = sum(is.na(sscore)), 
      unique_conds = length(unique(paste(drugname_typaslab, conc)))) %>%
    arrange(desc(n_NA_sscores))
})
# add names
feature_stats <- imap(feature_stats, ~{.x$dataset = .y; return(.x)})
feature_stats$nichols_2011
feature_stats$newsize

# and now the same for drugs: drug-dosage combinations that don't interact with 
# anything will be difficult to predict
drug_stats <- map(all_studies, function(x) {
  group_by(x, drugname_typaslab, conc) %>%
    summarise(n_signif = sum(significant, na.rm = TRUE), 
      n_NA_sscores = sum(is.na(sscore))) %>%
    arrange(n_signif)
})
drug_stats <- imap(drug_stats, ~{.x$dataset = .y; return(.x)})

drug_stats$nichols_2011
drug_stats$newsize
```

Number of NA values across all conditions per mutant. Why so many in the shiver 
data? Why distribution so different between newsize and nichols?

```{r}
feature_stats %>% 
  bind_rows() %>%
  ggplot(aes(x = dataset, y = n_NA_sscores)) +
  geom_jitter(shape = 1, height = 0) + 
  annotate("text", x = 1:4, y = 40, label = paste("N = ", c(95, 275, 26, 28))) + 
  labs(title = "Number of NA values per mutant", 
    y = "Number of NA values per mutant across all conditions")
```

Number of NA values across mutants per condition. 

```{r}
drug_stats %>%
  bind_rows() %>%
  ggplot(aes(x = dataset, y = n_NA_sscores)) + 
  geom_jitter(shape = 1, height = 0) + 
  labs(title = "Number of NA values per condition", 
    y = "Number of NA values per condition across all mutants")
```

There are some outliers in Newsize. They will be excluded. Let's see which 
drugs have at least one dosage with, say, more than 40 NA values. 

```{r}
drug_stats[c("newsize", "nichols_2011")] %>%
  bind_rows() %>%
  group_by(dataset, drugname_typaslab) %>%
  arrange(desc(n_NA_sscores)) %>%
  filter(any(n_NA_sscores > 40)) %>%
  arrange(dataset, drugname_typaslab) %>% 
  print(n = 50)
```

I would argue for excluding the outliers in Newsize because there clearly 
something went wrong. In the Nichols set there are some drugs that just tend to 
have more NA values, across dosages, but an order of magnitude less than the 
Newsize drugs. 

Statistics of how many mutants have 0, 1, 2, etc. number of interactions.

```{r}
feature_stats %>%
  bind_rows() %>%
  group_by(dataset) %>%
  count(n_signif) %>%
  ggplot(aes(x = n_signif, y = n)) + 
  geom_bar(stat = "identity") + 
  facet_wrap( ~ dataset, scales = "free") + 
  labs(title = "How many mutants have how many significant interactions?", 
    x = "Number of significant interactions", y = "Number of mutants")
```

Do the same for drugs. It looks like price and shiver data sets behave very 
differently from nichols and newsize. 

```{r}
drug_stats %>%
  bind_rows() %>%
  group_by(dataset) %>%
  count(n_signif) %>%
  ggplot(aes(x = n_signif)) + 
  geom_freqpoly(binwidth = 10) + 
  facet_wrap( ~ dataset) + 
  labs(title = "How many conditions have how many significant interactions?", 
    x = "Number of significant interactions", y = "Number of conditions")
```


## QC: removing drugs/features

Inspecting the data suggests the following:
  * Only work with Nichols and Newsize because Price and Shiver seem to behave 
  quite differently. Moreover, they don't add much more drugs that are not 
  already present in Nichols and Newsize. 
  
Therefore, for Nichols and Newsize:
  * Remove mutants (features) that have 10 or more NA values.
  * Remove mutants that have no significant interaction with any drug.
  * Remove conditions in Newsize that apparently did not work (> 500 NA values).
  * Remove conditions with less than 5 significant interactions.

```{r}
(feats_to_exclude_NAs <- 
  map(feature_stats[c("nichols_2011", "newsize")], ~filter(.x, n_NA_sscores > 10)))

(feats_no_IAs <- 
  map(feature_stats[c("nichols_2011", "newsize")], ~filter(.x, n_signif == 0)))

(conds_to_exclude <- 
  map(drug_stats[c("nichols_2011", "newsize")], ~filter(.x, n_signif < 5 | n_NA_sscores > 500)))
```

To remove features, run the following procedure: 
  1. Features with many NAs: if a feature fulfills this criterion, remove it everywhere.
  2. Features with no interactions: only remove if they don't interact in all of the data sets. 
  3. Finally: only keep features that are in both nichols and newsize. 

To remove conditions: just exclude them in the respective dataset, no special 
procedure here. 

Features that don't interact with anything neither in Nichols nor Newsize.

```{r}
venn(list(nichols = feats_no_IAs$nichols_2011$gene_synonym, 
  newsize = feats_no_IAs$newsize$gene_synonym))
```

Write out data nichols and newsize data without any changes.

```{r}
kept_studies <- all_studies[c("nichols_2011", "newsize")]
saveRDS(kept_studies, file.path(outdir, "kept_datasets_long_nochanges.rds"))
saveRDS(conds_to_exclude, file.path(outdir, "removed_conditions_quality_ctrl.rds"))
```

Remove conditions and features. Note that further down there is another round 
of condition removal because of low concentrations relative to the MIC. 

```{r}
# excluding conditions
kept_studies <- imap(kept_studies, function(data, name) {
  # remove conditions with few interactions/lots of NAs from respective dataset
  to_remove <- conds_to_exclude[[name]]
  data <- data[!(paste(data$drugname_typaslab, data$conc) %in% 
      paste(to_remove$drugname_typaslab, to_remove$conc)), ]
  
  # remove features with lots of NAs from all datasets
  to_remove <- bind_rows(feats_to_exclude_NAs[c("nichols_2011", "newsize")])$gene_synonym
  data <- data[!(data$gene_synonym %in% to_remove), ]
  
  # remove features never interacting with anything in any dataset
  to_remove <- reduce(map(feats_no_IAs[c("nichols_2011", "newsize")], "gene_synonym"), dplyr::intersect)
  data <- data[!(data$gene_synonym %in% to_remove), ]
  return(data)
})

# Only keep features that are both in nichols and newsize. 

# before:
venn(list(nichols = unique(kept_studies$nichols_2011$gene_synonym), 
  newsize = unique(kept_studies$newsize$gene_synonym)))

# after:
kept_studies <- map(kept_studies, function(data) {
  total_intersect <- reduce(map(kept_studies, "gene_synonym"), dplyr::intersect)
  data <- data[data$gene_synonym %in% total_intersect, ]
})

venn(list(nichols = unique(kept_studies$nichols_2011$gene_synonym), 
  newsize = unique(kept_studies$newsize$gene_synonym)))
```

Save to disk Nichols and newsize data after removing features and conditions.

```{r}
saveRDS(kept_studies, file.path(outdir, 
  "kept_datasets_long_featsndrugs_removed.rds"))
```

## Correlation between Nichols and Newsize

How well do conditions correlate between the two datasets? 

```{r}
# for each condition that is the same: calculate correlation coefficient
(nic <- select(kept_studies$nichols_2011, gene_synonym, drugname_typaslab, 
  conc, sscore))
(new <- select(kept_studies$newsize, gene_synonym, drugname_typaslab, conc, 
  sscore))

cors <- 
  semi_join(nic, new, by = c("gene_synonym", "drugname_typaslab", "conc")) %>%
  left_join(new, by = c("gene_synonym", "drugname_typaslab", "conc"), suffix = 
      c(".nic", ".new"))

# e.g.: A22: 
filter(cors, drugname_typaslab == "A22", conc == 2) %>%
  ggplot(aes(x = sscore.nic, y = sscore.new)) + 
  geom_point() + 
  labs(title = "A22 (conc = 2), correlation of s-scores", x = "s-scores Nichols", 
    y = "s-scores Newsize")

# spearman corrs across all conditions:
group_by(cors, drugname_typaslab, conc) %>%
  summarise(spearman = cor(sscore.nic, sscore.new, method = "spearman", 
    use = "complete.obs")) %>%
  ggplot(aes(x = "", y = spearman)) + 
  geom_text_repel(aes(label = paste0(drugname_typaslab, "_", conc)), 
    size = 3) +
  labs(title = "Spearman correlation of conditions shown across all genes", 
    x = "", y = "Spearman's rho")
```

## Spread data, replace NA values

Convert to wide format. `NA` values are replaced by 0 for s-scores and `FALSE` 
for significance values (qvalues). Save the output. 

```{r}
# lose calcofluor due to NA concentration
kept_studies_wide <- unlist(
  map(kept_studies, function(data) {
    data <- filter(data, !is.na(conc))
    sscores <- select(data, gene_synonym, drugname_typaslab, conc, sscore) %>%
      spread(key = gene_synonym, value = sscore, fill = 0)
    qvals <- select(data, gene_synonym, drugname_typaslab, conc, significant) %>%
      spread(key = gene_synonym, value = significant, fill = FALSE)
    stopifnot({
      all(!is.na(sscores))
      all(!is.na(qvals))
    })
    return(list(sscores = sscores, qvals = qvals))
  }), 
  recursive = FALSE)

saveRDS(kept_studies_wide, file.path(outdir, 
  "kept_datasets_wide_featsndrugs_removed_noNAs.rds"))
```


## Merging correlated features

Features that are highly correlated should be merged. 

```{r}
the_matrix <- kept_studies_wide$nichols_2011.sscores
feats <- select(the_matrix, -one_of("drugname_typaslab", "conc"))
feats_cor <- cor(feats, method = "spearman")

# quite a large and unwieldy plot - but still shows a few useful clusters:
if (!file.exists("./plots/Correlation_all_by_all.png")) {
  png("./plots/Correlation_all_by_all.png", width = 90, height = 90, 
    units = "cm", res = 500)
  corrplot::corrplot(feats_cor, method = "color", order = "hclust", 
    tl.cex = 0.1)
  dev.off()
}

# get all possible combinations and indicate if they are in the same complex:
if (!file.exists(file.path(outdir, "corpairs.rds"))) {
  corpairs <- suppressMessages(generate_corpairs())
  saveRDS(corpairs, file.path(outdir, "corpairs.rds"))
} else {
  corpairs <- readRDS(file.path(outdir, "corpairs.rds"))
}

head(corpairs)

ggplot(corpairs, aes(x = cor)) + 
  geom_freqpoly(binwidth = 0.05) + 
  geom_vline(xintercept = c(0.6, 0.7), linetype = "dotted") + 
  facet_wrap( ~ in_same_cmplx, scales = "free") + 
  labs(title = "Spearman correlations of all mutant pairs.
    Left: not part of the same protein complex
    Right: within the same protein complex", 
    x = "Spearman's rho")

# two selected examples (highest and lowest correlation above cutoff):
ggplot(feats, aes(x = NUOE, y = NUOF)) + 
  geom_point() + 
  labs(title = "S-scores across all conditions (Spearman = 0.84)")

ggplot(feats, aes(x = ATPB, y = ATPE)) + 
  geom_point() + 
  labs(title = "S-scores across all conditions (Spearman = 0.75)")

ggplot(feats, aes(x = FLIF, y = FLIQ)) + 
  geom_point() + 
  labs(title = "S-scores across all conditions (Spearman = 0.58)")
```

When choosing a cutoff of 0.6 and requiring that genes are in the same complex 
we get `r sum(corpairs$in_same_cmplx & corpairs$cor > 0.6)` pairs. 
A lot of the genes are part of the NADH:quinone oxidoreductase 
complex (nuoX genes). Other prominent complexes are:
  * the terminal oxidase complex (cyoX)
  * sulfate adenylyltransferase (cysD, cysN)
  * acrAB-tolC 
  * ATP synthase (atpX)
  * the Tol-Pal system (Pal, TolB, TolR, TolQ)
  * TatB-TatC (part of the twin arginine translocation (Tat) complex for the 
  export of folded proteins)
  * some flagellum-related genes (FlhA, FLiF, FliQ)

```{r}
# View(corpairs[corpairs$in_same_cmplx & corpairs$cor > 0.6, ])
sum(corpairs$in_same_cmplx & corpairs$cor > 0.6)
sum(corpairs$cor > 0.6)
```

If we don't require them to be in the same complex we get 
`r sum(corpairs$cor > 0.6)` pairs with, among others:
* ribonucleotide biosynthesis (pyrX, purX)
* LPS biosynthesis: rfaX

For details check:

```{r}
# View(corpairs[corpairs$cor > 0.6, ])
sum(corpairs$cor > 0.6)
```

Anticorrelation is a lot rarer, only `r sum(corpairs$cor < -0.6)` pairs have a 
Spearman correlation of less than -0.6. 

Notably, Pur and Pyr genes are anticorrelated with RsxB (together with rseC 
turns off SoxR-mediated induction of SoxS), YrbA (= ibaG, defends against acid 
stress), and UbiE/F (ubiquinone biosynthesis). Moreover, GuaB (IMP 
dehydrogenase) is anticorrelated with LipA (lipoate biosynthesis) and Rph 
(an RNAse). 

Now merge correlated predictors. Use a bit higher threshold (0.7) because 
otherwise it would merge too many features. Iterative averaging not an option 
because `mean(mean(c(a, b)), c) != mean(c(a, b, c))`. Instead, do 
correlation-based hierarchical clustering of all genes which have at least one 
correlation above the threshold. Then based on this result determine clusters 
and calculate centroids of the clusters. 

Genes passing the threshold: 

```{r}
cor_thresh <- 0.7
(high_cor_genes <- unique(unlist(corpairs[corpairs$cor > cor_thresh, 
  c("featA", "featB")])))
```

So `r length(high_cor_genes)` are affected. 

Do hierarchical clustering. 

```{r}
feats_cor_high <- feats_cor[high_cor_genes, high_cor_genes]
# since we're only merging positively correlated features
feats_cor_high_d <- as.dist(1 - feats_cor_high)

(h <- hclust(feats_cor_high_d))
(d <- as.dendrogram(h))
# this is the order as in the dendrogram:
h$labels[order.dendrogram(d)] # order() does not dispatch - why?
# easier:
dend_labels <- labels(d)

nclust <- 23
(clusters <- cutree(h, k = nclust))

min_col <- plasma(2)[1]
max_col <- plasma(2)[2]

# before splitting into nclust clusters:
p <- Heatmap(feats_cor_high, 
  col = colorRamp2(breaks = c(-1, 1), colors = c(min_col, max_col)), 
  name = "Spearman's rho", 
  row_dend_side = "right", 
  column_names_side = "top", 
  row_names_gp = gpar(fontsize = 6), 
  column_names_gp = gpar(fontsize = 6), 
  cluster_columns = d, 
  cluster_rows = d, 
  row_order = dend_labels, 
  column_order = dend_labels)
p

pdf("./plots/Correlated_feats.pdf", width = 14, height = 14)
p
dev.off()

# simply running the command above and providing an argument to split will 
# mess up the order of the clusters
# therefore, need to turn off column clustering and manually pass the order 
# of the labels, which are contained in dend_order
# but somehow I got it right and saved it wrong and now I don't know what to do 
# anymore - fix later, does not really matter
p <- Heatmap(feats_cor_high, 
  col = colorRamp2(breaks = c(-1, 1), colors = c(min_col, max_col)), 
  name = "Spearman's rho", 
  row_dend_side = "right", 
  column_names_side = "top", 
  row_names_gp = gpar(fontsize = 6), 
  column_names_gp = gpar(fontsize = 6), 
  cluster_columns = FALSE, 
  cluster_rows = d, 
  row_order = dend_labels, 
  column_order = rev(dend_labels), 
  split = nclust)
p

pdf("./plots/Correlated_feats_splitup.pdf", width = 14, height = 14)
p
dev.off()
```

[Link]("./plots/Correlated_feats.pdf"). 
[Link]("./plots/Correlated_feats_splitup.pdf"). 

Looks like 23 clusters is a good number to capture similar groups. 

```{r}
clusters[order(clusters)]
clusters <- tibble(clust_id = clusters, members = names(clusters)) %>%
  group_by(clust_id) %>%
  nest(.key = members) %>%
  mutate(members = map(members, flatten_chr))

clusters$clust_label <- sapply(clusters$members, function(x) {
  ifelse(length(x) < 3, 
    paste(x, collapse = "_"), 
    paste0(x[1], "_and_", length(x) - 1, "_more"))
})
clusters
saveRDS(object = clusters, file = file.path(outdir, "clusters.rds"))
```

Now merge the features by calculating the centroids in case of sscores. For 
qvalues set to `TRUE` if at least one of the members is `TRUE`. 

```{r}
# merge features in clusters frame using function f of merge_features(), 
# then remove features that have been merged 
tmp <- c("nichols_2011.sscores", "newsize.sscores")
kept_studies_wide[tmp] <- 
  map(kept_studies_wide[tmp], merge_features, cluster_fr = clusters, f = mean)

tmp <- c("nichols_2011.qvals", "newsize.qvals")
kept_studies_wide[tmp] <- 
  map(kept_studies_wide[tmp], merge_features, cluster_fr = clusters, f = any)
```


Plot the correlation matrix again, let's check:

```{r}
the_matrix <- kept_studies_wide$nichols_2011.sscores
feats <- select(the_matrix, -one_of("drugname_typaslab", "conc"))
feats_cor <- cor(feats, method = "spearman")

# quite a large and unwieldy plot - but still shows a few useful clusters:
if (!file.exists("./plots/Correlation_all_by_all_aftermerge.png")) {
  png("./plots/Correlation_all_by_all_aftermerge.png", width = 90, height = 90, 
    units = "cm", res = 500)
  corrplot::corrplot(feats_cor, method = "color", order = "hclust", 
    tl.cex = 0.1)
  dev.off()
}
```

## Dosage selection

```{r}
matrices <- kept_studies_wide[c("nichols_2011.sscores", "newsize.sscores")]

if (!file.exists(file.path(outdir, "MICs.csv"))) {
  source("./parse_MIC_data.R")
} else {
  (mics <- read_delim("./data/programmatic_output/MICs.csv", delim = ";"))
}

# drugs that are not in the MIC table
map(matrices, ~ setdiff(.x$drugname_typaslab, mics$drugname_typaslab))
```

Table for asking a few things about dosages in the dataset. Express dosages as 
percentages of MIC, indicate if E. coli is usually resistant to the drug. 

```{r}
dosg_check <- 
  map(matrices, function(.x) {
    .x <- left_join(select(.x, drugname_typaslab, conc), mics) %>% 
    mutate(frac_of_mic = conc / mic_curated) %>% 
    group_by(drugname_typaslab) %>% 
    mutate(n_conc = n()) %>%
    ungroup() %>%
    select(drugname_typaslab, conc, mic_curated, frac_of_mic, resistant, n_conc)
    return(.x)
  })

dosg_check$nichols_2011.sscores <- 
  left_join(dosg_check$nichols_2011.sscores, drug_stats$nichols_2011, 
    by = c("drugname_typaslab", "conc"))

dosg_check$newsize.sscores <- 
  left_join(dosg_check$newsize.sscores, drug_stats$newsize, 
    by = c("drugname_typaslab", "conc"))

# example output
head(dosg_check$nichols_2011.sscores)

saveRDS(object = dosg_check, file = "./data/programmatic_output/dosg_check.rds")
```

How often is the concentration (allegedly) higher than the MIC? 

```{r}
# how often is the actual concentration higher than the MIC?
map(dosg_check, function(.x) {
  group_by(.x, drugname_typaslab) %>%
  filter(any(frac_of_mic > 1))
})
```

We will exclude conditions that were measured at a concentration less than 5% 
of the MIC (keep everything that is NA). This means we lose the following drugs 
completely: 

```{r}
removal_by_dosg <- map(dosg_check, function(.x) {
  filter(.x, frac_of_mic < 0.05) %>%
    arrange(drugname_typaslab)
})

map(removal_by_dosg, ~ filter(group_by(.x, drugname_typaslab), n_conc == n()))
```

The following conditions will be removed: 

```{r}
print(removal_by_dosg$nichols_2011.sscores, n = 100)
print(removal_by_dosg$newsize.sscores, n = 100)

saveRDS(removal_by_dosg, file = file.path(outdir, "removal_by_dosg.rds"))

tmp <- c("nichols_2011.sscores", "nichols_2011.qvals")
kept_studies_wide[tmp] <- map(kept_studies_wide[tmp], function(.x) {
  to_remove <- paste(removal_by_dosg$nichols_2011.sscores$drugname_typaslab, 
    removal_by_dosg$nichols_2011.sscores$conc)
  .x <- .x[!(paste(.x$drugname_typaslab, .x$conc) %in% to_remove), ]
  return(.x)
})

tmp <- c("newsize.sscores", "newsize.qvals")
kept_studies_wide[tmp] <- map(kept_studies_wide[tmp], function(.x) {
  to_remove <- paste(removal_by_dosg$newsize.sscores$drugname_typaslab, 
    removal_by_dosg$newsize.sscores$conc)
  
  .x <- .x[!(paste(.x$drugname_typaslab, .x$conc) %in% to_remove), ]

    return(.x)
})

saveRDS(kept_studies_wide, file = file.path(outdir, 
  "kept_datasets_wide_featsndrugs_removed_noNAs_featsmerged.rds"))
```

Could be useful to save the respective matrices without features but with 
MoA information: 

```{r}
tmp <- kept_studies_wide
tmp$nichols_2011.qvals <- NULL
tmp$newsize.qvals <- NULL

tmp <- map(tmp, function(.x) {
  .x <- select(.x, drugname_typaslab, conc) %>%
    left_join(mode_of_action[, c("drugname_typaslab", "process_broad", "process_subgroup")]) %>% 
    arrange(drugname_typaslab, conc)
  return(.x)
})

saveRDS(tmp, file = file.path(outdir, "matrices_MoA_info.rds"))
```


# Splitting data into different sets

We will split our chemical genetics data into **five datasets**:
  * From Nichols: A **training set** for algorithm and setup selection, model tuning. Drugs here have one of the 4 main MoAs and chemical features are present. 
  * From both Nichols and Newsize: a **control set** containing drugs (actually: conditions) that belong to one of the 4 main MoAs but one of the following was the case: (i) chemical features were missing, (ii) the conditions were not, in fact, drugs (e.g. UV), (iii) drugs from Newsize that were already in the training set (since we can't use them as test set drugs). 
  * From Nichols: an **other MoA set** containing drugs that were not in the 4 main MoAs.
  * Newsize only: A **test set**: these are all drugs from Newsize belonging to one of the 4 main MoAs. 
  * Both Nichols and Newsize: An **unknown MoA set**: these are drugs from both Nichols and Newsize whose MoA is poorly understood in bacteria. 
  
Retrieve info on MoA, physicochemical features. 

```{r}
# function to add MoA information
add_moa_info <- function(m) {
  left_join(m, mode_of_action[, c("drugname_typaslab", "process_broad")]) %>% 
    select(drugname_typaslab, conc, process_broad, everything())
}

# reminder: original quality-controlled data = kept_studies_wide
# we don't need q values from this point on 
kept_studies_wide_moa <- 
  map(kept_studies_wide[c("nichols_2011.sscores", "newsize.sscores")], add_moa_info)

# function to add physchem features
chem_info <- select(drugs_complete, drugname_typaslab, SlogP:data_effective_rotor_count)
drugs_with_chemfeats <- chem_info$drugname_typaslab[!is.na(chem_info$SlogP)]
drugs_no_chemfeats <- chem_info$drugname_typaslab[is.na(chem_info$SlogP)]
stopifnot(nunique(chem_info$drugname_typaslab) == length(drugs_no_chemfeats) + 
    length(drugs_with_chemfeats))
```


```{r}
# training set
training_set <- kept_studies_wide_moa$nichols_2011.sscores %>% 
  filter(drugname_typaslab %in% drugs_with_chemfeats & 
      process_broad %in% main_moas)

# control set
control_set <- kept_studies_wide_moa
control_set$nichols_2011.sscores <- 
  filter(control_set$nichols_2011.sscores, process_broad %in% main_moas & 
      drugname_typaslab %in% drugs_no_chemfeats)
control_set$newsize.sscores <- 
  filter(control_set$newsize.sscores, process_broad %in% main_moas & 
      drugname_typaslab %in% training_set$drugname_typaslab)
control_set <- imap_dfr(control_set, function(.el, .name) {
  .el$origin <- str_extract(.name, pattern = "nichols|newsize")
  .el <- select(.el, drugname_typaslab, conc, process_broad, origin, everything())
})

# other MoAs 
other_moas <- map(kept_studies_wide_moa, ~ filter(.x, ! process_broad %in% moas))
other_moas <- imap_dfr(other_moas, function(.el, .name) {
  .el$origin <- str_extract(.name, pattern = "nichols|newsize")
  .el <- select(.el, drugname_typaslab, conc, process_broad, origin, everything())
})

# test set
test_set <- kept_studies_wide_moa$newsize.sscores %>% 
  filter(!drugname_typaslab %in% training_set$drugname_typaslab & 
      process_broad %in% main_moas)

# drugs with unknown MoA
unknown_moas <- imap_dfr(kept_studies_wide_moa, function(.el, .name) {
  .el$origin <- str_extract(.name, pattern = "nichols|newsize")
  .el <- select(.el, drugname_typaslab, conc, process_broad, origin, everything())
  .el <- filter(.el, process_broad == "unknown")
})

# for convenience
split_sets <- list(training_set = training_set, control_set = control_set, 
  other_moas = other_moas, test_set = test_set, unknown_moas = unknown_moas)

saveRDS(object = split_sets, file = "./data/programmatic_output/split_sets.rds")
```


## Limiting the number of drug doses

Keep at most 3 doses per drug to mitigate imbalances. 
If there are more than 3 doses: take those 3 closest to the MIC (log2-scale). 
If they are NA keep 3 highest doses.  

```{r}
dosg_check <- map(dosg_check, function(.x) {
  .x$frac_of_mic_log2 <- log2(.x$frac_of_mic)
  return(.x)})

# two arrange steps so that drugs with an NA mic value are ordered such that 
# the highest dosage will be selected
# keep_all_dosg: indicates which drugs to keep for the "all dosages" table
# keep_one_dosg: indicates which drugs to keep for the table where dosage 
# closest to MIC is kept
dosg_tabs_top3 <- 
  map(dosg_check, function(.x) {
    group_by(.x, drugname_typaslab) %>%
    arrange(desc(conc), .by_group = TRUE) %>%
    arrange(abs(frac_of_mic_log2), .by_group = TRUE) %>%
    mutate(
      keep_all_dosg = ifelse(seq_len(n()) %in% c(1:3), TRUE, FALSE), 
      keep_one_dosg = ifelse(seq_len(n()) == 1, TRUE, FALSE)) %>%
    arrange(drugname_typaslab, conc, .by_group = TRUE) %>%
    select(-one_of("dataset")) %>%
    ungroup()
  })

# this must be TRUE now:
stopifnot(
  every(dosg_tabs_top3, function(.x) {
  nunique(.x$drugname_typaslab) == sum(.x$keep_one_dosg)
}))

# for later inspection if desired:
saveRDS(dosg_tabs_top3, file = file.path(outdir, "dosg_tabs_top3.rds"))
```


```{r}
filter_dosg <- function(dataset, dosgtab, tokeep = c("all", "one")) {
  # "all" is actually 3 doses closest to MIC (or 3 highest ones)
  # "one" is the same but with one dosage
  tokeep <- match.arg(tokeep)
  stopifnot(is(dataset, "data.frame") & is(dosgtab, "data.frame"))
  
  if (tokeep == "all") {
    dosgtab <- dosgtab[dosgtab$keep_all_dosg, ]
  } else {
    dosgtab <- dosgtab[dosgtab$keep_one_dosg, ]
  }
  
  dataset <- semi_join(dataset, dosgtab, by = c("drugname_typaslab", "conc"))
  return(dataset)
}

m_all <- filter_dosg(training_set, dosgtab = dosg_tabs_top3$nichols_2011.sscores, 
  tokeep = "all")

m_one <- filter_dosg(training_set, dosgtab = dosg_tabs_top3$nichols_2011.sscores, 
  tokeep = "one")

# now versions with chemical features and then save
m_one_chem <- left_join(m_one, chem_info)
m_all_chem <- left_join(m_all, chem_info)

walk(c("m_all", "m_all_chem", "m_one", "m_one_chem"), function(x) {
  saveRDS(get(x), file.path(outdir, paste0(x, ".rds")))
})

# matrix to indicate which drug-dose-condition combinations are significant:
m_signif_all <- kept_studies_wide$nichols_2011.qvals %>% 
  left_join(mode_of_action[, c("drugname_typaslab", "process_broad")]) %>% 
  select(colnames(m_all)) %>% 
  semi_join(m_all, by = c("drugname_typaslab", "conc"))

saveRDS(m_signif_all, file.path(outdir, "m_signif_all.rds"))
```

Just a little plot on the observation counts. 

```{r}
group_by(m_all, drugname_typaslab) %>%
  slice(1) %>%
  ungroup() %>%
  count(process_broad) %>%
  ggplot(aes(x = process_broad, y = n)) + 
    geom_bar(aes(fill = process_broad), stat = "identity", width = 0.65) + 
    comparison_theme + 
    labs(x = "", y = "Count", title = "Unique drugs per MoA") + 
    scale_fill_manual("Mode of action", values = moa_cols, labels = moa_repl) + 
    scale_x_discrete(labels = moa_repl) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
      legend.position = "None", panel.grid = element_blank())

ggsave("./plots/Ndrugs_per_MoA.pdf", width = 30, height = 45, units = "mm")

count(m_all, drugname_typaslab, process_broad) %>%
  ggplot(aes(x = factor(n))) + 
    geom_bar(aes(fill = process_broad), width = 0.65) + 
    comparison_theme + 
    labs(x = "N conc per drug", y = "N drugs", title = "Number of concentrations\nmeasured per drug") + 
    scale_fill_manual("Mode of action", values = moa_cols, labels = moa_repl) + 
    theme(panel.grid = element_blank())

ggsave("./plots/Nconc_per_drug.pdf", width = 55, height = 45, units = "mm")
```


```{r, include = FALSE, eval = FALSE}
## FOR PAPER
group_by(m_all, drugname_typaslab) %>%
  slice(1) %>%
  ungroup() %>%
  count(process_broad) %>%
  ggplot(aes(x = process_broad, y = n)) + 
    geom_bar(aes(fill = process_broad), stat = "identity", width = 0.65) + 
    comparison_theme + 
    labs(x = "", y = "Count", title = "Drugs per MoA") + 
    scale_fill_manual("Mode of action", values = moa_cols, labels = moa_repl) + 
    scale_x_discrete(labels = moa_repl) + 
    paper_theme + 
    theme(legend.position = "None")

ggsave("./plots/paper/Ndrugs_per_MoA.pdf", width = 50, height = 55, units = "mm")

count(m_all, drugname_typaslab, process_broad) %>%
  ggplot(aes(x = factor(n))) + 
    geom_bar(aes(fill = process_broad), width = 0.55) + 
    comparison_theme + 
    labs(x = "N concentrations per drug", y = "Number of drugs", 
      title = "Dosages per drug") + 
    scale_fill_manual("MoA", values = moa_cols, labels = moa_repl) + 
    paper_theme + 
    theme(legend.position = c(0.38, 0.78), legend.background = element_blank(), 
      legend.text = element_text(size = 7), legend.key.size = unit(0.25, "cm"))

ggsave("./plots/paper/Nconc_per_drug.pdf", width = 50, height = 55, units = "mm")


## FOR POSTER ------------------------------
group_by(m_all, drugname_typaslab) %>%
  slice(1) %>%
  ungroup() %>%
  count(process_broad) %>%
  ggplot(aes(x = process_broad, y = n)) + 
    geom_bar(aes(fill = process_broad), stat = "identity", width = 0.65) + 
    comparison_theme + 
    labs(x = "", y = "Count", title = "Drugs per MoA") + 
    scale_fill_manual("Mode of action", values = moa_cols, labels = moa_repl) + 
    scale_x_discrete(labels = moa_repl) + 
    poster_theme + 
    theme(legend.position = "None")

ggsave("./plots/POSTER_Ndrugs_per_MoA.pdf", width = 80, height = 100, units = "mm")

count(m_all, drugname_typaslab, process_broad) %>%
  ggplot(aes(x = factor(n))) + 
    geom_bar(aes(fill = process_broad), width = 0.55) + 
    comparison_theme + 
    labs(x = "N concentrations per drug", y = "Number of drugs", 
      title = "Dosages per drug") + 
    scale_fill_manual("MoA", values = moa_cols, labels = moa_repl) + 
    poster_theme + 
    theme(legend.position = c(0.4, 0.75), legend.background = element_blank(), 
      legend.text = element_text(size = 12))

ggsave("./plots/POSTER_Nconc_per_drug.pdf", width = 80, height = 100, units = "mm")
```


# Generating a "matrix container" for comparative model assessment

Each row of the matrix container should have all the information to run 
repeated instanced blocked stratified nested cross-validation. We will compare: 
  * Algorithms
  * Impact of two ways of selecting dosages
  * Influence of including chemical features (just out of curiosity, we do not plan to use them at all)

Notes on commits: 
  * Commit 716861c brought integration of other data sets and some drugs that had been (wrongly) excluded before (bicyclomycin, cecropin B, dibucaine, procaine). 
  * After 0704327 dosage selection was improved (changed to reflect biology) and some issues with nested CV setup were corrected. 


## Making the container

```{r}
mc <- tibble(
  drug_dosages = c("all", "all", "one", "one"), 
  chemical_feats = c(FALSE, TRUE, FALSE, TRUE), 
  drug_feature_matrices = list(m_all, m_all_chem, m_one, m_one_chem)
)
```

### Adding CV instances

Generating the instances. 

```{r}
# for single dosage:
task_one <- make_my_task(dfm = m_one, targetvar = "process_broad", 
  blockvar = "drugname_typaslab")

if (!file.exists(file.path(outdir, "cvinsts_one.rds"))) {
  cvinsts_one <- replicate(10, 
    make_cvinst_blocked_stratified(task_data_all_cols = task_one$data_complete, 
      mlr_task = task_one, folds = 8, strat_var = "process_broad"), 
    simplify = FALSE)
  
  saveRDS(cvinsts_one, file.path(outdir, "cvinsts_one.rds"))
} else {
  cvinsts_one <- readRDS(file.path(outdir, "cvinsts_one.rds"))
}
  

# for all dosages:
task_all <- make_my_task(dfm = m_all, targetvar = "process_broad", 
  blockvar = "drugname_typaslab")

if (!file.exists(file.path(outdir, "cvinsts_all.rds"))) {
  cvinsts_all <- replicate(10, 
    make_cvinst_blocked_stratified(task_data_all_cols = task_all$data_complete, 
      mlr_task = task_all, folds = 8, strat_var = "process_broad"), 
    simplify = FALSE)
  
  saveRDS(cvinsts_all, file.path(outdir, "cvinsts_all.rds"))
} else {
  cvinsts_all <- readRDS(file.path(outdir, "cvinsts_all.rds"))
}
```

Adding them to the container.

```{r}
mc$resamp_instance <- ifelse(mc$drug_dosages == "all", list(cvinsts_all), 
  list(cvinsts_one))
```


### Adding model and hyperparameter specifications

Currently those three:
  * classif.randomForest
  * classif.glmnet
  * classif.xgboost

```{r}
mc <- bind_rows(rep(list(mc), 3))

mc$fitted_model <- 
  rep(c("classif.randomForest", "classif.glmnet", "classif.xgboost"), each = 4)

# hyperparameter grids
# this is not an ideal solution
# check https://jakob-r.de/mlrHyperopt/articles/working_with_parconfigs_and_paramsets.html 
# for a workaround
sqrt_p <- floor(sqrt(ncol(m_all)))

rf_grid <- makeParamSet(
  makeDiscreteParam("ntree", values = c(200, 500, 1000)), 
  makeDiscreteParam("mtry", values = c(sqrt_p, sqrt_p + 25, sqrt_p + 50, 
    sqrt_p + 75, sqrt_p + 100)), 
  makeDiscreteParam("fw.perc", values = c(0.25, 0.5, 1))
)

lasso_grid <- makeParamSet(
  makeDiscreteParam("s", values = seq(from = 0.01, to = 1, length.out = 100)), 
  makeDiscreteParam("fw.perc", values = seq(from = 0.01, to = 0.25, by = 0.02))
)

xgboost_grid <- makeParamSet(
  makeDiscreteParam("nrounds", values = c(1, 5, 10, 20, 30)), 
  makeDiscreteParam("eta", values = c(0.2, 0.3, 0.4, 0.5, 0.6)), 
  makeDiscreteParam("max_depth", values = c(3, 5, 7)), 
  makeDiscreteParam("fw.perc", values = c(0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 
    0.5, 1))
)

mc$hyperparam_grid <- 
  ifelse(mc$fitted_model == "classif.randomForest", 
    list(rf_grid), 
    ifelse(mc$fitted_model == "classif.glmnet", 
      list(lasso_grid), 
      list(xgboost_grid)))

# tuning_measure
mc$tuning_measure <- list(list(mmce, kappa))

# then arrange nicely
mc <- select(mc, fitted_model, drug_dosages, chemical_feats, 
  drug_feature_matrices, resamp_instance, hyperparam_grid, tuning_measure) %>%
  arrange(fitted_model, drug_dosages, chemical_feats)
```

Save it.

```{r}
saveRDS(mc, file = file.path(outdir, "mc.rds"))
```


# System and session info

```{r}
R.version
sessionInfo()
```