-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathbenchmark.Rmd
500 lines (405 loc) · 23.9 KB
/
benchmark.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
---
title: "Benchmark"
author: "Andrea Dalseno"
date: "2/19/2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading the data
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>On the <a href="https://www.drivendata.org/competitions/66/flu-shot-learning/data/">data download page</a>, we provide everything you need to get started:</p>
<ul>
<li><strong>Training Features</strong>: These are the input variables that your model will use to predict the probability that people received H1N1 flu and seasonal flu vaccines. There are 35 feature columns in total, each a response to a survey question. These questions cover several different topics, such as whether people observed safe behavioral practices, their opinions about the diseases and the vaccines, and their demographics. Check out the <a href="https://www.drivendata.org/competitions/66/flu-shot-learning/page/211/">problem description</a> page for more information. </li>
<li><strong>Training Labels</strong>: These are the labels corresponding to the observations in the training features. There are two target variables: <code>h1n1_vaccine</code> and <code>seasonal_vaccine</code>. Both are binary variables, with 1 indicating that a person received the respective flu vaccine and 0 indicating that a person did not receive the respective flu vaccine. Note that this is what is known as a "multilabel" modeling task.</li>
<li><strong>Test Features</strong>: These are the features for observations that you will use to generate the submission predictions after training a model. We don't give you the labels for these samples—it's up to you to generate them.</li>
<li><strong>Submission Format</strong>: This file serves as an example for how to format your submission. It contains the index and columns for our submission prediction. The two target variable columns are filled with 0.5 and 0.7 as an example. Your submission to the leaderboard must be in this exact format (with different prediction values) in order to be scored successfully!</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div><div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Let's start by importing the libraries that we will need to load and explore the data.</p>
```{r, message=FALSE}
library(tidyverse)
library(kableExtra)
library(rlang)
library(janitor)
```
Next, we can load the datasets and begin taking a look.
```{r}
features_df <- read.csv('training_set_features.csv', header=TRUE, row.names="respondent_id")
labels_df <- read.csv('training_set_labels.csv', header=TRUE, row.names="respondent_id")
```
```{r}
sprintf("features_df rows: %i, columns %i", nrow(features_df), ncol(features_df))
head(features_df)%>%
kbl() %>%
kable_material(c("striped", "hover"))
#knitr::kable(head(features_df), format = "markdown")
```
Each row is a person who was a survey respondent. The columns are the feature values corresponding to those people. We have 26,707 observations and 35 features.
```{r}
str(features_df)
```
Now let's look at the labels.
```{r}
sprintf("labels_df rows: %i, columns %i", nrow(labels_df), ncol(labels_df))
head(labels_df)%>%
kbl()%>%
kable_material(c("striped", "hover"))
```
We have the same 26,707 observations, and two target variables that we have labels for.
Let's double-check that the rows between the features and the labels match up. We don't want to have the wrong labels. Numpy's assert_array_equal will error if the two arrays—the row indices of the two data frames—don't match up.
np.testing.assert_array_equal(features_df.index.values, labels_df.index.values)
The assertion ran, and nothing happened. That's good, it means there is no problem. If the two index arrays were not the same, there would be an error.
## EXPLORING THE DATA
```{r}
library(ggplot2)
```
###Labels
Let's start by taking a look at our distribution of the two target variables.
```{r echo=TRUE}
p1 <- labels_df %>%
group_by(h1n1_vaccine) %>% summarise(total = n()) %>%
ggplot( aes(x=h1n1_vaccine, y=total)) +
geom_bar(stat='identity') +
ggtitle("Proportion of H1N1 Vaccine") +
xlab('h1n1 vaccine')+
coord_flip()
p2 <- labels_df %>%
group_by(seasonal_vaccine) %>% summarise(total = n()) %>%
ggplot( aes(x=seasonal_vaccine, y=total)) +
geom_bar(stat='identity') +
ggtitle("Proportion of Seasonal Vaccine") +
xlab('seasonal vaccine')+
coord_flip()
library(grid)
grid.newpage()
grid.draw(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "last"))
```
It looks like roughy half of people received the seasonal flu vaccine, but only about 20% of people received the H1N1 flu vaccine. In terms of class balance, we say that the seasonal flu vaccine target has balanced classes, but the H1N1 flu vaccine target has moderately imbalanced classes.
Are the two target variables independent? Let's take a look.
```{r}
ftable(addmargins(prop.table(table(labels_df))))
```
```{r}
cor(labels_df$h1n1_vaccine, y = labels_df$seasonal_vaccine, use = "everything",
method = "pearson")
```
These two variables have a phi coefficient of 0.37, indicating a moderate positive correlation. We can see that in the cross-tabulation as well. Most people who got an H1N1 flu vaccine also got the seasonal flu vaccine. While a minority of people who got the seasonal vaccine got the H1N1 vaccine, they got the H1N1 vaccine at a higher rate than those who did not get the seasonal vaccine.
## Features
Next, let's take a look at our features. From the problem description page, we know that the feature variables are almost all categorical: a mix of binary, ordinal, and nominal features. Let's pick a few and see how the rates of vaccination may differ across the levels of the feature variables.
First, let's combine our features and labels into one dataframe.
```{r}
joined_df <- transform(merge(features_df,labels_df,by='row.names',all=TRUE), row.names=Row.names, Row.names=NULL)
sprintf("joined_df rows: %i, columns %i", nrow(joined_df), ncol(joined_df))
head(joined_df)%>%
kbl()%>%
kable_material(c("striped", "hover"))
```
### Prototyping a Plot
Next, let's see how the features are correlated with the target variables. We'll start with trying to visualize if there is simple bivariate correlation. If a feature is correlated with the target, we'd expect there to be different patterns of vaccination as you vary the values of the feature.
Jumping right to the right final visualization is hard. We can instead pick one feature and one target and work our way up to a prototype, before applying it to more features and both targets. We'll use `h1n1_concern`, the level of concern the person showed about the H1N1 flu, and `h1n1_vaccine` as a target variable.
First, we'll get the count of observations for each combination of those two variables.
```{r}
counts <- joined_df %>%
select(h1n1_concern, h1n1_vaccine)%>%
group_by(h1n1_concern, h1n1_vaccine)%>%
summarise(n=n(), .groups = 'drop')%>%
na.omit() %>%
pivot_wider(names_from = h1n1_vaccine, values_from = n)%>%
rename('h1'='1', 'h0'='0')
counts
```
**ggplot prefers a long structure**
```{r}
joined_df %>%
select(h1n1_concern, h1n1_vaccine)%>%
group_by(h1n1_concern, h1n1_vaccine)%>%
summarise(n=n())%>%
na.omit() %>%
mutate(h1n1_vaccine=as.factor(h1n1_vaccine), h1n1_concern=as.factor(h1n1_concern))%>%
ggplot(aes(x=reorder(h1n1_concern, desc(h1n1_concern)), y=n, fill=h1n1_vaccine))+
geom_bar(position="dodge",stat='identity' ) +
ggtitle("h1n1_vaccine")+
xlab('h1n1_vaccine')+
coord_flip()
```
<p>Unfortunately, it's still hard to tell whether <code>h1n1_concern</code> levels show differences in someone's likelihood to get vaccinated. Since the two classes are imbalanced, we just see fewer vaccinated observations for every level of <code>h1n1_concern</code>. It swamps out any other trends that might exist.</p>
<p>Let's instead look at the <strong>rate</strong> of vaccination for each level of <code>h1n1_concern</code>.</p>
```{r}
h1n1_concern_counts <- cbind(counts$h1n1_concern,rowSums(counts))
h1n1_concern_counts
```
```{r}
props <- counts %>%
adorn_totals(where='col')%>%
mutate(h0=h0/Total, h1=h1/Total)%>%
select(h1n1_concern,h0,h1)
props
```
```{r}
props %>%
pivot_longer(names_to='h1n1_vaccine', cols=!h1n1_concern, values_to = 'pp')%>%
ggplot(aes(x=h1n1_concern, y=pp, fill=h1n1_vaccine))+
geom_bar(position="dodge",stat='identity' ) +
ggtitle("")+
xlab('h1n1_concern')+
coord_flip()
```
<p>Now we have a clearer picture of what's happening! In this plot, each pair of blue (no vaccine) and orange (received vaccine) bars add up to 1.0. We can clearly see that even though most people don't get the H1N1 vaccine, they are more likely to if they have a higher level of concern. It looks like <code>h1n1_concern</code> will be a useful feature when we get to modeling.</p>
<p>Since every pair of bars adds up to 1.0 and we only have two bars, this is actually a good use case for a stacked bar chart, to make it even easier to read.</p>
```{r}
props %>%
pivot_longer(names_to='h1n1_vaccine', cols=!h1n1_concern, values_to = 'pp')%>%
ggplot(aes(x=h1n1_concern, y=pp, fill=h1n1_vaccine))+
geom_bar(position="stack",stat='identity' ) +
ggtitle("")+
xlab('h1n1_concern')+
coord_flip()
```
This is a more compact plot showing the same thing as before.
### Plotting more variables
Let's factor this code into a function so we can use it on more variables.
```{r}
vaccination_rate_plot <- function(column, target, df ) {
# Stacked bar chart of vaccination rate for `target` against
# `column`.
#
# Args:
# column (string): column name of feature variable
# target (string): column name of target variable
# df (pandas DataFrame): dataframe that contains columns
# `col` and `target`
# returns a ggplot
column <- sym(column)
target <- ensym(target)
counts <- df%>%
select(!!column, !!target)%>%
group_by(!!column, !!target)%>%
summarise(n=n(), .groups = 'drop')%>%
na.omit()%>%
pivot_wider(names_from = !!target, values_from = n)%>%
rename('h1'='1', 'h0'='0')
props <- counts %>%
adorn_totals(where='col')%>%
mutate(h0=h0/Total, h1=h1/Total)%>%
select(!!column,h0,h1)
p3 <- props %>%
pivot_longer(names_to=as_string(target), cols=-1, values_to = 'pp')%>%
ggplot(aes(x=!!column, y=pp, fill=!!target))+
geom_bar(position="stack",stat='identity' ) +
ggtitle("")+
xlab({{column}})+
ylab('')+
coord_flip()
}
```
<p>Then, we'll loop through several columns and plot against both <code>h1n1_vaccine</code> and <code>seasonal_vaccine</code>.</p>
```{r}
cols_to_plot = c(
'h1n1_concern',
'h1n1_knowledge',
'opinion_h1n1_vacc_effective',
'opinion_h1n1_risk',
'opinion_h1n1_sick_from_vacc',
'opinion_seas_vacc_effective',
'opinion_seas_risk',
'opinion_seas_sick_from_vacc',
'sex',
'age_group',
'race'
)
for (i in cols_to_plot){
tmp1 <- vaccination_rate_plot(i, 'h1n1_vaccine', joined_df)
tmp2 <- vaccination_rate_plot(i, 'seasonal_vaccine', joined_df)
grid.newpage()
grid.draw(rbind(ggplotGrob(tmp1), ggplotGrob(tmp2), size = "last"))
}
```
<p>It looks like the knowledge and opinion questions have pretty strong signal for both target variables.</p>
<p>The demographic features have stronger correlation with <code>seasonal_vaccine</code>, but much less so far <code>h1n1_vaccine</code>. In particular, we interestingly see a strong correlation with <code>age_group</code> with the <code>seasonal_vaccine</code> but not with <code>h1n1_vaccine</code>. It appears that with seasonal flu, people act appropriately according to the fact that people <a href="https://www.cdc.gov/flu/highrisk/index.htm">more impacted and have higher risk of flu-related complications with age</a>. It turns out though that H1N1 flu has an interesting relationship with age: <a href="https://www.cdc.gov/h1n1flu/surveillanceqa.htm#7">even though older people have higher risk of complications, they were less likely to get infected!</a> While we know anything about causality from this analysis, it seems like the risk factors ended up being reflected in the vaccination rates.</p>
## Building some models
<p>Let's start working on training some models! We will be using logistic regression, a simple and fast linear model for classification problems. Logistic regression is a great model choice for a first-pass baseline model when starting out on a problem.</p>
```{r}
```
<p>We will be using scikit-learn's logistic regression implementation.</p>
<p>Standard logistic regression only works with numeric input for features. Since this is a benchmark, we're going to build simple models only using the numeric columns of our dataset.</p>
<p>Categorical variables with non-numeric values take a little more preprocessing to prepare for many machine learning algorithms. We're not going to deal with them in this benchmark walkthrough, but there are many different ways to encode categorical variables into numeric values. Check out <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html">one-hot encoding</a> and <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html">ordinal encoding</a> to get started if you're not familiar.</p>
```{r}
for (i in colnames(features_df)){
if (class(features_df[[i]]) != 'character'){
print(i)
}
}
```
```{r}
numeric_cols <- c()
for (i in colnames(features_df)){
if (class(features_df[[i]]) != 'character'){
numeric_cols <- rbind(numeric_cols,i)
}
}
numeric_cols <- as.vector(numeric_cols)
```
## Feature-Preprocessing
<p>There are two important data preprocessing steps before jumping to the logistic regression:</p>
<ul>
<li><strong>Scaling</strong>: Transform all features to be on the same scale. This matters when using regularization, which we will discuss in the next section. We will use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"><code>StandardScaler</code></a>, also known as Z-score scaling. This scales and shifts features so that they have zero mean and unit variance. </li>
<li><strong>NA Imputation</strong>: Logistic regression does not handle NA values. We will use median imputation, which fills missing values with the median from the training data, implemented with <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html"><code>SimpleImputer</code></a>. </li>
</ul>
<p>We are also going to start using Scikit-Learn's built-in composition functionality to encapsulate everything into a pipeline. Building pipelines is a best practice for building machine learning models. Among <a href="https://scikit-learn.org/stable/modules/compose.html">other benefits</a>, it makes it easy to reuse on new data (such as our test data). The great thing about pipelines is that they have the same interface as transformers and estimators, so you can treat them as if they are.</p>
<p>In the block below, we're going to first chain together the preprocessing steps (scaling and imputing) into one intermediate pipeline object <code>numeric_preprocessing_steps</code>. Then, we use that with Scikit-Learn's <a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html"><code>ColumnTransformer</code></a>, which is a convenient way to grab columns out of a pandas data frame and then apply a specified transformer.</p>
<p>If we wanted to do other transformations on other columns, such as encoding our non-numeric columns, that would be additional entries to the list in the <code>transformers</code> argument of <code>ColumnTransformer</code>.</p>
```{r}
library(scales)
scaled_df <- joined_df%>%
select(c(numeric_cols,c('h1n1_vaccine', 'seasonal_vaccine')))%>%
sapply(function(.) rescale(.))
scaled_df <- as.data.frame(scaled_df)%>%
mutate_all(~ifelse(is.na(.), median(., na.rm=TRUE), .))
```
<p>Next, we're going to define our estimators.</p>
<p>We'll use scikit-learn's default hyperparameters for <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"><code>LogisticRegression</code></a> of L2 (a.k.a. Ridge) regularization with <code>C</code> value (inverse regularization strength) of 1. <a href="https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a">Regularization</a> is useful because it reduces overfitting. Check out scikit-learn's documentation for <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"><code>LogisticRegression</code></a> to read more about the options. When building your own model, you may want to tune your hyperparameters using something like <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">GridSearchCV</a>.</p>
<p>Because we have two labels we need to predict, we can use Scikit-Learn's <a href="https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html"><code>MultiOutputClassifier</code></a>. This is a convenient shortcut for training two of the same type of model and having them run together.</p>
## Estimator
```{r}
# Logistics Regression
#model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
```
## Training and Evaluation
<p>Finally, let's get ready to train and evaluate our model.</p>
<p>Let's split our available data into a training and evaluation set. (We're going to reserve "test set" to refer to the final predictions we upload to the platform.) We'll use a third of our data for evaluation.</p>
<p>Recall that earlier in our exploratory analysis, the <code>h1n1_vaccine</code> label classes were moderately imbalanced. Sometimes this can lead to lopsided splits, which can lead to generalization problems with fitting and/or evaluating the model. We should have a large enough dataset that a randomly shuffled split should keep the same proportions, but we can use the <code>stratify</code> argument to enforce even splits.</p>
```{r}
library(caTools)
set.seed(101)
sample = sample.split(scaled_df$h1n1_concern, SplitRatio = .70)
train = subset(scaled_df, sample == TRUE)
test = subset(scaled_df, sample == FALSE)
```
Now, let's train the model!
```{r}
model.h1 <- glm(h1n1_vaccine~.-seasonal_vaccine,family=binomial(link='logit'),data=train)
model.se <- glm(seasonal_vaccine~.-h1n1_vaccine,family=binomial(link='logit'),data=train)
```
Make predictions
```{r}
h1n1.probs = predict(model.h1, type='response', newdata=test)
se.probs = predict(model.se, type='response', newdata=test)
```
<p>This has given us back a list of two (n_obs, 2) arrays. The first array is for <code>h1n1_vaccine</code>, and the second array is for <code>seasonal_vaccine</code>. The two columns for each array are probabilities for class 0 and class 1 respectively. That means we want the second column (index 1) for each of the two arrays. Let's grab those and put them in a data frame.</p>
<p>This competition uses <a href="https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5">ROC AUC</a> as the metric. Let's plot ROC curves and take a look. Unfortunately, scikit-learn's convenient <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html"><code>plot_roc_curve</code></a> doesn't support multilabel, so we'll need to make the plot ourselves.</p>
```{r}
library(pROC)
pROC_obj <- roc(test$h1n1_vaccine,h1n1.probs,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")
```
```{r}
auc(pROC_obj)
```
```{r}
pROC_obj <- roc(test$seasonal_vaccine,se.probs,
smoothed = TRUE,
# arguments for ci
ci=TRUE, ci.alpha=0.9, stratified=FALSE,
# arguments for plot
plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,
print.auc=TRUE, show.thres=TRUE)
sens.ci <- ci.se(pROC_obj)
plot(sens.ci, type="shape", col="lightblue")
## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low
## definition shape.
plot(sens.ci, type="bars")
```
```{r}
auc(pROC_obj)
```
<p>An AUC score of 0.5 is no better than random, and an AUC score of 1.0 is a perfect model. Both models look like they generally perform similarly. Our scores of around 0.83 are not great, but they're not bad either!</p>
<p>The competition metric is the average between these two AUC values. Scikit-learn's <a href="http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html"><code>roc_auc_score</code></a> <em>does</em> support multilabel, so we can use that directly.</p>
## Retrain Model on Full Dataset
<p>Now that we have an idea of our performance, we'll want to retrain our model on the full dataset before generating our predictions on the test set.</p>
```{r}
scaled_df.h1 <- scaled_df%>%
select(-seasonal_vaccine)
scaled_df.se <- scaled_df%>%
select(-h1n1_vaccine)
```
```{r}
final_model.h1 <- glm(h1n1_vaccine~.,family=binomial(link='logit'),data=scaled_df.h1)
final_model.se <- glm(seasonal_vaccine~.,family=binomial(link='logit'),data=scaled_df.se)
```
```{r}
test_df <- read.csv('test_set_features.csv', header=TRUE, row.names="respondent_id")
```
```{r}
scaled_test <- test_df%>%
select(c(numeric_cols))%>%
sapply(function(.) rescale(.))
scaled_test <- as.data.frame(scaled_test)%>%
mutate_all(~ifelse(is.na(.), median(., na.rm=TRUE), .))
```
```{r}
test_probas.h1 = predict(final_model.h1, type = 'response', newdata = scaled_test)
test_probas.se = predict(final_model.se, type = 'response', newdata = scaled_test)
```
<p>Let's make predictions on the test set! Again, for this competition, we want the <strong>probabilities</strong>, not the binary label predictions. We'll again use the <code>.predict_proba</code> method to get those.</p>
<p>As before, this gives us back two arrays: one for <code>h1n1_vaccine</code>, and one for <code>seasonal_vaccine</code>. The two columns for each array are probabilities for class 0 and class 1 respectively. That means we want the second column (index 1) for each of the two arrays.</p>
Let's read in the submission format file so we can put our predictions into it.
```{r}
submission_df <- read.csv('submission_format.csv', header=TRUE, row.names="respondent_id")
```
```{r}
head(submission_df)%>%
kbl()%>%
kable_material(c("striped", "hover"))
```
We want to replace those 0.5s and 0.7s with our predictions. First, make sure we have the rows in the same order by comparing the indices. Then, we can drop in the appropriate columns from our predicted probabilities.
```{r}
# identical here won't work since test_probas.h1 is a vector. Just check the both have the same lenght
assertthat::are_equal(nrow(submission_df), length(test_probas.h1))
assertthat::are_equal(nrow(submission_df), length(test_probas.se))
```
```{r}
# Save predictions to submission data frame
submission_df[["h1n1_vaccine"]] = test_probas.h1
submission_df[["seasonal_vaccine"]] = test_probas.se
```
```{r}
head(submission_df)%>%
kbl()%>%
kable_material(c("striped", "hover"))
```
```{r}
# Transform row names in column and convert it to integer (otherwise they will be strings)
library(data.table)
setDT(submission_df, keep.rownames = 'respondent_id')[]
submission_df <- submission_df %>%
mutate(respondent_id=as.integer(respondent_id))
```
```{r}
write.csv(submission_df,"submission.csv", row.names = FALSE)
```
```{bash}
head submission.csv
```
## Submit to the Leaderboard
<p>We can then head over to the competition <a href="https://www.drivendata.org/competitions/66/flu-shot-learning/submissions/">submissions page</a> to submit the predictions.</p>
![](score.png)
Done!