generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path06-eda.Rmd
192 lines (128 loc) · 8.35 KB
/
06-eda.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Exploratory Data Analysis {#eda}
## Read the Data into the Programming Language
The first thing to do is to read the data. Now a days data is typically stored in text/csv, JSON, XML formats. When the data is very large it can also be found stored in "gzip" files that are compressed files like "zip" files and most statistical software can read the data directly from gzip files, removing the need to de-compress the file saving storage space and also increasing speed the data is read.
Formats like XML and JSON are useful for non-rectangular data but tend to produce larger files that are slower to read and process.
### Data Forms: Long format vs. Wide format
There are two standard forms data is typically found: Long format and wide format.
As a principal rule, data should always be stored in long format. Long format is the natural way data is inputted into machine learning and statistical models as well as in code for creating visualizations. The only case where I can think of for using wide format is to display the data for a person to view as wide format is a more natural way for people to see the data.
## Basic EDA Steps
For EDA I would recommend looking at the data in its entirety and then to look at each variable individually.
Many times the data has too many variables to look at each individual one manually. In such a case choose the most important variables and perform manual EDA on those. We will talk about writing scripts to perform automated EDA much later.
Let me first start on how to look at individual variables.
1. Write the Title with the variable name.
2. Find the number of unique values of the variable. Is it less then the total number of rows.
3. Print the unique values of the variable.
4. Make a table with the count of the values of the variable.
5. Find the number of missing values or NA's in the variable.
6. You should visualize the count table and include the NA's in the visual.
7. All your observations that you think are important about the variable through the tables and visualizations that are interesting, or of note should be written down at the end, before going to the next variable.
Then repeat steps 1 to 7 for the next variable.
Here is just the list of things and the order in which they should appear in the report for each variable that is important.
### Step 1
1. Title
2. Number of unique values
3. List of unique values
4. Count table of the values of the variable with percentage counts as well.
5. Number of missing values in the variable (i.e. NA's) with percentage counts in the table as well as a table.
6. Visualization of the count table including the number of missing values in the variable.
7. Observations of note about the variable.
To do this in the EDA section of what I believe would be your report you should first write the title with the variable name for example: Looking at Variable Name in Raw Version.
For points 4 and 5 in step 1, they should appear in the same table ideally, but you could also put step 4 in one table and step 5 in a separate table. Some more detail can be found below:
After that you should print the number of unique values and then print what the unique values are.
After that you should make a table with the count of the unique values in the variable. Following this, in the same table also show the number of missing values as counts and then the percentage of missingness (missing data) for each value. Lastly, create a visualization for this count table.
If the variable is continuous then the visualization should be a histogram. If the variable is numeric but not continues, for example discrete like number of houses in a district then the visualization should be a bar chart also call bar graph and bar plot. Finally, if the variable is categorical such as the participant's sex which can be male, female or other, then the visualization should be a bar chart.
### Step 2
When this is done for the chosen variables find the counts as well as the percentage of missing values in all the variables of the data set, and print those as a table.
## Intro to Tidy Verse Package
After loading the data, it is necessary to ensure that our data is in tidy format. By this we simply means that the data follows the following three interrelated rules which make a data set tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
To achieve these three golden rules and more, we use Tidyverse package.
**Tidyverse** is a powerful collection of R packages that are actually
data tools for transforming and visualizing data. All packages of the
tidyverse share an underlying philosophy and common APIs.
The core packages are:
* **ggplot2** which implements the grammar of graphics. You can use it
to visualize your data.
* **dplyr** is a grammar of data manipulation. You can use it to solve the
most common data manipulation challenges.
* **tidyr** helps you to create tidy data or data where each variable is in a column, each observation is a row end each value is a cell.
* **readr** is a fast and friendly way to read rectangular data.
* **purr** enhances R’s functional programming (FP) toolkit by providing a
complete and consistent set of tools for working with functions and vectors.
* **tibble** is a modern re-imaginging of the data frame.
* **stringr** provides a cohesive set of functions designed to make
working with strings as easy as posssible
* **forcats** provide a suite of useful tools that solve common problems
with factors.
**As Tidyverse provides a complete framework for Data Manipulation, Analysis and Visualization, we will use it through out this course.**
You can install the complete tidyverse with:
*install.packages("tidyverse")*
Then, load the core tidyverse and make it available in your current R
session by running:
```{r,warning=FALSE,message=FALSE}
library(tidyverse)
```
Now we see tidy verse in action by using a data set : "Iris"
## Loading Data in R
```{r}
#loading data in R
iris <- read.csv("C:/Users/Z Book/Downloads/archive/Iris.csv",stringsAsFactors = TRUE)
#make data as tibble object
data <- as_tibble(iris)
```
After loading the data into R, we make it as tibble object so that we can apply *dplyr* operations on the data frame.
The first step after loading data is to see the head of the data. This allows us to see first six rows of data frame.
```{r}
head(data)
```
We see that Iris data set has 6 variables. The data set contains information of Length and Width of a flower's Sepal and Petal. It also important to know the total number of rows in the data frame.
```{r}
nrow(data)
```
The iris data set has 150 rows and 6 columns.
Next we see the summary statistics of the data. The summary function in R provides descriptive statistics of the data frame. These statistics include : Min, Max, Median and Mean.
```{r}
summary(data)
```
Note, the summary() function provided the statistics of all numeric variables. The species variable which is a categorical column was ignored by R.
Now we use Dplyr for applying different functions on our data frame.
## Dplyr
### Filter
filter() allows you to select a subset of rows in a data frame.
```{r}
#Select iris data of species "Setosa"
data %>%
filter(Species =="Iris-setosa")
```
### Arrange
arrange() sorts the observations in a data set in ascending or descending order based on one of its variables.
```{r}
#Sort in ascending order of sepal length
data %>%
arrange(SepalLengthCm)
```
Similarly we can arrange the data in descending order.
```{r}
#Sort in descending order of sepal length
data %>%
arrange(desc(SepalLengthCm))
```
### Mutate
mutate() allows you to update or create new columns of a data frame.
```{r}
#create a new variable "SepalLengthMm" which converts the sepal length in millimeters.
sepal_length_converted <- data %>%
mutate(SepalLengthMm =SepalLengthCm*10)
#show the converted column
sepal_length_converted$SepalLengthMm
```
There are other verbs provided by Dplyr for data manipulation such as :
**summarize()** allows you to turn many observations into a single data point.
**groupby()** allows you to summarize within groups instead of summarizing the entire dataset.
We also make a simple scatterplot. Since Tidyverse includes ggplot package init so there is no need to import the package seperately
```{r}
#Compare petal width and length
ggplot(data, aes(x = PetalLengthCm, y = PetalWidthCm, color = Species))+ geom_point()
```