-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy path_main.Rmd
7021 lines (4991 loc) · 471 KB
/
_main.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Crime by the Numbers: A Criminologist's Guide to R"
date: "`r Sys.Date()`"
author: "Jacob Kaplan"
bibliography: [book.bib]
biblio-style: apalike
link-citations: yes
colorlinks: yes
description: "A guide to learning R for the purpose of conducting quantitative research. This covers collecting and cleaning data, and visualizing it in graphs and maps."
url: "https://crimebythenumbers.com"
github-repo: "jacobkap/crimebythenumbers"
site: bookdown::bookdown_site
documentclass: krantz
monofont: "Source Code Pro"
monofontoptions: "Scale=0.7"
graphics: yes
---
```{r include=FALSE, cache=FALSE}
library(formatR)
knitr::opts_chunk$set(
comment = "#",
collapse = TRUE,
fig.align = 'center',
fig.width = 9,
fig.asp = 0.618,
fig.show = "hold",
error = TRUE,
fig.pos = "!H",
out.extra = "",
tidy = "styler",
out.width = "100%",
out.height= "45%"
)
options(tidygeocoder.quiet = TRUE)
options(tidygeocoder.verbose = FALSE)
options(readr.show_col_types = FALSE)
detachAllPackages <- function() {
basic.packages <- c("package:stats",
"package:graphics",
"package:grDevices",
"package:utils",
"package:datasets",
"package:methods",
"package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
```
\pagenumbering{roman}
```{r setup, include=FALSE}
# bookdown::render_book("index.Rmd", "bookdown::pdf_book", output_file = "crimebythenumbers.pdf")
# bookdown::render_book("index.Rmd", "bookdown::gitbook")
# Make code repository
# rmarkdown_files <- list.files(pattern = ".Rmd$")
# rmarkdown_files <- rmarkdown_files[!rmarkdown_files %in% c("author.Rmd",
# "git.Rmd",
# "workflow.Rmd",
# "index.Rmd",
# "collaboration.Rmd",
# "r-markdown.Rmd",
# "_main.Rmd")]
# for (rmarkdown_file in rmarkdown_files) {
# rmarkdown_file_save <- gsub(".Rmd", ".R", rmarkdown_file)
# knitr::purl(rmarkdown_file, output = paste0("code_repository/just_code/", rmarkdown_file_save),
# documentation = 0)
# knitr::purl(rmarkdown_file, output = paste0("code_repository/code_and_text/", rmarkdown_file_save),
# documentation = 2)
# }
# Make package bibliography
# all_packages <- c()
# for (rmarkdown_file in rmarkdown_files) {
# temp <- readr::read_lines(rmarkdown_file)
# temp <- paste(temp, collapse = " ")
# temp <- strsplit(temp, " ")[[1]]
# packages <- temp[grep("library\\(.*\\)", temp)]
# packages <- gsub("library\\(|\\)|`|\\.", "", packages)
# packages <- packages[packages != ""]
# all_packages <- c(all_packages, packages)
# all_packages <- sort(all_packages)
# all_packages <- unique(all_packages)
# all_packages <- all_packages[all_packages != "knitr"]
# }
# all_packages
# knitr::write_bib(all_packages, file = "packages.bib", width = 60)
```
\frontmatter
# Preface {-}
This book introduces the programming language R and is meant for undergrads or graduate students studying criminology. R is a programming language that is well-suited to the type of work frequently done in criminology - taking messy data and turning it into useful information. While R is a useful tool for many fields of study, this book focuses on the skills criminologists should know and uses crime data for the example data sets.
If you would like to purchase a physical copy of this book, it is available from [Amazon](https://www.amazon.com/Criminologists-Guide-Crime-Numbers-Chapman/dp/1032244070/?_encoding=UTF8&pd_rd_w=Ia4RG&content-id=amzn1.sym.e4bd6ac6-9035-4a04-92a6-fc4ad60e09ad&pf_rd_p=e4bd6ac6-9035-4a04-92a6-fc4ad60e09ad&pf_rd_r=G1A0TFSVYNG3GP667X0F&pd_rd_wg=uZ6xj&pd_rd_r=82e214d8-720f-492b-b5b5-df9b6d52d95a&ref_=pd_gw_ci_mcx_mr_hp_atf_m).
For this book you should have the latest version of [R](https://cloud.r-project.org/) installed and be running it through [RStudio Desktop (the free version).](https://www.rstudio.com/products/rstudio/download/) We'll get into detail on what R and RStudio are soon, but please have them both installed to be able to follow along with each chapter. While you must install both, you only ever need to open RStudio. While R is the actual programming language, RStudio is a program that makes it a lot easier to interact with R than opening up the R application itself.^[This is formally known as an "integrated development environment" or an IDE.] I highly recommend following along with the code for each lesson and then trying to use the lessons learned on a data set that you are interested in.
## Why learn to program? {-}
With the exception of some more advanced techniques like scraping data from websites or from PDFs, nearly everything we do here can be done through Excel, a software you're probably more familiar with. The basic steps for research projects are generally:
1. Open up a data set - which frequently comes as an Excel file!
2. Change some values - misspellings or too-specific categories for our purposes are very common in crime data
3. Delete some values - such as states you won't be studying
4. Make some graphs
5. Calculate some values - such as number of crimes per year
6. Sometimes do a statistical analysis depending on the type of project
7. Write up what you find
R can do all of this but why should you want (or have) to learn an entirely new skill just to do something you can already do? R is useful for two main reasons: scale and reproducibility.
### Scale {-}
If you do a one-off project in your career such as downloading some data and making a graph out of it, it makes sense to stick with software like Excel. The cost (in time and effort) of learning R is certainly not worth it for a single (or even several) project - even one perfectly suited for using R. R (and many programming languages more generally, such as Python) has its strength in doing something fairly simple many times. For example, it may be quicker to download one file yourself than it is to write the code in R to download that file. But when it comes to downloading hundreds of files, writing the R code becomes very quickly the better option than doing it by hand.
For most tasks you do in research when dealing with data, you will end up doing them many times (including doing the same task in future projects). So R offers the trade-off of spending time upfront by learning the code with the benefit of that code being able to do work at a large scale with little extra work from you. Please keep in mind this trade-off - you need to front-load the costs of learning R for the rewards of making your life easier when dealing with data - when feeling discouraged about the small returns you get early in learning R.
### Reproducibility {-}
The second major benefit of using R over something like Excel is that R is reproducible. Every action you take is written down. This is useful when collaborating with others (including your future self) as they can look at your code and follow along what you did without you having to show them every click you made as you frequently would on Excel. Your collaborator can look at your code to help you figure out a bug in the code or add their own code to yours.
In the research context specifically, you want to have code to give to people to ensure that your research was done correctly and there aren't bugs in the code. Additionally, if you build a tool to, for example, interpret raw crime data from an agency and turn it into a map, being able to share the code so others can modify it for their own city saves these people a lot of time and effort.
While not required (yet) in criminology, some academic journals (such as in economics) even require that you submit your data and code if your paper is accepted. If criminology follows in this trend, or if you submit to journals that require code submissions, you'll need to be able to write code and not rely on software that doesn't track your steps (such as Excel and SPSS).
## What you will learn {-}
For many of the lessons we will be working through real research questions and working from start to finish as you would on your own project. This involves thinking about what you want to accomplish from the data you have and what steps you need to take to reach that goal. This involves more than just knowing what code to write - it includes figuring out what your data has, whether it can answer the question you're asking, and planning out (without writing any code yet) what you need to do when you start coding. For most lessons we'll be using actual crime data that is commonly used in research so you'll become acquainted with a number of important data sets.
### Skills {-}
There is a large range of skills in criminology research - far too large to cover in a single book. Here we will attempt to teach fundamental skills to build a solid foundation for future work. We'll be focusing on the following skills and trying to reinforce our skills with each lesson.
* Subsetting - Taking only certain rows or columns from a data set
* Graphing
* Regular expressions - Essentially R's "Find and Replace" function for text
* Getting data from websites (webscraping)
* Getting data from PDFs (PDF scraping)
* Mapping
* Writing documents through R
## What you won't learn {-}
This book is not a statistics book so we will not be covering any statistical techniques. Though some data sets we handle are fairly large, this book does not discuss how to deal with Big Data. While the lessons you learn in this book can apply to larger data sets, Big Data (which I tend to define loosely as data that are too large for my computer to handle) requires special skills that are outside the realm of this book. If you do intend to deal with huge data sets I recommend you look at the R package [data.table,](https://github.com/Rdatatable/data.table/wiki) which is an excellent resource for it. While we briefly cover mapping, this book will not cover working with geographic data in detail. For a comprehensive look at geographic data please see this [book.](https://geocompr.robinlovelace.net/) This book also will not cover any qualitative data or analysis. While qualitative research is an important part of criminology, this book only focuses on working with quantitative data. Some parts of this book may apply to dealing with qualitative data, such as PDF scraping and regular expressions, but the examples I use in those chapters still deal with quantitative data.
## Simple vs easy {-}
In the course of this book we will cover things that are very simple. For example, we'll take a data set (think of it like an Excel file) with crime for nearly every police agencyg in the United States and keep only data from Colorado for a small number of years. We'll then find out how many murders happened in Colorado each year. This is a fairly simple task - it can be expressed in two sentences. You'll find that most of what you do is simple like this - it is quick to talk about what you are doing and the concepts are not complicated. What it isn't is easy. To actually write the R code to do this takes knowing a number of interrelated concepts in R and several lines of code to implement each step.
While this distinction may seem minor, I think it is important for newer programmers to understand that what they are doing may be simple to talk about but hard to implement. When you learn something new in R, or are first introduced to the language, you may feel like you're bashing your head through a brick wall. That is normal. It is easy to feel like a bad programmer because something that can be articulated in 10 seconds may take hours to do. So during times when you are working with R try to keep in mind that even though a project may be simple to articulate, it may be hard to code and that there is often very little correlation between the two.
## How to read this book {-}
This book is written so a person who has no programming experience can start with this chapter and by the end of the book be able to do a data project from start to finish. Each chapter introduces a new skill and builds on the skills introduced in previous chapters. So if you skip ahead you may miss important skills taught in the chapters you didn't read. For someone who has no - or minimal - programming experience, I recommend reading each chapter in order. If you have more programming experience and just want to learn how to do a specific thing, feel free to skip directly to that chapter.
## Citing this book {-}
If this book was useful in your research, please cite it. To cite this book, please use the below citation:
Kaplan J (2022). *Crime by the Numbers: A Criminologist's Guide to R*. https://crimebythenumbers.com/.
BibTeX format:
```bibtex
@Manual{crimebythenumbers,
title = {Crime by the Numbers: A Criminologist's Guide to R},
author = {Jacob Kaplan},
year = {2022},
url = {https://crimebythenumbers.com/},
}
```
## How to contribute to this book {-}
If you have any questions, suggestions (such as a topic to cover), or find any issues, please make a post on the [Issues page](https://github.com/jacobkap/crimebythenumbers/issues) for this book on GitHub. On this page you can create a new issue (which is basically just a post on this forum) with a title and a longer description of your issue. You'll need a GitHub account to make a post. Posting there lets me track issues and respond to your message or alert you when the issue is closed (i.e. I've finished or denied the request). Issues are also public so you can see if someone has already posted something similar.
For more minor issues like typos or grammar mistakes, you can edit the book directly through its GitHub page. That'll make an update for me to accept, which will change the book to include your edit. To do that, click the edit button at the top of the site - the button is highlighted in the below figure. You will need to make a GitHub account to make edits. When you click on that button you'll be taken to a page that looks like a Word doc where you can make edits. Make any edits you want and then scroll to the bottom of the page. There you can write a short (please, no more than a sentence or two) description of what you've done and then submit the changes for me to review.
```{r, echo = FALSE}
knitr::include_graphics('images/edit_button.PNG')
```
Please only use the above two methods to contribute or make suggestions about the book. Don't email me. While it's a bit more work for you to do it this way, since you'll need to make a GitHub account if you don't already have one, it helps me. I wrote this book, in part, to help my career so having evidence that people read it and are contributing to it is important to me. It's a way to publicly measure the book's impact.
## Where to find data included in this book {-}
To download the data used in this book please see [here.](https://github.com/jacobkap/crimebythenumbers/tree/master/data) Each of the files that are used in this book are available to download at that link. At the top of every chapter that uses one of these files I'll say exactly which file(s) you need to download. The best way to use this book is to follow along by downloading the data and running the code that I include in each chapter.
## Where to find code included in this book {-}
If you're reading this book through its [website,](https://crimebythenumbers.com) you can easily copy the code by clicking on the "Copy to clipboard" option on the top right of every chunk of code. This button, shown in the image below, will copy all of the code in the chunk and you can then paste (through Control/Command+V) into R.
```{r, echo = FALSE}
knitr::include_graphics('images/copy_code.PNG')
```
I've also made each chapter available to download as an R file that has every line of code used in each chapter available to you to run. To download the files, please go to the book's GitHub page [here.](https://github.com/jacobkap/crimebythenumbers/tree/master/code_repository) I've saved each chapter twice - once where it only includes the code used (in the "just_code" folder) and once where it includes the code and all of the text in the chapter (in the "code_and_text" folder). So download whichever one you want to use. The code is identical in each.
<!--chapter:end:index.Rmd-->
```{r include=FALSE, cache=FALSE}
library(formatR)
knitr::opts_chunk$set(
comment = "#",
collapse = TRUE,
fig.align = 'center',
fig.width = 9,
fig.asp = 0.618,
fig.show = "hold",
error = TRUE,
fig.pos = "!H",
out.extra = "",
tidy = "styler",
out.width = "100%",
out.height= "45%"
)
options(tidygeocoder.quiet = TRUE)
options(tidygeocoder.verbose = FALSE)
options(readr.show_col_types = FALSE)
detachAllPackages <- function() {
basic.packages <- c("package:stats",
"package:graphics",
"package:grDevices",
"package:utils",
"package:datasets",
"package:methods",
"package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
```
# About the author {-}
**Jacob Kaplan** is a researcher at the Princeton School of Public and International Affairs. He holds a PhD from the University of Pennsylvania.
He is the author of several R packages that make it easier to work with data, including [fastDummies](https://jacobkap.github.io/fastDummies/) and [asciiSetupReader.](https://jacobkap.github.io/asciiSetupReader/) His [website](http://jacobdkaplan.com/) allows easy analysis of crime-related data, and he has released over a [dozen crime data sets](http://jacobdkaplan.com/data.html) that he has compiled, cleaned, and made available to the public. He is also the author of books on the two primary criminal justice data sets: the FBI's [Uniform Crime Reporting (UCR) Program Data](https://ucrbook.com/) and the FBI's [National Incident Based Reporting System (NIBRS)](https://nibrsbook.com/) data.
<!--chapter:end:author.Rmd-->
```{r include=FALSE, cache=FALSE}
library(formatR)
knitr::opts_chunk$set(
comment = "#",
collapse = TRUE,
fig.align = 'center',
fig.width = 9,
fig.asp = 0.618,
fig.show = "hold",
error = TRUE,
fig.pos = "!H",
out.extra = "",
tidy = "styler",
out.width = "100%",
out.height= "45%"
)
options(tidygeocoder.quiet = TRUE)
options(tidygeocoder.verbose = FALSE)
options(readr.show_col_types = FALSE)
detachAllPackages <- function() {
basic.packages <- c("package:stats",
"package:graphics",
"package:grDevices",
"package:utils",
"package:datasets",
"package:methods",
"package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
```
\mainmatter
# (PART) Introduction {-}
# A soup to nuts project example
Before we get into exactly how to use R, we'll go over a brief example of a kind of data project that you'd do in the real world. For this chapter we'll look at FBI homicide data that you can download [here.](https://github.com/jacobkap/crimebythenumbers/tree/master/data) The file is called "shr_1976_2020.rds".
## Big picture data example
Below is a large chunk of R code along with some comments about what the code does. The purpose of this example is to show that with relatively little code (excluding blank lines and comments, there are only 35 lines of R code here) you can go from opening a data set to making a graph that answers your research question. I don't expect you to understand any of this code as it is fairly complex and involves many different concepts in programming. So if the code is scary - and for many early programmers seeing a bunch of code that you don't understand is scary and overwhelming - feel free to ignore the code itself.
We'll cover each of these skills in turn throughout the book so that by the end of the book you should be able to come back and understand the code (and modify it to meet your own needs). The important thing is that you can see exactly what R can do (and this is only a tiny example of R's flexibility) and think about the process to get there (which we'll talk about below).
At the time of this writing, the FBI had just released 2020 crime data, which showed about a 30% increase in murders relative to 2019. This had led to an explosion of (in my opinion highly premature) explanations of why exactly murder went up so much in 2020. A common explanation is that it is largely driven by gun violence among gang members who are killing each other in a cyclical pattern of murders followed by retaliatory murders. For our coding example, we'll examine that claim by seeing if gang violence did indeed increase, and whether it increased more than other types of murders.
The end result is the graph below. It is, in my opinion, a fairly strong answer to our question. It shows the percent change in murders by the victim-offender relationship from 2019 to 2020. This is using FBI murder data, which technically does have a variable that says if the murder is gang related, but it's a very flawed variable (i.e. vast undercount of gang-related murders) so I prefer to use stranger and acquaintance murders as a rough proxy. And we now have an easy to read graph that shows that while indeed stranger and acquaintance murders did go up a lot, nearly all relationship groups experienced far more murders in 2020 than in 2019. This suggests that there was a broad increase in murder in 2020, and it was not driven merely by an increase in one or a few groups.
```{r, echo = FALSE}
knitr::include_graphics('images/shr_motivation_example.png')
```
These graphs (though modified to a table instead of a graph) were included in a article I contributed to on the site [FiveThirtyEight](https://fivethirtyeight.com/features/murders-spiked-in-2020-how-will-that-change-the-politics-of-crime/) in discussing the murder increase in 2020. So this is an actual work product that is used in a major media publication - and is something that you'll be able to do by the end of this book. For nearly all research you do you'll follow the same process as in this example: load data into R, clean it somehow, and create a graph or a table or do a regression on it. While this can range from very simple to very complex depending on your exact situation (and how clean the data is that you start with), all research projects are essentially the same.
Please look at the following large chunk of code. We'll next go through each of the different pieces of this code to start understanding how they work. Throughout the course of this book we'll cover these steps in more detail - as most research programming work follows the same process - so here we'll talk more abstractly about what each does. The goal is for you to understand the basic steps necessary for using R to do research, and to understand how R can do it - but not having to understand what each line of code does just yet.
```{r}
library(dplyr) # Used to aggregate data
library(ggplot2) # Used to make the graph
library(crimeutils) # Used to capitalize words in a column
library(tidyr) # Used to reshape the data
# Load in the data
shr <- readRDS("data/shr_1976_2020.rds")
# See which agencies reported in 2019 and 2020
# An "ori" is a unique identifier code for agencies in FBI data
agencies_2019 <- shr$ori[shr$year == 2019]
agencies_2020 <- shr$ori[shr$year == 2020]
# Get which agencies reported in both years so we have an
# apples-to-apples comparison
agencies_in_both <- agencies_2019[agencies_2019 %in% agencies_2020]
# Keep just data from 2019 and 2020 and where the agencies
# is one of the agencies chosen above. Also keep only murder and
# nonnegligent manslaughter (so excluding negligent manslaughter).
shr_2019_2020 <- shr[shr$year %in% 2019:2020,]
shr_2019_2020 <- shr_2019_2020[shr_2019_2020$ori %in%
agencies_in_both,]
shr_2019_2020 <- shr_2019_2020[shr_2019_2020$homicide_type %in%
"murder and nonnegligent manslaughter",]
# Get the number of murders by victim-offender relationship in 2019 and 2020
# Then find the percent change in murders by this group from 2019 to 2020
# Sort data by smallest to largest percent change
shr_difference <-
shr_2019_2020 %>%
group_by(year) %>%
count(victim_1_relation_to_offender_1) %>%
spread(year, n) %>%
mutate(difference = `2020` - `2019`,
percent_change = difference / `2019` * 100,
victim_1_relation_to_offender_1 =
capitalize_words(victim_1_relation_to_offender_1)) %>%
filter(`2019` >= 50) %>%
arrange(percent_change)
# This is only for the graph. By default graphs order alphabetically
# but this makes sure it orders it based on the ordering we made above
# (smallest to largest percent change)
shr_difference$victim_1_relation_to_offender_1 <-
factor(shr_difference$victim_1_relation_to_offender_1,
levels = shr_difference$victim_1_relation_to_offender_1)
# Makes a barplot showing the percent change from 2019 to 2020 in number
# of murders by victim group. Labels the x-axis and the y-axis, shifts
# the graph so that relationship labels are on the y-axis for easy reading.
# And finally uses the "crim" theme that changes the colors in the graph to
# make it a little easier to see.
ggplot(shr_difference, aes(x = victim_1_relation_to_offender_1,
y = percent_change)) +
geom_bar(stat = "identity") +
ylab("% Change, 2020 Vs. 2019") +
xlab("Victim Relative to Murderer") +
coord_flip() +
theme_crim()
```
## Little picture data example
We'll now look at each piece of the larger chunk of code above, and I'll explain what it does. There are five different steps that I take to create the graph from the data we use:
1. Load the packages we use
2. Load the data
3. Clean the data
4. Aggregate the data
5. Make the graph
### Loading packages
In R we'll often use code written by other people that have tools that we want to use in our code. To use this code we need to tell R that we want to use that particular package - and packages are just a collection of other people's code. A collection of code for a specific purpose (e.g. making a graph, doing a very particular cleaning task) is called a function. Each package is a collection of functions. For this example, we're using packages that help us clean and aggregate data or to graph it, so we load it here. The general convention is to start your R file with each of the packages you want to use at the top of the file.
```{r}
library(dplyr)
library(ggplot2)
library(crimeutils)
library(tidyr)
```
### Loading data
Next we need to load in our data. The data we're using is a type of R data file called an .Rds file so we load it using the function `readRDS()`, which is one of the functions built into R so we don't actually need to use any package for it. For this example, we're using data from the FBI's Supplementary Homicide Report which are an annual data set that has relatively detailed information on most (but not all, as not all agencies report data) murders in the United States. This includes the relationship between the victim and the offender (technically the suspected offender) in the murder, which is what we'll look at. When we read in the data to R we need to give it a name so R knows what it is called. We'll call this data "shr" since that is the normal abbreviation for the Supplementary Homicide Report data. Normally in R we use lower cased letters when naming something, which is why we're calling it "shr" rather than "SHR."
Each row of data is actually a murder incident, and there can be up to 11 victims per murder incident. So we'll be undercounting murders as in this example we're only looking at the first victim in an incident. But, as it's an example, this is fine as I don't want it to be too complicated and including more than just the first victim would greatly complicate our code.
```{r}
shr <- readRDS("data/shr_1976_2020.rds")
```
### Cleaning
One of the annoying quirks of dealing with FBI data is that different agencies report each year. So comparing different years has an issue because you'll be doing an apples-to-oranges competition as an agency may report one year but not another. So for this data the first thing we need to do is to make sure we're only looking at agencies that reported data in both years. The first few lines check which agencies reported in 2019 and which agencies reported in 2020. We do this by looking at which ORIs (in the "ori" column) are present in each year (as agencies that did not report won't be in the data). An ORI is the FBI term for a unique ID for that agency. Then we make a vector, which has only the ORIs that are present in both years.
We then subset the data to only data from 2019 and 2020 and where the agency reported in both years. Subsetting essentially means that we only keep the rows of data that meet those conditions. Another quirk of this data is that it includes homicides that are not murder - namely, negligent manslaughter. So the final subsetting condition we use is that it only includes murder and nonnegligent manslaughter.
```{r}
agencies_2019 <- shr$ori[shr$year == 2019]
agencies_2020 <- shr$ori[shr$year == 2020]
agencies_in_both <- agencies_2019[agencies_2019 %in% agencies_2020]
shr_2019_2020 <- shr[shr$year %in% 2019:2020,]
shr_2019_2020 <- shr_2019_2020[shr_2019_2020$ori %in% agencies_in_both,]
shr_2019_2020 <- shr_2019_2020[shr_2019_2020$homicide_type %in%
"murder and nonnegligent manslaughter",]
```
### Aggregating
Now we have only the rows of data that we want. Each row of data is a single murder incident, so we want to aggregate that data to the year-level and see how many murders there were for each victim-offender relationship group. The following chunk of code does that and then finds the percent difference. Since we can have large percent changes due to low base rates, we then remove any rows where there were fewer than 50 murders of that victim-offender relationship type in 2019. Finally, we arrange the data from smallest to largest difference. We'll print out the data just to show you what it looks like.
```{r}
shr_difference <-
shr_2019_2020 %>%
group_by(year) %>%
count(victim_1_relation_to_offender_1) %>%
spread(year, n) %>%
mutate(difference = `2020` - `2019`,
percent_change = difference / `2019` * 100,
victim_1_relation_to_offender_1 =
capitalize_words(victim_1_relation_to_offender_1)) %>%
filter(`2019` >= 50) %>%
arrange(percent_change)
shr_difference
```
### Graphing
Once we have our data cleaned and organized in the way we want, we are ready to graph it. By default when R graphs data it will organize it alphabetically. In our case we want it ordered by smallest to largest change in the number of murders between 2019 and 2020 by relationship type. So we first tell R to order it by the relationship type variable, which we've already sorted in the last section of code. Then we use the `ggplot()` function (which is covered extensively in Chapters \@ref(graphing-intro) and \@ref(ois-graphs)) to make our graph. In our code we include the data set we're using, which is the shr_difference data and the columns we want to graph. Then we tell it we want to create a bar chart and what we want the x-axis and y-axis labels to be. Finally, we have two lines that just affect how the graph looks. All of this is covered in the two graphing chapters, but is only several lines of code to go from cleaned data to a beautiful - and informative - graphic.
```{r}
shr_difference$victim_1_relation_to_offender_1 <-
factor(shr_difference$victim_1_relation_to_offender_1,
levels = shr_difference$victim_1_relation_to_offender_1)
ggplot(shr_difference, aes(x = victim_1_relation_to_offender_1,
y = percent_change)) +
geom_bar(stat = "identity") +
ylab("% Change, 2020 Vs. 2019") +
xlab("Victim Relative to Murderer") +
coord_flip() +
theme_crim()
```
## Reusing and modifying code
One of the main benefits of programming is that once you write code to do one thing, it's usually very easy to adapt it to do a similar thing. Below I've copied some of the code we used above and changed only one thing: instead of looking at the column "victim_1_relation_to_offender_1" we're now looking at the column "offender_1_weapon". That's all I did, everything else is identical. Now after about 30 seconds of copying and changing the column name, we have a graph that shows weapon usage changes from 2019 to 2020 instead of victim-offender relationship.
This is one of the key benefits of programming over something more click intensive like using Excel or SPSS.^[I'm aware that technically you can write SPSS code. However, every single person I know who has ever used SPSS does so by clicking buttons and is afraid of writing code.] There's certainly more upfront work than just clicking buttons, but once we have working code we can very quickly reuse it or modify it slightly.
```{r}
shr_difference <-
shr_2019_2020 %>%
group_by(year) %>%
count(offender_1_weapon) %>%
spread(year, n) %>%
mutate(difference = `2020` - `2019`,
percent_change = difference / `2019` * 100,
offender_1_weapon = capitalize_words(offender_1_weapon)) %>%
filter(`2019` >= 50) %>%
arrange(percent_change)
shr_difference$offender_1_weapon <-
factor(shr_difference$offender_1_weapon,
levels = shr_difference$offender_1_weapon)
ggplot(shr_difference, aes(x = offender_1_weapon,
y = percent_change)) +
geom_bar(stat = "identity") +
ylab("% Change, 2020 Vs. 2019") +
xlab("Offender Weapon") +
coord_flip() +
theme_crim()
```
<!--chapter:end:example-project.Rmd-->
```{r include=FALSE, cache=FALSE}
library(formatR)
knitr::opts_chunk$set(
comment = "#",
collapse = TRUE,
fig.align = 'center',
fig.width = 9,
fig.asp = 0.618,
fig.show = "hold",
error = TRUE,
fig.pos = "!H",
out.extra = "",
tidy = "styler",
out.width = "100%",
out.height= "45%"
)
options(tidygeocoder.quiet = TRUE)
options(tidygeocoder.verbose = FALSE)
options(readr.show_col_types = FALSE)
detachAllPackages <- function() {
basic.packages <- c("package:stats",
"package:graphics",
"package:grDevices",
"package:utils",
"package:datasets",
"package:methods",
"package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
```
# Introduction to R and RStudio {#intro-to-r}
In this chapter you'll learn to open a data file in R. That file is "ucr2017.rda," which you'll need to download from the data repository available [here.](https://github.com/jacobkap/crimebythenumbers/tree/master/data)
## Using RStudio
In this lesson we'll start by looking at RStudio then write some brief code to load in some crime data and start exploring it. This lesson will cover code that you won't understand completely yet. That is fine, we'll cover everything in more detail as the lessons progress.
RStudio is the interface we use to work with R. It has a number of features to make it easier for us to work with R. While not strictly necessary to use, most people who use R do so through RStudio. When using R you don't need to open up both R and RStudio on your computer. Just open RStudio and it'll internally use R. We'll spend some time right now looking at RStudio and the options you can change to make it easier to use (and to suit your personal preferences with appearance) as this will make all of the work that we do in this book easier.
When you open up RStudio you'll see four panels, each of which plays an important role in RStudio. Your RStudio may not look like the setup I have in the following image - that is fine, we'll learn how to change the appearance of RStudio soon.
```{r, echo = FALSE}
knitr::include_graphics('images/rstudio_1.PNG')
```
At the top right of the image (and this may be in a different location on your RStudio) is the Console panel. Here you can write code, hit enter/return, and R will run that code. If you write `2+2` it will return (in this case that just mean it will print an answer) 4. This is useful for doing something simple like using R as a calculator or quickly looking at data. In most cases during research this is where you'd do something that you don't care to keep. This is because when you restart R it won't save anything written in the Console. To do reproducible research or to be able to collaborate with others you need a way to keep the code you've written.
The way to keep the code you've written in a file that you can open later or share with someone else is by writing code in an R Script (if you're familiar with Stata, an R Script is just like a .do file). An R Script is essentially a text file (similar to a Word document) where you write code. To run code in an R Script just click on a line of code or highlight several lines and hit enter/return or click the "Run" button on the top right of the Source panel shown in the top left of the above image. You'll see the lines of code run in the Console and any output (if your code has an output) will be shown there too (making a plot will be shown in a different panel as we'll see soon).
For code that you don't want to run, called comments, start the line with a pound sign `#` and that line will not be run (it will still print in the console if you run it but it won't do anything). These comments should explain the code you wrote (if it's not otherwise obvious what the code does).
It is good practice to do all of your code writing in an R Script - even if you delete some lines of code later - as it eliminates the possibility of losing code or forgetting what you wrote. Having all the code in front of you in a text file also makes it easier to understand the flow of code from start to finish for a task - an issue we'll discuss more in later lessons.
While the Source and Console panels are the ones that are of most use, there are two other panels worth discussing. As these two panels let you interchange which tabs are available in them, we'll return to them shortly in the discussion of the options RStudio has to customize it.
### Opening an R Script
When you want to open up a new R Script you can click File on the very top left, then R Script. It will open up the script in a new tab inside of the Source panel. There are also a number of other file options available: R Presentation which can make PowerPoints; R Markdown, which can make Word Documents or PDFs that incorporate R code used to make tables or graphs (and which we'll cover in Chapter \@ref(r-markdown)); and Shiny Web App to make websites using R. There is too much to cover for an introductory book such as this, but keep in mind the wide capabilities of R if you have another task to do. To open an R Script that is already saved to your computer, click "Open File..." and navigate to the file that you want to open.
```{r, echo = FALSE}
knitr::include_graphics('images/rstudio_2.PNG')
```
### Setting the working directory
Many research projects incorporate data that someone else (such as the FBI or a local police agency) has put together. In these cases, we need to load the data into R to be able to use it. In a little bit we'll load a data set into R and start working on it, but let's take a step back now and think about how to even load data. First, we'll need to get the data onto our computer somehow, probably by downloading it from an agency's website. Let's be specific - we don't download it to our computer, we download it to a specific folder on our computer (usually defaulted to the Downloads folder on a Windows machine). So let's say you wanted to load a file called "data" into R. If you have a file called "data" in both your Desktop and your Downloads folder, R wouldn't know which one you wanted. And unless your data was in the folder R searches by default (which may not be where the file is downloaded by default), R won't know which file to load.
We need to tell R explicitly which folder has the data to load. We do this by setting the "Working Directory" (or the "Folders where I want you, R, to look for my data" in more simple terms). To set a working directory in R click the Session tab on the top menu, scroll to Set Working Directory, then click Choose Directory. This will open a window where you can navigate to the folder you want.
```{r, echo = FALSE}
knitr::include_graphics('images/rstudio_3.PNG')
```
After clicking Open in that window you'll see a new line of code in the Console starting with `setwd()` and inside of the parentheses is the route your computer takes to get to the folder you selected. And now R knows which folder to look in for the data you want. It is good form to start your R Script with `setwd()` to make sure you can load the data. Copy the line of code that says `setwd()` (which stands for "set working directory"), including everything in the parentheses, to your R Script when you start working.
### Changing RStudio
Your RStudio looks different than my RStudio because I changed a number of settings to suit my preferences. To do so yourself click the Tools tab on the top menu and then click Global Options.
```{r, echo = FALSE}
knitr::include_graphics('images/rstudio_5.PNG')
```
This opens up a window with a number of different tabs to change how R behaves and how it looks.
#### General
Under Workspace in the General tab make sure to **uncheck** the "Restore .RData into workspace at startup" and to set "Save workspace to .RData on exit:" to **Never**. What this does is make sure that every time you open RStudio it starts fresh with no objects (essentially data loaded into R or made in R) from previous sessions. This may be annoying at times, especially when it comes to loading large files, but the benefits far outweigh the costs.
You want your code to run from start to finish without any errors. Something I've seen many students do is write some code in the Console (or in their R Script but out of order of how it should be run) to fix an issue with the data. This means their data is how it should be, but when the R session restarts (such as if the computer restarts) they won't be able to get back to that point. Making sure your code handles everything from start to finish is well-worth the avoided headache of trying to remember what code you did to fix the issue previously.
```{r, echo = FALSE}
knitr::include_graphics('images/rstudio_6.PNG')
```
#### Code
The Code tab lets you specify how you want the code to be displayed. The important section for us is to make sure to check the "Soft-wrap R source files" check-box. If you write a very long line of code it gets too big to view all at once and you must scroll to the right to read it all. That can be annoying as you won't be able to see all the code at once. Setting "Soft-wrap" makes it so if a line is too long it will just be shown on multiple lines, which solves that issue. In practice it is best to avoid long lines of codes as it makes it hard to read, but that isn't always possible.
```{r, echo = FALSE}
knitr::include_graphics('images/rstudio_7.PNG')
```
##### Saving
Inside of the Code tab we also want to turn on an option to have RStudio automatically save the R script when we aren't using it. This is like how Google Docs automatically saves your document every second or so. While we should be saving our file often (using the little floppy disk icon near the top of RStudio), having RStudio automatically save adds a level of security as it prevents losing a lot of progress if we forget to save and RStudio crashes or we close it.
To set it to autosave, move to the Saving tab, and check the "Automatically save when editor loses focus" box. So if you click out of RStudio or stop typing, it will automatically save. You can also say how long to wait before saving with options ranging from 500 milliseconds to 10,000 milliseconds, which is the same as 0.5 seconds to 10 seconds.
```{r, echo = FALSE}
knitr::include_graphics('images/auto_save.PNG')
```
#### Appearance
The Appearance tab lets you change the background, color, and size of text. Change it to your preferences.
```{r, echo = FALSE}
knitr::include_graphics('images/rstudio_8.PNG')
```
#### Pane Layout
The final tab we'll look at is Pane Layout. This lets you move around the Source, Console, and the other two panels. There are a number of different tabs to select for the panels (unchecking one just moves it to the other panel, it doesn't remove it from RStudio), and we'll talk about three of them. The Environment tab shows every object you load into R or make in R. So if you load a file called "data" you can check the Environment tab. If it is there, you have loaded the file correctly.
As we'll discuss more in Section \@ref(functions-intro), the Help tab will open up to show you a help page for a function you want more information on (we'll also discuss exactly what a function is below. But for now just think of a function as a shortcut to using code that someone else wrote). The Plots tab will display any plot you make. It also keeps all plots you've made (until restarting RStudio) so you can scroll through the plots.
```{r, echo = FALSE, out.height="40%"}
knitr::include_graphics('images/rstudio_9.PNG')
```
### Helpful cheat sheets
RStudio also includes a number of links to helpful cheat sheets for a few important topics. To get to it click Help, then Cheatsheets, and click on whichever one you need.
```{r, echo = FALSE, out.height="40%"}
knitr::include_graphics('images/rstudio_4.PNG')
```
## Assigning variables {#assignment}
When we're using R for research the general process is to load data, change it somehow (such as deleting rows we don't want, aggregating from some small unit such as monthly crime to a higher unit such as yearly crime), and then analyze it. To do all this we need to be able to make sure each step we do actually changes the data. This seems simple but is actually a very common issue I've noticed when working with new R programmers - they run code on the data (e.g. deleting certain rows) but forget to save the change to that data.
Let's look at an example of this. First, we need to know how to create objects in R. I use "object" in a very vague sense to mean anything that is loaded into R and can be manipulated. To create something in R we assign "something" to an object name. This is a very technical sentence so let's look at an example and then step back and try to understand that sentence.
```{r}
a <- 1
```
Above I am creating the object "a" by assigning it the value of 1. In R terms, "a is assigned 1" or "a gets 1". In non-technical terms: a equals 1.
We can print out a to see if this is true.
```{r}
a
```
When we print out a, it returns 1 since that was what a was assigned to. We can assign a another value, and it will overwrite 1 with whatever value we choose.
```{r}
a <- 33
a
```
Now a is 33. Or a equals 33. Or a was assigned 33. Or a gets 33. Or we assigned 33 to a. There are a lot of ways to explain what we did here, which is quite frustrating and confusing to new R programmers. I use the terms "assignment" and "gets" only because that is the convention in R, but if it's easier for you to talk about something equaling something else (instead of being assigned to that value), please do so!
The `<-` is what does the assignment, or what makes the thing on the left equal to the thing on the right. You might be thinking that it'd be easier to simply use the equal sign instead of the `<-` - we are making things equal after all. And you'd be right. Using `=` does the exact same thing as `<-`.
```{r}
a = 13
a
```
We can use `=` instead of `<-` and get the same results (with very few exceptions and none that are relevant in this book). The reason that people use `<-` instead of `=` is largely a matter of convention. It's just the thing that R programmers do so new programmers tend to adopt it. If it's easier for you to use `=` instead of `<-`, feel free to do that.
In this book I'll use `<-` and talk about "assigning" values because that is the convention in R. And while that's not really a good reason to do anything, I think that it's important that new R programmers at least know what the proper conventions are and be able to speak the language (so to speak) of R programmers. This is also important when searching for more help on a topic as you need to know the right term to be able to ask for help (from other R programmers and from Google) easily.
So far we've just been assigning "a" a value, or overwriting that value with a new value. We can also assign something new to have the same value as a. Let's make the object "example_123_value.demonstration" get the value that a has - or in other words make "example_123_value.demonstration" be equal to a.
```{r}
example_123_value.demonstration <- a
example_123_value.demonstration
```
I use name "example_123_value.demonstration" just an example of what you can include in an object name - any character (lower or uppercase), any number (just can't start with a number), and some punctuation (e.g. underscores and periods). Spaces are not allowed. In practice you'll want to call each object something specific so you know what it is, and ideally make the name as short as possible. For example, if you are using crime data from Houston you'll want to call it something like "houston_crime". The R convention is to only use lowercase characters and include only underscores as the punctuation, but you can name it whatever is most useful to you.
As noted at the start of the section, a lot of new programmers will make a change to an object but forget to assign the result back into the object (or into a new object). This means that that object won't actually change. For example, let's say we want to multiply example_123_value.demonstration by 10.
If we do `example_123_value.demonstration * 10` then it'll print out the result in the console, but not actually change example_123_value.demonstration. What we need to do is assign that result of the multiplication back into example_123_value.demonstration. Lots of new programmers forget to assign the results back into the object, which understandably leads to lots of confusion since the object is now not what they expect it to be.
```{r}
example_123_value.demonstration <- example_123_value.demonstration * 10
example_123_value.demonstration
```
I've been saying "object" a lot, without defining it. An object is a bit tricky to define, especially at this stage in the book. Throughout this book I'll be using object to describe something that has been assigned value, such as "a" and "example_123_value.demonstration". This also includes outside data sets read into R, such as an Excel file loaded into R and even a set of R code that has been assigned to an object (which is called a function). Each object that you have created or loaded yourself can be found in the Environment tab.
## What are functions (and packages)? {#functions-intro}
When programming to do research you'll often have to do the same thing multiple times. For example, many crime data sets are available as one file for each year of data. So if you are analyzing multiple years of data you'll need to clean each file separately - and in most cases that involves using the exact same code for every file. This also includes doing things that other people have done. For example, most research leads to at least one graph being made. Since making graphs is so common, many people have spent a long time writing code to make it easy to make publication-ready graphs. Instead of doing all that work ourselves we can just use code that other people have written and made available to us. While we could do this by copying code, the easiest way to reuse code is to use functions.
As noted in the previous section, a function is a bunch of code (it could range from a single line of code to hundreds of lines) that has been assigned to an object. We'll dive into this topic in detail in Chapter \@ref(functions) - including how to make your own functions - but using functions is such an important concept that we'll briefly introduce them here. Almost everything that you will do in R is through functions. For the most part that'll be using functions that other people have written that are available to use - and this includes functions that are built into R already and ones we have to download from other R programmers.
Let's look at the function `head()` as an example. This is a function that is already built into R which means we don't need to do anything to use it. For functions that are written by other R programmers we'll need to download those functions and tell R we want to use it - and we'll show how in a bit. The way to identify a function is through the parentheses after the function name (the naming convention is the same as for objects as discussed in the previous section. We want a short, descriptive name that explains what the function does). If we see a word followed by parentheses, we can be confident that we're looking at a function.
The `head()` function prints out the first 6 rows of every column of a data.frame (which is essentially an Excel sheet, and something we'll cover in more detail in Chapter \@ref(data-types)). `head()` is an extremely useful and common function in R, but just the name alone doesn't make it clear what it does or that we need to put a data object inside the parentheses.
If you are having trouble understanding what a function does or how to use it, you can ask R for help and it will open up a page explaining what the function does, what options it has, and examples of how to use it. To do so we write `help(function)` or `?function` in the console and it will open up that function's help page. For finding the help page of a function we do not include the parentheses part of the function: `help(head)` works while `help(head())` does not.
If we wrote `help(head)` to figure out what the `head()` function does, it will open up this page. Unfortunately, many help pages are not that useful. The following image shows the help page for `head()`, and it is not very friendly to a new R programmer. In cases where the help page is not useful, and you're looking at functions not covered in this book, I recommend looking online for help pages dedicated to that function or broader programming sites such as [Stack Overflow,](https://stackoverflow.com/) where people can ask questions about programming.
```{r, echo = FALSE}
knitr::include_graphics('images/help_page.PNG')
```
For `head()`, all we need to do is tell the function what data we're looking at. In programming terms, the input to the function (what we have to include in the parentheses) is the name of our data object. We'll look at the very commonly used data called `mtcars`. `mtcars` is one of a small number of data files that are already in R when you open it. These are included in R just as examples of data to use when testing our code or teaching people to use R. Just type `mtcars` into the console and it will print out data to the console; there's nothing you need to do to load the data into R. `mtcars` has info about a number of cars with each row being a type of car and each column being information about the car such as the miles per gallon it gets and how many gears it has.
We'll use the `head()` function to print out just the first 6 rows of the `mtcars` data.
```{r}
head(mtcars)
```
Now we have the first 6 rows of every column from the `mtcars` data. This is a fairly simple function and is useful for quickly looking at our data. Many functions are more complicated than `head()` and involve multiple inputs rather than just the single input we had here. Some functions, for example, let you choose how you want the function to operate, as it can do so in multiple ways. Even in `head()` there's an optional input to choose how many rows you want it to return, with the default being 6. Since we didn't choose anything, the function stuck to the default and returned only 6 rows.
Throughout this book we'll spend a lot of time introducing functions that other people have made and learning how to combine the functions together to be able to get our raw data (e.g. a CSV file downloaded from a police site) into a usable format for research (e.g. cleaned to include only the rows and columns we need to analyze and in the units we want). For functions that other people wrote, we need to tell R that we want to use these functions. We do so by having R download that person's package. A package is just the name for a collection of functions in an easily downloadable format. We can do all of the downloading through R, so we don't have to go searching for them. There are two ways to download a package in R: through writing R code or through a shortcut in RStudio.
Downloading a package through R code uses - like pretty much everything else in R - a function. This function is `install.packages()`, where we put the name of the package we want in the (). This name also has to be in quotes since it is an object that is not currently in R. Let's install the package "caesar", which is a simple package I made that creates a Caesar cipher from some text. We need to run the code `install.packages("caesar")` and be sure to spell "caesar" right and put it in quotes.
```{r, eval = FALSE}
install.packages("caesar")
```
The RStudio shortcut way is to go to the Packages tab and then click Install on the top left of this tab. This will open up a window as shown in the following image where you can enter the name of the package you want. Then click Install and RStudio will install it for you. Also in this tab is the Update button, which allows you to update packages that you have already installed. Since R programmers generally provide updates to their packages (usually bug fixes but occasionally new features and new functions), it's important to update your packages every several months or so.
```{r, echo = FALSE}
knitr::include_graphics('images/install_packages.PNG')
```
Once we have downloaded the package, we need to tell R that we want to use that package. There are thousands of R packages and you'll likely have hundreds downloaded before long (if a package relies on other packages to work it'll download those too. So even if you install a single package it may also install other packages necessary for the package you want). Some packages have functions with the same name (but they do different things) so using all packages at once will cause issues since we won't know which functions we're actually using. So we only want to use the packages we need for that task. We need a way to tell R that we want to use a package. We only need to do this once per session - that is, once before restarting RStudio. The way to do this is to use the function `library()`, where we put the package name in the parentheses. Since the package is something that has been installed to R, we don't need to have quotes around the name.
```{r}
library(caesar)
```
Now we can run the `caesar()` function and make a Caesar cipher for that text (it's just a coincidence that the function name is the same as the package name).
```{r}
caesar("example text")
```
## Reading data into R
For many research projects you'll have data produced by some outside group (e.g. FBI, local police agencies) and you want to take that data and put it inside R to work on it. We call that reading data into R. R is capable of reading a number of different formats of data, which we will discuss in more detail in Chapter \@ref(reading-and-writing-data). Here, we will talk about the standard R data file only.
### Loading data {#loading-data-intro}
As we learned in Section \@ref(setting-the-working-directory), we need to set our working directory to the folder where the data is. For my own setup, R is already defaulted to the folder with this data so I do not need to set a working directory. For those following along on your own computer, make sure to set your working directory now.
The `load()` function lets us load data already in the R format. These files will end in the extension ".rda" or sometimes ".Rda" or ".RData". Since we are telling R to load a specific file, we need to have that file name in quotes and include the file extension ".rda". With .rda data, the object inside the .rda file already has a name so we don't need to assign a name to the data. With other forms of data such as .csv files, we will need to do that as we'll see in Chapter \@ref(reading-and-writing-data).
In this example (and elsewhere in this book when I load in data), I have all of the data in a folder called "data" in my working directory, which is why I have "data/" before the data name. You do not need this as you should have all of your data directly in your working directory.
```{r}
load("data/ucr2017.rda")
```
## First steps to exploring data
The object we loaded is called `ucr2017`. We'll explore this data more thoroughly in Chapter \@ref(explore), but for now let's use four simple (and important) functions to get a sense of what the data holds. To use each of these functions, we need to write the name of the data set (without quotes since we don't need quotes for an object already made in R) inside the ().
* `head()`
* `summary()`
* `plot()`
* `View()`
Note that the first three functions are lowercase while `View()` is capitalized. That is simply because older functions in R were often capitalized while newer ones use all lowercase letters. R is case sensitive so using `view()` will not work.
The `head()` function prints the first 6 rows of each column of the data to the console. This is useful to get a quick glance at the data but has some important drawbacks. When using data with a large number of columns it can be quickly overwhelming by printing too much. There may also be differences in the first 6 rows with other rows. For example, if the rows are ordered chronologically (as is the case with most crime data) the first 6 rows will be the most recent. If data collection methods or the quality of collection changed over time, these 6 rows won't be representative of the data.
```{r}
head(ucr2017)
```
The `summary()` function gives a six-number summary of each numeric or Date column in the data. For other types of data, such as "character" types (which are just columns with words rather than numbers or dates), it'll say what type of data it is. We'll cover different types of data in Chapter \@ref(data-types).
The six values it returns for numeric and Date columns are
+ The minimum value
+ The value at the 1st quartile
+ The median value
+ The mean value
+ The value at the 3rd quartile
+ The max value
In cases where there are NAs, it will say how many NAs there are. An NA value is a missing value. Think of it like an empty cell in an Excel file. NA values will cause issues when doing math, such as finding the mean of a column, as R doesn't know how to handle a NA value in these situations, though `summary()` automatically excludes NAs when doing the math operations.
```{r}
summary(ucr2017)
```
The `plot()` function allows us to graph our data. For criminology research we generally want to make scatterplots to show the relationship between two numeric variables, time-series graphs to see how a variable (or variables) change over time, or barplots comparing categorical variables. Here, we'll make a scatterplot seeing the relationship between a city's number of murders and their number of aggravated assaults (assault with a weapon or that causes serious bodily injury).
To do so we must specify which column is displayed on the x-axis and which one is displayed on the y-axis. In Section \@ref(select-specific-columns) we'll talk explicitly about how to select specific columns from our data. For now, all you need to know is to select a column in which you write the data set name followed by a dollar sign `$`, followed by the column name. Do not include any quotations or spaces (technically spaces can be included but make it a bit harder to read and are against conventional style when writing R code so we'll exclude them). Inside of `plot()` we say that "x = ucr2017\$actual_murder" so that column goes on the x-axis and "y = ucr2017\$actual_assault_aggravated" so aggravated assault goes on the y-axis. And that's all it takes to make a simple graph.
```{r}
plot(x = ucr2017$actual_murder, y = ucr2017$actual_assault_aggravated)
```
Finally, `View()` opens essentially an Excel file of the data set you put inside the (). This allows you to look at the data as if it were in Excel (though you can't edit the data at all here) and is a good way to start to understand the data.
```{r eval = FALSE}
View(ucr2017)
```
```{r, echo = FALSE}
knitr::include_graphics('images/view_example.PNG')
```
<!--chapter:end:intro-to-r.Rmd-->
```{r include=FALSE, cache=FALSE}
library(formatR)
knitr::opts_chunk$set(
comment = "#",
collapse = TRUE,
fig.align = 'center',
fig.width = 9,
fig.asp = 0.618,
fig.show = "hold",
error = TRUE,
fig.pos = "!H",
out.extra = "",
tidy = "styler",
out.width = "100%",
out.height= "45%"
)
options(tidygeocoder.quiet = TRUE)
options(tidygeocoder.verbose = FALSE)
options(readr.show_col_types = FALSE)
detachAllPackages <- function() {
basic.packages <- c("package:stats",
"package:graphics",
"package:grDevices",
"package:utils",
"package:datasets",
"package:methods",
"package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}
```
# Data types and structures {#data-types}
## Data types {#section-data-types}
When you read a sentence like "two plus two" you know the answer is four. R doesn't know that. This is because R takes things very literally. It will read "two" as a word, not as a number. For R to understand numbers you need to specify that you're talking about numbers, and not just words. Let's look at an example, making two variables which each have the value of "2."
```{r}
a <- "2"
b <- "2"
```
We now have a and b that are equal to "2" (in quotes!). Let's try to add them.
```{r, error = TRUE}
a + b
```
We get an error that is a technical way of saying that we did math on something that isn't a number. That's because we made a and b get "2" with quotes around it, which R interpreted as a word, not as a number. If we change a and b to 2 (without quotes), then R will know that the 2 is a number, and will do math on it.
```{r}
a <- 2
b <- 2
a + b
```
This may seem like a pretty simple concept but is fundamental to how R works, and can trip up new and experienced programmers alike. R trusts you. It only knows what you tell it. If you tell it that something is a word (by including quotes), it will treat it as a word, even if it looks to you like a number. So we must be very precise about what code we write, as R won't (for the most part) fix our mistakes - though it will give us an error if we try to do something it doesn't like, like add two words.
## Numeric, character, and logical (boolean)
There are three main data types that are important to know for using R to do research: numeric, character, and logical.
A numeric type is a number, and this includes both integers like 2 and decimals like 2.5. You can tell something is numeric if it is a number and there are no quotes around it. 2 is a number, "2" is not. For real data this will likely be something like the age of an individual or the number of crimes in a city. We want it as numeric type because we can do math on numbers. For example, we can find the average age of victims of crimes, or the median number of crimes in a city each week. This won't work unless R knows that these values are numbers.
A character is just a word or a set of words. If it is in quotes it's a character. Other programming languages generally call this a string instead of a character, but they mean the same thing. Pretty much anything that you'd write in English class fits in here.
Finally, a logical data type is just a true or false value, though in R it must be written all in capital letters: TRUE or FALSE. This is also referred to as a Boolean value. Booleans or logical data are useful when comparing two things. For example, we can see if 2 is equal to 3.
```{r}
2 == 3
```
It's not, so R returned FALSE (the == just compares the thing on the left to the thing on the right). This is very useful when we want to keep only certain rows in our data. For example, if we had data on multiple years of crime and we only wanted to keep a single year (let's say 2020), we could tell R to keep only rows where the year equals 2020 - where it is TRUE that that row's year column is equal to what year we want. We'll cover this in great detail in Chapter \@ref(subsetting-intro).
While you could try to figure out what type of data something is just by looking at it, R has a number of functions to check for you. We'll look at a few general functions that tell you the type of data something is, and then ones that check if the data is a specific type.
First, the `is()` function tells you all of the types of data something is - and a value can actually have multiple types. While it can't be both, for example, numeric and character, it can have other data types that we'll look at in the next section. First, let's look at what `is()` returns (prints out to the console) for a few simple examples.
```{r}
is(2)
```
Checking what 2 is tells us that it is both a "numeric" type and a "vector" type.
```{r}
is("2")
```
Checking "2" (in quotes), gives us four different types of data for this value: "character", "vector", "data.frameRowLabels", and "SuperClassMethod". You can ignore the last two types, we just are interested in that it is a "character" type and, like the type of 2, is a "vector".
```{r}
is(TRUE)
```
Finally, checking what TRUE is returns both "logical" and "vector". We expected logical since TRUE is a logical type. Again, we see that it is also a vector type. TRUE has to be both in capital letters and not be in quotes. If we write it in quotes then R will think it is a character, and if we have it lowercase and without quotes R will think that it is an object (such as something we make using `<-` and not a Boolean).
```{r}
is("TRUE")
```
```{r, error = TRUE}
is(true)
```
All three of the values we checked say that they are a "vector" type. We'll cover vectors in the next section, but for now let's see one other function that tells us the type of data something is. If we use `class()` instead of `is()` we'll get just the first value returned in the types of data that we input.
```{r}
class(2)
class("2")
class(TRUE)
```
In a lot of cases we'll want to check if some data is a specific type. For example, we might want to check that the year column of a data set is numeric, rather than say character. We do this with three functions, each of which checks that the data input (the data put in the parentheses of the function) is that type of data or not. These functions are: `is.numeric()`, `is.character()`, and `is.logical()`.
Running any of these functions will actually return a logical value, either TRUE or FALSE telling us if the value inputted is that type.
```{r}
is.numeric(2)
is.character("2")
is.character(2)
is.logical(TRUE)
```
So far we've just been checking the value of a single thing: a single number, a single character/string, or a single logical/Boolean value. In practice almost everything we do will be on a column of a data set. These functions still work in the exact same way. We input the column (using the data$column syntax discussed in Chapter \@ref(intro-to-r) to specify which data set we want and specifically which column in that data set) and the function will behave just like it did above. That's because each column can only be a single type of data; if the column is numeric, all values will be numeric; if the column is character, all values in that column are character; if the column is logical, every value in that column is also logical.
Let's use the UCR data from 2017 that was introduced in Chapter \@ref(intro-to-r). Remember that the data must be in your working directory to load it. And here I have "data/" before the data name because the data is in a folder called "data" in my working directory. For more on working directories, please see Section \@ref(setting-the-working-directory).
```{r}
load("data/ucr2017.rda")
```
We need to know the column names before using them, so we can use the `names()` function to get a list of all of the column names (the `colnames()` function does the same thing).
```{r}
names(ucr2017)
```
Now we can check the types of some of the columns. Let's check the year column as an example. A year is a number so we may expect it to be numeric, but there's technically nothing stopping that data from being character type. It can't be logical type because then instead of a year value it'd just be TRUE or FALSE, which is certainly not what a year is.
```{r}
is(ucr2017$year)
```
And we can use `is.numeric()` as another way to see if this column is numeric.
```{r}
is.numeric(ucr2017$year)
```
## Data structures
We'll look in detail about two important data structures - vectors and data.frames - and then talk briefly about two other structures that are not that important in this book, but are nonetheless good to know that they exist. So far we've just been looking at either a single value, such as `a <- 1` or more complicated structures such as the ucr2017 data set, which is called a data.frame - R's version of an Excel file. Data structures each operate a little differently from each other so it's good to understand what they are and how they work. We'll cover much more of how they work in Chapter \@ref(subsetting-intro), which covers how to subset data - which is just how to keep only certain values (such as specific rows or columns) in the data.
### Vectors (collections of "things") {#vectors}
The first data structure we'll discuss is a vector. A vector is a collection of same type (numeric, character, logical, Date) values in a single object. When we made "a" in Chapter \@ref(intro-to-r), we assigned it only a single value, such as `a <- 1`. Usually we'll want to have a group of values - such as a set of years or a group of crime types - rather than just a single value. We can do this by using the same assignment method as `a <- 1` but put all of the values we want to assign to a into the function `c()` and separate each value by a comma. The `c()` function **c**ombines each value together into a single vector.
Now, technically a single value, such as our object called "a" which now equals 1, is still a vector. In this case it'd be a vector of length 1, since there is only one value in it. But when we generally talk about vectors there are multiple elements in it.
Here's an example of making the object a be a vector with three values: 1, 2, and 3 (in that order).
```{r}
a <- c(1, 2, 3)
```