Also see the R output. These notes are just a rough outline of what we discussed in class; some topics may be missing.
Every object in R has a type that answers the question, "What is it?" For
instance, the type of the number 5.1
is double
, which stands for
double-precision and means it's a decimal number.
Types are useful for figuring out what kind of data is in a data set. Some
functions only work with specific types. For example, it doesn't make sense to
multiply the value "hello"
by something. Understanding types can help you
detect bugs in your R code.
You can print the type of an object with the typeof()
function:
# Numbers
typeof(TRUE)
typeof(5.0)
typeof(3+1i)
# Text
typeof("I CAN HAZ DATA?")
# Function
typeof(cos)
Text values have the type character
. Programmers usually call text values
strings, because they're a string (or sequence) of characters.
In R, the elements of a vector must all have the same type, and the type of a
vector is the same as the type of its elements. So the values 2.1
and c(2.1, 2.4, 3.6)
both have type double
.
You might want to store a bunch of values with different types in a single
variable. You can do this with a list. A list holds multiple elements, like a
vector, but each element can have any type. To create a list, use the list
function:
x = list("GR", 8, 4, "U")
The type of a list is list
. You can create complicated data structures by
putting lists inside of each other.
Every value in R has one or more classes, which answer the question "What
does it do?" The number 5.1
has class numeric
, which means that 5.1
acts
like a number.
Like types, classes are useful for figuring out what kind of data is in a data set, as well as for catching bugs.
You can check the class of an object with the class()
function:
# Numbers
class(TRUE)
class(5.0)
class(3+1i)
# Text
class("MOAR DATA PLZ")
# Function
class(cos)
Often an object's type and class are the same. Later on, we'll see some objects where the type and class are different.
In the previous lecture, we saw how to get around on the command line by changing the working directory. R also has a working directory, but the commands are different.
To view the working directory (compare to pwd
):
getwd()
To list the files in the working directory (compare to ls
):
list.files()
To change the working directory to the folder foo
(compare to cd foo
):
setwd("foo")
When you start working on a project, it's a good idea to change R's working directory to the folder that contains your data.
There are lots of different file formats for storing data.
Extensions help identify the format of a file. An extension is a dot followed
by a few letters (usually 3) at the end of a file's name. For example, a file
named foo.txt
has the extension ".txt", which means it's a plain text file.
Some common extensions for data are:
.rds
: R data (one data set).rda
: R data (one or more data sets).csv
: comma-separated values.tsv
: tab-separated values
The R data formats are binary, which means the data is stored in a way that only certain programs understand (in this case, R). If you open binary files in a text editor, you'll see a bunch of nonsense, because your editor has no idea how to make sense of the file.
On the other hand, the CSV and TSV formats are plain text. Columns are separated by commas or tabs, and each row is put on a separate line. If you open these files in your text editor, you'll see the data as if someone had typed it in.
To load data into R, you need to choose the right function for the job! Check the file format of the data to decide. A few of the R functions for loading data are:
load()
: RDA filesreadRDS()
: RDS filesread.csv()
: CSV filesread.delim()
: TSV filesread.table()
: other plain text tables
The CSV format is very popular for online data, because you don't need any special programs to view it. We'll mainly use RDS and CSV in this class.
Let's use R to get to know the class. Load the file sts98.rds
:
x = readRDS("sts98.rds")
This file has data on the year and major of every student enrolled in the class.
When you load data you've never seen before, you should start by trying to
figure out how it's organized. To peek at the top of a data set, use the
function head()
. For instance,
head(x)
shows that the student data is a table with columns "Class", "Major", and
"Dept". You could've printed out the whole data set by typing its name (x
),
but if there are a lot of rows, they'd go whizzing by as R prints them all.
It's easier and faster to peek using head()
. If you'd rather see the bottom
of the data set instead, you can use tail()
:
tail(x)
Both functions accept a second argument that says how many rows to print out. So
head(x, 10)
would print the first 10 rows.
You can check the number of rows in a data set with the nrow()
function:
nrow(x)
Similarly, you can check the number of columns with ncol()
. You can even
check both at the same time with the dim()
function (short for "dimensions"):
dim(x)
Only tabular data has dimensions. If you call dim()
on a vector,
dim(3:4)
you get NULL
, which means there's no value. Still, vectors do have length,
and there's a function to check their length:
length(3:4)
So there are several different ways to check the "size" of an object in R.
The columns of a data frame can have names. You can get the names with the
colnames()
function:
colnames(x)
Similarly, there's a rownames()
function to get row names. Finally, there's a
names()
function, which gets the name on each element of an object. If you
run
names(x)
you get back the column names. This is a hint that the "elements" of the data frame are its columns.
The type of the STS 98 data set is list
:
typeof(x)
and the class is data.frame
:
class(x)
In R, a data frame is just a table of data. Each row is one observation---in
this case, one student. Each column is a variable---something measured about
the students. Each column has its own type, and the column types might be
different, so under-the-hood R uses a list to store the data frame. The columns
are the elements of the list. This is also why names()
gives us the column
names.
You can learn more about the columns of a data frame with the str()
function
(this stands for "structure").
str(x)
This tells us there are 3 columns, all of which are "Factors". A factor is just data that falls into specific categories. For instance, a student's class can be "Freshman", "Sophomore", and so on. This is different from a numerical measurement, say height, where there aren't specific categories, just a range of values.
We can use R to get some idea of what kinds of students are in the class. We could just print out the whole data frame and try to eyeball it, but that's missing the whole point of R! We can grab just the column with information about people's year:
x$Class
The $
tells R that we want to get a specific column by name. This is still
pretty messy to look at, though. Let's have R count the number of students in
each class level:
table(x$Class)
The table()
function computes a table of counts. In this case, we can see
that most of the class is seniors and juniors.
We can do the same thing to get an idea of the majors:
table(x$Major)
It's a little hard to see which major has the most students. We can fix that by sorting the table:
sort(table(x$Major))
Now it's easy to see that most of the class is in Economics, with English the next most common major.
This kind of analysis is pretty common, so there's a shortcut. We can use the
summary()
function to quickly get a summary of every column:
summary(x)
We can also use table()
to cross-tabulate student class level and major:
table(x$Class, x$Major)
This shows a breakdown of class level by major.