theme | _class | paginate | backgroundColor | footer | marp |
---|---|---|---|---|---|
gaia |
lead |
true |
Computational Thinking and Social Science | :copyright: Matti Nelimarkka | 2023 | Sage Publishing |
true |
- Connect to the Internet via libraries and extract meaningful content from it for your work.
- Use libraries to collect data from hypertext documents.
- Read application programming interface (API) documentation.
- Collect data from online APIs that do not require authentication.
- Read and store data in .json format.
- Compare the .csv and .json forms of data storage.
- Increasingly we repurpose data for our research practices.
- We discuss now how computers can help in collection, storage and manipulation of the data.
- Remember: No quantity of data can rectify a dull or, worse, irrelevant research question.
Connecting to web resources with the help of httr
library.
library(httr)
## collect the Web site example.com
response <- GET('http://www.example.com')
website_content = content( response, 'text' )
print( website_content )
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset=”utf-8” />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this domain in literature
without prior coordination or asking for permission.</p>
<p><a href=”https://www.iana.org/domains/example”>More information...</a></p>
</div>
</body>
</html>
-
separates paragraphs
- indicates a link
Knowing the semantic meaning of HTML tags, websites can be further analysed using other libraries.
For R, use rvest
and in some more difficult cases selenium
.
Web services define APIs: in essence these are grammars
- determining and identifying what data can be collected from the service
- how data are to be requested properly
- the format in which the data get returned to the requester
Requests:
https://data.police.uk/api/crimesstreet/ all-crime?lat=51.5073&lng=-0.171505
Response (partial):
{"category":"anti-social-behaviour",
"location_type":"Force",
"location":{"latitude":"51.517535","street":{"id":1670905,"name":"On or near A4206"},"
"longitude":"-0.182180"},
"context":"",
"outcome_status":null,
"persistent_id":"","id":104301433,"location_subtype":"","month":"2022-08"}
- Comma-separated values (CSV) compared to dictionary-style data format JSON.
- CSV requires all rows must have the same number of columns
- JSON allows more complex structures, like lists to be used
[
{
"id": 1,
"text": "This post has no Likes",
"likes": []
}, {
"id": 2,
"text": "This post has two Likes",
"likes": [
"John Smith",
"Jane Smith"
]
}
]
library(jsonlite)
data <- read_json( 'data.json' )
for( row in 1:length(data) ){
print( paste( data[[ row ]]$id, data[[ row ]]$text ) )
likes <- data[[ row ]]$likes
for( user in likes ) {
print( paste(' Liked by', user ) )
}
}
- Always archive the original ‘raw’ data before any cleaning or processing.
- Document each processing and wrangling step.
- In the end, rerun the processing steps to ensure you have not forgotten anything.
- What kinds of steps are carried out in a Web-scraping process?
- Why are APIs preferable to Web scraping for collection of data?
- How can scholars access and use data not found online?
- Why might the .json data format preferred over .csv files and binary dumps?
- What should you remember when working with data?