-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathday_three_part_b.Rmd
196 lines (153 loc) · 6.82 KB
/
day_three_part_b.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Day Three, part B<br> Be careful out there folks!"
author: |
| Jon Atwell \& Christopher Skovron
| Northwestern University
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
revealjs::revealjs_presentation:
theme: simple
highlight: haddock
center: true
transition: none
#css: reveal.css
self_contained: false
reveal_plugins: ["notes"]
lib_dir: "."
---
## Get giddy about the possibilities!
## count_problems $=G($ data_size$^\alpha), \alpha > 1$?
## Probably not, just different problems
- Pitfalls of Inference
- Ethical/legal considerations
- Practical considerations
## Salganik's Characterists of Big Data
<img class="plain" src="https://www.bitbybitbook.com/images/bit-by-bit-cover-934a0a3f.jpg">
<br>
(PUP, 2018)
## Characterists BD **can** have
<span class="fragment" data-fragment-index="1"> - **Huge N** </span><br>
<span class="fragment" data-fragment-index="2"> - Not created for science* </span><br>
<span class="fragment" data-fragment-index="3"> - Always-on </span><br>
<span class="fragment" data-fragment-index="4"> - Nonreactive </span><br>
<span class="fragment" data-fragment-index="5"> - Incomplete </span><br>
<span class="fragment" data-fragment-index="6"> - Inaccessible </span><br>
<span class="fragment" data-fragment-index="7"> - Nonrepresentive </span><br>
<span class="fragment" data-fragment-index="8"> - Drifting </span><br>
<span class="fragment" data-fragment-index="9"> - Algorithmically-confounded </span><br>
<span class="fragment" data-fragment-index="10"> - Dirty </span><br>
<span class="fragment" data-fragment-index="11"> - Sensitive </span><br>
## Big Data is indeed big, but is correlation is enough?
<aside class="notes">Does BD mean the end of theory? This is some of more philosophical stuff.</aside>
<span class="fragment" data-fragment-index="1"> - whole of sample frame is not the population, unless science of X</span><br>
<span class="fragment" data-fragment-index="2"> - systemic bias instead sampling bias</span><br>
<span class="fragment" data-fragment-index="3"> - P-hacking, [meet N-hacking: 6:06 mark](https://mediasite.kellogg.northwestern.edu/Mediasite/Play/70c3061a791c46efa349f89670d72bc61d?playFrom=10656&autoStart=true&popout=true)<br>
<span class="fragment" data-fragment-index="4"> - More about new types of data, or richness of context </span>
## Not created for science
<span class="fragment" data-fragment-index="1"> - "digital exhaust"</span><br>
<span class="fragment" data-fragment-index="2"> - weak link between **construct** and **measures**</span>
## Always-on
<span class="fragment" data-fragment-index="1"> - Old: HS diploma to job at 28</span><br>
<span class="fragment" data-fragment-index="2"> - New: Location every 15 minutes.</span><br>
<span class="fragment" data-frogment-index="3"> - But: When is this informative? Are we answering different questions?</span>
## Nonreactive
<span class="fragment" data-fragment-index="1"> - Old: Generosity in lab studies (reactivity/Hawthorne effect)</span><br>
<span class="fragment" data-fragment-index="2"> - New: Don't know about observation, or is now normalized</span><br>
<span class="fragment" data-fragment-index="3"> - But: Have you been on Instagram?</span>
## Incomplete: If only I had:
<aside class="notes">Relates to are we answering different types of questions?</span>
- gender
- income
- race/ethnicity
- IQ
- party affliation
- ...
## Incessible
<aside class="notes">Google knows most basic things about all of US pop. Can we see it plz?</aside>
- Private companies own it
- Governments collect it but don't share
## Nonrepresentative
- Who actually tweets anyway?
- But nonrandom sample can still be very useful!
## Drifting
<aside class="notes">Pop, behavior: Facebook system - PageRank, predictive search</aside>
- population
- behavior
- system
## Algorithmically confounded
- unique experiences (N treatment groups)
- recommenders
- Matthew effect/increasing returns/preferential attachment
- action triggers
## Bigger question: Is social life now algorithmically confounded?
Lack of treatment controls is one thing, but <br>
exposures compound to substantial change.<br>
<ul>
<li>those confounds create new reality</li>
<li>Facebook vote encouragement</li>
<li>LinkedIn recommendations</li>
</ul>
## Dirty
- Wait, Russians can use Twitter too?!
- Wait, computers can do automated tasks?!
## Sensitive
- More so than you might think: [Meta data](https://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/)
- Merging or cross referencing data sets, especially so
# Ethical and legal considerations
## Ethical
- Loss of Privacy (See [Netflix Prize](https://www.cs.cornell.edu/~shmat/netflix-faq.html))
- Violation of trust (See Cambridge Analytica)
- Exploitation
- Lack of informed consent
## Legal Pitfalls
- Violation of User Agreements ([Sandvig vs Sessions](https://www.aclu.org/cases/sandvig-v-sessions-challenge-cfaa-prohibition-uncovering-racial-discrimination-online))
- Impersonation
- Deception
## Legal hangups
- Formal data use agreements can happen without scientific input.
- Can take a long time
- Can be expensive
## Paging IRB!
- behind the times, don't understand tech systems
- assumes you'll follow UA?
# Practical considerations
## You might have to read more widely
- physicists doing S.S. in _Science_ and _PNAS_, etc.<br>
<span class="fragment" data-fragment-index="1"><center><img class="plain" src="https://imgs.xkcd.com/comics/physicists.png"></center></span>
## You have to move quickly
- Small tweaks to websites can break scripts
- Data are more widely available, more people study them
## But you might have to move slowly too
- harvesting data can take time
- stay off server blacklists
## Social science is moving toward lab or collaborative model
- You should work with others
- Important to share and read other's code
- Must be flexible
## It's important to start with observation
Platforms support actions that:<br>
<ul>
<li>go unused</li>
<li>get co-opted</li>
</ul>
## It's important to start with observation
Platforms can have unique:<br>
<ul>
<li>norms</li>
<li>terms</li>
<li>meanings for words</li>
<li>...?</li>
</ul>
## Scraping just ain't what it used to be
- Most webpages are dynamic and idiosyncratic
- Server-side ops/parameters are intentionally opaque
## You can ask for data from corporations and other orgs
- think of quid pro quo
- Use LinkedIn to find lower level people who might care
- use email finders or domain hack address
- repeat
## You can use Mturk (or its mirrors)!
- cheaply label training set
- cheaply validate results
- but I'm skeptical of survey results
# And yet, it's an exciting time to be doing social science!