-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwrangle_act-oziPC-2.py
688 lines (531 loc) · 17.9 KB
/
wrangle_act-oziPC-2.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.6.0
# kernelspec:
# display_name: Python 3
# language: python
# name: python3
# ---
# %% [markdown]
# # <a name="top">WeRateDogs - Udacity Data Wrangling Project 03 </a>
# ---
# ## GATHER & ASSESS 3 datasets from 3 different sources:
# 1. [Gather/Assess Data #1](#gatherassess1) - Twitter archive, twitter-archive-enhanced.csv (local archive). format: CSV
# 2. [Gather/Assess Data #2](#gatherassess2) - Tweet image predictions - Download data from file_url utilizing requests library. format: TSV
# 3. [Gather/Assess Data #3](#gatherassess3) - Query Twitter API for additional data - image_preds (local archive created from image recognition system). format: TXT
#
# ## CLEAN (8) Quality Issues
# Also known as dirty data which includes mislabeled, corrupted, duplicated, inconsistent content issues, etc.
#
# ### twitter-archive-enhanced.csv quality issues:
#
# 1. [Quality #1](#q1) - columns 'timestamp' & 'retweeted_status_timestamp' are objects (strings) and not of 'timestamp' type. Change type to timestamp.
#
# 2. [Quality #2](#q2) - twitterDF.name contains a lot of non-dog names, e.g. 'a'; Replace with np.NaN
#
# 3. [Quality #3](#q3) - pupper, etc. REPLACE
#
# 4. [Quality #4](#q4) - remove URL from 'source' & replace with 4 categories: iphone, vine, twitter, tweetdeck
#
# 5. [Quality #5](#q5) - retweeted_status_id is of type float; change to object(text). `in_reply_to_status_id` and `in_reply_to_user_id` are type float. Convert to string
#
# 6. [Quality #6](#q6) - float to string - REPLACE
#
# ### rt_tweets quality issues:
#
# 7. [Quality #7](#q7) - create new dataframe of columns needed
#
# 8. [Quality #8](#q8) - remove retweets
#
#
# ---
# ## CLEAN (2) Tidiness Issues
# Messy data includes structural issues where variables don't form a column, observations form rows, & each observational unit forms a table.
#
# 1. [Tidy #1](#t1) - NOT A TIDINESS ISSUE, quality issue:
#
# 2. [Tidy #2](#t2) - merge all 3 datasets
#
# 3. Potential Tidy #3 - variables from rows and columns --> doggo, floofer, pupper, puppo. Create new column, e.g. 'Dog_type' and specify which, if any, is represented. The problem is there numerous tweets where more than 1 'dog type' is specified. I don't think one can arbitrarily choose which type should be used in the dataset where 1+ (doggo, floofer, etc.) are specified.
# ---
# ## Examples of assessments:
# ### Visuals
# 1. [Visual 1](#vis1) - Horizontal Bar Chart (WeRateDogs Dog Breeds represented (top 10))
# 2. [Visual 2](#vis2) - Horizontal Bar Chart (Top 15 Favorites (tweets), by probable name)
#
# ### Programatic
# 1. [Programatic 1](#prog1) - Percentages, Value Counts, etc.
# 2. [Programatic 2](#prog2) - Grouping of dataframe on the first predicted name for various mean data
#
# ### Saved new dataframe to file
# [Save to file, WeRateDogs_migration.csv](#save1) to file.
#
#
# [BACK TO TOP](#top)
# %% [markdown]
# ## Import Libraries
# %%
import pandas as pd
import numpy as np
import os
import requests
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.patches import ConnectionPatch
# %matplotlib inline
# %% [markdown]
# ## <a name="gatherassess1">Gather/Assess Data #1 - Twitter Archive Enhanced</a>
# %%
# Read data into dataframe
twitterDF_orig = pd.read_csv("data/twitter-archive-enhanced.csv")
# Make copy of dataframe
twitterDF = twitterDF_orig.copy()
# %%
twitterDF.head(5)
# %%
# review data columns in DF, are Dtypes appropriate, etc.
twitterDF.info()
# %%
# find all tweets where the retweeted_status_id is notnull
twitterDF[twitterDF.retweeted_status_id.notnull()]
# %% [markdown]
# [BACK TO TOP](#top)
# %%
# review names of pups
twitterDF.name.value_counts()
# %%
# review dogtionary names; interesting to see id# 200 has 2 values, doggo & floofer
twitterDF[twitterDF['floofer'] != 'None'].head(3)
# %%
# it appears the designations were pulled from the tweeted text, 'doggo' & 'floofer' in text below
twitterDF.loc[200,'text']
# %%
# Illustrating that pup designations are NOT singular. Multiple
twitterDF[twitterDF['doggo'] != 'None'].sample(5)
# %% [markdown]
# ### Define
#
# <a name="q1"> Q1 - Convert dtype of timestamp columns</a>
# %% [markdown]
# ### Code
# %%
# Fixed 2 columns with incorrect datatypes, changed to datetime64
twitterDF.timestamp = pd.to_datetime(twitterDF.timestamp)
twitterDF.retweeted_status_timestamp = pd.to_datetime(twitterDF.retweeted_status_timestamp)
# %% [markdown]
# ### Test
# %%
twitterDF.info()
# %% [markdown]
# ### Define
#
# <a name="q2"> Q2 - dog names = 'a', replace with NaN </a>
# %% [markdown]
# ### Code
# %%
# replace puppo's names that match 'a' with NaN
twitterDF.name = np.where(twitterDF.name == 'a', np.NaN, twitterDF.name)
# %% [markdown]
# ### Test
# %%
# check to ensure all 'a' names were removed
twitterDF[twitterDF.name == 'a']
# %% [markdown]
# ### Define
# <a name="q3"> Q3 - doggo, floofer, pupper, & puppo use None; Replace with NaN, or 0, & 1 for present </a>
# %% [markdown]
# ### Code
# %%
# replace 'None' with 0
# replace 'doggo' with 1
twitterDF.doggo = np.where(twitterDF.doggo == 'None', 0, twitterDF.doggo)
twitterDF.doggo = np.where(twitterDF.doggo == 'doggo', 1, twitterDF.doggo)
# %%
# replace 'None' with 0
# replace 'floofer' with 1
twitterDF.floofer = np.where(twitterDF.floofer == 'None', 0, twitterDF.floofer)
twitterDF.floofer = np.where(twitterDF.floofer == 'floofer', 1, twitterDF.floofer)
# %%
# replace 'None' with 0
# replace 'pupper' with 1
twitterDF.pupper = np.where(twitterDF.pupper == 'None', 0, twitterDF.pupper)
twitterDF.pupper = np.where(twitterDF.pupper == 'pupper', 1, twitterDF.pupper)
# %%
# replace 'None' with 0
# replace 'puppo' with 1
twitterDF.puppo = np.where(twitterDF.puppo == 'None', 0, twitterDF.puppo)
twitterDF.puppo = np.where(twitterDF.puppo == 'puppo', 1, twitterDF.puppo)
# %% [markdown]
# ### Test
# %%
# check to ensure cleaning successful
twitterDF[twitterDF.puppo == 'None'].count()
# %%
# check to ensure cleaning successful
twitterDF.query("floofer == 1")
# %% [markdown]
# ### Define
# <a name="q4"> Q4 - remove URL from 'source' & replace with 4 categories: iphone, vine, twitter, tweetdeck </a>
# %%
# review names of sources
twitterDF.source.value_counts()
# %% [markdown]
# ### Code
# %%
twitterDF.head(2)
# %%
# function to categorize source column
def update_source(row):
if 'iphone' in row:
return 'iphone'
elif 'vine' in row:
return 'vine'
elif 'Twitter' in row:
return 'twitter web client'
elif 'TweetDeck' in row:
return 'TweetDeck'
# %%
# run update_source function on every row to replace source text with shorter description of source
twitterDF.source = twitterDF.apply(lambda row: update_source(row['source']),axis=1)
# %% [markdown]
# ### Test
# %%
# check to ensure function replaced items as intended
twitterDF.sample(5)
# %% [markdown]
# ### Define
# <a name="q6">Q6 - `in_reply_to_status_id` and `in_reply_to_user_id` are type float. Convert to string</a>
# %%
# data exploration
# see sample of is_reply_to_status_id...
twitterDF[twitterDF.in_reply_to_status_id.notnull()]
# %% [markdown]
# ### Define
#
# <a name="q8"> Q8 - remove retweets & delete columns </a>
# %%
twitterDF.sample(2)
# %% [markdown]
# ### Code
# %%
# Get indices of rows to drop, in this case, any row with a value in retweeted_status_id different that NaN.
drop_these = twitterDF[twitterDF['retweeted_status_id'].notnull()].index
twitterDF.drop(drop_these,inplace=True)
twitterDF.sample(3)
# %% [markdown]
# ### Test
# %%
# check if any 'notnull' entries exist in retweeted_status_id
twitterDF[twitterDF['retweeted_status_id'].notnull()]
# %%
# get rid of 3 empty columns representing the retweeted tweets
drop_cols = ['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp']
twitterDF.drop(drop_cols,axis=1,inplace=True)
twitterDF.info()
# %%
# check to ensure cols dropped
twitterDF.info()
# %% [markdown]
# ## <a name="gather2">Gather Data #2 - Tweet image predictions</a>
# %%
# Download data from file_url utilizing requests library & save to line #5
file_url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
req = requests.get(file_url)
fname = os.path.basename(file_url)
open("data/" + fname, 'wb').write(req.content)
# %%
# data exploration
# Nows read file downloaded & view sample
image_preds = pd.read_csv("data/image-predictions.tsv", sep="\t")
image_preds.sample(5)
# %%
# data exploration
image_preds.info()
# %% [markdown]
# [BACK TO TOP](#top)
# %% [markdown]
# ## <a name="gather3">Gather Data #3 - Query Twitter API for additional data</a>
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
#
# * retweet count
# * favorite count
# * any additional data found that's interesting
# * only tweets on Aug 1st, 2017 (image predictions present)
# %%
# define keys & API info
# authenticate API using regenerated keys/tokens
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
# %%
tweet_ids = twitterDF.tweet_id.values
len(tweet_ids)
# %%
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
'''
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
# This loop will likely take 20-30 minutes to run because of Twitter's rate limit
for tweet_id in tweet_ids:
count += 1
print(str(count) + ": " + str(tweet_id))
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
print("Success")
json.dump(tweet._json, outfile)
outfile.write('\n')
except tweepy.TweepError as e:
print("Fail")
fails_dict[tweet_id] = e
pass
end = timer()
print(end - start)
print(fails_dict)
'''
# %% [markdown]
# ### Start from here if data already obtained from Twitter
#
# [BACK TO TOP](#top)
# %%
# Read tweet JSON into dataframe using pandas
# recived ValueError: Trailing data without 'lines=True'
rt_tweets = pd.read_json("tweet.json", lines=True)
rt_tweets.head(5)
# %%
# data exploration
rt_tweets.info()
# %%
# data exploration
# View retweeted tweets, first 5 of 163, these will be deleted
rt_tweets[rt_tweets.retweeted_status.notnull()].head(5)
# %%
# data exploration
rt_tweets.user
# %%
# data exploration
rt_tweets.columns
# %%
# data exploration
# inspect the extended entities data
rt_tweets.loc[0,'extended_entities']
# %%
# data exploration
# inspect the entities data
rt_tweets.loc[115,'entities']
# %%
# data exploration
rt_tweets.loc[130,'user']
# %%
# data exploration
rt_tweets.iloc[1:8,11:]
# %% [markdown]
# ## <a name="t1">Tidy #1 - create new dataframe of columns needed</a>
# %%
# add columns to this list for creating a new DF with only columns we want only
tweet_cols = ['created_at','id','full_text','display_text_range','retweet_count','favorite_count','user']
# %%
# create new DF with column defined above
rt_tweets_sub = rt_tweets.loc[:,tweet_cols]
rt_tweets_sub.head(10)
# %% [markdown]
# ## <a name="t2">Tidy #2 - Merge 3 datasets</a>
#
# 1. twitterDF
# 2. rt_tweets_sub
# 3. image_preds
# %%
# data exploration
twitterDF.info()
# %%
# data exploration
rt_tweets_sub.info()
# %%
image_preds.info()
# %% [markdown]
# ### Define
# ### <a name="q7">Quality 7 - rename id column for common data uniformity</a>
# %%
# dataframe has a different name for its shared column, id --> tweet_id
rt_tweets_sub = rt_tweets_sub.rename(columns={"id":"tweet_id"})
rt_tweets_sub.head(5)
# %%
# MERGE 2 dataframes!
new_tweets_df = pd.merge(rt_tweets_sub, twitterDF, on='tweet_id')
new_tweets_df.head(3)
# %%
# data exploration
new_tweets_df.info()
# %%
# MERGE newly merged dataframe and image_preds to get new_tweets_df2
new_tweets_df2 = pd.merge(new_tweets_df, image_preds, on='tweet_id')
# %% [markdown]
# ## <a name="save1">New Dataframe saved to file</a>
# %%
# write new dataframe to file
new_tweets_df2.to_csv("twitter_archive_master.csv")
# %% [markdown]
# [BACK TO TOP](#top)
# %%
# data exploration
new_tweets_df2.head(5)
# %%
# data exploration
# how many names are blank(null)
new_tweets_df2.name.isnull().count()
# %%
# data exploration
new_tweets_df2.loc[576,'expanded_urls']
# %%
# data exploration
new_tweets_df2.info()
# %%
# exploratory
# highest_accuracy = new_tweets_df2.query("p1_dog == true and ")
# %%
# count the number of times a name was used for pup. New series, count_by_name, is sorted by the index which is alphabetically
# sorted by default
count_by_name = new_tweets_df2.groupby('p1').size()
count_by_name
# %%
# see top 40 most predicted names
count_by_name.sort_values(ascending=False)[0:40]
# %%
# Investigate why 'seat_belt' is the 15th most predicted name for a dog picture. These are all tweets who's value equals
# 'seat_belt' and groupby the 2nd predicted value
new_tweets_df2.query("p1 == 'seat_belt'").groupby('p2').size()
# %%
# create new series of the top 10 names used for pups
top10_names = count_by_name.sort_values(ascending=False).head(10)
top10_names
# %%
top10_names.index.values
# %%
top10_val_array = top10_names.values
top10_val_array
# %% [markdown]
# ## <a name="vis1"> Horizontal Bar Chart to visualize the top 10 breeds represented during the timeframe </a>
# %%
# Horizontal Bar Chart to visualize the top 10 breeds represented during the timeframe
# Fixing random state for reproducibility
np.random.seed(19680801)
plt.rcdefaults()
fig, ax = plt.subplots()
names = top10_names.index.values
y_pos = np.arange(len(names))
performance = top10_names.values
error = np.random.rand(len(names))
ax.barh(y_pos, performance, xerr=error, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis() # labels read top-to-bottom
ax.set_xlabel('Dog Breeds (predicted) Count ')
ax.set_ylabel('Predicted Breeds')
ax.set_title('WeRateDogs Dog Breeds represented (top 10)')
plt.show()
# %% [markdown]
# [BACK TO TOP](#top)
# %%
# Data Exploration
new_tweets_df2.iloc[300:305,0:10]
# %%
# Data Exploration
new_tweets_df2.iloc[300:305,11:20]
# %% [markdown]
# ## <a name="prog1">Programmatic Assessment</a>
# %%
## Percentages that dog was catagorized affectionately
## Averages of doggo, floofer, pupper, & puppo. Essentially, how often have these been designated
## This means that 'doggo' was used to describe a pup 3.6% of the time
desig = ['doggo', 'floofer', 'pupper', 'puppo']
#new_tweets_df2.doggo.mean()
new_tweets_df2[desig].mean()
# %%
## Owner named their dog this index the value number of times. There were a lot of missing values here
## Data Exploration
## Names most used
new_tweets_df2.name.value_counts()
# %% [markdown]
# [BACK TO TOP](#top)
# %%
top10_names_used = list(top10_names.index)
# %%
top10_names_used
# %% [markdown]
# [BACK TO TOP](#top)
#
# ### <a name="prog2">More Programmatic Assessment</a>
# %%
## Create grouping of dataframe on the first predicted name, p1, & obtained the mean of specific data points
# This one provides appropriate columns but it correctly displayed the resulting dataframe in p1 alphabetic order
# which is not statistically significant
name_by_avgs = new_tweets_df2.groupby("p1")[['p1_conf','rating_numerator','rating_denominator','favorite_count',
'retweet_count']].mean()
#Actually, you just need to pull out the rows you want, top10names, from the name_by_avgs. It's just sorted alphabetically
#name_by_avgs = new_tweets_df2.groupby(new_tweets_df2[newtop10])[['p1_conf','rating_numerator','rating_denominator','doggo','floofer',
# 'pupper','puppo','favorite_count','retweet_count']].mean()
name_by_avgs.head(10)
# %%
# Get the highest average of retweets by predicted names.
p1_retweets = name_by_avgs.retweet_count.sort_values(ascending=False)
p1_retweets.head(10)
# The results indicate that tweets with pictures that are predicted as an "Arabian_camel" had an average retweet count of 17,424
# retweets. This insight says more about the neural network results and it's accuracy than the retweet specifics
# %%
# data exploration
#top10stats = name_by_avgs.loc[newtop10]
#top10stats.head(10)
# %%
#name_by_avgs.reset_index(inplace=True)
# %%
#name_by_avgs.rename(columns= {'p1':'probable_name', 'p1_conf':'probability'}, inplace=True)
# %%
# data exploration
#favorites_by_name = name_by_avgs.loc[:,['favorite_count']]
#favorites_by_name.
# %%
#top15_favorites = favorites_by_name.iloc[0:15,:]
#top15_favorites.t.sort_values(ascending=False)
# %% [markdown]
# ## <a name="vis2">Notable analysis from visual bar chart </a>
#
# ### None of the top 15 favorited 'dog's' images were acturately identified as dogs
# %%
# create sub
favorites_by_name = name_by_avgs.loc[:,['favorite_count']]
favorites_by_name.sort_values(by=['favorite_count'], ascending=False, inplace=True)
# get top 15 of new subset to create visual from
top15_favorites = favorites_by_name.iloc[0:15,:]
group_names = top15_favorites.index
group_data = top15_favorites.favorite_count
# %%
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots(figsize=(6, 4))
ax.barh(group_names, group_data)
labels = ax.get_xticklabels()
plt.setp(labels, rotation=45, horizontalalignment='right')
ax.set(xlim=[-10000, 70000], xlabel='No. of favorited tweets', ylabel='Names (guessed by learning model)',
title='Top 15 Favorites (tweets), by probable name')
plt.show;
# %% [markdown]
# [BACK TO TOP](#top)
# %%
name_by_avgs.query("rating_numerator >= 10").rating_numerator.sort_values(ascending=False)
# %%
name_by_avgs.rating_numerator.sort_values(ascending=False)