-
-
Notifications
You must be signed in to change notification settings - Fork 13
Creating a Data Set
The original strategy was to use Twitter's Premium API to build a large dataset (i.e., millions of observations) from historical tweets, focussing on known instances/victims of online attack such as Diane Abbot or Bank Note mob attack. We thought this was possible because this project was granted access to Twitter's Premium Sandbox API.
Under closer inspection of the documentation this is no longer going to be possible. It appears that only a maximum of 5000 tweets per month from the Search Tweets: Full-archive endpoint (i.e., Tweets since 2006) will be retrievable. This is because the Premium Sandbox API is limited to making 50 API requests per month to this endpoint, with each request returning a maximum of 100 tweets. It is 250 requests for the endpoint that retrieves tweets since the last 30 days.
One means to overcome this is to upgrade to the premium API. To return 1.25 million tweets would cost approximately €1700. Certainly not feasible at this stage without funding.
In addition to this, tweets seem to be vanishing when either twitter or the account holder flags them as abuse (good) but we have no way of getting these tweets (bad).
Pagination: When making both data and count requests it is likely that there is more data than can be returned in a single response. When that is the case the response will include a 'next' token. The 'next' token is provided as a root-level JSON attribute. Whenever a 'next' token is provided, there is additional data to retrieve so you will need to keep making API requests.
The current strategy is to take a shotgun approach.
- standard 7-day search Twitter API will be used to retrieve tweets matched by keyword or phrase (from compiled lists) and from known user accounts. The max rate limt is 450 requests per 15 min window. The script will be automated to run continuously using something like CRON.
- Datasets available in the literature will be collected.
- Known harassment cases in the past will be scraped (but remember that many harassing tweets maybe removed)
- Bootstrapping - selecting some keywords to identify other keywords to use for collecting the data
The documentation on the twitter API's is a bit of a mine field
-
Search for tweets containing word X
- search
- pro:
- con:
-
get tweet by id
- id
- pro: most datasets on github are stored using ids so very useful for obtaining datasets from the literature
- con:
-
get a users timeline
- user_timelines
- pro:
- con: only get the users timeline and no more information (ie where they were mentioned retweeted etc)
created_at
id
text
in_reply_to_status_id
in_reply_to_user_id
user.id
user.location
user.time_zone? → proxy for location with place?
place/coordinates
retweeted_status - representation of original tweet that was retweeted
reply_count
retweet_count
entities
favorited
reweeted