Skip to content
This repository has been archived by the owner on Nov 21, 2022. It is now read-only.

Creating a Data Set

malteserteresa edited this page Jun 14, 2019 · 16 revisions

Original Strategy

The original strategy was to use Twitter's Premium API to build a large dataset (i.e., millions of observations) from historical tweets, focussing on known instances/victims of online attack such as Diane Abbot or Bank Note mob attack. We thought this was possible because this project was granted access to Twitter's Premium Sandbox API.

Under closer inspection of the documentation this is no longer going to be possible. It appears that only a maximum of 5000 tweets per month from the Search Tweets: Full-archive endpoint (i.e., Tweets since 2006) will be retrievable. This is because the Premium Sandbox API is limited to making 50 API requests per month to this endpoint, with each request returning a maximum of 100 tweets. It is 250 requests for the endpoint that retrieves tweets since the last 30 days.

One means to overcome this is to upgrade to the premium API. To return 1.25 million tweets would cost approximately €1700. Certainly not feasible at this stage without funding.

In addition to this, tweets seem to be vanishing when either twitter or the account holder flags them as abuse (good) but we have no way of getting these tweets (bad).

Pagination: When making both data and count requests it is likely that there is more data than can be returned in a single response. When that is the case the response will include a 'next' token. The 'next' token is provided as a root-level JSON attribute. Whenever a 'next' token is provided, there is additional data to retrieve so you will need to keep making API requests.

Current Strategy

The current strategy is to take a shotgun approach.

  • standard 7-day search Twitter API will be used to retrieve tweets matched by keyword or phrase (from compiled lists) and from known user accounts. The max rate limt is 450 requests per 15 min window. The script will be automated to run continuously using something like CRON.
  • Datasets available in the literature will be collected.
  • Known harassment cases in the past will be scraped (but remember that many harassing tweets maybe removed)
  • Bootstrapping - selecting some keywords to identify other keywords to use for collecting the data

Gathering Data from Twitter

The documentation on the twitter API's is a bit of a mine field

Useful endpoints so far

  1. Search for tweets containing word X

  2. get tweet by id

    • id
    • pro: most datasets on github are stored using ids so very useful for obtaining datasets from the literature
    • con:
  3. get a users timeline

    • user_timelines
    • pro:
    • con: only get the users timeline and no more information (ie where they were mentioned retweeted etc)

Keys of interest

    created_at
    id
    text
    in_reply_to_status_id
    in_reply_to_user_id 
    user.id
    user.location
    user.time_zone? → proxy for location with place?
    place/coordinates
    retweeted_status - representation of original tweet that was retweeted
    reply_count
    retweet_count
    entities
    favorited
    reweeted

Non Twitter Data

Clone this wiki locally