Skip to content
This repository has been archived by the owner on Nov 21, 2022. It is now read-only.

Creating a Data Set

malteserteresa edited this page Jun 16, 2019 · 16 revisions

Strategy

The original strategy was to use Twitter's Premium API to build a large dataset (i.e., millions of observations) from historical tweets, focussing on known victims/instances of online attack such as Diane Abbot or Bank Note mob attack. We thought this was possible because this project was granted access to Twitter's Premium Sandbox API.

Under closer inspection of the documentation this is no longer going to be possible. It appears that only a maximum of 5000 tweets per month from the Search Tweets: Full-archive endpoint (i.e., Tweets since 2006) will be retrievable. This is because the Premium Sandbox API is limited to making 50 API requests per month to this endpoint, with each request returning a maximum of 100 tweets. It is 250 requests for the endpoint that retrieves tweets since the last 30 days.

One means to overcome this is to upgrade to the premium API. To return 1.25 million tweets would cost approximately €1700. Certainly not feasible at this stage without funding.

In addition to this, tweets seem to be vanishing when either twitter or the account holder flags them as abuse (good) but we have no way of getting these tweets (bad).

Current Strategy

The current strategy is to take a shotgun approach.

  • standard 7-day search Twitter API will be used to retrieve tweets matched by keyword or phrase (from compiled lists) and from known user accounts. The max rate limt is 450 requests per 15 min window. The script will be automated to run continuously using something like CRON.
  • Datasets available in the literature will be collected.
  • Known harassment cases in the past will be scraped (but remember that many harassing tweets maybe removed)
  • Bootstrapping - selecting some keywords to identify other keywords to use for collecting the data

Twitter Data Analysis

Unique Nature of Twitter

Users interact on the social networking platform Twitter with each other through follows, replies, retweets, likes, quotes, and mentions.

Below is taken from the Twitter documentation:

Follows

If you follow someone on Twitter it means, you are subscribing to their Tweets, their updates will appear in your Home timeline and that person is able to send you Direct Messages.

Retweets

A Retweet is a re-posting of a Tweet. Twitter's Retweet feature helps you and others quickly share that Tweet with all of your followers. You can Retweet your own Tweets or Tweets from someone else. Sometimes people type "RT" at the beginning of a Tweet to indicate that they are re-posting someone else's content. This isn't an official Twitter command or feature, but signifies that they are quoting another person's Tweet.

Replies

A reply is a response to another person’s Tweet. When you reply to someone else, your Tweet will show the message Replying to... when viewed in your profile page timeline. Replies from people with protected Tweets will only be visible to their followers. If someone sends you a reply and you are not following them, the reply will not appear in your Home timeline. Instead, the reply will appear in your Notifications tab. If someone sends you a reply and you are not following them, the reply will not appear in your Home timeline.

Mentions

A mention is a Tweet that contains another person’s username anywhere in the body of the Tweet. Visiting another account’s profile page on Twitter will not display Tweets that mention them

Likes

Quotes

  • Hashtags:
  • URLS: presence in a tweet is usually indicative the content being an index fora longer explanatory story pointed to by the UR
  • Emojis
  • Emojicons - the useful resource below talks about in sentiment analysis

Gathering Data from Twitter

The documentation on the twitter API's is a bit of a mine field, so I'll try and make things as clear as possible.

First off there are lots of different endpoints, but the main ones are Twitter Search API, Twitter Streaming API, Decahose and Firehose.

  • Search API: find data through a search or username. Twitter’s Search API gives you access to a data set that already exists from tweets that have occurred.
  • Streaming API: data as tweets happen in near real-time.
  • Decahose: similar to the Twitter’s Streaming API but will return 10% of all matched tweets
  • Firehose: similar to the Twitter’s Streaming API but will return 100% of all matched tweets

Decahose and Firehose have different sampling algorithms to the other endpoints and attempt to return a more representative random sample.

Within the non-hose APIs such as search, you then have different levels of access such as premium, enterprise etc which limit the rate at which you can retrieve data. This is not for all non-hose APIs but some such as search

Sampling

TBC

Useful endpoints so far

  1. Search for tweets containing word X

  2. get tweet by id

    • id
    • pro: most datasets on github are stored using ids so very useful for obtaining datasets from the literature
    • con:
  3. get a users timeline

    • user_timelines
    • pro:
    • con: only get the users timeline and no more information (ie where they were mentioned retweeted etc)

Tweet Metadata

  • favorite_count:
  • in_reply_to_X: if not null, this means this tweet is in response to another tweet
  • retweet_count: tweets can be retweeted and the original tweet will keep this number

User Metadata

  • friends_count: people that the user follows
  • followers_count: people following the user
  • statuses_count: a status is also a tweet. In this case this is the number of tweets the user has posted

Metadata of interest

    created_at - information on when the tweet was created
    id - tweet id
    text
    in_reply_to_status_id
    in_reply_to_user_id 
    user.id
    user.location
    user.time_zone? → proxy for location with place?
    place/coordinates
    retweeted_status - representation of original tweet that was retweeted
    reply_count
    retweet_count
    entities
    favorited
    reweeted

Useful resource

Unintended Bias

The unintended but systematic difference in performance for classification models for different demographic groups.

This is a current issue with many models such as the Perspective API which classes "You are a feminist" as toxic. We can use metrics to test for the skewness that have been built by the Jigsaw team and are derived from ROC-AUC, Equality Gap an Mann-Whitney U metrics.

Non Twitter Data

Clone this wiki locally