Skip to content

Commit

Permalink
- Possibility to save all tweets on a single file (.ji format);
Browse files Browse the repository at this point in the history
- Save datetime of tweet as `datetime` field when using MongoDB Pipeline;
- Additional Index on MongoDB;
- Option to ignore or update existing data when using MongoDB Pipeline;
  • Loading branch information
Roberto Correia committed Jan 29, 2018
1 parent 93423a1 commit 12fb608
Show file tree
Hide file tree
Showing 14 changed files with 262 additions and 235 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.idea/
venv/
Data/
pyenv/

Expand Down
74 changes: 12 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,16 @@
# Introduction #
`TweetScraper` can get tweets from [Twitter Search](https://twitter.com/search-home).
It is built on [Scrapy](http://scrapy.org/) without using [Twitter's APIs](https://dev.twitter.com/rest/public).
The crawled data is not as *clean* as the one obtained by the APIs, but the benefits are you can get rid of the API's rate limits and restrictions. Ideally, you can get all the data from Twitter Search.
This is a fork of [TweetScraper](https://github.com/jonbakerfish/TweetScraper), with some aditional features, like:

**WARNING:** please be polite and follow the [crawler's politeness policy](https://en.wikipedia.org/wiki/Web_crawler#Politeness_policy).


# Installation #
It requires [Scrapy](http://scrapy.org/) and [PyMongo](https://api.mongodb.org/python/current/) (Also install [MongoDB](https://www.mongodb.org/) if you want to save the data to database). Setting up:

$ git clone https://github.com/jonbakerfish/TweetScraper.git
$ cd TweetScraper/
$ pip install -r requirements.txt #add '--user' if you are not root
$ scrapy list
$ #If the output is 'TweetScraper', then you are ready to go.

# Usage #
1. Change the `USER_AGENT` in `TweetScraper/settings.py` to identify who you are

USER_AGENT = 'your website/e-mail'

2. In the root folder of this project, run command like:

scrapy crawl TweetScraper -a query=foo,#bar

where `query` is a list of keywords seperated by comma (`,`). The query can be any thing (keyword, hashtag, etc.) you want to search in [Twitter Search](https://twitter.com/search-home). `TweetScraper` will crawl the search results of the query and save the tweet content and user information. You can also use the following operators in each query (from [Twitter Search](https://twitter.com/search-home)):

| Operator | Finds tweets... |
| --- | --- |
| twitter search | containing both "twitter" and "search". This is the default operator. |
| **"** happy hour **"** | containing the exact phrase "happy hour". |
| love **OR** hate | containing either "love" or "hate" (or both). |
| beer **-** root | containing "beer" but not "root". |
| **#** haiku | containing the hashtag "haiku". |
| **from:** alexiskold | sent from person "alexiskold". |
| **to:** techcrunch | sent to person "techcrunch". |
| **@** mashable | referencing person "mashable". |
| "happy hour" **near:** "san francisco" | containing the exact phrase "happy hour" and sent near "san francisco". |
| **near:** NYC **within:** 15mi | sent within 15 miles of "NYC". |
| superhero **since:** 2010-12-27 | containing "superhero" and sent since date "2010-12-27" (year-month-day). |
| ftw **until:** 2010-12-27 | containing "ftw" and sent up to date "2010-12-27". |
| movie -scary **:)** | containing "movie", but not "scary", and with a positive attitude. |
| flight **:(** | containing "flight" and with a negative attitude. |
| traffic **?** | containing "traffic" and asking a question. |
| hilarious **filter:links** | containing "hilarious" and linking to URLs. |
| news **source:twitterfeed** | containing "news" and entered via TwitterFeed |
- Possibility to save all tweets on a single file (`.ji` format);
- Save datetime of tweet as `datetime` field when using MongoDB Pipeline;
- Additional Index on MongoDB;
- Option to ignore or update existing data when using MongoDB Pipeline;

3. The tweets will be saved to disk in `./Data/tweet/` in default settings and `./Data/user/` is for user data. The file format is JSON. Change the `SAVE_TWEET_PATH` and `SAVE_USER_PATH` in `TweetScraper/settings.py` if you want another location.
# Future features #
- Get extra info about the Twitter User (like total retweets, followers, etc);
- Get extra info about Tweet (like if used emoticons);
- Download media files (gifs, images, videos);

4. In you want to save the data to MongoDB, change the `ITEM_PIPELINES` in `TweetScraper/settings.py` from `TweetScraper.pipelines.SaveToFilePipeline` to `TweetScraper.pipelines.SaveToMongoPipeline`.

### Other parameters
* `lang[DEFAULT='']` allow to choose the language of tweet scrapped. This is not part of the query parameters, it is a different part in the search API URL
* `top_tweet[DEFAULT=False]`, if you want to query only top_tweets or all of them
* `crawl_user[DEFAULT=False]`, if you want to crawl users, author's of tweets in the same time

E.g.: `scrapy crawl TweetScraper -a query=foo -a crawl_user=True`


# Acknowledgement #
Keeping the crawler up to date requires continuous efforts, we thank all the [contributors](https://github.com/jonbakerfish/TweetScraper/graphs/contributors) for their valuable work.


# License #
TweetScraper is released under the [GNU GENERAL PUBLIC LICENSE, Version 2](https://github.com/jonbakerfish/TweetScraper/blob/master/LICENSE)

Read the [TweetScraper official documentation](https://github.com/jonbakerfish/TweetScraper/blob/master/README.md) to
know how to install and use this;
36 changes: 0 additions & 36 deletions TweetScraper/items.py

This file was deleted.

2 changes: 2 additions & 0 deletions TweetScraper/items/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .tweet import Tweet
from .user import User
27 changes: 27 additions & 0 deletions TweetScraper/items/tweet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
from scrapy import Item, Field


class Tweet(Item):
ID = Field() # tweet id
url = Field() # tweet url
datetime = Field() # post time
text = Field() # text content
user_id = Field() # user id
usernameTweet = Field() # username of tweet

nbr_retweet = Field() # nbr of retweet
nbr_favorite = Field() # nbr of favorite
nbr_reply = Field() # nbr of reply

is_reply = Field() # boolean if the tweet is a reply or not
is_retweet = Field() # boolean if the tweet is just a retweet of another tweet

has_image = Field() # True/False, whether a tweet contains images
images = Field() # a list of image urls, empty if none

has_video = Field() # True/False, whether a tweet contains videos
videos = Field() # a list of video urls

has_media = Field() # True/False, whether a tweet contains media (e.g. summary)
medias = Field() # a list of media
14 changes: 14 additions & 0 deletions TweetScraper/items/user.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# -*- coding: utf-8 -*-
from scrapy import Item, Field


class User(Item):
ID = Field() # user id
name = Field() # user name
screen_name = Field() # user screen name
avatar = Field() # avatar url
location = Field() # city, country
nbr_tweets = Field() # nbr of tweets
nbr_following = Field() # nbr of following
nbr_followers = Field() # nbr of followers
nbr_likes = Field() # nbr of likes
99 changes: 0 additions & 99 deletions TweetScraper/pipelines.py

This file was deleted.

3 changes: 3 additions & 0 deletions TweetScraper/pipelines/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .save_to_file_pipeline import SaveToFilePipeline
from .save_to_mongo_pipeline import SaveToMongoPipeline
from .save_to_single_file_pipeline import SaveToSingleFilePipeline
46 changes: 46 additions & 0 deletions TweetScraper/pipelines/save_to_file_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# -*- coding: utf-8 -*-
from scrapy.conf import settings
import logging
import os

from TweetScraper.items.tweet import Tweet
from TweetScraper.items.user import User
from TweetScraper.utils import mkdirs, save_to_file

logger = logging.getLogger(__name__)


class SaveToFilePipeline(object):
""" pipeline that save data to disk """

def __init__(self):
self.saveTweetPath = settings['SAVE_TWEET_PATH']
self.saveUserPath = settings['SAVE_USER_PATH']
mkdirs(self.saveTweetPath) # ensure the path exists
mkdirs(self.saveUserPath)

def process_item(self, item, spider):
if isinstance(item, Tweet):
save_path = os.path.join(self.saveTweetPath, item['ID'])
if os.path.isfile(save_path):
pass # simply skip existing items
# or you can rewrite the file, if you don't want to skip:
# self.save_to_file(item,savePath)
# logger.info("Update tweet:%s"%dbItem['url'])
else:
save_to_file(item, save_path)
logger.debug("Add tweet:%s" % item['url'])

elif isinstance(item, User):
save_path = os.path.join(self.saveUserPath, item['ID'])
if os.path.isfile(save_path):
pass # simply skip existing items
# or you can rewrite the file, if you don't want to skip:
# self.save_to_file(item,savePath)
# logger.info("Update user:%s"%dbItem['screen_name'])
else:
save_to_file(item, save_path)
logger.debug("Add user:%s" % item['screen_name'])

else:
logger.info("Item type is not recognized! type = %s" % type(item))
68 changes: 68 additions & 0 deletions TweetScraper/pipelines/save_to_mongo_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# -*- coding: utf-8 -*-
from datetime import datetime
from scrapy.conf import settings
import logging
import pymongo

from TweetScraper.items.tweet import Tweet
from TweetScraper.items.user import User

logger = logging.getLogger(__name__)


class SaveToMongoPipeline(object):
""" pipeline that save data to mongodb """

def __init__(self):
connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
self.updateItem = settings['MONGODB_UPDATE']

db = connection[settings['MONGODB_DB']]
self.tweetCollection = db[settings['MONGODB_TWEET_COLLECTION']]
self.userCollection = db[settings['MONGODB_USER_COLLECTION']]

self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)
self.tweetCollection.ensure_index([('usernameTweet', pymongo.ASCENDING)])
self.tweetCollection.ensure_index([('datetime', pymongo.ASCENDING)])
self.tweetCollection.ensure_index([('user_id', pymongo.ASCENDING)])
self.userCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)

# convert field types (from string to int and datetime)
def convert_fields(self, item):
mongo_entity = dict(item)

# convert string datetime to true datetime
mongo_entity['datetime'] = datetime.strptime(mongo_entity['datetime'], "%Y-%m-%d %H:%M:%S")
mongo_entity['ID'] = int(mongo_entity['ID']) # convert id to a number
mongo_entity['user_id'] = int(mongo_entity['user_id']) # convert user_id to a number

return mongo_entity

def process_item(self, item, spider):
if isinstance(item, Tweet):
db_item = self.tweetCollection.find_one({'ID': item['ID']})
if db_item:
if self.updateItem:
mongo_entity = self.convert_fields(item)
db_item.update(mongo_entity)
self.tweetCollection.save(db_item)
logger.info("Update tweet: %s" % db_item['url'])

else:
mongo_entity = self.convert_fields(item)
self.tweetCollection.insert_one(mongo_entity)
logger.debug("Add tweet: %s" % item['url'])

elif isinstance(item, User):
db_item = self.userCollection.find_one({'ID': item['ID']})
if db_item:
if self.updateItem:
db_item.update(dict(item))
self.userCollection.save(db_item)
logger.info("Update user: %s" % db_item['screen_name'])
else:
self.userCollection.insert_one(dict(item))
logger.debug("Add user: %s" % item['screen_name'])

else:
logger.info("Item type is not recognized! type = %s" % type(item))
Loading

0 comments on commit 12fb608

Please sign in to comment.