- Possibility to save all tweets on a single file (.ji format);

- Save datetime of tweet as `datetime` field when using MongoDB Pipeline; - Additional Index on MongoDB; - Option to ignore or update existing data when using MongoDB Pipeline;
tobecwb · Jan 29, 2018 · 12fb608 · 12fb608
1 parent 93423a1
commit 12fb608
Show file tree

Hide file tree

Showing 14 changed files with 262 additions and 235 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
+.idea/
+venv/
 Data/
 pyenv/
 

diff --git a/README.md b/README.md
@@ -1,66 +1,16 @@
 # Introduction #
-`TweetScraper` can get tweets from [Twitter Search](https://twitter.com/search-home). 
-It is built on [Scrapy](http://scrapy.org/) without using [Twitter's APIs](https://dev.twitter.com/rest/public).
-The crawled data is not as *clean* as the one obtained by the APIs, but the benefits are you can get rid of the API's rate limits and restrictions. Ideally, you can get all the data from Twitter Search.
+This is a fork of [TweetScraper](https://github.com/jonbakerfish/TweetScraper), with some aditional features, like:
 
-**WARNING:** please be polite and follow the [crawler's politeness policy](https://en.wikipedia.org/wiki/Web_crawler#Politeness_policy).
-
-
-# Installation #
-It requires [Scrapy](http://scrapy.org/) and [PyMongo](https://api.mongodb.org/python/current/) (Also install [MongoDB](https://www.mongodb.org/) if you want to save the data to database). Setting up:
-
-    $ git clone https://github.com/jonbakerfish/TweetScraper.git
-    $ cd TweetScraper/
-    $ pip install -r requirements.txt  #add '--user' if you are not root
-	$ scrapy list
-	$ #If the output is 'TweetScraper', then you are ready to go.
-
-# Usage #
-1. Change the `USER_AGENT` in `TweetScraper/settings.py` to identify who you are
-
-		USER_AGENT = 'your website/e-mail'
-
-2. In the root folder of this project, run command like: 
-
-		scrapy crawl TweetScraper -a query=foo,#bar
-
-	where `query` is a list of keywords seperated by comma (`,`). The query can be any thing (keyword, hashtag, etc.) you want to search in [Twitter Search](https://twitter.com/search-home). `TweetScraper` will crawl the search results of the query and save the tweet content and user information. You can also use the following operators in each query (from [Twitter Search](https://twitter.com/search-home)):
-
-	| Operator | Finds tweets... |
-	| --- | --- |
-	| twitter search | containing both "twitter" and "search". This is the default operator. |
-	| **"** happy hour **"** | containing the exact phrase "happy hour". |
-	| love **OR** hate | containing either "love" or "hate" (or both). |
-	| beer **-** root | containing "beer" but not "root". |
-	| **#** haiku | containing the hashtag "haiku". |
-	| **from:** alexiskold | sent from person "alexiskold". |
-	| **to:** techcrunch | sent to person "techcrunch". |
-	| **@** mashable | referencing person "mashable". |
-	| "happy hour" **near:** "san francisco" | containing the exact phrase "happy hour" and sent near "san francisco". |
-	| **near:** NYC **within:** 15mi | sent within 15 miles of "NYC". |
-	| superhero **since:** 2010-12-27 | containing "superhero" and sent since date "2010-12-27" (year-month-day). |
-	| ftw **until:** 2010-12-27 | containing "ftw" and sent up to date "2010-12-27". |
-	| movie -scary **:)** | containing "movie", but not "scary", and with a positive attitude. |
-	| flight **:(** | containing "flight" and with a negative attitude. |
-	| traffic **?** | containing "traffic" and asking a question. |
-	| hilarious **filter:links** | containing "hilarious" and linking to URLs. |
-	| news **source:twitterfeed** | containing "news" and entered via TwitterFeed |
+- Possibility to save all tweets on a single file (`.ji` format);
+- Save datetime of tweet as `datetime` field when using MongoDB Pipeline;
+- Additional Index on MongoDB;
+- Option to ignore or update existing data when using MongoDB Pipeline;
 
-3. The tweets will be saved to disk in `./Data/tweet/` in default settings and `./Data/user/` is for user data. The file format is JSON. Change the `SAVE_TWEET_PATH` and `SAVE_USER_PATH` in `TweetScraper/settings.py` if you want another location.
+# Future features #
+- Get extra info about the Twitter User (like total retweets, followers, etc);
+- Get extra info about Tweet (like if used emoticons);
+- Download media files (gifs, images, videos);
 
-4.  In you want to save the data to MongoDB, change the `ITEM_PIPELINES` in `TweetScraper/settings.py` from `TweetScraper.pipelines.SaveToFilePipeline` to `TweetScraper.pipelines.SaveToMongoPipeline`.
-
-### Other parameters
-* `lang[DEFAULT='']` allow to choose the language of tweet scrapped. This is not part of the query parameters, it is a different part in the search API URL
-* `top_tweet[DEFAULT=False]`, if you want to query only top_tweets or all of them
-* `crawl_user[DEFAULT=False]`, if you want to crawl users, author's of tweets in the same time
-
-E.g.: `scrapy crawl TweetScraper -a query=foo -a crawl_user=True`
-
-
-# Acknowledgement #
-Keeping the crawler up to date requires continuous efforts, we thank all the [contributors](https://github.com/jonbakerfish/TweetScraper/graphs/contributors) for their valuable work.
-
-
-# License #
-TweetScraper is released under the [GNU GENERAL PUBLIC LICENSE, Version 2](https://github.com/jonbakerfish/TweetScraper/blob/master/LICENSE)
+
+Read the [TweetScraper official documentation](https://github.com/jonbakerfish/TweetScraper/blob/master/README.md) to 
+know how to install and use this;
diff --git a/TweetScraper/items.py b/TweetScraper/items.py
diff --git a/TweetScraper/items/__init__.py b/TweetScraper/items/__init__.py
@@ -0,0 +1,2 @@
+from .tweet import Tweet
+from .user import User
diff --git a/TweetScraper/items/tweet.py b/TweetScraper/items/tweet.py
@@ -0,0 +1,27 @@
+# -*- coding: utf-8 -*-
+from scrapy import Item, Field
+
+
+class Tweet(Item):
+    ID = Field()                # tweet id
+    url = Field()               # tweet url
+    datetime = Field()          # post time
+    text = Field()              # text content
+    user_id = Field()           # user id
+    usernameTweet = Field()     # username of tweet
+
+    nbr_retweet = Field()       # nbr of retweet
+    nbr_favorite = Field()      # nbr of favorite
+    nbr_reply = Field()         # nbr of reply
+
+    is_reply = Field()          # boolean if the tweet is a reply or not
+    is_retweet = Field()        # boolean if the tweet is just a retweet of another tweet
+
+    has_image = Field()         # True/False, whether a tweet contains images
+    images = Field()            # a list of image urls, empty if none
+
+    has_video = Field()         # True/False, whether a tweet contains videos
+    videos = Field()            # a list of video urls
+
+    has_media = Field()         # True/False, whether a tweet contains media (e.g. summary)
+    medias = Field()            # a list of media
diff --git a/TweetScraper/items/user.py b/TweetScraper/items/user.py
@@ -0,0 +1,14 @@
+# -*- coding: utf-8 -*-
+from scrapy import Item, Field
+
+
+class User(Item):
+    ID = Field()                # user id
+    name = Field()              # user name
+    screen_name = Field()       # user screen name
+    avatar = Field()            # avatar url
+    location = Field()          # city, country
+    nbr_tweets = Field()        # nbr of tweets
+    nbr_following = Field()     # nbr of following
+    nbr_followers = Field()     # nbr of followers
+    nbr_likes = Field()         # nbr of likes
diff --git a/TweetScraper/pipelines.py b/TweetScraper/pipelines.py
diff --git a/TweetScraper/pipelines/__init__.py b/TweetScraper/pipelines/__init__.py
@@ -0,0 +1,3 @@
+from .save_to_file_pipeline import SaveToFilePipeline
+from .save_to_mongo_pipeline import SaveToMongoPipeline
+from .save_to_single_file_pipeline import SaveToSingleFilePipeline
diff --git a/TweetScraper/pipelines/save_to_file_pipeline.py b/TweetScraper/pipelines/save_to_file_pipeline.py
@@ -0,0 +1,46 @@
+# -*- coding: utf-8 -*-
+from scrapy.conf import settings
+import logging
+import os
+
+from TweetScraper.items.tweet import Tweet
+from TweetScraper.items.user import User
+from TweetScraper.utils import mkdirs, save_to_file
+
+logger = logging.getLogger(__name__)
+
+
+class SaveToFilePipeline(object):
+    """ pipeline that save data to disk """
+
+    def __init__(self):
+        self.saveTweetPath = settings['SAVE_TWEET_PATH']
+        self.saveUserPath = settings['SAVE_USER_PATH']
+        mkdirs(self.saveTweetPath)  # ensure the path exists
+        mkdirs(self.saveUserPath)
+
+    def process_item(self, item, spider):
+        if isinstance(item, Tweet):
+            save_path = os.path.join(self.saveTweetPath, item['ID'])
+            if os.path.isfile(save_path):
+                pass  # simply skip existing items
+                # or you can rewrite the file, if you don't want to skip:
+                # self.save_to_file(item,savePath)
+                # logger.info("Update tweet:%s"%dbItem['url'])
+            else:
+                save_to_file(item, save_path)
+                logger.debug("Add tweet:%s" % item['url'])
+
+        elif isinstance(item, User):
+            save_path = os.path.join(self.saveUserPath, item['ID'])
+            if os.path.isfile(save_path):
+                pass  # simply skip existing items
+                # or you can rewrite the file, if you don't want to skip:
+                # self.save_to_file(item,savePath)
+                # logger.info("Update user:%s"%dbItem['screen_name'])
+            else:
+                save_to_file(item, save_path)
+                logger.debug("Add user:%s" % item['screen_name'])
+
+        else:
+            logger.info("Item type is not recognized! type = %s" % type(item))
diff --git a/TweetScraper/pipelines/save_to_mongo_pipeline.py b/TweetScraper/pipelines/save_to_mongo_pipeline.py
@@ -0,0 +1,68 @@
+# -*- coding: utf-8 -*-
+from datetime import datetime
+from scrapy.conf import settings
+import logging
+import pymongo
+
+from TweetScraper.items.tweet import Tweet
+from TweetScraper.items.user import User
+
+logger = logging.getLogger(__name__)
+
+
+class SaveToMongoPipeline(object):
+    """ pipeline that save data to mongodb """
+
+    def __init__(self):
+        connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
+        self.updateItem = settings['MONGODB_UPDATE']
+
+        db = connection[settings['MONGODB_DB']]
+        self.tweetCollection = db[settings['MONGODB_TWEET_COLLECTION']]
+        self.userCollection = db[settings['MONGODB_USER_COLLECTION']]
+
+        self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)
+        self.tweetCollection.ensure_index([('usernameTweet', pymongo.ASCENDING)])
+        self.tweetCollection.ensure_index([('datetime', pymongo.ASCENDING)])
+        self.tweetCollection.ensure_index([('user_id', pymongo.ASCENDING)])
+        self.userCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True)
+
+    # convert field types (from string to int and datetime)
+    def convert_fields(self, item):
+        mongo_entity = dict(item)
+
+        # convert string datetime to true datetime
+        mongo_entity['datetime'] = datetime.strptime(mongo_entity['datetime'], "%Y-%m-%d %H:%M:%S")
+        mongo_entity['ID'] = int(mongo_entity['ID'])  # convert id to a number
+        mongo_entity['user_id'] = int(mongo_entity['user_id'])  # convert user_id to a number
+
+        return mongo_entity
+
+    def process_item(self, item, spider):
+        if isinstance(item, Tweet):
+            db_item = self.tweetCollection.find_one({'ID': item['ID']})
+            if db_item:
+                if self.updateItem:
+                    mongo_entity = self.convert_fields(item)
+                    db_item.update(mongo_entity)
+                    self.tweetCollection.save(db_item)
+                    logger.info("Update tweet: %s" % db_item['url'])
+
+            else:
+                mongo_entity = self.convert_fields(item)
+                self.tweetCollection.insert_one(mongo_entity)
+                logger.debug("Add tweet: %s" % item['url'])
+
+        elif isinstance(item, User):
+            db_item = self.userCollection.find_one({'ID': item['ID']})
+            if db_item:
+                if self.updateItem:
+                    db_item.update(dict(item))
+                    self.userCollection.save(db_item)
+                    logger.info("Update user: %s" % db_item['screen_name'])
+            else:
+                self.userCollection.insert_one(dict(item))
+                logger.debug("Add user: %s" % item['screen_name'])
+
+        else:
+            logger.info("Item type is not recognized! type = %s" % type(item))
-Original file line number
+Diff line change
@@ -1,3 +1,5 @@
+    .idea/
+    venv/
     Data/
     pyenv/
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		from .tweet import Tweet
		from .user import User