diff --git a/.gitignore b/.gitignore index 3da6951..80e52c4 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,5 @@ +.idea/ +venv/ Data/ pyenv/ diff --git a/README.md b/README.md index 9069174..fdef3d9 100644 --- a/README.md +++ b/README.md @@ -1,66 +1,16 @@ # Introduction # -`TweetScraper` can get tweets from [Twitter Search](https://twitter.com/search-home). -It is built on [Scrapy](http://scrapy.org/) without using [Twitter's APIs](https://dev.twitter.com/rest/public). -The crawled data is not as *clean* as the one obtained by the APIs, but the benefits are you can get rid of the API's rate limits and restrictions. Ideally, you can get all the data from Twitter Search. +This is a fork of [TweetScraper](https://github.com/jonbakerfish/TweetScraper), with some aditional features, like: -**WARNING:** please be polite and follow the [crawler's politeness policy](https://en.wikipedia.org/wiki/Web_crawler#Politeness_policy). - - -# Installation # -It requires [Scrapy](http://scrapy.org/) and [PyMongo](https://api.mongodb.org/python/current/) (Also install [MongoDB](https://www.mongodb.org/) if you want to save the data to database). Setting up: - - $ git clone https://github.com/jonbakerfish/TweetScraper.git - $ cd TweetScraper/ - $ pip install -r requirements.txt #add '--user' if you are not root - $ scrapy list - $ #If the output is 'TweetScraper', then you are ready to go. - -# Usage # -1. Change the `USER_AGENT` in `TweetScraper/settings.py` to identify who you are - - USER_AGENT = 'your website/e-mail' - -2. In the root folder of this project, run command like: - - scrapy crawl TweetScraper -a query=foo,#bar - - where `query` is a list of keywords seperated by comma (`,`). The query can be any thing (keyword, hashtag, etc.) you want to search in [Twitter Search](https://twitter.com/search-home). `TweetScraper` will crawl the search results of the query and save the tweet content and user information. You can also use the following operators in each query (from [Twitter Search](https://twitter.com/search-home)): - - | Operator | Finds tweets... | - | --- | --- | - | twitter search | containing both "twitter" and "search". This is the default operator. | - | **"** happy hour **"** | containing the exact phrase "happy hour". | - | love **OR** hate | containing either "love" or "hate" (or both). | - | beer **-** root | containing "beer" but not "root". | - | **#** haiku | containing the hashtag "haiku". | - | **from:** alexiskold | sent from person "alexiskold". | - | **to:** techcrunch | sent to person "techcrunch". | - | **@** mashable | referencing person "mashable". | - | "happy hour" **near:** "san francisco" | containing the exact phrase "happy hour" and sent near "san francisco". | - | **near:** NYC **within:** 15mi | sent within 15 miles of "NYC". | - | superhero **since:** 2010-12-27 | containing "superhero" and sent since date "2010-12-27" (year-month-day). | - | ftw **until:** 2010-12-27 | containing "ftw" and sent up to date "2010-12-27". | - | movie -scary **:)** | containing "movie", but not "scary", and with a positive attitude. | - | flight **:(** | containing "flight" and with a negative attitude. | - | traffic **?** | containing "traffic" and asking a question. | - | hilarious **filter:links** | containing "hilarious" and linking to URLs. | - | news **source:twitterfeed** | containing "news" and entered via TwitterFeed | +- Possibility to save all tweets on a single file (`.ji` format); +- Save datetime of tweet as `datetime` field when using MongoDB Pipeline; +- Additional Index on MongoDB; +- Option to ignore or update existing data when using MongoDB Pipeline; -3. The tweets will be saved to disk in `./Data/tweet/` in default settings and `./Data/user/` is for user data. The file format is JSON. Change the `SAVE_TWEET_PATH` and `SAVE_USER_PATH` in `TweetScraper/settings.py` if you want another location. +# Future features # +- Get extra info about the Twitter User (like total retweets, followers, etc); +- Get extra info about Tweet (like if used emoticons); +- Download media files (gifs, images, videos); -4. In you want to save the data to MongoDB, change the `ITEM_PIPELINES` in `TweetScraper/settings.py` from `TweetScraper.pipelines.SaveToFilePipeline` to `TweetScraper.pipelines.SaveToMongoPipeline`. - -### Other parameters -* `lang[DEFAULT='']` allow to choose the language of tweet scrapped. This is not part of the query parameters, it is a different part in the search API URL -* `top_tweet[DEFAULT=False]`, if you want to query only top_tweets or all of them -* `crawl_user[DEFAULT=False]`, if you want to crawl users, author's of tweets in the same time - -E.g.: `scrapy crawl TweetScraper -a query=foo -a crawl_user=True` - - -# Acknowledgement # -Keeping the crawler up to date requires continuous efforts, we thank all the [contributors](https://github.com/jonbakerfish/TweetScraper/graphs/contributors) for their valuable work. - - -# License # -TweetScraper is released under the [GNU GENERAL PUBLIC LICENSE, Version 2](https://github.com/jonbakerfish/TweetScraper/blob/master/LICENSE) + +Read the [TweetScraper official documentation](https://github.com/jonbakerfish/TweetScraper/blob/master/README.md) to +know how to install and use this; diff --git a/TweetScraper/items.py b/TweetScraper/items.py deleted file mode 100644 index 05f2478..0000000 --- a/TweetScraper/items.py +++ /dev/null @@ -1,36 +0,0 @@ -# -*- coding: utf-8 -*- - -# Define here the models for your scraped items -from scrapy import Item, Field - - -class Tweet(Item): - ID = Field() # tweet id - url = Field() # tweet url - datetime = Field() # post time - text = Field() # text content - user_id = Field() # user id - usernameTweet = Field() # username of tweet - - nbr_retweet = Field() # nbr of retweet - nbr_favorite = Field() # nbr of favorite - nbr_reply = Field() # nbr of reply - - is_reply = Field() # boolean if the tweet is a reply or not - is_retweet = Field() # boolean if the tweet is just a retweet of another tweet - - has_image = Field() # True/False, whether a tweet contains images - images = Field() # a list of image urls, empty if none - - has_video = Field() # True/False, whether a tweet contains videos - videos = Field() # a list of video urls - - has_media = Field() # True/False, whether a tweet contains media (e.g. summary) - medias = Field() # a list of media - - -class User(Item): - ID = Field() # user id - name = Field() # user name - screen_name = Field() # user screen name - avatar = Field() # avator url diff --git a/TweetScraper/items/__init__.py b/TweetScraper/items/__init__.py new file mode 100644 index 0000000..42d30da --- /dev/null +++ b/TweetScraper/items/__init__.py @@ -0,0 +1,2 @@ +from .tweet import Tweet +from .user import User diff --git a/TweetScraper/items/tweet.py b/TweetScraper/items/tweet.py new file mode 100644 index 0000000..e67546c --- /dev/null +++ b/TweetScraper/items/tweet.py @@ -0,0 +1,27 @@ +# -*- coding: utf-8 -*- +from scrapy import Item, Field + + +class Tweet(Item): + ID = Field() # tweet id + url = Field() # tweet url + datetime = Field() # post time + text = Field() # text content + user_id = Field() # user id + usernameTweet = Field() # username of tweet + + nbr_retweet = Field() # nbr of retweet + nbr_favorite = Field() # nbr of favorite + nbr_reply = Field() # nbr of reply + + is_reply = Field() # boolean if the tweet is a reply or not + is_retweet = Field() # boolean if the tweet is just a retweet of another tweet + + has_image = Field() # True/False, whether a tweet contains images + images = Field() # a list of image urls, empty if none + + has_video = Field() # True/False, whether a tweet contains videos + videos = Field() # a list of video urls + + has_media = Field() # True/False, whether a tweet contains media (e.g. summary) + medias = Field() # a list of media diff --git a/TweetScraper/items/user.py b/TweetScraper/items/user.py new file mode 100644 index 0000000..99df994 --- /dev/null +++ b/TweetScraper/items/user.py @@ -0,0 +1,14 @@ +# -*- coding: utf-8 -*- +from scrapy import Item, Field + + +class User(Item): + ID = Field() # user id + name = Field() # user name + screen_name = Field() # user screen name + avatar = Field() # avatar url + location = Field() # city, country + nbr_tweets = Field() # nbr of tweets + nbr_following = Field() # nbr of following + nbr_followers = Field() # nbr of followers + nbr_likes = Field() # nbr of likes diff --git a/TweetScraper/pipelines.py b/TweetScraper/pipelines.py deleted file mode 100644 index 2314177..0000000 --- a/TweetScraper/pipelines.py +++ /dev/null @@ -1,99 +0,0 @@ -# -*- coding: utf-8 -*- -from scrapy.exceptions import DropItem -from scrapy.conf import settings -import logging -import pymongo -import json -import os - -from TweetScraper.items import Tweet, User -from TweetScraper.utils import mkdirs - - -logger = logging.getLogger(__name__) - -class SaveToMongoPipeline(object): - - ''' pipeline that save data to mongodb ''' - def __init__(self): - connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT']) - db = connection[settings['MONGODB_DB']] - self.tweetCollection = db[settings['MONGODB_TWEET_COLLECTION']] - self.userCollection = db[settings['MONGODB_USER_COLLECTION']] - self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True) - self.userCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True) - - - def process_item(self, item, spider): - if isinstance(item, Tweet): - dbItem = self.tweetCollection.find_one({'ID': item['ID']}) - if dbItem: - pass # simply skip existing items - ### or you can update the tweet, if you don't want to skip: - # dbItem.update(dict(item)) - # self.tweetCollection.save(dbItem) - # logger.info("Update tweet:%s"%dbItem['url']) - else: - self.tweetCollection.insert_one(dict(item)) - logger.debug("Add tweet:%s" %item['url']) - - elif isinstance(item, User): - dbItem = self.userCollection.find_one({'ID': item['ID']}) - if dbItem: - pass # simply skip existing items - ### or you can update the user, if you don't want to skip: - # dbItem.update(dict(item)) - # self.userCollection.save(dbItem) - # logger.info("Update user:%s"%dbItem['screen_name']) - else: - self.userCollection.insert_one(dict(item)) - logger.debug("Add user:%s" %item['screen_name']) - - else: - logger.info("Item type is not recognized! type = %s" %type(item)) - - - -class SaveToFilePipeline(object): - ''' pipeline that save data to disk ''' - def __init__(self): - self.saveTweetPath = settings['SAVE_TWEET_PATH'] - self.saveUserPath = settings['SAVE_USER_PATH'] - mkdirs(self.saveTweetPath) # ensure the path exists - mkdirs(self.saveUserPath) - - - def process_item(self, item, spider): - if isinstance(item, Tweet): - savePath = os.path.join(self.saveTweetPath, item['ID']) - if os.path.isfile(savePath): - pass # simply skip existing items - ### or you can rewrite the file, if you don't want to skip: - # self.save_to_file(item,savePath) - # logger.info("Update tweet:%s"%dbItem['url']) - else: - self.save_to_file(item,savePath) - logger.debug("Add tweet:%s" %item['url']) - - elif isinstance(item, User): - savePath = os.path.join(self.saveUserPath, item['ID']) - if os.path.isfile(savePath): - pass # simply skip existing items - ### or you can rewrite the file, if you don't want to skip: - # self.save_to_file(item,savePath) - # logger.info("Update user:%s"%dbItem['screen_name']) - else: - self.save_to_file(item, savePath) - logger.debug("Add user:%s" %item['screen_name']) - - else: - logger.info("Item type is not recognized! type = %s" %type(item)) - - - def save_to_file(self, item, fname): - ''' input: - item - a dict like object - fname - where to save - ''' - with open(fname,'w') as f: - json.dump(dict(item), f) diff --git a/TweetScraper/pipelines/__init__.py b/TweetScraper/pipelines/__init__.py new file mode 100644 index 0000000..cc378aa --- /dev/null +++ b/TweetScraper/pipelines/__init__.py @@ -0,0 +1,3 @@ +from .save_to_file_pipeline import SaveToFilePipeline +from .save_to_mongo_pipeline import SaveToMongoPipeline +from .save_to_single_file_pipeline import SaveToSingleFilePipeline diff --git a/TweetScraper/pipelines/save_to_file_pipeline.py b/TweetScraper/pipelines/save_to_file_pipeline.py new file mode 100644 index 0000000..a61ae22 --- /dev/null +++ b/TweetScraper/pipelines/save_to_file_pipeline.py @@ -0,0 +1,46 @@ +# -*- coding: utf-8 -*- +from scrapy.conf import settings +import logging +import os + +from TweetScraper.items.tweet import Tweet +from TweetScraper.items.user import User +from TweetScraper.utils import mkdirs, save_to_file + +logger = logging.getLogger(__name__) + + +class SaveToFilePipeline(object): + """ pipeline that save data to disk """ + + def __init__(self): + self.saveTweetPath = settings['SAVE_TWEET_PATH'] + self.saveUserPath = settings['SAVE_USER_PATH'] + mkdirs(self.saveTweetPath) # ensure the path exists + mkdirs(self.saveUserPath) + + def process_item(self, item, spider): + if isinstance(item, Tweet): + save_path = os.path.join(self.saveTweetPath, item['ID']) + if os.path.isfile(save_path): + pass # simply skip existing items + # or you can rewrite the file, if you don't want to skip: + # self.save_to_file(item,savePath) + # logger.info("Update tweet:%s"%dbItem['url']) + else: + save_to_file(item, save_path) + logger.debug("Add tweet:%s" % item['url']) + + elif isinstance(item, User): + save_path = os.path.join(self.saveUserPath, item['ID']) + if os.path.isfile(save_path): + pass # simply skip existing items + # or you can rewrite the file, if you don't want to skip: + # self.save_to_file(item,savePath) + # logger.info("Update user:%s"%dbItem['screen_name']) + else: + save_to_file(item, save_path) + logger.debug("Add user:%s" % item['screen_name']) + + else: + logger.info("Item type is not recognized! type = %s" % type(item)) diff --git a/TweetScraper/pipelines/save_to_mongo_pipeline.py b/TweetScraper/pipelines/save_to_mongo_pipeline.py new file mode 100644 index 0000000..86c8630 --- /dev/null +++ b/TweetScraper/pipelines/save_to_mongo_pipeline.py @@ -0,0 +1,68 @@ +# -*- coding: utf-8 -*- +from datetime import datetime +from scrapy.conf import settings +import logging +import pymongo + +from TweetScraper.items.tweet import Tweet +from TweetScraper.items.user import User + +logger = logging.getLogger(__name__) + + +class SaveToMongoPipeline(object): + """ pipeline that save data to mongodb """ + + def __init__(self): + connection = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT']) + self.updateItem = settings['MONGODB_UPDATE'] + + db = connection[settings['MONGODB_DB']] + self.tweetCollection = db[settings['MONGODB_TWEET_COLLECTION']] + self.userCollection = db[settings['MONGODB_USER_COLLECTION']] + + self.tweetCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True) + self.tweetCollection.ensure_index([('usernameTweet', pymongo.ASCENDING)]) + self.tweetCollection.ensure_index([('datetime', pymongo.ASCENDING)]) + self.tweetCollection.ensure_index([('user_id', pymongo.ASCENDING)]) + self.userCollection.ensure_index([('ID', pymongo.ASCENDING)], unique=True, dropDups=True) + + # convert field types (from string to int and datetime) + def convert_fields(self, item): + mongo_entity = dict(item) + + # convert string datetime to true datetime + mongo_entity['datetime'] = datetime.strptime(mongo_entity['datetime'], "%Y-%m-%d %H:%M:%S") + mongo_entity['ID'] = int(mongo_entity['ID']) # convert id to a number + mongo_entity['user_id'] = int(mongo_entity['user_id']) # convert user_id to a number + + return mongo_entity + + def process_item(self, item, spider): + if isinstance(item, Tweet): + db_item = self.tweetCollection.find_one({'ID': item['ID']}) + if db_item: + if self.updateItem: + mongo_entity = self.convert_fields(item) + db_item.update(mongo_entity) + self.tweetCollection.save(db_item) + logger.info("Update tweet: %s" % db_item['url']) + + else: + mongo_entity = self.convert_fields(item) + self.tweetCollection.insert_one(mongo_entity) + logger.debug("Add tweet: %s" % item['url']) + + elif isinstance(item, User): + db_item = self.userCollection.find_one({'ID': item['ID']}) + if db_item: + if self.updateItem: + db_item.update(dict(item)) + self.userCollection.save(db_item) + logger.info("Update user: %s" % db_item['screen_name']) + else: + self.userCollection.insert_one(dict(item)) + logger.debug("Add user: %s" % item['screen_name']) + + else: + logger.info("Item type is not recognized! type = %s" % type(item)) diff --git a/TweetScraper/pipelines/save_to_single_file_pipeline.py b/TweetScraper/pipelines/save_to_single_file_pipeline.py new file mode 100644 index 0000000..d90482d --- /dev/null +++ b/TweetScraper/pipelines/save_to_single_file_pipeline.py @@ -0,0 +1,45 @@ +# -*- coding: utf-8 -*- +import json + +from scrapy.conf import settings +import logging +import os + +from TweetScraper.items.tweet import Tweet +from TweetScraper.items.user import User +from TweetScraper.utils import mkdirs, save_to_file + +logger = logging.getLogger(__name__) + + +class SaveToSingleFilePipeline(object): + """ pipeline that save all data to a single file on disk """ + + def __init__(self): + self.tweets_file = None + self.users_file = None + + self.saveTweetPath = settings['SAVE_TWEET_PATH'] + self.saveUserPath = settings['SAVE_USER_PATH'] + mkdirs(self.saveTweetPath) # ensure the path exists + mkdirs(self.saveUserPath) + + def open_spider(self, spider): + self.tweets_file = open(os.path.join(self.saveTweetPath, "tweets.ji"), "w") + self.users_file = open(os.path.join(self.saveUserPath, "users.ji"), "w") + + def close_spider(self, spider): + self.tweets_file.close() + self.users_file.close() + + def process_item(self, item, spider): + if isinstance(item, Tweet): + self.tweets_file.write(json.dumps(item.__dict__)) + logger.debug("Add tweet:%s" % item['url']) + + elif isinstance(item, User): + self.users_file.write(json.dumps(item.__dict__)) + logger.debug("Add user:%s" % item['screen_name']) + + else: + logger.info("Item type is not recognized! type = %s" % type(item)) diff --git a/TweetScraper/settings.py b/TweetScraper/settings.py index b168481..2b3cadd 100644 --- a/TweetScraper/settings.py +++ b/TweetScraper/settings.py @@ -6,13 +6,14 @@ # settings for spiders BOT_NAME = 'TweetScraper' LOG_LEVEL = 'INFO' -DOWNLOAD_HANDLERS = {'s3': None,} # from http://stackoverflow.com/a/31233576/2297751, TODO +DOWNLOAD_HANDLERS = {'s3': None, } # from http://stackoverflow.com/a/31233576/2297751, TODO SPIDER_MODULES = ['TweetScraper.spiders'] NEWSPIDER_MODULE = 'TweetScraper.spiders' ITEM_PIPELINES = { - 'TweetScraper.pipelines.SaveToFilePipeline':100, - #'TweetScraper.pipelines.SaveToMongoPipeline':100, # replace `SaveToFilePipeline` with this to use MongoDB + # 'TweetScraper.pipelines.SaveToSingleFilePipeline': 100, + # 'TweetScraper.pipelines.SaveToFilePipeline': 200, + 'TweetScraper.pipelines.SaveToMongoPipeline': 300, # replace `SaveToFilePipeline` with this to use MongoDB } # settings for where to save data on disk @@ -22,8 +23,7 @@ # settings for mongodb MONGODB_SERVER = "127.0.0.1" MONGODB_PORT = 27017 -MONGODB_DB = "TweetScraper" # database name to save the crawled data -MONGODB_TWEET_COLLECTION = "tweet" # collection name to save tweets -MONGODB_USER_COLLECTION = "user" # collection name to save users - - +MONGODB_DB = "TweetScraper" # database name to save the crawled data +MONGODB_TWEET_COLLECTION = "tweet" # collection name to save tweets +MONGODB_USER_COLLECTION = "user" # collection name to save users +MONGODB_UPDATE = False # update existing itens diff --git a/TweetScraper/spiders/TweetCrawler.py b/TweetScraper/spiders/TweetCrawler.py index c441ad5..319d8e8 100644 --- a/TweetScraper/spiders/TweetCrawler.py +++ b/TweetScraper/spiders/TweetCrawler.py @@ -1,12 +1,13 @@ -from scrapy.spiders import CrawlSpider, Rule -from scrapy.selector import Selector -from scrapy.conf import settings -from scrapy import http -from scrapy.shell import inspect_response # for debugging -import re import json -import time import logging + +from scrapy import http +from scrapy.selector import Selector +from scrapy.spiders import CrawlSpider + +from TweetScraper.items import Tweet +from TweetScraper.items import User + try: from urllib import quote # Python 2.X except ImportError: @@ -14,8 +15,6 @@ from datetime import datetime -from TweetScraper.items import Tweet, User - logger = logging.getLogger(__name__) @@ -23,8 +22,8 @@ class TweetScraper(CrawlSpider): name = 'TweetScraper' allowed_domains = ['twitter.com'] - def __init__(self, query='', lang='', crawl_user=False, top_tweet=False): - + def __init__(self, query='', lang='', crawl_user=False, top_tweet=False, *args, **kwargs): + super().__init__(*args, **kwargs) self.query = query self.url = "https://twitter.com/i/search/timeline?l={}".format(lang) @@ -32,7 +31,6 @@ def __init__(self, query='', lang='', crawl_user=False, top_tweet=False): self.url = self.url + "&f=tweets" self.url = self.url + "&q=%s&src=typed&max_position=%s" - self.crawl_user = crawl_user def start_requests(self): @@ -54,7 +52,7 @@ def parse_page(self, response): def parse_tweets_block(self, html_page): page = Selector(text=html_page) - ### for text only tweets + # for text only tweets items = page.xpath('//li[@data-item-type="tweet"]/div') for item in self.parse_tweet_item(items): yield item @@ -64,14 +62,15 @@ def parse_tweet_item(self, items): try: tweet = Tweet() - tweet['usernameTweet'] = item.xpath('.//span[@class="username u-dir u-textTruncate"]/b/text()').extract()[0] + tweet['usernameTweet'] = \ + item.xpath('.//span[@class="username u-dir u-textTruncate"]/b/text()').extract()[0] - ID = item.xpath('.//@data-tweet-id').extract() - if not ID: + tweet_id = item.xpath('.//@data-tweet-id').extract() + if not tweet_id: continue - tweet['ID'] = ID[0] + tweet['ID'] = tweet_id[0] - ### get text content + # get text content tweet['text'] = ' '.join( item.xpath('.//div[@class="js-tweet-text-container"]/p//text()').extract()).replace(' # ', '#').replace( @@ -80,7 +79,7 @@ def parse_tweet_item(self, items): # If there is not text, we ignore the tweet continue - ### get meta data + # get meta data tweet['url'] = item.xpath('.//@data-permalink-path').extract()[0] nbr_retweet = item.css('span.ProfileTweet-action--retweet > span.ProfileTweet-actionCount').xpath( @@ -108,7 +107,7 @@ def parse_tweet_item(self, items): item.xpath('.//div[@class="stream-item-header"]/small[@class="time"]/a/span/@data-time').extract()[ 0])).strftime('%Y-%m-%d %H:%M:%S') - ### get photo + # get photo has_cards = item.xpath('.//@data-card-type').extract() if has_cards and has_cards[0] == 'photo': tweet['has_image'] = True @@ -116,7 +115,7 @@ def parse_tweet_item(self, items): elif has_cards: logger.debug('Not handle "data-card-type":\n%s' % item.xpath('.').extract()[0]) - ### get animated_gif + # get animated_gif has_cards = item.xpath('.//@data-card2-type').extract() if has_cards: if has_cards[0] == 'animated_gif': @@ -151,7 +150,7 @@ def parse_tweet_item(self, items): yield tweet if self.crawl_user: - ### get user info + # get user info user = User() user['ID'] = tweet['user_id'] user['name'] = item.xpath('.//@data-name').extract()[0] @@ -159,12 +158,7 @@ def parse_tweet_item(self, items): user['avatar'] = \ item.xpath('.//div[@class="content"]/div[@class="stream-item-header"]/a/img/@src').extract()[0] yield user + except: logger.error("Error tweet:\n%s" % item.xpath('.').extract()[0]) # raise - - def extract_one(self, selector, xpath, default=None): - extracted = selector.xpath(xpath).extract() - if extracted: - return extracted[0] - return default diff --git a/TweetScraper/utils.py b/TweetScraper/utils.py index ffbd900..2ddae60 100644 --- a/TweetScraper/utils.py +++ b/TweetScraper/utils.py @@ -1,6 +1,17 @@ +import json import os + def mkdirs(dirs): - ''' Create `dirs` if not exist. ''' + """ Create `dirs` if not exist. """ if not os.path.exists(dirs): - os.makedirs(dirs) \ No newline at end of file + os.makedirs(dirs) + + +def save_to_file(item, fname): + """ input: + item - a dict like object + fname - where to save + """ + with open(fname, 'w') as f: + json.dump(dict(item), f)