-
Notifications
You must be signed in to change notification settings - Fork 87
Add topic mine support for the Pushshift verified twitter archive #747
base: master
Are you sure you want to change the base?
Conversation
@epenn would you be able to rebase this off the top of the current |
We have the OK to deploy this. Is the code ready to merge and release? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for comprehensible code comments Eric!
For the reference, could you also ELI5 what this PR is all about, i.e. what's "verified Twitter" archive and what is it that we're going to be doing here? I genuinely don't know :)
@@ -185,6 +184,7 @@ END; | |||
$$ | |||
LANGUAGE plpgsql; | |||
|
|||
>>>>>>> origin/master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from a merge.
-- 1 of 2. Import the output of 'apgdiff': | ||
-- | ||
|
||
select insert_platform_source_pair( 'twitter', 'pushshift' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this bit should be in the mediawords.sql
too.
PS_TWITTER_PAGE_SIZE = 10000 | ||
PS_TWITTER_SCROLL_TIMEOUT = '1m' | ||
PS_TWITTER_SCROLL_URL = 'https://twitter-es.pushshift.io/_search/scroll' | ||
PS_TWITTER_URL = 'https://twitter-es.pushshift.io/twitter_verified/_search?scroll=%s' % PS_TWITTER_SCROLL_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use f-strings here and elsewhere (if applicable)? I.e. f'https://twitter-es.pushshift.io/twitter_verified/_search?scroll={PS_TWITTER_SCROLL_TIMEOUT}'
} | ||
|
||
|
||
def _mock_elasticsearch_response(posts: dict, scroll_id) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere, would you be able to use more precise typing annotations?
Here's out 30 seconds tutorial on those: https://github.com/mediacloud/backend/blob/master/doc/coding_guidelines.markdown#declare-function-parameter-and-return-value-types
One doesn't have to go too crazy with those, but an IDE would be really happy if it was hinted that posts
is a Dict[str, Union[int, dict, bool, str]]
(or something like that), and what's the type of scroll_id
anyway - is it an int
or a str
? Can it be empty? Same for return values.
def _mock_elasticsearch_hit(post: dict) -> dict: | ||
"""Mock an ElasticSearch hit for a Pushshift tweet.""" | ||
|
||
return { | ||
'_index': 'twitter_verified', | ||
'_type': '_doc', | ||
'_id': post['post_id'], | ||
'_routing': post['post_id'], | ||
'_score': 123.456, | ||
'_source': { | ||
'id': int(post['post_id']), | ||
'id_str': post['post_id'], | ||
'screen_name': post['author'], | ||
'text': post['content'], | ||
'created_at': post['publish_date'] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to move those _mock*
helpers to the test file which is doing the actual mocking?
Returns: None, referenced tweets are stored in tweet['quoted_status'] or | ||
tweet['retweeted_status'] as appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this modifies the parameter tweets
in-place, right?
Could you make it return the modified tweets instead? Now it's a bit hard to figure out which specific method does add those quoted_status
/ retweeted_status
to the tweets
. We have a plenty of those C-style "pass a reference" methods that change up their arguments in subtle ways all over our codebase, and I can't say I like them too much :)
'post_id': tweet['id_str'], | ||
'data': tweet, | ||
'content': tweet['text'], | ||
'publish_date': publish_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you verify that publish_date
gets converted to America/New_York
timezone at some point, in the caller perhaps? Unfortunately topic_posts.publish_date
is a TIMESTAMP [WITHOUT A TIME ZONE]
in PostgreSQL, so we have to normalize all the dates that come in to EST / EEST (or maybe it's UTC for topic_posts
specifically? Either way, it has to point to the same moment of time somehow), and I don't know whether it's the caller that does that, or are you supposed to do it in the post ingest module such as this one.
Oh, and I have merged in |
Could you also have a look at the failing tests too? |
I can handle the first question on context and purpose. This is part of our effort to build cross-platform topics. Jason over at PushShift.io runs an archive of "verified" tweets that he ingests and maintains. This code allows us to import tweets from this archive via adding another it as a platform in a Topic. So it queries his API for matching tweets, extract any shared links from them, and adds those into the Topic to be processed (and saves the tweets too). So at a high level this lets us discover links being shared in tweets about a topic and saves attention metrics about them. |
Thanks Rahul! |
This adds support for pulling data from the Pushshift verified twitter archive. A couple things of note:
retweeted_status
andquoted_status
fields (when needed) since Pushshift optimizes for space by removing the payloads from quote tweets and retweets.