Define a news source, set your search method, and collect news articles or create news graphs.
Usage:
import os
from seldonite import sources, collect, run
aws_access_key = os.environ['AWS_ACCESS_KEY']
aws_secret_key = os.environ['AWS_SECRET_KEY']
source = sources.news.CommonCrawl(aws_access_key, aws_secret_key)
collector = collect.Collector(source) \
.on_sites(['cbc.ca', 'bbc.com']) \
.by_keywords(['afghanistan', 'withdrawal'])
graph = graphs.Graph(collector) \
.build_tfidf_graph()
articles_df, words_df, edges_df = run.Runner(graph)
.to_pandas()
Please see the wiki for more detail on sources and methods
To install seldonite as editable, and dependencies via conda:
conda env create -f ./environment.yml
This library uses a variety of third party libraries, please see limited setup instructions below:
To use NLP methods that require the use of spacy:
python -m spacy download en_core_web_sm
To make Python dependencies available to Spark executors, use the dependency packaging script:
bash ./seldonite/spark/package_pyspark_deps.sh
We use pytest
.
To run tests, run these commands from the top level directory:
pytest
- Spark based pipeline based on Sebastien Nagel's cc-pyspark library.
- Heuristics for News Articles adapted from newsplease
- Political news classifier taken from Political-News-Filter