Python package for collecting and analyzing webpages
See here for extended examples of observatory
in use.
Initializes a project directory
Searches Google for terms. Google Custom Search Engine credentials required.
Compiles results from multiple Google searches.
Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)
Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.
Processes the JSON output of a crawl into a pandas DataFrame.
Not implemented yet. `!scrapy crawl digcon_crawler -O output.json --nolog
Merges Google searches and crawl results.
Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.
Uses the requests
package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.
Initializes files to scrape urls for their HTML.
Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.
A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.
Produces a sample of pages for verifying counts of terms.
Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).
Calculates and visualizes co-variance metrics for specified search terms in the site text.
Returns specific pages using two or more specified search terms.
- Documentation :(
- Convert modules to methods of data classes
- Add crawl module