observatory

Python package for collecting and analyzing webpages

See here for extended examples of observatory in use.

Modules

`start_project`

Initializes a project directory

`search_google`

Searches Google for terms. Google Custom Search Engine credentials required.

`google_process`

Compiles results from multiple Google searches.

`get_domains`

Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)

`initialize_crawl`

Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.

`crawl_process`

Processes the JSON output of a crawl into a pandas DataFrame.

`crawl`

Not implemented yet. `!scrapy crawl digcon_crawler -O output.json --nolog

`search_merge`

Merges Google searches and crawl results.

`get_versions`

~~Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.~~
Uses the requests package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.

`initialize_scrape`

Initializes files to scrape urls for their HTML.

`scrape`

Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.

`query`

A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.

`ground_truth`

Produces a sample of pages for verifying counts of terms.

`analyze_orgs`

Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).

`analzye_term_correlations`

Calculates and visualizes co-variance metrics for specified search terms in the site text.

`co_occurrence`

Returns specific pages using two or more specified search terms.

TBD

Documentation :(
Convert modules to methods of data classes
Add crawl module

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
dist		dist
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

observatory

Modules

`start_project`

`search_google`

`google_process`

`get_domains`

`initialize_crawl`

`crawl_process`

`crawl`

`search_merge`

`get_versions`

`initialize_scrape`

`scrape`

`query`

`ground_truth`

`analyze_orgs`

`analzye_term_correlations`

`co_occurrence`

TBD

About

Releases 3

Packages

Languages

License

ericnost/web-observatory

Folders and files

Latest commit

History

Repository files navigation

observatory

Modules

start_project

search_google

google_process

get_domains

initialize_crawl

crawl_process

crawl

search_merge

get_versions

initialize_scrape

scrape

query

ground_truth

analyze_orgs

analzye_term_correlations

co_occurrence

TBD

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

`start_project`

`search_google`

`google_process`

`get_domains`

`initialize_crawl`

`crawl_process`

`crawl`

`search_merge`

`get_versions`

`initialize_scrape`

`scrape`

`query`

`ground_truth`

`analyze_orgs`

`analzye_term_correlations`

`co_occurrence`

Packages