Skip to content

Proof of concept to test scrapy-based scrapers

License

Notifications You must be signed in to change notification settings

braykuka/scrapy-test

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapy PoC for Plural Open Scrapers

Proof of Concept for using Scrapy to write and execute scrapers that obtain open civic data.

Try it

  • Install the necessary version of python using pyenv
  • If necessary, pip install poetry
  • poetry install

You should now have the scrapy tool installed in your poetry environment.

Warning: currently something is wrong with the project structure, trying to import from the core package from a scraper a couple layers deep in the scrapers package. I can only run this with a modification to PYTHONPATH (which PyCharm adds by default, running in PyCharm is fine). How to set PYTHONPATH if you want to just run this in a terminal:

  • Find the path to your Poetry environment folder: poetry env info
  • Set PYTHONPATH to include both the root folder of this repo and the site-packages folder of the poetry env: export PYTHONPATH='/home/jesse/repo/openstates/scrapy-test:/home/jesse/.cache/pypoetry/virtualenvs/scrapy-test-vH1KNbGC-py3.9/lib/python3.9/site-packages
    • You will need to change the paths in the command above to match those on your machine.

Run a scout scrape

python -m scrapy.cmdline crawl nv-bills -a session=2023Special35 -O nv-bills-scout.json -s "ITEM_PIPELINES={}"

  • This command disables the DropStubsPipeline (ITEM_PIPELINES={}), which by default drops stub entities
  • Results are output to nv-bills-scout.json

Run a full scrape

python -m scrapy.cmdline crawl nv-bills -a session=2023Special35 -O nv-bills.json -s "DOWNLOADER_MIDDLEWARES={}"

  • This command disables the ScoutOnlyDownloaderMiddleware (DOWNLOADER_MIDDLEWARES={}), which by default ignores requests that are not marked is_scout in the meta property of the request.
  • Results are output to nv-bills.json
  • Please note that the scraper is not fully ported over yet, so there is still missing data.

Context

Prior art

John did a scrapy PoC in the ( private) Plural Engineering Experiments repo

Why Scrapy?

  • Very popular: easy to find developers who are familiar
  • Very mature: battle-tested layers of abstraction, flexibility to meet our goals
  • Reduce overall surface area of "in-house" code we need to maintain

Criteria

One way to think of success for this project: can it achieve most or all of the goals of the spatula project, without requiring much custom code?

  • Can we run a scout scrape that returns basic info on entities without making the extra requests required for retrieving full info on entities?
  • Can we run a normal scrape that does not output the partial/stub info generated by scout code?
  • Are there barriers involved in using necessary elements of openstates-core code here? For instance we want to be able to easily port code, and continue to use the core entity models w/ helper functions etc.
  • Can the scraper output an equivalent directory of json files that can be compared 1:1 to an existing scraper?

Technical notes

Porting existing Open States scrapers

We have a repository of existing open data scrapers in openstates-scrapers. These form the baseline of quality and expected output for scrapers in this test repository.

Those scrapers rely on some shared code from a PyPi package called openstates, the code for which is found here. There is currently a barrier to adding that shared code to this repo (see Problems below), so some of that shared code is temporarily copied into this repo.

Some technical notes regarding porting code:

  • All the scrapers in the scrapers_next folder use the spatula scraper framework. A few of the ones in the scrapers folder do as well (see nv/bills.py). But most of the scrapers in the scrapers folder use an older framework called scrapelib.
  • There are often multiple requests needed to compile enough data to fully represent a Bill, so these sequences of requests and parsing can end up looking like long procedures (scrapelib) or nested abstractions where it's not clear how they are tied together (spatula). In scrapy, we should handle this by yielding Requests that pass along the partial entity using cb_kwargs.
    • In spatula, you'll see a pattern where subsequent parsing functions access self.input to access that partial data. In scrapy, passed-down partial data is available as a named kwarg, such as bill or bill_stub.
  • Fundamentally, the CSS and Xpath selectors remain the same, just some of the syntax around them changes:
    • doc.xpath() or self.root.xpath() becomes response.xpath()
    • CSS("#title").match_one(self.root).text becomes response.css("#title::text").get()

Evaluating whether the port is a success

The scrapy-based scrapers need to perform at least as well as the equivalent scraper in that repo.

The most important expectation to meet is that the new scraper must be at least as information-complete and accurate as the old scraper. Is the output the same (or better)? See documentation on Open States scrapers.

Old Open States scrapers will output a JSON file for each scraped item to a local directory: ./_data:

jesse@greenbookwork:~/repo/openstates/openstates-scrapers/scrapers/_data$ cd nv
jesse@greenbookwork:~/repo/openstates/openstates-scrapers/scrapers/_data/nv$ ls -alh
total 56K
drwxrwxr-x 2 jesse jesse 4.0K Nov  6 18:29 .
drwxrwxr-x 6 jesse jesse 4.0K Nov  5 19:36 ..
-rw-rw-r-- 1 jesse jesse 2.1K Nov  6 18:28 bill_9f6f717c-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 2.1K Nov  6 18:28 bill_a2526bec-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse  13K Nov  6 18:29 bill_a54dad84-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 1.9K Nov  6 18:29 bill_a8d58f30-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 2.1K Nov  6 18:29 bill_abc9288c-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse 3.8K Nov  6 18:28 jurisdiction_ocd-jurisdiction-country:us-state:nv-government.json
-rw-rw-r-- 1 jesse jesse  171 Nov  6 18:28 organization_9dac2f10-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse  187 Nov  6 18:28 organization_9dac2f11-7d04-11ee-aeef-01ae5adc5576.json
-rw-rw-r-- 1 jesse jesse  189 Nov  6 18:28 organization_9dac2f12-7d04-11ee-aeef-01ae5adc5576.json

The above output was created by running the following command from within the scrapers subdirectory:

poetry run python -m openstates.cli.update --scrape nv bills session=2023Special35

(Nevada session 2023Special35 is a nice example because it is quick: only 5 bills.)

We can compare this output to the nv-bills.json output mentioned above.

Other evaluation criteria:

  • A scraper for bills should accept a session argument that accepts a legislative session identifier (string). See example.
  • Comments that provide context for otherwise-opaque HTML selectors/traversal are helpful!
  • Long procedures should be broken into reasonable-sized functions. Often it makes sense to have a separate function for handling sub-entities, eg add_actions(), add_sponsors(), add_versions() etc..
  • Requests that are required to get the BillStub level of info should have "is_scout": True set in the meta arg. This allows us to run the scraper in "scout" mode: only running the minimum requests needed for basic info so we can frequently assess when new entities are posted (and avoid flodding the source with requests).

Problems

  • The spatula library specifies an older version of the attrs package as a dependency. scrapy also has attrs as a dependency. These versions conflict. And since openstates has spatula as a dependency, we currently cannot add openstates as a dependency to this project! To try to quickly work around this, I copied a bunch of library code out of the openstates-core repo and into the core package within this repo. This is a very temporary solution.
  • Scraper is not fully ported

Useful concepts

  • Pass input properties from the comamnd line to the scraper using the -a flag, ie -a session=2023Special35. This allows us to copy os-update behavior where we can pass in runtime parameters to the scraper.
  • Override scrapy settings at runtime with the -s flag. This allows us to set things like which Item pipelines and middleware is enabled at runtime. This allows us to switch the behavior between scout/normal scrape at runtime.
  • Item pipelines do things with items returned by scrapers. Using this to drop "stub" items when in normal scrape mode.
  • Downloader middleware allows us to change behavior of a Request before it is made. Currently requiring the scraper to mark is_scout: True on the meta property of the Request, so that we can ignore non-scout requests when desired.

About

Proof of concept to test scrapy-based scrapers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%