VA: Scraper for new site #5051

showerst · 2024-10-10T16:56:30Z

This gets the scraper working to pull the 2025 carryover bills.

@jessemortenson @NewAgeAirbender heads up they moved bulk data out of the sFtp site onto the web so we can probably ditch our paramiko dependency, and you can remove those credentials from your pipelines. 🥳 They also claim the bulk data is updated hourly now rather than daily.

There's some hacky stuff in here to hard-code 2025 (which is all that's in the bulk data anyway) -- I'm waiting on a key for their new API to rewrite the scraper to use it, as it provides better data.

jessemortenson

Overall looks good to me, one question about bill versions

jessemortenson · 2024-10-10T17:38:10Z

scrapers/va/csv_bills.py

@@ -427,13 +385,5 @@ def scrape(self, session=None):
                        media_type="text/html",
                        on_duplicate="ignore",
                    )
-                    b.add_version_link(


Do they not have PDFs versions available for 2025 bills? or is it no longer convenient to grab the PDF URLs based on this new data source?

They don't include the PDF identifiers in the CSV data. I think we could scrape json endpoints that power the javascript frontend (though they're using some bot detection) and pull them that way.

I'm hoping they'll be included in the API and that someone will approve my key request soon, if not then we'll have to go the endpoint route. If they don't approve my API key by next week I'll PR that, obviously feel free to do it sooner, I wouldn't be offended =).

showerst · 2024-10-10T19:29:40Z

FYI GitGuardian does not like that VA "WebAPIKey" that I pulled from one of my browser requests, but it does not appear to be an actual secret.

jessemortenson · 2024-10-11T15:51:29Z

Were you and Rylie in the habit of letting them merge everything in? Just want to make sure you're not waiting on me to merge (or I can keep doing that if that's the convention we've been doing)

showerst · 2024-10-11T16:25:52Z

I've generally merged my own and they merged theirs. This one's waiting on some human QA on my side since it's a new scraper, but we'll wrap that up this afternoon and merge.

I think it's best if all the core members can merge at will, and just request review from each other if it's something that might majorly affect one of our orgs or major API users. I usually also courtesy tag y'all when something involves a large change to the output of a scraper, or requires a new credential of some sort.

If you see one of mine hanging feel free to comment and ask if it's ready, occasionally I file the PR then forget to come back and merge it after the tests pass.

jessemortenson · 2024-10-11T16:35:14Z

Cool, sounds good, I agree re: core members merging and requesting review based on judgement call.

showerst · 2024-10-12T13:13:49Z

@jessemortenson this is working well. Once you have a VA api key, you can merge this at your leisure, I don't want to break your infra. Set the VA_API_KEY env variable in your scraper environment first.

I kept the old scraper working, you can use it without a key with poetry run os-update va csv_bills but it doesn't pull the PDF files, instead the janky text versions, so I don't recommend using it. If the API ends up stable after a few months we should probably delete it.

I'll take a look at a votes scraper and maybe a non-hardcoded session listing function next week.

jessemortenson · 2024-10-31T22:08:27Z

FYI I added representation of carried over bills by adding a related bill from prior session when the bill action says it was carried over. (VA finally got back to me with a key!)

showerst added 2 commits October 10, 2024 12:49

VA: hotfixes for new csvs and 2025 session

3c6c7bc

VA: hotfixes for new csvs and 2025 session

04b60ff

jessemortenson approved these changes Oct 10, 2024

View reviewed changes

showerst added 5 commits October 10, 2024 14:33

VA: remove debug statement

64ed530

VA: new events scraper

14edc52

VA: Events: variable start date and better attachment handling

f8f8920

VA: Events: variable start date and better attachment handling

62ac438

VA: Events: documment split function

a595ed5

showerst added 5 commits October 11, 2024 14:07

VA: start on API scraper

2e7ae28

VA: Add versions

277bf56

VA: add actions to new scraper

ef45dff

VA: get impact statements working

d365ca4

VA: new api bills scraper

82c326b

VA: add votes to bills scraper

5d681d8

showerst changed the title ~~VA: hotfixes for new site CSV files~~ VA: Scraper for new site Oct 16, 2024

showerst mentioned this pull request Oct 16, 2024

We will need a new VA scraper in the fall. openstates/issues#1235

Closed

VA: add related bills from prior session to represent carryover

3c681c9

jessemortenson merged commit 0b0171a into main Oct 31, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VA: Scraper for new site #5051

VA: Scraper for new site #5051

showerst commented Oct 10, 2024

jessemortenson left a comment

jessemortenson Oct 10, 2024

showerst Oct 10, 2024

showerst commented Oct 10, 2024 •

edited

Loading

jessemortenson commented Oct 11, 2024

showerst commented Oct 11, 2024

jessemortenson commented Oct 11, 2024

showerst commented Oct 12, 2024

jessemortenson commented Oct 31, 2024

VA: Scraper for new site #5051

VA: Scraper for new site #5051

Conversation

showerst commented Oct 10, 2024

jessemortenson left a comment

Choose a reason for hiding this comment

jessemortenson Oct 10, 2024

Choose a reason for hiding this comment

showerst Oct 10, 2024

Choose a reason for hiding this comment

showerst commented Oct 10, 2024 • edited Loading

jessemortenson commented Oct 11, 2024

showerst commented Oct 11, 2024

jessemortenson commented Oct 11, 2024

showerst commented Oct 12, 2024

jessemortenson commented Oct 31, 2024

showerst commented Oct 10, 2024 •

edited

Loading