-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VA: Scraper for new site #5051
VA: Scraper for new site #5051
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, one question about bill versions
@@ -427,13 +385,5 @@ def scrape(self, session=None): | |||
media_type="text/html", | |||
on_duplicate="ignore", | |||
) | |||
b.add_version_link( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they not have PDFs versions available for 2025 bills? or is it no longer convenient to grab the PDF URLs based on this new data source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't include the PDF identifiers in the CSV data. I think we could scrape json endpoints that power the javascript frontend (though they're using some bot detection) and pull them that way.
I'm hoping they'll be included in the API and that someone will approve my key request soon, if not then we'll have to go the endpoint route. If they don't approve my API key by next week I'll PR that, obviously feel free to do it sooner, I wouldn't be offended =).
FYI GitGuardian does not like that VA "WebAPIKey" that I pulled from one of my browser requests, but it does not appear to be an actual secret. |
Were you and Rylie in the habit of letting them merge everything in? Just want to make sure you're not waiting on me to merge (or I can keep doing that if that's the convention we've been doing) |
I've generally merged my own and they merged theirs. This one's waiting on some human QA on my side since it's a new scraper, but we'll wrap that up this afternoon and merge. I think it's best if all the core members can merge at will, and just request review from each other if it's something that might majorly affect one of our orgs or major API users. I usually also courtesy tag y'all when something involves a large change to the output of a scraper, or requires a new credential of some sort. If you see one of mine hanging feel free to comment and ask if it's ready, occasionally I file the PR then forget to come back and merge it after the tests pass. |
Cool, sounds good, I agree re: core members merging and requesting review based on judgement call. |
@jessemortenson this is working well. Once you have a VA api key, you can merge this at your leisure, I don't want to break your infra. Set the VA_API_KEY env variable in your scraper environment first. I kept the old scraper working, you can use it without a key with I'll take a look at a votes scraper and maybe a non-hardcoded session listing function next week. |
FYI I added representation of carried over bills by adding a related bill from prior session when the bill action says it was carried over. (VA finally got back to me with a key!) |
This gets the scraper working to pull the 2025 carryover bills.
@jessemortenson @NewAgeAirbender heads up they moved bulk data out of the sFtp site onto the web so we can probably ditch our paramiko dependency, and you can remove those credentials from your pipelines. 🥳 They also claim the bulk data is updated hourly now rather than daily.
There's some hacky stuff in here to hard-code 2025 (which is all that's in the bulk data anyway) -- I'm waiting on a key for their new API to rewrite the scraper to use it, as it provides better data.