Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VA: Scraper for new site #5051

Merged
merged 14 commits into from
Oct 31, 2024
Merged

VA: Scraper for new site #5051

merged 14 commits into from
Oct 31, 2024

Conversation

showerst
Copy link
Contributor

This gets the scraper working to pull the 2025 carryover bills.

@jessemortenson @NewAgeAirbender heads up they moved bulk data out of the sFtp site onto the web so we can probably ditch our paramiko dependency, and you can remove those credentials from your pipelines. 🥳 They also claim the bulk data is updated hourly now rather than daily.

There's some hacky stuff in here to hard-code 2025 (which is all that's in the bulk data anyway) -- I'm waiting on a key for their new API to rewrite the scraper to use it, as it provides better data.

Copy link
Contributor

@jessemortenson jessemortenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, one question about bill versions

@@ -427,13 +385,5 @@ def scrape(self, session=None):
media_type="text/html",
on_duplicate="ignore",
)
b.add_version_link(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they not have PDFs versions available for 2025 bills? or is it no longer convenient to grab the PDF URLs based on this new data source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't include the PDF identifiers in the CSV data. I think we could scrape json endpoints that power the javascript frontend (though they're using some bot detection) and pull them that way.

I'm hoping they'll be included in the API and that someone will approve my key request soon, if not then we'll have to go the endpoint route. If they don't approve my API key by next week I'll PR that, obviously feel free to do it sooner, I wouldn't be offended =).

@showerst
Copy link
Contributor Author

showerst commented Oct 10, 2024

FYI GitGuardian does not like that VA "WebAPIKey" that I pulled from one of my browser requests, but it does not appear to be an actual secret.

@jessemortenson
Copy link
Contributor

Were you and Rylie in the habit of letting them merge everything in? Just want to make sure you're not waiting on me to merge (or I can keep doing that if that's the convention we've been doing)

@showerst
Copy link
Contributor Author

I've generally merged my own and they merged theirs. This one's waiting on some human QA on my side since it's a new scraper, but we'll wrap that up this afternoon and merge.

I think it's best if all the core members can merge at will, and just request review from each other if it's something that might majorly affect one of our orgs or major API users. I usually also courtesy tag y'all when something involves a large change to the output of a scraper, or requires a new credential of some sort.

If you see one of mine hanging feel free to comment and ask if it's ready, occasionally I file the PR then forget to come back and merge it after the tests pass.

@jessemortenson
Copy link
Contributor

Cool, sounds good, I agree re: core members merging and requesting review based on judgement call.

@showerst
Copy link
Contributor Author

@jessemortenson this is working well. Once you have a VA api key, you can merge this at your leisure, I don't want to break your infra. Set the VA_API_KEY env variable in your scraper environment first.

I kept the old scraper working, you can use it without a key with poetry run os-update va csv_bills but it doesn't pull the PDF files, instead the janky text versions, so I don't recommend using it. If the API ends up stable after a few months we should probably delete it.

I'll take a look at a votes scraper and maybe a non-hardcoded session listing function next week.

@showerst showerst changed the title VA: hotfixes for new site CSV files VA: Scraper for new site Oct 16, 2024
@jessemortenson
Copy link
Contributor

FYI I added representation of carried over bills by adding a related bill from prior session when the bill action says it was carried over. (VA finally got back to me with a key!)

@jessemortenson jessemortenson merged commit 0b0171a into main Oct 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants