-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create divisions data exports from XML #170
Comments
Sorry, I'm confused by this one. TWFY already has the MP/division/vote tables - is the request here not to get all the division/vote data from the existing parlparse XML directly into TWFY? Would that not be easier than basically duplicating entire datasets? May have totally misunderstood. I can't see anything updating membership in the PW code, may have missed it, so they won't be being caught, I assume. |
Ok, so what I actually want is the simplest way to get an up to date parquet file of divisions and votes that can be used for analysis queries elsewhere (and in general is cool as a bulk data API). As TheyWorkForYou already crunches the XML for divisions, we should just use that? What this might actually be, is a database dump of a few tables from TWFY that can be reprocessed like I'm currently doing with the public whip. So rather than a split with both feeding from ParlParse, there's a bit of looped data exchange going on between the two. graph TD
HansardXML["Hansard XML"]
ParlParse
HumanAnnotator{{"Human input: <br/>- describe divisions <br/> - create policies <br/> - assign divisions to policies"}}
PublicWhip["TheyWorkForYou Votes"]
TheyWorkForYou["TheyWorkForYou"]
HansardXML --> ParlParse
ParlParse -->|"Speeches and divisions(XML)"| TheyWorkForYou
TheyWorkForYou -->|"Voting data (parquet?)"| PublicWhip
PublicWhip -->|"Divisions and alignment(JSON)"| TheyWorkForYou
PublicWhip --> HumanAnnotator --> PublicWhip
|
mysociety/theyworkforyou#1765 will close this ticket. Closed in favour of mysociety/theyworkforyou#1759 |
TheyWorkForYou Votes imports processed vote information via a database dump of the public whip reworked as parquets(https://pages.mysociety.org/publicwhip-data/). This isn’t using the calculated tables, but the raw information it’s extracted from the debate XML.
The key tables used are:
We want to be able to create these tables (or something similar) directly from parlparse. This cuts public whip out of the loop - and would also let us add votes for all Parliaments twfy covers, but public whip does not.
I think this only needs to be pw_divisions and pw_vote. The schemas for these are at the link above.
Pw_divisions includes division descriptions added in public whip - we will need to do a comparison between the new data source and the old data source to extract ‘custom division name/description’, which can be re-added in twfy-votes (currently there is a big yaml for this).
There is duplication between pw_mp and a separation set of tables twfy-votes extracts from the people.json (https://pages.mysociety.org/politician_data/) - so this should be able to be cut out without work in parl parse.
Questions:
Pw_vote contains a reference to a membership_id rather than person_id (meaning votes can be easily mapped to current parties through a join) - how does this handle membership changes that are added after the fact (party change etc). Is pw rebuilding a lot of this table rather than just new votes to catch this kind of thing?
It’s useful if this ends up as parquet files somewhere on the internet because these can be directly referenced. This might add a few dependencies to parlparse - or we could export as xmls, which one of the github-based data-repos reprocesses. My sense is the more stuff that is directly in parlparse the better.
I have in twfy-votes moved towards using ‘chamber’ as a field/column name rather than house (to better capture multiple parliaments). Could move this back a step and also do it here - but also easy to change elsewhere if want to be consistent within parlparse.
The text was updated successfully, but these errors were encountered: