Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowness of data pipeline [META] #155

Closed
reginafcompton opened this issue Jan 23, 2019 · 4 comments
Closed

Slowness of data pipeline [META] #155

reginafcompton opened this issue Jan 23, 2019 · 4 comments

Comments

@reginafcompton
Copy link
Contributor

reginafcompton commented Jan 23, 2019

Related issues:


Preliminary observations

On January 22, we deployed NYC Councilmatic to the production server, in turn clearing out the downloads directory. Resultantly, the data pipeline cycle completed as expected: once every four hours.

Shortly thereafter, the process began to slow, with import_data executing once every one or two hours. It's not yet obvious where the cron backup occurs, nor why, though it certainly seems to be in either the updating of the solr index or the conversion of RTF.

@reginafcompton
Copy link
Contributor Author

Looking at the logs

import_data
import_data downloads 1781 bills on 2019-01-22 at 9:35 am. After which, update_index and convert_rtf do not run. That's very strange. import_data runs again at 2019-01-22 at 12:05 pm, and downloads only 84 bills.

update_index
update_index runs on 2019-01-22 at 12:06:51 pm. It indexes 1598 nyc bills. The process requires ends at 12:08:52 pm. The next update occurs on 2019-01-22 at 13:37:14. It indexes 1607 nyc bills.

convert_rtf
convert_rtf runs on 2019-01-22 12:08:52. It finishes at 13:22:38 – so, here, the conversion script clearly causes the "backup" in the data pipeline.


Some questions:

  • why does update_index update so many bills, when the previous import only downloads 84 bills?
  • what does the age argument do? can it be more precise?
  • how does the conversion script know which bills to convert? why does it get backed up, not once, but multiple times? (Look at the logs closely for this.)

@reginafcompton
Copy link
Contributor Author

I found the source of the issue: import_data. Here's what's going on.

import_data determines which entities have been newly updated by querying the raw tables, i.e., the tables populated with data in the downloads directory. For example, in the case of bills, import_data executes the following queries:

Put data into change_bill table

INSERT INTO change_bill
SELECT raw.ocd_id
FROM raw_bill AS raw
JOIN councilmatic_core_bill AS dat
ON raw.ocd_id = dat.ocd_id
....where clause to handle null values....

Update existing entries in councilmatic_core_bill, if they are in the change_bill table

UPDATE councilmatic_core_bill SET
"ocd_id"=s."ocd_id", "ocd_created_at"=s."ocd_created_at", "ocd_updated_at"=s."ocd_updated_at", "description"=s."description", "identifier"=s."identifier", "bill_type"=s."bill_type", "classification"=s."classification", "source_url"=s."source_url", "source_note"=s."source_note", "from_organization_id"=s."from_organization_id", "full_text"=s."full_text", "ocr_full_text"=s."ocr_full_text", "html_text"=s."html_text", "abstract"=s."abstract", "last_action_date"=s."last_action_date", "legislative_session_id"=s."legislative_session_id", "slug"=s."slug", "updated_at"=s."updated_at"
        FROM (
                SELECT
                raw."ocd_id", raw."ocd_created_at", raw."ocd_updated_at", raw."description", raw."identifier", raw."bill_type", raw."classification", raw."source_url", raw."source_note", raw."from_organization_id", raw."full_text", raw."ocr_full_text", raw."html_text", raw."abstract", raw."last_action_date", raw."legislative_session_id", raw."slug", raw."updated_at"
              FROM raw_bill AS raw
              JOIN change_bill AS change
              ON raw.ocd_id = change.ocd_id
        ) AS s
        WHERE councilmatic_core_bill.ocd_id = s.ocd_id

I noticed this problem when I found that the number of bills updated far exceeded the number downloaded, e.g.:

----------------------------------------------------------------------
  2019-01-24 15:50:56 Downloaded 133 bills  
----------------------------------------------------------------------

 ...fun ASCI art...

2019-01-24 15:50:56 Importing bills ...               
2019-01-24 15:51:35 Inserted a total of 1967 raw bills

2019-01-24 15:51:43 Inserted 7143 actions

2019-01-24 15:51:49 Found 1915 changed bill

Marking bills with a fresh updated_at timestamp causes issues further down the data pipeline, mainly, in update_index and convert_rtf. Both depend on the updated_timestamp: so, if thousands of bills are marked as being updated, the Solr update and conversion script have to handle thousands of entries. That takes considerable time.

Proposed solution

I think we need to implement datamade/django-councilmatic#226. Specifically, I'd like to see the downloads directory get cleared at the end of every import. I'll put together a PR for this.

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Jan 30, 2019

The RTF conversion script, typically, runs within 15 minutes. However, sometimes it requires additional time.

This can happen if the import adds a considerable volume of bills – that's fine! – but it also seems to happen because the conversion script determines which bills need converting by finding the max ocd_updated_at timestamp, and comparing that to the bills' updated_at timestamp.

The ocd timestamp occurs earlier than the updated_at timestamp in our Councilmatic database, and so, the rtf script potentially runs against bills that have already been converted (unnecessary and time consuming). The simple fix? Do this: max_updated = Bill.objects.all().aggregate(Max('updated_at'))['updated_at__max']

@reginafcompton
Copy link
Contributor Author

RTF conversion script updated via datamade/django-councilmatic#230. Ready to close!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant