Slowness of data pipeline [META] #155

reginafcompton · 2019-01-23T20:48:05Z

Related issues:

Preliminary observations

On January 22, we deployed NYC Councilmatic to the production server, in turn clearing out the downloads directory. Resultantly, the data pipeline cycle completed as expected: once every four hours.

Shortly thereafter, the process began to slow, with import_data executing once every one or two hours. It's not yet obvious where the cron backup occurs, nor why, though it certainly seems to be in either the updating of the solr index or the conversion of RTF.

The text was updated successfully, but these errors were encountered:

reginafcompton · 2019-01-24T22:07:22Z

Looking at the logs

import_data
import_data downloads 1781 bills on 2019-01-22 at 9:35 am. After which, update_index and convert_rtf do not run. That's very strange. import_data runs again at 2019-01-22 at 12:05 pm, and downloads only 84 bills.

update_index
update_index runs on 2019-01-22 at 12:06:51 pm. It indexes 1598 nyc bills. The process requires ends at 12:08:52 pm. The next update occurs on 2019-01-22 at 13:37:14. It indexes 1607 nyc bills.

convert_rtf
convert_rtf runs on 2019-01-22 12:08:52. It finishes at 13:22:38 – so, here, the conversion script clearly causes the "backup" in the data pipeline.

Some questions:

why does update_index update so many bills, when the previous import only downloads 84 bills?
what does the age argument do? can it be more precise?
how does the conversion script know which bills to convert? why does it get backed up, not once, but multiple times? (Look at the logs closely for this.)

reginafcompton · 2019-01-28T18:27:20Z

I found the source of the issue: import_data. Here's what's going on.

import_data determines which entities have been newly updated by querying the raw tables, i.e., the tables populated with data in the downloads directory. For example, in the case of bills, import_data executes the following queries:

Put data into change_bill table

INSERT INTO change_bill
SELECT raw.ocd_id
FROM raw_bill AS raw
JOIN councilmatic_core_bill AS dat
ON raw.ocd_id = dat.ocd_id
....where clause to handle null values....

Update existing entries in councilmatic_core_bill, if they are in the change_bill table

UPDATE councilmatic_core_bill SET
"ocd_id"=s."ocd_id", "ocd_created_at"=s."ocd_created_at", "ocd_updated_at"=s."ocd_updated_at", "description"=s."description", "identifier"=s."identifier", "bill_type"=s."bill_type", "classification"=s."classification", "source_url"=s."source_url", "source_note"=s."source_note", "from_organization_id"=s."from_organization_id", "full_text"=s."full_text", "ocr_full_text"=s."ocr_full_text", "html_text"=s."html_text", "abstract"=s."abstract", "last_action_date"=s."last_action_date", "legislative_session_id"=s."legislative_session_id", "slug"=s."slug", "updated_at"=s."updated_at"
        FROM (
                SELECT
                raw."ocd_id", raw."ocd_created_at", raw."ocd_updated_at", raw."description", raw."identifier", raw."bill_type", raw."classification", raw."source_url", raw."source_note", raw."from_organization_id", raw."full_text", raw."ocr_full_text", raw."html_text", raw."abstract", raw."last_action_date", raw."legislative_session_id", raw."slug", raw."updated_at"
              FROM raw_bill AS raw
              JOIN change_bill AS change
              ON raw.ocd_id = change.ocd_id
        ) AS s
        WHERE councilmatic_core_bill.ocd_id = s.ocd_id

I noticed this problem when I found that the number of bills updated far exceeded the number downloaded, e.g.:

----------------------------------------------------------------------
  2019-01-24 15:50:56 Downloaded 133 bills  
----------------------------------------------------------------------

 ...fun ASCI art...

2019-01-24 15:50:56 Importing bills ...               
2019-01-24 15:51:35 Inserted a total of 1967 raw bills

2019-01-24 15:51:43 Inserted 7143 actions

2019-01-24 15:51:49 Found 1915 changed bill

Marking bills with a fresh updated_at timestamp causes issues further down the data pipeline, mainly, in update_index and convert_rtf. Both depend on the updated_timestamp: so, if thousands of bills are marked as being updated, the Solr update and conversion script have to handle thousands of entries. That takes considerable time.

Proposed solution

I think we need to implement datamade/django-councilmatic#226. Specifically, I'd like to see the downloads directory get cleared at the end of every import. I'll put together a PR for this.

reginafcompton · 2019-01-30T19:02:07Z

The RTF conversion script, typically, runs within 15 minutes. However, sometimes it requires additional time.

This can happen if the import adds a considerable volume of bills – that's fine! – but it also seems to happen because the conversion script determines which bills need converting by finding the max ocd_updated_at timestamp, and comparing that to the bills' updated_at timestamp.

The ocd timestamp occurs earlier than the updated_at timestamp in our Councilmatic database, and so, the rtf script potentially runs against bills that have already been converted (unnecessary and time consuming). The simple fix? Do this: max_updated = Bill.objects.all().aggregate(Max('updated_at'))['updated_at__max']

reginafcompton · 2019-02-01T15:13:41Z

RTF conversion script updated via datamade/django-councilmatic#230. Ready to close!

This was referenced Jan 28, 2019

Clear downloads datamade/django-councilmatic#228

Merged

Prep for new release datamade/django-councilmatic#229

Merged

Solr updates an awful lot of bills when running update_index #140

Closed

reginafcompton mentioned this issue Jan 30, 2019

Use updated_at rather than ocd_updated_at as timestamp datamade/django-councilmatic#230

Merged

reginafcompton closed this as completed Feb 1, 2019

reginafcompton mentioned this issue Feb 1, 2019

Replace rtf conversion script with actual PDFs #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowness of data pipeline [META] #155

Slowness of data pipeline [META] #155

reginafcompton commented Jan 23, 2019 •

edited

Loading

reginafcompton commented Jan 24, 2019

reginafcompton commented Jan 28, 2019

reginafcompton commented Jan 30, 2019 •

edited

Loading

reginafcompton commented Feb 1, 2019

Slowness of data pipeline [META] #155

Slowness of data pipeline [META] #155

Comments

reginafcompton commented Jan 23, 2019 • edited Loading

Preliminary observations

reginafcompton commented Jan 24, 2019

Looking at the logs

reginafcompton commented Jan 28, 2019

Proposed solution

reginafcompton commented Jan 30, 2019 • edited Loading

reginafcompton commented Feb 1, 2019

reginafcompton commented Jan 23, 2019 •

edited

Loading

reginafcompton commented Jan 30, 2019 •

edited

Loading