-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowness of data pipeline [META] #155
Comments
Looking at the logsimport_data update_index convert_rtf Some questions:
|
I found the source of the issue:
Put data into change_bill table
Update existing entries in
I noticed this problem when I found that the number of bills updated far exceeded the number downloaded, e.g.:
Marking bills with a fresh Proposed solutionI think we need to implement datamade/django-councilmatic#226. Specifically, I'd like to see the downloads directory get cleared at the end of every import. I'll put together a PR for this. |
The RTF conversion script, typically, runs within 15 minutes. However, sometimes it requires additional time. This can happen if the import adds a considerable volume of bills – that's fine! – but it also seems to happen because the conversion script determines which bills need converting by finding the max The ocd timestamp occurs earlier than the |
RTF conversion script updated via datamade/django-councilmatic#230. Ready to close! |
Related issues:
update_index
#140Preliminary observations
On January 22, we deployed NYC Councilmatic to the production server, in turn clearing out the
downloads
directory. Resultantly, the data pipeline cycle completed as expected: once every four hours.Shortly thereafter, the process began to slow, with import_data executing once every one or two hours. It's not yet obvious where the cron backup occurs, nor why, though it certainly seems to be in either the updating of the solr index or the conversion of RTF.
The text was updated successfully, but these errors were encountered: