You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to use the CLI to validate my data package using the command:
data validate datapackage.json
I first get a warning about a memory leak:
(node:21287) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 121 end listeners added. Use emitter.setMaxListeners() to increase limit
followed by a core dump an hour or more later. The data package I'm working with can be found on datahub. It currently consists of two tabular data resources. One (mines) contains ~30MB of CSV data, and triggers the memory leak warning but validates successfully in under a minute. The other (employment-production-quarterly) is ~160MB of CSV data, and also triggers the memory leak warning, proceeding to run for many minutes using ~100-150% of a CPU, while slowly and continuously increasing its memory footprint (but only up to ~10% of available memory), eventually resulting in the following error:
From within python, using goodtables.validate() on the same data package including all ~2 million records, validation completes successfully and takes about 10 minutes.
I am running Ubuntu 18.04.1 on a Thinkpad T470S with 2, 2-thread cores, and 24GB of RAM. The version of node (v8.11.1) and npm (v6.4.1) that I'm using are the ones distributed with the current anaconda3 distribution (v5.2). The version of data is 0.9.5.
The text was updated successfully, but these errors were encountered:
Core dump aside, it seems like the data validation could happen much faster somehow. Is it going through record by record? Or working on vectorized columns?
@zaneselvans When running goodtables.validate(), are you setting row_limit= to a large enough number to scan the whole table? At least on my system, the default limit is 1000. Checking because I am suspicious of your speed results (2 million records in 10 minutes); I would have expected it to be much slower based on testing with my own data...
"warnings": ["Table table.csv inspection has reached 1000 row(s) limit"]
At least goodtables-py is checking line by line. I agree this could probably be done much faster by working with vectorized columns.
It's been a while! Not sure I remember whether I had the row limit set. Initially at least I was trying to validate everything. In the PUDL project now we are in theory going to to try and use goodtables programmatically, but it's not yet able to test all the things we want to test, structurally, about the data, so we're only running it on a few thousand rows, and the main structural validation we're doing is happening through actually pulling all the data into an SQLite database.
When attempting to use the CLI to validate my data package using the command:
I first get a warning about a memory leak:
(node:21287) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 121 end listeners added. Use emitter.setMaxListeners() to increase limit
followed by a core dump an hour or more later. The data package I'm working with can be found on datahub. It currently consists of two tabular data resources. One (
mines
) contains ~30MB of CSV data, and triggers the memory leak warning but validates successfully in under a minute. The other (employment-production-quarterly
) is ~160MB of CSV data, and also triggers the memory leak warning, proceeding to run for many minutes using ~100-150% of a CPU, while slowly and continuously increasing its memory footprint (but only up to ~10% of available memory), eventually resulting in the following error:From within python, using
goodtables.validate()
on the same data package including all ~2 million records, validation completes successfully and takes about 10 minutes.I am running Ubuntu 18.04.1 on a Thinkpad T470S with 2, 2-thread cores, and 24GB of RAM. The version of
node
(v8.11.1) andnpm
(v6.4.1) that I'm using are the ones distributed with the current anaconda3 distribution (v5.2). The version ofdata
is 0.9.5.The text was updated successfully, but these errors were encountered: