-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JS Out of memory #33
Comments
@EnnoMeijers Thanks enno. LD-wizard should never go out-of-memory, given that it processes the results in a streaming fashion. Can you share your configuration? |
See the kb-nbt files in the tests branch, https://github.com/netwerk-digitaal-erfgoed/ld-workbench/tree/tests/static/kb-nbt. I downloaded the NBT dump file (see link above) to the static/kb-nbt/ directory and changed the config.yml in the following way:
|
I misread your first post, sorry about that (and thanks to @philipperenzen for pointing that out). You're querying a local file, where that local file needs to be fully read into memory for SPARQL querying. Although the LD-wizard code streams through all the results, the in-memory SPARQL engine does not support streaming. As a result, the process goes out-of-memory, as the local file does not fit in memory. |
Think about reporting this at the Comunica repo. |
Look into raising the heap limit of the Node.js VM @ddeboer |
Reproduced; this is indeed due to Comunica’s in-memory SPARQL server: $ comunica-sparql-file-http nbt_20211011.ttl
$ comunica-sparql http://localhost:3000/sparql 'select * { ?s ?p ?o} limit 1'
[fetch failed
Server running on http://localhost:3000/sparql
Server worker (11053) running on http://localhost:3000/sparql
[200] POST to /sparql
Requested media type: application/sparql-results+json
Received query query: select * { ?s ?p ?o} limit 1
Worker 11053 got assigned a new query (0).
<--- Last few GCs --->
[11053:0x140008000] 51459 ms: Mark-Compact 4008.5 (4139.1) -> 3995.9 (4141.8) MB, 1751.62 / 0.00 ms (average mu = 0.098, current mu = 0.008) task; scavenge might not succeed
[11053:0x140008000] 54866 ms: Mark-Compact 4011.3 (4142.4) -> 3999.0 (4145.1) MB, 3394.33 / 0.00 ms (average mu = 0.040, current mu = 0.004) task; scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
1: 0x1044164c4 node::Abort() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
2: 0x104417818 node::ModifyCodeGenerationFromStrings(v8::Local<v8::Context>, v8::Local<v8::Value>, bool) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
3: 0x104582a48 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
4: 0x1045829f8 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
5: 0x10471541c v8::internal::Heap::CallGCPrologueCallbacks(v8::GCType, v8::GCCallbackFlags, v8::internal::GCTracer::Scope::ScopeId) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
6: 0x104714150 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
7: 0x104761f18 v8::internal::MinorGCJob::Task::RunInternal() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
8: 0x10447c4bc node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
9: 0x10447c1d0 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
10: 0x10743e734 uv__async_io [/opt/homebrew/Cellar/libuv/1.47.0/lib/libuv.1.dylib]
11: 0x10744e1c0 uv__io_poll [/opt/homebrew/Cellar/libuv/1.47.0/lib/libuv.1.dylib]
12: 0x10743ebc8 uv_run [/opt/homebrew/Cellar/libuv/1.47.0/lib/libuv.1.dylib]
13: 0x104341a40 node::SpinEventLoopInternal(node::Environment*) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
14: 0x10445ac90 node::NodeMainInstance::Run(node::ExitCode*, node::Environment*) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
15: 0x10445a9e8 node::NodeMainInstance::Run() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
16: 0x1043de2ec node::Start(int, char**) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
17: 0x1899150e0 start [/usr/lib/dyld]
Worker 11053 died with SIGABRT. Starting new worker.
Server worker (11072) running on http://localhost:3000/sparql
$ NODE_OPTIONS="--max-old-space-size=10000" comunica-sparql-file-http nbt_20211011.ttl --httpTimeout 100000
Server running on http://localhost:3000/sparql
Server worker (11300) running on http://localhost:3000/sparql
[200] POST to /sparql
Requested media type: application/sparql-results+json
Received query query: select * { ?s ?p ?o} limit 1
Worker 11300 got assigned a new query (0).
Worker 11300 timed out for query 0.
Shutting down worker 11300 with 1 open connections.
Worker 11300 died with 15. Starting new worker.
Server worker (11333) running on http://localhost:3000/sparql |
I’m tempted to close this as out-of-scope for this project. Apparently Comunica cannot be expected to handle large data dumps such as this one (7.4 GB). This hangs for me:
The recommended course of action would probably be to spin up a local SPARQL endpoint instead of directly querying the large data dump. @EnnoMeijers What do you think? |
Agreed, though it would be useful to update the documentation to clearly state the limits when running the workbench against local files. Is it possible to specify the advised maximum file size? |
I think that it is cleaner to remove Comunica support from this project. Users can use whichever SPARQL engine they want, remote or locally. We should not prioritize one SPARQL implementation over others (although Comunica is of course a very nice implementation). |
@wouterbeek Agreed when we’re talking about users spinning up their own SPARQL endpoint, using Comunica or something else. However, providing a datadump to LD Workbench (instead of a local/remote SPARQL endpoint) is different: then it’s LD Workbench querying the file, and LD Workbench happens to use the Comunica client for executing the query. So
does not apply then. |
Is there a benefit to using a local file over using a local endpoint? The latter decouples through a standardized protocol, and thereby allows any SPARQL implementation to be coupled. (For clarity: I want to avoid a tight coupling between LD Workbench and Comunica, similar to how VocBench and RDF4J have become too intertwined. Today, users of VocBench must use RDF4J and cannot use any SPARQL implementation anymore. This is something I specifically want to avoid.) |
Yes: it’s easier to set up as you can just point LD Workbench to a local file (e.g. Note that internally LD Workbench queries over files too (not only SPARQL endpoints): each stage writes output to a I agree with the requirement to prevent tight coupling, but I feel there’s still a misunderstanding here:
|
@ddeboer I am not opposed to the existence of a simple package that includes LD Workbench and an open source SPARQL engine, that allows the user to load an RDF file inside the SPARQL engine, and let LD Workbench use the corresponding SPARQL endpoint, all in one go and with one command. But the project currently suffers from a deep -- an in my opinion incorrect -- belief that a local file is somehow faster/better/more reliable than a local or remote SPARQL endpoint. I really want to challenge the incorrect belief that use of the SPARQL Protocol significantly affects performance. Or, if that belief turns out to be correct, I want to be able to report this as a fundamental flaw to W3C for them to fix in the next SPARQL standard. At the moment this project moves between the following two approaches:
I believe in approach (1), but not in approach (2). I also believe that the evidence for choosing approach (2) does not exist, or at least has not been presented yet. In the end, it is not so important what I do or do not believe. But for the project it is better if a clear choice is made for either of these approaches, and that development is specifically oriented to work within that clearly committed to approach. |
Thanks for elaborating @wouterbeek.
Where do you see this, exactly? LD Workbench allows querying over local files but does not favour that over SPARQL endpoints.
It seems we’re all agreed on this approach: use the SPARQL standard for configuring ETL pipelines (glued together with some YAML configuration). Still, in practice there are problems with iterating over SPARQL endpoints (or databases in general), such as query execution time that grows with a larger |
^ Thanks, I think that this is the real issue that we want to solve.
^ Is this a correct summary of the underlying problem @ddeboer @EnnoMeijers ? Do you also have an LD Workbench configuration where I can test this fundamental flaw? I will then create follow-up issues accordingly. |
When trying to run the LDWorkbench against a local dump file (KB NBT dump, see Datasetregister entry) LDWorkbench crashes with the following error:
Used the KB-NBT config from the
tests
branch and pointed to the downloaded dump file instead of the sparql endpoint. The dump file is in Turtle and has a size of 8 Gb.The text was updated successfully, but these errors were encountered: