Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JS Out of memory #33

Open
EnnoMeijers opened this issue Jan 8, 2024 · 15 comments
Open

JS Out of memory #33

EnnoMeijers opened this issue Jan 8, 2024 · 15 comments
Labels
wontfix This will not be worked on

Comments

@EnnoMeijers
Copy link

When trying to run the LDWorkbench against a local dump file (KB NBT dump, see Datasetregister entry) LDWorkbench crashes with the following error:

✔ validating pipeline
⠏ Loading results from Iterator
<--- Last few GCs --->

[8740:0x57579c0]    26245 ms: Scavenge 2025.0 (2078.8) -> 2020.3 (2080.3) MB, 5.08 / 0.00 ms  (average mu = 0.282, current mu = 0.187) task; 
[8740:0x57579c0]    26259 ms: Scavenge 2026.1 (2080.3) -> 2021.5 (2081.6) MB, 4.90 / 0.00 ms  (average mu = 0.282, current mu = 0.187) task; 
[8740:0x57579c0]    26629 ms: Scavenge 2027.6 (2081.7) -> 2022.6 (2098.5) MB, 357.61 / 0.00 ms  (average mu = 0.282, current mu = 0.187) task; 


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xc9e850 node::Abort() [node]
 2: 0xb720ff  [node]
 3: 0xec1a70 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
 4: 0xec1d57 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [node]
 5: 0x10d3dc5  [node]
 6: 0x10ebc48 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0x114481c v8::internal::MinorGCJob::Task::RunInternal() [node]
 8: 0xd17ed6  [node]
 9: 0xd1aa8f node::PerIsolatePlatformData::FlushForegroundTasksInternal() [node]
10: 0x18827f3  [node]
11: 0x18971bb  [node]
12: 0x1883517 uv_run [node]
13: 0xbb5b83 node::SpinEventLoopInternal(node::Environment*) [node]
14: 0xced015  [node]
15: 0xced9dd node::NodeMainInstance::Run() [node]
16: 0xc588a7 node::Start(int, char**) [node]
17: 0x7fef8e0e5083 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
18: 0xbb2afe _start [node]
Aborted

Used the KB-NBT config from the tests branch and pointed to the downloaded dump file instead of the sparql endpoint. The dump file is in Turtle and has a size of 8 Gb.

@LaurensRietveld
Copy link
Collaborator

@EnnoMeijers Thanks enno. LD-wizard should never go out-of-memory, given that it processes the results in a streaming fashion. Can you share your configuration?

@EnnoMeijers
Copy link
Author

See the kb-nbt files in the tests branch, https://github.com/netwerk-digitaal-erfgoed/ld-workbench/tree/tests/static/kb-nbt. I downloaded the NBT dump file (see link above) to the static/kb-nbt/ directory and changed the config.yml in the following way:

# Metadata for your pipeline:
name: KB NBT Pipeline
description: >
  This is an example pipeline. It uses files that are available in this repository 
  and SPARQL endpoints that should work.


# This is optional, by default it will be stored in the data directory of the pipeline using filename 'statements.nt'
destination: file://pipelines/data/kb-nbt-pipeline.nt

# The individual stages for your pipeline
stages:
  - name: "Stage 1"
    iterator:
      query: file://static/kb-nbt/iterator-stage-1.rq
      endpoint: file://static/kb-nbt/nbt_20211011.ttl
    generator: 
      - query: file://static/kb-nbt/generator-stage-1.rq

@LaurensRietveld
Copy link
Collaborator

I misread your first post, sorry about that (and thanks to @philipperenzen for pointing that out).

You're querying a local file, where that local file needs to be fully read into memory for SPARQL querying. Although the LD-wizard code streams through all the results, the in-memory SPARQL engine does not support streaming. As a result, the process goes out-of-memory, as the local file does not fit in memory.

@wouterbeek
Copy link
Collaborator

Think about reporting this at the Comunica repo.

@wouterbeek
Copy link
Collaborator

Look into raising the heap limit of the Node.js VM @ddeboer

@ddeboer
Copy link
Member

ddeboer commented Jan 22, 2024

Reproduced; this is indeed due to Comunica’s in-memory SPARQL server:

$ comunica-sparql-file-http nbt_20211011.ttl
$ comunica-sparql http://localhost:3000/sparql 'select * { ?s ?p ?o} limit 1'
[fetch failed

Server running on http://localhost:3000/sparql
Server worker (11053) running on http://localhost:3000/sparql
[200] POST to /sparql
      Requested media type: application/sparql-results+json
      Received query query: select * { ?s ?p ?o} limit 1
Worker 11053 got assigned a new query (0).




<--- Last few GCs --->

[11053:0x140008000]    51459 ms: Mark-Compact 4008.5 (4139.1) -> 3995.9 (4141.8) MB, 1751.62 / 0.00 ms  (average mu = 0.098, current mu = 0.008) task; scavenge might not succeed
[11053:0x140008000]    54866 ms: Mark-Compact 4011.3 (4142.4) -> 3999.0 (4145.1) MB, 3394.33 / 0.00 ms  (average mu = 0.040, current mu = 0.004) task; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0x1044164c4 node::Abort() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 2: 0x104417818 node::ModifyCodeGenerationFromStrings(v8::Local<v8::Context>, v8::Local<v8::Value>, bool) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 3: 0x104582a48 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 4: 0x1045829f8 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 5: 0x10471541c v8::internal::Heap::CallGCPrologueCallbacks(v8::GCType, v8::GCCallbackFlags, v8::internal::GCTracer::Scope::ScopeId) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 6: 0x104714150 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 7: 0x104761f18 v8::internal::MinorGCJob::Task::RunInternal() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 8: 0x10447c4bc node::PerIsolatePlatformData::RunForegroundTask(std::__1::unique_ptr<v8::Task, std::__1::default_delete<v8::Task>>) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
 9: 0x10447c1d0 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
10: 0x10743e734 uv__async_io [/opt/homebrew/Cellar/libuv/1.47.0/lib/libuv.1.dylib]
11: 0x10744e1c0 uv__io_poll [/opt/homebrew/Cellar/libuv/1.47.0/lib/libuv.1.dylib]
12: 0x10743ebc8 uv_run [/opt/homebrew/Cellar/libuv/1.47.0/lib/libuv.1.dylib]
13: 0x104341a40 node::SpinEventLoopInternal(node::Environment*) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
14: 0x10445ac90 node::NodeMainInstance::Run(node::ExitCode*, node::Environment*) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
15: 0x10445a9e8 node::NodeMainInstance::Run() [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
16: 0x1043de2ec node::Start(int, char**) [/opt/homebrew/Cellar/node@20/20.10.0/bin/node]
17: 0x1899150e0 start [/usr/lib/dyld]
Worker 11053 died with SIGABRT. Starting new worker.
Server worker (11072) running on http://localhost:3000/sparql

nbt_20211011.ttl is 8 GB, so it makes sense more memory is required. Raising the JS heap memory doesn’t directly help:

$ NODE_OPTIONS="--max-old-space-size=10000" comunica-sparql-file-http nbt_20211011.ttl --httpTimeout 100000
Server running on http://localhost:3000/sparql

Server worker (11300) running on http://localhost:3000/sparql
[200] POST to /sparql
      Requested media type: application/sparql-results+json
      Received query query: select * { ?s ?p ?o} limit 1
Worker 11300 got assigned a new query (0).
Worker 11300 timed out for query 0.
Shutting down worker 11300 with 1 open connections.
Worker 11300 died with 15. Starting new worker.
Server worker (11333) running on http://localhost:3000/sparql

@ddeboer ddeboer added the wontfix This will not be worked on label May 28, 2024
@ddeboer
Copy link
Member

ddeboer commented May 28, 2024

I’m tempted to close this as out-of-scope for this project. Apparently Comunica cannot be expected to handle large data dumps such as this one (7.4 GB). This hangs for me:

NODE_OPTIONS="--max-old-space-size=20000" comunica-sparql-file nbt_20211011.ttl 'select * {?s ?p ?o} limit 1'

The recommended course of action would probably be to spin up a local SPARQL endpoint instead of directly querying the large data dump.

@EnnoMeijers What do you think?

@EnnoMeijers
Copy link
Author

Agreed, though it would be useful to update the documentation to clearly state the limits when running the workbench against local files. Is it possible to specify the advised maximum file size?

@wouterbeek
Copy link
Collaborator

I think that it is cleaner to remove Comunica support from this project. Users can use whichever SPARQL engine they want, remote or locally. We should not prioritize one SPARQL implementation over others (although Comunica is of course a very nice implementation).

@ddeboer
Copy link
Member

ddeboer commented Jun 4, 2024

@wouterbeek Agreed when we’re talking about users spinning up their own SPARQL endpoint, using Comunica or something else.

However, providing a datadump to LD Workbench (instead of a local/remote SPARQL endpoint) is different: then it’s LD Workbench querying the file, and LD Workbench happens to use the Comunica client for executing the query. So

Users can use whichever SPARQL engine they want

does not apply then.

@wouterbeek
Copy link
Collaborator

Is there a benefit to using a local file over using a local endpoint? The latter decouples through a standardized protocol, and thereby allows any SPARQL implementation to be coupled.

(For clarity: I want to avoid a tight coupling between LD Workbench and Comunica, similar to how VocBench and RDF4J have become too intertwined. Today, users of VocBench must use RDF4J and cannot use any SPARQL implementation anymore. This is something I specifically want to avoid.)

@ddeboer
Copy link
Member

ddeboer commented Jun 5, 2024

Is there a benefit to using a local file over using a local endpoint?

Yes: it’s easier to set up as you can just point LD Workbench to a local file (e.g. dump.nt) without having to install and run a SPARQL server.

Note that internally LD Workbench queries over files too (not only SPARQL endpoints): each stage writes output to a stageX.nt file, which can then be queried over by subsequent stages, if the pipeline happens to be configured that way.

I agree with the requirement to prevent tight coupling, but I feel there’s still a misunderstanding here:

  • LD Workbench uses the Comunica client to perform SPARQL queries; we could dependency-inject this but at the moment that’s not really worth the trouble;
  • if the user is setting up their own SPARQL endpoint, they’re of course free to choose any SPARQL server software (Comunica server, Fuseki, Triply etc.).

@wouterbeek
Copy link
Collaborator

wouterbeek commented Jun 6, 2024

@ddeboer I am not opposed to the existence of a simple package that includes LD Workbench and an open source SPARQL engine, that allows the user to load an RDF file inside the SPARQL engine, and let LD Workbench use the corresponding SPARQL endpoint, all in one go and with one command.

But the project currently suffers from a deep -- an in my opinion incorrect -- belief that a local file is somehow faster/better/more reliable than a local or remote SPARQL endpoint. I really want to challenge the incorrect belief that use of the SPARQL Protocol significantly affects performance. Or, if that belief turns out to be correct, I want to be able to report this as a fundamental flaw to W3C for them to fix in the next SPARQL standard.

At the moment this project moves between the following two approaches:

  1. Let's use the SPARQL standard as a decoupling between a lightweight ETL tool and an unspecified ecosystem of SPARQL endpoints, and
  2. Let's not use the SPARQL standard as a decoupling between a lightweight ETL tool and an unspecified ecosystem of SPARQL endpoints, because that does not work/scale, and let's instead integrate a specific approach for handing the backend locally through coupling with a specific SPARQL engine and by treating local files differently.

I believe in approach (1), but not in approach (2). I also believe that the evidence for choosing approach (2) does not exist, or at least has not been presented yet.

In the end, it is not so important what I do or do not believe. But for the project it is better if a clear choice is made for either of these approaches, and that development is specifically oriented to work within that clearly committed to approach.

@ddeboer
Copy link
Member

ddeboer commented Jun 6, 2024

Thanks for elaborating @wouterbeek.

But the project currently suffers from a deep -- an in my opinion incorrect -- belief that a local file is somehow faster/better/more reliable than a local or remote SPARQL endpoint

Where do you see this, exactly? LD Workbench allows querying over local files but does not favour that over SPARQL endpoints.

Let's use the SPARQL standard

It seems we’re all agreed on this approach: use the SPARQL standard for configuring ETL pipelines (glued together with some YAML configuration). Still, in practice there are problems with iterating over SPARQL endpoints (or databases in general), such as query execution time that grows with a larger OFFSET.

@wouterbeek
Copy link
Collaborator

wouterbeek commented Jun 8, 2024

Still, in practice there are problems with iterating over SPARQL endpoints (or databases in general), such as query execution time that grows with a larger OFFSET.

^ Thanks, I think that this is the real issue that we want to solve.

  • We encounter a fundamental flaw in (some?) contemporary implementations of the SPARQL 1.1 Query standard, where we are forced to issue multiple SPARQL requests with increasing offset values, and where execution time grows as the offset values become larger.
  • This fundamental flaw is either specific to one or more SPARQL 1.1 implementations, or is more fundamental are is inherent in the SPARQL 1.1 standard. (We must investigate which of these is true.)
  • This fundamental flaw prevents us from using the SPARQL 1.1 Protocol for some use cases, such as larger ETL processes.
  • Because this fundamental flaw prevents us from using the SPARQL 1.1 Protocol, people are coming up with other, non-SPARQL solutions.
  • This causes a lot of time to be wasted in the community, with technological discussions and workarounds that do not produce sufficient value. In the meantime, linked data adoption stagnates, because the core linked data standard for querying data contains this fundamental flaw.

^ Is this a correct summary of the underlying problem @ddeboer @EnnoMeijers ? Do you also have an LD Workbench configuration where I can test this fundamental flaw? I will then create follow-up issues accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants