Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES nodes OOM due to unbounded allocation's in Lucene90DocValuesProducer #114941

Open
wyattberlinic opened this issue Oct 16, 2024 · 4 comments
Open
Labels
:Analytics/Aggregations Aggregations :Analytics/Geo Indexing, search aggregations of geo points and shapes >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@wyattberlinic
Copy link

Elasticsearch Version

8.15.1

Installed Plugins

None

Java Version

bundled

OS Version

Ubuntu 20.04.6 LTS

Problem Description

When we issue a search where the following two conditions are true, ES OOM's:

  1. The index being searched contains documents with large geojson fields.
  2. The query is an aggregation query and the buckets are geojson buckets using the field in 1.

In the referenced repro repository I:

  1. Start ES in docker with 1GB of memory.
  2. Create a new index.
  3. Index some points.
  4. Index a document with a large geometry.
  5. Force merge (this seems to be necessary to get the right info on disk for the failure to happen).
  6. Run an aggregation search against that index and geometry field.
  7. OOM.

When I open up the head dump in VisualVM I see that most of the memory is taken by byte[]:
image

If I dig in a bit further, we've tried to initialize a bunch of these TwoPhaseFilterMatchingDisiWrapper objects.
image

In each of those we're initializing these 30MB byte arrays that are all 0's inside of the ShapeDocValuesQuery via a Lucene90DocValuesProducer:
image

Notably, these byte arrays hav 30,462,618 entries. That's the same number as val$entry.maxLength:
image

What I think is happening:
For each bucket in the aggregation:
Create a ShapeDocValuesQuery
During scoring in either getContainsWeight or getStandardWeight a Lucene90DocValuesProducer is created via:
https://github.com/elastic/elasticsearch/blob/v8.15.1/server/src/main/java/org/elasticsearch/lucene/spatial/ShapeDocValuesQuery.java#L124-L127

That then creates an array in one of these spots:

https://github.com/apache/lucene/blob/releases/lucene/9.11.1/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java#L766-L766

https://github.com/apache/lucene/blob/releases/lucene/9.11.1/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java#L807-L808

The size of that array is maxLength of the entry.
That max length is equal to the size of the largest value that might be read out of the index via the producer.

If you have a (large number of buckets * large geometry) you OOM. There's no protection (ex. raise an error if too much memory would be used) or safety (ex. allocate memory just-in-time/few buckets at a time). I would be satisfied to see either of these added. Even just a check that raises an error if more than some threshold of bytes will be allocated would be sufficient because I'd then be able to find problem data and re-index it in another form. As is I can't easily find what indices have these large geometries.

Steps to Reproduce

I have created a public repository with a repro here: https://github.com/peregrine-io/es8-oom/blob/main/README.md

To reproduce the error you will need docker and python installed then follow these steps.
git clone https://github.com/peregrine-io/es8-oom.git
cd es8-oom/repro
docker-compose up es8
pip install -r requirements.txt
python es_oom_repro

Logs (if relevant)

This repro will take a few minutes to run as it indexes a very large document. After some time you will see an error from the script:
elastic_transport.ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')

The docker container will report an OOM:

repro-es8-1  | java.lang.OutOfMemoryError: Java heap space
repro-es8-1  | Dumping heap to data/java_pid67.hprof ...
repro-es8-1  | Terminating due to java.lang.OutOfMemoryError: Java heap space
repro-es8-1  |
repro-es8-1  | ERROR: Elasticsearch exited unexpectedly, with exit code 3
repro-es8-1 exited with code 3
@wyattberlinic wyattberlinic added >bug needs:triage Requires assignment of a team area label labels Oct 16, 2024
@iverase iverase added :Analytics/Geo Indexing, search aggregations of geo points and shapes :Analytics/Aggregations Aggregations and removed needs:triage Requires assignment of a team area label labels Oct 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 17, 2024
@iverase
Copy link
Contributor

iverase commented Oct 17, 2024

I would have hope that this change would have prevented the node going OOM.

FWIW: The real fix to prevent this situation is here. This will prevent to put in memory those binary doc values.

@iverase
Copy link
Contributor

iverase commented Oct 21, 2024

This OOM is caused by this change too #98360 where we force the scorer to be executed using doc values. cc @kkrik-es

@iverase
Copy link
Contributor

iverase commented Oct 21, 2024

This change should give some protection if the real memory circuit breaker is activated: #115181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations :Analytics/Geo Indexing, search aggregations of geo points and shapes >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

3 participants