ES nodes OOM due to unbounded allocation's in Lucene90DocValuesProducer #114941
Labels
:Analytics/Aggregations
Aggregations
:Analytics/Geo
Indexing, search aggregations of geo points and shapes
>bug
Team:Analytics
Meta label for analytical engine team (ESQL/Aggs/Geo)
Elasticsearch Version
8.15.1
Installed Plugins
None
Java Version
bundled
OS Version
Ubuntu 20.04.6 LTS
Problem Description
When we issue a search where the following two conditions are true, ES OOM's:
In the referenced repro repository I:
When I open up the head dump in VisualVM I see that most of the memory is taken by byte[]:
If I dig in a bit further, we've tried to initialize a bunch of these TwoPhaseFilterMatchingDisiWrapper objects.
In each of those we're initializing these 30MB byte arrays that are all 0's inside of the
ShapeDocValuesQuery
via aLucene90DocValuesProducer
:Notably, these byte arrays hav 30,462,618 entries. That's the same number as
val$entry.maxLength
:What I think is happening:
For each bucket in the aggregation:
Create a
ShapeDocValuesQuery
During scoring in either
getContainsWeight
orgetStandardWeight
aLucene90DocValuesProducer
is created via:https://github.com/elastic/elasticsearch/blob/v8.15.1/server/src/main/java/org/elasticsearch/lucene/spatial/ShapeDocValuesQuery.java#L124-L127
That then creates an array in one of these spots:
https://github.com/apache/lucene/blob/releases/lucene/9.11.1/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java#L766-L766
https://github.com/apache/lucene/blob/releases/lucene/9.11.1/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer.java#L807-L808
The size of that array is
maxLength
of the entry.That max length is equal to the size of the largest value that might be read out of the index via the producer.
If you have a (large number of buckets * large geometry) you OOM. There's no protection (ex. raise an error if too much memory would be used) or safety (ex. allocate memory just-in-time/few buckets at a time). I would be satisfied to see either of these added. Even just a check that raises an error if more than some threshold of bytes will be allocated would be sufficient because I'd then be able to find problem data and re-index it in another form. As is I can't easily find what indices have these large geometries.
Steps to Reproduce
I have created a public repository with a repro here: https://github.com/peregrine-io/es8-oom/blob/main/README.md
To reproduce the error you will need docker and python installed then follow these steps.
git clone https://github.com/peregrine-io/es8-oom.git
cd es8-oom/repro
docker-compose up es8
pip install -r requirements.txt
python es_oom_repro
Logs (if relevant)
This repro will take a few minutes to run as it indexes a very large document. After some time you will see an error from the script:
elastic_transport.ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: ProtocolError(('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')
The docker container will report an OOM:
The text was updated successfully, but these errors were encountered: