Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got tons of error after performing a small insert. #363

Open
didip opened this issue Jul 23, 2020 · 24 comments
Open

Got tons of error after performing a small insert. #363

didip opened this issue Jul 23, 2020 · 24 comments

Comments

@didip
Copy link

didip commented Jul 23, 2020

Elassandra version: Latest

Plugins installed: []

JVM version (java -version): JDK 8

OS version (uname -a if on a Unix-like system): Ubuntu 18

Description of the problem including expected versus actual behavior:

After inserting a blob of data, the cluster is suddenly in a bad state and starts erroring out

at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)

I don't have precise info yet which blob breaks the cluster. I am just curious if you have seen this error before.

Rolling restart fixes the cluster.

@vroyer
Copy link
Collaborator

vroyer commented Jul 23, 2020

So please attach information in the issue (log, version, etc..) !

@didip
Copy link
Author

didip commented Aug 8, 2020

A few more information. The beginning of that stack trace:

2020-08-07 23:47:20,907 WARN  [elasticsearch[10.0.15.0][http_server_worker][T#8]] Netty4HttpServerTransport.java:575 exceptionCaught caught exception while handling client http traffic, closing connection [id: 0x49c6de7e, L:/10.0.15.0:9200 - R:/10.0.12.124:31658]
java.lang.StackOverflowError: null
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)

Another one:

2020-08-07 23:45:12,956 ERROR [Thread-66661] CassandraDaemon.java:241 uncaughtException Exception in thread Thread[Thread-66661,5,main]
java.lang.StackOverflowError: null
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)
	at java.util.Collections$UnmodifiableMap.entrySet(Collections.java:1483)

@didip
Copy link
Author

didip commented Aug 8, 2020

The 2 errors always happen back-to-back, elasticsearch first, then cassandra.

@vroyer
Copy link
Collaborator

vroyer commented Aug 8, 2020 via email

@didip
Copy link
Author

didip commented Aug 9, 2020

Sorry for the late reply, the thread stack size is already set to Xss1m

@vroyer
Copy link
Collaborator

vroyer commented Aug 9, 2020

Ok, could you provide more information to reproduce this issue and the full stacktrace from your logs ?

@didip
Copy link
Author

didip commented Aug 9, 2020 via email

@tengzhuofei
Copy link

    at java.lang.Thread.run(Thread.java:748)

2020-09-01 11:27:27,990 WARN [elasticsearch[172.16.2.30][http_server_worker][T#14]] org.elasticsearch.common.logging.DeprecationLogger.deprecated(DeprecationLogger.java:242) Unknown field [refresh] used in UpdateRequest which has no value and will not be accepted in future
2020-09-01 11:27:28,391 WARN [elasticsearch[172.16.2.30][masterService#updateTask][T#1]] org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:712) org.elasticsearch.cluster.service.MasterService$$Lambda$2393/2102567294@577db957
org.elasticsearch.ResourceAlreadyExistsException: index [log_icos_sit/IG7uGWGxRvudBgZoXwBslA] already exists
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validateIndexName(MetaDataCreateIndexService.java:153)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.validate(MetaDataCreateIndexService.java:795)
at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:307)
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:54)
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:704)
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:302)
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:227)
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:158)
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-09-01 11:27:38,572 WARN [elasticsearch[172.16.2.30][http_server_worker][T#14]] org.elasticsearch.common.logging.DeprecationLogger.deprecated(DeprecationLogger.java:242) Unknown field [refresh] used in UpdateRequest which has no value and will not be accepted in future
2020-09-01 12:54:38,839 INFO [elasticsearch[172.16.2.30][generic][T#5]] org.elassandra.shard.CassandraShardStateListener.afterIndexShardStarted(CassandraShardStateListener.java:69) shard [feature_1062466119_icos_sit][0] started
2020-09-01 12:55:22,880 INFO [elasticsearch[172.16.2.30][generic][T#11]] org.elassandra.shard.CassandraShardStateListener.afterIndexShardStarted(CassandraShardStateListener.java:69) shard [twin_444743153_icos_sit][0] started
2020-09-01 12:55:24,885 INFO [elasticsearch[172.16.2.30][generic][T#8]] org.elassandra.shard.CassandraShardStateListener.afterIndexShardStarted(CassandraShardStateListener.java:69) shard [feature_444743153_icos_sit][0] started
2020-09-01 12:55:26,754 ERROR [GossipStage:1] org.apache.cassandra.service.CassandraDaemon$2.uncaughtException(CassandraDaemon.java:234) Exception in thread Thread[GossipStage:1,5,main]
java.lang.StackOverflowError: null
at java.util.HashMap.hash(HashMap.java:339)
at java.util.HashMap.containsKey(HashMap.java:596)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)
at java.util.Collections$UnmodifiableMap.containsKey(Collections.java:1454)

@vroyer
Copy link
Collaborator

vroyer commented Sep 2, 2020 via email

@tengzhuofei
Copy link

elassandra 6.8.4.2
helm chart deploy memory 26Gi JvmMaxHeap 13Gi JvmHeapNewSize 3Gi Xss 1m

@tengzhuofei
Copy link

image

@vroyer
Copy link
Collaborator

vroyer commented Sep 3, 2020 via email

@tengzhuofei
Copy link

tengzhuofei commented Sep 4, 2020 via email

@vroyer
Copy link
Collaborator

vroyer commented Sep 4, 2020 via email

@tengzhuofei
Copy link

tengzhuofei commented Sep 4, 2020 via email

@tengzhuofei
Copy link

tengzhuofei commented Sep 22, 2020 via email

@mgitman
Copy link

mgitman commented Nov 6, 2020

I want to chime in to report a similar occurrence to what didip and tengzhuofei had reported.

We're on Elassandra 6.8.4.7 running as a service on RHEL with Java 8 (openjdk version "1.8.0_265"). 32GB RAM. Our cluster has nine Elasticsearch-enabled nodes in its search datacenter. Interesting JVM settings:

-Xms16G 
-Xmx16G
-Xss1m

Here's how that DC looks in nodetool status (note the 16 tokens with vnodes):

Datacenter: us-west-2-search
============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Tokens       Owns    Host ID                               Rack
UN  ...      14.14 GiB  16           ?       f307e7ce-542f-4bee-822d-fd942f8577e7  us-west-2a
UN  ...      16.76 GiB  16           ?       36508031-a73e-4818-b66b-6b2c2b64da05  us-west-2b
UN  ...       11.76 GiB  16           ?       a17061a5-532b-4613-966d-bf9171b70446  us-west-2b
UN  ...       13.2 GiB   16           ?       d89b9142-21e1-4dca-b0bf-6fdad21f361a  us-west-2c
UN  ...       15.98 GiB  16           ?       48798cb9-3515-4867-a5d0-70e4c27674f9  us-west-2c
UN  ...       18.95 GiB  16           ?       6d2a6fde-c6c3-4be4-90ea-4a182359d0fd  us-west-2c
UN  ...      14.61 GiB  16           ?       be147c81-1a23-4b58-b9dd-03993ef13e76  us-west-2a
UN  ...       19.93 GiB  16           ?       ca4c1231-62b2-4aa6-be0d-249fb3ef61cd  us-west-2b
UN  ...      19.66 GiB  16           ?       5d823437-cd50-4e94-951c-b3755bcca02a  us-west-2a

We noticed that all nine of our Elasticsearch endpoints became unresponsive. All nine exhibited a StackOverflowError with thousands upon thousands of lines of:

	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)

Just megabytes with that line in the system.log file.

Excerpt:

2020-11-04 11:55:11,568 ERROR [elasticsearch[11.11.11.11][http_server_worker][T#3]] ExceptionsHelper.java:281 lambda$maybeDieOnAnotherThread$2 fatal error
	at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$2(ExceptionsHelper.java:280)
	at java.util.Optional.ifPresent(Optional.java:159)
	at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:270)
	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:176)
	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
	at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:856)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
	at java.lang.Thread.run(Thread.java:748)
2020-11-04 11:55:11,569 ERROR [Thread-61961] CassandraDaemon.java:241 uncaughtException Exception in thread Thread[Thread-61961,5,main]
java.lang.StackOverflowError: null
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)
	at java.util.Collections$UnmodifiableMap.get(Collections.java:1456)

Here's what else I noticed when the Elasticsearch endpoints were failing to respond to queries. With the HTTP API at port 9200, each index definition would show up, but not with the extra _search element in the URI. The Cassandra nodes would still show up as up (UN) in nodetool status.

Restarting the nodes made the immediate problem go away. Of course, the question is, how to prevent this.

In recent months we've upgraded to Elassandra 6.8.4.7 from 5.5.0.24. I don't recall seeing this error on 5.5.0.24. This particular keyspace was just upgraded a couple weeks ago to a 6.8.4.7 cluster, and we hadn't seen the problem on its old 5.5.0.24 cluster.

Also, I don't know whether this is an infinite loop or a sign of an incredibly nested call stack where it's legitimately one UnmodifiableMap.get after another.

My one colleague who has seen this before believes it has to do with the definition of a nested mapping for an index. Example:

                "foo_bar": {
                    "type": "nested",
                    "cql_collection": "singleton",
                    "cql_udt_name": "foo_bar",
                    "properties": {
                        "transaction_id": {
                            "type": "keyword",
                            "cql_collection": "singleton",
                            "index": "false"
                        },
                        "status": {
                            "type": "keyword",
                            "cql_collection": "singleton",
                            "index": "false"
                        },
                        "transaction_date": {
                            "type": "date",
                            "cql_collection": "singleton",
                            "index": "false"
                        }
                    }
                }

And if we changed this definition from nested to object, the problem would go away. I wish I could give a more definitive basis for this assertion of his.

@vroyer
Copy link
Collaborator

vroyer commented Nov 6, 2020 via email

@mgitman
Copy link

mgitman commented Nov 6, 2020

Vincent, thanks for getting back.

I'll take your word that there's a possibility this is not an infinite loop. We'll double the stack size and see what happens. (Even then, my anecdotal observation is that, even if this is not an infinite loop and just an incredibly deep stack, there's something exceedingly inefficient going on, whether on a write or a query.)

What's your take on the speculation that this has something to do with the "nested" mapping and that we should replace it with "object"? Beyond that, if you do have any insight as to when a field mapping should be defined as "nested" and when it should be defined as "object", I'd love to hear.

@tengzhuofei
Copy link

I adjusted Xss to 5m and still have this problem

@vroyer
Copy link
Collaborator

vroyer commented Nov 20, 2020 via email

@tengzhuofei
Copy link

tengzhuofei commented Nov 23, 2020 via email

@vroyer
Copy link
Collaborator

vroyer commented Nov 23, 2020 via email

@vroyer
Copy link
Collaborator

vroyer commented Jan 18, 2021

We found a possible cause for this issue, fixed in this commit.
Could you please upgrade to 6.8.4.13 and let us know if it fix your issue.
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants