Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logstash fails to acquire lock on PQ lockfile (reentrant?) #10572

Open
yaauie opened this issue Mar 20, 2019 · 12 comments
Open

Logstash fails to acquire lock on PQ lockfile (reentrant?) #10572

yaauie opened this issue Mar 20, 2019 · 12 comments

Comments

@yaauie
Copy link
Member

yaauie commented Mar 20, 2019

On Logstash versions 6.2.4 and 6.5.4, I have seen reports of the Logstash Persistent Queue failing to acquire a lock on a lockfile, with the message indicating that the lock is already held in the current JVM process.

[2019-03-19T11:04:26,987][ERROR][logstash.pipeline ] Logstash failed to create queue {:pipeline_id=>"[redacted]", "exception"=>"org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: [redacted]", "backtrace"=>["org/logstash/ackedqueue/Queue.java:258...

I do not presently have sufficient steps for consistent reproduction, but am working to chase those down.

In at least one case, the user was using multiple pipelines via pipelines.yml, the issue was occurring at startup, and the pipelines did not have any pipeline-specific queue settings.

@colinsurprenant
Copy link
Contributor

might be related to #10715 which I will investigate.

@Aqualie
Copy link

Aqualie commented Jan 15, 2020

Experiencing this problem with logstash 7.5.1

[2020-01-15T06:57:30,950][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue.
org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /opt/elastic/logstash/queue/pfsense/.lock
	at org.logstash.ackedqueue.Queue.open(Queue.java:263) ~[logstash-core.jar:?]
	at org.logstash.ackedqueue.ext.JRubyAckedQueueExt.open(JRubyAckedQueueExt.java:101) ~[logstash-core.jar:?]
	at org.logstash.ackedqueue.ext.JRubyWrappedAckedQueueExt.initialize(JRubyWrappedAckedQueueExt.java:42) ~[logstash-core.jar:?]
	at org.logstash.ackedqueue.QueueFactoryExt.create(QueueFactoryExt.java:39) ~[logstash-core.jar:?]
	at org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:169) [logstash-core.jar:?]
	at org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:813) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.6.0.jar:?]
	at opt.elastic.logstash.logstash_minus_core.lib.logstash.java_pipeline.RUBY$method$initialize$0(/opt/elastic/logstash/logstash-core/lib/logstash/java_pipeline.rb:24) [jruby-complete-9.2.6.0.jar:?]
	at opt.elastic.logstash.logstash_minus_core.lib.logstash.pipeline_action.create.RUBY$method$execute$0(/opt/elastic/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:36) [jruby-complete-9.2.6.0.jar:?]
	at opt.elastic.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/opt/elastic/logstash/logstash-core/lib/logstash/agent.rb:325) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:136) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:77) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.runtime.Block.call(Block.java:124) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.RubyProc.call(RubyProc.java:286) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.RubyProc.call(RubyProc.java:270) [jruby-complete-9.2.6.0.jar:?]
	at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:105) [jruby-complete-9.2.6.0.jar:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
[2020-01-15T06:57:30,952][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:pfsense, :exception=>"Java::JavaLang::IllegalStateException", :message=>"org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /opt/elastic/logstash/queue/pfsense/.lock", :backtrace=>["org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:172)", "org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen)", "org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:813)", "org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183)", "opt.elastic.logstash.logstash_minus_core.lib.logstash.java_pipeline.RUBY$method$initialize$0(/opt/elastic/logstash/logstash-core/lib/logstash/java_pipeline.rb:24)", "opt.elastic.logstash.logstash_minus_core.lib.logstash.pipeline_action.create.RUBY$method$execute$0(/opt/elastic/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:36)", "opt.elastic.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/opt/elastic/logstash/logstash-core/lib/logstash/agent.rb:325)", "org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:136)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:77)", "org.jruby.runtime.Block.call(Block.java:124)", "org.jruby.RubyProc.call(RubyProc.java:286)", "org.jruby.RubyProc.call(RubyProc.java:270)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:105)", "java.base/java.lang.Thread.run(Thread.java:835)"]}
[2020-01-15T06:57:30,977][INFO ][logstash.javapipeline    ] Pipeline started {"pipeline.id"=>".monitoring-logstash"}
[2020-01-15T06:57:31,100][FATAL][logstash.runner          ] An unexpected error occurred! {:error=>#<LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<vmware>`>, :backtrace=>["org/logstash/execution/ConvergeResultExt.java:109:in `create'", "org/logstash/execution/ConvergeResultExt.java:37:in `add'", "/opt/elastic/logstash/logstash-core/lib/logstash/agent.rb:338:in `block in converge_state'"]}
[2020-01-15T06:57:31,243][ERROR][org.logstash.Logstash    ] java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit

Tried to stop the logstash process, delete the .lock file manually and start logstash again but did not help still seeing the error message..
Running bin/pqcheck against the directory:

 »  bin/pqcheck queue/pfsense                                                            /opt/elastic/logstash
Java HotSpot(TM) 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Checking queue dir: queue/pfsense
checkpoint.head, fully-acked: YES, page.465 size: 67108864
pageNum=465, firstUnackedPageNum=465, firstUnackedSeqNum=65672857, minSeqNum=65601460, elementCount=71397, isFullyAcked=yes
 »     

My queue/beats pipeline failed for whatever reason roughly 11 hours earlier and starting to queue everything to disk elasticsearch cluster health is GREEN... still looking into that pipeline issue.

@colinsurprenant
Copy link
Contributor

@Aqualie this is likely not a PQ issue but more the result of a pipeline exception/error that does not correctly closes the PQ (and thus not releasing the lock) and upon restarting the failed pipeline it complains about failing to acquire the lock.

Can you verify the logs prior to this problem to see if you can trace any errors which might help explain why the lock has not been released? (look for pipeline reload failure, pipeline error/exception or any warning/error logs for that matter).

Also I have seen some cases where this problem occurred because of permissions problems on the PQ data files. Can you look for IO/File/Permissions errors? Also double-check that the file permissions in the PQ data dir are ok.

@colinsurprenant
Copy link
Contributor

There is a related issue specific to the permission problem #10715

@Harsimran1702
Copy link

Facing this error on 7.6.2

[2020-05-26T21:57:29,449][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue.
org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by another program on lock path: /app/queue/main/.lock
        at org.logstash.ackedqueue.Queue.open(Queue.java:263) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyAckedQueueExt.open(JRubyAckedQueueExt.java:100) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyWrappedAckedQueueExt.initialize(JRubyWrappedAckedQueueExt.java:42) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.QueueFactoryExt.create(QueueFactoryExt.java:39) ~[logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:176) [logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:831) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.logstash_minus_core.lib.logstash.java_pipeline.RUBY$method$initialize$0(/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:28) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:107) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:140) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:386) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:184) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyClass.newInstance(RubyClass.java:894) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyClass$INVOKER$i$newInstance.call(RubyClass$INVOKER$i$newInstance.gen) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroOrOneOrNBlock.call(JavaMethod.java:349) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:375) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:174) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:316) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:143) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.IRBlockBody.doYield(IRBlockBody.java:185) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.BlockBody.yield(BlockBody.java:116) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.Block.yield(Block.java:170) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyArray.each(RubyArray.java:1814) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.RubyArray$INVOKER$i$0$0$each.call(RubyArray$INVOKER$i$0$0$each.gen) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:555) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:332) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:86) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:93) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.instructions.CallBase.interpret(CallBase.java:537) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:361) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:86) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:73) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.clamp_minus_0_dot_6_dot_5.lib.clamp.command.RUBY$method$run$0(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/clamp-0.6.5/lib/clamp/command.rb:67) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.clamp_minus_0_dot_6_dot_5.lib.clamp.command.RUBY$method$run$0$__VARARGS__(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/clamp-0.6.5/lib/clamp/command.rb) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:70) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.runtime.IRRuntimeHelpers.instanceSuper(IRRuntimeHelpers.java:1169) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.runtime.IRRuntimeHelpers.instanceSuperSplatArgs(IRRuntimeHelpers.java:1156) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InstanceSuperInvokeSite.invoke(InstanceSuperInvokeSite.java:39) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.logstash_minus_core.lib.logstash.runner.RUBY$method$run$0(/usr/share/logstash/logstash-core/lib/logstash/runner.rb:242) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.logstash_minus_core.lib.logstash.runner.RUBY$method$run$0$__VARARGS__(/usr/share/logstash/logstash-core/lib/logstash/runner.rb) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:70) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.clamp_minus_0_dot_6_dot_5.lib.clamp.command.RUBY$method$run$0(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/clamp-0.6.5/lib/clamp/command.rb:132) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:84) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:70) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:183) [jruby-complete-9.2.9.0.jar:?]
        at usr.share.logstash.lib.bootstrap.environment.RUBY$script(/usr/share/logstash/lib/bootstrap/environment.rb:73) [jruby-complete-9.2.9.0.jar:?]
        at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) [?:?]
        at org.jruby.ir.Compiler$1.load(Compiler.java:89) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runScript(Ruby.java:857) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runNormally(Ruby.java:780) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runNormally(Ruby.java:798) [jruby-complete-9.2.9.0.jar:?]
        at org.jruby.Ruby.runFromMain(Ruby.java:610) [jruby-complete-9.2.9.0.jar:?]
        at org.logstash.Logstash.run(Logstash.java:113) [logstash-core.jar:?]
        at org.logstash.Logstash.main(Logstash.java:47) [logstash-core.jar:?]
[2020-05-26T21:57:29,492][FATAL][logstash.runner          ] The given configuration is invalid. Reason: org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by another program on lock path: /app/queue/main/.lock
[2020-05-26T21:57:29,497][ERROR][org.logstash.Logstash    ] java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit

This is observed when running a configuration check while the logstash process is already running:

sudo -u logstash /usr/share/logstash/bin/logstash --path.settings /etc/logstash --config.test_and_exit

@rahuldeyfb33
Copy link

rahuldeyfb33 commented Sep 8, 2021

Can somebody has a solution to this error because I am getting the same error:
The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /usr/share/logstash/queue/main/.lock

I am running dockerized logstash, logstash version 7.4.2

@yaauie
Copy link
Member Author

yaauie commented Sep 8, 2021

There are two ways that the PQ can fail to acquire a lock:

  1. A single Logstash is trying to obtain a lock that it already holds; the error message will include:

    cause: Lock held by this virtual machine

    These cases indicate that a single Logstash process acquired the lock on the queue, encountered some error and failed to fully release the lock on the queue, and is now attempting to re-acquire the lock.

    Most cases that I have seen stem from insufficient disk capacity when the pipeline is first starting (logs contain Not enough free disk space available to allocate persisted queue with further details). I have seen other reports that permissions issues in the queue directory can cause a partially-acquired lock to enter this state.

    In any case, it is only after encountering the first issue and making a second attempt to acquire the lock that we fail to do so. Internally, Logstash is in a state where it cannot recover and must be restarted.

    These cases can be resolved by shutting down the Logstash process, resolving the causing issue (such as ensuring that there is sufficient free disk space on the volume holding the queue, or fixing permissions), and starting Logstash up again. The other logs prior to this message will be instrumental in determining the root cause.

  2. Multiple Logstash processes are trying to obtain a lock on a single queue; the error message will include:

    cause: Lock held by another program

    These cases are caused by one Logstash process attempting to acquire a lock on the queue when another Logstash process already holds it. Only one Logstash instance can safely access any particular queue at any given time.

    This case can be solved by stopping the other Logstash process before starting up a new logstash process directed at the same queue.

@rahuldeyfb33
Copy link

@yaauie Thanks for your reply. To Elaborate further I am running a dockerized logstash to ingest logs from different sources, parse/filter them and then store them in Elasticsearch. The docker container is running in a linux virtual machine. Recently all the logstash pipelines are continuously failing with this error.
bidos-logstash_1 | warning: thread "Converge PipelineAction::Create<webscada_stream.conf>" terminated with exception (report_on_exception is true): bidos-logstash_1 | LogStash::Error: Don't know how to handle Java::JavaLang::IllegalStateException for PipelineAction::Create<webscada_stream.conf> bidos-logstash_1 | create at org/logstash/execution/ConvergeResultExt.java:109 bidos-logstash_1 | add at org/logstash/execution/ConvergeResultExt.java:37 bidos-logstash_1 | converge_state at /usr/share/logstash/logstash-core/lib/logstash/agent.rb:339 bidos-logstash_1 | [2021-09-02T12:11:01,107][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue. bidos-logstash_1 | org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /usr/share/logstash/data/queue/webscada_stream.conf/.lock

"usr/share/logstash/data/queue" is not a directory present in the virtual machine. It is only present inside the docker container. So we cannot change the permissions or remove files from this directory.

Docker Image layers looks like this:
`

groupadd --gid 1000 logstash && adduser --uid 1000 --gid 1000 --home-dir /usr/share/logstash --no-create-home logstash

curl -Lo - http://localhost:8000/logstash-7.4.2.tar.gz | tar zxf - -C /usr/share && mv /usr/share/logstash-7.4.2 /usr/share/logstash && chown --recursive logstash:logstash /usr/share/logstash/ && chown -R logstash:root /usr/share/logstash && chmod -R g=u /usr/share/logstash && find /usr/share/logstash -type d -exec chmod g+s {} ; && ln -s /usr/share/logstash /opt/logstash

WORKDIR /usr/share/logstash

`

have deleted all the docker images, containers, volumes using ''sudo docker system prune -all" and then pulled the docker logstash images again. After trying to build the docker images and starting docker containers, this lock error still persist.

I read in a thread that this might happen due to an orphaned process running in the virtual machine. So I rebooted the virtual machine to get rid of the orphaned processes but this error still persist. Can somebody explain what is going wrong here. We are trying to solve this error for the last 10 days, but could not get any success.

@rahuldeyfb33
Copy link

More details, I am running multiple pipelines which are defined in pipelines.yml and queue type is defined as "persisted"

@sasikiranvaddi
Copy link

sasikiranvaddi commented May 21, 2024

We observed LockException, please find more details in the below topic1.
Is there any fix available or work in progress for the issue mentioned.

@breml
Copy link
Contributor

breml commented Jul 1, 2024

@yaauie Since some time we observe that from time to time (~1 week), our Logstash processing comes to a complete stop and we have to restart our Logstash instances. We tried to make sense of the logs, but so far we have not yet found the root cause of the issue. That being said, in some cases we observed the following log entries around the time when the blocking happened:

[2024-07-01T07:58:01,454][ERROR][org.logstash.execution.AbstractPipelineExt] Logstash failed to create queue.
org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine on lock path: /apps/elk/logstash/queue/process_log_formats_k8s/.lock
        at org.logstash.ackedqueue.Queue.open(Queue.java:176) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyAckedQueueExt.open(JRubyAckedQueueExt.java:127) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.ext.JRubyWrappedAckedQueueExt.initialize(JRubyWrappedAckedQueueExt.java:64) ~[logstash-core.jar:?]
        at org.logstash.ackedqueue.QueueFactoryExt.create(QueueFactoryExt.java:84) ~[logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt.openQueue(AbstractPipelineExt.java:257) [logstash-core.jar:?]
        at org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$openQueue.call(AbstractPipelineExt$INVOKER$i$0$0$openQueue.gen) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:456) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:195) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:350) [jruby.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:128) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:115) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:90) [jruby.jar:?]
        at org.jruby.RubyClass.newInstance(RubyClass.java:931) [jruby.jar:?]
        at org.jruby.RubyClass$INVOKER$i$newInstance.call(RubyClass$INVOKER$i$newInstance.gen) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:90) [jruby.jar:?]
        at org.jruby.ir.instructions.CallBase.interpret(CallBase.java:548) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:367) [jruby.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:88) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:238) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:225) [jruby.jar:?]
        at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:228) [jruby.jar:?]
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:291) [jruby.jar:?]
        at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:328) [jruby.jar:?]
        at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66) [jruby.jar:?]
        at org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116) [jruby.jar:?]
        at org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136) [jruby.jar:?]
        at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:66) [jruby.jar:?]
        at org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58) [jruby.jar:?]
        at org.jruby.runtime.Block.call(Block.java:144) [jruby.jar:?]
        at org.jruby.RubyProc.call(RubyProc.java:352) [jruby.jar:?]
        at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:111) [jruby.jar:?]
        at java.lang.Thread.run(Thread.java:840) [?:?]

I looked into these cases a little bit more and I found the following log entries, that seemed to be at the start of the locking issue:

[2024-07-01T07:57:21,365][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:process_log_formats_k8s, :exception=>"KeyError", :message=>"key not found", :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:325:in `raise_fetch_no_key'", "/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:186:in `fetch'", "/usr/share/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:202:in `fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:85:in `block in fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:83:in `block in fetch_or_store'", "org/jruby/ext/thread/Mutex.java:171:in `synchronize'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:80:in `fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/collector.rb:92:in `register?'", "org/logstash/execution/AbstractPipelineExt.java:540:in `initialize_flow_metrics'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:127:in `start'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:51:in `block in execute'", "/usr/share/logstash/logstash-core/lib/logstash/pipelines_registry.rb:159:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:50:in `execute'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:386:in `block in converge_state'"]}
[2024-06-14T06:59:47,501][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:process_log_formats_csconnector, :exception=>"LocalJumpError", :message=>"yield called out of block", :backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:84:in `block in fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:83:in `block in fetch_or_store'", "org/jruby/ext/thread/Mutex.java:171:in `synchronize'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:80:in `fetch_or_store'", "/usr/share/logstash/logstash-core/lib/logstash/instrument/collector.rb:92:in `register?'", "org/logstash/execution/AbstractPipelineExt.java:540:in `initialize_flow_metrics'", "/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:127:in `start'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:51:in `block in execute'", "/usr/share/logstash/logstash-core/lib/logstash/pipelines_registry.rb:159:in `create_pipeline'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:50:in`execute'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:386:in `block in converge_state'"]}
[2023-11-21T01:56:18,955][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:heartbeat, :exception=>"Java::JavaLang::NullPointerException", :message=>"Cannot invoke \"org.logstash.instrument.metrics.Metric.getValue()\" because \"this.numeratorMetric\" is null", :backtrace=>["org.logstash.instrument.metrics.BaseFlowMetric.doCapture(BaseFlowMetric.java:82)", "org.logstash.instrument.metrics.BaseFlowMetric.<init>(BaseFlowMetric.java:66)", "org.logstash.instrument.metrics.ExtendedFlowMetric.<init>(ExtendedFlowMetric.java:70)", "org.logstash.instrument.metrics.ExtendedFlowMetric.<init>(ExtendedFlowMetric.java:80)", "org.logstash.instrument.metrics.ExtendedFlowMetric.<init>(ExtendedFlowMetric.java:62)", "org.logstash.instrument.metrics.FlowMetric.create(FlowMetric.java:39)", "org.logstash.execution.AbstractPipelineExt.createFlowMetric(AbstractPipelineExt.java:553)", "org.logstash.execution.AbstractPipelineExt.initializePluginThroughputFlowMetric(AbstractPipelineExt.java:650)", "org.logstash.execution.AbstractPipelineExt.initializePluginFlowMetrics(AbstractPipelineExt.java:631)", "org.logstash.execution.AbstractPipelineExt.initializeFlowMetrics(AbstractPipelineExt.java:540)", "org.logstash.execution.AbstractPipelineExt$INVOKER$i$0$0$initializeFlowMetrics.call(AbstractPipelineExt$INVOKER$i$0$0$initializeFlowMetrics.gen)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:351)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:144)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:345)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:80)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:164)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:151)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:210)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:351)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:144)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:345)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)", "org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136)", "org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:76)", "org.jruby.runtime.Block.yieldSpecific(Block.java:157)", "org.jruby.ir.runtime.IRRuntimeHelpers.yieldSpecific(IRRuntimeHelpers.java:498)", "org.jruby.ir.instructions.YieldInstr.interpret(YieldInstr.java:76)", "org.jruby.ir.interpreter.StartupInterpreterEngine.processOtherOp(StartupInterpreterEngine.java:176)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:104)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:128)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:115)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:85)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:94)", "org.jruby.ir.instructions.CallBase.interpret(CallBase.java:546)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:361)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:92)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:238)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:225)", "org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:226)", "org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:393)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:206)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:325)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)", "org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:66)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58)", "org.jruby.runtime.Block.call(Block.java:143)", "org.jruby.RubyProc.call(RubyProc.java:309)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:107)", "java.base/java.lang.Thread.run(Thread.java:833)"]}

It affected not always the same pipeline, but common denominator is, that there was some sort of issue while initially creating the pipeline (LogStash::PipelineAction::Create). It also looks like it is not always the same reason why the creation of the pipeline fails.

In my case, it is always Lock held by this virtual machine on lock and the reason has never been a full disk.

As of now, we use Logstash 8.12.1 on Linux (rpm installation, no containers). That being said, we observed the issue already with Logstash 8.7.x and we performed the update to 8.12.1 in the hope, that it would maybe resolve the issue. But this was unfortunately not the case.

Our Logstash configuration makes use of pipelines as well as pipeline to pipeline communication. In total there are 14 pipelines and most of them are part of a bigger processing system where we use the distributor as well as the collector pattern.

I hope, the information provided does help in debugging this issue.

@breml
Copy link
Contributor

breml commented Jul 1, 2024

I made an additional observation in regards to the log entries I mentioned in the previous post.

When Logstash is in the above mentioned error state, where a pipeline can not be created because of the lock not being available (Logstash failed to create queue. org.logstash.LockException: The queue failed to obtain exclusive access, cause: Lock held by this virtual machine), the recreation of the pipeline is retried ~40 times per minute. What I now observed is, that for every retry a new instance of the "plain" codec is created on every failed retry. I can see this, because I have a cronjob, which saves the result of http://localhost:9600/_node/stats/pipelines?pretty to disk every minute and there I can see, that the array in .pipelines.<pipeline_name>.plugins.codecs grows by 40 entries every minute, if Logstash is in this failure state. If the number of codecs is somewhere around 4200, Logstash completely stops the processing.

I created the cronjob to get better insights about what is happening before Logstash stops processing events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants