-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agents do not survive Jenkins restart #691
Comments
Thank for including this. We have some longer-running pipelines ourselves, and we are migrating the Jenkins into our cluster. I was wondering about this just today. |
@613andred Sounds like a great proposal. Anything to do, in order to help out? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this issue is still affecting you, just comment with any updates and we'll keep it open. Thank you for your contributions. |
Unfortunately I have been very busy as of late so I have not been able to put together PR. Ideally I would like to get input from the maintainers before I go ahead. |
@613andred this would be a great addition; we suffer this issue as well and getting remote agents to reconnect with new tokens is less than a great strategy. Any input from the maintainers would go a long way here |
I like this proposal here, in particular the concept of performing the restore before the jenkins start. If you want to try to make a PR to address this will be great, atm the priority is to fix the current code and release a new version.
I do not understand this part, here we are talking about some kubernetes jobs or vm/ec2/static jenkins nodes? |
Coming back to this issue, I believe we should do something about this since apart of kubernetes or ssh agents is not possible to create any permanent inbound agent due to after a restart the inbound agent secrets will be re-generated (this is also causing the kubernetes agent issue of not able to reconnect) After investigating for what I understood when you create a jenkins node the node secrets is stored in the I tried a couple of things, the first one was to add a persistent layer mounted in Now these are the options that comes to my mind to address this issue (in my personal preferred order, other opinions and options will be appreciated): 1- Find a way (groovy/api script) to write in the DefaultConfidentialStore. This method have a pro that is following the spirit of this project (no persistent layer at all, everything as code) but the cons of not rescuing the kubernetes job agents but only the permanent not-ssh nodes (which is ok to me, the k8s job should born and die as long the master is up imho) 2- This issue's proposal: add an init container to restore the backup before the jenkins java process start and add to the backup script the 3-Add a persistent volume (optional) for Any opinion, thought, criticism? Thanks |
what exactly you tried? when you are referring to agents, which kind of agents we are talking about? |
@brokenpip3 I am referring to the k8s pods that spin up dynamically when a job is run. I added a sleep command in a job, ran it which brought up a k8s pod as slave agent, then killed the master pod. Once the master came back up, the jenkins slave pod that was spun by the job went to NotReady state indefinitely trying to connect to the master. I had to manually delete the pod. I tried to mount a PV at /var/lib/jenkins/secrets path via a PVC, but the slave pod was still unable to connect back to the master once master came back up. |
Ok but you did this by scaling down the operator, save the jenkins pod as code, modify it and delete/apply again?
restart jenkins |
@brokenpip3 Yes, I modified the jenkins master and recreated it. I tested it the way you mentioned. It seems the value returned by jenkins.model.Jenkins.getInstance().getComputer("").getJnlpMac() is removed after jenkins master restart. That is why the slave pod is unable to communicate with the master. |
So the issue it's different in your tests: as you can see the second groovy query failed because the node does not exist in the jenkins configuration not because the persistent layer does not have the right secret. |
@brokenpip3 Thanks for looking into this! I understand your point about the lifecycle of slave pod being tied to the master. That made sense. But the issue is that once master comes back up, the existing slave pods go into NotReady state indefinitely. It would be great if the operator would check for orphan pods and terminates them once master comes back up. Otherwise it becomes a manual step. |
Yep that make sense 100%. |
Describe the bug
Agents are currently unable to re-connect to Jenkins if it is restarted, this is problematic for any long-running pipeline.
This functionality is critical to our use of Jenkins.
This issue has previously been raised in #542.
There are several problems causing this issue:
To Reproduce
Create a long running pipeline, Ex. create a pipeline with a step
sh 'sleep 300'
.Restart Jenkins by deleting the pod.
Additional information
Agent logs:
Jenkins logs:
Workaround
This workaround is not great as it breaks the principals of the Jenkins operator.
/var/lib/jenkins/nodes
/var/lib/jenkins/secrets
/var/lib/jenkins/jobs
Proposed fix
nodes
andsecrets
state to backupIf you accept the proposal I am able to provide a PR to resolve this issue.
I believe this solution would also address #607 and #679
The text was updated successfully, but these errors were encountered: