-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" #7516
Comments
Thanks for reporting this issue! |
@mayankagg9722 This is the metric server API-Service in my test environment. apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
annotations:
components.gke.io/component-name: metrics-server
components.gke.io/component-version: 0.6.3-gke.1
components.gke.io/layer: addon
labels:
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/cluster-service: "true"
name: v1beta1.metrics.k8s.io
spec:
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
port: 443
version: v1beta1
versionPriority: 100 |
we are getting partially failed restore, as there are some errors too. |
@blackpiglet , yes we understand as well that this occur as when API service is not configured properly. But our ask majorly is around why is Restore getting impacted i.e. seeing 10 min delay in processing restore resources in velero. 1 . We wanted to basically understand where exactly we are putting sleep in velero of 10 min that is causing this issue. We have already tried to go through below code pieces to understand the sleep of 10 minutes and throttling but couldn't figure out the root cause, it will be helpful if you can help us. Code:
It is critical for us and we wanted to understand and do custom patch to avoid these issues during restore. |
IMO, the 10-minute timeout is not set for the service discovery. Lines 75 to 96 in 6c0cb4b
|
Thanks @blackpiglet for sharing this. I think you rightly pointed this out as in includeNamespace list "tigera-system" namespace is mentioned in restore payload, which might have actually caused this issue of sleep and halt in restore. I could see in the logs for this cluster that total Sample Logs: Log link in velero
Also, I could see the error during restore for fetching Log Link I have also calculated the sleep of 10 min at different intervals that I observed during restore and it was also coming around Total of Restore Code: |
PR #7424 this needs to take care of this scenario as well. Tagging @kaovilai @blackpiglet. EnsureNamespaceExistsAndIsReady Problem: During restore for every resource within the namespace, we are calling the check to await on if namespace exists (wait for 10 min polling). For instance, if we have 100 resources in a namespace that itself is in terminating state for very long then it impacts the restore flow to get stuck/halt and it too increases the time to restore as it does it for each resource in the same namespace. |
I understand that is inconvenient for the user to wait a long time to find out that the backup cannot proceed, but I think the current behavior is correct for this scenario. What do you think? Any suggestions? |
Instead of waiting for the target namespace to be ready for each and every resource individually, we should perhaps iterate on all the needed namespaces at once and then perform restore for each and every resource. Also, we could also decrease the polling time for checking namespace state that will help in overall duration of restore if many such namespaces exists. |
The proposal is reasonable, but Velero doesn't understand the resource dependency. velero/pkg/cmd/server/server.go Line 600 in 3c704ba
As a result, Velero doesn't know whether the restored resource's dependent namespace already exists. |
@blackpiglet can we keep the decreased polling time for terminating namespaces which is currently 10 min? |
Making it configurable seems feasible. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands. |
This issue was closed because it has been stalled for 14 days with no activity. |
What steps did you take and what happened:
Triggered restore and the restore operation got halt/stuck for 10 min at different intervals when its in-progress due to which it took longer than expected to complete.
What did you expect to happen:
Restore flow should not be halt/stopped due to the "Failed to discover group" and "I0308: client-side throttling" errors.
Failures observed at which velero got stuck/halt:
I0308 15:14:27.441940 1 request.go:690] Waited for 1.039794137s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.eastus2.azmk8s.io:443/apis/image-assurance.operator.tigera.io/v1?timeout=32s :: {}
The following information will help us better understand what's going on:
We have observed below logs in the cluster from velero:
Restore is properly happening in the cluster until it restored 244 items.
Time:
2024-03-08 14:49:30.6996640
Log:
Restored 244 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"name":"default","namespace":"tigera-dex","progress":"","resource":"serviceaccounts"}
then we are observing failures and the restore operation got halt/stuck for 10 min.
next restore log in velero came after 10 minutes
Time:
2024-03-08 15:09:30.8773140
Log:
Restored 245 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"progress":"","resource":"serviceaccounts","name":"default","namespace":"twistlock"}
Similar things we have observed at various intervals like below:
restore happened properly till
Time:
2024-03-08 15:09:43.178358
Log:
Restored 382 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"name":"tigera-pull-secret","namespace":"tigera-prometheus","progress":"","resource":"secrets"}
--- sample failure logs in between the duration of restore where velero got halt/stuck.
next restore log in velero came after 10 minutes
Time:
2024-03-08 15:29:43.2419230
Log:
Restored 383 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"resource":"secrets","name":"akeyless-customer-fragment","namespace":"akeyless","progress":""}
Additional Information captured:
Code that could have impacted this:
Discovery helper Go Routine:
https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/server/server.go#L520
Restore calling discovery refresh:
velero/pkg/restore/restore.go
Line 564 in 79e9e31
Environment:
Velero version (use velero version): v1.11.0
velero-plugin-for-csi: v0.5.1
Kubernetes version: AKS 1.27.9
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release): AKSUbuntu-2204gen2containerd-202402.07.0
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: