Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" #7516

mayankagg9722 · 2024-03-13T12:21:37Z

What steps did you take and what happened:
Triggered restore and the restore operation got halt/stuck for 10 min at different intervals when its in-progress due to which it took longer than expected to complete.

What did you expect to happen:
Restore flow should not be halt/stopped due to the "Failed to discover group" and "I0308: client-side throttling" errors.

Failures observed at which velero got stuck/halt:

Failed to discover group: external.metrics.k8s.io/v1beta1
Different GET requests failing in cluster. Sample: I0308 15:14:27.441940 1 request.go:690] Waited for 1.039794137s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.eastus2.azmk8s.io:443/apis/image-assurance.operator.tigera.io/v1?timeout=32s :: {}

The following information will help us better understand what's going on:

We have observed below logs in the cluster from velero:

Restore is properly happening in the cluster until it restored 244 items.
Time: 2024-03-08 14:49:30.6996640
Log: Restored 244 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"name":"default","namespace":"tigera-dex","progress":"","resource":"serviceaccounts"}

then we are observing failures and the restore operation got halt/stuck for 10 min.

I0308 14:53:24.851065 1 request.go:690] Waited for 1.042889102s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/maps.k8s.elastic.co/v1alpha1?timeout=32s :: {}

I0308 14:53:34.897298 1 request.go:690] Waited for 3.996218636s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/metrics.k8s.io/v1beta1?timeout=32s :: {}

Failed to discover group: external.metrics.k8s.io/v1beta1 :: {"error.message":"the server is currently unable to handle the request"}

next restore log in velero came after 10 minutes
Time: 2024-03-08 15:09:30.8773140
Log: Restored 245 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"progress":"","resource":"serviceaccounts","name":"default","namespace":"twistlock"}

Similar things we have observed at various intervals like below:

restore happened properly till
Time: 2024-03-08 15:09:43.178358
Log: Restored 382 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"name":"tigera-pull-secret","namespace":"tigera-prometheus","progress":"","resource":"secrets"}

--- sample failure logs in between the duration of restore where velero got halt/stuck.

Message
I0308 15:14:27.441940 1 request.go:690] Waited for 1.039794137s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/image-assurance.operator.tigera.io/v1?timeout=32s :: {}

I0308 15:14:37.489143 1 request.go:690] Waited for 3.997041158s due to client-side throttling, not priority and fairness, request: GET:https://<akscluster>.privatelink.eastus2.azmk8s.io:443/apis/azmon.app.monitoring/v1?timeout=32s :: {}

Failed to discover group: external.metrics.k8s.io/v1beta1 :: {"error.message":"the server is currently unable to handle the request"}

next restore log in velero came after 10 minutes

Time: 2024-03-08 15:29:43.2419230
Log: Restored 383 items out of an estimated total of 1885 (estimate will change throughout the restore) :: {"resource":"secrets","name":"akeyless-customer-fragment","namespace":"akeyless","progress":""}

Additional Information captured:

Code that could have impacted this:

Discovery helper Go Routine:
https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/server/server.go#L520

Restore calling discovery refresh:

velero/pkg/restore/restore.go

Line 564 in 79e9e31

if createdOrUpdatedCRDs {

Environment:
Velero version (use velero version): v1.11.0
velero-plugin-for-csi: v0.5.1
Kubernetes version: AKS 1.27.9
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release): AKSUbuntu-2204gen2containerd-202402.07.0

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

danfengliu · 2024-03-14T07:15:06Z

Thanks for reporting this issue!
I got your point that it took so long time to get a refresh result, we will investigate and give a solution in throttling scenario, in addition, did your restore get Failed / partiallyFailed result due to throttling error?

blackpiglet · 2024-03-14T08:56:00Z

@mayankagg9722
It is related to your k8s environment's metric server.
The metric server is registered to the kube-apiserver as an aggregated API service.

This is the metric server API-Service in my test environment.
If the pod behind the service is not ready, the kube-apiserver cannot serve the corresponding API.
As a result, please ensure the metric service-related workload and related resources are ready before applying the restore.

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    components.gke.io/component-name: metrics-server
    components.gke.io/component-version: 0.6.3-gke.1
    components.gke.io/layer: addon
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
    port: 443
  version: v1beta1
  versionPriority: 100

mayankagg9722 · 2024-03-14T10:08:05Z

Thanks for reporting this issue! I got your point that it took so long time to get a refresh result, we will investigate and give a solution in throttling scenario, in addition, did your restore get Failed / partiallyFailed result due to throttling error?

we are getting partially failed restore, as there are some errors too.
sample: Velero restore error: error getting namespace tigera-system: timed out waiting for the condition :: {}

mayankagg9722 · 2024-03-14T10:23:44Z

@mayankagg9722 It is related to your k8s environment's metric server. The metric server is registered to the kube-apiserver as an aggregated API service.

This is the metric server API-Service in my test environment. If the pod behind the service is not ready, the kube-apiserver cannot serve the corresponding API. As a result, please ensure the metric service-related workload and related resources are ready before applying the restore.
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    components.gke.io/component-name: metrics-server
    components.gke.io/component-version: 0.6.3-gke.1
    components.gke.io/layer: addon
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/cluster-service: "true"
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
    port: 443
  version: v1beta1
  versionPriority: 100

@blackpiglet , yes we understand as well that this occur as when API service is not configured properly.

But our ask majorly is around why is Restore getting impacted i.e. seeing 10 min delay in processing restore resources in velero.

1 . We wanted to basically understand where exactly we are putting sleep in velero of 10 min that is causing this issue.
2. How many retries are currently configured in such failures i.e. "discovery groups" and also what is the root cause for throttling in this cluster?
3. Can we make changes in velero go routine for discovery groups (that runs every 5 min) to avoid throttling of such cases?

We have already tried to go through below code pieces to understand the sleep of 10 minutes and throttling but couldn't figure out the root cause, it will be helpful if you can help us.

Code:

Velero Discovery Go routine: https://github.com/vmware-tanzu/velero/blob/main/pkg/cmd/server/server.go#L511
Go client Requests: https://github.com/kubernetes/client-go/blob/7ebe0ea60e0a6bea3e8280871c1241d0405cdb32/rest/request.go#L625
Go Rest Client: https://github.com/kubernetes/client-go/blob/master/rest/config.go#L377
Velero Rest client config: https://github.com/vmware-tanzu/velero/blob/main/pkg/client/client.go#L32

It is critical for us and we wanted to understand and do custom patch to avoid these issues during restore.

blackpiglet · 2024-03-14T13:33:01Z

IMO, the 10-minute timeout is not set for the service discovery.
It is set for the namespace's restore-waiting timeout defaultResourceTerminatingTimeout.

velero/pkg/util/kube/utils.go

Lines 75 to 96 in 6c0cb4b

    
           err := wait.PollUntilContextTimeout(context.Background(), time.Second, timeout, true, func(ctx context.Context) (bool, error) { 
        
           	clusterNS, err := client.Get(ctx, namespace.Name, metav1.GetOptions{}) 
        
           	if apierrors.IsNotFound(err) { 
        
           		// Namespace isn't in cluster, we're good to create. 
        
           		return true, nil 
        
           	} 
        
           	if err != nil { 
        
           		// Return the err and exit the loop. 
        
           		return true, err 
        
           	} 
        
           	if clusterNS != nil && (clusterNS.GetDeletionTimestamp() != nil || clusterNS.Status.Phase == corev1api.NamespaceTerminating) { 
        
           		// Marked for deletion, keep waiting 
        
           		return false, nil 
        
           	} 
        
           	// clusterNS found, is not nil, and not marked for deletion, therefore we shouldn't create it. 
        
           	ready = true 
        
           	return true, nil 
        
           })

mayankagg9722 · 2024-03-14T16:33:47Z

IMO, the 10-minute timeout is not set for the service discovery. It is set for the namespace's restore-waiting timeout defaultResourceTerminatingTimeout.

velero/pkg/util/kube/utils.go

Lines 75 to 96 in 6c0cb4b

err := wait.PollUntilContextTimeout(context.Background(), time.Second, timeout, true, func(ctx context.Context) (bool, error) {

clusterNS, err := client.Get(ctx, namespace.Name, metav1.GetOptions{})

if apierrors.IsNotFound(err) {

// Namespace isn't in cluster, we're good to create.

return true, nil

}

if err != nil {

// Return the err and exit the loop.

return true, err

}

if clusterNS != nil && (clusterNS.GetDeletionTimestamp() != nil || clusterNS.Status.Phase == corev1api.NamespaceTerminating) {

// Marked for deletion, keep waiting

return false, nil

}

// clusterNS found, is not nil, and not marked for deletion, therefore we shouldn't create it.

ready = true

return true, nil

})

Thanks @blackpiglet for sharing this. I think you rightly pointed this out as in includeNamespace list "tigera-system" namespace is mentioned in restore payload, which might have actually caused this issue of sleep and halt in restore.

I could see in the logs for this cluster that total 10 resource types came for restore in the namespace "tigera-system"

Sample Logs: Log link in velero

1. Resource 'services' will be restored into namespace 'tigera-system' :: {}
2. Resource 'serviceaccounts' will be restored into namespace 'tigera-system' :: {}
3. Resource 'secrets' will be restored into namespace 'tigera-system' :: {}
...... like this......
9. Resource 'deployments.apps' will be restored into namespace 'tigera-system' :: {}
10. Resource 'configmaps' will be restored into namespace 'tigera-system' :: {}

Also, I could see the error during restore for fetching "tigera-system" namespace.
To note: This error came 17 times and it comes to a total of 170 minutes of delay for completing polling for fetching the namespace to ready state. (PollUntilContextTimeout used in code)

Log Link
Error Log: Velero restore error: error getting namespace tigera-system: timed out waiting for the condition :: {}

I have also calculated the sleep of 10 min at different intervals that I observed during restore and it was also coming around Total of 160 min. Note: multiples of 10, due to number of resources of different types that might came for restore

Restore Code:
EnsureNamespaceExistsAndIsReadyis called in velero restore.go at 2 places, which have caused total 10 min delays when restoring those 10 different types of resources in the cluster:

mayankagg9722 · 2024-03-27T10:48:19Z

PR #7424 this needs to take care of this scenario as well. Tagging @kaovilai @blackpiglet.

EnsureNamespaceExistsAndIsReady

Problem: During restore for every resource within the namespace, we are calling the check to await on if namespace exists (wait for 10 min polling). For instance, if we have 100 resources in a namespace that itself is in terminating state for very long then it impacts the restore flow to get stuck/halt and it too increases the time to restore as it does it for each resource in the same namespace.

blackpiglet · 2024-03-28T02:55:15Z

PR #7424 this needs to take care of this scenario as well. Tagging @kaovilai @blackpiglet.

EnsureNamespaceExistsAndIsReady

Problem: During restore for every resource within the namespace, we are calling the check to await on if namespace exists (wait for 10 min polling). For instance, if we have 100 resources in a namespace that itself is in terminating state for very long then it impacts the restore flow to get stuck/halt and it too increases the time to restore as it does it for each resource in the same namespace.

I understand that is inconvenient for the user to wait a long time to find out that the backup cannot proceed, but I think the current behavior is correct for this scenario.
The Velero cannot control when the terminating namespace is deleted, so it seems reasonable to continue to wait for it.
A possible improvement is making the log here more meaningful to tell the user it hangs due to the terminating namespace.

What do you think? Any suggestions?

mayankagg9722 · 2024-03-28T04:55:06Z

PR #7424 this needs to take care of this scenario as well. Tagging @kaovilai @blackpiglet.
EnsureNamespaceExistsAndIsReady
Problem: During restore for every resource within the namespace, we are calling the check to await on if namespace exists (wait for 10 min polling). For instance, if we have 100 resources in a namespace that itself is in terminating state for very long then it impacts the restore flow to get stuck/halt and it too increases the time to restore as it does it for each resource in the same namespace.

I understand that is inconvenient for the user to wait a long time to find out that the backup cannot proceed, but I think the current behavior is correct for this scenario. The Velero cannot control when the terminating namespace is deleted, so it seems reasonable to continue to wait for it. A possible improvement is making the log here more meaningful to tell the user it hangs due to the terminating namespace.

What do you think? Any suggestions?

Instead of waiting for the target namespace to be ready for each and every resource individually, we should perhaps iterate on all the needed namespaces at once and then perform restore for each and every resource. Also, we could also decrease the polling time for checking namespace state that will help in overall duration of restore if many such namespaces exists.

blackpiglet · 2024-03-28T06:01:42Z

Instead of waiting for the target namespace to be ready for each and every resource individually, we should perhaps iterate on all the needed namespaces at once and then perform restore for each and every resource. Also, we could also decrease the polling time for checking namespace state that will help in overall duration of restore if many such namespaces exists.

The proposal is reasonable, but Velero doesn't understand the resource dependency.
During restoration, Velero creates resources according to a default resource list.

velero/pkg/cmd/server/server.go

Line 600 in 3c704ba

var defaultRestorePriorities = restore.Priorities{

As a result, Velero doesn't know whether the restored resource's dependent namespace already exists.
There was some discussion about how to make Velero aware of the resources' dependency, but that is still not in the planned releases.
#7199

mayankagg9722 · 2024-04-04T10:04:51Z

Instead of waiting for the target namespace to be ready for each and every resource individually, we should perhaps iterate on all the needed namespaces at once and then perform restore for each and every resource. Also, we could also decrease the polling time for checking namespace state that will help in overall duration of restore if many such namespaces exists.

The proposal is reasonable, but Velero doesn't understand the resource dependency. During restoration, Velero creates resources according to a default resource list.

velero/pkg/cmd/server/server.go

Line 600 in 3c704ba

var defaultRestorePriorities = restore.Priorities{

As a result, Velero doesn't know whether the restored resource's dependent namespace already exists. There was some discussion about how to make Velero aware of the resources' dependency, but that is still not in the planned releases. #7199

@blackpiglet can we keep the decreased polling time for terminating namespaces which is currently 10 min?

blackpiglet · 2024-04-08T14:56:58Z

Making it configurable seems feasible.

github-actions · 2024-06-08T01:49:15Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions · 2024-06-25T01:50:56Z

This issue was closed because it has been stalled for 14 days with no activity.

danfengliu added Needs investigation Restore labels Mar 14, 2024

danfengliu assigned ywk253100 Mar 18, 2024

ywk253100 removed the Needs investigation label Mar 19, 2024

mayankagg9722 mentioned this issue Mar 27, 2024

Descriptive restore error on timeout due to terminating namespace #7424

Merged

3 tasks

github-actions bot added the staled label Jun 8, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2024

anshulahuja98 mentioned this issue Sep 20, 2024

Terminating namespace polling for each resource during restore #8234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" #7516

Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" #7516

mayankagg9722 commented Mar 13, 2024 •

edited

Loading

danfengliu commented Mar 14, 2024

blackpiglet commented Mar 14, 2024

mayankagg9722 commented Mar 14, 2024

mayankagg9722 commented Mar 14, 2024

blackpiglet commented Mar 14, 2024

mayankagg9722 commented Mar 14, 2024

mayankagg9722 commented Mar 27, 2024 •

edited

Loading

blackpiglet commented Mar 28, 2024

mayankagg9722 commented Mar 28, 2024

blackpiglet commented Mar 28, 2024

mayankagg9722 commented Apr 4, 2024

blackpiglet commented Apr 8, 2024

github-actions bot commented Jun 8, 2024

github-actions bot commented Jun 25, 2024

Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" #7516

Restore getting stuck for 10 min at different intervals due to "Failed to discover group" and "I0308: client-side throttling" #7516

Comments

mayankagg9722 commented Mar 13, 2024 • edited Loading

danfengliu commented Mar 14, 2024

blackpiglet commented Mar 14, 2024

mayankagg9722 commented Mar 14, 2024

mayankagg9722 commented Mar 14, 2024

blackpiglet commented Mar 14, 2024

mayankagg9722 commented Mar 14, 2024

mayankagg9722 commented Mar 27, 2024 • edited Loading

blackpiglet commented Mar 28, 2024

mayankagg9722 commented Mar 28, 2024

blackpiglet commented Mar 28, 2024

mayankagg9722 commented Apr 4, 2024

blackpiglet commented Apr 8, 2024

github-actions bot commented Jun 8, 2024

github-actions bot commented Jun 25, 2024

mayankagg9722 commented Mar 13, 2024 •

edited

Loading

mayankagg9722 commented Mar 27, 2024 •

edited

Loading