enable-gang-scheduling not work #1264

chenwenjun-github · 2021-03-28T05:18:46Z

What happened:
I use the kubeflow/tf-operator with --enable-gang-scheduling = true, and I have installed volcano.
When I submit two same tfjobs at the same time and the resourcequota of namespace is only enough for one tfjob, I find the two tfjob still start to occupy the resources and cause deadlock.

And I try to submit only one tfjob with more resource but the resourcequota of namespace is not enough

I find although podgroup show that the resouce is not enough for all pods of tfjob, but there still have several pod which has been created and pod's status is pending

What you expected to happen:
when the resouce is not enough for all pods of tfjob, all the pods of tfjob shouldn't be created,
because the pod which status is pending still will be distribute resource.

How to reproduce it (as minimally and precisely as possible):
like the describe above

Anything else we need to know?:
tf-operator version is recently either.
I use default queue, and I have many namespace, does this have an impact?

Environment:

Volcano Version:1.2.0
Kubernetes version (use kubectl version):1.13
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

gaocegege · 2021-03-29T01:56:33Z

Are you using volcano?

chenwenjun-github · 2021-03-29T02:36:22Z

Are you using volcano?

yes I do

gaocegege · 2021-03-29T02:55:53Z

OK, I will have a look ASAP.

/cc @Thor-wl @shinytang6

Thor-wl · 2021-03-29T03:14:26Z

OK, I will have a look ASAP.

/cc @Thor-wl @shinytang6

Yes, we've noticed this issue and is analysising it.

chenwenjun-github · 2021-03-29T08:48:43Z

Some additional information:

tf-operator version is tag:1.0.1-rc.5, and container param what I set is like this:

volcano version is tag:v1.2.0
I use default queue which volcano create defaultly, and I have many namespace, need I create different queue for namespace?

chenwenjun-github · 2021-03-29T11:20:24Z

I think that this code have some problem, It should juage whether status of podgroup is running first before ReconcilePods if EnableGangScheduling is true.
I'm not sure my idea, what do you think @Thor-wl

Thor-wl · 2021-03-29T11:24:25Z

I think that this code have some problem, It should juage whether status of podgroup is running first before ReconcilePods if EnableGangScheduling is true.
I'm not sure my idea, what do you think @Thor-wl

I'm sorry for not reviewing the code. Let me take a look.

shinytang6 · 2021-03-29T12:53:26Z

Some additional information:

tf-operator version is tag:1.0.1-rc.5, and container param what I set is like this:

volcano version is tag:v1.2.0

I use default queue which volcano create defaultly, and I have many namespace, need I create different queue for namespace?

If you only use the default queue and proportion plugin is enabled, all podgroups will be inqueue regardless of the minresource declaration because queue.capacity is empty, so it is possible that some pods will be pending in real allocate action.

// pkg/scheduler/plugins/proportion/proportion.go#L254
ssn.AddJobEnqueueableFn(pp.Name(), func(obj interface{}) int {
	job := obj.(*api.JobInfo)
	queueID := job.Queue
	attr := pp.queueOpts[queueID]
	queue := ssn.Queues[queueID]

	// If no capability is set, always enqueue the job.
	if len(queue.Queue.Spec.Capability) == 0 {
		klog.V(4).Infof("Capability of queue <%s> was not set, allow job <%s/%s> to Inqueue.",
			queue.Name, job.Namespace, job.Name)
		return util.Permit
	}

	if job.PodGroup.Spec.MinResources == nil {
		return util.Permit
	}
	minReq := job.GetMinResources()
	// The queue resource quota limit has not reached
	inqueue := minReq.Add(attr.allocated).Add(attr.inqueue).LessEqual(api.NewResource(queue.Queue.Spec.Capability))
	if inqueue {
		attr.inqueue.Add(job.GetMinResources())
		return util.Permit
	}
	return util.Reject
})

p.s: even though your queue has declared capability, the podgroup created by tf-operator still has something wrong currently because minResources field is nil, as you can see in the above code, the job will always be enqueued.

Similar issue: kubeflow/common#112

chenwenjun-github · 2021-03-29T13:39:54Z

@shinytang6 Thank you. I also find that the status of podgroup is always inqueue even if the resourcequota of namespace is not enough for tfjob when I test gang-scheduler.
But some pods become pending , I feel one possible reason is tf-operator's scheduler and volcano scheduler is repeat.
what I mean is tf-operator shouldn't create pod if enable-gang-scheduling = true, pod should be create by allocate module of volcano.
Because I find this log in tf-operator pod when enable-gang-scheduling = true

shinytang6 · 2021-03-29T14:15:45Z

@shinytang6 Thank you. I also find that the status of podgroup is always inqueue even if the resourcequota of namespace is not enough for tfjob when I test gang-scheduler.
But some pods become pending , I feel one possible reason is tf-operator's scheduler and volcano scheduler is repeat.
what I mean is tf-operator shouldn't create pod if enable-gang-scheduling = true, pod should be create by allocate module of volcano.

emm, I don’t think so. The tasks scheduled by volcano are generated based on the pod with schedulername=volcano. What allocate do is to select the appropriate node and bind it to the pod instead of creating the pod.

chenwenjun-github · 2021-03-30T04:46:36Z

emm, I don’t think so. The tasks scheduled by volcano are generated based on the pod with schedulername=volcano. What allocate do is to select the appropriate node and bind it to the pod instead of creating the pod.

@shinytang6 you are right. volcano don't create pod. the pod and podgroup is created by tf-operator, schedulered by volcano.

And I think I know my problem.
The problem is my tfjob is running in one namespace of k8s, and I create the resourcequota for namespace, and the resourcequota of namespace is limited.
Although pod which is pending have not been binded to node, but I find it occupy resourcequota of namespace.
This will cause deadlock when I submit two same tfjobs at the same time and the resourcequota of namespace is only enough for one tfjob, this make some pod of tfjob can't be created.
volcano's allocate will check if pod num >= minMember, otherwise, it don't allocate, so the pods which already have created will keep pending.

Thor-wl · 2021-03-30T08:08:25Z

@shinytang6 @jiangkaihua have discussed about this issue last night. And they are working for it.
/cc @shinytang6 @jiangkaihua

google-oss-robot · 2021-03-30T08:08:27Z

@Thor-wl: GitHub didn't allow me to assign the following users: jiangkaihua.

Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

@shinytang6 @jiangkaihua have discussed about this issue last night. And they are working for it.
/assign @shinytang6 @jiangkaihua

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stale · 2021-06-29T00:41:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shinytang6 mentioned this issue Mar 29, 2021

enable-gang-scheduling not work volcano-sh/volcano#1387

Closed

google-oss-robot assigned shinytang6 Mar 30, 2021

shinytang6 mentioned this issue Apr 5, 2021

[feature] Support delay pods creating when EnableGangScheduling is true kubeflow/common#129

Closed

stale bot added the lifecycle/stale label Jun 29, 2021

stale bot closed this as completed Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable-gang-scheduling not work #1264

enable-gang-scheduling not work #1264

chenwenjun-github commented Mar 28, 2021 •

edited

Loading

gaocegege commented Mar 29, 2021

chenwenjun-github commented Mar 29, 2021

gaocegege commented Mar 29, 2021

Thor-wl commented Mar 29, 2021

chenwenjun-github commented Mar 29, 2021

chenwenjun-github commented Mar 29, 2021

Thor-wl commented Mar 29, 2021

shinytang6 commented Mar 29, 2021 •

edited

Loading

chenwenjun-github commented Mar 29, 2021 •

edited

Loading

shinytang6 commented Mar 29, 2021

chenwenjun-github commented Mar 30, 2021 •

edited

Loading

Thor-wl commented Mar 30, 2021 •

edited

Loading

google-oss-robot commented Mar 30, 2021

stale bot commented Jun 29, 2021

enable-gang-scheduling not work #1264

enable-gang-scheduling not work #1264

Comments

chenwenjun-github commented Mar 28, 2021 • edited Loading

gaocegege commented Mar 29, 2021

chenwenjun-github commented Mar 29, 2021

gaocegege commented Mar 29, 2021

Thor-wl commented Mar 29, 2021

chenwenjun-github commented Mar 29, 2021

chenwenjun-github commented Mar 29, 2021

Thor-wl commented Mar 29, 2021

shinytang6 commented Mar 29, 2021 • edited Loading

chenwenjun-github commented Mar 29, 2021 • edited Loading

shinytang6 commented Mar 29, 2021

chenwenjun-github commented Mar 30, 2021 • edited Loading

Thor-wl commented Mar 30, 2021 • edited Loading

google-oss-robot commented Mar 30, 2021

stale bot commented Jun 29, 2021

chenwenjun-github commented Mar 28, 2021 •

edited

Loading

shinytang6 commented Mar 29, 2021 •

edited

Loading

chenwenjun-github commented Mar 29, 2021 •

edited

Loading

chenwenjun-github commented Mar 30, 2021 •

edited

Loading

Thor-wl commented Mar 30, 2021 •

edited

Loading