Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable-gang-scheduling not work #1264

Closed
chenwenjun-github opened this issue Mar 28, 2021 · 14 comments
Closed

enable-gang-scheduling not work #1264

chenwenjun-github opened this issue Mar 28, 2021 · 14 comments
Assignees

Comments

@chenwenjun-github
Copy link

chenwenjun-github commented Mar 28, 2021

What happened:
I use the kubeflow/tf-operator with --enable-gang-scheduling = true, and I have installed volcano.
When I submit two same tfjobs at the same time and the resourcequota of namespace is only enough for one tfjob, I find the two tfjob still start to occupy the resources and cause deadlock.

And I try to submit only one tfjob with more resource but the resourcequota of namespace is not enough
image
I find although podgroup show that the resouce is not enough for all pods of tfjob, but there still have several pod which has been created and pod's status is pending

What you expected to happen:
when the resouce is not enough for all pods of tfjob, all the pods of tfjob shouldn't be created,
because the pod which status is pending still will be distribute resource.

How to reproduce it (as minimally and precisely as possible):
like the describe above

Anything else we need to know?:
tf-operator version is recently either.
I use default queue, and I have many namespace, does this have an impact?

Environment:

  • Volcano Version:1.2.0
  • Kubernetes version (use kubectl version):1.13
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@gaocegege
Copy link
Member

Are you using volcano?

@chenwenjun-github
Copy link
Author

Are you using volcano?

yes I do

@gaocegege
Copy link
Member

OK, I will have a look ASAP.

/cc @Thor-wl @shinytang6

@Thor-wl
Copy link

Thor-wl commented Mar 29, 2021

OK, I will have a look ASAP.

/cc @Thor-wl @shinytang6

Yes, we've noticed this issue and is analysising it.

@chenwenjun-github
Copy link
Author

Some additional information:

  1. tf-operator version is tag:1.0.1-rc.5, and container param what I set is like this:
    image
    volcano version is tag:v1.2.0
  2. I use default queue which volcano create defaultly, and I have many namespace, need I create different queue for namespace?

@chenwenjun-github
Copy link
Author

image
I think that this code have some problem, It should juage whether status of podgroup is running first before ReconcilePods if EnableGangScheduling is true.
I'm not sure my idea, what do you think @Thor-wl

@Thor-wl
Copy link

Thor-wl commented Mar 29, 2021

image
I think that this code have some problem, It should juage whether status of podgroup is running first before ReconcilePods if EnableGangScheduling is true.
I'm not sure my idea, what do you think @Thor-wl

I'm sorry for not reviewing the code. Let me take a look.

@shinytang6
Copy link
Member

shinytang6 commented Mar 29, 2021

Some additional information:

  1. tf-operator version is tag:1.0.1-rc.5, and container param what I set is like this:
    image
    volcano version is tag:v1.2.0
  2. I use default queue which volcano create defaultly, and I have many namespace, need I create different queue for namespace?

If you only use the default queue and proportion plugin is enabled, all podgroups will be inqueue regardless of the minresource declaration because queue.capacity is empty, so it is possible that some pods will be pending in real allocate action.

// pkg/scheduler/plugins/proportion/proportion.go#L254
ssn.AddJobEnqueueableFn(pp.Name(), func(obj interface{}) int {
	job := obj.(*api.JobInfo)
	queueID := job.Queue
	attr := pp.queueOpts[queueID]
	queue := ssn.Queues[queueID]

	// If no capability is set, always enqueue the job.
	if len(queue.Queue.Spec.Capability) == 0 {
		klog.V(4).Infof("Capability of queue <%s> was not set, allow job <%s/%s> to Inqueue.",
			queue.Name, job.Namespace, job.Name)
		return util.Permit
	}

	if job.PodGroup.Spec.MinResources == nil {
		return util.Permit
	}
	minReq := job.GetMinResources()
	// The queue resource quota limit has not reached
	inqueue := minReq.Add(attr.allocated).Add(attr.inqueue).LessEqual(api.NewResource(queue.Queue.Spec.Capability))
	if inqueue {
		attr.inqueue.Add(job.GetMinResources())
		return util.Permit
	}
	return util.Reject
})

p.s: even though your queue has declared capability, the podgroup created by tf-operator still has something wrong currently because minResources field is nil, as you can see in the above code, the job will always be enqueued.

Similar issue: kubeflow/common#112

@chenwenjun-github
Copy link
Author

chenwenjun-github commented Mar 29, 2021

@shinytang6 Thank you. I also find that the status of podgroup is always inqueue even if the resourcequota of namespace is not enough for tfjob when I test gang-scheduler.
But some pods become pending , I feel one possible reason is tf-operator's scheduler and volcano scheduler is repeat.
what I mean is tf-operator shouldn't create pod if enable-gang-scheduling = true, pod should be create by allocate module of volcano.
Because I find this log in tf-operator pod when enable-gang-scheduling = true
image

@shinytang6
Copy link
Member

@shinytang6 Thank you. I also find that the status of podgroup is always inqueue even if the resourcequota of namespace is not enough for tfjob when I test gang-scheduler.
But some pods become pending , I feel one possible reason is tf-operator's scheduler and volcano scheduler is repeat.
what I mean is tf-operator shouldn't create pod if enable-gang-scheduling = true, pod should be create by allocate module of volcano.

emm, I don’t think so. The tasks scheduled by volcano are generated based on the pod with schedulername=volcano. What allocate do is to select the appropriate node and bind it to the pod instead of creating the pod.

@chenwenjun-github
Copy link
Author

chenwenjun-github commented Mar 30, 2021

emm, I don’t think so. The tasks scheduled by volcano are generated based on the pod with schedulername=volcano. What allocate do is to select the appropriate node and bind it to the pod instead of creating the pod.

@shinytang6 you are right. volcano don't create pod. the pod and podgroup is created by tf-operator, schedulered by volcano.

And I think I know my problem.
The problem is my tfjob is running in one namespace of k8s, and I create the resourcequota for namespace, and the resourcequota of namespace is limited.
Although pod which is pending have not been binded to node, but I find it occupy resourcequota of namespace.
This will cause deadlock when I submit two same tfjobs at the same time and the resourcequota of namespace is only enough for one tfjob, this make some pod of tfjob can't be created.
volcano's allocate will check if pod num >= minMember, otherwise, it don't allocate, so the pods which already have created will keep pending.
image
image

@Thor-wl
Copy link

Thor-wl commented Mar 30, 2021

@shinytang6 @jiangkaihua have discussed about this issue last night. And they are working for it.
/cc @shinytang6 @jiangkaihua

@google-oss-robot
Copy link

@Thor-wl: GitHub didn't allow me to assign the following users: jiangkaihua.

Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

@shinytang6 @jiangkaihua have discussed about this issue last night. And they are working for it.
/assign @shinytang6 @jiangkaihua

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stale
Copy link

stale bot commented Jun 29, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants