-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-list: support "hostlist" constraint to allow jobs to be filtered by nodes #5656
job-list: support "hostlist" constraint to allow jobs to be filtered by nodes #5656
Conversation
2ea42eb
to
cb9e7b5
Compare
I wonder if we should limit the size of hostlist allowed in a constraint query for
Just add more |
Good catch! Hmmm not sure what the limit shouild be ... 1024 should be more than enough?
Yeah, b/c one could make a semi-infinite sized one. Although I'm unsure how to limit the size. Let me create an issue for that. |
1024 might be a bit small on a cluster with 10K nodes. Something dynamic might work if job-list can get the maximum expected instance size. However, this doesn't prevent DoS from submitting a complex query (e.g. |
Hmmm, I guess it wouldn't be too hard to keep a running track of largest node count seen so far.
Yeah, I brought up in #5669, perhaps need some hard caps just to avoid DoS attempts. |
Was looking into this and going back on pros on cons of various approaches:
the problems with some of the above
Sooo, I'm not sure.. I'm beginning to wonder if a heuristic might be best. Like number of brokers OR max job count, whatever is bigger? then max is a multiple of it? That's a lot of work for this. This also makes me wonder if we should hard code some max instead. Edit: hmmm here's a compromise idea, number of brokers or some-defined N, whichever is larger? That way there's always some decent minimum that is allowed, like 1024 or something. |
cc950de
to
8596055
Compare
re-pushed using the following idea to limit hostlist constraint sizes, the limit is instance size (ie number of brokers) or 1024, whatever is bigger. the 1024 minimum is to give the constraint some decent minimum, in the event the size of the cluster has shrunk over time. good idea? bad idea? |
This seems reasonable to me. Another idea I had would be for job-list to keep a single hostlist (really a set) of all hosts encountered for all jobs. When a hostlist constraint is encountered, the job-list module could take the intersection of this set and the "all hosts" set to limit the constraint hostlist to only those hosts which could possibly be matched. A similar process could be used to optimize matching of ranks as well. The drawback is extra work when processing every job to maintain the set of possible hosts, which may not end up being worth it. 🤔 |
In the current implementation this would work, but it'd immediately become a problem if we have older job data stored in a database, which we wouldn't scan upon a instance restart. So I was trying to avoid doing any optimization based on what we've seen so far. As mentioned above, I guess this could be solved with a side part of the database of extra stuffs (like "max hosts" or "hosts we've seen so far"). Hmmm, let me think about this more. Maybe doing that in the DB would be inevitable for optimization purposes. Or as you say ... maybe it's not worth the energy to do this optimization. |
Just curious, how are you going to match on hosts with older jobs in a database. Maybe the optimization won't be needed in that case if the hosts are indexed or something. (I haven't seen an implementation, so difficult to reason about). I assume a good database will have already handled the query optimization and "DoS" problem... |
eeada31
to
7b1164e
Compare
re-pushed, updating for conflicts, now based on #5681 |
1010788
to
c80c4c6
Compare
c80c4c6
to
27f8b5a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a quick first pass, found some places where error.text
is not set on error and thus garbage would be returned in the error response.
src/modules/job-list/match.c
Outdated
if (inc_check_comparison (c->mctx, comparisons, errp) < 0) | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The count of comparisons is incremented here and also for each host. Doesn't that mean one host comparison is counted twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my logic was there was a comparison simply to see if we should bother to check hosts (i.e. if the job didn't run, we don't check anything, and that should count for at least 1 comparison). But now that you mention it, perhaps that is not the most sensible way to think about it. I'll remove the first check.
match_hostlist, | ||
wrap_hostlist_destroy, | ||
errp))) | ||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errp->text
not set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is ok, errp is set in the call to list_constraint_new()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, but looks like it is missing in the EINVAL
path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind, I think I was looking at the wrong function in the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I think I was looking at the wrong function in the diff.
/* Create a single hostlist if user specifies multiple nodes or | ||
* RFC29 hostlist range */ | ||
if (!(hl = hostlist_create ())) | ||
goto error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errp->text
not set here.
} | ||
if (!zlistx_add_end (c->values, hl)) { | ||
hostlist_destroy (hl); | ||
goto error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errp->text
not set here.
@@ -743,6 +831,28 @@ struct match_ctx *match_ctx_create (flux_t *h) | |||
goto error; | |||
} | |||
|
|||
if (flux_get_size (mctx->h, &mctx->max_hostlist) < 0) | |||
goto error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errp->text
not set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you mean I forgot a flux_log in this case, but correct!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, heh. Yeah I must have been blindly checking goto error
without errprintf()
🤦
19ff734
to
eb1fcf5
Compare
re-pushed with tweaks per comments above |
eb1fcf5
to
e1fef43
Compare
rebased now that #5681 is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this one several times as part of development of #5711 and LGTM!
@Mergifyio rebase |
Problem: In the near future it'd be convenient to do calculations on the job nodelist, but it often needs to be in a hostlist data structure for processing. We'd like to avoid converting the nodelist to hostlist struct over and over again. Add a field into the job_data struct to hold a cached version of the nodelist in a hostlist struct.
Problem: It would be convenient to filter jobs based on the nodes they ran on. Add a constraint operator "hostlist" to filter on nodes within the job nodelist. Multiple nodes can be specified. Hostlists represented in RFC29 format are acceptable for input to the constraint. Fixes flux-framework#4186
Problem: There are no unit tests for the new 'hostlist' constraint operator. Add tests in match and state_match tests.
Problem: In t2260-job-list.t, some test jobs are submitted using "flux job submit". This makes it inconvenient to set job requirements on those jobs, such as specific nodes the jobs should run on. Solution: Convert all uses of "flux job submit" to use "flux submit".
Problem: In the near future we would like to filter jobs based on nodes in a job's nodelist. This would be an issue in the current test setup because test brokers all run under the same host and it is unknown which nodes jobs actually ran on. Solution: Setup fake hostnames for the test brokers in t2260-job-list.t and when necessary, run test jobs on specific hosts.
Problem: There is no testing / coverage for the new hostlist constraint in the job-list module. Add some tests to t2260-job-list.t.
✅ Branch has been successfully rebased |
e1fef43
to
d381bf3
Compare
oops, missed that this had been approved last week. rebased and setting MWP. thanks |
…_filter_nodes job-list: support "hostlist" constraint to allow jobs to be filtered by nodes
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5656 +/- ##
==========================================
- Coverage 83.30% 83.29% -0.01%
==========================================
Files 519 519
Lines 83680 83734 +54
==========================================
+ Hits 69707 69747 +40
- Misses 13973 13987 +14
|
TSIA.
Maybe the only interesting side note is that until we solve #5367 (which should probably follow this one up) we don't yet have a friendly user interface for this. The tests send constraints objects straight to the
job-list
module.