Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-40810: Implement an allocateNodes "auto" #22

Merged
merged 10 commits into from
Nov 29, 2023
Merged

DM-40810: Implement an allocateNodes "auto" #22

merged 10 commits into from
Nov 29, 2023

Conversation

daues
Copy link
Contributor

@daues daues commented Oct 26, 2023

No description provided.

@yalsayyad yalsayyad changed the title Tickets/dm 40810 DM-40810: Implement an allocateNodes "auto" Oct 30, 2023
Copy link

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments. Approved for merge.

@@ -61,6 +61,13 @@ def parseArgs(self, basename):

parser = argparse.ArgumentParser(prog=basename)
parser.add_argument("platform", help="node allocation platform")
parser.add_argument(
"-u",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-u is used in a lot of commands for the username. I don't know if this will cause confusion or worse if a username is ever needed for allocateNodes. I don't know that there's a better single character. I would be ok with just --auto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I just remove the "-u" and it will be happy with only one such argument? I'll try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be work; now just --auto.


print("Detected this number of preexisting glidein jobs: %s " % strResult)
print("Detected this number of preexisting glidein jobs: %s " % strResult)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this output would be helpful in the automated case as well.

print(f"full_constraint {full_constraint}")
condorq_data = condor_q(
constraint=full_constraint,
schedds={schedd_name: scheddref},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only detecting jobs on a single schedd? So one would have to run allocateNodes on multiple machines but jobs submitted on either can use nodes from either allocateNodes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the case for this version of the software, please add it to the documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is detecting jobs for the single schedd on the current machine. So working on/across sdfrome001/2 simultaneously, yes each needs an allocateNodes. They do not interfere at the first level, all necessary glideins will be submitted, but then jobs could run on either. Interference does exist but it is small (all Idle jobs at a given instant are accounted for.) The addition of Tags / Requirements would be the solution.
I'll comment in the docs about the single schedd.

if verbose:
print("\t\tNeed to Add More:")
print(f"\t\tRequestMemory is {thisMemory} ")
print(f"\t\tRatio to 4 GB is {thisRatio} ")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the "4" to "{ratioMemCore}"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

else:
print("Length Zero")
print(len(condorq_data))
except Exception:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doing such a broad catch of exceptions? And then do nothing (not even printing the exception for later debugging).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something like
except Exception as exc:
raise type(exc)("Problem querying condor schedd for jobs") from None

# Check Slurm queue Running glideins
jobname = "".join(["glide_", auser])
existingGlideinsRunning = 0
batcmd = "".join(["squeue --noheader --states=R --name=", jobname, " | wc -l"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again question about the join vs f-string. Assume same question for all the rest of the joins where there isn't an unknown number of values in a list being turned into a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted to f-string.

)

maxIdleGlideins = maxNumberOfGlideins - existingGlideinsRunning
maxSubmitGlideins = maxIdleGlideins - existingGlideinsIdle

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to read code here a couple times. maxSubmitGlideins confused me when set to that difference. I would move the non-maxSubmitGlideins print statements above here moving these two differences closer to the if statement below. Then put a comment about capping new glideins based upon a limit of total idle glideins in queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will leave these as is right now, as this is the "working like a glidein factory" part, and it is likely worth another ticket to put the whole procedure into "memory"
and think through the best names for all of these variables, if the current ones are not totally clear. (It's fairly complicated.)

existingGlideinsRunning = 0
batcmd = "".join(["squeue --noheader --states=R --name=", jobname, " | wc -l"])
print("The squeue command is: %s " % batcmd)
resultR = subprocess.check_output(batcmd, shell=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the command fails, I think this will throw an exception. Just wondering why these aren't caught like the ones in the condor section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add

try:
subprocess.check_output(...)
except subprocess.CalledProcessError as e:
print(e.output)

"RequestCpus",
"JobUniverse",
"RequestMemory",
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a comment about the pieces of the constraint? Something like "Limit query to compute jobs for this user that are idle" or put comments with JobStatus and JobUniverse (It took me a couple minutes to figure out why needed JobUniverse).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments on the classads to be returned, and the constraint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it can be emphasized that there are classads to be returned (the first list), and then the constraint (a second list.) All of the classads there do not have to be returned but they seemed useful to have available for debugging.


Returns
-------
number : `str`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think copy-paste error in the return type.

@daues daues merged commit e3d61e8 into main Nov 29, 2023
3 checks passed
@daues daues deleted the tickets/DM-40810 branch November 29, 2023 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants