update: readme for WIP changes to fluence #42
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem: our current fluence plugin is a bit behind the current kubernetes-sigs/scheduler-plugins, and resulting in the errors shown here: https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run0#example-scheduling.
Today I felt brave (or just stupid) and decided to take another look. Our main issue was getting the openshifft-psap plugins fluence branch up to date with > 100 commits from upstream kubernetes-sigs/scheduler-plugins. I first attempted a proper rebase, knowing this is typically desired. But after about an hour (once it got into the vendor folder) it became clear this was not a good approach. At least, someone that worked on the entire history would need to devote days to it (at least it seems). I did Google searching for "how to rebase with 100s of commit changes" and largely the suggestion was to do an old school merge - meaning that we would still need to resolve conflicts, but only once. I kept a record of the conflicts and some notes here: https://gist.github.com/vsoch/3c2b6d69607cab68de057ccbd003adeb. The strategy I decided on today was the following:
This resulted in a lot of errors that led me to find issues discussed below. I was able to fix all the issues and get both the unit and integration tests working. See notes in the next section for more details.
Notes
Changes to packages
For some reason this merge left many changed files in the openshift repo that were not in upstream. It actually looks like this openshift update undid changes that persisted (and still exist) in the kubernetes-sig master branch. The primary changes were for associated packages that didn't seem relevant for fluence, and then functions in the coscheduling / capacityscheduling packages to just return two values instead of one (and one always nil). Seeing that the fluence branch didn't directly edit them, I assume these are old changes and I updated them.
The other change that was wrong was simply that fluence.go was (for some reason) renamed back to kubeflux.go. I'm pretty sure I did this rename originally so likely I just did something wrong. The commit history looks OK, so I just renamed it again commit. Another set of eyes would be good here.
Testing
I went through a rebuild of the fluence image from flux-k8s and with this updated repository, and the error reported above seems to be gone, and fluence is running again, at least superficially!
So that is good. The pull request here is primarily to facilitate discussion/planning, and a WIP to make some small tweaks (primarily docs/the README) that likely will be updated when we decide on next steps (and a future strategy, see next section).
Next Steps
These are my suggested next steps:
Since I have functioning fluence images, I am unblocked from setting up testing for Google Cloud (this was the original failure). I will try that again this evening, because I'm excited (and hoping that it works). I will update this PR with what I learn.
@cmisale @milroy this is primarily for your FYI. The main work is here: researchapps/scheduler-plugins#1
I tried my best to have clear, succinct / modular commits (per flux standard), but I'm still fairly bad at that.
I think there is a lot to pick up / learn about how custom scheduling plugins work, and I'm just at the tip of the iceberg. If I did this all wrong / it ultimately isn't right I apologize for the noise! It definitely was fun... I think I was looking for something complex to stick my head in today (cue ostrich visual) and it definitely delivered! 🙌
Update: I was able to deploy fluence to google cloud, and deploy the same sample pods (and confirm fluence schedules them). I installed the MPI operator and tried a test run of a previous lammps container - likely there is something wrong about the architecture (intended for aws or other) against the Google node and my system, because I got "Illegal instruction (core dumped)" both testing with the MPI operator and pulling the container and just running
lmp
on my local machine (notes at the bottom of the README here) https://github.com/converged-computing/operator-experiments/tree/main/google/scheduler/run1. I'm not particularly worried about this because I suspect we will build newer containers and also use the Flux Operator. I'll post an update in the appropriate chats about next steps.