Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Auto import images for embedded registry #10973

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

vitorsavian
Copy link
Member

@vitorsavian vitorsavian commented Oct 2, 2024

Proposed Changes

  • ADR
  • Auto import images form .txt files and tar files

Types of Changes

  • New Feature

Verification

  • run a k3s server
k3s server
  • create a .txt file and add a docker.io image path to the .txt file
docker.io/library/mysql:latest
  • move the file to /var/lib/rancher/k3s/agent/images
cp example.txt /var/lib/rancher/k3s/agent/images
  • then see if the file was imported in the containerd image store
k3s ctr images list | grep "mysql"
  • you can also change the file inside var/lib/rancher/k3s/agent/images
nano /var/lib/rancher/k3s/agent/images/example.txt
  • then you can add
docker.io/library/redis:latest
  • the controller will see the change and then import the redis image
k3s ctr images list | grep "redis"

Testing

Linked Issues

User-Facing Change

Users can now auto import images to containerd by just throwing the image in the agent/images folder while k3s is running

Further Comments

@vitorsavian vitorsavian requested a review from a team as a code owner October 2, 2024 15:46
## Context

Since the feature for embedded registry, the users appeared with a question about having to manually import images, specially in edge environments
As a result, there is a need for a folder to be created, where every image there will be watched by a controller (a child process that will run when the embedded registry is created) for changes or new images, this new images or new changes will be added to the node registry, meaning that other nodes will have access to the image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be much better if we could reuse the existing folder for images instead of creating a new one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could possibly be convinced to just change the behavior of the existing agent/images folder.

Copy link
Member Author

@vitorsavian vitorsavian Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking into creating the controller for agent/images and the controller will upload the images as the default behavior, but for others images there the controller can add separately.

Copy link
Member

@brandond brandond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can now add images in the embedded registry auto import folder while k3s is running

To avoid confusion, remove any references to importing things into the embedded registry - the embedded registry does not have a discrete image store to import into. We import images into the containerd image store, and images in the containerd image store can be used by Kubernetes pods without needing to be pulled from a remote registry, and they can also be shared between nodes by the embedded registry.


Since the feature for embedded registry, the users appeared with a question about having to manually import images, specially in edge environments.

As a result, there is a need for a folder who can handle this action, where every image there will be watched by a controller (a child process that will run when the embedded registry is created) for changes or new images, this new images or new changes will be added to the containerd node registry, meaning that other nodes will have access to the image.
Copy link
Member

@brandond brandond Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Controllers are not necessarily child processes, they are usually a goroutine within the main k3s supervisor process. Can we remove the expectation that this will be handled by a dedicated child process?
  2. Instead of "containerd node registry", refer to it as the "containerd image store" as that is the correct name for where images are imported into.
  3. Remove any mention of sharing access with other nodes. That is a function of the embedded registry and not related to the auto-import functionality discussed here.


As a result, there is a need for a folder who can handle this action, where every image there will be watched by a controller (a child process that will run when the embedded registry is created) for changes or new images, this new images or new changes will be added to the containerd node registry, meaning that other nodes will have access to the image.

This folder could it be the agent/images itself, but it will need a change in the way the images were loaded previously, this could it be the new controller handling the first load for k3s images.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propose something specific. If we do decide to periodically re-scan the existing images folder, how would we do that? A periodic scan would be easiest, but inotify and other APIs for realtime change detection are available. Processing image tarballs can be an io-intensive operation, how should we determine if the whole file should be imported without having to reprocess the entire contents - compare mtime+size perhaps?

Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 21.27660% with 74 lines in your changes missing coverage. Please review.

Project coverage is 43.66%. Comparing base (7552203) to head (d605c59).

Files with missing lines Patch % Lines
pkg/agent/containerd/containerd.go 21.27% 63 Missing and 11 partials ⚠️

❗ There is a different number of reports uploaded between BASE (7552203) and HEAD (d605c59). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (7552203) HEAD (d605c59)
e2etests 7 6
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10973      +/-   ##
==========================================
- Coverage   49.93%   43.66%   -6.28%     
==========================================
  Files         178      178              
  Lines       14816    14907      +91     
==========================================
- Hits         7399     6509     -890     
- Misses       6069     7182    +1113     
+ Partials     1348     1216     -132     
Flag Coverage Δ
e2etests 36.09% <21.27%> (-9.95%) ⬇️
inttests 19.70% <0.00%> (-17.17%) ⬇️
unittests 13.46% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Vitor Savian <[email protected]>
Signed-off-by: Vitor Savian <[email protected]>
}

switch event.Op {
case fsnotify.Write:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably recommend against doing the work of importing image directly in the fsnotify event loop. Depending on how the file is written, you may get multiple events for the same file in short sequence. I would probably just wire the notify events up to a keyed work queue, to ensure that the fsnotify event queue is not blocked, and to merge multiple events concerning the same file.

You can take the service change code as a reference:

  • // runWorker dequeues Service changes from the work queue
    // We run a lightweight work queue to handle service updates. We don't need the full overhead
    // of a wrangler service controller and shared informer cache, but we do want to run changes
    // through a keyed queue to reduce thrashing when pods are updated. Much of this is cribbed from
    // https://github.com/rancher/lasso/blob/release/v2.5/pkg/controller/controller.go#L173-L215
    func (k *k3s) runWorker() {
    for k.processNextWorkItem() {
    }
    }
    // processNextWorkItem does work for a single item in the queue,
    // returning a boolean that indicates if the queue should continue
    // to be serviced.
    func (k *k3s) processNextWorkItem() bool {
    obj, shutdown := k.workqueue.Get()
    if shutdown {
    return false
    }
    if err := k.processSingleItem(obj); err != nil && !apierrors.IsConflict(err) {
    logrus.Errorf("%s: %v", controllerName, err)
    }
    return true
    }
    // processSingleItem processes a single item from the work queue,
    // requeueing it if the handler fails.
    func (k *k3s) processSingleItem(obj interface{}) error {
    var (
    key string
    ok bool
    )
    defer k.workqueue.Done(obj)
    if key, ok = obj.(string); !ok {
    logrus.Errorf("expected string in workqueue but got %#v", obj)
    k.workqueue.Forget(obj)
    return nil
    }
    keyParts := strings.SplitN(key, "/", 2)
    if err := k.updateStatus(keyParts[0], keyParts[1]); err != nil {
    k.workqueue.AddRateLimited(key)
    return fmt.Errorf("error updating LoadBalancer Status for %s: %v, requeueing", key, err)
    }
    k.workqueue.Forget(obj)
    return nil
    }
  • workqueue workqueue.RateLimitingInterface
  • k.workqueue = workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter())

Comment on lines 158 to 168
// At startup all leases from k3s are cleared; we no longer use leases to lock content
if err := clearLeases(ctx, client); err != nil {
logrus.Errorf("Error while clearing leases: %s", err.Error())
return
}

// Clear the pinned labels on all images previously pinned by k3s
if err := clearLabels(ctx, client); err != nil {
logrus.Errorf("Error while clearing labes: %s", err.Error())
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done in PreloadImages before go Watcher(ctx, cfg) is called, don't do it again here. Doing so would un-label all the images that we just finished importing.

Comment on lines 115 to 117
// Watcher is a controller that watch the agent/images folder
// to ensure that every new file is added to the watcher state
func Watcher(ctx context.Context, cfg *config.Node) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't appear to be called outside this package, why does it need to be exported?

Suggested change
// Watcher is a controller that watch the agent/images folder
// to ensure that every new file is added to the watcher state
func Watcher(ctx context.Context, cfg *config.Node) {
// watcher is a controller that uses fsnotify to watch the agent/images folder
// and import any tarballs placed in that directory whenever they are created or modified.
func watcher(ctx context.Context, cfg *config.Node) {

// get the file info to add to the state map
fileInfo, err := dirEntry.Info()
if err != nil {
logrus.Errorf("Error while getting the info from file: %s", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention is to use %v for errors, and just pass in the error itself.

Suggested change
logrus.Errorf("Error while getting the info from file: %s", err.Error())
logrus.Errorf("Error while getting the info from file: %v", err)

case fsnotify.Write:
newStateFile, err := os.Stat(event.Name)
if err != nil {
logrus.Errorf("Error encountered while getting file %s info for event write: %s", event.Name, err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logrus.Errorf("Error encountered while getting file %s info for event write: %s", event.Name, err.Error())
logrus.Errorf("Error encountered while getting file %s info for event write: %v", event.Name, err)

case fsnotify.Create:
info, err := os.Stat(event.Name)
if err != nil {
logrus.Errorf("Error encountered while getting file %s info for event Create: %s", event.Name, err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logrus.Errorf("Error encountered while getting file %s info for event Create: %s", event.Name, err.Error())
logrus.Errorf("Error encountered while getting file %s info for event Create: %v", event.Name, err)

if !ok {
return
}
logrus.Errorf("error in watcher controller: %s", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%v

delete(stateFileInfos, event.Name)
logrus.Infof("Removed file from the watcher controller: %s", event.Name)
case fsnotify.Remove:
delete(stateFileInfos, event.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to un-label the images that came from this file when it is deleted? We'd probably need to use another label to track what file (or files, in case it is in multiple) an image came from to do this.

delete(stateFileInfos, event.Name)
logrus.Infof("Removed file from the watcher controller: %s", event.Name)
}
case err, ok := <-watcher.Errors:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is a not-ok error here? Do we just return from the goroutine and stop watching forever, until the next time K3s is restarted? Should this be a fatal error so that k3s is restarted, or should we restart the watch again?

Comment on lines 120 to 139
logrus.Errorf("Error to create a watcher: %s", err.Error())
return
}

// Add agent/images path to the watcher.
err = watcher.Add(cfg.Images)
if err != nil {
logrus.Errorf("Error when creating the watcher controller: %s", err.Error())
return
}

client, err := Client(cfg.Containerd.Address)
if err != nil {
logrus.Errorf("Error to create containerd client: %s", err.Error())
return
}

criConn, err := cri.Connection(ctx, cfg.Containerd.Address)
if err != nil {
logrus.Errorf("Error to create CRI connection: %s", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling here is to just print a message and return, leaving K3s with no watch on the images. Should we have this instead return an error, and wrap it in a retry loop so that it is restarted if it fails?

Also note that the idiomatic error message should be in the rough form of:

logrus.Errorf("Failed to SOMETHING with %s: %v", thing.Name, err)

"Error to" is not proper grammar, and we already know it's an error because we are using Errorf which prints level=error as part of the log entry.

Signed-off-by: Vitor Savian <[email protected]>
Signed-off-by: Vitor Savian <[email protected]>
@vitorsavian vitorsavian changed the title [ADR][WIP] Auto import images for embedded registry [WIP] Auto import images for embedded registry Oct 18, 2024
@vitorsavian vitorsavian marked this pull request as draft October 18, 2024 14:42
@brandond
Copy link
Member

brandond commented Oct 21, 2024

We probably shouldn't do it in this PR, but it would be nice if whatever we used for watching image file changes was generic, and could be reused by the deploy controller - right now it just re-scans everything once a minute:

// start calls listFiles at regular intervals to trigger application of manifests that have changed on disk.
func (w *watcher) start(ctx context.Context, client kubernetes.Interface) {
w.recorder = pkgutil.BuildControllerEventRecorder(client, ControllerName, metav1.NamespaceSystem)
force := true
for {
if err := w.listFiles(force); err == nil {
force = false
} else {
logrus.Errorf("Failed to process config: %v", err)
}
select {
case <-ctx.Done():
return
case <-time.After(15 * time.Second):
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants