Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading default value in --etcd-snapshot-dir opt descriptions (server, etcd-snapshot) #11570

Open
majabojarska opened this issue Jan 11, 2025 · 2 comments

Comments

@majabojarska
Copy link
Contributor

Environmental Info:
K3s Version:

k3s version v1.31.4+k3s1 (a562d090)
go version go1.22.9

Node(s) CPU architecture, OS, and Version:

Linux [...] 6.10.2-rt14-arch1-3-rt #1 SMP PREEMPT_RT Sat, 14 Dec 2024 12:07:28 +0000 x86_64 GNU/Linux

Cluster Configuration:

Single node cluster (server+agent) - in my opinion not that relevant within the context of this report.

Describe the bug:

The default --etcd-snapshot-dir value reported by the server and etcd-snapshot K3S CLI command --help dialogues does not match the effective path created and used in runtime.

Take a look at the --help output for both commands:

k3s server --help | grep '\-dir'
   --data-dir value, -d value                 (data) Folder to hold state default /var/lib/rancher/k3s or ${HOME}/.rancher/k3s if not root [$K3S_DATA_DIR]
   --etcd-snapshot-dir value                  (db) Directory to save db snapshots. (default: ${data-dir}/db/snapshots)
# -------- SNIP --------

k3s etcd-snapshot --help | grep '\-dir'
   --data-dir value, -d value                                   (data) Folder to hold state default /var/lib/rancher/k3s or ${HOME}/.rancher/k3s if not root [$K3S_DATA_DIR]
   --dir value, --etcd-snapshot-dir value                       (db) Directory to save etcd on-demand snapshot. (default: ${data-dir}/db/snapshots)

Assuming --etcd-snapshot-dir is not provided, the effective path is actually ${data-dir}/server/db/snapshots, instead of ${data-dir}/db/snapshots (note the missing server path segment).

Steps To Reproduce:

To preface this section, I've initially observed this issue on a different system, running NixOS. The K3s service was installed and managed via the k3s nixpkg, obviously adding a layer of abstraction between the K3s distributables and the end user. To rule out the potential configuration skew, I've reproduced this on Arch via AUR, whose install process I understand better.

  1. Install v1.31.4+k3s1

  2. Add a minimal working example etcd snapshot configuration to the systemd service k3s server invocation. Just enough to enable etcd (instead of SQLite) and get a snapshot created quickly, without flooding the storage:

    • /usr/bin/k3s server --cluster-init --etcd-snapshot-schedule-cron="* * * * *" --etcd-snapshot-retention=1
    • See the "additional context" section below for the full systemd unit file.
  3. Ensure service is enabled and reload daemons to get the above configuration running:

    sudo system enable --now k3s.service
    sudo systemctl daemon-reload
    sudo systemctl status k3s.service
    
    ● k3s.service - Lightweight Kubernetes
         Loaded: loaded (/usr/lib/systemd/system/k3s.service; enabled; preset: disabled)
         Active: active (running) 
    # -------- SNIP --------
  4. Manually trigger an etcd snapshot via k3s etcd-snapshot

    sudo k3s etcd-snapshot save
    INFO[0000] Snapshot on-demand-machine-1736613048 saved.
  5. Wait a minute, for the next etcd snapshot cron schedule tick.

Note

The service is running as root, and therefore the effective default --data-dir directory is /var/lib/rancher/k3s.

Expected behavior:

The snapshots are saved under /var/lib/rancher/k3s/db/snapshots, since this is what k3s server --help told me.

Actual behavior:

The snapshots are saved under /var/lib/rancher/k3s/server/db/snapshots:

[root@machine k3s]# pwd
/var/lib/rancher/k3s
[root@machine k3s]# ls
agent  data  server # No db dir here
[root@machine k3s]# cd server/
[root@machine server]# ls
agent-token  cred  db  etc  kine.sock  manifests  node-token  static  tls  token
[root@machine server]# cd db/snapshots/
[root@machine snapshots]# ls
etcd-snapshot-machine-1736614203  on-demand-machine-1736613048

Additional context / logs:

I've looked in the sources, and the effective data dir appears to be resolved here, on server startup:

k3s/pkg/server/server.go

Lines 466 to 486 in a562d09

func setupDataDirAndChdir(config *config.Control) error {
var (
err error
)
config.DataDir, err = ResolveDataDir(config.DataDir)
if err != nil {
return err
}
dataDir := config.DataDir
if err := os.MkdirAll(dataDir, 0700); err != nil {
return errors.Wrapf(err, "can not mkdir %s", dataDir)
}
if err := os.Chdir(dataDir); err != nil {
return errors.Wrapf(err, "can not chdir %s", dataDir)
}
return nil

k3s/pkg/server/server.go

Lines 40 to 42 in a562d09

func ResolveDataDir(dataDir string) (string, error) {
dataDir, err := datadir.Resolve(dataDir)
return filepath.Join(dataDir, "server"), err

The above applies to both the server and etcd-snapshot commands, since to my understanding, the snapshots invoked manually by etcd-snapshot save send a POST /db/snapshot to the server, which in turn calculates the snapshot write path, using the DataDir config value resolved on startup.

Full systemd unit file

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target

[Service]
Type=notify
EnvironmentFile=/etc/systemd/system/k3s.service.env
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/k3s server --cluster-init --etcd-snapshot-schedule-cron="* * * * *" --etcd-snapshot-retention=1
KillMode=process
Delegate=yes
# Having non-zero Limits causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
# See https://github.com/k3s-io/k3s/commit/b4335630b78b5cf927e79724067803a6c0d7c04f
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target
@majabojarska
Copy link
Contributor Author

Suffixing --data-dir with server, for server-originating artifacts totally makes sense, and imo it's the CLI help that needs to be updated. I'll submit a PR.

@brandond
Copy link
Member

Merged to master, will backport in February cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Test
Development

No branches or pull requests

3 participants