Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a diagnostic refresh_age metric and corresponding container healthcheck #50

Merged
3 commits merged into from
Jun 1, 2024

Conversation

michaelosthege
Copy link
Contributor

Hi @sfudeus, thanks for this great exporter!

Every now and then my container appears to hang, producing such gaps until I restart it:

image

I assume that some internal loop crashes, but the logs don't indicate anything.

To notice when things to out of order, I introduced a new metric:

# HELP homematic_refresh_age Seconds since the last successful refresh.
# TYPE homematic_refresh_age gauge
homematic_refresh_age{ccu="192.168.178.30"} 4.156044244766235

In combination with a healthcheck script and docker-autoheal this makes it easy to automatically restart the container.

image

exporter.py Show resolved Hide resolved
@sfudeus
Copy link
Owner

sfudeus commented May 19, 2024

Hi @michaelosthege, thanks for the contribution. This looks good besides the one comment I had. Ultimately, instead of needing such a healthcheck long-term, I'd prefer finding and preventing an issue which would block the loop infinitely. But at least this helps finding such occasions.

@michaelosthege
Copy link
Contributor Author

I'd prefer finding and preventing an issue which would block the loop infinitely. But at least this helps finding such occasions.

Totally agree. I tried to debug, but found it quite hard.
Mainly two things prevented me from proper debugging:

  • The DEBUG mode is too verbose to be useful for debugging rare events
  • The generate_metrics is quite complex, with lots of branches and a lots of indentation.

As a start I would recommend to apply ruff and refactor some things into functions.
Let me know if I shall open a PR!

@sfudeus
Copy link
Owner

sfudeus commented May 19, 2024

Mainly two things prevented me from proper debugging:

  • The DEBUG mode is too verbose to be useful for debugging rare events
  • The generate_metrics is quite complex, with lots of branches and a lots of indentation.

As a start I would recommend to apply ruff and refactor some things into functions. Let me know if I shall open a PR!

Feel free of you find the time for it. The exporter has grown over time and could need some rework.
Please try to break the refactoring down into reasonable chunks to not have one complex big refactoring which is hard to review.

@sfudeus
Copy link
Owner

sfudeus commented May 21, 2024

@michaelosthege Can you rebase on the latest state of master? I recently refactored the workflow definitions and the image build job fails because of that mismatch.

@michaelosthege
Copy link
Contributor Author

Done!

@sfudeus sfudeus mentioned this pull request May 21, 2024
@sfudeus
Copy link
Owner

sfudeus commented May 21, 2024

Workflow security setting prevent running from a remote repo, so I cloned it into a local branch on PR #58.
Image preview available at docker.io/sfudeus/homematic_exporter:preview-58

healthcheck.sh Outdated
maxage=${1:-60}

# Get the age of the last successful metrics refresh in seconds
age=$(curl -s http://localhost:9040/metrics | grep 'homematic_refresh_age{' | cut -d ' ' -f2 | cut -d '.' -f1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no curl in the docker image yet - and the scripts lacks an execute bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, I should have checked the healthcheck logs!

I didn't need to change the file permissions though. It's now doing 0-exit on my deployment..

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your healthcheck command included the bash-invocation and only gave the script as argument, that is why you didn't nee the bit. But it would make sense to do so to be able to call it natively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please feel free to suggest a change or make a commit. I don't know what exactly you'd like to see here

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 5f580ed, it is just about the permissions bits on the healthcheck.sh. I can add that later myself as well, I cannot push to your branch.

The most recent docker build contains that already and looks good so far in my cluster. I did not set a liveness check using the healthcheck yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Linux file permissions at the git level? I didn't know these were copied into the docker image.

@sfudeus sfudeus closed this pull request by merging all changes into sfudeus:master in 3eda0b9 Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants