Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEVOPS-7803 restore log stream #2972

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DCkQ6
Copy link

@DCkQ6 DCkQ6 commented Jul 3, 2024

Change Overview

Unfortunately, there is a hardcoded timeout in the kubelet server. If we want to support long-lasting phases, we need to work around it.

This PR introduces re-establishing the log stream connection while the pod is still running.

Pull request type

Please check the type of change your PR introduces:

  • 🚧 Work in Progress
  • 🌈 Refactoring (no functional changes, no api changes)
  • 🐹 Trivial/Minor
  • 🐛 Bugfix
  • 🌻 Feature
  • 🗺️ Documentation
  • 🤖 Test

Issues

Test Plan

I created a phase executing function KubeTask, with a long-lasting command (sleep). I ensured that the logs were read until the end of its execution and that the output was properly gathered.

  • 💪 Manual
  • ⚡ Unit test
  • 💚 E2E

Re-establish connection while the pod is still running
@DCkQ6 DCkQ6 mentioned this pull request Jul 3, 2024
10 tasks
Copy link
Contributor

@hairyhum hairyhum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is better, but can we have some automated test to prove that we don't lose logs?


func (s *restoreLogStreamReader) Read(p []byte) (n int, err error) {
n, err = s.reader.Read(p)
defer func() { s.lastReadTime = metav1.Now() }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we take the timestamp after the stream close. My main concern in #2903 (comment) was that if there was a log at the time of the stream close we might lose some records.
Did you have a way to reproduce the failure so we can have at least empirical proof that this approach works?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a way to reproduce it, and I did it. Unfortunately, each test takes 4 hours as I just run an action from a blueprint that uses a long sleep and echo.

This approach in manual tests always resulted in a duplicated last line of the log, which was independent of the frequency of the log. The last line was repeated even if it was the only line since 4 hours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Should we then add some deduplication for that case (if we can)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplication would be quite an overhead for what we are trying to achieve. This problem manifests only after 4 hours, and a single duplicated line for tasks that take over 4 hours, in my opinion, doesn't justify the need to add deduplication logic. This logic would require storing the last read line in memory and comparing it after reconnecting. Additionally, we should not assume that our users' logs distinguish between lines, which would complicate such a simple comparison.

Therefore, in my opinion, we should accept this single duplicated line that occurs on rare occasions.

pkg/kube/pod_controller.go Show resolved Hide resolved
@DCkQ6 DCkQ6 requested a review from hairyhum August 21, 2024 09:50
@hairyhum hairyhum requested review from e-sumin and removed request for hairyhum August 21, 2024 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Problem with long running phase
3 participants