-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEVOPS-7803 restore log stream #2972
base: master
Are you sure you want to change the base?
Conversation
Re-establish connection while the pod is still running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is better, but can we have some automated test to prove that we don't lose logs?
|
||
func (s *restoreLogStreamReader) Read(p []byte) (n int, err error) { | ||
n, err = s.reader.Read(p) | ||
defer func() { s.lastReadTime = metav1.Now() }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like we take the timestamp after the stream close. My main concern in #2903 (comment) was that if there was a log at the time of the stream close we might lose some records.
Did you have a way to reproduce the failure so we can have at least empirical proof that this approach works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a way to reproduce it, and I did it. Unfortunately, each test takes 4 hours as I just run an action from a blueprint that uses a long sleep
and echo
.
This approach in manual tests always resulted in a duplicated last line of the log, which was independent of the frequency of the log. The last line was repeated even if it was the only line since 4 hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Should we then add some deduplication for that case (if we can)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deduplication would be quite an overhead for what we are trying to achieve. This problem manifests only after 4 hours, and a single duplicated line for tasks that take over 4 hours, in my opinion, doesn't justify the need to add deduplication logic. This logic would require storing the last read line in memory and comparing it after reconnecting. Additionally, we should not assume that our users' logs distinguish between lines, which would complicate such a simple comparison.
Therefore, in my opinion, we should accept this single duplicated line that occurs on rare occasions.
Change Overview
Unfortunately, there is a hardcoded timeout in the kubelet server. If we want to support long-lasting phases, we need to work around it.
This PR introduces re-establishing the log stream connection while the pod is still running.
Pull request type
Please check the type of change your PR introduces:
Issues
Test Plan
I created a phase executing function
KubeTask
, with a long-lasting command (sleep). I ensured that the logs were read until the end of its execution and that the output was properly gathered.