Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

668 joblogs #43

Closed
wants to merge 19 commits into from
Closed

668 joblogs #43

wants to merge 19 commits into from

Conversation

vlerkin
Copy link
Collaborator

@vlerkin vlerkin commented Dec 18, 2024

Added logic to remove files:

  1. After file was successfully uploaded to the S3-like-storage the file in the volume will be removed;
  2. When code receives events from other existing jobs on the cluster it checks if files with its logs are present in the S3-like-storage, if not the code will attempt to collect logs, if the stdout is still available, if the file exists the code now checks if that file also still present in the volume and removes it since it is present in the S3-like-storage.

I think this functionality addresses your concerns.

vlerkin and others added 19 commits November 26, 2024 11:30
…onfigure logs in a separate file logging_config.py that is imported to the __main__.py to trigger logs configuration before other modules inheret it; add debug logs to the watcher to better monitor each step
…e modules; implement logging to be configured globally but keep separate logger for each module; add level for logging to be configurable in the config file
…ct with all available logging level to validate user input from the config file
… to reset the connection periodically; add lock for events like subscibe, unsubscribe and event dispatching
…check for logfiles of the jobs that are still on the cluster in Completed or Failed state
…ve random component from log files names and keep job_name as one
…od_name to indicate which stdout (logs) to read from
…n the joblogs section of the config file, in the KubernetesJobLogHandler init throw a value error if logs_dir is not provided in the config
… on a check when old jobs are present on the cluster and the code checks that the files of those jobs already exist in the S3 storage, added a file removal if it is still in the volume
@vlerkin vlerkin requested a review from wvengen December 18, 2024 10:12
@wvengen
Copy link
Member

wvengen commented Dec 19, 2024

Thanks! Can you please rebase this on the main branch?

Regarding your second point, I don't fully understand this. Does it mean that if a job is found running or finished, and also a log on container storage, it will do nothing? When finished, I think this makes sense, but when running, this is actually an error condition, I would think. Or can you think of cases when a file uploaded to container storage exists, and a job is still found running?

(Also, I was thinking that this will generate a lot of checking with container storage to see if files exists. That is fine. A later improvement could be to add metadata to the job indicating that its logs have been stored (potentially with the full container storage path, so it can be easily located e.g. by #12). But that's for later.)

@vlerkin
Copy link
Collaborator Author

vlerkin commented Dec 19, 2024

The second point I outlined is about Completed or Failed jobs that still exist on the cluster, if they actively run then their logs are still being actively collected and when finished, uploaded. But if for some reason the managing pod died, and its replacement in the process of creation, then jobs will finish running without it, and all the logs they produce need to be collected by a new managing pod. It will append logs from the moment it was disrupted, upload the file and delete it from an associated volume, or if there are complete logs, jobs are not deleted and the uploading process was interrupted by unexpected pod death, then a new pod, when receives an event that there are still jobs on a cluster in Completed/Failed state, checks if a corresponding log file exists in the S3-like-storage, if not checks if it's in the volume, if yes, uploads it to the S3-like-storage and deletes it from volume.

Code lines: 257-301 (handle_events method) in the log_handler_k8s.py file.

@vlerkin vlerkin closed this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants