-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
668 joblogs #43
668 joblogs #43
Conversation
…onfigure logs in a separate file logging_config.py that is imported to the __main__.py to trigger logs configuration before other modules inheret it; add debug logs to the watcher to better monitor each step
…e modules; implement logging to be configured globally but keep separate logger for each module; add level for logging to be configurable in the config file
…ct with all available logging level to validate user input from the config file
…api.py; keep logging_config file
… to reset the connection periodically; add lock for events like subscibe, unsubscribe and event dispatching
…check for logfiles of the jobs that are still on the cluster in Completed or Failed state
…ve random component from log files names and keep job_name as one
… directory for logs
…od_name to indicate which stdout (logs) to read from
…read it from there if provided
…n the joblogs section of the config file, in the KubernetesJobLogHandler init throw a value error if logs_dir is not provided in the config
… on a check when old jobs are present on the cluster and the code checks that the files of those jobs already exist in the S3 storage, added a file removal if it is still in the volume
Thanks! Can you please rebase this on the Regarding your second point, I don't fully understand this. Does it mean that if a job is found running or finished, and also a log on container storage, it will do nothing? When finished, I think this makes sense, but when running, this is actually an error condition, I would think. Or can you think of cases when a file uploaded to container storage exists, and a job is still found running? (Also, I was thinking that this will generate a lot of checking with container storage to see if files exists. That is fine. A later improvement could be to add metadata to the job indicating that its logs have been stored (potentially with the full container storage path, so it can be easily located e.g. by #12). But that's for later.) |
The second point I outlined is about Completed or Failed jobs that still exist on the cluster, if they actively run then their logs are still being actively collected and when finished, uploaded. But if for some reason the managing pod died, and its replacement in the process of creation, then jobs will finish running without it, and all the logs they produce need to be collected by a new managing pod. It will append logs from the moment it was disrupted, upload the file and delete it from an associated volume, or if there are complete logs, jobs are not deleted and the uploading process was interrupted by unexpected pod death, then a new pod, when receives an event that there are still jobs on a cluster in Completed/Failed state, checks if a corresponding log file exists in the S3-like-storage, if not checks if it's in the volume, if yes, uploads it to the S3-like-storage and deletes it from volume. Code lines: 257-301 (handle_events method) in the log_handler_k8s.py file. |
Added logic to remove files:
I think this functionality addresses your concerns.