668 joblogs #43

vlerkin · 2024-12-18T10:06:53Z

Added logic to remove files:

After file was successfully uploaded to the S3-like-storage the file in the volume will be removed;
When code receives events from other existing jobs on the cluster it checks if files with its logs are present in the S3-like-storage, if not the code will attempt to collect logs, if the stdout is still available, if the file exists the code now checks if that file also still present in the volume and removes it since it is present in the S3-like-storage.

I think this functionality addresses your concerns.

…onfigure logs in a separate file logging_config.py that is imported to the __main__.py to trigger logs configuration before other modules inheret it; add debug logs to the watcher to better monitor each step

…e modules; implement logging to be configured globally but keep separate logger for each module; add level for logging to be configurable in the config file

…sing for users

…ct with all available logging level to validate user input from the config file

…api.py; keep logging_config file

… to reset the connection periodically; add lock for events like subscibe, unsubscribe and event dispatching

…tion forever

…check for logfiles of the jobs that are still on the cluster in Completed or Failed state

…ve random component from log files names and keep job_name as one

… directory for logs

…od_name to indicate which stdout (logs) to read from

…read it from there if provided

…n the joblogs section of the config file, in the KubernetesJobLogHandler init throw a value error if logs_dir is not provided in the config

… on a check when old jobs are present on the cluster and the code checks that the files of those jobs already exist in the S3 storage, added a file removal if it is still in the volume

wvengen · 2024-12-19T09:46:11Z

Thanks! Can you please rebase this on the main branch?

Regarding your second point, I don't fully understand this. Does it mean that if a job is found running or finished, and also a log on container storage, it will do nothing? When finished, I think this makes sense, but when running, this is actually an error condition, I would think. Or can you think of cases when a file uploaded to container storage exists, and a job is still found running?

(Also, I was thinking that this will generate a lot of checking with container storage to see if files exists. That is fine. A later improvement could be to add metadata to the job indicating that its logs have been stored (potentially with the full container storage path, so it can be easily located e.g. by #12). But that's for later.)

vlerkin · 2024-12-19T10:11:48Z

The second point I outlined is about Completed or Failed jobs that still exist on the cluster, if they actively run then their logs are still being actively collected and when finished, uploaded. But if for some reason the managing pod died, and its replacement in the process of creation, then jobs will finish running without it, and all the logs they produce need to be collected by a new managing pod. It will append logs from the moment it was disrupted, upload the file and delete it from an associated volume, or if there are complete logs, jobs are not deleted and the uploading process was interrupted by unexpected pod death, then a new pod, when receives an event that there are still jobs on a cluster in Completed/Failed state, checks if a corresponding log file exists in the S3-like-storage, if not checks if it's in the volume, if yes, uploads it to the S3-like-storage and deletes it from volume.

Code lines: 257-301 (handle_events method) in the log_handler_k8s.py file.

vlerkin and others added 19 commits November 26, 2024 11:30

remove limited reconnection attempts; cap backoff time with 15 min; c…

dfc939b

…onfigure logs in a separate file logging_config.py that is imported to the __main__.py to trigger logs configuration before other modules inheret it; add debug logs to the watcher to better monitor each step

refactor code to make config import and logging configuration separat…

f472580

…e modules; implement logging to be configured globally but keep separate logger for each module; add level for logging to be configurable in the config file

change level to INFO in config instead of DEBUG to make it less confu…

19bfc19

…sing for users

remove lines to configure the logging level

4761f26

add logging.getLevelName() method to be used instead of creating a di…

8792063

…ct with all available logging level to validate user input from the config file

move logging configuration to the api.py; move config loading to the …

7609bb6

…api.py; keep logging_config file

rename logging_level to log_level

4e0405c

remove redundant check of the log_level in the logging_config.py

82c55e2

Small changes

d47c9e8

change timeout_seconds parameter for k8s api connection from 0 to 300…

d274cac

… to reset the connection periodically; add lock for events like subscibe, unsubscribe and event dispatching

change timeout_seconds watcher param back to 0 to try and hold connec…

f633d5a

…tion forever

remove job_name mapping and implement a stateless directory sreen to …

9c64ce4

…check for logfiles of the jobs that are still on the cluster in Completed or Failed state

remove empty line

27317eb

make directory for logs configurable providing a default option; remo…

a3de0b7

…ve random component from log files names and keep job_name as one

describe a new parameter in CONFIG.md to provide info on how to set a…

fc50b64

… directory for logs

remove job_name as it is not correct, use job_id for file names and p…

1189e80

…od_name to indicate which stdout (logs) to read from

move logs_dir param to the joblogs section in config, adjust code to …

87ce2c8

…read it from there if provided

remove default value from the code, make logs_dir a mandatory param i…

b2dd992

…n the joblogs section of the config file, in the KubernetesJobLogHandler init throw a value error if logs_dir is not provided in the config

added functionality to remove successfully ploaded files; additionaly…

b682edc

… on a check when old jobs are present on the cluster and the code checks that the files of those jobs already exist in the S3 storage, added a file removal if it is still in the volume

vlerkin requested a review from wvengen December 18, 2024 10:12

vlerkin closed this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

668 joblogs #43

668 joblogs #43

vlerkin commented Dec 18, 2024

wvengen commented Dec 19, 2024

vlerkin commented Dec 19, 2024 •

edited

Loading

668 joblogs #43

668 joblogs #43

Conversation

vlerkin commented Dec 18, 2024

wvengen commented Dec 19, 2024

vlerkin commented Dec 19, 2024 • edited Loading

vlerkin commented Dec 19, 2024 •

edited

Loading