Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebooting GitLab may trigger ClamAV alarm #6114

Open
dsotirho-ucsc opened this issue Apr 1, 2024 · 17 comments
Open

Rebooting GitLab may trigger ClamAV alarm #6114

dsotirho-ucsc opened this issue Apr 1, 2024 · 17 comments
Assignees
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points

Comments

@dsotirho-ucsc
Copy link
Contributor

dsotirho-ucsc commented Apr 1, 2024

The azul-clamscan-<deployment> alarm is triggered if a clamscan succeeded log message is not produced within an 18 hour period. Since the ClamAV scan is run twice daily, and takes many hours to complete, it is possible for a reboot of the GitLab instance (due to an update, backup, or testing) to cancel an ongoing ClamAV scan and prevent a successful completion of a scan to fall within an 18 hour window since one last completed.

Recommended solution is to increase the alarm's period to 24 hours.

Note: 24 hours is the maximum allowed time period for an alarm with one evaluation period. (from: Common features of CloudWatch alarms)

The number of evaluation periods for an alarm multiplied by the length of each evaluation period can't exceed one day.

@dsotirho-ucsc dsotirho-ucsc added the orange [process] Done by the Azul team label Apr 1, 2024
@dsotirho-ucsc
Copy link
Contributor Author

Assignee to provide symptoms and solution in description.

@dsotirho-ucsc dsotirho-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms - [priority] Medium labels Apr 2, 2024
@dsotirho-ucsc dsotirho-ucsc self-assigned this Apr 2, 2024
@hannes-ucsc
Copy link
Member

hannes-ucsc commented Apr 15, 2024

For demo, reboot a GL instance on day before demo while scan is ongoing (prepare proof). Show that alarm did not go off.

I don't think we need an elaborate demo. We discovered that the attempted fixes from the first two PRs (#6155 and #6315) weren't effective before we even got to the demo. IOW, we will likely make the same discovery about PR #6374 organically, during normal operations.

@dsotirho-ucsc
Copy link
Contributor Author

@hannes-ucsc: "Rebooting the instance still results in a false alarm, for example:"

https://groups.google.com/a/ucsc.edu/g/azul-group/c/zkmzRVv_rec/m/WbNWhMsKBQAJ

You are receiving this email because your Amazon CloudWatch Alarm "azul-clamscan-dev.alarm" in the US East (N. Virginia) region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Saturday 27 April, 2024 09:48:13 UTC".

View this alarm in the AWS Management Console:
https://us-east-1.console.aws.amazon.com/cloudwatch/deeplink.js?region=us-east-1#alarmsV2:alarm/azul-clamscan-dev.alarm

Alarm Details:

  • Name: azul-clamscan-dev.alarm
  • Description:
  • State Change: OK -> ALARM
  • Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).
  • Timestamp: Saturday 27 April, 2024 09:48:13 UTC
  • AWS Account: 122796619775
  • Alarm Arn: arn:aws:cloudwatch:us-east-1:122796619775:alarm:azul-clamscan-dev.alarm

Threshold:

  • The alarm is in the ALARM state when the metric is LessThanThreshold 1.0 for at least 1 of the last 1 period(s) of 86400 seconds.

Monitored Metrics:

  • MetricExpression: FILL(log_count_raw, 0)
  • MetricLabel: No Label

State Change Actions:

  • OK: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]
  • ALARM: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]
  • INSUFFICIENT_DATA:

@dsotirho-ucsc
Copy link
Contributor Author

Assignee to consider increasing the frequency of the cronjob to */18 hours

@dsotirho-ucsc
Copy link
Contributor Author

There is a contradiction in the above comment: */18 would not be an increase. Assignee to formalize plan.

@dsotirho-ucsc
Copy link
Contributor Author

The alarm fires when a successful clamscan message wasn't logged within the last 24 hours. On average, a successful scan takes anywhere from 10 - 14 hours ( quicker on anvildev and prod, slower on anvilprod and dev).

Currently clamscan is set up to run twice a day. This causes the alarm to fire if the scan following a reboot takes longer than the scan that completed just prior to the reboot.

For this example, assume a 11 hour scan starting at 00 and 12:

start   00:00
end     11:00
start   12:00
reboot  13:00
start   00:00
end     11:05 (alarm fired at 11:01)

Since systemd timers won't start a service that is still running from its last activation by a timer, I purpose setting the clamscan's timer to run 6 times a day, or every 4 hours (*-*-* */4:00:00).

Example, 11 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     11:00 *
start   12:00
reboot  13:00
start   16:00
end     03:00 *

11:00 to 03:00 = 16 hours

Example, 14 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     14:00 *
start   16:00
reboot  17:00
start   20:00
end     10:00 *

14:00 to 10:00 = 20 hours

@dsotirho-ucsc
Copy link
Contributor Author

dsotirho-ucsc commented May 31, 2024

@hannes-ucsc: "Let's just start the unit every hour. If the scan takes less than an hour, it's actually desirable to start it on the next full hour. Extra care to be taken to ensure that the scans aren't running in parallel or overlap."

@hannes-ucsc
Copy link
Member

Note my edits to the demo instructions.

@hannes-ucsc hannes-ucsc added no demo [process] Not to be demonstrated at the end of the sprint and removed no demo [process] Not to be demonstrated at the end of the sprint labels Jul 31, 2024
@dsotirho-ucsc dsotirho-ucsc reopened this Oct 21, 2024
@dsotirho-ucsc dsotirho-ucsc removed the no demo [process] Not to be demonstrated at the end of the sprint label Oct 21, 2024
@dsotirho-ucsc
Copy link
Contributor Author

Resolution was incomplete, recently a scan took 26 hours to complete following a reboot of GitLab dev.

CloudWatch logs:

Screenshot 2024-10-21 at 12 16 09 PM

@dsotirho-ucsc
Copy link
Contributor Author

dsotirho-ucsc commented Oct 22, 2024

Spike to compare hierarchical folder sizes on the two lower GitLab instances using a treemap based GUI.

@dsotirho-ucsc dsotirho-ucsc added spike:3 [process] Spike estimate of three points spike:2 [process] Spike estimate of two points and removed spike:1 [process] Spike estimate of one point spike:3 [process] Spike estimate of three points labels Oct 22, 2024
@achave11-ucsc
Copy link
Member

achave11-ucsc commented Oct 24, 2024

Unfortunately, it isn't possible to generate a hierarchical folder size comparison of a given instance without proprietary software. Like TreeMap which features the ability to connect to the instance, perform the scan, and generate the desired map.
Most other options with the TreeMap type of GUI are similar in the sense that they 1) only run as a desktop apps, and/or 2) are unable to generate a report from/to a specified file and 3) are proprietary.

The viable open source choice was ncdu, which was mentioned multiple times in the provided search results from the treemap based GUI. It provides a text based interface which one may navigate to manually traverse the produced report.

dev_ncdu.json
anvildev_ncdu.json

To display the generated reports, install ncdu (via brew in MacOS), then run:

ncdu -f dev_ncdu.json  # Or anvildev_ncdu.json

@achave11-ucsc
Copy link
Member

Originally, this was ran without sudo privileges, causing the inaccuracy observed during PL where the mnt directory didn't encompass the majority of the space usage.

The file download links in the previous comment have been updated to link to the ncdu report which used sudo privileges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points
Projects
None yet
Development

No branches or pull requests

3 participants