Rebooting GitLab may trigger ClamAV alarm #6114

dsotirho-ucsc · 2024-04-01T19:00:05Z

The azul-clamscan-<deployment> alarm is triggered if a clamscan succeeded log message is not produced within an 18 hour period. Since the ClamAV scan is run twice daily, and takes many hours to complete, it is possible for a reboot of the GitLab instance (due to an update, backup, or testing) to cancel an ongoing ClamAV scan and prevent a successful completion of a scan to fall within an 18 hour window since one last completed.

Recommended solution is to increase the alarm's period to 24 hours.

Note: 24 hours is the maximum allowed time period for an alarm with one evaluation period. (from: Common features of CloudWatch alarms)

The number of evaluation periods for an alarm multiplied by the length of each evaluation period can't exceed one day.

The text was updated successfully, but these errors were encountered:

dsotirho-ucsc · 2024-04-01T19:01:02Z

Assignee to provide symptoms and solution in description.

hannes-ucsc · 2024-04-15T16:42:59Z

~~For demo, reboot a GL instance on day before demo while scan is ongoing (prepare proof). Show that alarm did not go off.~~

I don't think we need an elaborate demo. We discovered that the attempted fixes from the first two PRs (#6155 and #6315) weren't effective before we even got to the demo. IOW, we will likely make the same discovery about PR #6374 organically, during normal operations.

dsotirho-ucsc · 2024-04-29T19:16:09Z

@hannes-ucsc: "Rebooting the instance still results in a false alarm, for example:"

https://groups.google.com/a/ucsc.edu/g/azul-group/c/zkmzRVv_rec/m/WbNWhMsKBQAJ

You are receiving this email because your Amazon CloudWatch Alarm "azul-clamscan-dev.alarm" in the US East (N. Virginia) region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Saturday 27 April, 2024 09:48:13 UTC".

View this alarm in the AWS Management Console:
https://us-east-1.console.aws.amazon.com/cloudwatch/deeplink.js?region=us-east-1#alarmsV2:alarm/azul-clamscan-dev.alarm

Alarm Details:

Name: azul-clamscan-dev.alarm

Description:

State Change: OK -> ALARM

Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [0.0 (26/04/24 09:48:00)] was less than the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).

Timestamp: Saturday 27 April, 2024 09:48:13 UTC

AWS Account: 122796619775

Alarm Arn: arn:aws:cloudwatch:us-east-1:122796619775:alarm:azul-clamscan-dev.alarm

Threshold:

The alarm is in the ALARM state when the metric is LessThanThreshold 1.0 for at least 1 of the last 1 period(s) of 86400 seconds.

Monitored Metrics:

MetricExpression: FILL(log_count_raw, 0)

MetricLabel: No Label

State Change Actions:

OK: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]

ALARM: [arn:aws:sns:us-east-1:122796619775:azul-monitoring-dev]

INSUFFICIENT_DATA:

dsotirho-ucsc · 2024-05-06T18:56:41Z

Assignee to consider increasing the frequency of the cronjob to */18 hours

dsotirho-ucsc · 2024-05-29T19:05:46Z

There is a contradiction in the above comment: */18 would not be an increase. Assignee to formalize plan.

dsotirho-ucsc · 2024-05-30T01:59:40Z

The alarm fires when a successful clamscan message wasn't logged within the last 24 hours. On average, a successful scan takes anywhere from 10 - 14 hours ( quicker on anvildev and prod, slower on anvilprod and dev).

Currently clamscan is set up to run twice a day. This causes the alarm to fire if the scan following a reboot takes longer than the scan that completed just prior to the reboot.

For this example, assume a 11 hour scan starting at 00 and 12:

start   00:00
end     11:00
start   12:00
reboot  13:00
start   00:00
end     11:05 (alarm fired at 11:01)

Since systemd timers won't start a service that is still running from its last activation by a timer, I purpose setting the clamscan's timer to run 6 times a day, or every 4 hours (*-*-* */4:00:00).

Example, 11 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     11:00 *
start   12:00
reboot  13:00
start   16:00
end     03:00 *

11:00 to 03:00 = 16 hours

Example, 14 hour scan starting every 4 hours (00, 04, 08, 12, 16, 20):

start   00:00
end     14:00 *
start   16:00
reboot  17:00
start   20:00
end     10:00 *

14:00 to 10:00 = 20 hours

dsotirho-ucsc · 2024-05-31T18:24:01Z

@hannes-ucsc: "Let's just start the unit every hour. If the scan takes less than an hour, it's actually desirable to start it on the next full hour. Extra care to be taken to ensure that the scans aren't running in parallel or overlap."

hannes-ucsc · 2024-07-24T07:28:05Z

Note my edits to the demo instructions.

dsotirho-ucsc · 2024-10-21T19:24:57Z

Resolution was incomplete, recently a scan took 26 hours to complete following a reboot of GitLab dev.

CloudWatch logs:

dsotirho-ucsc · 2024-10-22T19:04:02Z

Spike to compare hierarchical folder sizes on the two lower GitLab instances using a treemap based GUI.

achave11-ucsc · 2024-10-24T18:07:34Z

Unfortunately, it isn't possible to generate a hierarchical folder size comparison of a given instance without proprietary software. Like TreeMap which features the ability to connect to the instance, perform the scan, and generate the desired map.
Most other options with the TreeMap type of GUI are similar in the sense that they 1) only run as a desktop apps, and/or 2) are unable to generate a report from/to a specified file and 3) are proprietary.

The viable open source choice was ncdu, which was mentioned multiple times in the provided search results from the treemap based GUI. It provides a text based interface which one may navigate to manually traverse the produced report.

dev_ncdu.json
anvildev_ncdu.json

To display the generated reports, install ncdu (via brew in MacOS), then run:

ncdu -f dev_ncdu.json  # Or anvildev_ncdu.json

achave11-ucsc · 2024-10-24T20:18:26Z

Originally, this was ran without sudo privileges, causing the inaccuracy observed during PL where the mnt directory didn't encompass the majority of the space usage.

The file download links in the previous comment have been updated to link to the ncdu report which used sudo privileges.

dsotirho-ucsc added the orange [process] Done by the Azul team label Apr 1, 2024

dsotirho-ucsc assigned dsotirho-ucsc and unassigned dsotirho-ucsc Apr 1, 2024

dsotirho-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms - [priority] Medium labels Apr 2, 2024

dsotirho-ucsc self-assigned this Apr 2, 2024

dsotirho-ucsc added a commit that referenced this issue Apr 11, 2024

Rebooting GitLab may trigger ClamAV alarm (#6114)

ddc0a11

dsotirho-ucsc mentioned this issue Apr 11, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114) #6155

Merged

dsotirho-ucsc added a commit that referenced this issue Apr 11, 2024

Rebooting GitLab may trigger ClamAV alarm (#6114)

8eca3d0

dsotirho-ucsc added a commit that referenced this issue Apr 11, 2024

Rebooting GitLab may trigger ClamAV alarm (#6114)

b7ed36c

dsotirho-ucsc added a commit that referenced this issue Apr 11, 2024

fixup! Rebooting GitLab may trigger ClamAV alarm (#6114)

57200ca

dsotirho-ucsc added a commit that referenced this issue Apr 12, 2024

Rebooting GitLab may trigger ClamAV alarm (#6114)

affc2ec

dsotirho-ucsc added a commit that referenced this issue Apr 12, 2024

fixup! Rebooting GitLab may trigger ClamAV alarm (#6114)

0c75f05

dsotirho-ucsc added a commit that referenced this issue Apr 12, 2024

Rebooting GitLab may trigger ClamAV alarm (#6114)

fa8a917

dsotirho-ucsc added a commit that referenced this issue Apr 12, 2024

fixup! Rebooting GitLab may trigger ClamAV alarm (#6114)

3704421

hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Apr 15, 2024

dsotirho-ucsc added a commit that referenced this issue Apr 15, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

6d0b84f

dsotirho-ucsc added a commit that referenced this issue Apr 15, 2024

fixup! Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

fde5620

achave11-ucsc pushed a commit that referenced this issue Apr 19, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

433e457

achave11-ucsc added a commit that referenced this issue Apr 20, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114, PR #6155)

828f5e4

achave11-ucsc added a commit that referenced this issue Jul 11, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

cad9d56

achave11-ucsc added a commit that referenced this issue Jul 11, 2024

fixup! Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

7da4abf

achave11-ucsc added a commit that referenced this issue Jul 11, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

5e64239

achave11-ucsc added a commit that referenced this issue Jul 11, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

0ce73f0

dsotirho-ucsc mentioned this issue Jul 16, 2024

Setup CloudWatch alarm for ClamAV notifications #3895

Closed

achave11-ucsc added a commit that referenced this issue Jul 19, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

24c30a0

achave11-ucsc added a commit that referenced this issue Jul 19, 2024

fixup! Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

a519ba4

achave11-ucsc added a commit that referenced this issue Jul 19, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

39c4aa4

achave11-ucsc added a commit that referenced this issue Jul 19, 2024

fixup! Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

7f5a8b9

hannes-ucsc added no demo [process] Not to be demonstrated at the end of the sprint and removed demo [process] To be demonstrated at the end of the sprint labels Jul 24, 2024

dsotirho-ucsc mentioned this issue Jul 24, 2024

Promotion 2024-07-23 #6439

Closed

7 tasks

dsotirho-ucsc pushed a commit that referenced this issue Jul 24, 2024

Fix: Rebooting GitLab may trigger ClamAV alarm (#6114)

4333732

dsotirho-ucsc added a commit that referenced this issue Jul 25, 2024

[R] Fix: Rebooting GitLab may trigger ClamAV alarm (#6114, PR #6374)

857906b

hannes-ucsc closed this as completed Jul 31, 2024

hannes-ucsc added no demo [process] Not to be demonstrated at the end of the sprint and removed no demo [process] Not to be demonstrated at the end of the sprint labels Jul 31, 2024

dsotirho-ucsc reopened this Oct 21, 2024

dsotirho-ucsc removed the no demo [process] Not to be demonstrated at the end of the sprint label Oct 21, 2024

dsotirho-ucsc added spike:3 [process] Spike estimate of three points spike:2 [process] Spike estimate of two points and removed spike:1 [process] Spike estimate of one point spike:3 [process] Spike estimate of three points labels Oct 22, 2024

dsotirho-ucsc assigned hannes-ucsc and unassigned achave11-ucsc Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebooting GitLab may trigger ClamAV alarm #6114

Rebooting GitLab may trigger ClamAV alarm #6114

dsotirho-ucsc commented Apr 1, 2024 •

edited

Loading

dsotirho-ucsc commented Apr 1, 2024

hannes-ucsc commented Apr 15, 2024 •

edited

Loading

dsotirho-ucsc commented Apr 29, 2024

dsotirho-ucsc commented May 6, 2024

dsotirho-ucsc commented May 29, 2024

dsotirho-ucsc commented May 30, 2024

dsotirho-ucsc commented May 31, 2024 •

edited by hannes-ucsc

Loading

hannes-ucsc commented Jul 24, 2024

dsotirho-ucsc commented Oct 21, 2024

dsotirho-ucsc commented Oct 22, 2024 •

edited

Loading

achave11-ucsc commented Oct 24, 2024 •

edited

Loading

achave11-ucsc commented Oct 24, 2024

Rebooting GitLab may trigger ClamAV alarm #6114

Rebooting GitLab may trigger ClamAV alarm #6114

Comments

dsotirho-ucsc commented Apr 1, 2024 • edited Loading

dsotirho-ucsc commented Apr 1, 2024

hannes-ucsc commented Apr 15, 2024 • edited Loading

dsotirho-ucsc commented Apr 29, 2024

dsotirho-ucsc commented May 6, 2024

dsotirho-ucsc commented May 29, 2024

dsotirho-ucsc commented May 30, 2024

dsotirho-ucsc commented May 31, 2024 • edited by hannes-ucsc Loading

hannes-ucsc commented Jul 24, 2024

dsotirho-ucsc commented Oct 21, 2024

dsotirho-ucsc commented Oct 22, 2024 • edited Loading

achave11-ucsc commented Oct 24, 2024 • edited Loading

achave11-ucsc commented Oct 24, 2024

dsotirho-ucsc commented Apr 1, 2024 •

edited

Loading

hannes-ucsc commented Apr 15, 2024 •

edited

Loading

dsotirho-ucsc commented May 31, 2024 •

edited by hannes-ucsc

Loading

dsotirho-ucsc commented Oct 22, 2024 •

edited

Loading

achave11-ucsc commented Oct 24, 2024 •

edited

Loading