#956 Flow control fixes #957

bucko909 · 2024-08-21T15:43:48Z

This is not a complete fix for the aggregator, but does fix the cache, relay and rewriter daemons.

I've tested the cache daemon manually and verified that it only outputs "MetricCache is nearly full" and "MetricCache below watermark" with a sufficiently high cache size. With a small cache size, it will still drop datapoints ("MetricCache is full") with these settings.

The "fix MAX_QUEUE_SIZE to match config file default" commit may be controversial. I'm happy to do whatever with this to get this merged; it just felt wrong when I found it.

bucko909 · 2024-08-21T16:55:21Z

Push just resets to preferred email.

deniszh · 2024-08-24T07:58:30Z

@bucko909 : could you please check failed tests, pretty please?
Thanks!

deniszh · 2024-08-24T08:11:13Z

And rebase or conflict fix needed too

It's not really clear what to do here. If someone has altered their config file to remove this line, this patch will alter behaviour of their instance. However, the example config default has been 10000 since first commit, and that's probably a more sensible default value. It might nonetheless be more sensible to just document the discrepancy.

bucko909 · 2024-08-29T14:22:37Z

Rebased to fix conflicts. Only meaningful change is the conflict, which was in the middle of these changes, and is added as context in the diff:

$ (FROM=4e105a0; NOW=d7926bd; diff -u5  <(git diff -U10 $(git merge-base $FROM origin/master)..$FROM) <(git diff -U10 $(git merge-base $NOW origin/master)..$NOW))
--- /dev/fd/63  2024-08-29 15:20:54.847098593 +0100
+++ /dev/fd/62  2024-08-29 15:20:54.851098553 +0100
@@ -131,14 +131,14 @@
  # If enabled this setting is used to timeout metric client connection if no
  # metrics have been sent in specified time in seconds
  #METRIC_CLIENT_IDLE_TIMEOUT = None
  
 diff --git a/lib/carbon/cache.py b/lib/carbon/cache.py
-index cf8b7d1..a7a0c7e 100644
+index 72ccf6c..0304143 100644
 --- a/lib/carbon/cache.py
 +++ b/lib/carbon/cache.py
-@@ -200,20 +200,27 @@ class _MetricCache(defaultdict):
+@@ -201,20 +201,27 @@ class _MetricCache(defaultdict):
              in self.items()]
  
    @property
    def watermarks(self):
      return [(metric, min(datapoints.keys()), max(datapoints.keys()))
@@ -162,11 +162,11 @@
    def _check_available_space(self):
      if state.cacheTooFull and self.size < settings.CACHE_SIZE_LOW_WATERMARK:
        log.msg("MetricCache below watermark: self.size=%d" % self.size)
        events.cacheSpaceAvailable()
  
-@@ -244,22 +251,26 @@ class _MetricCache(defaultdict):
+@@ -245,22 +252,26 @@ class _MetricCache(defaultdict):
  
      return sorted(datapoint_index.items(), key=by_timestamp)
  
    def store(self, metric, datapoint):
      timestamp, value = datapoint
@@ -180,20 +180,20 @@
          else:
 +          if self.is_nearly_full:
 +            # This will disable reading when flow control is enabled
 +            log.msg("MetricCache is nearly full: self.size=%d" % self.size)
 +            events.cacheFull()
+           if not self[metric]:
+             self.new_metrics.append(metric)
            self.size += 1
            self[metric][timestamp] = value
            if self.strategy:
              self.strategy.store(metric)
        else:
          # Updating a duplicate does not increase the cache size
          self[metric][timestamp] = value
  
- 
- _Cache = None
 diff --git a/lib/carbon/client.py b/lib/carbon/client.py
 index d9ba5a9..6bed7b7 100644
 --- a/lib/carbon/client.py
 +++ b/lib/carbon/client.py
 @@ -28,20 +28,24 @@ try:

I've used a smaller overage for this setting, as the cache is a much slower process than the relay, so hopefully its size is much higher.

bucko909 · 2024-08-29T14:54:01Z

I think this should fix the tests -- apologies; I just ran pytest but it looks like it didn't find the other two classes in that file.

The fixes are squashed into "[#956] allow instances to go 25% over their hard max".

tox with no parameters also runs a benchmark. The benchmark does not appear to terminate in any reasonable time when performing "Benchmarking unique metric MetricCache drain..." on 1M datapoints. I tested it on master and the same is true, so I've not broken this.

codecov-commenter · 2024-08-29T14:54:09Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 37.93103% with 18 lines in your changes missing coverage. Please review.

Project coverage is 50.46%. Comparing base (fdc56f6) to head (dbf2925).
Report is 17 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/carbon/service.py	0.00%	9 Missing ⚠️
lib/carbon/client.py	16.66%	4 Missing and 1 partial ⚠️
lib/carbon/conf.py	0.00%	3 Missing ⚠️
lib/carbon/events.py	50.00%	0 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #957      +/-   ##
==========================================
- Coverage   50.63%   50.46%   -0.18%     
==========================================
  Files          36       36              
  Lines        3446     3474      +28     
  Branches      535      534       -1     
==========================================
+ Hits         1745     1753       +8     
- Misses       1574     1592      +18     
- Partials      127      129       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bucko909 force-pushed the flow-control branch from 420fbcd to 4e105a0 Compare August 21, 2024 16:53

bucko909 added 3 commits August 29, 2024 15:14

make config file more consistent

ea4148f

[graphite-project#956] allow instances to go 25% over their hard max

84d5f94

bucko909 force-pushed the flow-control branch from 4e105a0 to d7926bd Compare August 29, 2024 14:22

bucko909 added 2 commits August 29, 2024 15:50

[graphite-project#956] allow cache to slightly exceed max size

4d321ef

I've used a smaller overage for this setting, as the cache is a much slower process than the relay, so hopefully its size is much higher.

enable USE_FLOW_CONTROL on aggregator, rewriter and relay

dbf2925

bucko909 force-pushed the flow-control branch from 8fcdb64 to dbf2925 Compare August 29, 2024 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#956 Flow control fixes #957

#956 Flow control fixes #957

bucko909 commented Aug 21, 2024

bucko909 commented Aug 21, 2024

deniszh commented Aug 24, 2024

deniszh commented Aug 24, 2024

bucko909 commented Aug 29, 2024

bucko909 commented Aug 29, 2024

codecov-commenter commented Aug 29, 2024

#956 Flow control fixes #957

Are you sure you want to change the base?

#956 Flow control fixes #957

Conversation

bucko909 commented Aug 21, 2024

bucko909 commented Aug 21, 2024

deniszh commented Aug 24, 2024

deniszh commented Aug 24, 2024

bucko909 commented Aug 29, 2024

bucko909 commented Aug 29, 2024

codecov-commenter commented Aug 29, 2024

Codecov Report