Add distribution metrics #39

bkjg · 2020-07-27T09:24:25Z

This pull request will add distribution metrics (distribution_t data structure) to the collectd.

The description of functionalities is in the following link:

https://docs.google.com/document/d/1ccsg5ffUfqt9-mBDGTymRn8X-9Wk1CuGYeMlRxmxiok/edit#heading=h.5irk4csrpu0y

This commit add distsribution header file and functions for distribution management.

This commit add skeleton of distribution functions.

This commit will add the implementation of distribution_new_linear function, which creates new distribution metrics using the linear relationship of bucket's sizes. This functions depends on bucket_new_linear which will be added in the future commits.

This commit will add implementation of bucket_new_linear function that creates an array of type bucket_t and initialize maximum boundaries using the linear relationship. This commit also fix the several small bugs and typos.

This commit removes one argument from distribution_new_exponential function. To calculate sizes of buckets, only factor and number of buckets are enough, we don't need initial size, we always start calculating sizes from zero.

This commit adds the implementation of distribution_new_exponential function. This function depends on bucket_new_exponential which will be added in future commits. The implementation takes into account the changed arguments of this particular function.

This commit will add the implementation of bucket_new_exponential function. This function creates an array of bucket_t type and initialize maximum boundaries using the exponential relationship.

This commit will add the implementation of distribution_new_custom function, which creates distribution metrics structure using the custom buckets sizes given by the user. The implementation depends on bucket_new_custom function, which will be implemented and added in future commits.

This commit will add the implementation of distribution_destroy function. This function do clean up and free all the memory: especially pointer to the buckets inside distribution_t structure and pointer to the distribution_t structure itself.

This commit will add the implementation of bucket_new_custom function. This function creates the bucket_t structure and initialize it using the array with consecutive buckets sizes given by the user (except the size of the last bucket). Last bucket has maximum boundary equal to infinity, so size of this bucket is also an infinity.

This commit will fix bugs i.e. compilation errors.

This commit will add the implementation of distribution_update function. This function uses bucket_update function which increments the counter of the adequate bucket for the given gauge.

This commit will add the implementation of distribution_percentile function. This function is looking for the maximum boundary of bucket in which should be found the percentage given by the user. This function uses find_percentile function which finds percentile using binary search algorithm.

This commit will add the implementation of distribution_average function. It uses the sum of all gauges and divides it by the number of all requested updates.

This commit will make function distribution_update thread safe by surrounding the main logic with mutex.

This commit will add the implementation of distribution_clone function. This function does copy of the distribution_t structure and thanks to mutex, is thread safe.

This commit will add patch that changes the type of distribution_clone from distribution_t to distribution_t*.

This commit will add the functionality of checking null pointers. When i.e. calloc didn't success then it return null pointer.

This commit will add copying the distribution at the beginning of the functions. Thanks that our code is thread safe because cloning distribution contains locking the mutex.

octo

This looks great Barbara!

octo · 2020-07-27T14:59:48Z

src/daemon/distribution.c

+    return -1;
+  }
+
+  distribution_t *distribution = distribution_clone(d);


I think I understand your earlier comment now: yes, here it is much more efficient to lock d and call find_percent while holding the mutex.

Okay, thank you. I changed to the locking mutex instead of copying whole data structure

src/daemon/distribution.c

octo · 2020-07-27T15:33:35Z

src/daemon/distribution.c

+
+void distribution_destroy(distribution_t *d) {
+  if (d == NULL) {
+    errno = EINVAL;


I recommend against setting errno here. free(3) also doesn't set errno if called with a NULL pointer.

Thanks! Fixed

octo · 2020-07-27T15:38:55Z

src/daemon/distribution.h

+/* function for updating the buckets */
+void distribution_update(distribution_t *d, double gauge);
+
+/* function  for getting the percentile */


I really appreciate that all the functions in the header file are commented, that's awesome :)

I recommend that the comments discuss the function's behavior in corner cases. For example, what constraints exist for the arguments, such as «"percent" has to be between 0 (exclusive) and 100 (inclusive)». Also, how are error conditions signaled? Will the function return zero on failure? Or -1? Or NaN (Not a Number)? Is errno set?

Thank you! I extended the comments of descriptions of values returned in case of any error

octo · 2020-07-27T15:40:16Z

src/daemon/distribution.c

+    return -1;
+  }
+
+  distribution_t *distribution = distribution_clone(d);


Likewise, no need to clone the distribution for this, just lock the distribution when accessing sum_gauges and the bucket counter.

octo · 2020-07-27T15:57:07Z

src/daemon/distribution.c

+  double ptr = 0.0;
+
+  for (size_t i = 0; i < num_buckets - 1; ++i) {
+    ptr += custom_buckets_sizes[i];


I think it would be more intuitive to pass bucket boundaries instead of bucket sizes. In other words, I think:

double boundaries[] = {1, 2, 5, 10, 20, 50, 100};

is easier to read than:

double sizes[] = {1, 1, 3, 5, 10, 30, 50};

Thanks! Changed

This commit will add changes that will allow locking and unlocking the mutex instead of copying a distribution_t data structure. This change will cause that the program will run more efficiently and will be a bit faster.

This commit will change the logic of distribution_new_custom. So far this function was getting custom sizes of buckets as an argument. However it would be more readable if instead having sizes, we would have the boundaries of buckets. That's why now this function get boundaries of buckets as a parameter.

This commit will extend the descriptions in the comments in distribution.h file. Earlier comments only containted the description what given function do. Now comments contain also the description of returned value if there will be any error.

This commit will change the values returned from distribution_average and distribution_percentile function. Earlier these functions in the case of any error returned -1.0. However comparation of the double values is a little tricky in C, so we prefer to return NaN instead of any particular value.

…k_equal functions

This commit will add bash script that runs distribution_benchmark for numbers of buckets in multiplies of 20 from 20 to 4000.

emargalit

Great work Barbara! ^_^

emargalit · 2020-08-10T13:23:22Z

run_benchmark.sh

@@ -0,0 +1,6 @@
+#!/bin/bash
+
+for i in {1..200}


Another option would be to write the step directly into the for loop like this:

for i in {20..4000..20}

Interesting suggestion, I have never heard about this third argument before. Thanks!

emargalit · 2020-08-10T13:32:24Z

src/daemon/distribution.c

+    return NULL;
+  }
+
+  pthread_mutex_lock(&d->mutex);


Are you using a mutex for the case when the user wants to return the bucket counters of the distribution_t itself and not use the clone? Would the mutex still be neccessary when returning the counters of a distribution_t clone?

I use the mutex to return the bucket's counters of the distribution_t itself and not use the clone, because copying is extremely costly and holding mutex and returning counters of the original distribution_t is just faster.

Mutux have to be hold by the clone function, so if I would clone distribution_t structure, then I could not have the mutex in the counters getter function.

Thank you for the explanation!

emargalit · 2020-08-10T13:33:59Z

src/daemon/distribution.c

+  pthread_mutex_t mutex;
+};
+
+double *distribution_get_buckets_boundaries(distribution_t *d) {


Just out of curiosity, what was your reason to choose to create getter functions to return the bucket boundaries and counters and not directly return the buckets themselves?

emargalit · 2020-08-10T13:35:49Z

src/daemon/distribution.c

+  return d->sum_gauges;
+}
+
+bool distribution_check_equal(distribution_t *d1, distribution_t *d2) {


Which use case would you need this function for?

emargalit · 2020-08-10T13:44:33Z

src/daemon/distribution.c

+
+  for (size_t i = 0; i < num_buckets - 1; ++i) {
+    buckets[i].max_boundary = factor * multiplier;
+    multiplier *= base;


Maybe the code would be more readable if the pow function was used here.

I would also recommend to use pow function. As we found out, it's much more precise than multiplication many times

I didn't use the pow function to avoid calculating boundary for every bucket in logarithm time, which could end with num_buckets multiply by log(num_buckets) time complexity for distribution_new_exponential function. And current solution, which uses previous boundary, has linear time complexity, so it's a bit faster.

emargalit · 2020-08-10T13:46:28Z

src/daemon/distribution.c

+  }
+
+  d->num_buckets = num_buckets;
+  pthread_mutex_init(&d->mutex, NULL);


What happens on line 228?

I initialise the mutex. When you create it, you should initialise it, to be sure that it will work correctly.

https://linux.die.net/man/3/pthread_mutex_init

emargalit · 2020-08-10T13:48:48Z

src/daemon/distribution.c

+
+  pthread_mutex_lock(&d->mutex);
+
+  bucket_update(d->buckets, d->num_buckets, gauge);


I like the idea of exporting the bucket update to a separate function, great!

emargalit · 2020-08-10T13:56:53Z

src/daemon/distribution_benchmark.c

+
+  for (int i = 0; i < max_size; ++i) {
+    for (int j = 0; j < 9; ++j) {
+      updates[i * 9 + j] = (rand() / (double)RAND_MAX) + (rand() % (int)1e6);


What is the idea behind calculating the index of the update elements like this? Do you get better randomized numbers like this?

The idea is that I wanted to get double number, i.e. with numbers after decimal point. That's why at the beginning I got the random number from [0, 1) and then random number from [0, 1e6). After summing them I received the decimal number

emargalit · 2020-08-10T14:03:26Z

src/daemon/distribution_benchmark.c

+  }
+
+  printf("%d,%lf,%lf,%lf\n", num_buckets, (double)*elapsed_time_update / 1e9,
+         (double)*elapsed_time_percentile / 1e9,


Maybe the elapsed time could be calculated directly in seconds above so this conversion here would not be needed here anymore.

I wanted to calculate it at the beginning in nanoseconds to have the maximum precision I can, because operations on double lose precision fast. That's why I use uint64_t and nanoseconds and calculate seconds at the end

Lana243

Great job, Barbara👍👍👍

Lana243 · 2020-08-10T14:12:33Z

src/daemon/distribution.c

+  int idx = (int)num_buckets - 1;
+
+  while (idx >= 0 && buckets[idx].max_boundary > gauge) {
+    buckets[idx].counter++;
+    idx--;
+  }


To your mind:

Suggested change

int idx = (int)num_buckets - 1;

while (idx >= 0 && buckets[idx].max_boundary > gauge) {

buckets[idx].counter++;

idx--;

}

for (size_t i = num_buckets - 1; i >= 0 && buckets[i].max_boundary > gauge) {

buckets[i].counter++;

}

As we consider that distribution is not empty there won't be any issues with num_buckets - 1

Unfortunately, if the gauge is smaller than buckets[0].max_boundary, then proposed for loop with last infinitely. Reason for it is that i is the type of size_t which is always greater than zero, so the condition i >= 0 is always true.

Lana243 · 2020-08-10T14:16:57Z

src/daemon/distribution.c

+typedef struct {
+  double max_boundary;
+  uint64_t counter;
+} bucket_t;


Could you please add a comment about how does the buckets look like. What are the minimal boundaries? Are minimal and maximal boundaries inclusive or exclusive?

Right, I will add that. Thanks!

Lana243 · 2020-08-10T14:19:10Z

src/daemon/distribution.c

+
+  for (size_t i = 0; i < num_buckets - 1; ++i) {
+    buckets[i].max_boundary = factor * multiplier;
+    multiplier *= base;


I would also recommend to use pow function. As we found out, it's much more precise than multiplication many times

Lana243 · 2020-08-10T14:22:37Z

src/daemon/distribution.c

+  if (num_boundaries > 0) {
+    if (custom_buckets_boundaries[0] <= 0 ||
+        custom_buckets_boundaries[0] == INFINITY) {
+      free(buckets);
+      errno = EINVAL;
+      return NULL;
+    }
+
+    buckets[0].max_boundary = custom_buckets_boundaries[0];
+
+    for (size_t i = 1; i < num_boundaries; ++i) {
+      if (custom_buckets_boundaries[i] <= 0 ||
+          custom_buckets_boundaries[i] == INFINITY ||
+          custom_buckets_boundaries[i - 1] >= custom_buckets_boundaries[i]) {
+        free(buckets);
+        errno = EINVAL;
+        return NULL;
+      }


I'd prefer to check boundaries first and then to allocate memory for the buckets because memory allocation is very expensive operation and I'd like to avoid doing it when unnecessary

Great idea, I will change it! Thanks!

Lana243 · 2020-08-10T14:23:57Z

src/daemon/distribution.c

+  if (num_buckets == 0 || base <= 0 || factor <= 0) {
+    errno = EINVAL;
+    return NULL;
+  }


I'd argue that base should be greater than 1. Otherwise, the buckets will become smaller and smaller

You're right! I wanted to add this condition, but then I forgot about it. Fixed!

Lana243 · 2020-08-10T14:29:17Z

src/daemon/distribution_test.c

+          .factor = 4.64,
+          .want_get = array_new_exponential(26, 1.01, 4.64),
+      }};
+


I'd add a corner case when 0 < base < 1. And when base = 1

I will that, thank you!

…ucture This commit will add little fixes in the benchmark and in the functions for the distribution_t structure. In benchmark now we are sure that compiler won't delete some lines because of the unused return value and now this commit introduced checking pointers to the time calculating variables if there are not equal to null and also if distribution_t structures are not null. Also in distribution_t structure in function for initializing the exponential distribution this commit introduce the fix that ensure us that base is greater than 1.

This commit will add the functionality to measure time taken by particular types of distributions to complete the most important functionalities of distribution metrics: update and calculating percentile.

Lana243 · 2020-08-12T12:17:15Z

src/daemon/distribution.c

+  uint64_t quantity = (uint64_t)(
+      (percent / 100.0) * (double)d->buckets[d->num_buckets - 1].counter);


BTW, what would you return if distribution was created but has no updates yet? I'd argue that you should return NAN

Fixed, thanks!

…function

bkjg added 24 commits July 27, 2020 11:21

Add distribution header file

5670907

This commit add distsribution header file and functions for distribution management.

Add draft of distribution functions

e060f37

This commit add skeleton of distribution functions.

Add implementation of bucket_new_linear function

0144c2f

This commit will add implementation of bucket_new_linear function that creates an array of type bucket_t and initialize maximum boundaries using the linear relationship. This commit also fix the several small bugs and typos.

Add implementation of bucket_new_exponential function

8922e23

This commit will add the implementation of bucket_new_exponential function. This function creates an array of bucket_t type and initialize maximum boundaries using the exponential relationship.

Fix bugs

27bdc61

This commit will fix bugs i.e. compilation errors.

Add implementation of distribution_update function

009a5bb

This commit will add the implementation of distribution_update function. This function uses bucket_update function which increments the counter of the adequate bucket for the given gauge.

Add implementation of distribution_average function

40bc6eb

This commit will add the implementation of distribution_average function. It uses the sum of all gauges and divides it by the number of all requested updates.

Add mutex to distribution_update function

bba200a

This commit will make function distribution_update thread safe by surrounding the main logic with mutex.

Add implementation of distribution_clone function

18fe3a4

This commit will add the implementation of distribution_clone function. This function does copy of the distribution_t structure and thanks to mutex, is thread safe.

Change the type returned by distribution_clone function

6108428

This commit will add patch that changes the type of distribution_clone from distribution_t to distribution_t*.

Add cheking null pointeres

618c944

This commit will add the functionality of checking null pointers. When i.e. calloc didn't success then it return null pointer.

Make the code thread safe

52c8e49

This commit will add copying the distribution at the beginning of the functions. Thanks that our code is thread safe because cloning distribution contains locking the mutex.

Little pointers fixes

7169ca3

Add initial_size to distribution_h structure

b8a766b

Clang format

2fb1e01

Add initialization and destruction of mutex

fc5e6e6

Fix potential attempt to read a null pointer

6c91ed6

octo reviewed Jul 27, 2020

View reviewed changes

bkjg added 5 commits July 28, 2020 09:32

Add distribution metrics to the build system

43cdba6

Replace copying distribution metrics by mutex

d080a28

This commit will add changes that will allow locking and unlocking the mutex instead of copying a distribution_t data structure. This change will cause that the program will run more efficiently and will be a bit faster.

Extend the comments in distribution.h file

200cd28

This commit will extend the descriptions in the comments in distribution.h file. Earlier comments only containted the description what given function do. Now comments contain also the description of returned value if there will be any error.

bkjg added 2 commits August 4, 2020 06:25

Fix bugs in distribution_get_buckets_boundaries function

61caf83

Implement tests for distribution_get_buckets_counters function

6d4afb2

bkjg force-pushed the distribution-metrics branch from 857881f to 6d4afb2 Compare August 4, 2020 07:26

bkjg added 12 commits August 4, 2020 09:04

Implement tests for distribution_get_sum_gauges and distribution_chec…

fcbe7f0

…k_equal functions

Add checking errno inside unit tests

557dbd3

Add condition to distribution_new_custom function

5ed12d8

Clang format

8abc56e

Replace constant number with static const variable

0fb5f5d

Remove distribution.c include in distribution_test.c file

4e2fedd

Add getters to the distribution header file

15c47fd

Remove creating libdistribution.la from Makefile.am

44991a3

Add distribution_benchmark to Makefile.am

e0246de

Add benchmark

350e9f5

Add bash script for running distribution_benchmark

50cc412

This commit will add bash script that runs distribution_benchmark for numbers of buckets in multiplies of 20 from 20 to 4000.

Clang format

286f762

emargalit reviewed Aug 10, 2020

View reviewed changes

Lana243 reviewed Aug 10, 2020

View reviewed changes

bkjg added 2 commits August 12, 2020 09:09

Add distribution_benchmark for the different types of distribution

5c162bd

This commit will add the functionality to measure time taken by particular types of distributions to complete the most important functionalities of distribution metrics: update and calculating percentile.

Lana243 reviewed Aug 12, 2020

View reviewed changes

bkjg added 10 commits August 12, 2020 13:48

Change number of iterations in the benchmark

06c542e

Change steps in script for running benchmark

8e5f95b

Started introducing checking nulls in tests

06a62fd

Move checking the custom buckets boundaries before allocating buckets

17385fa

Add comments about buckets' boundaries

cb6c55d

Add comments to the header file

e77c4bd

Fix returning NAN in distribution_average if there was no updates

7e3ad57

Add checking nulls to the distribution_test

9608d1b

Add two corner cases to tests for distribution_new_exponential function

16af7af

Add unlocking mutexes in case on failure in distribution_check_equal …

2926727

…function


		pthread_mutex_lock(&d->mutex);

		bucket_update(d->buckets, d->num_buckets, gauge);

		uint64_t quantity = (uint64_t)(
		(percent / 100.0) * (double)d->buckets[d->num_buckets - 1].counter);

Add distribution metrics #39

Are you sure you want to change the base?

Add distribution metrics #39

Conversation

bkjg commented Jul 27, 2020

octo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emargalit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkjg Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

emargalit Aug 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lana243 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkjg Aug 13, 2020 •

edited

Loading

emargalit Aug 10, 2020 •

edited

Loading