-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distribution metrics #39
base: project-bkjg
Are you sure you want to change the base?
Conversation
This commit add distsribution header file and functions for distribution management.
This commit add skeleton of distribution functions.
This commit will add the implementation of distribution_new_linear function, which creates new distribution metrics using the linear relationship of bucket's sizes. This functions depends on bucket_new_linear which will be added in the future commits.
This commit will add implementation of bucket_new_linear function that creates an array of type bucket_t and initialize maximum boundaries using the linear relationship. This commit also fix the several small bugs and typos.
This commit removes one argument from distribution_new_exponential function. To calculate sizes of buckets, only factor and number of buckets are enough, we don't need initial size, we always start calculating sizes from zero.
This commit adds the implementation of distribution_new_exponential function. This function depends on bucket_new_exponential which will be added in future commits. The implementation takes into account the changed arguments of this particular function.
This commit will add the implementation of bucket_new_exponential function. This function creates an array of bucket_t type and initialize maximum boundaries using the exponential relationship.
This commit will add the implementation of distribution_new_custom function, which creates distribution metrics structure using the custom buckets sizes given by the user. The implementation depends on bucket_new_custom function, which will be implemented and added in future commits.
This commit will add the implementation of distribution_destroy function. This function do clean up and free all the memory: especially pointer to the buckets inside distribution_t structure and pointer to the distribution_t structure itself.
This commit will add the implementation of bucket_new_custom function. This function creates the bucket_t structure and initialize it using the array with consecutive buckets sizes given by the user (except the size of the last bucket). Last bucket has maximum boundary equal to infinity, so size of this bucket is also an infinity.
This commit will add the implementation of distribution_update function. This function uses bucket_update function which increments the counter of the adequate bucket for the given gauge.
This commit will add the implementation of distribution_percentile function. This function is looking for the maximum boundary of bucket in which should be found the percentage given by the user. This function uses find_percentile function which finds percentile using binary search algorithm.
This commit will add the implementation of distribution_average function. It uses the sum of all gauges and divides it by the number of all requested updates.
This commit will make function distribution_update thread safe by surrounding the main logic with mutex.
This commit will add the implementation of distribution_clone function. This function does copy of the distribution_t structure and thanks to mutex, is thread safe.
This commit will add patch that changes the type of distribution_clone from distribution_t to distribution_t*.
This commit will add the functionality of checking null pointers. When i.e. calloc didn't success then it return null pointer.
This commit will add copying the distribution at the beginning of the functions. Thanks that our code is thread safe because cloning distribution contains locking the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great Barbara!
src/daemon/distribution.c
Outdated
return -1; | ||
} | ||
|
||
distribution_t *distribution = distribution_clone(d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand your earlier comment now: yes, here it is much more efficient to lock d
and call find_percent
while holding the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thank you. I changed to the locking mutex instead of copying whole data structure
src/daemon/distribution.c
Outdated
|
||
void distribution_destroy(distribution_t *d) { | ||
if (d == NULL) { | ||
errno = EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend against setting errno
here. free(3) also doesn't set errno
if called with a NULL
pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed
src/daemon/distribution.h
Outdated
/* function for updating the buckets */ | ||
void distribution_update(distribution_t *d, double gauge); | ||
|
||
/* function for getting the percentile */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate that all the functions in the header file are commented, that's awesome :)
I recommend that the comments discuss the function's behavior in corner cases. For example, what constraints exist for the arguments, such as «"percent" has to be between 0 (exclusive) and 100 (inclusive)». Also, how are error conditions signaled? Will the function return zero on failure? Or -1? Or NaN (Not a Number)? Is errno
set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I extended the comments of descriptions of values returned in case of any error
src/daemon/distribution.c
Outdated
return -1; | ||
} | ||
|
||
distribution_t *distribution = distribution_clone(d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, no need to clone the distribution for this, just lock the distribution when accessing sum_gauges
and the bucket counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/daemon/distribution.c
Outdated
double ptr = 0.0; | ||
|
||
for (size_t i = 0; i < num_buckets - 1; ++i) { | ||
ptr += custom_buckets_sizes[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more intuitive to pass bucket boundaries instead of bucket sizes. In other words, I think:
double boundaries[] = {1, 2, 5, 10, 20, 50, 100};
is easier to read than:
double sizes[] = {1, 1, 3, 5, 10, 30, 50};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Changed
This commit will add changes that will allow locking and unlocking the mutex instead of copying a distribution_t data structure. This change will cause that the program will run more efficiently and will be a bit faster.
This commit will change the logic of distribution_new_custom. So far this function was getting custom sizes of buckets as an argument. However it would be more readable if instead having sizes, we would have the boundaries of buckets. That's why now this function get boundaries of buckets as a parameter.
This commit will extend the descriptions in the comments in distribution.h file. Earlier comments only containted the description what given function do. Now comments contain also the description of returned value if there will be any error.
This commit will change the values returned from distribution_average and distribution_percentile function. Earlier these functions in the case of any error returned -1.0. However comparation of the double values is a little tricky in C, so we prefer to return NaN instead of any particular value.
857881f
to
6d4afb2
Compare
…k_equal functions
This commit will add bash script that runs distribution_benchmark for numbers of buckets in multiplies of 20 from 20 to 4000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Barbara! ^_^
run_benchmark.sh
Outdated
@@ -0,0 +1,6 @@ | |||
#!/bin/bash | |||
|
|||
for i in {1..200} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to write the step directly into the for loop like this:
for i in {20..4000..20}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting suggestion, I have never heard about this third argument before. Thanks!
return NULL; | ||
} | ||
|
||
pthread_mutex_lock(&d->mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using a mutex for the case when the user wants to return the bucket counters of the distribution_t itself and not use the clone? Would the mutex still be neccessary when returning the counters of a distribution_t clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the mutex to return the bucket's counters of the distribution_t itself and not use the clone, because copying is extremely costly and holding mutex and returning counters of the original distribution_t is just faster.
Mutux have to be hold by the clone function, so if I would clone distribution_t structure, then I could not have the mutex in the counters getter function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation!
pthread_mutex_t mutex; | ||
}; | ||
|
||
double *distribution_get_buckets_boundaries(distribution_t *d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, what was your reason to choose to create getter functions to return the bucket boundaries and counters and not directly return the buckets themselves?
return d->sum_gauges; | ||
} | ||
|
||
bool distribution_check_equal(distribution_t *d1, distribution_t *d2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which use case would you need this function for?
|
||
for (size_t i = 0; i < num_buckets - 1; ++i) { | ||
buckets[i].max_boundary = factor * multiplier; | ||
multiplier *= base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the code would be more readable if the pow function was used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also recommend to use pow function. As we found out, it's much more precise than multiplication many times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't use the pow function to avoid calculating boundary for every bucket in logarithm time, which could end with num_buckets multiply by log(num_buckets) time complexity for distribution_new_exponential function. And current solution, which uses previous boundary, has linear time complexity, so it's a bit faster.
} | ||
|
||
d->num_buckets = num_buckets; | ||
pthread_mutex_init(&d->mutex, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on line 228?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initialise the mutex. When you create it, you should initialise it, to be sure that it will work correctly.
|
||
pthread_mutex_lock(&d->mutex); | ||
|
||
bucket_update(d->buckets, d->num_buckets, gauge); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of exporting the bucket update to a separate function, great!
|
||
for (int i = 0; i < max_size; ++i) { | ||
for (int j = 0; j < 9; ++j) { | ||
updates[i * 9 + j] = (rand() / (double)RAND_MAX) + (rand() % (int)1e6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea behind calculating the index of the update elements like this? Do you get better randomized numbers like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that I wanted to get double number, i.e. with numbers after decimal point. That's why at the beginning I got the random number from [0, 1) and then random number from [0, 1e6). After summing them I received the decimal number
src/daemon/distribution_benchmark.c
Outdated
} | ||
|
||
printf("%d,%lf,%lf,%lf\n", num_buckets, (double)*elapsed_time_update / 1e9, | ||
(double)*elapsed_time_percentile / 1e9, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the elapsed time could be calculated directly in seconds above so this conversion here would not be needed here anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to calculate it at the beginning in nanoseconds to have the maximum precision I can, because operations on double lose precision fast. That's why I use uint64_t and nanoseconds and calculate seconds at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, Barbara👍👍👍
int idx = (int)num_buckets - 1; | ||
|
||
while (idx >= 0 && buckets[idx].max_boundary > gauge) { | ||
buckets[idx].counter++; | ||
idx--; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To your mind:
int idx = (int)num_buckets - 1; | |
while (idx >= 0 && buckets[idx].max_boundary > gauge) { | |
buckets[idx].counter++; | |
idx--; | |
} | |
for (size_t i = num_buckets - 1; i >= 0 && buckets[i].max_boundary > gauge) { | |
buckets[i].counter++; | |
} |
As we consider that distribution is not empty there won't be any issues with num_buckets - 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, if the gauge is smaller than buckets[0].max_boundary, then proposed for loop with last infinitely. Reason for it is that i is the type of size_t which is always greater than zero, so the condition i >= 0 is always true.
typedef struct { | ||
double max_boundary; | ||
uint64_t counter; | ||
} bucket_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment about how does the buckets look like. What are the minimal boundaries? Are minimal and maximal boundaries inclusive or exclusive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I will add that. Thanks!
|
||
for (size_t i = 0; i < num_buckets - 1; ++i) { | ||
buckets[i].max_boundary = factor * multiplier; | ||
multiplier *= base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also recommend to use pow function. As we found out, it's much more precise than multiplication many times
src/daemon/distribution.c
Outdated
if (num_boundaries > 0) { | ||
if (custom_buckets_boundaries[0] <= 0 || | ||
custom_buckets_boundaries[0] == INFINITY) { | ||
free(buckets); | ||
errno = EINVAL; | ||
return NULL; | ||
} | ||
|
||
buckets[0].max_boundary = custom_buckets_boundaries[0]; | ||
|
||
for (size_t i = 1; i < num_boundaries; ++i) { | ||
if (custom_buckets_boundaries[i] <= 0 || | ||
custom_buckets_boundaries[i] == INFINITY || | ||
custom_buckets_boundaries[i - 1] >= custom_buckets_boundaries[i]) { | ||
free(buckets); | ||
errno = EINVAL; | ||
return NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to check boundaries first and then to allocate memory for the buckets because memory allocation is very expensive operation and I'd like to avoid doing it when unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, I will change it! Thanks!
src/daemon/distribution.c
Outdated
if (num_buckets == 0 || base <= 0 || factor <= 0) { | ||
errno = EINVAL; | ||
return NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that base should be greater than 1. Otherwise, the buckets will become smaller and smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I wanted to add this condition, but then I forgot about it. Fixed!
.factor = 4.64, | ||
.want_get = array_new_exponential(26, 1.01, 4.64), | ||
}}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a corner case when 0 < base < 1. And when base = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will that, thank you!
…ucture This commit will add little fixes in the benchmark and in the functions for the distribution_t structure. In benchmark now we are sure that compiler won't delete some lines because of the unused return value and now this commit introduced checking pointers to the time calculating variables if there are not equal to null and also if distribution_t structures are not null. Also in distribution_t structure in function for initializing the exponential distribution this commit introduce the fix that ensure us that base is greater than 1.
This commit will add the functionality to measure time taken by particular types of distributions to complete the most important functionalities of distribution metrics: update and calculating percentile.
uint64_t quantity = (uint64_t)( | ||
(percent / 100.0) * (double)d->buckets[d->num_buckets - 1].counter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, what would you return if distribution was created but has no updates yet? I'd argue that you should return NAN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
This pull request will add distribution metrics (distribution_t data structure) to the collectd.
The description of functionalities is in the following link:
https://docs.google.com/document/d/1ccsg5ffUfqt9-mBDGTymRn8X-9Wk1CuGYeMlRxmxiok/edit#heading=h.5irk4csrpu0y