Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distribution metrics #39

Open
wants to merge 91 commits into
base: project-bkjg
Choose a base branch
from
Open

Conversation

bkjg
Copy link

@bkjg bkjg commented Jul 27, 2020

This pull request will add distribution metrics (distribution_t data structure) to the collectd.

The description of functionalities is in the following link:

https://docs.google.com/document/d/1ccsg5ffUfqt9-mBDGTymRn8X-9Wk1CuGYeMlRxmxiok/edit#heading=h.5irk4csrpu0y

bkjg added 24 commits July 27, 2020 11:21
This commit add distsribution header file and functions for
distribution management.
This commit add skeleton of distribution functions.
This commit will add the implementation of distribution_new_linear
function, which creates new distribution metrics using the linear
relationship of bucket's sizes. This functions depends on
bucket_new_linear which will be added in the future commits.
This commit will add implementation of bucket_new_linear function
that creates an array of type bucket_t and initialize maximum
boundaries using the linear relationship.

This commit also fix the several small bugs and typos.
This commit removes one argument from distribution_new_exponential
function. To calculate sizes of buckets, only factor and number of
buckets are enough, we don't need initial size, we always start
calculating sizes from zero.
This commit adds the implementation of distribution_new_exponential
function. This function depends on bucket_new_exponential which will
be added in future commits. The implementation takes into account
the changed arguments of this particular function.
This commit will add the implementation of bucket_new_exponential
function. This function creates an array of bucket_t type and
initialize maximum boundaries using the exponential relationship.
This commit will add the implementation of distribution_new_custom
function, which creates distribution metrics structure using the
custom buckets sizes given by the user. The implementation depends
on bucket_new_custom function, which will be implemented and added
in future commits.
This commit will add the implementation of distribution_destroy
function. This function do clean up and free all the memory:
especially pointer to the buckets inside distribution_t structure
and pointer to the distribution_t structure itself.
This commit will add the implementation of bucket_new_custom function.
This function creates the bucket_t structure and initialize it using
the array with consecutive buckets sizes given by the user (except the
size of the last bucket). Last bucket has maximum boundary equal to
infinity, so size of this bucket is also an infinity.
This commit will fix bugs i.e. compilation errors.
This commit will add the implementation of distribution_update
function. This function uses bucket_update function which increments
the counter of the adequate bucket for the given gauge.
This commit will add the implementation of distribution_percentile
function. This function is looking for the maximum boundary of bucket
in which should be found the percentage given by the user. This
function uses find_percentile function which finds percentile using
binary search algorithm.
This commit will add the implementation of distribution_average
function. It uses the sum of all gauges and divides it by the number
of all requested updates.
This commit will make function distribution_update thread safe by
surrounding the main logic with mutex.
This commit will add the implementation of distribution_clone
function. This function does copy of the distribution_t structure
and thanks to mutex, is thread safe.
This commit will add patch that changes the type of
distribution_clone from distribution_t to distribution_t*.
This commit will add the functionality of checking null pointers.
When i.e. calloc didn't success then it return null pointer.
This commit will add copying the distribution at the beginning of the
functions. Thanks that our code is thread safe because cloning
distribution contains locking the mutex.
Copy link
Owner

@octo octo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great Barbara!

return -1;
}

distribution_t *distribution = distribution_clone(d);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand your earlier comment now: yes, here it is much more efficient to lock d and call find_percent while holding the mutex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thank you. I changed to the locking mutex instead of copying whole data structure

src/daemon/distribution.c Show resolved Hide resolved

void distribution_destroy(distribution_t *d) {
if (d == NULL) {
errno = EINVAL;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend against setting errno here. free(3) also doesn't set errno if called with a NULL pointer.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed

/* function for updating the buckets */
void distribution_update(distribution_t *d, double gauge);

/* function for getting the percentile */
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate that all the functions in the header file are commented, that's awesome :)

I recommend that the comments discuss the function's behavior in corner cases. For example, what constraints exist for the arguments, such as «"percent" has to be between 0 (exclusive) and 100 (inclusive)». Also, how are error conditions signaled? Will the function return zero on failure? Or -1? Or NaN (Not a Number)? Is errno set?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I extended the comments of descriptions of values returned in case of any error

return -1;
}

distribution_t *distribution = distribution_clone(d);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, no need to clone the distribution for this, just lock the distribution when accessing sum_gauges and the bucket counter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

double ptr = 0.0;

for (size_t i = 0; i < num_buckets - 1; ++i) {
ptr += custom_buckets_sizes[i];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more intuitive to pass bucket boundaries instead of bucket sizes. In other words, I think:

double boundaries[] = {1, 2, 5, 10, 20, 50, 100};

is easier to read than:

double sizes[] = {1, 1, 3, 5, 10, 30, 50};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changed

bkjg added 5 commits July 28, 2020 09:32
This commit will add changes that will allow locking and unlocking
the mutex instead of copying a distribution_t data structure.
This change will cause that the program will run more efficiently
and will be a bit faster.
This commit will change the logic of distribution_new_custom.
So far this function was getting custom sizes of buckets as an
argument. However it would be more readable if instead having
sizes, we would have the boundaries of buckets. That's why now
this function get boundaries of buckets as a parameter.
This commit will extend the descriptions in the comments in
distribution.h file. Earlier comments only containted the description
what given function do. Now comments contain also the description of
returned value if there will be any error.
This commit will change the values returned from distribution_average
and distribution_percentile function. Earlier these functions in the
case of any error returned -1.0. However comparation of the double
values is a little tricky in C, so we prefer to return NaN instead of
any particular value.
@bkjg bkjg force-pushed the distribution-metrics branch from 857881f to 6d4afb2 Compare August 4, 2020 07:26
Copy link

@emargalit emargalit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Barbara! ^_^

run_benchmark.sh Outdated
@@ -0,0 +1,6 @@
#!/bin/bash

for i in {1..200}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to write the step directly into the for loop like this:

for i in {20..4000..20}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting suggestion, I have never heard about this third argument before. Thanks!

return NULL;
}

pthread_mutex_lock(&d->mutex);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using a mutex for the case when the user wants to return the bucket counters of the distribution_t itself and not use the clone? Would the mutex still be neccessary when returning the counters of a distribution_t clone?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the mutex to return the bucket's counters of the distribution_t itself and not use the clone, because copying is extremely costly and holding mutex and returning counters of the original distribution_t is just faster.

Mutux have to be hold by the clone function, so if I would clone distribution_t structure, then I could not have the mutex in the counters getter function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation!

pthread_mutex_t mutex;
};

double *distribution_get_buckets_boundaries(distribution_t *d) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, what was your reason to choose to create getter functions to return the bucket boundaries and counters and not directly return the buckets themselves?

return d->sum_gauges;
}

bool distribution_check_equal(distribution_t *d1, distribution_t *d2) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which use case would you need this function for?


for (size_t i = 0; i < num_buckets - 1; ++i) {
buckets[i].max_boundary = factor * multiplier;
multiplier *= base;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the code would be more readable if the pow function was used here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also recommend to use pow function. As we found out, it's much more precise than multiplication many times

Copy link
Author

@bkjg bkjg Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use the pow function to avoid calculating boundary for every bucket in logarithm time, which could end with num_buckets multiply by log(num_buckets) time complexity for distribution_new_exponential function. And current solution, which uses previous boundary, has linear time complexity, so it's a bit faster.

}

d->num_buckets = num_buckets;
pthread_mutex_init(&d->mutex, NULL);
Copy link

@emargalit emargalit Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on line 228?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initialise the mutex. When you create it, you should initialise it, to be sure that it will work correctly.

https://linux.die.net/man/3/pthread_mutex_init


pthread_mutex_lock(&d->mutex);

bucket_update(d->buckets, d->num_buckets, gauge);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of exporting the bucket update to a separate function, great!


for (int i = 0; i < max_size; ++i) {
for (int j = 0; j < 9; ++j) {
updates[i * 9 + j] = (rand() / (double)RAND_MAX) + (rand() % (int)1e6);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea behind calculating the index of the update elements like this? Do you get better randomized numbers like this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that I wanted to get double number, i.e. with numbers after decimal point. That's why at the beginning I got the random number from [0, 1) and then random number from [0, 1e6). After summing them I received the decimal number

}

printf("%d,%lf,%lf,%lf\n", num_buckets, (double)*elapsed_time_update / 1e9,
(double)*elapsed_time_percentile / 1e9,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the elapsed time could be calculated directly in seconds above so this conversion here would not be needed here anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to calculate it at the beginning in nanoseconds to have the maximum precision I can, because operations on double lose precision fast. That's why I use uint64_t and nanoseconds and calculate seconds at the end

Copy link

@Lana243 Lana243 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, Barbara👍👍👍

Comment on lines +284 to +289
int idx = (int)num_buckets - 1;

while (idx >= 0 && buckets[idx].max_boundary > gauge) {
buckets[idx].counter++;
idx--;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your mind:

Suggested change
int idx = (int)num_buckets - 1;
while (idx >= 0 && buckets[idx].max_boundary > gauge) {
buckets[idx].counter++;
idx--;
}
for (size_t i = num_buckets - 1; i >= 0 && buckets[i].max_boundary > gauge) {
buckets[i].counter++;
}

As we consider that distribution is not empty there won't be any issues with num_buckets - 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, if the gauge is smaller than buckets[0].max_boundary, then proposed for loop with last infinitely. Reason for it is that i is the type of size_t which is always greater than zero, so the condition i >= 0 is always true.

Comment on lines +31 to +34
typedef struct {
double max_boundary;
uint64_t counter;
} bucket_t;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment about how does the buckets look like. What are the minimal boundaries? Are minimal and maximal boundaries inclusive or exclusive?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I will add that. Thanks!


for (size_t i = 0; i < num_buckets - 1; ++i) {
buckets[i].max_boundary = factor * multiplier;
multiplier *= base;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also recommend to use pow function. As we found out, it's much more precise than multiplication many times

Comment on lines 180 to 197
if (num_boundaries > 0) {
if (custom_buckets_boundaries[0] <= 0 ||
custom_buckets_boundaries[0] == INFINITY) {
free(buckets);
errno = EINVAL;
return NULL;
}

buckets[0].max_boundary = custom_buckets_boundaries[0];

for (size_t i = 1; i < num_boundaries; ++i) {
if (custom_buckets_boundaries[i] <= 0 ||
custom_buckets_boundaries[i] == INFINITY ||
custom_buckets_boundaries[i - 1] >= custom_buckets_boundaries[i]) {
free(buckets);
errno = EINVAL;
return NULL;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to check boundaries first and then to allocate memory for the buckets because memory allocation is very expensive operation and I'd like to avoid doing it when unnecessary

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, I will change it! Thanks!

Comment on lines 235 to 238
if (num_buckets == 0 || base <= 0 || factor <= 0) {
errno = EINVAL;
return NULL;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that base should be greater than 1. Otherwise, the buckets will become smaller and smaller

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I wanted to add this condition, but then I forgot about it. Fixed!

.factor = 4.64,
.want_get = array_new_exponential(26, 1.01, 4.64),
}};

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a corner case when 0 < base < 1. And when base = 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will that, thank you!

bkjg added 2 commits August 12, 2020 09:09
…ucture

This commit will add little fixes in the benchmark and in the functions
for the distribution_t structure. In benchmark now we are sure that
compiler won't delete some lines because of the unused return value and
now this commit introduced checking pointers to the time calculating
variables if there are not equal to null and also if distribution_t
structures are not null. Also in distribution_t structure in function
for initializing the exponential distribution this commit introduce the
fix that ensure us that base is greater than 1.
This commit will add the functionality to measure time taken by
particular types of distributions to complete the most important
functionalities of distribution metrics: update and calculating
percentile.
Comment on lines +336 to +337
uint64_t quantity = (uint64_t)(
(percent / 100.0) * (double)d->buckets[d->num_buckets - 1].counter);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, what would you return if distribution was created but has no updates yet? I'd argue that you should return NAN

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants