Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collector/filesystem: Handle Statfs_t overflows #2965

Closed
wants to merge 3 commits into from

Conversation

rexagod
Copy link
Contributor

@rexagod rexagod commented Mar 19, 2024

Handle cases where, owing to multiplying two uint64 integers and typecasting it to float64, the overall precision is lost when the values concerned exceed the floatMantissa64 (1 << 53) before or after the operation (which is well within the acceptable uint64 range).

Fixes: #1672

@rexagod rexagod force-pushed the 1672 branch 5 times, most recently from a18b677 to 125c2e4 Compare March 30, 2024 23:01
Handle cases where, owing to multiplying two `uint64` integers and
typecasting it to `float64`, the overall precision is lost when the
values concerned exceed the `floatMantissa64` (1 << 53) before or after
the operation (which is well within the acceptable `uint64` range).

Fixes: prometheus#1672

Signed-off-by: Pranshu Srivastava <[email protected]>
Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dswarbrick
Copy link
Contributor

If you are going to make use of math/big, wouldn't it make more sense to handle the number as big.Int (since they are originally integer types), and only convert to float64 at the end to hand off to the channel? I don't really see the point in prematurely casting them to big.Float, since int(a) * int(b) will by definition produce an int.

@rexagod
Copy link
Contributor Author

rexagod commented May 20, 2024

Feel free to correct me if I'm missing something, but I believe we are casting them right before returning? Also, except Bsize, all Statfs_t fields being used here are uint types (not int).

Additionally, int(a) * int(b) is also subject to under-or-over-flow, so casting needs to be done before the operation happens, otherwise the int(c) produced as a result of implicit casting may not be the expected value.

@dswarbrick
Copy link
Contributor

dswarbrick commented May 20, 2024

@rexagod Yes, I'm aware that int(a) * int(b) (in the computer programming sense) can under/overflow. I meant more in the mathematical sense - a whole number multiplied by a whole number will always produce a whole number result.

Since big.Int can handle integers of arbitrary size, it is safe to multiply them without fear of under/overflow. Hence you do not need to prematurely cast to big.Float as you do here:

size, _ := new(big.Float).SetUint64(buf.Blocks).Mul(new(big.Float).SetUint64(buf.Blocks), new(big.Float).SetInt64(int64(buf.Bsize))).Float64()

This can be simplified to:

size, _ := new(big.Int).Mul(new(big.Int).SetUint64(buf.Blocks), new(big.Int).SetInt64(buf.Bsize)).Float64()

Keep things simple during the calculation, and only cast to float64 at the end for Prometheus.

@rexagod
Copy link
Contributor Author

rexagod commented May 21, 2024

I see your point. Even though we explicitly .Float64() in both of the cases, the latter ensures that precision isn't lost owing to the premature type-casting to float64 (since big.Int isn't subject to rounding errors).

@dswarbrick
Copy link
Contributor

dswarbrick commented May 21, 2024

Ah, I see that unix.Statfs_t.Bsize is either int32 or int64, depending on the host arch - so you will at least need to explicitly cast that to int64 for SetInt64() to be happy.

This brings me to my second point though - why does this PR only address 64-bit archs, i.e. bits.UintSize == 64? If this can overflow on 64-bit, surely it's just as likely (if not more likely) to overflow on 32-bit?

TBH, I'm not entirely convinced that the original issue #1672 was caused by an overflow. The method float64(buf.Blocks) * float64(buf.Bsize), despite some premature rounding when buf.Blocks exceeds the float64 mantissa, is not going to result in orders-of-magnitude different values when the available / free bytes are calculated in the same manner.

The original issue stated:

The size in bytes is 879510155264 which is accurate but the avail bytes is so much larger the scale makes size in bytes look near zero.

We only get a Grafana screenshot in the issue. It would have been a lot more helpful to have the raw metrics, so we could perhaps establish some relationship between the values, i.e. verify whether some value was in the ballpark of a wraparound. Even more helpful would have been the output of stat -f /tmp, as this would be as close as possible to the raw counters that the Go unix.Fstatfs was returning.

The fact remains that even a version as old as the one mentioned in the issue (v1.0.0-rc.0) used the float64(buf.Bavail) * float64(buf.Bsize) method, and to overflow this, it would require such a colossally large filesystem that I don't think the human race currently has the technology to achieve it. In fact, even if buf.Bsize was 2^63-1 and buf.Bavail was 2^64-1 (i.e., their maximum possible values according to unix.Statfs_t), the product would only 9.22e+18 - a long way short of MaxFloat64 (approx. 1.7e+308).

Since I cannot see how node_exporter could have produced a value approaching 8e+22 as shown in the Grafana screenshot, my gut feeling is that the promql must have included some additional multiplier, or perhaps the panel series unit was set incorrectly, resulting in Grafana performing multiplication by an additional factor.

Using big.Int to do the multiplication is of course going to preserve precision during the multiplication and only round (potentially) at the final float64 cast. But if that method is adopted, I think it should be used regardless of the host's Uintsize.

if err != nil {
labels.deviceError = err.Error()
level.Debug(c.logger).Log("msg", "Error on statfs() system call", "rootfs", rootfsFilePath(labels.mountPoint), "err", err)
// Handle for under/over-flow for cases where: "node_filesystem_{free,avail}_bytes reporting values are larger than node_filesystem_size_bytes".
Copy link
Contributor

@dswarbrick dswarbrick May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording of this comment suggests that statfs could return a result with free or available blocks greater than total blocks, which is a logical impossibility. Presumably what you're trying to convey is that filesystems with suitably large total blocks and a suitably large block size can overflow a uint64 when multiplied together, resulting in wraparound, and the appearance that free / available bytes are greater than total bytes (assuming that they have not also caused overflow / wraparound in their calculation).

@rexagod
Copy link
Contributor Author

rexagod commented May 21, 2024

I gave this patch another thought, and while we could get something in to address the possible, but not probable over-or-under-flow wraparound that could have caused this, or atleast what was my only conclusion from the data provided in the original issue, however unlikely, since we do see the raw metrics in the panel exhibiting values that are not at all expected, I'm convinced we should defer merging this in favor of being provided the requested information or some reproducibility that would help assess the exact cause of this issue with more confidence.

cc @treydock

@rexagod rexagod closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

node_filesystem_{free,avail}_bytes reporting values larger than node_filesystem_size_bytes
4 participants