collector/filesystem: Handle `Statfs_t` overflows #2965

rexagod · 2024-03-19T20:18:59Z

Handle cases where, owing to multiplying two uint64 integers and typecasting it to float64, the overall precision is lost when the values concerned exceed the floatMantissa64 (1 << 53) before or after the operation (which is well within the acceptable uint64 range).

Fixes: #1672

Handle cases where, owing to multiplying two `uint64` integers and typecasting it to `float64`, the overall precision is lost when the values concerned exceed the `floatMantissa64` (1 << 53) before or after the operation (which is well within the acceptable `uint64` range). Fixes: prometheus#1672 Signed-off-by: Pranshu Srivastava <[email protected]>

collector/filesystem_linux.go

Signed-off-by: Pranshu Srivastava <[email protected]>

discordianfish

LGTM

dswarbrick · 2024-05-16T18:12:48Z

If you are going to make use of math/big, wouldn't it make more sense to handle the number as big.Int (since they are originally integer types), and only convert to float64 at the end to hand off to the channel? I don't really see the point in prematurely casting them to big.Float, since int(a) * int(b) will by definition produce an int.

rexagod · 2024-05-20T10:09:40Z

Feel free to correct me if I'm missing something, but I believe we are casting them right before returning? Also, except Bsize, all Statfs_t fields being used here are uint types (not int).

Additionally, int(a) * int(b) is also subject to under-or-over-flow, so casting needs to be done before the operation happens, otherwise the int(c) produced as a result of implicit casting may not be the expected value.

dswarbrick · 2024-05-20T20:14:09Z

@rexagod Yes, I'm aware that int(a) * int(b) (in the computer programming sense) can under/overflow. I meant more in the mathematical sense - a whole number multiplied by a whole number will always produce a whole number result.

Since big.Int can handle integers of arbitrary size, it is safe to multiply them without fear of under/overflow. Hence you do not need to prematurely cast to big.Float as you do here:

size, _ := new(big.Float).SetUint64(buf.Blocks).Mul(new(big.Float).SetUint64(buf.Blocks), new(big.Float).SetInt64(int64(buf.Bsize))).Float64()

This can be simplified to:

size, _ := new(big.Int).Mul(new(big.Int).SetUint64(buf.Blocks), new(big.Int).SetInt64(buf.Bsize)).Float64()

Keep things simple during the calculation, and only cast to float64 at the end for Prometheus.

rexagod · 2024-05-21T01:20:51Z

I see your point. Even though we explicitly .Float64() in both of the cases, the latter ensures that precision isn't lost owing to the premature type-casting to float64 (since big.Int isn't subject to rounding errors).

Signed-off-by: Pranshu Srivastava <[email protected]>

dswarbrick · 2024-05-21T01:59:15Z

Ah, I see that unix.Statfs_t.Bsize is either int32 or int64, depending on the host arch - so you will at least need to explicitly cast that to int64 for SetInt64() to be happy.

This brings me to my second point though - why does this PR only address 64-bit archs, i.e. bits.UintSize == 64? If this can overflow on 64-bit, surely it's just as likely (if not more likely) to overflow on 32-bit?

TBH, I'm not entirely convinced that the original issue #1672 was caused by an overflow. The method float64(buf.Blocks) * float64(buf.Bsize), despite some premature rounding when buf.Blocks exceeds the float64 mantissa, is not going to result in orders-of-magnitude different values when the available / free bytes are calculated in the same manner.

The original issue stated:

The size in bytes is 879510155264 which is accurate but the avail bytes is so much larger the scale makes size in bytes look near zero.

We only get a Grafana screenshot in the issue. It would have been a lot more helpful to have the raw metrics, so we could perhaps establish some relationship between the values, i.e. verify whether some value was in the ballpark of a wraparound. Even more helpful would have been the output of stat -f /tmp, as this would be as close as possible to the raw counters that the Go unix.Fstatfs was returning.

The fact remains that even a version as old as the one mentioned in the issue (v1.0.0-rc.0) used the float64(buf.Bavail) * float64(buf.Bsize) method, and to overflow this, it would require such a colossally large filesystem that I don't think the human race currently has the technology to achieve it. In fact, even if buf.Bsize was 2^63-1 and buf.Bavail was 2^64-1 (i.e., their maximum possible values according to unix.Statfs_t), the product would only 9.22e+18 - a long way short of MaxFloat64 (approx. 1.7e+308).

Since I cannot see how node_exporter could have produced a value approaching 8e+22 as shown in the Grafana screenshot, my gut feeling is that the promql must have included some additional multiplier, or perhaps the panel series unit was set incorrectly, resulting in Grafana performing multiplication by an additional factor.

Using big.Int to do the multiplication is of course going to preserve precision during the multiplication and only round (potentially) at the final float64 cast. But if that method is adopted, I think it should be used regardless of the host's Uintsize.

dswarbrick · 2024-05-21T02:52:24Z

collector/filesystem_linux.go

-	if err != nil {
-		labels.deviceError = err.Error()
-		level.Debug(c.logger).Log("msg", "Error on statfs() system call", "rootfs", rootfsFilePath(labels.mountPoint), "err", err)
+	// Handle for under/over-flow for cases where: "node_filesystem_{free,avail}_bytes reporting values are larger than node_filesystem_size_bytes".


The wording of this comment suggests that statfs could return a result with free or available blocks greater than total blocks, which is a logical impossibility. Presumably what you're trying to convey is that filesystems with suitably large total blocks and a suitably large block size can overflow a uint64 when multiplied together, resulting in wraparound, and the appearance that free / available bytes are greater than total bytes (assuming that they have not also caused overflow / wraparound in their calculation).

rexagod · 2024-05-21T07:11:08Z

I gave this patch another thought, and while we could get something in to address the possible, but not probable over-or-under-flow wraparound that could have caused this, or atleast what was my only conclusion from the data provided in the original issue, however unlikely, since we do see the raw metrics in the panel exhibiting values that are not at all expected, I'm convinced we should defer merging this in favor of being provided the requested information or some reproducibility that would help assess the exact cause of this issue with more confidence.

cc @treydock

rexagod force-pushed the 1672 branch 5 times, most recently from a18b677 to 125c2e4 Compare March 30, 2024 23:01

rexagod force-pushed the 1672 branch from 125c2e4 to cb5b0f9 Compare March 30, 2024 23:02

discordianfish reviewed Apr 10, 2024

View reviewed changes

collector/filesystem_linux.go Show resolved Hide resolved

fixup! collector/filesystem: Handle Statfs_t overflows

692c955

Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod force-pushed the 1672 branch from 6765f62 to 692c955 Compare May 12, 2024 17:56

discordianfish approved these changes May 13, 2024

View reviewed changes

discordianfish assigned SuperQ May 13, 2024

fixup! fixup! collector/filesystem: Handle Statfs_t overflows

1c5456a

Signed-off-by: Pranshu Srivastava <[email protected]>

rexagod force-pushed the 1672 branch from 38773a1 to 1c5456a Compare May 21, 2024 01:21

dswarbrick reviewed May 21, 2024

View reviewed changes

rexagod closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collector/filesystem: Handle `Statfs_t` overflows #2965

collector/filesystem: Handle `Statfs_t` overflows #2965

rexagod commented Mar 19, 2024

discordianfish left a comment

dswarbrick commented May 16, 2024

rexagod commented May 20, 2024

dswarbrick commented May 20, 2024 •

edited

Loading

rexagod commented May 21, 2024

dswarbrick commented May 21, 2024 •

edited

Loading

dswarbrick May 21, 2024 •

edited

Loading

rexagod commented May 21, 2024

collector/filesystem: Handle Statfs_t overflows #2965

collector/filesystem: Handle Statfs_t overflows #2965

Conversation

rexagod commented Mar 19, 2024

discordianfish left a comment

Choose a reason for hiding this comment

dswarbrick commented May 16, 2024

rexagod commented May 20, 2024

dswarbrick commented May 20, 2024 • edited Loading

rexagod commented May 21, 2024

dswarbrick commented May 21, 2024 • edited Loading

dswarbrick May 21, 2024 • edited Loading

Choose a reason for hiding this comment

rexagod commented May 21, 2024

collector/filesystem: Handle `Statfs_t` overflows #2965

collector/filesystem: Handle `Statfs_t` overflows #2965

dswarbrick commented May 20, 2024 •

edited

Loading

dswarbrick commented May 21, 2024 •

edited

Loading

dswarbrick May 21, 2024 •

edited

Loading