Optimize toByteString and toASCIIBytes #80

lykahb · 2023-12-02T18:06:13Z

This PR leverages the bytestring fixed-length builders to simplify and speed up the conversions. The re-implementation of toASCIIBytes is now more high-level and safe.

I bumped the bytestring dependency lower bound to the version that introduces Data.ByteString.Builder.Prim. It was released in 2012, so that is plenty of backwards compatibility.

Here are benchmarks on MacBook M1 Max:

Before

benchmarking conversion/toASCIIBytes
time                 34.23 ns   (34.16 ns .. 34.35 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 34.21 ns   (34.18 ns .. 34.29 ns)
std dev              169.2 ps   (40.23 ps .. 290.2 ps)

benchmarking conversion/toByteString
time                 74.16 ns   (73.96 ns .. 74.43 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 74.07 ns   (73.98 ns .. 74.25 ns)
std dev              398.7 ps   (176.1 ps .. 649.3 ps)

After

benchmarking conversion/toASCIIBytes
time                 28.61 ns   (28.59 ns .. 28.63 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.62 ns   (28.60 ns .. 28.65 ns)
std dev              56.85 ps   (26.42 ps .. 117.3 ps)

benchmarking conversion/toByteString
time                 20.87 ns   (20.81 ns .. 20.96 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 20.94 ns   (20.89 ns .. 21.01 ns)
std dev              200.6 ps   (144.6 ps .. 259.8 ps)

lykahb · 2024-03-18T20:22:41Z

@phadej This PR is waiting for a review.

phadej

I don't believe that anything can be faster than toASCIIBytes uuid = BI.unsafeCreate 36 (pokeASCII uuid). It's literally just allocating exactly 36 bytes and poking stuff at the right place. Maybe there's something GHC doesn't see (missing bang somewhere)? But the current code is literally as little work as possible already.

I don't see a point in complicating it.

lykahb · 2024-03-19T00:05:39Z

It is hard to tell what makes the high-level implementation faster. The toByteString packed the characters from a list, which looks suboptimal. But the toASCIIBytes only does the low-level operations and looks quite simple, with, as you said, hard to imagine anything being faster.

Yet, the benchmarks show that a high-level implementation is faster. It might be that it unlocks some GHC optimizations or that bytestring has some other non-trivial optimizations. If we compare only the functions that do not have any extra overhead (toASCIIBytes from before with toByteString in this PR), it'd be a third faster.

If uuid delegates the low-level work to bytestring, it would benefit from any future performance improvements in bytestring. Also, this might be easier to maintain: this diff mostly removes code and lets this library focus on the format, rather than the low-level optimization.

There is also more room for speedup: toASCIIBytes can be as just fast as toByteString if we remove the copy haskell/bytestring#666. For better compatibility I avoided using an internal module in this PR Data.ByteString.Builder.Prim.Internal.runF.

clyring · 2024-03-19T03:56:14Z

If I had to guess, the reason the FixedPrim implementation in this patch is faster is probably to do with BBP.word32HexFixed being a branchless two-nibbles-at-a-time lookup instead of the branchy one-nibble-at-a-time calculation in the baseline.

lykahb · 2024-03-19T18:50:45Z

@clyring clarified that the module Data.ByteString.Builder.Prim.Internal is stable. I rewrote the string conversion without the Builder and got further improvement:

benchmarking conversion/toASCIIBytes
time                 21.56 ns   (21.18 ns .. 22.30 ns)
                     0.995 R²   (0.988 R² .. 1.000 R²)
mean                 21.36 ns   (21.20 ns .. 21.95 ns)
std dev              968.8 ps   (101.2 ps .. 2.048 ns)


benchmarking conversion/toByteString
time                 17.85 ns   (17.82 ns .. 17.89 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.87 ns   (17.84 ns .. 17.95 ns)
std dev              166.1 ps   (89.32 ps .. 297.0 ps)

lykahb · 2024-03-20T19:08:05Z

With the recent update this branch has toASCIIBytes that is x1.6 faster and toByteString that is x4.15 faster.

@phadej Would you give this another look? This is a significant improvement and it reduces the amount of code to maintain in uuid. It seems that high-performance low-level code better fits the core libraries of the Haskell ecosystem like bytestring.

iand675 · 2024-03-20T20:46:08Z

I don't believe that anything can be faster than toASCIIBytes uuid = BI.unsafeCreate 36 (pokeASCII uuid). It's literally just allocating exactly 36 bytes and poking stuff at the right place. Maybe there's something GHC doesn't see (missing bang somewhere)? But the current code is literally as little work as possible already.

@phadej I appreciate that you'd like to keep the code uncomplicated, but benchmarking seems to indicate a solid improvement, so I don't understand the resistance to the improvement.

phadej · 2024-03-20T20:55:16Z

@iand675 the original patch was more complicated. And my review comment made lykahb improve on it.

Your comment is not fair.

iand675 · 2024-03-20T21:00:08Z

It's true that @lykahb made a further improvement, but I don't think it's unfair to state that he provided benchmarks prior to you saying that you didn't want to "complicate the code" that were a solid performance increase.

phadej · 2024-03-20T21:02:07Z

uuid-types/src/Data/UUID/Types/Internal.hs

  where
-    (w0, w1, w2, w3) = toWords uuid
+    wordFixedPrim :: BBP.FixedPrim (Word32, (Word16, (Word16, (Word16, (Word16, Word32)))))
+    wordFixedPrim = BBP.word32HexFixed BBP.>*<


Note to self, the word<N>HexFixed calls into C function in bytestring:

char* _hs_bytestring_uint_hex (unsigned int x, char* buf) { // write hex representation in reverse order char c, *ptr = buf, *next_free; do { *ptr++ = digits[x & 0xf]; x >>= 4; } while ( x ); // invert written digits next_free = ptr--; while(buf < ptr) { c = *ptr; *ptr-- = *buf; *buf++ = c; } return next_free; };

Fascinating that loop is faster (?) than unrolled version. Maybe GCC does magic. For "random" UUIDs, ther e shouldn't be win in short circuiting the loop.

No, that's wrong, the Fixed variants just split the number into halves:

https://hackage.haskell.org/package/bytestring-0.12.1.0/docs/src/Data.ByteString.Builder.Prim.ASCII.html#word32HexFixed

-- | Encode a 'Word8' using 2 nibbles (hexadecimal digits). {-# INLINE word8HexFixed #-} word8HexFixed :: FixedPrim Word8 word8HexFixed = fixedPrim 2 $ \x op -> do enc <- encode8_as_16h lowerTable x unalignedWriteU16 enc op -- | Encode a 'Word16' using 4 nibbles. {-# INLINE word16HexFixed #-} word16HexFixed :: FixedPrim Word16 word16HexFixed = (\x -> (fromIntegral $ x `shiftR` 8, fromIntegral x)) >$< pairF word8HexFixed word8HexFixed -- | Encode a 'Word32' using 8 nibbles. {-# INLINE word32HexFixed #-} word32HexFixed :: FixedPrim Word32 word32HexFixed = (\x -> (fromIntegral $ x `shiftR` 16, fromIntegral x)) >$< pairF word16HexFixed word16HexFixed -- | Encode a 'Word64' using 16 nibbles. {-# INLINE word64HexFixed #-} word64HexFixed :: FixedPrim Word64 word64HexFixed = (\x -> (fromIntegral $ x `shiftR` 32, fromIntegral x)) >$< pairF word32HexFixed word32HexFixed

So that's what #80 (comment) was saying

phadej · 2024-03-20T22:13:52Z

@iand675

"complicate the code"

If we don't understand why code works as it does, does it work?

In my opinion it doesn't in the long term. I'm the one who maintaining this, so I'm making the judgment calls.

This PR has a lot of good in it, but it's not perfect, I'll take if from here.

Note to self: benchmark on x86_64

lykahb · 2024-03-21T01:41:55Z

For the sake of clarity, is your objection about the use Data.ByteString.Builder.Prim in general or do you perceive anything else here as overly complicated? The only other thing I can think of was converting a Builder to a ByteString with toLazyByteStringWith. And that is gone now, inspired by Clyring's comment.

prikhi · 2024-03-21T22:34:29Z

Note to self: benchmark on x86_64

Is there a trick to reducing these variance values?

On an i9-13900K:

$  cabal bench --project-file cabal.bench.project uuid-types-benchmark --ghc-options='-O2 -threaded' --benchmark-options='-m pattern toASCIIBytes toByteString'
# BEFORE
benchmarking conversion/toASCIIBytes
time                 23.57 ns   (23.45 ns .. 23.68 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 23.55 ns   (23.40 ns .. 23.91 ns)
std dev              732.5 ps   (376.0 ps .. 1.369 ns)
variance introduced by outliers: 51% (severely inflated)

benchmarking conversion/toByteString
time                 49.83 ns   (49.29 ns .. 50.46 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 50.18 ns   (49.82 ns .. 50.54 ns)
std dev              1.218 ns   (1.078 ns .. 1.415 ns)
variance introduced by outliers: 37% (moderately inflated)


# AFTER
benchmarking conversion/toASCIIBytes
time                 13.18 ns   (13.09 ns .. 13.28 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 13.19 ns   (13.09 ns .. 13.33 ns)
std dev              404.7 ps   (295.9 ps .. 574.5 ps)
variance introduced by outliers: 51% (severely inflated)

benchmarking conversion/toByteString
time                 13.03 ns   (12.97 ns .. 13.09 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.01 ns   (12.95 ns .. 13.06 ns)
std dev              184.5 ps   (126.4 ps .. 291.3 ps)
variance introduced by outliers: 18% (moderately inflated)

lykahb added 5 commits December 2, 2023 10:47

Use fixed length bytestring builder for hyphenated string

fc74f8f

Use fixed length bytestring builder for network order

6b3022a

Reuse conversion to ByteString for the list of bytes

d6de856

New release

f3dc68f

Update uuid-types version and changelog

bcab925

lykahb mentioned this pull request Mar 18, 2024

Create strict bytestrings from FixedPrim with zero copy haskell/bytestring#666

Open

phadej requested changes Mar 18, 2024

View reviewed changes

Use Data.ByteString.Builder.Prim.Internal instead of Builder

c89e48b

phadej reviewed Mar 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize toByteString and toASCIIBytes #80

Optimize toByteString and toASCIIBytes #80

lykahb commented Dec 2, 2023

lykahb commented Mar 18, 2024

phadej left a comment

lykahb commented Mar 19, 2024

clyring commented Mar 19, 2024

lykahb commented Mar 19, 2024

lykahb commented Mar 20, 2024

iand675 commented Mar 20, 2024

phadej commented Mar 20, 2024

iand675 commented Mar 20, 2024

phadej Mar 20, 2024 •

edited

Loading

phadej Mar 20, 2024

phadej Mar 20, 2024

phadej commented Mar 20, 2024

lykahb commented Mar 21, 2024

prikhi commented Mar 21, 2024

Optimize toByteString and toASCIIBytes #80

Are you sure you want to change the base?

Optimize toByteString and toASCIIBytes #80

Conversation

lykahb commented Dec 2, 2023

lykahb commented Mar 18, 2024

phadej left a comment

Choose a reason for hiding this comment

lykahb commented Mar 19, 2024

clyring commented Mar 19, 2024

lykahb commented Mar 19, 2024

lykahb commented Mar 20, 2024

iand675 commented Mar 20, 2024

phadej commented Mar 20, 2024

iand675 commented Mar 20, 2024

phadej Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

phadej Mar 20, 2024

Choose a reason for hiding this comment

phadej Mar 20, 2024

Choose a reason for hiding this comment

phadej commented Mar 20, 2024

lykahb commented Mar 21, 2024

prikhi commented Mar 21, 2024

phadej Mar 20, 2024 •

edited

Loading