Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize toByteString and toASCIIBytes #80

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lykahb
Copy link

@lykahb lykahb commented Dec 2, 2023

This PR leverages the bytestring fixed-length builders to simplify and speed up the conversions. The re-implementation of toASCIIBytes is now more high-level and safe.

I bumped the bytestring dependency lower bound to the version that introduces Data.ByteString.Builder.Prim. It was released in 2012, so that is plenty of backwards compatibility.

Here are benchmarks on MacBook M1 Max:

Before

benchmarking conversion/toASCIIBytes
time                 34.23 ns   (34.16 ns .. 34.35 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 34.21 ns   (34.18 ns .. 34.29 ns)
std dev              169.2 ps   (40.23 ps .. 290.2 ps)

benchmarking conversion/toByteString
time                 74.16 ns   (73.96 ns .. 74.43 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 74.07 ns   (73.98 ns .. 74.25 ns)
std dev              398.7 ps   (176.1 ps .. 649.3 ps)

After

benchmarking conversion/toASCIIBytes
time                 28.61 ns   (28.59 ns .. 28.63 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.62 ns   (28.60 ns .. 28.65 ns)
std dev              56.85 ps   (26.42 ps .. 117.3 ps)

benchmarking conversion/toByteString
time                 20.87 ns   (20.81 ns .. 20.96 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 20.94 ns   (20.89 ns .. 21.01 ns)
std dev              200.6 ps   (144.6 ps .. 259.8 ps)

@lykahb
Copy link
Author

lykahb commented Mar 18, 2024

@phadej This PR is waiting for a review.

Copy link
Collaborator

@phadej phadej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that anything can be faster than toASCIIBytes uuid = BI.unsafeCreate 36 (pokeASCII uuid). It's literally just allocating exactly 36 bytes and poking stuff at the right place. Maybe there's something GHC doesn't see (missing bang somewhere)? But the current code is literally as little work as possible already.

I don't see a point in complicating it.

@lykahb
Copy link
Author

lykahb commented Mar 19, 2024

It is hard to tell what makes the high-level implementation faster. The toByteString packed the characters from a list, which looks suboptimal. But the toASCIIBytes only does the low-level operations and looks quite simple, with, as you said, hard to imagine anything being faster.

Yet, the benchmarks show that a high-level implementation is faster. It might be that it unlocks some GHC optimizations or that bytestring has some other non-trivial optimizations. If we compare only the functions that do not have any extra overhead (toASCIIBytes from before with toByteString in this PR), it'd be a third faster.

If uuid delegates the low-level work to bytestring, it would benefit from any future performance improvements in bytestring. Also, this might be easier to maintain: this diff mostly removes code and lets this library focus on the format, rather than the low-level optimization.

There is also more room for speedup: toASCIIBytes can be as just fast as toByteString if we remove the copy haskell/bytestring#666. For better compatibility I avoided using an internal module in this PR Data.ByteString.Builder.Prim.Internal.runF.

@clyring
Copy link

clyring commented Mar 19, 2024

If I had to guess, the reason the FixedPrim implementation in this patch is faster is probably to do with BBP.word32HexFixed being a branchless two-nibbles-at-a-time lookup instead of the branchy one-nibble-at-a-time calculation in the baseline.

@lykahb
Copy link
Author

lykahb commented Mar 19, 2024

@clyring clarified that the module Data.ByteString.Builder.Prim.Internal is stable. I rewrote the string conversion without the Builder and got further improvement:

benchmarking conversion/toASCIIBytes
time                 21.56 ns   (21.18 ns .. 22.30 ns)
                     0.995 R²   (0.988 R² .. 1.000 R²)
mean                 21.36 ns   (21.20 ns .. 21.95 ns)
std dev              968.8 ps   (101.2 ps .. 2.048 ns)


benchmarking conversion/toByteString
time                 17.85 ns   (17.82 ns .. 17.89 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.87 ns   (17.84 ns .. 17.95 ns)
std dev              166.1 ps   (89.32 ps .. 297.0 ps)

@lykahb
Copy link
Author

lykahb commented Mar 20, 2024

With the recent update this branch has toASCIIBytes that is x1.6 faster and toByteString that is x4.15 faster.

@phadej Would you give this another look? This is a significant improvement and it reduces the amount of code to maintain in uuid. It seems that high-performance low-level code better fits the core libraries of the Haskell ecosystem like bytestring.

@iand675
Copy link

iand675 commented Mar 20, 2024

I don't believe that anything can be faster than toASCIIBytes uuid = BI.unsafeCreate 36 (pokeASCII uuid). It's literally just allocating exactly 36 bytes and poking stuff at the right place. Maybe there's something GHC doesn't see (missing bang somewhere)? But the current code is literally as little work as possible already.

@phadej I appreciate that you'd like to keep the code uncomplicated, but benchmarking seems to indicate a solid improvement, so I don't understand the resistance to the improvement.

@phadej
Copy link
Collaborator

phadej commented Mar 20, 2024

@iand675 the original patch was more complicated. And my review comment made lykahb improve on it.

Your comment is not fair.

@iand675
Copy link

iand675 commented Mar 20, 2024

It's true that @lykahb made a further improvement, but I don't think it's unfair to state that he provided benchmarks prior to you saying that you didn't want to "complicate the code" that were a solid performance increase.

where
(w0, w1, w2, w3) = toWords uuid
wordFixedPrim :: BBP.FixedPrim (Word32, (Word16, (Word16, (Word16, (Word16, Word32)))))
wordFixedPrim = BBP.word32HexFixed BBP.>*<
Copy link
Collaborator

@phadej phadej Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self, the word<N>HexFixed calls into C function in bytestring:

char* _hs_bytestring_uint_hex (unsigned int x, char* buf) {
    // write hex representation in reverse order
    char c, *ptr = buf, *next_free;
    do {
        *ptr++ = digits[x & 0xf];
        x >>= 4;
    } while ( x );
    // invert written digits
    next_free = ptr--;
    while(buf < ptr) {
        c      = *ptr;
        *ptr-- = *buf;
        *buf++ = c;
    }
    return next_free;
};

Fascinating that loop is faster (?) than unrolled version. Maybe GCC does magic. For "random" UUIDs, ther e shouldn't be win in short circuiting the loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's wrong, the Fixed variants just split the number into halves:

https://hackage.haskell.org/package/bytestring-0.12.1.0/docs/src/Data.ByteString.Builder.Prim.ASCII.html#word32HexFixed

-- | Encode a 'Word8' using 2 nibbles (hexadecimal digits).
{-# INLINE word8HexFixed #-}
word8HexFixed :: FixedPrim Word8
word8HexFixed = fixedPrim 2 $ \x op -> do
  enc <- encode8_as_16h lowerTable x
  unalignedWriteU16 enc op

-- | Encode a 'Word16' using 4 nibbles.
{-# INLINE word16HexFixed #-}
word16HexFixed :: FixedPrim Word16
word16HexFixed =
    (\x -> (fromIntegral $ x `shiftR` 8, fromIntegral x))
      >$< pairF word8HexFixed word8HexFixed

-- | Encode a 'Word32' using 8 nibbles.
{-# INLINE word32HexFixed #-}
word32HexFixed :: FixedPrim Word32
word32HexFixed =
    (\x -> (fromIntegral $ x `shiftR` 16, fromIntegral x))
      >$< pairF word16HexFixed word16HexFixed

-- | Encode a 'Word64' using 16 nibbles.
{-# INLINE word64HexFixed #-}
word64HexFixed :: FixedPrim Word64
word64HexFixed =
    (\x -> (fromIntegral $ x `shiftR` 32, fromIntegral x))
      >$< pairF word32HexFixed word32HexFixed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's what #80 (comment) was saying

@phadej
Copy link
Collaborator

phadej commented Mar 20, 2024

@iand675

"complicate the code"

If we don't understand why code works as it does, does it work?

In my opinion it doesn't in the long term. I'm the one who maintaining this, so I'm making the judgment calls.

This PR has a lot of good in it, but it's not perfect, I'll take if from here.

Note to self: benchmark on x86_64

@lykahb
Copy link
Author

lykahb commented Mar 21, 2024

For the sake of clarity, is your objection about the use Data.ByteString.Builder.Prim in general or do you perceive anything else here as overly complicated? The only other thing I can think of was converting a Builder to a ByteString with toLazyByteStringWith. And that is gone now, inspired by Clyring's comment.

@prikhi
Copy link
Contributor

prikhi commented Mar 21, 2024

Note to self: benchmark on x86_64

Is there a trick to reducing these variance values?

On an i9-13900K:

$  cabal bench --project-file cabal.bench.project uuid-types-benchmark --ghc-options='-O2 -threaded' --benchmark-options='-m pattern toASCIIBytes toByteString'
# BEFORE
benchmarking conversion/toASCIIBytes
time                 23.57 ns   (23.45 ns .. 23.68 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 23.55 ns   (23.40 ns .. 23.91 ns)
std dev              732.5 ps   (376.0 ps .. 1.369 ns)
variance introduced by outliers: 51% (severely inflated)

benchmarking conversion/toByteString
time                 49.83 ns   (49.29 ns .. 50.46 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 50.18 ns   (49.82 ns .. 50.54 ns)
std dev              1.218 ns   (1.078 ns .. 1.415 ns)
variance introduced by outliers: 37% (moderately inflated)


# AFTER
benchmarking conversion/toASCIIBytes
time                 13.18 ns   (13.09 ns .. 13.28 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 13.19 ns   (13.09 ns .. 13.33 ns)
std dev              404.7 ps   (295.9 ps .. 574.5 ps)
variance introduced by outliers: 51% (severely inflated)

benchmarking conversion/toByteString
time                 13.03 ns   (12.97 ns .. 13.09 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.01 ns   (12.95 ns .. 13.06 ns)
std dev              184.5 ps   (126.4 ps .. 291.3 ps)
variance introduced by outliers: 18% (moderately inflated)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants