-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient Character Representations #164
Comments
I'll just note that in the dynamic character rework that should be merged into
I think it would be great to follow your idea and make either of these:
Whichever of the above empirically appears to be more performant. With the "small" unboxed byte variant of the dynamic character, we can replace using the memoized TCM with a reference to looking up costs and medians in a completely pre-computed TCM.
I think the performance gain we can get in the general case (where we don't special case the "discrete metric" or "L1-norm metric") will be an additional boon from this approach. Additionally, there might be a use case for unboxed dynamic characters for alphabets that fit into a
I think that the unboxed versions of the string alignment will be more performant than using the boxed We would still need to use the memoized TCM for storing the costs and medians, but there might be some benefit of the types being @Boarders , do you think these ideas are reasonable and in line with what the new character type is capable of? |
@recursion-ninja : Those ideas sound reasonable to me and fit perfectly into this proposal. I hadn't considered the impact on the TCM but that makes perfect sense. I'm unsure what underlying representation the various |
I forgot about the time it takes to generate the hash! I imagine that for |
Since I had the code already available I ran The relevant bits of code are: -- Note these do not keep the costs as that was awkward to do and I wanted a quick benchmark
fitch' :: (Bits b) => b -> b -> b
fitch' l r
| popCount (l .&. r) > 0 = l .&. r
| otherwise = l .|. r
fitchByteString' :: ByteString -> ByteString -> ByteString
fitchByteString' l r =
ByteString.pack (ByteString.zipWith (fitch') l r)
fitchUnbox'
:: Unboxed.Vector Word8 -> Unboxed.Vector Word8 -> Unboxed.Vector Word8
fitchUnbox' = Unboxed.zipWith fitch' The results are as follows:
Bytestring also presents a challenge with keeping the cost so in some sense it is fortunate the performance is 10x worse. |
Yikes! That's way worse than I expected. Do you think It looks like if we use the supplied primitives from |
The problem
Currently our character data is represented as follows:
Our bit vectors and bit matrices are based around @recursion-ninja's bv-little library and so these are represented as follows:
Here the arbitrary precision natural number is represented in Haskell in accordance with the gmp library. This works extremely well for arbitrary alphabet sizes encoded as arbitrary length bit vectors but is space and time inefficient for some typical inputs where we know the precise alphabet size and it is small. For example, if we have DNA characters then these can be stored in a single
Word8
. This then has the added benefit that a block of such characters can be represented by an unboxed vector.In order to measure this I took an input of the form:
and wrote a fitch style operation (identical to the one we use on the Haskell side of our code):
which I then zipped over blocks of bitvectors encoded with the two different inputs (the benchmarks are here: https://github.com/Boarders/Char-Benchmarks/blob/master/Char_Bench.hs). I used a boxed vector of natural numbers as a stand-in for our current arbitrary length bit vectors and unboxed vectors of both Word8 and Word64 for possible new encodings.
If I take
n = 100
(that is to say, a character block of length 100) then the results are as follows:Taking
n = 100000
gives the following:Note here that, from the perspective of these operations, boxed vectors of arbitrary precision nats do not scale linearly whereas it looks like that is the case (in this range) for unboxed vectors (this is probably due to whether the input can be stored in L1/L2 cache and various other mysteries of modern hardware).
This suggests there is a lot to be gained by encoding shorter alphabets in fixed width types.
Proposal
I think we should take the following steps:
This will require the character blocks can easily be extensible with new types which the design in the
new-graph-representation
branch allows for. We can also make life easier by having lenses onto the various components of static and dynamic characters from a character block. That should make it seem as though not much has changed from the perspective of a high level user of the character sequence type.For the continuous character I don't think there are many gains to be had and these benchmarks don't measure anything about those types but it is perhaps worth changing the tuple to a record of
ExtendedReal
s which is strict in each field as this is usually an easy performance gain.Using fixed width types opens up the possibility that a character block can use an unboxed vector. We can do that for all of the aforementioned characters and also we can write an
Unbox
instance for the continuous character type (which will, probably in turn, require such an instance to be written for an ExtendedReal`). This should offer good performance gains.Above I have listed types
Word128
andWord256
. Most modern CPUs have single instructions for performing operations on 128 or 256 wide registers. For example I wrote (and tested) the following C code:This is elaborate as a result of the intel intrinsics, but if you squint it is just the usual fitch-style operation. When I have the time I am going test, via ffi, what the performance is like for this when called from Haskell. I should note that such SIMD instructions will eventually be added to the native code generator of Haskell and as such if we were to use any ffi code it would be (hopefully) short-lived. We will need to experiment with using this via ffi in Haskell to see whether it is actually worthwhile (I haven't done this yet as this function expects 16 aligned inputs and so some finessing needs to take place on the Haskell side beyond the ffi finessing).
Downsides
This leads to multiple different character representations which will lead to both more code with similar (but subtlety different) operations. We will probably have to write boxed and unboxed implementations for several operations (for example a fitch-style operation). This therefore will lead to more code and probably a large-ish re-factor (though i don't envision it being too painful). There are probably others I haven't thought about.
(Other) Upsides
Beyond those mentioned already in terms of performance, I think this type of project will lead to packed characters being much easier to add to PCG. On top of this, if the SIMD 128 bit operations have good performance and we can figure out the right sort of bit masking (also using intel intrinsics) then this can lead to potentially 16x single core parallelism on characters that fit in a single byte (though that doesn't take into account the masking cost). I think it is well worth exploring this possibility.
The text was updated successfully, but these errors were encountered: