Implementation of Daira reduction #98

unzvfu · 2021-01-21T11:10:20Z

This branch contains an implementation of Daira Hopwood's reduction algorithm for the Tweedle{dee,dum} curves. More precisely, the reduction algorithm is implemented but only enabled for Tweedledee, while Tweedledum remains with the original Montgomery reduction algorithm. This allows easy benchmark comparison between the two reduction algorithms.

Unfortunately I wasn't able to get Daira's algorithm to run faster than Montgomery's, even though after counting multiplications and additions and staring at assembly output I think that should be possible. Anyway, since this code does not represent an improvement I'm leaving this PR as a draft and I do not recommend merging at this stage. I intend to revisit it on occasion as ideas for improvement occur to me. Suggestions for improvement would be welcome in the comments below.

Current micro-benchmark timings as of 21 Jan 2021 (in nanoseconds):

CPU	Intel Core i5-3230M @ 2.6GHz	Intel Core i7-7700HQ @ 2.8GHz
Monty mulmod	38.8	22.7
Daira mulmod	44.4	27.7
Monty sqrmod	28.7	15.9
Daira sqrmod	34.1	20.3

Notes:

This branch uses better compiler flags than the ones in master; specifically, I set lto="fat" and codegen-units=1 in the toml file (source).
- The same source suggested building with RUSTFLAGS = '-C target-cpu=native'. This generated code using mulx instead of plain mul (on the i7; the i5 doesn't have bmi2 or adx), but the timings ended up exactly the same.
Scaling governor set to "performance" in all benchmarks
rustc 1.51.0-nightly (d98d2f57d 2021-01-18)

…ng).

dlubarov · 2021-01-22T05:45:06Z

Your implementation looks pretty solid; I tried a few trivial optimizations but they didn't have much affect. It seems like the adds and subs are taking longer than I would have expected, like closer to 1 add/cycle than the theoretical capacity of 4 adds/cycle. Maybe we should look into add/sub/mul to see how much room for improvement there is, and if we end up with significant improvements there, we could come back to see if this method can become the fastest.

I started poking around the generated asm; will report back if I find anything interesting.

dlubarov · 2021-01-22T08:29:10Z

The generated asm for daira_multiply is pretty long, and I haven't gotten too deep into it yet, but a few notes so far from looking at smaller pieces:

The compiler seems to handle add_no_overflow nicely, with one add and two adcs, rather than ORing carry bits as our Rust code does.

This might be obvious, but mul_6_6 seems to be doing a lot of stack access, since there are too many intermediate results to fit in registers. It might be worth looking at some minor variations of grade school multiplication to see if we can reduce the stack usage.

(If we did write an asm implementation later, maybe we could find creative ways to avoid memory access, e.g. by abusing XMM registers for storage. Not sure if it's a good or bad idea...)

Also, it seems like the compiler isn't smart enough to use mulx in mul_6_6. (Or maybe it's being conservative since pre-Haswell chips don't have it?) It seems like it instead uses adcs with an operands of zero to accumulate carry bits.

We could maybe use _mulx_u64 to avoid needing our own asm code. It seems like the implementation just does a Rust u64 multiplication though, rather than an actual mulx? (Or maybe I'm looking at the wrong code... we could try it and see.)

unzvfu · 2021-01-26T11:27:00Z

Thanks heaps for your comments Daniel. I'll have a look at the stack access of mul_6_6 (and friends) to see if there's a way to reduce the stack access.

On the subject of mulx, the compiler does generate good mulx-based code when RUSTFLAGS='-C target-cpu=native' is set. Somewhat surprisingly I didn't get any significant performance improvement when I did that on the Core i5. I will check whether it helped at all on the Core i7, just in case there is a difference.

dlubarov · 2021-01-27T08:01:24Z

Ah you're right, I see mulx now with that flag. I'm still sort of suspicious of those addcs with a zero though, like (excerpt from mul_4_4)

mulx    rsi, r13, r13
add     r13, r9
adc     rdi, 0

I didn't study the asm too much, but I'm guessing that it's tracking the high bits of acc128[i]. acc128[i] is basically a sum of (u64 halves of) limb products with significance i. Since there might be more than two limb products being summed up, I figured the system's carry chain wouldn't help (since it tracks a single bit), so I just used a u128 to store the column sum including the carry digit. I'm not confident that that's the right approach though.

Perhaps we should try "row-wise" multiplication, as described here. Seems like that should make better use of carry chains, since only one bit is carried at a time.

OTOH, my understanding is that a sequence of adcs would be limited to one per cycle, since each one depends on the previous one's carry bit, so we might be sacrificing some parallelism in order to reduce additions. Ideally the compiler would use adcx/adox (e.g. one for odd rows, one for even rows) to reduce dependencies, but I'm not sure if it's clever enough.

unzvfu · 2021-01-28T09:37:29Z

On the subject of mulx, the compiler does generate good mulx-based code when RUSTFLAGS='-C target-cpu=native' is set. Somewhat surprisingly I didn't get any significant performance improvement when I did that on the Core i5. I will check whether it helped at all on the Core i7, just in case there is a difference.

I've checked that compiling with target-cpu=native doesn't change the timings on the Core i7 either unfortunately. I've updated the PR description to note this.

unzvfu · 2021-02-03T10:58:38Z

So, new developments:

Out of interest, I went ahead and implemented the mulx/ad[oc]x based multiplication in assembly (as described in that Intel paper) and the times only reduced by a little bit, about 3ns on the i7. (NB: the implementation still has bugs, but the performance should be typical.) This made me realise...
Something I should have measured earlier was the proportion of time spent in the multiplication vs reduction phases of Daira's reduction. Turns out the reduction phase really is the problem: It takes about 17.5ns out of the 27.7ns for mulmod, i.e. 63%. The assembly implementation dropped the multiplication time from about 10.2ns to about 7ns, which is a pretty respectable improvement on its own, but it's not very helpful when the reduction is dominating the runtime.
So with a bit more investigating the Daira reduction may still win the day. I don't think we want to rely on an assembly version of that, but it might be worth spending some more time with the compiler's assembly output to understand why it's taking so long. An implementation that uses/generates ad[oc]x instructions could improve things considerably.

unzvfu · 2021-02-14T03:04:27Z

I did a slightly more detailed breakdown of the proportion of time that each stage of Daira's reduction is currently taking. Total time for a reduction is ~16.5ns; here are the line-by-line costs in nanoseconds (and proportions) for future reference (timings on the Core i7):

        let (x0, x1, x2) = rebase_8(x);           // 2.5  (15%)
        // s = C * x1
        let s = mul_4_2(x1, Self::C);             // 5.3  (32%)
        // t = C^2 * x2 + x0
        let t = if x2 == 0 {                      // 1.4  (8%)
            x0
        } else {
            add_no_overflow(Self::C_SQR_TBL[(x2 - 1) as usize], x0)
        };

        // xp = kM - s + t
        let xp = add_6_4(sub(Self::K_M, s), t);   // 2.4  (14%)

        // xp = (xp0, xp1)
        let (xp0, xp1) = rebase_6(xp);            // 1.2  (7%)
        // u = C * xp1
        let u = mul_2_2(Self::C, xp1);            // 2.1  (13%)

        // return M - u + xp0                     // 1.9  (11%)
        let res = add_no_overflow(sub(Self::ORDER, u), xp0);

(Note: 2.5 + 5.3 + 1.4 + 2.4 + 1.2 + 2.1 + 1.9 = 16.8 != 16.5, so there is a very slight issue with how I arrived at these numbers; probably not important though.) The implementations of mul_4_2 and mul_2_2 used here are based on the mulx/ad[oc]x assembly, so are basically as fast as possible. The additions and subtractions do seem slower than expected, but they total only about 33% of the reduction runtime, so even if they were free the total improvement would still be modest.

unzvfu added 12 commits October 28, 2020 19:42

Add back in the old pen-and-paper multiplication implementation.

2648244

First draft of full pen-and-paper squaring.

0652e84

Full working version of Daira-reduction based arithmetic (needs tidyi…

b1e5c15

…ng).

Fastest compilation flags with Daira (and Monty) reduction.

5f9b93a

Merge branch 'master' into issue-71-faster-reduction-of-certain-moduli

8ff44f3

Remove unnecessary reductions where possible.

4a49005

Remove final reduction from _reduce(); also some tidying.

5ad97f6

Reinstate final reduction to make tests from other crates pass.

bb4191c

Add missing file.

21741d8

Switch explicit constants into raw form from Monty form.

ea189cf

Merge branch 'master' into issue-71-faster-reduction-of-certain-moduli

2f5d198

Tidying up and comments.

c2785b0

unzvfu self-assigned this Jan 21, 2021

unzvfu mentioned this pull request Jan 21, 2021

Faster reduction of 2^b + c moduli #71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Daira reduction #98

Implementation of Daira reduction #98

unzvfu commented Jan 21, 2021 •

edited

Loading

dlubarov commented Jan 22, 2021

dlubarov commented Jan 22, 2021

unzvfu commented Jan 26, 2021

dlubarov commented Jan 27, 2021 •

edited

Loading

unzvfu commented Jan 28, 2021

unzvfu commented Feb 3, 2021 •

edited

Loading

unzvfu commented Feb 14, 2021 •

edited

Loading

Implementation of Daira reduction #98

Are you sure you want to change the base?

Implementation of Daira reduction #98

Conversation

unzvfu commented Jan 21, 2021 • edited Loading

dlubarov commented Jan 22, 2021

dlubarov commented Jan 22, 2021

unzvfu commented Jan 26, 2021

dlubarov commented Jan 27, 2021 • edited Loading

unzvfu commented Jan 28, 2021

unzvfu commented Feb 3, 2021 • edited Loading

unzvfu commented Feb 14, 2021 • edited Loading

unzvfu commented Jan 21, 2021 •

edited

Loading

dlubarov commented Jan 27, 2021 •

edited

Loading

unzvfu commented Feb 3, 2021 •

edited

Loading

unzvfu commented Feb 14, 2021 •

edited

Loading