Skip to content
This repository has been archived by the owner on Dec 9, 2021. It is now read-only.

Implementation of Daira reduction #98

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

unzvfu
Copy link
Collaborator

@unzvfu unzvfu commented Jan 21, 2021

This branch contains an implementation of Daira Hopwood's reduction algorithm for the Tweedle{dee,dum} curves. More precisely, the reduction algorithm is implemented but only enabled for Tweedledee, while Tweedledum remains with the original Montgomery reduction algorithm. This allows easy benchmark comparison between the two reduction algorithms.

Unfortunately I wasn't able to get Daira's algorithm to run faster than Montgomery's, even though after counting multiplications and additions and staring at assembly output I think that should be possible. Anyway, since this code does not represent an improvement I'm leaving this PR as a draft and I do not recommend merging at this stage. I intend to revisit it on occasion as ideas for improvement occur to me. Suggestions for improvement would be welcome in the comments below.

Current micro-benchmark timings as of 21 Jan 2021 (in nanoseconds):

CPU Intel Core i5-3230M @ 2.6GHz Intel Core i7-7700HQ @ 2.8GHz
Monty mulmod 38.8 22.7
Daira mulmod 44.4 27.7
Monty sqrmod 28.7 15.9
Daira sqrmod 34.1 20.3

Notes:

  • This branch uses better compiler flags than the ones in master; specifically, I set lto="fat" and codegen-units=1 in the toml file (source).
    • The same source suggested building with RUSTFLAGS = '-C target-cpu=native'. This generated code using mulx instead of plain mul (on the i7; the i5 doesn't have bmi2 or adx), but the timings ended up exactly the same.
  • Scaling governor set to "performance" in all benchmarks
  • rustc 1.51.0-nightly (d98d2f57d 2021-01-18)

@unzvfu unzvfu self-assigned this Jan 21, 2021
@dlubarov
Copy link
Contributor

Your implementation looks pretty solid; I tried a few trivial optimizations but they didn't have much affect. It seems like the adds and subs are taking longer than I would have expected, like closer to 1 add/cycle than the theoretical capacity of 4 adds/cycle. Maybe we should look into add/sub/mul to see how much room for improvement there is, and if we end up with significant improvements there, we could come back to see if this method can become the fastest.

I started poking around the generated asm; will report back if I find anything interesting.

@dlubarov
Copy link
Contributor

The generated asm for daira_multiply is pretty long, and I haven't gotten too deep into it yet, but a few notes so far from looking at smaller pieces:

The compiler seems to handle add_no_overflow nicely, with one add and two adcs, rather than ORing carry bits as our Rust code does.

This might be obvious, but mul_6_6 seems to be doing a lot of stack access, since there are too many intermediate results to fit in registers. It might be worth looking at some minor variations of grade school multiplication to see if we can reduce the stack usage.

(If we did write an asm implementation later, maybe we could find creative ways to avoid memory access, e.g. by abusing XMM registers for storage. Not sure if it's a good or bad idea...)

Also, it seems like the compiler isn't smart enough to use mulx in mul_6_6. (Or maybe it's being conservative since pre-Haswell chips don't have it?) It seems like it instead uses adcs with an operands of zero to accumulate carry bits.

We could maybe use _mulx_u64 to avoid needing our own asm code. It seems like the implementation just does a Rust u64 multiplication though, rather than an actual mulx? (Or maybe I'm looking at the wrong code... we could try it and see.)

@unzvfu
Copy link
Collaborator Author

unzvfu commented Jan 26, 2021

Thanks heaps for your comments Daniel. I'll have a look at the stack access of mul_6_6 (and friends) to see if there's a way to reduce the stack access.

On the subject of mulx, the compiler does generate good mulx-based code when RUSTFLAGS='-C target-cpu=native' is set. Somewhat surprisingly I didn't get any significant performance improvement when I did that on the Core i5. I will check whether it helped at all on the Core i7, just in case there is a difference.

@dlubarov
Copy link
Contributor

dlubarov commented Jan 27, 2021

Ah you're right, I see mulx now with that flag. I'm still sort of suspicious of those addcs with a zero though, like (excerpt from mul_4_4)

mulx    rsi, r13, r13
add     r13, r9
adc     rdi, 0

I didn't study the asm too much, but I'm guessing that it's tracking the high bits of acc128[i]. acc128[i] is basically a sum of (u64 halves of) limb products with significance i. Since there might be more than two limb products being summed up, I figured the system's carry chain wouldn't help (since it tracks a single bit), so I just used a u128 to store the column sum including the carry digit. I'm not confident that that's the right approach though.

Perhaps we should try "row-wise" multiplication, as described here. Seems like that should make better use of carry chains, since only one bit is carried at a time.

OTOH, my understanding is that a sequence of adcs would be limited to one per cycle, since each one depends on the previous one's carry bit, so we might be sacrificing some parallelism in order to reduce additions. Ideally the compiler would use adcx/adox (e.g. one for odd rows, one for even rows) to reduce dependencies, but I'm not sure if it's clever enough.

@unzvfu
Copy link
Collaborator Author

unzvfu commented Jan 28, 2021

On the subject of mulx, the compiler does generate good mulx-based code when RUSTFLAGS='-C target-cpu=native' is set. Somewhat surprisingly I didn't get any significant performance improvement when I did that on the Core i5. I will check whether it helped at all on the Core i7, just in case there is a difference.

I've checked that compiling with target-cpu=native doesn't change the timings on the Core i7 either unfortunately. I've updated the PR description to note this.

@unzvfu
Copy link
Collaborator Author

unzvfu commented Feb 3, 2021

So, new developments:

  1. Out of interest, I went ahead and implemented the mulx/ad[oc]x based multiplication in assembly (as described in that Intel paper) and the times only reduced by a little bit, about 3ns on the i7. (NB: the implementation still has bugs, but the performance should be typical.) This made me realise...
  2. Something I should have measured earlier was the proportion of time spent in the multiplication vs reduction phases of Daira's reduction. Turns out the reduction phase really is the problem: It takes about 17.5ns out of the 27.7ns for mulmod, i.e. 63%. The assembly implementation dropped the multiplication time from about 10.2ns to about 7ns, which is a pretty respectable improvement on its own, but it's not very helpful when the reduction is dominating the runtime.
  3. So with a bit more investigating the Daira reduction may still win the day. I don't think we want to rely on an assembly version of that, but it might be worth spending some more time with the compiler's assembly output to understand why it's taking so long. An implementation that uses/generates ad[oc]x instructions could improve things considerably.

@unzvfu
Copy link
Collaborator Author

unzvfu commented Feb 14, 2021

I did a slightly more detailed breakdown of the proportion of time that each stage of Daira's reduction is currently taking. Total time for a reduction is ~16.5ns; here are the line-by-line costs in nanoseconds (and proportions) for future reference (timings on the Core i7):

        let (x0, x1, x2) = rebase_8(x);           // 2.5  (15%)
        // s = C * x1
        let s = mul_4_2(x1, Self::C);             // 5.3  (32%)
        // t = C^2 * x2 + x0
        let t = if x2 == 0 {                      // 1.4  (8%)
            x0
        } else {
            add_no_overflow(Self::C_SQR_TBL[(x2 - 1) as usize], x0)
        };

        // xp = kM - s + t
        let xp = add_6_4(sub(Self::K_M, s), t);   // 2.4  (14%)

        // xp = (xp0, xp1)
        let (xp0, xp1) = rebase_6(xp);            // 1.2  (7%)
        // u = C * xp1
        let u = mul_2_2(Self::C, xp1);            // 2.1  (13%)

        // return M - u + xp0                     // 1.9  (11%)
        let res = add_no_overflow(sub(Self::ORDER, u), xp0);

(Note: 2.5 + 5.3 + 1.4 + 2.4 + 1.2 + 2.1 + 1.9 = 16.8 != 16.5, so there is a very slight issue with how I arrived at these numbers; probably not important though.) The implementations of mul_4_2 and mul_2_2 used here are based on the mulx/ad[oc]x assembly, so are basically as fast as possible. The additions and subtractions do seem slower than expected, but they total only about 33% of the reduction runtime, so even if they were free the total improvement would still be modest.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants