Lesson 8: Loop Optimization #353

sampsyo · 2023-08-21T20:21:49Z

sampsyo
Aug 21, 2023
Maintainer

keikun555 · 2023-10-10T00:48:49Z

keikun555
Oct 10, 2023

Summarize what you did.

I chose to work in bril. I first made a loop library which detects whether a function CFG is reducible and finds backedges in the CFG as well as their corresponding natural loops. I then coded up loop invariant code motion (LICM) of SSA-form bril using the loop library. There were edge cases not talked about in class that I needed to account for, which I will talk about in the last section.

Explain how you know your implementation works—how did you test it? Which test inputs did you use? Do you have any quantitative results to report?

Turnt

I used this turnt configuration to run on all bril files within the bril repo.
The output of the turnt run is stored here and is generated with turnt --save -j $(find . -iname "*.bril"); (turnt -j $(find . -iname "*.bril" | sort) | tee turnt.out).
I have eight environments within this turnt configuration
- The natural-loop environment runs the Bril Loop Library which will generate a JSON file of whether each function has a reducible CFG and their natural loops.
- The labeled environment generates a bril program with a label for each basic block. This was used for debugging convenience.
- The licm and licm-roundtrip environments generates the LICM-optimized bril in text form, the first in SSA-form and the second in standard bril. This is also for debugging purposes.
- The four *-eval environments are used for differential analysis. They evaluate the original program, the SSA-form program, the LICM-optimized SSA-form program, and LICM-optimized standard form program and makes sure that the output is identical. This gives us confidence that my LICM implementation does not break the original program.
The turnt file passes on all benchmarks

❯ rg "^not ok.*benchmark" turnt.out
❯

The following is the programs that the turnt file failed on. These include the usual suspects of overridden arguments, type inference, and linking.

❯ rg "^not ok" turnt.out | cut -d' ' -f5 | uniq | xargs rgrep -c -E "CMD:|RETURN:|OUT:"
./bril-llvm/linkedlist.bril:0
./bril-llvm/point.bril:0
./examples/test/df/cond-args.bril:0
./examples/test/dom/while.bril:0
./examples/test/lvn/clobber.bril:1
./examples/test/lvn/clobber-fold.bril:1
./examples/test/lvn/commute.bril:0
./examples/test/lvn/divide-by-zero.bril:0
./examples/test/lvn/fold-comparisons.bril:0
./examples/test/lvn/logical-operators.bril:0
./examples/test/lvn/redundant-dce.bril:1
./examples/test/lvn/rename-fold.bril:1
./examples/test/ssa/if-orig.bril:0
./examples/test/ssa/if-ssa.bril:0
./examples/test/tdce/combo.bril:0
./examples/test/tdce/double-pass.bril:0
./examples/test/tdce/reassign-dkp.bril:0
./examples/test/to_ssa/argwrite.bril:0
./examples/test/to_ssa/if.bril:0
./examples/test/to_ssa/if-ssa.bril:0
./examples/test/to_ssa/while.bril:0
./test/check/argtype.bril:0
./test/check/badcall.bril:0
./test/check/badid.bril:0
./test/check/badmem.bril:0
./test/check/char.bril:0
./test/check/extra.bril:0
./test/check/labels.bril:0
./test/check/mem.bril:1
./test/check/missarg.bril:0
./test/check/missdest.bril:0
./test/check/printres.bril:0
./test/check/ptr.bril:0
./test/check/speculate.bril:0
./test/check/ssa.bril:0
./test/check/undef.bril:0
./test/interp-error/char-error/badconversion.bril:0
./test/interp-error/core-error/call-nonvoid-return-nothing-error.bril:0
./test/interp-error/core-error/call-return-nothing-error.bril:0
./test/interp-error/core-error/call-return-wrong-type.bril:0
./test/interp-error/core-error/call-void-return-error.bril:0
./test/interp-error/core-error/call-wrong-argument-types.bril:0
./test/interp-error/core-error/call-wrong-arity.bril:0
./test/interp-error/core-error/call-wrong-declared-type.bril:0
./test/interp-error/core-error/divide_by_zero.bril:0
./test/interp-error/core-error/duplicate_function.bril:0
./test/interp-error/core-error/duplicate_main.bril:0
./test/interp-error/core-error/undefined-func.bril:0
./test/interp-error/mem-error/double_free.bril:0
./test/interp-error/mem-error/free_offset.bril:0
./test/interp-error/mem-error/leak.bril:0
./test/interp-error/mem-error/out_of_bounds_2.bril:0
./test/interp-error/mem-error/out_of_bounds.bril:0
./test/interp-error/mem-error/uninit_read.bril:0
./test/interp-error/mem-error/wrong_write.bril:0
./test/interp-error/spec-error/spec-call.bril:0
./test/interp-error/spec-error/spec-double-commit.bril:0
./test/interp-error/spec-error/spec-nonspec-abort.bril:0
./test/interp-error/spec-error/spec-return.bril:0
./test/interp-error/spec-error/spec-return-implicit.bril:0
./test/interp-error/ssa-error/ssa-mismatch.bril:0
./test/interp-error/ssa-error/ssa-nolabel.bril:0
./test/interp/spec/spec-nested.bril:0
./test/linking/diamond.bril:0
./test/linking/link_ops.bril:0
./test/linking/nested.bril:0
./test/linking/recursive.bril:0
./test/print/eight-queens.bril:0
./type-infer/tests/fail-infer/arith_ops.bril:1
./type-infer/tests/fail-infer/assign_label.bril:0
./type-infer/tests/fail-infer/br.bril:1
./type-infer/tests/fail-infer/comp_ops.bril:1
./type-infer/tests/fail-infer/control_ops.bril:1
./type-infer/tests/fail-infer/div.bril:1
./type-infer/tests/fail-infer/idchain.bril:0
./type-infer/tests/fail-infer/jmp.bril:1
./type-infer/tests/fail-infer/logic_ops.bril:1
./type-infer/tests/fail-infer/many_functions.bril:1
./type-infer/tests/fail-infer/tricky-jump.bril:1
./type-infer/tests/fail-typecheck/add.bril:2
./type-infer/tests/fail-typecheck/arith_ops.bril:1
./type-infer/tests/fail-typecheck/br.bril:1
./type-infer/tests/fail-typecheck/comp_ops.bril:1
./type-infer/tests/fail-typecheck/control_ops.bril:1
./type-infer/tests/fail-typecheck/div.bril:1
./type-infer/tests/fail-typecheck/idchain.bril:1
./type-infer/tests/fail-typecheck/jmp.bril:1
./type-infer/tests/fail-typecheck/logic_ops.bril:1
./type-infer/tests/fail-typecheck/many_functions.bril:1
./type-infer/tests/fail-typecheck/nop.bril:1
./type-infer/tests/fail-typecheck/ret.bril:1
./type-infer/tests/fail-typecheck/tiny.bril:1
./type-infer/tests/fail-typecheck/tricky-jump.bril:1
./type-infer/tests/infer/addarg.bril:0
./type-infer/tests/infer/arith_ops.bril:0
./type-infer/tests/infer/comp_ops.bril:0
./type-infer/tests/infer/control_ops.bril:0
./type-infer/tests/infer/idchain.bril:0
./type-infer/tests/infer/logic_ops.bril:0
./type-infer/tests/infer/many_functions.bril:0
./type-infer/tests/infer/tricky-jump.bril:0
./type-infer/tests/parse/add.bril:0
./type-infer/tests/parse/partial.bril:0
./type-infer/tests/print/add.bril:0
./type-infer/tests/print/partial.bril:0
./type-infer/tests/typecheck/arith_ops.bril:0
./type-infer/tests/typecheck/comp_ops.bril:0
./type-infer/tests/typecheck/control_ops.bril:0
./type-infer/tests/typecheck/idchain.bril:0
./type-infer/tests/typecheck/logic_ops.bril:0
./type-infer/tests/typecheck/many_functions.bril:0
./type-infer/tests/typecheck/tricky-jump.bril:0
❯

Brench

To measure the optimization performance, I used this brench configuration on all benchmark bril files to find the dynamic instances used to run each form of the bril program.
I have three baselines to compare my optimization with and two versions of LICM optimized programs:
- baseline which is the original program,
- ssa which runs ssa-form bril without the LICM optimization,
- ssa_roundtrip which runs standard form bril generated from ssa-form bril, again without the LICM optimization,
- ssa_licm which runs ssa-form bril with the LICM optimization, and
- licm_roundtrip which runs standard form bril generated from ssa-bril with the LICM optimization.
By itself it was hard to discern if the optimization reduced the total number of dynamic instructions executed, so I created a Brench Output Transposer which would transpose the csv to the following form

❯ python3 csvt.py < benchmarks.csv | head
benchmark,baseline,ssa,ssa_roundtrip,ssa_licm,ssa_licm_roundtrip
sieve,3482,5558,5758,5558,5758
vsmul,86036,110622,110628,110622,110628
quicksort,264,453,519,453,519
max-subarray,193,332,334,332,334
mat-mul,1990407,3374313,3374331,3246913,3246931
fib,121,203,207,203,207
eight-queens,1006454,1866985,1938643,1822291,1893949
primitive-root,11029,15248,15428,15248,15428
two-sum,98,180,200,180,200
❯

I then used awk to generate more csv files to measure the optimization performance.
- I compared the ssa-form bril and ssa-form bril with LICM in this CSV file generated with the command echo "benchmark,ssa-licm"; python3 csvt.py < benchmarks.csv | tail -n+2 | awk 'BEGIN { FS=OFS="," } {print $1, $3-$5}' | tee ssa_minus_licm.csv. Higher the better.
- I compared the ssa-roundtrip bril and ssa-roundtrip bril with LICM in this CSV file using the command echo "benchmark,ssa_roundtrip-licm_roundtrip"; python3 csvt.py < benchmarks.csv | tail -n+2 | awk 'BEGIN { FS=OFS="," } {print $1, $4-$6}' | tee ssa_roundtrip_minus_licm_roundtrip.csv. Again, the higher is better.
- I compared the original program with ssa-form with LICM in this CSV file and with ssa-roundtrip with LICM in this CSV file. As we may expect, the higher is better.

As suspected, LICM does optimize programs in SSA-form as well as in their roundtrip forms.

❯ cat ssa_minus_licm.csv | rg -v ",0$"
mat-mul,127400
eight-queens,44694
riemann,7
euler,18
cordic,-1
mandelbrot,2006
sqrt,16
pow,1
mat-inv,20
cholesky,184
relative-primes,19
mod_inv,58
quadratic,84
pascals-row,9
pythagorean_triple,123
euclid,19
sum-sq-diff,200
up-arrow,18
check-primes,967
❯ cat ssa_roundtrip_minus_licm_roundtrip.csv | rg -v ",0$"
mat-mul,127400
eight-queens,44694
riemann,7
euler,18
cordic,-1
mandelbrot,2006
sqrt,16
pow,1
mat-inv,20
cholesky,184
relative-primes,19
mod_inv,58
quadratic,84
pascals-row,9
pythagorean_triple,123
euclid,19
sum-sq-diff,200
up-arrow,18
check-primes,967
❯

One interesting observation is benchmarks/float/cordic.bril which actually increased the number of instructions executed. I believe this is because the optimization moved an instruction outside of a loop that was never executed.

The overhead of execution of SSA-form bril overweighed the original program execution as shown in both baseline_minus_*.csv CSV files.

What was the hardest part of the task? How did you solve this problem?

Unreachable blocks: "A dominates B iff all paths from the entry to B include A." So every block dominates unreachable blocks. Also, backedges are "edges from A to B where B dominates A." If there is an edge (A, B) in the CFG where A is an unreachable block, by definition this edge (A, B) would be a backedge. The natural loop in this case would just be nodes that can reach A, with B added to it.benchmarks/core/relative-primes.bril had a case of unreachable blocks which I suspected (unverified) allowed some unreachable loop invariant instructions to be sent to the preheader which would be executed. So I added a guard against unreachable blocks here which helped me pass the test for this bril program.
Phi nodes can error: Phi instructions are similar to the "divide by zero" instructions which error "when two labels have not yet executed, or when the instruction does not contain an entry for the second-most-recently-executed label." Adding phi instructions to the stateful instructions list in this function fixed the issue.
Phi nodes with all labels within the loop are dominated by a block in the loop: There's a peculiar edge case with phi nodes and the use domination criteria for instruction movement. Given a loop, we could have a phi instruction outside the loop which only has labels to blocks within the loop. The phi instruction would run before the loop ran, which would be a no-op. Within the fixed loop we could be assigning values to the variable within the phi instruction argument and a user of the variable also within the same loop. The LICM algorithm I had had moved the variable user outside the loop after the phi instruction and kept the variable definition within the loop because the definition did not dominate the phi instruction. However, in reality we do dominate the phi instruction's assignment because the phi instruction only has labels from within the loop, so if the phi instruction executed without error, that means the loop must have run. The main function in benchmarks/float/mandelbrot.bril had this problem and so I made sure the user phi instructions had labels outside the loop in this check to fix the issue.

0 replies

alifarahbakhsh · 2023-10-17T04:38:00Z

alifarahbakhsh
Oct 17, 2023

Repo

This repo contains the code for loop-invariant code motion. The experiments cover the benchmarks from the core and float folders in the canonical Bril repository.
The experiments are conducted via brench. Note that you should probably change the benchmark directory in the .toml files.
The measured metric is the number of dyn_insts, which is automatically collected by brench.

I did not cover the long and mem benchmarks because they either take a long time and thus output timeout, or throw memory management errors because
I have not considered pointer instructions in my code.
Note that some of the core and float benchmarks also result in a timeout, but I have manually checked them and the results are correct.

I have reported the statistics of the relative and absolute improvements (over the benchmarks) for the two experiments, where relative is equal to (before - after) / before and absolute is equal to before - after.
Since the programs are all in Bril, there is no randomness involved, and therefore for each benchmark a one-shot measurement is enough.
For core the mean relative and absolute improvements are 0.02 and 1457, with standard deviations of 0.03 and 6797, respectively.
For float the mean relative and absolute improvements are 0.01 and 390, with standard deviations of 0.01 and 1016, respectively.

An interesting point was seeing a degradation of performance for one of the mem benchmarks, even though the output was correct. I did not investigate the reason due to a lack of time.

The challenges that I faced were mainly oriented around solving stupid bugs, and the algorithm itself was fairly smooth. The only ambiguity was in identifying the body of a natural loop, for which I used some notes from the Cornell 4120 edition of compilers.

0 replies

matth2k · 2023-10-17T07:30:28Z

matth2k
Oct 17, 2023

Summary
- LICM in LLVM
Implementation Details
- I implemented a pass that hoists loop-invariant code to the preheader of a loop while using the newish LLVM pass manager.
- My pass does not detect all invariant memory operations. I do, however, use AliasAnalysis to hoist memory operations which are guaranteed to be alias free.
- I call my pass with opt with the most bare bones pipeline required to utilize my LICM pass. My pass pipeline is cgscc(inline),mem2reg,dse,function(loop(loop-unswitch)).
  - inline allows me to find loop-invariant function executions.
  - If I did not run the mem2reg pass, all the loop-invariant code was hidden behind memory operations.
  - dse (dead store elimination) improves the chances that my loop body will be alias-free.
- You can see the rest of my tool flow in optCode.sh. I used LLVM version 15.0.7.

Evaluation

Evaluating my LICM pass for performance was tricky, because many simple applications are bound by memory operations which are not loop-invariant. Consequently, Amdahl's law would suggest that it is going to be hard for me to notice any speedups without LICM on loads and stores. I wanted to try to implement loop unswitching as well to see more performance gains, but I did not have enough time.
I evaluated numerous micro-benchmarks (invariant func call, hello world loop, ...), but focused mainly on 5 large benchmarks (bzip2, gcc 3.5, oggenc, TSVC, STMR).

Benchmark	LICM Instructions Moved	Lines of LLVM	Testcase	Speedup
bzip2	3502	31,086	Roundtrip de/compress video	1.0
gcc 3.5	26041	19,252,451	n/a	n/a
oggenc	3125	65,081	Compress wav file	1.0
TSVC	161	20,336	Check output hash is correct	1.02
STMR	0	1996	Check output is correct	1.0
funcCall	9	84	Check output num is correct	1.0

As you can see from the chart above, LICM does not give raw performance gains without tackling loads and stores. The majority of moved instructions are getelementptr's. I'm not sure how these GEPS are getting lowered, but they may not even create add instructions because x86 instructions have their own offset field. Hence, the only instructions saved on are things like sub, add, sext, and mul.

The only benchmark I saw a statistically significant speedup on is TSVC (Test Suite for Vectorizing Compilers). This might make sense, because these benchmarks are not as memory heavy comparatively speaking. Here, I could see some icmp and fcmp operations being moved which was a good sign.

Overall, this assignment is leaving me very confused, because I don't have good intuition on how some of these LLVM instructions are lowered to x86. I even made sure to use -O0 on the LLVM static compiler. I don't know how I could be removing so many instructions and not see a clear speedup. My future work would be to look at more auto-vectorization testcases to look for more clues.

Anything Hard or Interesting?
- It was hard to use the new LLVM pass manager as most of the documentation online was for the old one. I spent a long time trying to register my loop pass as hook into PipelineParsing using the LoopPassManager.
- It was hard to find test cases that have a speedup with LICM on non-memory operations. I'm not sure I achieved my goal, but in any case this page was a really good resouce for finding large single-file programs.
- I was looking into MemorySSA to find invariant memory instructions, but I was too limited on time. It looks like the people working on LLVM are trying to model the memory dependence analysis problem differently.
References Used

0 replies

stephenverderame · 2023-10-18T01:42:45Z

stephenverderame
Oct 18, 2023

For this assignment, we (Stephen and Benny) implemented induction variable elimination in llvm
Link to repo

Implementation

There were certain parts of the assignment that were much easier using llvm and certain parts that were a lot more challenging. Those are covered in the challenges section.
Most of the implementation was straightforward using what we had discussed in class. There were a few things we accounted for such as allowing for subtraction instead of just addition. We also had to think carefully about what expression of the form j+=1 look like in SSA and do some traversals over the Phi nodes. Additionally, we ran into a few interesting corner cases. For example: in one of our benchmarks there was an induction variable in both an inner and outer loop, something along the lines of:

for (int i = 0; i < LEN2; i++) {
    for (int j = 1; j < LEN2; j++) {
        printf("%d %d\n", j - 1, k - 1);
        ++k;
    }
    ++k;
}

We also had to consider cases of nested loops, particularly cases where doing strength reduction in an inner loop creates more opportunities for strength reduction in the outer loop.

Using LLVM (and being in SSA form) made it extremely simple to check that operands were loop invariant and to easily move instructions around. Probably the most useful aspect of working in LLVM was the LoopManager which identified all the natural loops for us and the SimplifyLoop pass that gave each loop a pre-header and a single latch (backedge). We had spent considerable time trying to achieve this on our own before learning this function already existed.

Testing

We ran our implementation against a number of LLVM TSVC tests/benchmarks. We also handwrote about 30 or so tests to hit corner cases. We grabbed our performance evaluation benchmarks from the LLVM TSVC Benchmarks and the LLVM Single Source Benchmarks. From these, we chose Linpack, the Induction Variable Test Suite, and the Restructured loops benchmarks.

The following metrics were collected from running two trials, each trial performing 20,000 repetitions for each benchmark except for Linpack, which performed 400. We used an Ubuntu 22.04.03 LTS machine (Linux Kernel Version 6.2.0-33) with an Intel i7-13620H and 16 Gb memory. For the baseline, we performed (in this exact order) mem2reg, sroa, early-cse, loop-simplify, instcombine, and adce passes. For the optimized version, we performed this sequence, plus our pass, followed by instcombine and adce again. No other optimization passes were performed besides these.

Overall we found our optimization... had no significant impact. We measured an average speedup of 1.0053x +/- 0.0121 across all benchmarks with a standard deviation of 0.05008 and median of 1.0017x.

Challenges

The most significant challenge was simply getting our pass in a state where we could run it. If you run it at the beginning of the opt pipeline then mem2reg hadn’t ran yet and everything would be memory accesses that we couldn’t optimize. If we ran it at the beginning of the optimization pipeline, then LLVM’s internal induction variable elimination would have already run. Instead, we needed to have clang generate LLVM, and then optimize this manually with opt, specifying the exact optimizations to run. Another hiccup with this is that Clang without optimizations annotates functions with optnone, disabling optimizations. So we had to enable Clang optimizations, and then pass a flag to LLVM disabling them again. This ended up becoming the following lengthy command to compile a test program and run it

clang++-17 -O1 -mllvm -disable-llvm-optzns {filename} -emit-llvm -S -o /dev/stdout \
| opt-17 -passes="mem2reg,sroa,early-cse<memssa>,loop-simplify,instcombine,adce" \
| opt-17 -load-pass-plugin ../../build/ive/InductionVariableElimination.so \
    -passes="ive,adce,instcombine" \
| llc-17 -o /dev/stdout \
| clang++-17 -x assembler -no-pie - && ./a.out

Another challenge is that one of the best opportunities to optimize for induction variables is array accesses, but LLVM uses its getelementptr for this which turns into simple additions and multiplications, so we could not actually get any speedup boost from those. While x86 addressing modes probably allow these to be just as fast as if strength reduction was performed, we are unsure of (and quite speculative of) the performance on, say a RISC-V backend.

Another challenge is we realized that when we can replace conditions with derived variables is not exactly straightforward. We realized it only works when the factor is positive. I had written a pretty robust interval analysis from the last assignment, however, we ended up just using LLVM's LazyValueInfo (which isn't as powerful) because some of our test cases revealed some non-termination behavior in the interval analysis which I admittedly didn't feel like fixing right now.

0 replies

bcarlet · 2023-10-18T02:25:18Z

bcarlet
Oct 18, 2023

Summary

Details

I implemented loop-invariant code motion for standard (non-SSA) Bril. I followed the steps outlined in lecture: detect natural loops, insert preheaders, detect loop-invariant instructions, and move instructions to the preheader when it can be done safely. I used the weaker version of the requirement that the to-be-moved instruction must dominate all loop exits, which performed significantly better than the stronger requirement.

Testing

To test correctness and performance impact, I used brench to run the optimization on all the benchmarks in the Bril repository. I collected dynamic instruction counts for the baseline and optimized benchmarks and then computed a speedup for each benchmark. Summary statistics for these speedups are as follows:

suite	min	max	mean	stdev
core	0.97126	1.3605	1.0596	0.095769
float	1.0	1.1418	1.0345	0.051494
mem	0.98	1.0319	1.0034	0.012216
mixed	1.0	1.0576	1.0284	0.040764

Across all the benchmarks, 26 saw some speedup, 41 saw no change, and 2 saw a slowdown. Of the two slowdowns, one was caused by additional jumps added when inserting the preheader, and the other I'm not sure about.

Difficulties

Using Bril made this relatively straightforward, and I didn't run into any particularly tricky edge cases this time around (a pleasant surprise). That said, there are still a lot of things to check in order to safely move instructions around, so there were still bugs to work out.

0 replies

SanjitBasker · 2023-10-18T03:15:00Z

SanjitBasker
Oct 18, 2023

I worked with @obhalerao on this project.

Summary

Repo Link

Implementation Details

We implemented an LLVM pass to perform loop-invariant code motion. Since I've implemented this particular optimization on a different IR in CS 4120, we decided to take advantage of LLVM's makeLoopInvariant function. We learned how to use opt and llc to run our passes and get better control over which passes are run. This was important because we need mem2reg to run before our pass, to ensure that the variables are stored in temporaries rather than memory locations. This also lets us perform a fair comparison to a baseline where we run the exact same set of passes (except for the pass we wrote).

Testing

We first ran our code on embench, which we supported by writing a custom Python script to emulate the C compiler. It took a really long time to get working because we didn't realize that we had to pass --relocation-model=pic to llc in addition to passing -fPIE to clang. We found that our speeds were identical but our binaries were differently sized, so we figured that our optimization was running but just not causing a measurable speedup. To investigate this further, we wrote a very contrived C program in which our optimization hoists a multiplication out of the loop, and we were able to observe a speedup on this:

$ source compile_baseline.sh ../llvm_benchmarks/less_dumb_licm.c
$ source compile_licm.sh ../llvm_benchmarks/less_dumb_licm.c
$ time echo "1 2" | ./optimized 
1000000000

real    0m3.412s
user    0m3.410s
sys     0m0.001s
$ time echo "1 2" | ./unoptimized 
1000000000

real    0m3.693s
user    0m3.687s
sys     0m0.007s

In order to force the compiler to use a multiplication instruction, we used scanf to read one of the arguments from stdin--this probably means that I/O contributes very heavily to the runtime. Nevertheless, the speedup is consistent so we consider this strong evidence that our optimization can create observable speedup.

Difficulties

The main difficulty we encountered was running the LLVM pass.

0 replies

MelindaFang-code · 2023-10-18T03:16:12Z

MelindaFang-code
Oct 18, 2023

Summary

LICM implementation under lesson8 folder in Skeleton.c
@collinzrj and I worked together on implementing LICM using LLVM.

We uses a ModuleAnalysisManager for our pass. Then we iterate through the functions inside the module and obtain the natural loops for each function. While this is really easy to use this has posed certain constraint to our optimizer: during testing we find out that our algorithm has problem doing optimization if a function A is called within a loop in function B
our loop pass mainly has 2 functions. One to identify the loop invariants and one to make sure it is safe to move the instructions to the preheader.
To identify loop invariant instructions, we first tried the "isLoopVariant" library function but it turns out to be too conservative and marks nearly all instructions not invariant. Hence we tried to implement the algorithm taught in class, but implementing the reaching definition is really complicated in llvm, for instance there is so many types of instructions in llvm and llvm uses pointer type instructions extensively.
In the safeToHoist function, we use the isSafeToSpeculativelyExecute library function to test whether the instruction has side effects. Then we use the dominator tree to check whether the instruction dominates all loop exit or the definition dominates all of its uses

Test

we have 3 simple tests to test the capability (loop.c, foo.c, and test.c) In loop.c our function successfully identifies the only loop invariant (we assign variable to constant in a loop repetitively). In test.c we tests whether our algorithm could identify common nonvariant instructions like an instruction doesn't dominate loop exit or load from memory. In foo.c exposes the limitation of only obtaining loop information inside function call

Hardest Part

we encountered a lot of problems due to the recent upgrade of the llvm library, making a lot of the manager classes legacy and hard to use. Meanwhile, we also face problem using the latest library functions. Also many documentations are for the legacy library functions so pretty hard to learn.

we first tried to use the latest loopmanager to obtain the natural loops, but didn't succeed, so we have to take a step back and use the legacy one, which works in a constraint way.
we also manually writes helper functions to deal with edge cases with pointer instructions that could be treated as invariants.
Some llvm functions doesn't do the same thing as its name indicates (e.g. isLoopInvariant) so we have to read through the original code to understand what it does

Handle Load And Store Instruction

We found that load and store instructions can't be simply handled by the algorithm given in the course notes. However, llvm heavily relies on load and store. As a result, we extended the algorithm to handle load and store cases. We copied the pseudocode from the course notes, and prefix our changes with ">", our code also has comments explain the algo.

iterate to convergence:
    > for every pointer in the loop:
        > mark it as InvariantPointer iff, for all values written to the pointer, either:
            > the value is never written to or only wirtten by an InvariantPointer
            > the value is never written to or only written by an InvariantInstruction
    for every instruction in the loop:
        mark it as LI iff, for all arguments x:
            > if x is a LoadInst:
                > it only loads from value outside the loop or invariant pointer/value
            > if x is a StoreInst:
                > it is only stored by value outside the loop or invariant pointer/value
            else:
                all reaching defintions of x are outside of the loop, or
                there is exactly one definition, and it is already marked as
                    loop invariant

0 replies

zachary-kent · 2023-10-18T03:29:27Z

zachary-kent
Oct 18, 2023

Summary

@rcplane, @Enochen, and @zachary-kent collaborated on this project.

LICM pass

We implemented loop invariant code motion as an LLVM pass.

Implementation

We wrote a Loop pass in LLVM and connected it to a FunctionToLoopPassAdaptor to create a Function pass. This was necessary because the built in mem2reg pass, which helps setup the IR in SSA form is a function pass. Therefore if we want to chain our pass after mem2reg via opt we must also make it a function pass.
We initially had written code to generate a loop preheader, although we later found out that process was automatically performed by LLVM via LoopSimplify. Anyways, our custom implementation basically created a new basic block pointing to the existing loop header, and redirected non-backedges previously to the loop header into the preheader. We also had to shift and modify phi nodes in order to preserve SSA semantics.
For the actual code motion, we used the good ol' loop till convergence pattern. In each loop iteration, we search through all instructions in the loop and check if they are loop invariant. To avoid modifying the IR while iterating, we collect a set of loop invariant instructions as we move along. This not only allows us to hoist these instructions at the end of iteration, but also allows for faster convergence via loop-invariant operand detection. One step to checking whether an instruction is loop invariant is that its operands are all loop invariant, so being able to identify LI instructions ahead of hoisting leads to potentially less loop iterations.
To check whether an instruction is loop invariant, we essentially need to check that the instruction
1. is not effectful
  - we use LLVM helpers such as isSafeToSpeculativelyExecute and mayReadOrWriteMemory to make sure the instruction does not contain any side effects
2. does not have loop variant operands (as mentioned above)
  - we can trivially walk through the instructions operands and check that they are not in the loop or they're just values. This is made possible by being in SSA since the operands are instructions if not straight up values, so we can see where they're from.

Testing

We ended up using the TSVC benchmark suite to test the correctness and effectiveness of our optimizations. TSVC includes 151 loop-heavy benchmarks with their reference outputs in a single source file. We first compile tsvc.c using clang without any optimizations to produce LLVM IR, and then use opt to apply mem2reg and our LICM pass to this unoptimized IR. Then, we compile the optimized IR using clang to an executable, and run the benchmark suite.
Our outputs matched the reference output for every benchmark, giving us reasonable confidence that our optimization was correct.
We were unable to compute the dynamic instruction count of each benchmark directly, and instead had to rely on the (noisy) measure of wall-clock time. To attempt to reduce the variance in our measurements, we ran two trials. However, each run of the benchmark suite took ~30 minutes to complete, likely due to the relatively unoptimized IR we were producing. As such, we were not able to conduct as many trials as we had hoped.
From our two trials, we found that LICM generally improves performance for benchmarks where it actually modifies the IR. However, we found that this is not universally true. For example, in benchmark s2275, we found that execution time nearly increases by 20%. We hypothesize that this could be for several reasons:
- Register Pressure: Hoisting instructions to a loop preheader can increase their live ranges, increasing register pressure and making register allocation more difficult. In the worst case, this could result in an additional variable being spilled to the stack. In example s2275, an instruction is hoisted from the body of an doubly-nested inner loop to the preheader of its enclosing loop, potentially supporting its hypothesis. If we had more time, we would inspect the assembly to confirm these suspicions.
- Simple noise: We ran the benchmarks locally, and these processes could be descheduled by the OS. Ideally, we would be able to run these benchmarks hundreds of times in a controlled environment.
- We began a Docker deployment on an eight core private cloud instance running Ubuntu 22.04 to capture more of this variance but running all the unoptimized benchmarks would not complete in time. Partial results are here.
For all benchmarks modified by LICM, we found an average 1.96% wall-clock speedup. However, the variance of these estimate is quite high, and we would need to conduct many more runs to be confident it is meaningful. Further, we found that LICM only modified 43/151 benchmarks; the others were untouched. Below, we depict the average speedup from LICM over 2 runs for these 43 benchmarks.

Difficulties

We had trouble getting clang to run our function pass. It turned out that setting -O0 to disable other optimizations had led clang to put the optnone flag on all our functions. That led to the pass not running despite loading it up. We had to supply the -Xclang -disable-O0-optnone flags to get it to not do this.
However, we also wanted the mem2reg pass to run. After more trouble with clang, we opted (haha) to use opt instead to apply our passes, using clang only for generating the initial unoptimized IR.

0 replies

vivianyyd · 2023-10-18T03:39:20Z

vivianyyd
Oct 18, 2023

Will and I worked on lesson 8 together.
Finding natural loops and LI code
Loop invariant code motion

Summary

We implemented methods to find natural loops and loop invariant code in Bril, and perform loop-invariant code motion (LICM).
Benchmarks to come soon! (We will update this post when we have them done)

Implementation details

Using our dominator analysis, we found all the back-edges in the CFG. Next, we performed a DFS traversal from the
start of this back-edge until the header was reached. This made up our loop.
We then created a new data flow analysis (though it did not use our previous dataflow framework, since it was slightly
simpler) to keep track of loop-invariant instructions (this relied on the CFG being in SSA form)
Finally, we implemented LICM, which operates on the CFG (assuming it is in SSA form). This creates a new pre-header
node and moves all loop-invariant instructions into this pre-header, and updates phi nodes accordingly.

What was the hardest part of the task? How did you solve this problem?

The nice part about having a correct SSA was that we could do the optimization in SSA form which made it much easier
to reason about reaching defs. However, this was still tricky since we had issues with the phi nodes in the header node.
If the header contains a phi node referring to a mix of labels inside and outside of the loop, we need to lift part of
the phi node to the pre-header. This is because phi nodes only refer to the label immediately preceding the current
block, and all entries from the loop now come from the pre-header, eliminating the phi node's ability to distinguish
between different entries to the loop. Our fix involved adding a new phi node to the pre-header which pertains only to
labels outside of the loop, and modifying the phi node in the header to refer to only labels inside the loop, as well as
the new node
We also needed to change all references to the header label by nodes outside of the loop to be references to the
pre-header.

1 reply

willwng Oct 20, 2023

The benchmarks took a little longer than expected: we weren't seeing that much of a speedup due to the conversion to and from SSA resulting in a lot of additional overhead (averaging around 58.5% addition dynamic instructions). To mitigate this, we realized that we had already implemented live variable analysis, so we added a much stronger dead-code elimination pass that takes advantage of this.

For these benchmark, we will be comparing our optimized program against SSA -> out of SSA, since SSA generates a lot of extra instructions (and we don't have a copy propagation implementation to create dead code from extraneous copies).

The optimizations we made for the benchmarks were: to SSA -> LICM -> from SSA -> Local variable numbering -> Dead code removal (average of 22.6% speedup). The following shows a subset of the benchmarks

NgaiJustin · 2023-10-18T03:45:41Z

NgaiJustin
Oct 18, 2023

Summary

LLVM LICM Implementation (link)

Details

I implemented loop invariant code motion (LICM) in LLVM. The pass is implemented in licm/LICM.cpp. The pass moves loop invariant code out of the loop and into the preheader. I started off with the pseudocode outlined in lecture. Finding reaching defintions was simple since with LLVM in SSA by default, I just needed to find the defintions of the variables. However, dealing with memory operations were really difficult (more described below).

Testing/Evaluation

To convince myself that the pass was indeed hoisting the loop invariant code out of the natural loop I wrote a small test example and ran it through the pass. The test example and the corresponding LLVM is shown below. The loop invariant code is j = 2 and the associated loop invariant lines should be hoisted out of the loop. The output of the program is x = 6 which is correct.

#include <stdio.h>

#define N 3

int main() {
  int x = 0;

  for (int i = 0; i < N; i++) {
    int j = 2;  // loop invariant
    x = x + j;
  }

  printf("x = %d\n", x);

  return 0;
}

define i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  store i32 0, ptr %1, align 4
  store i32 0, ptr %2, align 4
  store i32 0, ptr %3, align 4
  br label %5

5:                                      ; preds = %12, %0
  %6 = load i32, ptr %3, align 4
  %7 = icmp slt i32 %6, 3
  br i1 %7, label %8, label %15

8:                                      ; preds = %5 (loop body)
  store i32 2, ptr %4, align 4          ; loop invariant (j = 2)
  %9 = load i32, ptr %2, align 4
  %10 = load i32, ptr %4, align 4
  %11 = add nsw i32 %9, %10
  store i32 %11, ptr %2, align 4
  br label %12

12:                                     ; preds = %8
  %13 = load i32, ptr %3, align 4
  %14 = add nsw i32 %13, 1
  store i32 %14, ptr %3, align 4
  br label %5, !llvm.loop !5

15:                                     ; preds = %5
  %16 = load i32, ptr %2, align 4
  %17 = call i32 (ptr, ...) @printf(ptr noundef @.str, i32 noundef %16)
  ret i32 0
}

I opted to test on the TSVC benchmark suite to measure the performance improvement of my optimizations. I had a lot of trouble getting this benchmark suite to run, but after reading Stephen, Sanjit and Matt's post above, I was able to get something running. However, I was only able to measure the wallclock time of the pre and post optimized programs. I was not able to measure the number of instructions executed. Factoring into the slight variance with startup, I found that the speed was the same, this was a bit disheartening, perhaps with more time and after I have recovered I will look into this more.

Difficulties

I spent some time configuring the new LLVM pass manager. I was rather confused since a lot of the resources I came across online referenced the legacy loop analysis llvm frameworks. This awesome resource I found (though in Chinese) explained the new llvm pass manager API really well.
I had some trouble handling the memory operations when implementing the LICM. Only loads and stores are considered loop invariant. I had to check if the operands of the instruction were either a load or store. If they were, I would check if the operands of the load/store were constants. If they were, I would consider the instruction loop invariant. However, I ran into a problem when I was testing my pass on the basic example where it simply won't identify anything in the loop as loop invariant. I decided to fall back on some of the helper functions in LoopUtil to abstract the functionality of checking if an instruction is loop invariant. This was a bit of a hack, but it worked.

0 replies

jdroob · 2023-10-18T03:59:23Z

jdroob
Oct 18, 2023

Summary

@20ashah , @AliceSzzze , and I implemented a loop-invariant code motion pass. This pass hoists loop-invariant instructions (which are determined to be safe to remove from the loop) to the loop preheader.

llvm-licm

Implementation Details

Our pass maintains an unordered set called loopInvariants, which stores LLVM values that are loop-invariant. For each instruction, I, we check to see if the instruction can be hoisted (moved outside of a loop) safely. It checks the following conditions:

It uses the isSafeToSpeculativelyExecute function to ensure that it's safe to speculatively execute the instruction, e.g. not a call / no side effects.
It checks if the instruction may read or write memory, which could make it unsafe to hoist.
It checks if all operands of the instruction are loop-invariant or present in the loopInvariants set. If so, it marks the instruction as loop-invariant and adds it to the loopInvariants set.

The run function is the main function of our pass. The run function iterates through the blocks of a loop and its instructions. For each instruction, it checks if it's loop-invariant and safe to hoist. If it is, the instruction is moved before the loop's preheader (i.e., outside the loop). The loop is processed repeatedly until no more hoisting can be done (controlled by the changing flag).

Evaluation

We ran our optimization on a few small programs with loops and verified that the loop-invariant instructions are hoisted out of the loop and the optimized program works as before. For example, in the following program the instruction %5 is hoisted out of the loop

define void @squareit(i32 noundef %0) #0 {
%2 = mul nsw i32 %0, %0
ret void
}

define i32 @main() #0 {
br label %1

1: ; preds = %7, %0
%2 = phi i32 [ 0, %0 ], [ %8, %7 ]
%3 = icmp slt i32 %2, 10
br i1 %3, label %4, label %9

4: ; preds = %1
%5 = mul nsw i32 9, 5
call void @squareit(i32 noundef %5)
%6 = call i32 (ptr, ...) @printf(ptr noundef @.str, i32 noundef %5)
br label %7

7: ; preds = %4
%8 = add nsw i32 %2, 1
br label %1, !llvm.loop !5

9: ; preds = %1
ret i32 0

}
—------------
…

; Function Attrs: noinline nounwind ssp uwtable
define i32 @main() #0 {
%1 = mul nsw i32 9, 5
br label %2

2: ; preds = %7, %0
%3 = phi i32 [ 0, %0 ], [ %8, %7 ]
%4 = icmp slt i32 %3, 10
br i1 %4, label %5, label %9

5: ; preds = %2
call void @squareit(i32 noundef %1)
%6 = call i32 (ptr, ...) @printf(ptr noundef @.str, i32 noundef %1)
br label %7

7: ; preds = %5
%8 = add nsw i32 %3, 1
br label %2, !llvm.loop !5

9: ; preds = %2
ret i32 0
}

We tried and are still trying to run the llvm test suite, but ran into a lot of build issues and none of us really knows how cmake works :(

Anything Hard or Interesting?

Getting acclimated to the LLVM syntax in general was challenging. As another post mentioned, much of the documentation online was based on the legacy Pass Manager so getting the new version to function properly was a challenge.

Handling side-effects was another obstacle we ran into. There were many loads and stores in the unoptimized LLVM IR. Since these have side-effects, they cannot be hoisted from the loop. To resolve this issue, we ran a mem2reg pass prior to the LICM pass which allowed more instructions to be designated as loop-invariant.

0 replies

yxd97 · 2023-10-18T04:03:31Z

yxd97
Oct 18, 2023

Summary

I implemented the loop-invariant code motion (LICM) pass for this task (code).
I worked on LLVM to process C programs.
I used the embench benchmark suite to evaluate my LICM pass.
- I modified the build scripts to support a pipeline of build commands, so that I can manually apply optimizations passes one by one. The link points my forked repo.

Implementation

I first used FunctionAnalysisManager.getResult<LoopAnalysis>() to obtain all natural loops in a function, then get the dominator tree of that function by FunctionAnalysisManager.getResult<DominatorTreeAnalysis>().
For all the loops, I wen through every instruction to see if all its operands are loop-invariant by calling Loop.isLoopInvariant() on all its operands.
If the instruction is loop-invariant, I further checked if it is safe to be removed. If this instruction (value) dominates all its uses, and does not have side effects (write to memory, throw exceptions, or call a subroutine), and is not a terminator (the last instruction of a basic block), it will be hoisted to the preheader of the loop.
- A special case is that, if one use of the instruction is a phi node, and there is only one definition comining into this phi node, we can treat it as ``dominated''. This will occur when a variable is declared outside of the loop, and assigned to in the loop:

int x;
for (...) {
  x = <some loop-invariant code>;
}

To move the instruction, I used the Instruction.moveBefore() API to insert is before the terminator of the preheader.

Testing

My LICM pass will be applied to the output of mem2reg, which converts the memory SSA IR into traditional SSA IR with phi nodes. That is because my pass will treat memory operations as instructions with side effects and hence skip hoisting them.
The test cases include several hand-written programs and some benchmarks from PolyBenchC. The reason I chose PolyBench is that all its benchmarks can be compiled standalong with trivial modifications, and there exists lots of loops in a benchmark suite for polyhedral analysis.
I first compared the baseline IR and the optimized IR to see if my pass actually did something. It changed the IR for all programs except loop-variant.c and polybench-nussinov.c, which I verified that there is no loop-invariant code to be hoisted.
Then I executed both the baseline and optimized programs to verify their outputs. My pass did not break any of these testing programs.

Evaluation

I evaluated my pass on the embench benchmark suite. There are three setups:
- baseline: the executable built by clang -O0 with the mem2reg pass.
- clang-licm: the executable built by clang -O0 with mem2reg and licm passes. Here the LICM pass is from clang.
- yd-licm: the executable built by clang -O0 with mem2reg and my own LICM passes.
Surprisingly, there is no difference in benchmark results across all three setups:

benchmark	baseline	clang-licm	yd-licm
aha-mont64	4004000.00	4004000.00	4004000.00
crc32	4010000.00	4010000.00	4010000.00
cubic	3931000.00	3931000.00	3931000.00
edn	4010000.00	4010000.00	4010000.00
huffbench	4120000.00	4120000.00	4120000.00
matmult-int	3985000.00	3985000.00	3985000.00
md5sum	4002000.00	4002000.00	4002000.00
minver	3998000.00	3998000.00	3998000.00
nbody	2808000.00	2808000.00	2808000.00
nettle-aes	4026000.00	4026000.00	4026000.00
nettle-sha256	3997000.00	3997000.00	3997000.00
nsichneu	4001000.00	4001000.00	4001000.00
picojpeg	4030000.00	4030000.00	4030000.00
primecount	3834000.00	3834000.00	3834000.00
qrduino	4253000.00	4253000.00	4253000.00
sglib-combined	3981000.00	3981000.00	3981000.00
slre	4010000.00	4010000.00	4010000.00
st	4080000.00	4080000.00	4080000.00
statemate	4001000.00	4001000.00	4001000.00
tarfind	4033000.00	4033000.00	4033000.00
ud	3999000.00	3999000.00	3999000.00
wikisort	2779000.00	2779000.00	2779000.00

I suspect the reason is, in a normal program, the hot spot code is usually loop-variant. If not, why do we need a loop there? I inspected into the optimized IR of one of the test programs, polybench-gemm.c, and observed that the only hoisted instructions are int-to-double conversions of a constant:

for (i = 0; i < ni; i++)
  for (j = 0; j < nk; j++)
    A[i][j] = (double)(i * (j + 1) % nk) / nk;  // here nk will be impcilitly converted to double

Challenges

The difficult part is to navigate the LLVM documents and fild the correct utilities functions to use. It require some trial-and-error, but once you find the correct ones, the pass will become fairly robust. However, I avoided using the markLoopInvariant function to let me practice the save-to-hoist conditions learned in class. Finding a good benchmark suite is also challenging, since it is really difficult to find a normal program that will run much faster just with LICM optimizaion.

0 replies

ryanwmao · 2023-10-19T00:40:16Z

ryanwmao
Oct 19, 2023

@xalbt and I worked together on lesson 8.

Summary

licm

Implementation

We used the LoopPass class to implement an LICM pass in LLVM
We initially struggled a bit in trying to use a LoopPass, but we referenced documentation found here as well as the skeleton code available on the class repo
We use the isLoopInvariant() function and moveBefore() in our implementation.

Difficulties

Again, we found navigating docs to be very difficult. There was a lot of function calls and interactions with the llvm code that we weren't 100% sure of
We were on a bit of a time crunch and we couldn't give this task as much attention as we hoped to
We also struggled a bit to run the actual pass, since there was a lot of differing documentation online (and also the different pass manager versions etc.). We ultimately settled on what we have right now

Testing

We weren't able to rigorously test our code as much as we would have liked. We ran the code on a few small examples to verify that it works, and then tried to test it on the TSVC benchmark suite.
Ultimately, we found no significant improvement in wall clock runtime between code with and without our optimization. We think that this is due to us not hoisting memory operations, which probably constitute the majority of the runtime.

0 replies

evanmwilliams · 2023-10-21T23:48:20Z

evanmwilliams
Oct 21, 2023

I worked with @emwangs and @he-andy for this assignment.

Summary

Find the link to the code here

Implementation

We attempted to implement this assignment in LLVM. We ran into many issues. First, getting LLVM to build on a loop pass was pretty difficult. I think that loop passes are either deprecated or there isn't a ton of support for them, so we had to work a bit hard to get anything to run.

Once we got it building, the code wouldn't run for a bit because LLVM could not find our pass. This was also weird, because we registered it just like we registered any other pass. Eventually we got this to work by forcing a directive into our run script. We also made a run script run.sh to make life a bit easier.

Then we tried to actually implement LICM. I believe it works for super trivial test cases, but the analysis is far too conservative. Upon further inspection, it appears that the function isLoopInvariant() provided by LLVM only returns true if the instruction isn't inside a loop...

So overall, we're still working on this one. I believe that to do this properly we might want to find the loops ourselves that way we can also identify the pre-headers and modify instructions as needed, but this will take a lot more time, so I plan to keep working on this over the weekend.

Difficulties

Everything I mentioned above! (Building, registering the pass, running the pass, testing, etc).

Testing

We have not been able to fully test this one yet, as we were quite busy this week. We know that it doesn't work very well though, and will keep working as the weekend goes on :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lesson 8: Loop Optimization #353

{{title}}

Replies: 14 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Lesson 8: Loop Optimization #353

sampsyo Aug 21, 2023 Maintainer

Replies: 14 comments · 1 reply

Summarize what you did.

Explain how you know your implementation works—how did you test it? Which test inputs did you use? Do you have any quantitative results to report?

Turnt

Brench

What was the hardest part of the task? How did you solve this problem?

Implementation

Testing

Challenges

Summary

Details

Testing

Difficulties

Summary

Implementation Details

Testing

Difficulties

Summary

Test

Hardest Part

Handle Load And Store Instruction

Summary

Implementation

Testing

Difficulties

Summary

Details

Testing/Evaluation

Difficulties

Summary

Implementation

Testing

Evaluation

Challenges

Summary

Implementation

Difficulties

Testing

Summary

Implementation

Difficulties

Testing

sampsyo
Aug 21, 2023
Maintainer

Replies: 14 comments 1 reply