Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
perf: try compiling with
-Ofast
instead of -O3
Some risk involved, but 🤷. From `clang` manpage: -O0, -O1, -O2, -O3, -Ofast, -Os, -Oz, -Og, -O, -O4 Specify which optimization level to use: -O0 Means "no optimization": this level compiles the fastest and generates the most debuggable code. -O1 Somewhere between -O0 and -O2. -O2 Moderate level of optimization which enables most optimizations. -O3 Like -O2, except that it enables optimizations that take longer to perform or that may generate larger code (in an attempt to make the program run faster). -Ofast Enables all the optimizations from -O3 along with other aggressive optimizations that may violate strict compliance with language standards. -Os Like -O2 with extra optimizations to reduce code size. From `gcc` manpage (which I don't actually have installed on this machine, so grabbing a snippet from online instead): -O -O1 Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function. With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time. -O turns on the following optimization flags: -fauto-inc-dec -fbranch-count-reg -fcombine-stack-adjustments -fcompare-elim -fcprop-registers -fdce -fdefer-pop -fdelayed-branch -fdse -fforward-propagate -fguess-branch-probability -fif-conversion -fif-conversion2 -finline-functions-called-once -fipa-profile -fipa-pure-const -fipa-reference -fipa-reference-addressable -fmerge-constants -fmove-loop-invariants -fomit-frame-pointer -freorder-blocks -fshrink-wrap -fshrink-wrap-separate -fsplit-wide-types -fssa-backprop -fssa-phiopt -ftree-bit-ccp -ftree-ccp -ftree-ch -ftree-coalesce-vars -ftree-copy-prop -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-forwprop -ftree-fre -ftree-phiprop -ftree-pta -ftree-scev-cprop -ftree-sink -ftree-slsr -ftree-sra -ftree-ter -funit-at-a-time -O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code. -O2 turns on all optimization flags specified by -O. It also turns on the following optimization flags: -falign-functions -falign-jumps -falign-labels -falign-loops -fcaller-saves -fcode-hoisting -fcrossjumping -fcse-follow-jumps -fcse-skip-blocks -fdelete-null-pointer-checks -fdevirtualize -fdevirtualize-speculatively -fexpensive-optimizations -fgcse -fgcse-lm -fhoist-adjacent-loads -finline-small-functions -findirect-inlining -fipa-bit-cp -fipa-cp -fipa-icf -fipa-ra -fipa-sra -fipa-vrp -fisolate-erroneous-paths-dereference -flra-remat -foptimize-sibling-calls -foptimize-strlen -fpartial-inlining -fpeephole2 -freorder-blocks-algorithm=stc -freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -fschedule-insns -fschedule-insns2 -fsched-interblock -fsched-spec -fstore-merging -fstrict-aliasing -fthread-jumps -ftree-builtin-call-dce -ftree-pre -ftree-switch-conversion -ftree-tail-merge -ftree-vrp Please note the warning under -fgcse about invoking -O2 on programs that use computed gotos. -O3 Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on the following optimization flags: -fgcse-after-reload -finline-functions -fipa-cp-clone -floop-interchange -floop-unroll-and-jam -fpeel-loops -fpredictive-commoning -fsplit-paths -ftree-loop-distribute-patterns -ftree-loop-distribution -ftree-loop-vectorize -ftree-partial-pre -ftree-slp-vectorize -funswitch-loops -fvect-cost-model -fversion-loops-for-strides -O0 Reduce compilation time and make debugging produce the expected results. This is the default. -Os Optimize for size. -Os enables all -O2 optimizations except those that often increase code size: -falign-functions -falign-jumps -falign-labels -falign-loops -fprefetch-loop-arrays -freorder-blocks-algorithm=stc It also enables -finline-functions, causes the compiler to tune for code size rather than execution speed, and performs further optimizations designed to reduce code size. -Ofast Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. Result: Summary of cpu time and (wall time): best avg sd +/- p (best) (avg) (sd) +/- p pathological 0.20632 0.21660 0.06898 [+0.0%] (0.20632) (0.21660) (0.06898) [+0.0%] command-t 0.16686 0.17429 0.05953 [+0.6%] (0.16686) (0.17430) (0.05954) [+0.6%] chromium (subset) 1.32075 1.33852 0.03378 [-0.2%] (0.27715) (0.28639) (0.01831) [-2.0%] 0.0005 chromium (whole) 1.10602 1.11082 0.01093 [-0.6%] 0.0005 (0.12251) (0.12594) (0.01112) [+0.1%] big (400k) 1.66209 1.66965 0.02413 [-0.6%] 0.0005 (0.17984) (0.18305) (0.00927) [-0.3%] total 4.47450 4.50988 0.11781 [-0.4%] 0.0005 (0.96369) (0.98629) (0.10814) [-0.5%] 0.005 Those are the Lua benchmarks, of course. Ruby unaffected (and it is using `-Os` anyway.
- Loading branch information