Performance Measurement Discussion #448

sampsyo · 2025-01-21T20:17:36Z

sampsyo
Jan 21, 2025
Maintainer

This thread is for discussing the famous "Producing Wrong Data!" paper by Mytkowicz et al. I (@sampsyo) am the discussion leader and will try to answer all your questions!

ngernest · 2025-01-21T21:35:43Z

ngernest
Jan 21, 2025

Critique: It would be nice if the paper discussed how to pick appropriate statistical measures for comparing two systems' performance (although this might be out of scope). Jules Jacobs argues in a Twitter thread that speedups ought to be averaged using the harmonic mean as opposed to the (more common) geometric mean. Measurement bias might affect different statistical measures to varying degrees, and it'd be interesting to see if certain measures are less susceptible to noise / measurement bias.

Question: The paper mentions that benchmark suites should should consist of "diverse" workloads (section 8.1). How should we evaluate the diversity of a workload? (and what makes a workload sufficiently diverse?)
(Context: My previous research was about property-based testing and involved evaluating the bug-finding efficacy of a new testing tool we developed over a suite of programs. This paper made me think about whether the workloads we picked were sufficiently diverse!)

3 replies

ethanuppal Jan 22, 2025

I was curious about their choice of benchmarks. Obviously, the selection is irrelevant to the paper because the bigger idea was how changing system environments affects the same benchmark. On the other hand, the selection is relevant to the paper --- although I didn't pay the $800 DVD to order my personal copy of the benchmarks, I was able to find the commands run (not the data they process). One that the paper devotes some words to is perlbench which involves running, well, a Perl script:

perlbench = LiveProcess()
perlbench.executable =  binary_dir+'400.perlbench_base.alpha-gcc'
perlbench.cmd = [perlbench.executable] + ['-I./lib', 'attrs.pl']
perlbench.output = 'attrs.out'

Thus, I'm not sure if (quoting the SPEC website) benchmarks like these "[are] a useful tool for anyone interested in how hardware systems will perform under compute-intensive workloads based on real applications." Even if they are, are those the best benchmarks to evaluate how -O2 vs -O3 compares?

sampsyo Jan 22, 2025
Maintainer Author

If you'd like a deeper treatment of geometric vs. harmonic means than a Twitter thread can supply, here's a good (and short) paper: https://ieeexplore.ieee.org/document/10419888

And not to defend SPECINT overly much, @ethanuppal, but just in case it's not clear: what's being benchmarked here is the Perl interpreter (a C program that is no more or less valid than anything else in SPECINT). Scientists have been unable to determine what the script itself does.

katherinewu312 Jan 23, 2025

Hi Ernest! I agree, an exploration of appropriate statistical measures for comparing system performance would have been a valuable addition, especially because the authors emphasize several times in the paper that a sufficiently diverse set of workloads along with careful statistical methods should factor out most measurement bias. They really bring to attention how recent work in systems research lacks the use of statistical rigor in software performance evaluation. It's interesting to think, and at first wasn't so obvious to me, of the different techniques performed in the experiments (ie changing the link order, environment size) as probability distributions, for example selecting a linking order uniformly at random from all the possible orders and sampling the execution time as a random variable. However, I can see why maybe optimizing for the ideal statistical method to evaluate performance may not be the most important problem since there will always be confounding factors that incur the performance anyway.

Discussion questions: How significant is measurement bias as applied to JIT compilers vs static compilers? I'd be curious to see a survey of existing performance evaluation methodologies for start-up vs steady-state performance, and the trade-offs between the two. And, since there seems to be no easy way to identify and eliminate measurement bias, is there some sort of error bound that is generally acceptable by researchers/hardware vendors?

Jonahcb · 2025-01-22T14:39:36Z

Jonahcb
Jan 22, 2025

Critique:
I greatly appreciate and find it exciting how they showed that the O3 "speedups" could have been completely due to measurement bias and that measurement bias could have led to slower "optimizations" being used due to incorrect conclusions.

Questions:
The paper mentions specific benchmark suites and how they may not be diverse enough to prevent measurement bias. However, the paper was published in 2009, (1) so I am curious if this problem has been solved with new benchmark suites? Furthermore, it mentions that hardware manufacturers do not provide access to adequate data of components of the microprocessor (e.g. caches) to isolate measurement bias and do not provide adequate disabling of features (e.g. LSD). (2) Has this problem been solved with current systems used to test optimizations and, if so, which features can now be disabled on modern systems?

(3) The author states that knowing the internal details of a microprocessor will help (i) understand the performance of a micro-kernel and (ii) fully exploit the microprocessor to obtain peak performance. What internal details of a microprocessor does a performance analyst reasonably need to know? (the paper mentioned alignment preferences)

2 replies

sampsyo Jan 22, 2025
Maintainer Author

On the topic of revealing microarchitectural details… in a certain way, the situation has gotten worse, but also kind of better? In the era of microarchitectural side channel attacks (e.g., Spectre), CPU vendors have only gotten cagier about protecting the internal details of CPU pipelines. On the other hand, security researchers have gotten pretty good at reverse-engineering those details anyway, so in practice it's not clear how much that resistance from the vendors matters. But we certainly have not entered a golden age where CPU internals are intentionally exposed to software for performance understandability/optimizability/predictability; CPUs have only gotten more and more complicated, and all that is still (formally speaking) hidden behind the ISA contract.

About the specific question of what performance analysts want to know about microarchitectures, the really big one is information about the execution units and "ports" available for running each type of instruction. I think the things people want are best expressed by uops.info, a website that catalogs this stuff for x86 machines. Predicting performance at this level has even been the subject of ML-based modeling, for instance.

gerardogtn Jan 22, 2025

I was also curious to see if things have changed since the paper was published 15 years ago, and hoping that there's been an improvement since. I was able to find this paper from 2015 by Hoefler and Belli where they found that linpack performance metrics are also vulnerable to measurement bias, where the same computer had a maximum performance of 77.38TFlops/s and a minimum of 61.2TFlops/s (a variation of 20%). That's still a 9-year old paper and it did enumerate 12 rules for benchmarking parallel systems, among them the suggestion to use harmonic mean instead of arithmetic mean as @ngernest mentioned ("Rule 3: Use the arithmetic mean only for summarizing costs.
Use the harmonic mean for summarizing rates."). So it seems that there is more awareness about the role of measurement bias and ways of better reporting metrics in literature, even if uptake is slow.

neel-patel-1 · 2025-01-22T15:20:17Z

neel-patel-1
Jan 22, 2025

Critique:
At first, I was skeptical that measurement bias was a legitimate issue for systems researchers -- the authors of this paper were able to isolate the sources of measurement bias in their experiment (link order, and stack alignment). After considering the number of tunable architectural optimizations and optional kernel features added to systems upon every new processor generation and release, I now believe this is a legitimate concern. There are many parameters that can impact the experimental results and may not be consistent between machines: prefetchers, page table optimizations, etc.

With regards to the proposed experimental setup randomization approach, I am still skeptical that this will remain feasible as more knobs get added to systems. To achieve adequate randomization the authors used 484 setups (Section 7.1.2) -- and this was just to enumerate setups that vary in terms of link order and environment size. We could imagine the space of random experimental setups growing infeasibly large for systems researchers to enumerate.

Discussion Questions:
When authors attribute speedups to an optimization, what experimental evidence is enough to convince you that the speedup is the result of the optimization (and not measurement bias)? Would a causal analysis be sufficient (e.g., attributing better performance to increased memory bandwidth utilization garnered by some optimization)? Would you prefer to see a diverse workload evaluation? Are both a necessity?

2 replies

devv64 Jan 23, 2025

I agree that measurement bias is becoming a complex challenge as systems are growing more sophisticated. I found it very concerning that authors deemed 484 measurement setups necessary while only considering 2 variables in their setup. This will explode the experimental setup overhead when considering many system parameters and essentially make this randomization approach impossible. This also leads me to believe that the paper is not fully exploring the extent of the problem of measurement bias in modern systems. There might be many additional variables even in their systems that are not being explored.

I had similar thoughts to these discussion questions. This raises another concern for me- Do findings need to be large enough to claim that it is not simply obfuscated by measurement bias? How should researchers balance being thorough when considering measurement bias but also practical with regards to time and resources? Should there be some standardized requirements for validating improvements, or should it be specific to each differing type of optimization? Is it feasible to have a benchmark suite that is sufficient in testing optimizations and cancel out biases?

aw578 Jan 23, 2025

The number of setups only explodes if you want to test every combination of configurations, right? If you assume that most sources of measurement error can be revealed by just changing 1 or 2 variables, the number of setups seems a lot more manageable. My discussion question is whether this is true in practice or not. Are there many examples of measurement error that require very specific experimental configurations?

As for constructing benchmark suites, I was thinking it over and I'm really not sure if you can have a bias-free benchmark suite without already knowing what the sources of measurement bias are. For the example they gave with the Java benchmark, it seems like you'd already have to know that memory usage affects your measurements in order to determine whether the benchmark code accurately reflected real workloads for your purpose.

mt-xing · 2025-01-22T15:53:50Z

mt-xing
Jan 22, 2025

I'm feeling strong parallels to the replication crisis happening in other areas of science, where tons of studies (especially in psychology and Wikipedia claims also medicine) are failing to replicate as researchers are finding either flawed methodology or other issues with the paper. It seems that this paper is driving home a similar point about empirical evidence in computer science, which I don't like because it makes me sad.

My critique, if I can call it that, is that I don't quite believe the authors when they say that we can learn from other fields of science in ways to mitigate these biases. After all, given how dire the replication crises has gotten in other fields, and how many fields it has spread to, it doesn't exactly inspire confidence that the other fields have it figured out either. And furthermore, if every single innocuous thing has the potential to completely skew the result of a study, then how can it ever be humanly possible to ensure that we've randomized all of them? It seems possible, if not likely, that there will always be some tiny insignificant factor that no one thought of, which itself is enough to cause a fast optimization to appear slower. Unless we can somehow exhaustively enumerate every single aspect of an environment that could skew an experiment (and how could that list ever be truly exhaustive?), the results here seem to show that even missing a single aspect dooms us all. So, what is even the point?

I'm just not sure I'm convinced that randomizing some factors is an improvement if missing one factor alone changes the entire experiment.

(I also find it kind of funny how the paper keeps coming back to this analogy about polling an election by sampling a large swath of the electorate across different demographics, especially right after multiple cycles of fairly abysmal political polling precisely because pollsters are having trouble figuring out which demographics to poll across)

Discussion Question
The authors discuss wanting more information and support from chip manufacturers to help better profile and gather information. Has this problem gotten better or worse with the rise of RISC processors like ARM? My (admittedly fuzzy) understanding of RISC is that the simpler instruction set also results in simpler processors - has that made it easier to profile, or are there other complications?

3 replies

gabizon103 Jan 22, 2025

In response to your question, I don't think that a CISC vs RISC processor would make much of a difference. Most of the sources of uncertainty the authors discuss, like cache hierarchies and branch predictors, are the non-architectural components of a CPU. This means the functionality of these components aren't (at least, shouldn't be) reflected in the ISA. I think vendors usually guard implementation details super closely, probably because of the threat of sidechannel attacks. So regardless of the flavor of ISA, you'd still need to do some pretty heavy profiling/experimenting/reverse-engineering to tease out some useful details.

scober Jan 22, 2025

If you are interested, this is a pretty good podcast episode about political polling. It covers a bunch of stuff but the most interesting part for me was that political pollsters have never been able to poll across demographic groups and re-construct an accurate model of the electorate.

There were a few decades when political polls were fairly accurate because most Americans had telephones and they (for the most part) answered them when they rang. So you could get a decent random sample of Americans by randomly sampling phone numbers. But before and after that period, political polls were pretty bad. Now that I have written this all out it seems like lesson in the perils of trying to compensate for measurement bias.

noschiff Jan 23, 2025

Critique: The paper generally addresses the problem of analyzing performance gains in systems research by assuming that it is an issue with how evaluations in this specific domain are conducted and that it is something that is fixable with more careful considerations of measurement bias. However, I share a similar sentiment and feeling of dread with you, Michael, that perhaps there's a greater issue with experiments in science as a whole. I understand that it's fundamental to research that we need some way of reasonably convincing ourselves and others of what affect our changes in research have, but I'm now worried that controlling for as many possible sources of measurement bias will never be enough to show we truly understand what's going on. We can of course evaluate optimizations on a few cases and develop an intuition for how it's an improvement, but we'll never be able to consider all possible software and hardware.

Discussion Question: My understanding is that in this type of research, we have an intuition and reasoning for why an optimization will improve performance, and we accompany that with the extensive experimental evaluation that this paper deals with to approximate the true impact of our work. Is there even a reasonable notion of a universally good optimization that we could find by testing all possible software and hardware? Is this something even worth considering or striving for? Related, are there any methods used to convince us of the effect of optimizations and changes that don't use data and experiments? Namely, is there anything formal that researchers can do to prove or convince us of the change without extensive testing?

samuelbreckenridge · 2025-01-22T19:51:23Z

samuelbreckenridge
Jan 22, 2025

I appreciated the argument around measurement bias, but I'm almost more interested by the results which seem to indicate that memory layout can have a greater impact on performance than the compiler optimizations being measured. Admittedly It's a bit hard to know what to make of this because performance is always discussed in terms of O3 speedup over O2 so its not clear what is speeding up or slowing down in absolute terms. However, besides making it hard to report measurements, it seems bad that a user might not observe any benefit from an optimization because e.g. they had environment variables that were a bad length. After all, it is actually a good thing if someone's program runs faster because their environment variables are a "good" length, even though this can be framed as measurement bias. Of course it is a hard problem to optimize memory layout, Figure 5 is a nice illustration of how there is no one size fits all configuration across different machines and the best configuration may depend on internal microprocessor details which as the paper points out may not be public, but I'm definitely curious about what has been tried in this space.

Discussion question: One solution the paper focuses on is using more diverse workloads and statistical methods to factor out bias. When trying to measure the performance of an optimization that is targeted towards certain types of workloads e.g. memory intensive, how should we approach striking a balance between workload diversity and measuring performance on the types of workloads that the optimization is specifically targeting?

2 replies

mb64 Jan 22, 2025

Hi Sam! I totally agree that there's a lot of interesting discussion around what else you can do with this stuff, besides chock it up to "randomness" and try not to let it impact your measurements. For example, looking at the super simple example code in figure 1, the performance of the loop is heavily impacted by how the data on the stack is aligned. The concern outlined in the paper is that right now, that alignment is very unpredictable, leading to measurement bias which you should randomize out. However, I think it'd be useful in conjunction to try to reduce the variance by trying to fix alignments and stuff like that --- for example, the paper talks about how they would've expected the Intel compiler to do more of this kind of thing (4.3). (I think in some respects making it predictable is a prerequisite to making it predictably good.)

Here's my discussion question: The paper talks about how to get useful measurements when raw performance is unstable under tweaks to the environment: clearly, in this case, it'd be useful if performance was more predictable. To what extent is it useful more generally? Should less-noisy performance be a goal of new systems? (In terms of compilers, this question might be especially applicable to JITs.)

emmanueljs1 Jan 23, 2025

Hello Sam + Mark! Similarly, while reading the paper I thought of a talk I went to where a company talked about all sorts of hacks they did in their setup specifically to guarantee speedups. I thought it was neat but hadn't really ever considered that this sort of thing might also be bad, actually. Separately, I really appreciated how the paper actually makes an attempt to both prevent measurement bias and check for its absence in their experiment showing how measurement bias is significant and commonplace! A critique I have of the paper is that it would have been helpful for them to show that measurement bias is significant and commonplace using an example that wasn't related to memory layout (in particular, sections 4.1 and 4.2 felt a bit too similar).

In the beginning of section 3.3 the authors mention disabling the randomization of the starting address of the stack (as this makes experiments harder to run). This made sense since otherwise their UNIX environment size example wouldn't really work (or at least it wouldn't have reproducible results), but it left me a bit confused since this goes against the idea of setup randomization. My discussion question is: what are the best practices for setting up the "control" conditions of the setup for an experiment while still trying to prevent measurement bias?

adityanathan · 2025-01-22T20:20:10Z

adityanathan
Jan 22, 2025

Critique: I found this paper to be very instructive and applicable to my own research involving systems. I find it very helpful that the authors did a walkthrough with an example of using two techniques to mitigate measurement bias and establish confidence in conclusions.

While reading the paper I thought not all systems are subject to measurement bias, that it depends on the particular metric being evaluated. However, upon further thought, I think I’m wrong. To elaborate, I thought that measurement bias is an issue in the O3 optimization speedup example because they’ve chosen execution time speedup as their evaluated metric. This metric condenses and aggregates a lot of factors involving the experimental setup to produce one number (the speedup) which is why measurement bias is a serious concern in this example. However, if we’re using a simulator that simulates only a single simple subsystem (e.g. a cache) and evaluating a discrete simple metric (e.g. hit rate) to evaluate a caching mechanism, then I thought we don’t need to worry about measurement bias as the tightly controlled, transparent, simple nature of the experimental setup doesn’t allow for emergent unexplainable behavior. However, even in this controlled simple setting, we can have measurement bias since the cache’s behavior is dependent on factors such as the cache size, cache line size, replacement policy, etc. Furthermore, we still need diverse benchmarks to stress test as some workloads may favor the caching mechanism. To conclude, measurement bias lurks everywhere, even in simple systems.

Discussion Question: Could there be an evaluation of a practical system that is somehow not subject to measurement bias due to its nature or the proposition it is evaluating? For example, in the limit, if the proposition is simple enough such as “the cache outputs data if it is present in cache” would not have any measurement bias; one workload would be just as good as n workloads and varying experimental setups won’t change results so evaluating 1 workload with 1 experimental setup is enough and evaluating more would increase confidence in the proposition but there is no measurement bias.

0 replies

gabizon103 · 2025-01-22T20:22:26Z

gabizon103
Jan 22, 2025

I'm interested in the reasons behind some of the performance fluctuations reported in the paper. In section 4.2.2, the authors attribute a possible cause of measurement bias from UNIX environment size to an event called LOAD_BLOCK:OVERLAP_STORE, which happens when a load is blocked by a preceding store. The authors go on to state:

The hardware uses conservative heuristics based on alignment and other factors to determine load-store over- laps. By changing the alignment of stack variables, we have probably affected the outcome of the hardware heuristics, and thus the number of load-store overlaps.

I was wondering about the specifics of these "conservative heuristics." I was under the impression that most architectures will speculatively try to reorder loads before stores, since they are often more critical to performance. When the addresses are resolved later in the pipeline, it'll check whether or not there was a mis-speculation, meaning that a load that should've blocked on a store was incorrectly ordered before that store. If it finds a mis-speculation, it'll replay the instructions. Here, I thought that detecting load-store overlaps literally just meant checking the addresses -- I'm not sure what the authors mean by conservative heuristics, or why alignment would have an effect on it. From my understanding, a heuristic shouldn't be necessary to begin with?

I've also heard of architectures that will try to predict which loads depend on which stores, similar to how it might predict branch correlations/outcomes. Still, I don't think this is what the authors mean. I'm interested if other people have insights or thoughts about this.

Now a bit bigger picture, I think the paper presents a dilemma that doesn't have a clear solution, really. I think one challenge, which other people have also mentioned, is knowing what parts of an experimental setup might cause measurement bias. The paper focuses on two, but I'm sure that there are numerous others that I can't think of off the top of my head. I think this is partially obscured by implementation details that hardware vendors don't disclose. This seems to highlight a tension between the ISA and the lower-level implementation details; on the one hand, we like the ISA as the lowest level of programming abstraction for reasons like code portability and security. On the other hand, we might want to peek below the ISA abstraction for performance boosts. In the case of this paper, we might want to peek below the ISA to figure out causes of measurement bias. It seems to me like finding all these sources of measurement bias are at odds with the programming abstractions we're used to, and I'm not sure what the solution is to identifying these sources and the precise reasoning behind them.

Discussion question: The specific causes of measurement bias discussed in the paper seem specific to CPUs (as opposed to more application-specific architectures) since they deal with code/data layout. If we care about benchmarking code on something even a little bit more specialized (like a GPU, for example), what kinds of sources of measurement bias might arise? How can we be sure that we've exhausted all sources of measurement bias in a given experiment?

1 reply

bryantpark04 Jan 23, 2025

I also greatly appreciated the sections of the paper investigating possible reasons for the effects that UNIX environment size and link order have on performance! Similar to you, since memory dependence prediction was one of the topics covered in ECE 6750 last semester, the LOAD_BLOCK:OVERLAP_STORE event mentioned in the analysis of how UNIX environment size affects stack variable alignment stood out to me. For context, one of the papers we covered was "Memory Dependence Using Store Sets" by Chrysos and Emer, and in the practical implementation scheme for store sets the paper outlines, the lower bits of the load/store PCs are used to index into a Store Set ID Table (SSIT), thus causing aliasing. If the addresses of the load/store instructions were to change, this could affect which sets of instructions would alias into the same entry of the SSIT. I don't believe this particular aspect would be affected by the UNIX environment size changing, since that would only(?) shift the alignment of the stack variables, so I'm also not sure how those "conservative heuristics" mentioned in the paper would change, or what they would even be. However, I was thinking that changing the link order would change the instruction PCs and thus how the instructions alias into the SSIT, perhaps affecting the overall performance of the program. I wonder if this LOAD_BLOCK:OVERLAP_STORE event also varies with program link order, and to what degree.

My questions are mainly about the differing levels of architectural knowledge available to compiler developers. In section 4.3, the authors mention that they expected that the Intel compiler would "take alignment preferences of the hardware into consideration" and thus "exhibit little measurement bias", but in fact using the Intel compiler also results in measurement bias. First, what are some examples of hardware-specific characteristics that compiler developers (at Intel, or elsewhere) could take into consideration? Second, given the amount of variation in hardware architecture even from one generation of Intel CPUs to the next, would having access to very specific architectural details even yield that much in improvements relative to additional implementation effort?

scober · 2025-01-22T22:54:33Z

scober
Jan 22, 2025

Critique: The paper really emphasizes benchmark diversity as a cure for measurement bias but they don't really back up that claim. I think it makes intuitive sense in a Law of Large Numbers sort of way -- if the "diverse" benchmarks are independently distributed with respect to the measurement bias then I think you would expect them to give a normal distribution around the "true" value. But I think there are two issues with that claim:

The first is that they run exactly this experiment, they don't get a normal distribution, and they conclude that SPECINT just isn't diverse enough of a benchmark. They don't try any other sets of benchmarks or even discuss what about SPECINT makes it not diverse enough or how it could be improved. I don't think they ever provide evidence that there is such a thing as a benchmark that is diverse enough to average out measurement biases.

The second issue is that they pre-suppose a ground truth value. In the case of O3 vs. O2 (and in many software evaluation contexts) I don't think there actually is one. The promise of O3 optimizations is not "this will make all C programs X% faster", it is "this will make some C programs between X% and Y% faster, depending on their contents". So I think the idea that you could develop a set of benchmarks that are so "diverse" that they would have an average speedup clustered around the "true" speedup of O3 is just malformed. There is no "true" O3 speedup.

Question: The milc benchmark seems to do some numerical analysis type stuff related to quantum field theory:

The program generates a gauge field, and is used in lattice gauge theory applications involving dynamical quarks. Lattice gauge theory involves the study of some of the fundamental constituents of matter, namely quarks and gluons. In this area of quantum field theory, traditional perturbative expansions are not useful. Introducing a discrete lattice of space-time points is the method of choice.

It is also pretty reliably un-measurement-biased in the paper's various experiments.

Why do you think it is so unbiased?
And would it be a good idea or even possible to seek out specific benchmarks that appear to be measurement-bias-resistant and favor those for performance evaluation? What would the strengths and weaknesses of that approach be with respect to the many diverse benchmarks approach?

2 replies

UnsignedByte Jan 22, 2025

Hi Simon! My impression while reading the paper was also somewhat similar. While I do think there is merit in trying to find the "true average speedup" given by particular optimizations, I do wonder to what extent this goal is really realistic beyond letting the optimization run in the wild to see how it affects actual user performance in different environments. In regards to a "ground truth" value, I think it is important for users to know that "on average" O3 optimizations would in fact provide some speedup, as most users likely just compile with O3 and expect speedup without necessarily thoroughly profiling. At the extreme, we could see an optimization in, say, the Linux kernel provides an expected speedup, but in reality this leads to significant slowdowns in other workloads (and surprise surprise, this actually happens), so in these cases having a "diverse benchmark" would be very important to make sure we aren't running performance into the ground for certain users.

I do think the use of diversity in benchmarking is more of an attempt to catch as many edge cases as possible for user environments and workloads, to mitigate these outlier cases as well as to get a sense of how optimizations and changes affect "most" users. I would find it interesting to look into measurement-bias-resistant benchmarks, but at the same time real users write measurement-bias-nonresistant code so it seems that totally eliminating measurement bias may not be the best idea either.

As a whole, I was quite surprised at the sheer magnitude of performance variation due to these external factors, and even ignoring the performance measurement question, I am sure many programmers would love to find out if they could get a free 10-20% speedup simply by changing linking or forcing stack alignment... I think if nothing else, the paper was very useful in helping to "spread the news" about this issue and hopefully make more people conscious of it when optimizing their programs.

Question:

The paper mentions causal analysis as one possible solution to dealing with measurement bias (as opposed to the resource-heavy solution of brute forcing as many situations as possible) but I can't help but feel that the type of causal analysis they are looking for is nearly impossible in many modern contexts, especially if we are dealing with code generation and compiler optimizations. Necessarily, eliminating and reordering code will affect everything from binary size to branch alignment, stack alignment, and even control flow, and it is likely essentially impossible on most systems to eliminate these effects and "isolate" the optimization somehow. Can causal analysis be meaningfully implemented in most contexts? How would one ensure that their attempts to control variables does not inadvertently introduce new ones?

mariasoroka Jan 23, 2025

Hi Simon and Edmund,

I agree with the issues you pointed out about the paper. While reading it, I could not get rid of the feeling that I'm not sure that there is a good definition of "speedup of O3".

The authors emphasize that the measurement bias exists in other scientific areas like sociology and that techniques developed there can be also useful in computer science. I don't think it is a correct analogy. In sociology and, specifically, in the polling example, the ground truth value can be clearly defined as the number of people who will vote for and against. Moreover, this value, theoretically, can be computed by asking every single person in the country. What is the ground truth value for the speedup? The paper implies that it is the average speedup over different programs and system setups. Well, there are infinitely many programs and setups... In more mathematical terms, what is the probability measure for computing the average? Depending on the chosen measure the value of the average will be different.

All of that is not to say that we should not be making an effort to thoroughly evaluate our results and publish cherry-picked papers. It is rather to say that all of these evaluations are meaningful only in the context.

Discussion question: I think that the authors think of the code as a biological or social system over which there is no control, and try to evaluate how optimization affects this system. Unlike biological systems, however, algorithms do not have any randomness in them. Thus, I think it is more suitable to think about the optimization code as a tool. Any tool performs better in some conditions and worse in others. Isn't it then more useful and productive to try to establish the conditions under which using the tool will lead to gains in performance? Is it possible to enforce some system properties if they are known to affect performance? Or make the user aware that optimization performance may be dependent on XYZ and that these parameters should be tuned?

tean-lai · 2025-01-23T01:44:36Z

tean-lai
Jan 23, 2025

Critique: The paper did well showing how measurement bias is extremely prevalent, but I find it hard to see how it could ever be removed in a sustainable manner. A future in which every experiment requires 400+ experimental setups seems like you would need some sort of computing cluster to help and I don't like how that would increase the barrier of entry to doing "more correct" experiments. I think maybe it would be worth exploring more on a tolerable level of measurement bias, so we don't need so many setups. Also, I wonder how valid some of the counterarguments this paper are. For example, in section 6.2 they address the argument that large speedups can, in a sense, overpower measurement bias, but the paper counters with data from section 4, showing how some instances of measurement bias can be very large too. This seems a bit of an overgeneralization to me, especially if a given large speedup is also theoretically sound. But I suppose it's not a bad idea to consider the possibly of bias despite huge results.

Questions: This paper mentioned some solutions like more cooperation from hardware. Just like how RISC-V has been nice for open-source hardware, I wonder if a completely open-sourced CPU could be a good standard for benchmarks in the future?

Also, I wonder if the prevalence of measurement bias should impact the optimizations we choose to make. Should we be optimizing compilers to work with modern hardware, or if we should be doing optimizations that work on theoretically ideal hardware (like infinite cache lines for example) and let hardware designers optimize to fit the compilers? I'm thinking that in the case of the paper, an experiment done on a simulation of an ideal hardware could reduce issues caused by memory alignment, and then there are less factors to consider like linking order.

1 reply

arnavm30 Jan 23, 2025

Hi Tean,

I had the same thought about how requiring 400+ experimental setups is a large barrier to entry and likely unrealistic. Moreover, that number is only considering the 2 dimensions they focus on (environment size and link order), and they mention that these are just a few of the “numerous ways of affecting memory layout,” and memory layout is in itself just one source of measurement bias (room temperature is another they raise). All that to say, with hardware and software becoming more configurable, it does not seem realistic to try all environmental setups or that even a random sample would be representative. With how often they seem to point out that there are other sources of measurement of bias, I would have liked to see an analysis that takes more of these sources into account rather than just the 2.

I think your idea of a fully open-sourced CPU makes a lot of sense. I imagine little has changed since the paper was published in terms of hardware vendors not wanting to disclose the full microarchitectural implementation.

Given how much of a problem measurement bias is and the extent it affects speedup (unpredictable fluctuations), I’m curious if there are metrics other than “raw” speedup (using run times) that can attenuate measurement bias? Or rather than just one metric to quantitatively evaluate performance, are there other metrics that can be considered together to better evaluate a program’s performance?

InnovativeInventor · 2025-01-23T01:56:54Z

InnovativeInventor
Jan 23, 2025

Critique. I would have appreciated a causal analysis of the original question posed: to what degree do O3 optimizations improve
performance? This is absent from the paper.

A proper casual analysis of this question would intervene minimally on the exact O3 optimizations applied. In particular, I was expecting an attempt to isolate (if possible) the optimization pass(es) responsible for performance wins and losses in benchmarks, with perhaps some discussion to experimentally reach casual conclusions while working around the compilers phase ordering problem.

Question. What would be a good way to measure compiler performance wins while working around the phase ordering problem? I was thinking that one could ablate various combinations of O3 optimization passes. Depending on how many passes there are, it might be feasible to enumerate all possible combination of O3 optimizations passes for the interesting benchmark suites (e.g. those that straddle the 1.0 line).

0 replies

parthsarkar17 · 2025-01-23T02:01:56Z

parthsarkar17
Jan 23, 2025

Critique:
I thought the paper did a thorough job in convincing the reader to consider their own measurement bias in running their own experiments. They convinced me that seemingly-innocuous experimental setups to measure systems performance can create wild downstream results that could definitely obfuscate any true results. However, if I had any objection to the proposed solutions given by the paper, it would be that it seems a little unrealistic to: A) get for-profit chip vendors to release their designs, B) even if given the full set of details, to be able to think of every possible way that measurement bias can be introduced to your system, and even C) get a sufficiently broad set of designs. Basically, it’s important that they raise that concern because it will deter people from an important set of measurement biases (most directly, the examples of UNIX env. variables and/or linking order), but it seems difficult to ever stamp out all possible biases, or even collect enough data to where the biases become negligible.

Questions:
One of the recommendations from the paper to mitigate measurement bias is to use a more diverse set of programs against which to compare any improvements. They also mentioned that, if your suite is diverse, then we should expect a tight distribution around the median speed-up found by running O3 vs. O2. This gives us one way to gauge whether a benchmark is not diverse, but it won't let us conclude that a benchmark is diverse. I was wondering, are there any ways to "guarantee" that your suite is good enough? Or, at least, are there more ways to tell if your suite isn't good enough?

0 replies

Annacaro22 · 2025-01-23T02:25:23Z

Annacaro22
Jan 23, 2025

Critique: As I was reading the paper, I was coming up with possible critiques which were covered by later sections; for example, their assertion that link order and UNIX environment size caused the variation in results (rather than simply correlated) in sections 4.1-4.2 gave me pause, but this was addressed more explicitly later in section 7.2. So I would say they did a good job anticipating possible critiques and covering their bases.

I also noticed that in sections 4.1-4.2, they focused largely on results from the perlbench test, which we can see on their violin graphs had the largest variation and thus is the most in line with their argument, it made me think that perhaps the specific issue they're observing (benchmark data variation with varied link order and UNIX environment size) only holds strongly for some benchmarks which can simply be avoided. However, they only observed these two underlying causes; the other benchmarks which looked to have performed with much less variation on their violin plots may have great amounts of variation for other underlying causes, which may not be easily detectable without using the benchmark evaluation method described in 7.1.1 (demonstrated in Figure 6).

Discussion Questions: The paper acknowledges the difficulty in implementing the proposed techniques for managing measurement bias, especially the demands of experimental setup randomization; they present some possible strategies in section 8 to reduce the presence of measurement bias in a more manageable way. Which of these suggestions do you think is the most feasible and most helpful in reducing the problem? Can you think of any other ways that they missed to scale down their findings into more feasible strategies?

0 replies

dhan0779 · 2025-01-23T02:26:27Z

dhan0779
Jan 23, 2025

Critique: Overall, I think this paper does well the point readers in a direction where something completely unobvious can be the reason for misinterpretable results. However, it seems like the authors also state that diverse benchmarks aren’t always the solution, although this seems to be the case for mitigating measurement bias in several fields as well. I agree that the notion of measurement bias is good to point out, but I don’t think this is “too” realistic for research purposes. The paper mentions a small code snippet to highlight the issue of the size of environment variables, but in the real world, there are so many different factors, not to mention the size of codebases (1000x larger samples) that magnify the effects of measurement bias.

I was also surprised that there was no significant trend when using a diverse set of benchmarks. As shown in Figure 2 in section 4, a wide variety of distributions are displayed on different benchmarks. This makes me question how feasible it really is to find a true unbiased result, especially where today where there are hundred of benchmarks (or even some research using just the top benchmarks to show performance improvements).

Question: When should one start considering the “diversity” of benchmarks rather than what is commonly used for an area of research? For example, in the field of computer vision/machine learning, people can often use different GPUs (or single/multiple GPUs) depending on the different precisions of models used. Thus, how important is it really to consider using a diverse set of benchmarks when oftentimes researchers tend to pick the one that shows the most performance gains?

0 replies

KabirSamsi · 2025-01-23T02:53:49Z

KabirSamsi
Jan 23, 2025

Critique:
I feel that the paper presents an important and sometimes somewhat overlooked topic in the phenomenon of measurement bias. The paper does well to initially model a simplistic argument for how an innocuous, seemingly unrelated event or decision in an experiment can lead to incorrect conclusions or interpretations of data. I was impressed by the quality and diversity of data collected and how it was benchmarked. It does well to directly note the impact of this by selecting a number of publications who have obtained significant data through experimentation without accounting for measurement bias in any sense.

To that effect, I thought that the problem presentation and statement were well done. Where I felt the paper proved more vacuous was in the later sections (particularly 8) and more directly, discussing techniques to better analyze potential variables at play in experimentation. In 8.3, the paper discusses having more manufacturers release details on the hardware they use – "Sun Microsystems has already taken the lead by releasing all details of one of their microprocessors (the OpenSparc project); we hope other manufacturers will follow." I question whether it is systematic and conclusive enough to rely on data which may or may not be supplied when developing a solution.

Question:
Section 7.1 proposes a handful of tactics using multiple experimental setups, to try and erase measurement bias. 7.1.1 proposes the tactic of volume – that is, more tests, which 7.1.2 proposes the tactic of smartly varying parameters in the setup, to generate diversity from known biased parameters. Both are determined to fail for different reasons (the first approach because the involved benchmark suites themselves may contain inherent bias, the second in the case that variance is not properly done). I wonder if the two approaches could in any way be used in tandem to solve each other's problems – that is to say, would a larger number of experiments done with controlled variance in the relevant parameters see any difference in measurement bias?

0 replies

polybeandip · 2025-01-23T02:55:24Z

polybeandip
Jan 23, 2025

Critque: The overall discussion on UNIX environment size was certainly illuminating and drove home their point on the prevalence of measurement bias. I'm most interested in Figure 1(b):

In particular, I wish they said more about the data point at top! Stack alignment was a good explanation for the 33% speedups but surely something more is going on for the 300% one?

Discussion Question: The methods the paper proposes for avoiding measurement bias require knowing sources of the bias:

Constructing diverse benchmark suites or randomized experimental setups requires one to know the properties on which vary the benchmarks or experiments
Performing a causal analysis requires already having a causality based on the measurement bias

This prior knowledge of bias seems like a luxury most papers don't have:

For example, we found that none of the papers in APLOS 2008, PACT 2007, PLDI 2007, and CGO 2007 address measurement bias satisfactorily.

I doubt negligence led these authors to fail to address measurement bias; more likely, they couldn't think of where it could come from. As such, I wonder if there are techniques to detect or decrease the impact of measurement bias without knowledge of its source?

0 replies

zihan0822 · 2025-01-23T02:59:48Z

zihan0822
Jan 23, 2025

critique:
This paper raises the awareness that innocuous experimental setting may easily screw the conclusion we draw by investing the effect of size of environment variables and linking order. And I love their call on the transparency of hardware manufacturer to help the research community.

I do not think some of those methods the author proposed (coming up a really diverse test suites and casual analysis) are quite feasible to implement in daily practices. Experimental randomization may also become impractical if our system gets too complicated.

Question:
It seems to me that as our system gets more and more sophisticated, it becomes not even possible to identify all those factors that may introduce measurement bias or come up with test suites that are diverse enough (exponentially hard to enumerate all those factors to make it robust against numerous experimental setting). In practices, how can we trade off between drawing the conclusion and reducing measure biases?

0 replies

smd21 · 2025-01-23T03:00:03Z

smd21
Jan 23, 2025

Critique: First, I found it incredibly interesting how in their experiment with O3's effect that the results appear to depend more on the link order and environment size (section 4). To me, this seems to indicate that the speedup from compiler optimizations can be outweighed by the hardware preferences mentioned in 4.3. It seems to me that if the compiler has hardware access then it should make the decisions that allow for optimal performance, and yet 4.3 outlines that, at least in the Intel compiler, this is not the case. Given the seeming outsize effect that memory layout had on performance, I'm curious if there's been other work in this area (and also why the Intel compiler appears to not exploit it's hardware access for performance).

Discussion Question: In section 7.1, the authors propose using a "diverse" set of benchmarks as one possible solution for measurement bias in experiments. However, the authors don't go into too much detail about what makes a diverse set and why the set they used is not diverse (they point out that their set isn't diverse, but I'm curious as to what makes it that way). So my question is what are the requirements for a diverse set of benchmarks and how would one go about creating a diverse set?

0 replies

Performance Measurement Discussion #448

sampsyo Jan 21, 2025 Maintainer

Replies: 17 comments · 16 replies

sampsyo Jan 22, 2025 Maintainer Author

sampsyo Jan 22, 2025 Maintainer Author

sampsyo
Jan 21, 2025
Maintainer

Replies: 17 comments 16 replies

sampsyo Jan 22, 2025
Maintainer Author

sampsyo Jan 22, 2025
Maintainer Author