Replies: 17 comments 16 replies
-
Critique: It would be nice if the paper discussed how to pick appropriate statistical measures for comparing two systems' performance (although this might be out of scope). Jules Jacobs argues in a Twitter thread that speedups ought to be averaged using the harmonic mean as opposed to the (more common) geometric mean. Measurement bias might affect different statistical measures to varying degrees, and it'd be interesting to see if certain measures are less susceptible to noise / measurement bias. Question: The paper mentions that benchmark suites should should consist of "diverse" workloads (section 8.1). How should we evaluate the diversity of a workload? (and what makes a workload sufficiently diverse?) |
Beta Was this translation helpful? Give feedback.
-
Critique: Questions: (3) The author states that knowing the internal details of a microprocessor will help (i) understand the performance of a micro-kernel and (ii) fully exploit the microprocessor to obtain peak performance. What internal details of a microprocessor does a performance analyst reasonably need to know? (the paper mentioned alignment preferences) |
Beta Was this translation helpful? Give feedback.
-
Critique: With regards to the proposed experimental setup randomization approach, I am still skeptical that this will remain feasible as more knobs get added to systems. To achieve adequate randomization the authors used 484 setups (Section 7.1.2) -- and this was just to enumerate setups that vary in terms of link order and environment size. We could imagine the space of random experimental setups growing infeasibly large for systems researchers to enumerate. Discussion Questions: |
Beta Was this translation helpful? Give feedback.
-
I'm feeling strong parallels to the replication crisis happening in other areas of science, where tons of studies (especially in psychology and Wikipedia claims also medicine) are failing to replicate as researchers are finding either flawed methodology or other issues with the paper. It seems that this paper is driving home a similar point about empirical evidence in computer science, which I don't like because it makes me sad. My critique, if I can call it that, is that I don't quite believe the authors when they say that we can learn from other fields of science in ways to mitigate these biases. After all, given how dire the replication crises has gotten in other fields, and how many fields it has spread to, it doesn't exactly inspire confidence that the other fields have it figured out either. And furthermore, if every single innocuous thing has the potential to completely skew the result of a study, then how can it ever be humanly possible to ensure that we've randomized all of them? It seems possible, if not likely, that there will always be some tiny insignificant factor that no one thought of, which itself is enough to cause a fast optimization to appear slower. Unless we can somehow exhaustively enumerate every single aspect of an environment that could skew an experiment (and how could that list ever be truly exhaustive?), the results here seem to show that even missing a single aspect dooms us all. So, what is even the point? I'm just not sure I'm convinced that randomizing some factors is an improvement if missing one factor alone changes the entire experiment. (I also find it kind of funny how the paper keeps coming back to this analogy about polling an election by sampling a large swath of the electorate across different demographics, especially right after multiple cycles of fairly abysmal political polling precisely because pollsters are having trouble figuring out which demographics to poll across) Discussion Question |
Beta Was this translation helpful? Give feedback.
-
I appreciated the argument around measurement bias, but I'm almost more interested by the results which seem to indicate that memory layout can have a greater impact on performance than the compiler optimizations being measured. Admittedly It's a bit hard to know what to make of this because performance is always discussed in terms of O3 speedup over O2 so its not clear what is speeding up or slowing down in absolute terms. However, besides making it hard to report measurements, it seems bad that a user might not observe any benefit from an optimization because e.g. they had environment variables that were a bad length. After all, it is actually a good thing if someone's program runs faster because their environment variables are a "good" length, even though this can be framed as measurement bias. Of course it is a hard problem to optimize memory layout, Figure 5 is a nice illustration of how there is no one size fits all configuration across different machines and the best configuration may depend on internal microprocessor details which as the paper points out may not be public, but I'm definitely curious about what has been tried in this space. Discussion question: One solution the paper focuses on is using more diverse workloads and statistical methods to factor out bias. When trying to measure the performance of an optimization that is targeted towards certain types of workloads e.g. memory intensive, how should we approach striking a balance between workload diversity and measuring performance on the types of workloads that the optimization is specifically targeting? |
Beta Was this translation helpful? Give feedback.
-
Critique: I found this paper to be very instructive and applicable to my own research involving systems. I find it very helpful that the authors did a walkthrough with an example of using two techniques to mitigate measurement bias and establish confidence in conclusions. While reading the paper I thought not all systems are subject to measurement bias, that it depends on the particular metric being evaluated. However, upon further thought, I think I’m wrong. To elaborate, I thought that measurement bias is an issue in the O3 optimization speedup example because they’ve chosen execution time speedup as their evaluated metric. This metric condenses and aggregates a lot of factors involving the experimental setup to produce one number (the speedup) which is why measurement bias is a serious concern in this example. However, if we’re using a simulator that simulates only a single simple subsystem (e.g. a cache) and evaluating a discrete simple metric (e.g. hit rate) to evaluate a caching mechanism, then I thought we don’t need to worry about measurement bias as the tightly controlled, transparent, simple nature of the experimental setup doesn’t allow for emergent unexplainable behavior. However, even in this controlled simple setting, we can have measurement bias since the cache’s behavior is dependent on factors such as the cache size, cache line size, replacement policy, etc. Furthermore, we still need diverse benchmarks to stress test as some workloads may favor the caching mechanism. To conclude, measurement bias lurks everywhere, even in simple systems. Discussion Question: Could there be an evaluation of a practical system that is somehow not subject to measurement bias due to its nature or the proposition it is evaluating? For example, in the limit, if the proposition is simple enough such as “the cache outputs data if it is present in cache” would not have any measurement bias; one workload would be just as good as n workloads and varying experimental setups won’t change results so evaluating 1 workload with 1 experimental setup is enough and evaluating more would increase confidence in the proposition but there is no measurement bias. |
Beta Was this translation helpful? Give feedback.
-
I'm interested in the reasons behind some of the performance fluctuations reported in the paper. In section 4.2.2, the authors attribute a possible cause of measurement bias from UNIX environment size to an event called
I was wondering about the specifics of these "conservative heuristics." I was under the impression that most architectures will speculatively try to reorder loads before stores, since they are often more critical to performance. When the addresses are resolved later in the pipeline, it'll check whether or not there was a mis-speculation, meaning that a load that should've blocked on a store was incorrectly ordered before that store. If it finds a mis-speculation, it'll replay the instructions. Here, I thought that detecting load-store overlaps literally just meant checking the addresses -- I'm not sure what the authors mean by conservative heuristics, or why alignment would have an effect on it. From my understanding, a heuristic shouldn't be necessary to begin with? I've also heard of architectures that will try to predict which loads depend on which stores, similar to how it might predict branch correlations/outcomes. Still, I don't think this is what the authors mean. I'm interested if other people have insights or thoughts about this. Now a bit bigger picture, I think the paper presents a dilemma that doesn't have a clear solution, really. I think one challenge, which other people have also mentioned, is knowing what parts of an experimental setup might cause measurement bias. The paper focuses on two, but I'm sure that there are numerous others that I can't think of off the top of my head. I think this is partially obscured by implementation details that hardware vendors don't disclose. This seems to highlight a tension between the ISA and the lower-level implementation details; on the one hand, we like the ISA as the lowest level of programming abstraction for reasons like code portability and security. On the other hand, we might want to peek below the ISA abstraction for performance boosts. In the case of this paper, we might want to peek below the ISA to figure out causes of measurement bias. It seems to me like finding all these sources of measurement bias are at odds with the programming abstractions we're used to, and I'm not sure what the solution is to identifying these sources and the precise reasoning behind them. Discussion question: The specific causes of measurement bias discussed in the paper seem specific to CPUs (as opposed to more application-specific architectures) since they deal with code/data layout. If we care about benchmarking code on something even a little bit more specialized (like a GPU, for example), what kinds of sources of measurement bias might arise? How can we be sure that we've exhausted all sources of measurement bias in a given experiment? |
Beta Was this translation helpful? Give feedback.
-
Critique: The paper really emphasizes benchmark diversity as a cure for measurement bias but they don't really back up that claim. I think it makes intuitive sense in a Law of Large Numbers sort of way -- if the "diverse" benchmarks are independently distributed with respect to the measurement bias then I think you would expect them to give a normal distribution around the "true" value. But I think there are two issues with that claim: The first is that they run exactly this experiment, they don't get a normal distribution, and they conclude that SPECINT just isn't diverse enough of a benchmark. They don't try any other sets of benchmarks or even discuss what about SPECINT makes it not diverse enough or how it could be improved. I don't think they ever provide evidence that there is such a thing as a benchmark that is diverse enough to average out measurement biases. The second issue is that they pre-suppose a ground truth value. In the case of O3 vs. O2 (and in many software evaluation contexts) I don't think there actually is one. The promise of O3 optimizations is not "this will make all C programs X% faster", it is "this will make some C programs between X% and Y% faster, depending on their contents". So I think the idea that you could develop a set of benchmarks that are so "diverse" that they would have an average speedup clustered around the "true" speedup of O3 is just malformed. There is no "true" O3 speedup. Question: The milc benchmark seems to do some numerical analysis type stuff related to quantum field theory:
It is also pretty reliably un-measurement-biased in the paper's various experiments. Why do you think it is so unbiased? |
Beta Was this translation helpful? Give feedback.
-
Critique: The paper did well showing how measurement bias is extremely prevalent, but I find it hard to see how it could ever be removed in a sustainable manner. A future in which every experiment requires 400+ experimental setups seems like you would need some sort of computing cluster to help and I don't like how that would increase the barrier of entry to doing "more correct" experiments. I think maybe it would be worth exploring more on a tolerable level of measurement bias, so we don't need so many setups. Also, I wonder how valid some of the counterarguments this paper are. For example, in section 6.2 they address the argument that large speedups can, in a sense, overpower measurement bias, but the paper counters with data from section 4, showing how some instances of measurement bias can be very large too. This seems a bit of an overgeneralization to me, especially if a given large speedup is also theoretically sound. But I suppose it's not a bad idea to consider the possibly of bias despite huge results. Questions: This paper mentioned some solutions like more cooperation from hardware. Just like how RISC-V has been nice for open-source hardware, I wonder if a completely open-sourced CPU could be a good standard for benchmarks in the future? Also, I wonder if the prevalence of measurement bias should impact the optimizations we choose to make. Should we be optimizing compilers to work with modern hardware, or if we should be doing optimizations that work on theoretically ideal hardware (like infinite cache lines for example) and let hardware designers optimize to fit the compilers? I'm thinking that in the case of the paper, an experiment done on a simulation of an ideal hardware could reduce issues caused by memory alignment, and then there are less factors to consider like linking order. |
Beta Was this translation helpful? Give feedback.
-
Critique. I would have appreciated a causal analysis of the original question posed: to what degree do O3 optimizations improve A proper casual analysis of this question would intervene minimally on the exact O3 optimizations applied. In particular, I was expecting an attempt to isolate (if possible) the optimization pass(es) responsible for performance wins and losses in benchmarks, with perhaps some discussion to experimentally reach casual conclusions while working around the compilers phase ordering problem. Question. What would be a good way to measure compiler performance wins while working around the phase ordering problem? I was thinking that one could ablate various combinations of O3 optimization passes. Depending on how many passes there are, it might be feasible to enumerate all possible combination of O3 optimizations passes for the interesting benchmark suites (e.g. those that straddle the 1.0 line). |
Beta Was this translation helpful? Give feedback.
-
Critique: Questions: |
Beta Was this translation helpful? Give feedback.
-
Critique: As I was reading the paper, I was coming up with possible critiques which were covered by later sections; for example, their assertion that link order and UNIX environment size caused the variation in results (rather than simply correlated) in sections 4.1-4.2 gave me pause, but this was addressed more explicitly later in section 7.2. So I would say they did a good job anticipating possible critiques and covering their bases. I also noticed that in sections 4.1-4.2, they focused largely on results from the perlbench test, which we can see on their violin graphs had the largest variation and thus is the most in line with their argument, it made me think that perhaps the specific issue they're observing (benchmark data variation with varied link order and UNIX environment size) only holds strongly for some benchmarks which can simply be avoided. However, they only observed these two underlying causes; the other benchmarks which looked to have performed with much less variation on their violin plots may have great amounts of variation for other underlying causes, which may not be easily detectable without using the benchmark evaluation method described in 7.1.1 (demonstrated in Figure 6). Discussion Questions: The paper acknowledges the difficulty in implementing the proposed techniques for managing measurement bias, especially the demands of experimental setup randomization; they present some possible strategies in section 8 to reduce the presence of measurement bias in a more manageable way. Which of these suggestions do you think is the most feasible and most helpful in reducing the problem? Can you think of any other ways that they missed to scale down their findings into more feasible strategies? |
Beta Was this translation helpful? Give feedback.
-
Critique: Overall, I think this paper does well the point readers in a direction where something completely unobvious can be the reason for misinterpretable results. However, it seems like the authors also state that diverse benchmarks aren’t always the solution, although this seems to be the case for mitigating measurement bias in several fields as well. I agree that the notion of measurement bias is good to point out, but I don’t think this is “too” realistic for research purposes. The paper mentions a small code snippet to highlight the issue of the size of environment variables, but in the real world, there are so many different factors, not to mention the size of codebases (1000x larger samples) that magnify the effects of measurement bias. I was also surprised that there was no significant trend when using a diverse set of benchmarks. As shown in Figure 2 in section 4, a wide variety of distributions are displayed on different benchmarks. This makes me question how feasible it really is to find a true unbiased result, especially where today where there are hundred of benchmarks (or even some research using just the top benchmarks to show performance improvements). Question: When should one start considering the “diversity” of benchmarks rather than what is commonly used for an area of research? For example, in the field of computer vision/machine learning, people can often use different GPUs (or single/multiple GPUs) depending on the different precisions of models used. Thus, how important is it really to consider using a diverse set of benchmarks when oftentimes researchers tend to pick the one that shows the most performance gains? |
Beta Was this translation helpful? Give feedback.
-
Critique: To that effect, I thought that the problem presentation and statement were well done. Where I felt the paper proved more vacuous was in the later sections (particularly 8) and more directly, discussing techniques to better analyze potential variables at play in experimentation. In 8.3, the paper discusses having more manufacturers release details on the hardware they use – "Sun Microsystems has already taken the lead by releasing all details of one of their microprocessors (the OpenSparc project); we hope other manufacturers will follow." I question whether it is systematic and conclusive enough to rely on data which may or may not be supplied when developing a solution. Question: |
Beta Was this translation helpful? Give feedback.
-
Critque: The overall discussion on UNIX environment size was certainly illuminating and drove home their point on the prevalence of measurement bias. I'm most interested in Figure 1(b): In particular, I wish they said more about the data point at top! Stack alignment was a good explanation for the 33% speedups but surely something more is going on for the 300% one?Discussion Question: The methods the paper proposes for avoiding measurement bias require knowing sources of the bias:
This prior knowledge of bias seems like a luxury most papers don't have:
I doubt negligence led these authors to fail to address measurement bias; more likely, they couldn't think of where it could come from. As such, I wonder if there are techniques to detect or decrease the impact of measurement bias without knowledge of its source? |
Beta Was this translation helpful? Give feedback.
-
critique: I do not think some of those methods the author proposed (coming up a really diverse test suites and casual analysis) are quite feasible to implement in daily practices. Experimental randomization may also become impractical if our system gets too complicated. Question: |
Beta Was this translation helpful? Give feedback.
-
Critique: First, I found it incredibly interesting how in their experiment with O3's effect that the results appear to depend more on the link order and environment size (section 4). To me, this seems to indicate that the speedup from compiler optimizations can be outweighed by the hardware preferences mentioned in 4.3. It seems to me that if the compiler has hardware access then it should make the decisions that allow for optimal performance, and yet 4.3 outlines that, at least in the Intel compiler, this is not the case. Given the seeming outsize effect that memory layout had on performance, I'm curious if there's been other work in this area (and also why the Intel compiler appears to not exploit it's hardware access for performance). Discussion Question: In section 7.1, the authors propose using a "diverse" set of benchmarks as one possible solution for measurement bias in experiments. However, the authors don't go into too much detail about what makes a diverse set and why the set they used is not diverse (they point out that their set isn't diverse, but I'm curious as to what makes it that way). So my question is what are the requirements for a diverse set of benchmarks and how would one go about creating a diverse set? |
Beta Was this translation helpful? Give feedback.
-
This thread is for discussing the famous "Producing Wrong Data!" paper by Mytkowicz et al. I (@sampsyo) am the discussion leader and will try to answer all your questions!
Beta Was this translation helpful? Give feedback.
All reactions