diff --git a/content/blog/2023-12-11-hlc/HlcArch.svg b/content/blog/2023-12-11-hlc/HlcArch.svg new file mode 100644 index 000000000..b26f40852 --- /dev/null +++ b/content/blog/2023-12-11-hlc/HlcArch.svg @@ -0,0 +1,4 @@ + + + +
High-Level AST
High-Level AST
HIR CFG
HIR CFG
Caiman Assembly
Caiman Assembly
Analyses
Analyses
Flatten, Basic Blocks, Typed Operators, References
Flatten, Basic Blocks, Typed O...
CPS, Explicit Types
CPS, Explicit Types
Text is not SVG - cannot display
\ No newline at end of file diff --git a/content/blog/2023-12-11-hlc/chart.svg b/content/blog/2023-12-11-hlc/chart.svg new file mode 100644 index 000000000..d9ebb3494 --- /dev/null +++ b/content/blog/2023-12-11-hlc/chart.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/content/blog/2023-12-11-hlc/hlc_1.svg b/content/blog/2023-12-11-hlc/hlc_1.svg new file mode 100644 index 000000000..f1549825c --- /dev/null +++ b/content/blog/2023-12-11-hlc/hlc_1.svg @@ -0,0 +1,97 @@ + + +%0 + + + +foo + +foo + + + +foo4 + +foo4 + + + +foo->foo4 + + + + + +foo5 + +foo5 + + + +foo->foo5 + + + + + +foo2 + +foo2 + + + +foo4->foo2 + + + + + +foo6 + +foo6 + + + +foo2->foo6 + + + + + +foo7 + +foo7 + + + +foo2->foo7 + + + + + +foo3 + +foo3 + + + +foo6->foo3 + + + + + +foo7->foo3 + + + + + +foo5->foo2 + + + + + \ No newline at end of file diff --git a/content/blog/2023-12-11-hlc/index.md b/content/blog/2023-12-11-hlc/index.md new file mode 100644 index 000000000..f465e69de --- /dev/null +++ b/content/blog/2023-12-11-hlc/index.md @@ -0,0 +1,538 @@ ++++ +title = "Heterogeneous Programming with the Caiman Front End" +[[extra.authors]] +name = "Stephen Verderame" +link = "https://stephenverderame.github.io/" +[extra] +bio = """Stephen is a senior/MEng student interested in compilers, +programming languages and architecture. +""" ++++ + +# Background + +Caiman is an IR designed for +[heterogeneous programming](https://en.wikipedia.org/wiki/Heterogeneous_computing), +with the current target being CPU/GPU computing. +The goal of Caiman is to more easily explore the +immense design space developers have access to once they introduce multiple +devices to run code on. + +As an example, consider the following function: + +```Rust +fn select_sum(a: Buffer, b: Buffer, c: Buffer) -> f32 { + let val = sum(a); + if val > 0.0 { + sum(b) + } else { + sum(c) + } +} +``` + +If we introduce the GPU as a device we can execute code on, then another way +to write the same thing would be: + +```Rust +fn select_sum1(a: Buffer, b: Buffer, c: Buffer) -> f32 { + let val = gpu_sum(a); + if val > 0.0 { + gpu_sum(b) + } else { + gpu_sum(c) + } +} +``` + +But a GPU can potentially execute multiple sums in parallel, so instead of +forcing a synchronization with a dependency between the first `sum` and +the other two, we can write this equivalently as: + +```Rust +fn select_sum2(a: Buffer, b: Buffer, c: Buffer) -> f32 { + let cond = gpu_sum(a); + let b = gpu_sum(b); + let c = gpu_sum(c); + if val > 0.0 { + b + } else { + c + } +} +``` + +This isn't too bad, but you can imagine that a more complicated function, `foo`, might +want to call `select_sum` multiple times. For each call to `select_sum`, +the developer of `foo` can choose which `select_sum` implementation to use. +But then each permutation of `select_sum` implementations used in `foo` +characterizes a different `foo` implementation. Now suppose a developer +of `bar` calls `foo` multiple times. They already have so many choices +for invoking `foo` once, and every permutation of these choices +for the number of calls `bar` makes can characterize +a different implementation of `bar`! This quickly becomes a massive +combinatorial explosion. + +So what Caiman provides is a way to abstract where data is stored, +what device computations are invoked on, and when synchronizations are made +between devices. This allows a user to write specifications that are agnostic to +these details and not subject to the combinatorial explosion. + +For example, the value specification describes what computations are performed, +and doesn't care where or when they are performed. In the value language, +`select_sum` can be specified with something like this: + +``` +val select_sum(a: array, b: array, c: array) -> f32 { + sum_b = sum(b) + sum_c = sum(c) + returns sum_b if sum(a) > 0.0 else sum_c +} +``` + +`sum` can be either a CPU or GPU sum and it can be executed in parallel or +sequentially. All the different versions we saw earlier are captured in this +one specification. + +A concrete `select_sum` implementation (aka schedule) would implement this +spec, along with a `spatial` spec that specifies how and where data is allocated +and freed and a `timeline` spec that specifies the synchronizations between devices. + +Explication (program synthesis) will be used to be able to derive much of +the implementation from the specs so that a user can write a few different specs +and Caiman can generate a plethora of implementations that adhere to them. + +# Introduction + +For my project, I've been developing a front-end language for Caiman. +The front-end lowers the high-level Caiman programs into Caiman Assembly, +our mid-level IR. Like LLVM, the assembly does have a textual format that +can be hand-written. However, it's absolutely ungodly (a simple program to return +the smaller of two constants is nearly 70 lines of code). The main cause +for this is that the assembly is in +[continuation-passing style](https://matt.might.net/articles/by-example-continuation-passing-style/) +(CPS), which makes executing commands on different devices easier. + +CPS can best be described as a way of writing programs where control flow +is made explicit through function calls and all function calls never return. +Instead of a function returning to its caller, a caller will pass a +*continuation* to the callee that the callee will invoke when it's done. +The continuation can be thought of as "the rest of the program". In this style, +all functions effectively become single basic blocks, which we call *funclets*. + +As an example, consider the following program: + +```Rust +fn min(a: i32, b: i32) -> i32 { + if a < b { + return a; + } else { + return b; + } +} +``` + +In CPS, this would look something like the following: + +```Rust +fn min(a: i32, b: i32, continuation: Box i32>) -> i32 { + continuation(if a < b { a } else { b }) +} +``` + +Another point of complexity is the machinery in the type system that is +used to match pieces of the implementation with pieces of the specifications. +There is also a part of a variable's type that is used for linear resource management. +Because of this, Caiman types have a base type (such as `i32`), and then a type +relative to each of the three specification languages. +Except for the base type, the type of something can also change +throughout a program or basic block. Each specification language tracks +a particular dimension of the implementation via +[linear types](https://wiki.c2.com/?LinearTypes). Linear +types apply to resources that must be used up in their entirety. For example, +in a linear typing system, a function such as `foo(a, b)` must use `a` exactly +once and it must use `b` exactly once. + +So the goal for this project was specifically to develop the part of the +front-end that would lower the imperative, human-writeable source language +into CPS. Furthermore, the user should only have to specify types in situations +that we can't infer them. + +Because the specification languages in the source don't have control +flow, lowering them was pretty standard. Therefore, we'll focus on lowering +the scheduling language (implementations) for this post. + +# Design + +The way I achieved this was by introducing a new high-level IR (HIR) that would +be the target of the front-end AST. + +![Architecture](HlcArch.svg) + +The HIR I designed is somewhat similar to BRIL, however, its canonical +representation is a CFG. The HIR has typical arithmetic and comparison operations, +loads, stores, conditional branches, variable declarations and function calls. +Unlike the AST, it's flattened (no nested expressions or statements) +and divided up into basic blocks, but unlike Caiman Assembly +it has traditional (non-CPS) control flow, and the inputs and outputs of +funclets aren't explicit. One thing to note is that HIR function calls are considered +basic block terminators, since in CPS a function call must be the last +statement in a function. Furthermore, CFGs are canonicalized so that they have +a single entry and a single exit block. + +Once the AST is in CFG form, we perform a few local transformations as "preprocessing". +Namely, we convert all operators to their typed forms +(so `add` becomes `_add_i64_i64` for adding two `i64`s) and convert all +mutations of variables to loads and stores of a reference. + +One could imagine employing the progressive lowering ideas of CompCert and MLIR +to separate the HIR into two HIRs, one before these transformations and +a second representation afterward. I didn't go this route because it would +involve decent code duplication for the definitions of the two IRs. + +# Implementation + +As an example, consider the following (toy) example in the high-level front-end: + +``` +// value specification +val main() -> i64 { + b :- false + c :- true + d :- false + one :- 1 + two :- 2 + r :- c if b else d + z :- two if r else one + returns z +} + +// implementation +fn foo() -> i64 @ node(main.z)-usable + impls main, time, space + // adheres to three specifications: + // * main (value) + // * time (timeline) (not shown) + // * space (spatial) (not shown) +{ + let b: bool @ node(main.b) = false; + var r: bool @ none(main); + if @ node(main.r) b { + let c: bool @ node(main.c) = true; + r = c; + } else { + let d: bool @ node(main.d) = false; + r = d; + } + @in { r: [node(main.r)-usable, none(space)-save] }; + + var v: i64 @ none(main); + if @ node(main.z) r { + let two: i64 @ node(main.two) = 2; + v = two; + } else { + let one: i64 @ node(main.one) = 1; + v = one; + } + @in { v: [node(main.z)-usable, none(space)-save] }; + return v; +} +``` + +It's still not pretty on account of the current need to match pieces of +the implementation back to the spec almost everywhere (denoted with the `@`) +and keep expressions unnested. However, the schedule in this example has the following +structure: + + + + +### CFG Generation + +During CFG generation, one piece of information we need to keep around is what +the continuation of each basic block is. In the above example, the +continuation of `foo` is `foo2` and the continuation of `foo2` is `foo3`. +`foo4`, `foo5`, `foo6`, and `foo7` are at the maximum depth in this example, so +when they're done, control is going to pass back to their parent's continuation. + +To give this information to each basic block, CFGs are generated +breadth-first instead of depth-first +(as indicated by the naming of the basic blocks in the diagram). +That way, the continuation of any block is going to be either the block +with the next ID or its parent's continuation. The "parent continuation" of the +root level blocks is just `None`. + +Within each basic block of the CFG, we also perform a few local transformations. +In particular, we use source-level type information to convert all operators +to their typed versions. In Caiman, all basic operators must be implemented as +external function calls on a particular target device. The default behavior +is for operators to invoke functions on the local (aka host) device which will +be defined in a standard library. The local/host device is the device that is +running the Caiman program itself. The type information attached to the +operators is used to determine the function to call. For example `true == false` +will call `_eq_bool_bool(true, false)` while `1 == 2` will call +`_eq_i64_i64(1, 2)`. + +We also need to convert the uses of variables to loads and stores +of references. This means we might need to introduce load instructions that turn +something like `return v` into `_v0 = load v; return _v0`. Right now, this +dereference transformation is applied locally, but in the future, this might become +a global transformation to reduce the amount of loads being inserted. + +### Funclet Arguments and Returns + +Once lowered into HIR, the next step is identifying the arguments and return +values of each funclet. The way I chose to do this is via a live variables analysis. +Any variable that is live into a basic block must necessarily be a funclet +argument for that block. Consider this part of the example shown earlier: + +``` +// foo +var r: bool @ none(main); +if @ node(main.r) b { + // foo4 + let c: bool @ node(main.c) = true; + r = c; +} else { + // foo5 + let d: bool @ node(main.d) = false; + r = d; +} +``` + +`r` is a variable defined outside the blocks of `foo4` and `foo5`. Since `r` +is live into `foo4` and `foo5`, it must be an argument to both. +Now, at compile-time, it's not necessarily known whether `foo4` or `foo5` will +be invoked. Therefore, the arguments of both blocks must be the same. In the +assembly, the terminator of the `foo` block looks something like this: + +``` +schedule-select %b [%foo4, %foo5] + [value node($val.%r), timeline none($time.%none), spatial none($space.%none)] + (%r_loc) %foo2; +``` + +where `(%r_loc)` is the argument to the selected block and `%foo2` is the +continuation. + +Therefore, the input arguments must be the set of all variables that are +live-out of all predecessor blocks. For the first block of a function, then +the inputs will be the same as the function inputs instead. + +What about the output values? It might seem logical to think that the output +values are simply the live-out set of a given basic block (as I initially did), +but consider the following pseudocode example: + +``` +// block1 +let b = // ... +let cond = // ... +var v; +if cond { + // block3 + v = b + 1; +} else { + // block4 + v = b * 2; +} +// block2 +// use v +``` + +`b` and `v` are live out of `block1`, however, only `v` is live into `block2`. +If we think about what's actually happening here in CPS, we might have the following +pseudocode + +``` +fn block1() { + %b = // ... + %v = // ... + select %b [%block3, %block4] (%b, %v) %block2 +} + +fn block3(%b, %v, %continuation) { + %v = %b + 1; + %continuation(%v); +} + +fn block4(%b, %v, %continuation) { + %v = %b * 2; + %continuation(%v); +} + +fn block2(%v) { + // use v +} +``` + +Here we can see that the actual return value of `%block1` is just +`%v`. More generally, a block's return values are the inputs to +its continuation. For the last block in a function, then the +outputs will be the outputs of the function instead. + +Luckily, we already know what the continuation of every block is because +we determined this information when we generated the basic blocks from the +source program. + +### Funclet Argument and Return Types + +Now that we determined what the funclet arguments and return values are, it +should be easy to determine what a funclet's input and output types are, right? +Well, not so much. Without diving into the details too much, the problem is that +while the datatype of a variable won't change, the part of the type +that is mapping a variable back to a specification and the part of the type +that is tracking a linear resource *can* change. You can think of +these parts of the type as being relative to the current funclet. + +The explicator (synthesizer) can determine all the types *within* a basic block. +However, it's not guaranteed to solve the input and output types across a +funclet boundary. In fact, nothing is. The general case is an undecidable problem. + +In the earlier schedule, we can see an example of how the types are changing: + +``` +var r: bool @ none(main)-dead; +if @ node(main.r) b { + let c: bool @ node(main.c) = true; + r = c; +} else { + let d: bool @ node(main.d) = false; + r = d; +} +@in { r: [node(main.r)-usable, none(space)-save] }; +``` + +`r` starts as being an empty reference that doesn't map to anything in a +specification (`none`) and can't be used (`dead`). During the assignments +to `r` in either branch, `r` takes on the value type of what was +assigned to it. +In the true branch, `r` maps to `main.c` in the value specification, +and in the false branch it maps to `main.d`. In both branches, `r` also becomes +`usable`. Finally, at the join point of the two +branches, `r` maps to `main.r`, which is a node in the value spec +defined as the output of the conditional. + +While the explicator is currently a WIP, for now, I decided to reduce what +needs to be specified by the user using a dataflow analysis. The above example +is actually the reduced level of specification. Annotating every type fully +would look like: + +``` +var r: bool @ [none(main)-dead, none(space)-save, none(time)-usable]; +if @ node(main.r) b { + let c: bool @ + [node(main.c)-usable, none(space)-usable, none(time)-usable] + = true; + r @ [node(main.c)-usable, none(space)-save, none(time)-usable] = c; +} else { + let d: bool @ + [node(main.d)-usable, none(space)-usable, none(time)-usable] + = false; + r @ [node(main.d)-usable, none(space)-save, none(time)-usable] = d; +} +@in { r: [node(main.r)-usable, none(space)-save, none(time)-usable] }; +``` + +We can cut down the required type annotations by knowing some information about +the types of different syntactic constructs. For example, uninitialized +variables are `none()-dead` in the value dimension, `none()-save` in the +spatial dimension and `none()-usable` in the timeline dimension. Constants +are `usable` in all three. Assignments to variables change the type in +the value dimension. + +These heuristics aren't perfect, and one example where they break down is +the meet point between the two branches of the `if`. In one branch, `r` +maps to `main.c` and in another, `r` maps to `main.d`. We need the user to +resolve the conflict with a type annotation. +In this example, `r` maps to `main.r` after the meet because +this corresponds to the following value specification: + +``` +val main() -> i64 { + // ... + r :- c if b else d +} +``` + +Future work will allow further simplification, and should require only +the following annotations: + +``` +var r; +if b { + r = true; +} else { + r = false; +} +@in { r: [node(main.r)-usable] }; +``` + +It should be noted that while a user is more than welcome to write every +implementation by hand, the goal is for them to only write bits and pieces of +it. A user can leave holes in the implementation to be filled in by the +explicator. The entire schedule itself could be one big hole. The inferences +made by the front-end are intended to aid the user in things they want +to specify themselves and aid the explicator by taking advantage of +high-level information that isn't available when explication occurs. + +### Challenges + +The most difficult conceptual task was identifying areas where my assumptions +were incorrect, especially as they related to the type system. For example, +the initial design of my parser did not include any way to specify type +annotations for funclet inputs and outputs because I thought there was enough +structure to deduce them in all cases. Since I'm building up the language as +I'm going, I'm sure there are cases I haven't considered simply +because I haven't implemented anything that could generate them. + +Another challenge is simply the nature of building a project on top of an +incomplete (and occasionally buggy) base. Some of my tests have revealed bugs +in the later stages of the Caiman compiler. Furthermore, since I haven't +gotten around to implementing front-end semantic analyses, whenever something +fails I always have to ask: +1. Is the test wrong? +2. Is the front-end wrong? +3. Is a later stage of Caiman wrong? + +None of these questions are particularly trivial to answer, which has led to +"fun" debugging escapades. + +## Evaluation + +Overall, I would say the front-end so far is a success. I roughly +completed what I planned to accomplish at this point which was +end-to-end compilation for the built-in operators, conditional branching, +internal functions, and AST flattening. I did not get to AST flattening +in the transformation from AST to HIR, but this is my next step and it should +be pretty straightforward. The way I am currently thinking of doing this +will also introduce the infrastructure necessary to reduce the number +of required type annotations to the level I described as the +further simplification enabled by future work. + +Every pre-existing assembly benchmark that doesn't involve user-supplied +external functions or device synchronizations has been implemented in the +front-end at a significant reduction in lines of code +(excluding whitespace, as measured with the `C` language settings of +[cloc](https://cloc.sourceforge.net/) 1.90). 5/9 of the assembly +benchmarks meet this description. These limitations are imposed simply because +I haven't implemented these features in the front-end yet. +The front-end versions of each benchmark were translated pretty literally, +with the assembly version being very close to what the front-end version lowers to. + +![LoC](chart.svg) + +Compilation and benchmark run time also barely changed. Actually, the front-end +compiler is *faster* than the assembly compiler, which is likely becuase they use +different parsers. I ran 5 trials on the `trivial` benchmark, which was +chosen because +it has the smallest difference in complexity between the front-end and +assembly implementations. The front-end compilation time was an average of +26.0ms with a standard deviation of 4.3ms and the assembly compilation +time was an average of 49.8ms with a standard deviation of 4.6ms. + +Overall, the improvements of the front-end over the assembly so far are +quite significant. There is still much work to be done, but the progress so far +has been promising. + +