Architecture for Pipeline of Passes #2322

xunilrj · 2022-03-29T13:37:13Z

xunilrj
Mar 29, 2022

We will soon start doing semantic analysis and using our CST/AST for more complex processing. This begs the question of how we are going to architect this. One possibility is using nanopasses:

Generics in Small Doses: Nanopass Compilation with Haskell
https://offog.org/publications/fita200811-generics.pdf
LLVM for Grad Students
https://www.cs.cornell.edu/~asampson/blog/llvm.html

and others...

To assess the feasibility of using nanopasses I did a quick check: how long it takes to transverse all nodes of a small file (200 lines), a big file (5000 lines) and huge files (jquery). I also implemented a new way to build and transverse our nodes for comparison, called "flat tree" below.

For more details: https://github.com/mamcx/tree-flat

file	type	parse	1 pass	10 passes	100 passes	cache then 100 passes
200	today	2.038ms	268.3µs	2.6623ms	26.2739ms	730µs
200	flat tree	573.9µs	36.9µs	382.9µs	3.3232ms	688.8µs
5000	today	13.5654ms	5.5214ms	55.363ms	549.4837ms	18.2915ms
5000	flat tree	11.0647ms	838.8µs	8.823ms	82.7959ms	14.8735ms
jquery	today	15.594ms	9.934ms	95.8813ms	957.9984ms	99.4825ms
jquery	flat tree	11.289ms	1.9675ms	19.3426ms	201.8328ms	42.1077ms

Transversing is much faster because it is pretty much iterating a Vec.

At 100 nanopasses, both solutions are pretty similar if we cache all nodes. This is very interesting because we have a nice solution already at our door. But...

Using "flat trees" allows me to more easily choose "columnar" modelling and have a Vec<JsSyntaxKind>, which means we can find certain kinds using SIMD... and that makes all the difference.

file	type	parse	1 pass	10 passes	100 passes	cache then 100 passes
200	today	2.038ms	268.3µs	2.6623ms	26.2739ms	730µs
200	flat tree	573.9µs	36.9µs	382.9µs	3.3232ms	688.8µs
200	SIMD	-	2.7µs	10.1µs	99.1µs	-
5000	today	13.5654ms	5.5214ms	55.363ms	549.4837ms	18.2915ms
5000	flat tree	11.0647ms	838.8µs	8.823ms	82.7959ms	14.8735ms
5000	SIMD	-	30.1µs	243.5µs	2.3674ms	-
jquery	today	15.594ms	9.934ms	95.8813ms	957.9984ms	99.4825ms
jquery	flat tree	11.289ms	1.9675ms	19.3426ms	201.8328ms	42.1077ms
jquery	SIMD	-	60.3µs	405.4µs	3.9441ms	-

Nothing changed when parsing.

The table above shows that we can find all nodes of a certain kind a hundred times and still be two or three times faster than we can do it only once with our current code. This is almost 300x faster.

In the most complex case, jquery, we are transversing and comparing all nodes a hundred times, and we still are faster than just parsing today. This means it seems possible to have a pipeline of nanopasses with negligible costs for transversing our parsed code.

It is also possible that we can improve parsing even more by moving from a "Events + Treesink" model to one building the flat tree directly.

To play with this code: https://github.com/rome/tools/tree/poc/flat-trees

My suggestion is that we continue this POC:

understanding what would be the relation with rome_rowan. Do we still need it?
How do we deal with caching.
Moving from "Events + Treesink" improves performance?
Can we move all semantic code from the parser to a specific pass?
Can we do scope resolution and type checking in another pass using this approach?
How do we deal with mutability?

Opinions?

How to run

> cargo run -p rome_cli --example pipeline --release -- .\target\200lines.tsx

MichaReiser · 2022-03-29T15:30:50Z

MichaReiser
Mar 29, 2022

This looks super interesting and I like the idea of nano passes. Could you explain the benchmark in more detail? How does it search the nodes? What "real world" use case does it model?

I hope to have more time tomorrow to look into this in detail.

1 reply

MichaReiser Mar 30, 2022

Is my understanding correct that SyntaxNode2 doesn't support trivia yet? Nor does it seem to be possible to retrieve the source string of tokens and nodes.

NicholasLYang · 2022-03-29T15:52:45Z

NicholasLYang
Mar 29, 2022

Is there any prior art on nano-passes combined with query architectures? I could see it working, but it could cause a lot of nested calling if we have a query that takes 5-10 passes to resolve.

As for flat trees, this seems quite similar to data oriented programming, in that it exploits cache locality. The question then becomes if the passes have a consistent order in which they traverse the trees. The library appears to assume pre-order, which may not be true for every pass.

In my experience, it's in the interest of most compilers to move to a block based SSA/three address code intermediate representation, simply because you remove the arbitrary nesting that you can get with trees. It also makes optimizations like dead code elimination and constant folding a lot easier. Perhaps this is a way to gain performance as well, since block based IRs probably have less pointer chasing.

0 replies

MichaReiser · 2022-03-30T09:56:10Z

MichaReiser
Mar 30, 2022

As I said before, I'm excited that we start exploring new representations for the AST. Another interesting read is about the ZIG compiler's rework of the memory layout which seems related.

What's unclear to me right now is what's the scope of what you're proposing. Are you proposing to use this new AST for linters only, only for additional passes right after the parser (scope analysis, rewriting the tree) or replace rowan completely? Is this a structure we could use for transforms as well that heavily mutate the tree? Also, what are the advantages disadvantages of this new approach. We'll probably loose caching. How important do we find caching? It would help if you could extend your explanation covering the use cases and advantages/disadvantages of the current / new approach.

Having a way to perform a cheap second pass after parsing would be useful to implement many of JavaScript's semantic checks or to rewrite the AST, for example, rewriting the left hand side of an assignment to an Assignment node or rewriting StringLiteralExpressions at the top of a function body to directives. However, rewriting the AST would require a cheap way of modifying the tree structure.

That raises the question how performant tree modifications are or if this will require re-creating the whole tree everytime we change a single node.

Only giving pre-order traversal isn't sufficient for the formatter. The formatter often times needs to look at the predecessor token to properly format comments or determine the number of lines to insert between two statements. And this happens in the critical path: Improving the complexity of retrieving the last_token method resulted in a 2-3x times runtime performance improvement.

Have you also looked into alternatives? For example, I believe our rowan implementation is more complicated than it must be and it often allocates SyntaxElements even if they are never returned publicly. One example is that calling descendents creates SyntaxToken's (heap allocated data structures) for every token even though the tokens get always skipped.

0 replies

xunilrj · 2022-03-31T13:17:20Z

xunilrj
Mar 31, 2022
Author

Doing a quick research on cache hit/miss:

rsuite

(  1)   258352 (38.1%, 38.1%): Token Hit
(  2)   181546 (26.8%, 64.9%): Node Miss
(  3)   122436 (18.1%, 83.0%): Node Hit
(  4)    85978 (12.7%, 95.6%): Token Miss
(  5)    22849 ( 3.4%, 99.0%): Trivia Hit
(  6)     6725 ( 1.0%,100.0%): Trivia Miss

Trivia: 22,849 of 29,574 = 77%
Tokens: 258,352 of 344,330 = 75%
Nodes: 122,436 of 303,982 = 40%

Material UI

9388857 counts
(  1)  3231832 (34.4%, 34.4%): Token Hit
(  2)  2731220 (29.1%, 63.5%): Node Miss
(  3)  1534367 (16.3%, 79.9%): Token Miss
(  4)  1385834 (14.8%, 94.6%): Node Hit
(  5)   335245 ( 3.6%, 98.2%): Trivia Hit
(  6)   170358 ( 1.8%,100.0%): Trivia Miss

Trivia: 335,245 of 505,603 = 66%
Tokens: 3,231,832 of 4,766,199 = 67%
Nodes: 1,385,834 of 4,117,054 = 33%

0 replies

xunilrj · 2022-04-16T18:39:04Z

xunilrj
Apr 16, 2022
Author

Expanding on this research I tried to create a pipeline of stages.

First obvious implementation is:

trait PipelineStage {
    fn handle(&mut self, node: &NodeOrToken<SyntaxNode, SyntaxToken>);
}

struct DynPipeline
{
    stages: Vec<Box<dyn PipelineStage>>,
}

If we run this on our 200/5k/jquery test suite with just one pass counting how many functions there are in each file we get:
*all times are using our current code and not the flat-tree proposal.

DynPipeline: took: 11.7µs
DynPipeline: took: 731.9µs
DynPipeline: took: 1.818ms

On a "normal" file it takes less than 1ms to iterate through all nodes and check its kind against JS_FUNCTION_DECLARATION. Not brilliant, but also not the worst thing in the world.

If we increase to 10 passes we see an increase in the time spent. I honestly thought the increase would be linear, but it wasn't. I would not be surprised in a more complex scenario, with different passes, that it will approach linear increase.

DynPipeline: took: 65.8µs
DynPipeline: took: 2.4054ms
DynPipeline: took: 5.6926ms

In the "normal" file we are already spending +1ms just navigating the tree. Not ideal.

There is one "easy" improvement we can do. We can exchange the dynamic dispatch for static dispatch. Rust does not have variadics generics like C++ (https://en.cppreference.com/w/cpp/language/parameter_pack). So I will need to cheat.

struct Pipeline<TStage>
where
    TStage: PipelineStage,
{
    stage: TStage,
}


impl<TStage> Pipeline<TStage>
where
    TStage: PipelineStage,
{
    pub fn run(&mut self, result: Parse<JsAnyRoot>) {
        let root = result.syntax();

        let all: Vec<_> = root.descendants_with_tokens().collect();

        for t in all.iter() {
            self.stage.handle(t);
        }
    }
}

This allows me to have zero dynamic disptach.

let mut pipeline: Pipeline<CountFunctionsStage> = Pipeline::new();
pipeline.run(result);

With the problem of being limited with just one stage. I can break this limitation doing:

#[derive(Default)]
struct LinkedListStages<TCurrent, TNext>
where
    TCurrent: PipelineStage,
    TNext: PipelineStage,
{
    current: TCurrent,
    next: TNext,
}

impl<TCurrent, TNext> PipelineStage for LinkedListStages<TCurrent, TNext>
where
    TCurrent: PipelineStage,
    TNext: PipelineStage,
{
    fn handle(&mut self, node: &NodeOrToken<SyntaxNode, SyntaxToken>) {
        self.current.handle(node);
        self.next.handle(node);
    }
}

I can sugar the creation of the pipeline with

type Stages2<A, B> = LinkedListStages<A, B>;
type Stages3<A, B, C> = Stages2<A, Stages2<B, C>>;
type Stages4<A, B, C, D> = Stages3<A, B, Stages2<C, D>>;
type Stages5<A, B, C, D, E> = Stages4<A, B, C, Stages2<D, E>>;

or with macros. This allows me to do:

let mut pipeline: Pipeline<
        Stages10<
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
            CountFunctionsStage,
        >,
> = Pipeline::new();
pipeline.run(result);

The advantage is that being everything static, the optimizer shows its value and we get:

Pipeline: took: 7.2µs
Pipeline: took: 523.8µs
Pipeline: took: 1.1496ms

This puts us under 1ms again for the "normal" file or five times faster than using dynamic dispatch.

I still find this approach of navigating all nodes searching the one you wants the "worst case scenario". We must support this, and this approach seems to be the best we can do at the moment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture for Pipeline of Passes #2322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Architecture for Pipeline of Passes #2322

xunilrj Mar 29, 2022

Replies: 5 comments · 1 reply

MichaReiser Mar 29, 2022

MichaReiser Mar 30, 2022

NicholasLYang Mar 29, 2022

MichaReiser Mar 30, 2022

xunilrj Mar 31, 2022 Author

xunilrj Apr 16, 2022 Author

xunilrj
Mar 29, 2022

Replies: 5 comments 1 reply

MichaReiser
Mar 29, 2022

NicholasLYang
Mar 29, 2022

MichaReiser
Mar 30, 2022

xunilrj
Mar 31, 2022
Author

xunilrj
Apr 16, 2022
Author