Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement AMBIG_MULTIPLE for llvm, Rust, go, and the vmops dump #486

Merged
merged 22 commits into from
Aug 16, 2024

Conversation

katef
Copy link
Owner

@katef katef commented Aug 7, 2024

This PR:

  • Implements AMBIG_MULTIPLE for various languages (go, rust, llvm especially)
  • Adds (struct fsm_hooks).comment()
  • Strips all comments from generated code in the absence of (struct fsm_options).comments
  • Quietly fixes a couple of unremarkable bugs
  • Does a bit more refactoring around the output routines

Here's how the generated AMBIG_MULTIPLE code looks for the following example:

aria; ./build/bin/re -k str -zabupl $lang 'x' 'x?'

for go:

package fsm_fsm

var ret0 []uint = []uint{1}
var ret1 []uint = []uint{0, 1}

func fsm_Match(data string) (bool, []uint) {
	var idx = ^uint(0)

	if idx++; idx >= uint(len(data)) {
		return true, ret0
	}

	if data[idx] != 'x' {
		return false, nil
	}

	if idx++; idx >= uint(len(data)) {
		return true, ret1
	}

	{
		return false, nil
	}

}

Rust:

aria; ./build/bin/re -k str -zabupl rust 'x' 'x?'

pub fn fsm_main(input: &str) -> Option<&'static [u32]> {
    use Label::*;
    static RET0: [u32; 1] = [1];
    static RET1: [u32; 2] = [0, 1];

    let mut bytes = input.bytes();

    pub enum Label {
        Ls,
    }

    let l = Ls;

    loop {
        match l {
            Ls => { // e.g. ""
                let c = match bytes.next() {
                    None => return Some(&RET0) /* "x?" */,
                    Some(c) => c,
                };
                if c != b'x' { return None }
                let _c = match bytes.next() {
                    None => return Some(&RET1) /* "x", "x?" */,
                    Some(_c) => _c,
                };
                return None
            }
        }
    }
}

and (my current favourite) for llvm, with many thanks to @mcy for the guidance:

; generated
%rt = type { ptr, i64 }
@fsm.r0 = internal unnamed_addr constant [1 x i32] [i32 1] ; "x?"
@fsm.r1 = internal unnamed_addr constant [2 x i32] [i32 0, i32 1] ; "x", "x?"
@fsm.r = internal unnamed_addr constant [3 x %rt] [
	 %rt { ptr bitcast ([1 x i32]* @fsm.r0 to ptr), i64 1 },
	 %rt { ptr bitcast ([2 x i32]* @fsm.r1 to ptr), i64 2 },
	 %rt { ptr poison, i64 -1 } ; fail
	]
define dso_local %rt @fsm.main(ptr nocapture noundef readonly %s) local_unnamed_addr hot nosync nounwind norecurse willreturn #0 {
	%n = alloca i32
	store i32 0, ptr %n
	br label %l0
stop:
	%i = phi i64
	 [0, %ret0],
	 [1, %ret1],
	 [2, %fail]
	%p = getelementptr inbounds [3 x %rt], [3 x %rt]* @fsm.r, i64 0, i64 %i
	%ret = load %rt, ptr %p
	ret %rt %ret
fail:
	br label %stop
ret0:
	br label %stop
ret1:
	br label %stop
l0:
	; e.g. ""
	%n0 = load i32, ptr %n
	%p0 = getelementptr inbounds i8, ptr %s, i32 %n0
	%c0 = load i8, ptr %p0
	%r0 = icmp eq i8 %c0, 0 ; EOT
	br i1 %r0, label %t0, label %f0
f0:
	%n.new0 = add i32 1, %n0
	store i32 %n.new0, ptr %n
	br label %l1
t0:
	br label %ret0
l1:
	; e.g. ""
	%r1 = icmp ne i8 %c0, 120 ; 'x'
	br i1 %r1, label %fail, label %l2
l2:
	; e.g. "x"
	%n1 = load i32, ptr %n
	%p1 = getelementptr inbounds i8, ptr %s, i32 %n1
	%c1 = load i8, ptr %p1
	%r2 = icmp eq i8 %c1, 0 ; EOT
	br i1 %r2, label %t2, label %f2
f2:
	%n.new1 = add i32 1, %n1
	store i32 %n.new1, ptr %n
	br label %fail
t2:
	br label %ret1
}

And the vmops structures also now carry the endids for all ambig modes except for AMBIG_NONE:

aria; ./build/bin/re -k str -zabupl vmops_c x 'x?'
#include <stdint.h>

#ifndef fsm_LIBFSM_VMOPS_H
#include "fsm_vmops.h"
#endif /* fsm_LIBFSM_VMOPS_H */

struct fsm_ret fsm_Ret[] = {
	{ (const unsigned []) { 1 }, 1 },
	{ (const unsigned []) { 0, 1 }, 2 },
};
const size_t fsm_Ret_count = sizeof fsm_Ret / sizeof *fsm_Ret;

struct fsm_op fsm_Ops[] = {
	{fsm_opEOF, 0, fsm_actionRET, 1, 0},
	{fsm_opNE, 'x', fsm_actionRET, 0, 0},
	{fsm_opEOF, 0, fsm_actionRET, 1, 1},
	{fsm_opALWAYS, '\x00', fsm_actionRET, 0, 0},

katef added 21 commits July 27, 2024 17:09
Now the VM IR ops point to retlist entries, rather than carrying their own endid sets. This means we can use the same indexing to de-dup them for every VM-based format.
The generated code looks like this:
```go
package fsm_fsm

var ret0 []uint = []uint{1}
var ret1 []uint = []uint{2}
var ret2 []uint = []uint{1, 2}
var ret3 []uint = []uint{0, 1, 2}

func fsm_Match(data string) (bool, []uint) {
	var idx = ^uint(0)

	if idx++; idx >= uint(len(data)) {
		return true, ret0
	}

	if data[idx] == 'a' {
		goto l3
	}

	if data[idx] == 'b' {
		goto l2
	}

	if data[idx] != 'c' {
		return false, nil
	}

l0: // e.g. "c"
	if idx++; idx >= uint(len(data)) {
		return true, ret2
	}

	if data[idx] <= '`' {
		return false, nil
	}

	if data[idx] <= 'b' {
		goto l1
	}

	if data[idx] == 'c' {
		goto l0
	}

	{
		return false, nil
	}

l1: // e.g. "aa"
	if idx++; idx >= uint(len(data)) {
		return true, ret1
	}

	if data[idx] <= '`' {
		return false, nil
	}

	if data[idx] <= 'c' {
		goto l1
	}

	{
		return false, nil
	}

l2: // e.g. "b"
	if idx++; idx >= uint(len(data)) {
		return true, ret3
	}

	if data[idx] == 'a' {
		goto l1
	}

	if data[idx] == 'b' {
		goto l2
	}

	if data[idx] == 'c' {
		goto l1
	}

	{
		return false, nil
	}

l3: // e.g. "a"
	if idx++; idx >= uint(len(data)) {
		return true, ret1
	}

	if data[idx] == 'a' {
		goto l1
	}

	if data[idx] == 'b' {
		goto l2
	}

	if data[idx] == 'c' {
		goto l1
	}

	{
		return false, nil
	}

}
```
I've handled AMBIG_NONE here, but I haven't distinguished the other ambig modes. Other than AMBIG_NONE, the other modes are all presented as an array of ids, even if it's just a single element. That's because I don't see any reason to give these specialised APIs for the current use-cases for this generated code, which is supposed to be a direct representation of our VM opcodes.
It's a bit rough, this isn't what I want to end up with, but I wanted to
commit this as a waypoint.
This means the phi instruction now only carries an array index.

Many thanks to @mcy for advice and patient help here.
This allows callers to default the codegen for accepting states (in particular outputting the values for endids) independently of commenting caller-specific meanings for the IDs.
Originally I'd intended this as a demonstration of how various applications can handle ambiguities differently. But now we have library support for AMBIG_MULTIPLE, I think this is just confusing.
The idea here is just to trim down %rt to:
```
%rt = type { ptr, i64 }
```

where we clearly don't need a uint64_t's count of unique ids.

I've purposefully not done the same for single-id interfaces. I don't want to mix success/failure with an id *value*, because the values are opaque (i.e. the meaning of an id value is the responsibility of the caller). Whereas here for AMBIG_MULTIPLE I'm mixing success/failure with the count, not with the id values.

Suggested by @mcy, thank you.
The generated code looks like this:
```rust
pub fn fsm_main(input: &str) -> Option<&'static [u32]> {
    use Label::*;
    static RET0: [u32; 1] = [1];
    static RET1: [u32; 1] = [2];
    static RET2: [u32; 3] = [0, 1, 2];

    let mut bytes = input.bytes();

    pub enum Label {
        Ls, L0,
    }

    let mut l = Ls;

    loop {
        match l {
            Ls => { // e.g. ""
                let c = match bytes.next() {
                    None => return Some(&RET0) /* "x?" */,
                    Some(c) => c,
                };
                if c != b'x' { return None }
                let c = match bytes.next() {
                    None => return Some(&RET2) /* "x", "x?", "x+" */,
                    Some(c) => c,
                };
                if c != b'x' { return None }
                l = L0; continue;
            }

            L0 => { // e.g. "xx"
                let c = match bytes.next() {
                    None => return Some(&RET1) /* "x+" */,
                    Some(c) => c,
                };
                if c != b'x' { return None }
                l = L0; continue;
            }
        }
    }
}
```
This gives better control over whitespace and punctuation between the hooks. For example we can output "<accept>, <comment>\n" with a comma between, and that sits more nicely for single-line comments. Previously these had to be "<accept> <comment>,\n"
Copy link
Collaborator

@silentbicycle silentbicycle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me overall.

I'm not familiar with Go or LLVM IR syntax, but I think I get the gist.

I will be working on integrating early endid matching into codegen soon, so I will spend a lot of time with the C interfaces, but any changes necessary for that can go in later PRs.

* but simply not yet implemented, where fsm_print() will print a message
* to stderr and exit.
*
* The code generation for the typical case of matching input require the FSM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo: "requires" (plural)

Spotted by Scott.
Copy link

@tfreiberg-fastly tfreiberg-fastly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that last commit looks absolutely correct to me. thankfully the previous commits were already approved

@katef katef merged commit 0a36f1b into main Aug 16, 2024
346 checks passed
@katef katef deleted the kate/more-multi branch August 16, 2024 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants