-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement AMBIG_MULTIPLE for llvm, Rust, go, and the vmops dump #486
Conversation
Now the VM IR ops point to retlist entries, rather than carrying their own endid sets. This means we can use the same indexing to de-dup them for every VM-based format.
The generated code looks like this: ```go package fsm_fsm var ret0 []uint = []uint{1} var ret1 []uint = []uint{2} var ret2 []uint = []uint{1, 2} var ret3 []uint = []uint{0, 1, 2} func fsm_Match(data string) (bool, []uint) { var idx = ^uint(0) if idx++; idx >= uint(len(data)) { return true, ret0 } if data[idx] == 'a' { goto l3 } if data[idx] == 'b' { goto l2 } if data[idx] != 'c' { return false, nil } l0: // e.g. "c" if idx++; idx >= uint(len(data)) { return true, ret2 } if data[idx] <= '`' { return false, nil } if data[idx] <= 'b' { goto l1 } if data[idx] == 'c' { goto l0 } { return false, nil } l1: // e.g. "aa" if idx++; idx >= uint(len(data)) { return true, ret1 } if data[idx] <= '`' { return false, nil } if data[idx] <= 'c' { goto l1 } { return false, nil } l2: // e.g. "b" if idx++; idx >= uint(len(data)) { return true, ret3 } if data[idx] == 'a' { goto l1 } if data[idx] == 'b' { goto l2 } if data[idx] == 'c' { goto l1 } { return false, nil } l3: // e.g. "a" if idx++; idx >= uint(len(data)) { return true, ret1 } if data[idx] == 'a' { goto l1 } if data[idx] == 'b' { goto l2 } if data[idx] == 'c' { goto l1 } { return false, nil } } ```
I've handled AMBIG_NONE here, but I haven't distinguished the other ambig modes. Other than AMBIG_NONE, the other modes are all presented as an array of ids, even if it's just a single element. That's because I don't see any reason to give these specialised APIs for the current use-cases for this generated code, which is supposed to be a direct representation of our VM opcodes.
It's a bit rough, this isn't what I want to end up with, but I wanted to commit this as a waypoint.
This means the phi instruction now only carries an array index. Many thanks to @mcy for advice and patient help here.
…e `poison`. Thanks to @mcy for this.
This allows callers to default the codegen for accepting states (in particular outputting the values for endids) independently of commenting caller-specific meanings for the IDs.
Originally I'd intended this as a demonstration of how various applications can handle ambiguities differently. But now we have library support for AMBIG_MULTIPLE, I think this is just confusing.
The idea here is just to trim down %rt to: ``` %rt = type { ptr, i64 } ``` where we clearly don't need a uint64_t's count of unique ids. I've purposefully not done the same for single-id interfaces. I don't want to mix success/failure with an id *value*, because the values are opaque (i.e. the meaning of an id value is the responsibility of the caller). Whereas here for AMBIG_MULTIPLE I'm mixing success/failure with the count, not with the id values. Suggested by @mcy, thank you.
The generated code looks like this: ```rust pub fn fsm_main(input: &str) -> Option<&'static [u32]> { use Label::*; static RET0: [u32; 1] = [1]; static RET1: [u32; 1] = [2]; static RET2: [u32; 3] = [0, 1, 2]; let mut bytes = input.bytes(); pub enum Label { Ls, L0, } let mut l = Ls; loop { match l { Ls => { // e.g. "" let c = match bytes.next() { None => return Some(&RET0) /* "x?" */, Some(c) => c, }; if c != b'x' { return None } let c = match bytes.next() { None => return Some(&RET2) /* "x", "x?", "x+" */, Some(c) => c, }; if c != b'x' { return None } l = L0; continue; } L0 => { // e.g. "xx" let c = match bytes.next() { None => return Some(&RET1) /* "x+" */, Some(c) => c, }; if c != b'x' { return None } l = L0; continue; } } } } ```
This gives better control over whitespace and punctuation between the hooks. For example we can output "<accept>, <comment>\n" with a comma between, and that sits more nicely for single-line comments. Previously these had to be "<accept> <comment>,\n"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me overall.
I'm not familiar with Go or LLVM IR syntax, but I think I get the gist.
I will be working on integrating early endid matching into codegen soon, so I will spend a lot of time with the C interfaces, but any changes necessary for that can go in later PRs.
include/fsm/print.h
Outdated
* but simply not yet implemented, where fsm_print() will print a message | ||
* to stderr and exit. | ||
* | ||
* The code generation for the typical case of matching input require the FSM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor typo: "requires" (plural)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that last commit looks absolutely correct to me. thankfully the previous commits were already approved
This PR:
AMBIG_MULTIPLE
for various languages (go, rust, llvm especially)(struct fsm_hooks).comment()
(struct fsm_options).comments
Here's how the generated
AMBIG_MULTIPLE
code looks for the following example:for go:
Rust:
and (my current favourite) for llvm, with many thanks to @mcy for the guidance:
And the vmops structures also now carry the endids for all ambig modes except for
AMBIG_NONE
: