Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endleaf rejiggle, .fsm syntax for endids, refactor endid get/set api, various related bugfixes #479

Merged
merged 28 commits into from
Jun 18, 2024

Conversation

katef
Copy link
Owner

@katef katef commented Jun 16, 2024

I did quite a few things here, all focused around the endleaf callbacks,
and by extension the endid mechanism. I fixed a few small bugs, and got distracted trying to tidy up a bit.

Anyway here's what we get:

  • .fsm syntax support for endids
  • we now have tests for .fsm syntax (gasp)
  • Add missing printing for endids for various formats (notably dot and .fsm)
  • Various work on ways to simplify the endids API, especially:
    • Removed &nwritten for fsm_endid_get(), we don't need partial gets
    • Removed the tri-state enums for fsm_endid_get/set()
    • We don't need to distinguish a state absent endids from a present but empty set of endids
  • Add missing calls to endleaf() callbacks for various formats
  • Reintroduced fsm_exec handling in re(1), and reworked related argv[] handling. This is neccessary bugfixes only, in particular I want to address Behavior of -G length inconsistent, moreover returns error code. #478 and Unclear how to do multiple transformations in one invocation #477 separately
  • Bugfixes around the api format codegen, that was quite out of date.

The .fsm syntax for endids looks like this:

; ./build/bin/re -r pcre -bpl fsm 'ab?c' 'abc*'
0  ->  1 "a"; # e.g. "a"
1  ->  2 "b"; # e.g. "ab"
1  ->  3 "c"; # e.g. "ac"
2  ->  4 "c"; # e.g. "abc"
4  ->  5 "c"; # e.g. "abcc"
5  ->  5 "c"; # e.g. "abcc"

start: 0;
end:   2 = [0], 3 = [0], 4 = [0, 1], 5 = [0];

and renders out like:
image

katef added 28 commits June 10, 2024 05:44
These are optional.

I'd expected this to be per state at file scope, but I don't like that the syntax would then allow attempting to attach an id to a non-accepting state. Since we have a block for `end:` already, adding the end ids here means they can't be mixed up.

This looks like:
```
; ./build/bin/re -r pcre -bpl fsm 'ab?c' 'abc*'
0  ->  1 "a"; # e.g. "a"
1  ->  2 "b"; # e.g. "ab"
1  ->  3 "c"; # e.g. "ac"
2  ->  4 "c"; # e.g. "abc"
4  ->  5 "c"; # e.g. "abcc"
5  ->  5 "c"; # e.g. "abcc"

start: 0;
end:   2 = [0], 3 = [0], 4 = [0, 1], 5 = [0];
```
Two things here:

Firstly I've reworked all this stuff such that we handle filenames, and then any remaining arguments are used to match with fsm_exec.

This is what I'd originally intended, but I think it got lost at some point, probably when introducing operators with arity 2.

Secondly matching text with fsm_exec now runs regarless of the operation, not just on OP_IDENTITY. I see no reason to limit that.
The file format is complex enough that I want to cover these explicitly now, rather than relying on coverage through other tests.
This should've been done when updating to fsm_state_t.
This is never actually reached, because fsm_collate() doesn't handle end IDs.
I don't see any reason callers of fsm_endid_get() would want to fetch into a fixed-size buffer of fewer elements than the number of endids an end state has. In all cases the caller can fsm_endid_count() to find out.
This keeps the generated code C89-compliant.
If these were neccessary for caller-supplied generated code from the endleaf callbacks, the actual way I'm intending a caller to pass information there, is to output with `opt.fragment` set.

Then there's no need to gi via `void *` for any caller state, that can all be exposed as appropriately typed variables in scope for the generated code (or wrapped in a similar function as the non-fragment code, and exposed as appropriately-typed arguments to that function).

The existing `void *opaque` here was nothing to do with this, it was actually for the `.getc()` callback. I've renamed it to keep it from getting mixed up.
I'm not removing this for any profound reason, just that fewer kinds of structs help me better see what's happening.
Now this is always equivalent to the count passed in, when the return status is 1. And the return status is always 1 when the count is enough. In all situations we know the count is enough.
@katef katef requested a review from silentbicycle June 16, 2024 22:35
Copy link
Collaborator

@silentbicycle silentbicycle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a better suggestion for how to pass a handle to the caller besides void *opaque that isn't significantly more complicated in a multi-threaded context?

src/libfsm/print/c.c Show resolved Hide resolved
include/fsm/fsm.h Show resolved Hide resolved
@katef katef merged commit 0402f39 into main Jun 18, 2024
322 checks passed
@katef katef deleted the kate/endleaf-rejiggle branch June 18, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants