Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode #2

Merged
merged 29 commits into from
Aug 12, 2024
Merged

Unicode #2

merged 29 commits into from
Aug 12, 2024

Conversation

fjebaker
Copy link
Owner

@fjebaker fjebaker commented Aug 10, 2024

Following up from #1 (thanks @alberic89) will use this PR to track the changes to get unicode support into fuzzig.

Todo:

  • API needs reworking to remove @panic (breaking)
  • More rigorous test cases for unicode

alberic89 and others added 3 commits August 10, 2024 10:29
Add a new implementation for Unicode.
Add a library to correctly handle Unicode strings.
Making non-breaking changes to ASCII implementation.
src/root.zig Outdated
Comment on lines 579 to 581
fn eqlFunc(a: *const UnicodeOptions, h: u21, n: u21) bool {
const gcd = GenCatData.init(a.allocator) catch @panic("Memory error");
defer gcd.deinit();
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eqlFunc is called in the inner loop of the alignment solver, so it should ideally have a buffered allocator. As it currently is this will hurt performance with the alloc overhead.

src/root.zig Outdated
Comment on lines 110 to 114
const TypeOfCaracter = switch (Impl) {
AsciiOptions => u8,
UnicodeOptions => u21,
else => unreachable,
};
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Algorithm generic struct already has an ElType argument, so we can use that instead of having to maintain a switch. In theory I still want downstream users to be able to modify the behaviour of the fuzzy finder without having to modify the source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, there are two types of solutions:

  1. The user should give a standard string ([]const u8) to the fuzzy finder, choose the right algo (Ascii or Unicode) depending on the “type” of the text. In this case, ElType will always be u8, and TypeOfCaracter depends on the algo. The algo will make by himself all the conversion.
  2. Let the user make the conversions, and the algo will take the type of the strings (ElType). The problem is that the user will need a lib to make the conversion, for being able to use fuzzig. And fuzzig will also need a lib (surely the same) to process correctly the data.

The solution for me is the 1, to replace ElType by u8 where it is needed, and use it instead to provide the type that is waited by the UnicodeOption (or Ascii).
In this case, TypeOfCaracter is no longer needed (so we can delete the switch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done it in alberic89@073d10a

src/root.zig Outdated
Comment on lines 490 to 491
/// Don't forget the allocator !!!
allocator: Allocator = undefined,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pass allocators in from the algorithm struct instead of having the options hold on to them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in commit alberic89@3bab90b
But many side effects. Maybe they will be attenuated with the futures changes (less need of an allocator).

src/root.zig Outdated
Comment on lines 597 to 627
fn scoreFunc(
a: *const UnicodeOptions,
comptime scores: UnicodeScores,
h: u21,
n: u21,
) ?i32 {
if (!a.eqlFunc(h, n)) return null;

if (a.case_penalize and (h != n)) {
return scores.score_match + a.penalty_case_mistmatch;
}
return scores.score_match;
}

fn bonusFunc(
self: *const UnicodeOptions,
comptime scores: UnicodeScores,
h: u21,
n: u21,
) i32 {
const p = CharacterType.fromUnicode(h, self.allocator);
const c = CharacterType.fromUnicode(n, self.allocator);

return switch (p.roleNextTo(c)) {
.Head => scores.bonus_head,
.Camel => scores.bonus_camel,
.Break => scores.bonus_break,
.Tail => scores.bonus_tail,
};
}
};
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are both essentically identical to the Ascii functions which suggests we should pull them out so we only maintain the implementation once.

src/utils.zig Outdated
Comment on lines 35 to 38
pub fn fromUnicode(c: u21, allocator: std.mem.Allocator) CharacterType {
const cd = CaseData.init(allocator) catch @panic("Memory error");
defer cd.deinit();
const gcd = GenCatData.init(allocator) catch @panic("Memory error");
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this feels like we should be able to do it with a stack allocated buffer instead of a std.mem.Allocator to avoid alloc overheads.

src/root.zig Outdated
Comment on lines 198 to 202
const haystack_normal = self.impl.convertString(haystack);
defer self.allocator.free(haystack_normal);

const needle_normal = self.impl.convertString(needle);
defer self.allocator.free(needle_normal);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use ArenaAllocator for the lifetime of a single solve? Then implementations that don't need to allocate don't have to pay the price of a dupe to satisfy the free? Or else let the implementation manage the lifetimes?

@alberic89
Copy link
Contributor

alberic89 commented Aug 10, 2024

I will rework on all the change you have suggested.

Meanwhile, I have made other changes to remove the need to know by advance the length of the haystack and needle. You can see them here: alberic89@bb35c1b

I don't know how I can pull my changes here. Do you want me to make a pull request for each change ?

Pass the context and callback functions to the algorithm instead of
defining its type by them. This lets the algorithm be owned by the
implementation, which also provides the public API, allowing
transformations, such as the unicode u8 -> u21, to happen in the guard.
@fjebaker
Copy link
Owner Author

@alberic89 I made a couple of API changes. Main thing now is that the Algorithm is owned by the implementation instead of the other way round, which lets the implementation provide the public interface and do things like transforming the input from u8 to u21. It also means the implementations can now access the allocator from the algorithm in their methods too, and provide their own inits and deinits.

Sorry about very likely having introduced a merge conflict for you. If you could rebase your work onto this branch and open PRs into this branch, that would be best.

There's another problem that occured to me; changing from u8 to u21 during the input stage means the traceback can be very off. We'd either need to recalculate the offsets into the u8 array, or insist that the unicode input is already u21, as I can't think of another way round that.

@fjebaker
Copy link
Owner Author

fjebaker commented Aug 10, 2024

I've also had a look at alberic89@bb35c1b ; in principle these changes are fine, but the problem is now that the you do the cleanup and the allocation for each call to score. One of the reasons I had the preallocation was to avoid having to do memory things during score, so you can init a solver and score 100 different fuzzy finds without having to touch memory. What I'd prefer to do here is keep the current interface, but provide a sscoreResize or just a resize method to let the user make the decision as to whether the buffers are the right size or not, and whether they have the performance budget to check each score's setup or not.

Does that make sense?

@fjebaker
Copy link
Owner Author

I've added a benchmark target to both the main branch and this one, which you can use to test how it impacts the Ascii fuzzy finder with zig build benchmark. The benchmark's are naive, but a good coarse estimate.

@alberic89
Copy link
Contributor

I've also had a look at alberic89@bb35c1b ; in principle these changes are fine, but the problem is now that the you do the cleanup and the allocation for each call to score. One of the reasons I had the preallocation was to avoid having to do memory things during score, so you can init a solver and score 100 different fuzzy finds without having to touch memory. What I'd prefer to do here is keep the current interface, but provide a sscoreResize or just a resize method to let the user make the decision as to whether the buffers are the right size or not, and whether they have the performance budget to check each score''s setup or not.

Does that make sense?

Yes, this solution is better. My solution was just “make it working”, but for performance yours is better. I will look on it.

@fjebaker
Copy link
Owner Author

Sounds good, and thanks for taking a look at this. I'm not married to any particular solution though, so if you find a better solution, please go for it! I am happy to merge breaking changes provided they are worth it 😎

Unicode support only enabled with `-Dunicode` to avoid labouring users
with unnecessary dependencies if they only need ASCII. Move all unicode
related functions to a seperate `unicode.zig` which is conditionally
included at compile time.
Removed `resizeIfNeeded`, since `realloc` effectively that for us.
Exposed the `resize` functions to the public `Ascii` and `Unicode`
interfaces to make them available to users.
@fjebaker fjebaker marked this pull request as ready for review August 12, 2024 17:29
@fjebaker fjebaker merged commit 26261a6 into main Aug 12, 2024
1 check passed
@fjebaker fjebaker deleted the unicode branch August 12, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants