-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode #2
Conversation
Add a new implementation for Unicode. Add a library to correctly handle Unicode strings. Making non-breaking changes to ASCII implementation.
Add Unicode support
src/root.zig
Outdated
fn eqlFunc(a: *const UnicodeOptions, h: u21, n: u21) bool { | ||
const gcd = GenCatData.init(a.allocator) catch @panic("Memory error"); | ||
defer gcd.deinit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eqlFunc
is called in the inner loop of the alignment solver, so it should ideally have a buffered allocator. As it currently is this will hurt performance with the alloc overhead.
src/root.zig
Outdated
const TypeOfCaracter = switch (Impl) { | ||
AsciiOptions => u8, | ||
UnicodeOptions => u21, | ||
else => unreachable, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Algorithm
generic struct already has an ElType
argument, so we can use that instead of having to maintain a switch. In theory I still want downstream users to be able to modify the behaviour of the fuzzy finder without having to modify the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, there are two types of solutions:
- The user should give a standard string (
[]const u8
) to the fuzzy finder, choose the right algo (Ascii or Unicode) depending on the “type” of the text. In this case,ElType
will always beu8
, andTypeOfCaracter
depends on the algo. The algo will make by himself all the conversion. - Let the user make the conversions, and the algo will take the type of the strings (
ElType
). The problem is that the user will need a lib to make the conversion, for being able to use fuzzig. And fuzzig will also need a lib (surely the same) to process correctly the data.
The solution for me is the 1, to replace ElType
by u8
where it is needed, and use it instead to provide the type that is waited by the UnicodeOption
(or Ascii).
In this case, TypeOfCaracter
is no longer needed (so we can delete the switch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done it in alberic89@073d10a
src/root.zig
Outdated
/// Don't forget the allocator !!! | ||
allocator: Allocator = undefined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pass allocators in from the algorithm struct instead of having the options hold on to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in commit alberic89@3bab90b
But many side effects. Maybe they will be attenuated with the futures changes (less need of an allocator).
src/root.zig
Outdated
fn scoreFunc( | ||
a: *const UnicodeOptions, | ||
comptime scores: UnicodeScores, | ||
h: u21, | ||
n: u21, | ||
) ?i32 { | ||
if (!a.eqlFunc(h, n)) return null; | ||
|
||
if (a.case_penalize and (h != n)) { | ||
return scores.score_match + a.penalty_case_mistmatch; | ||
} | ||
return scores.score_match; | ||
} | ||
|
||
fn bonusFunc( | ||
self: *const UnicodeOptions, | ||
comptime scores: UnicodeScores, | ||
h: u21, | ||
n: u21, | ||
) i32 { | ||
const p = CharacterType.fromUnicode(h, self.allocator); | ||
const c = CharacterType.fromUnicode(n, self.allocator); | ||
|
||
return switch (p.roleNextTo(c)) { | ||
.Head => scores.bonus_head, | ||
.Camel => scores.bonus_camel, | ||
.Break => scores.bonus_break, | ||
.Tail => scores.bonus_tail, | ||
}; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are both essentically identical to the Ascii functions which suggests we should pull them out so we only maintain the implementation once.
src/utils.zig
Outdated
pub fn fromUnicode(c: u21, allocator: std.mem.Allocator) CharacterType { | ||
const cd = CaseData.init(allocator) catch @panic("Memory error"); | ||
defer cd.deinit(); | ||
const gcd = GenCatData.init(allocator) catch @panic("Memory error"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this feels like we should be able to do it with a stack allocated buffer instead of a std.mem.Allocator
to avoid alloc overheads.
src/root.zig
Outdated
const haystack_normal = self.impl.convertString(haystack); | ||
defer self.allocator.free(haystack_normal); | ||
|
||
const needle_normal = self.impl.convertString(needle); | ||
defer self.allocator.free(needle_normal); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use ArenaAllocator
for the lifetime of a single solve? Then implementations that don't need to allocate don't have to pay the price of a dupe
to satisfy the free? Or else let the implementation manage the lifetimes?
I will rework on all the change you have suggested. Meanwhile, I have made other changes to remove the need to know by advance the length of the haystack and needle. You can see them here: alberic89@bb35c1b I don't know how I can pull my changes here. Do you want me to make a pull request for each change ? |
Pass the context and callback functions to the algorithm instead of defining its type by them. This lets the algorithm be owned by the implementation, which also provides the public API, allowing transformations, such as the unicode u8 -> u21, to happen in the guard.
@alberic89 I made a couple of API changes. Main thing now is that the Algorithm is owned by the implementation instead of the other way round, which lets the implementation provide the public interface and do things like transforming the input from Sorry about very likely having introduced a merge conflict for you. If you could rebase your work onto this branch and open PRs into this branch, that would be best. There's another problem that occured to me; changing from |
I've also had a look at alberic89@bb35c1b ; in principle these changes are fine, but the problem is now that the you do the cleanup and the allocation for each call to Does that make sense? |
I've added a benchmark target to both the main branch and this one, which you can use to test how it impacts the |
Yes, this solution is better. My solution was just “make it working”, but for performance yours is better. I will look on it. |
Sounds good, and thanks for taking a look at this. I'm not married to any particular solution though, so if you find a better solution, please go for it! I am happy to merge breaking changes provided they are worth it 😎 |
Re-allocate Matrix and buffers in scoreImpl at each call depending of the length of the strings.
Use instead the allocator provided to the algorithm, so many many many side effects.
Now, the user SHOULD use []const u8 strings.
adjust the rebase (not in valid state)
Not working for now
Currently working
Matrix resize
Unicode support only enabled with `-Dunicode` to avoid labouring users with unnecessary dependencies if they only need ASCII. Move all unicode related functions to a seperate `unicode.zig` which is conditionally included at compile time.
Removed `resizeIfNeeded`, since `realloc` effectively that for us. Exposed the `resize` functions to the public `Ascii` and `Unicode` interfaces to make them available to users.
Following up from #1 (thanks @alberic89) will use this PR to track the changes to get unicode support into fuzzig.
Todo:
@panic
(breaking)