Unicode #2

fjebaker · 2024-08-10T09:56:19Z

Following up from #1 (thanks @alberic89) will use this PR to track the changes to get unicode support into fuzzig.

Todo:

API needs reworking to remove @panic (breaking)
More rigorous test cases for unicode

Add a new implementation for Unicode. Add a library to correctly handle Unicode strings. Making non-breaking changes to ASCII implementation.

Add Unicode support

fjebaker · 2024-08-10T10:08:37Z

src/root.zig

+    fn eqlFunc(a: *const UnicodeOptions, h: u21, n: u21) bool {
+        const gcd = GenCatData.init(a.allocator) catch @panic("Memory error");
+        defer gcd.deinit();


The eqlFunc is called in the inner loop of the alignment solver, so it should ideally have a buffered allocator. As it currently is this will hurt performance with the alloc overhead.

fjebaker · 2024-08-10T10:09:59Z

src/root.zig

+        const TypeOfCaracter = switch (Impl) {
+            AsciiOptions => u8,
+            UnicodeOptions => u21,
+            else => unreachable,
+        };


The Algorithm generic struct already has an ElType argument, so we can use that instead of having to maintain a switch. In theory I still want downstream users to be able to modify the behaviour of the fuzzy finder without having to modify the source.

For me, there are two types of solutions:

The user should give a standard string ([]const u8) to the fuzzy finder, choose the right algo (Ascii or Unicode) depending on the “type” of the text. In this case, ElType will always be u8, and TypeOfCaracter depends on the algo. The algo will make by himself all the conversion.

Let the user make the conversions, and the algo will take the type of the strings (ElType). The problem is that the user will need a lib to make the conversion, for being able to use fuzzig. And fuzzig will also need a lib (surely the same) to process correctly the data.

The solution for me is the 1, to replace ElType by u8 where it is needed, and use it instead to provide the type that is waited by the UnicodeOption (or Ascii).
In this case, TypeOfCaracter is no longer needed (so we can delete the switch).

I have done it in alberic89@073d10a

fjebaker · 2024-08-10T10:10:53Z

src/root.zig

+    /// Don't forget the allocator !!!
+    allocator: Allocator = undefined,


Let's pass allocators in from the algorithm struct instead of having the options hold on to them.

Removed in commit alberic89@3bab90b
But many side effects. Maybe they will be attenuated with the futures changes (less need of an allocator).

fjebaker · 2024-08-10T10:11:38Z

src/root.zig

+    fn scoreFunc(
+        a: *const UnicodeOptions,
+        comptime scores: UnicodeScores,
+        h: u21,
+        n: u21,
+    ) ?i32 {
+        if (!a.eqlFunc(h, n)) return null;
+
+        if (a.case_penalize and (h != n)) {
+            return scores.score_match + a.penalty_case_mistmatch;
+        }
+        return scores.score_match;
+    }
+
+    fn bonusFunc(
+        self: *const UnicodeOptions,
+        comptime scores: UnicodeScores,
+        h: u21,
+        n: u21,
+    ) i32 {
+        const p = CharacterType.fromUnicode(h, self.allocator);
+        const c = CharacterType.fromUnicode(n, self.allocator);
+
+        return switch (p.roleNextTo(c)) {
+            .Head => scores.bonus_head,
+            .Camel => scores.bonus_camel,
+            .Break => scores.bonus_break,
+            .Tail => scores.bonus_tail,
+        };
+    }
+};


These are both essentically identical to the Ascii functions which suggests we should pull them out so we only maintain the implementation once.

fjebaker · 2024-08-10T10:12:43Z

src/utils.zig

+    pub fn fromUnicode(c: u21, allocator: std.mem.Allocator) CharacterType {
+        const cd = CaseData.init(allocator) catch @panic("Memory error");
+        defer cd.deinit();
+        const gcd = GenCatData.init(allocator) catch @panic("Memory error");


Again, this feels like we should be able to do it with a stack allocated buffer instead of a std.mem.Allocator to avoid alloc overheads.

fjebaker · 2024-08-10T10:14:50Z

src/root.zig

+            const haystack_normal = self.impl.convertString(haystack);
+            defer self.allocator.free(haystack_normal);
+
+            const needle_normal = self.impl.convertString(needle);
+            defer self.allocator.free(needle_normal);


Maybe use ArenaAllocator for the lifetime of a single solve? Then implementations that don't need to allocate don't have to pay the price of a dupe to satisfy the free? Or else let the implementation manage the lifetimes?

alberic89 · 2024-08-10T10:36:23Z

I will rework on all the change you have suggested.

Meanwhile, I have made other changes to remove the need to know by advance the length of the haystack and needle. You can see them here: alberic89@bb35c1b

I don't know how I can pull my changes here. Do you want me to make a pull request for each change ?

Pass the context and callback functions to the algorithm instead of defining its type by them. This lets the algorithm be owned by the implementation, which also provides the public API, allowing transformations, such as the unicode u8 -> u21, to happen in the guard.

fjebaker · 2024-08-10T15:31:41Z

@alberic89 I made a couple of API changes. Main thing now is that the Algorithm is owned by the implementation instead of the other way round, which lets the implementation provide the public interface and do things like transforming the input from u8 to u21. It also means the implementations can now access the allocator from the algorithm in their methods too, and provide their own inits and deinits.

Sorry about very likely having introduced a merge conflict for you. If you could rebase your work onto this branch and open PRs into this branch, that would be best.

There's another problem that occured to me; changing from u8 to u21 during the input stage means the traceback can be very off. We'd either need to recalculate the offsets into the u8 array, or insist that the unicode input is already u21, as I can't think of another way round that.

fjebaker · 2024-08-10T15:35:05Z

I've also had a look at alberic89@bb35c1b ; in principle these changes are fine, but the problem is now that the you do the cleanup and the allocation for each call to score. One of the reasons I had the preallocation was to avoid having to do memory things during score, so you can init a solver and score 100 different fuzzy finds without having to touch memory. What I'd prefer to do here is keep the current interface, but provide a sscoreResize or just a resize method to let the user make the decision as to whether the buffers are the right size or not, and whether they have the performance budget to check each score's setup or not.

Does that make sense?

fjebaker · 2024-08-10T15:36:36Z

I've added a benchmark target to both the main branch and this one, which you can use to test how it impacts the Ascii fuzzy finder with zig build benchmark. The benchmark's are naive, but a good coarse estimate.

alberic89 · 2024-08-10T15:38:06Z

I've also had a look at alberic89@bb35c1b ; in principle these changes are fine, but the problem is now that the you do the cleanup and the allocation for each call to score. One of the reasons I had the preallocation was to avoid having to do memory things during score, so you can init a solver and score 100 different fuzzy finds without having to touch memory. What I'd prefer to do here is keep the current interface, but provide a sscoreResize or just a resize method to let the user make the decision as to whether the buffers are the right size or not, and whether they have the performance budget to check each score''s setup or not.

Does that make sense?

Yes, this solution is better. My solution was just “make it working”, but for performance yours is better. I will look on it.

fjebaker · 2024-08-10T15:41:18Z

Sounds good, and thanks for taking a look at this. I'm not married to any particular solution though, so if you find a better solution, please go for it! I am happy to merge breaking changes provided they are worth it 😎

Re-allocate Matrix and buffers in scoreImpl at each call depending of the length of the strings.

Use instead the allocator provided to the algorithm, so many many many side effects.

Now, the user SHOULD use []const u8 strings.

adjust the rebase (not in valid state)

Not working for now

Memory leaks

Currently working

Matrix resize

Unicode support only enabled with `-Dunicode` to avoid labouring users with unnecessary dependencies if they only need ASCII. Move all unicode related functions to a seperate `unicode.zig` which is conditionally included at compile time.

Removed `resizeIfNeeded`, since `realloc` effectively that for us. Exposed the `resize` functions to the public `Ascii` and `Unicode` interfaces to make them available to users.

alberic89 and others added 3 commits August 10, 2024 10:29

Create .gitignore

c73a3b1

Add working Unicode support

fc5e83d

Add a new implementation for Unicode. Add a library to correctly handle Unicode strings. Making non-breaking changes to ASCII implementation.

Merge pull request #1 from alberic89/unicode-implementation

33b305d

Add Unicode support

fjebaker commented Aug 10, 2024

View reviewed changes

fjebaker added 5 commits August 10, 2024 16:27

feat: added simple benchmark suite

6ea5cce

chore: cleanup build.zig.zon

e88ba67

chore: rename fields to be consistent + function tables comptime

692ee88

chore: rename implementations

5d77462

alberic89 and others added 15 commits August 10, 2024 17:42

No longer need to know the max size by advance

5250b26

Re-allocate Matrix and buffers in scoreImpl at each call depending of the length of the strings.

Remove allocator field in implementations

4ae5d78

Use instead the allocator provided to the algorithm, so many many many side effects.

Remove the TypeOfCaracter switch

67e3c14

Now, the user SHOULD use []const u8 strings.

make rebase adjustements

2fe86fc

adjust the rebase (not in valid state)

WIP: make adjustement

0c7594d

Not working for now

feat: init category and case unicode data only once

4e998d4

WIP

068c73e

Memory leaks

Solve unicode tool init problem

4db079d

Currently working

Merge remote-tracking branch 'upstream/unicode' into unicode

cbb68f1

fix: convertString of Unicode can fail

b1f437a

Resize Matrix on need

abac757

Let the user choice if he want to resize the matrix and buffer

94f8f87

fix: add errdefer in Unicode init

17903ad

fix: various style and optimization improvements

4aebb97

Merge pull request #5 from alberic89/matrix-resize

ac1d1cd

Matrix resize

fjebaker added 4 commits August 12, 2024 18:22

feat: cleanup unicode support

ad33235

Unicode support only enabled with `-Dunicode` to avoid labouring users with unnecessary dependencies if they only need ASCII. Move all unicode related functions to a seperate `unicode.zig` which is conditionally included at compile time.

feat: pre-allocated buffer resizing

967df94

Removed `resizeIfNeeded`, since `realloc` effectively that for us. Exposed the `resize` functions to the public `Ascii` and `Unicode` interfaces to make them available to users.

feat: expose maximum haystack / needle query functions

2215837

Merge branch 'main' into unicode

ed86b30

fjebaker marked this pull request as ready for review August 12, 2024 17:29

ci: add unicode to ci test

3b435d5

fjebaker merged commit 26261a6 into main Aug 12, 2024
1 check passed

fjebaker deleted the unicode branch August 12, 2024 17:29

chore: bump version 0.1.0

30a5f62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode #2

Unicode #2

fjebaker commented Aug 10, 2024 •

edited

Loading

fjebaker Aug 10, 2024

fjebaker Aug 10, 2024

alberic89 Aug 10, 2024

alberic89 Aug 10, 2024

fjebaker Aug 10, 2024

alberic89 Aug 10, 2024

fjebaker Aug 10, 2024

fjebaker Aug 10, 2024

fjebaker Aug 10, 2024

alberic89 commented Aug 10, 2024 •

edited

Loading

fjebaker commented Aug 10, 2024

fjebaker commented Aug 10, 2024 •

edited

Loading

fjebaker commented Aug 10, 2024

alberic89 commented Aug 10, 2024

fjebaker commented Aug 10, 2024

		/// Don't forget the allocator !!!
		allocator: Allocator = undefined,

Unicode #2

Unicode #2

Conversation

fjebaker commented Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alberic89 commented Aug 10, 2024 • edited Loading

fjebaker commented Aug 10, 2024

fjebaker commented Aug 10, 2024 • edited Loading

fjebaker commented Aug 10, 2024

alberic89 commented Aug 10, 2024

fjebaker commented Aug 10, 2024

fjebaker commented Aug 10, 2024 •

edited

Loading

alberic89 commented Aug 10, 2024 •

edited

Loading

fjebaker commented Aug 10, 2024 •

edited

Loading