ICU-13219 add -u-dx support to BreakIterator #2702

FrankYFTang · 2023-11-15T01:37:30Z

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-13219
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

FrankYFTang · 2023-11-15T01:37:59Z

@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?

srl295 · 2023-11-15T06:56:28Z

@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?

Suggestion: change the title to -u-dx

I will have to look at the test cases a bit more but it seems like it could work.

Did you see my pr #2676 which has a test case from a minority language?

FrankYFTang · 2023-11-15T19:25:14Z

Did you see my pr #2676 which has a test case from a minority language?

The tricky part will not be the behavior of break within a script, but in the boundary with another characters or between two script inside a -u-dx. For example, let's say we have -u-dx-thai-laoo and we have a run of text in thai and lao script and number without any spaces, would there any break in that run of text? or shoudl it beak in the boundary with number, or break in in the spot between the lao script and the thai script? or none at a all.

jira-pull-request-webhook · 2023-11-16T01:51:31Z

Notice: the branch changed across the force-push!

icu4c/source/common/brkiter.cpp is different
icu4c/source/common/rbbi_cache.cpp is different
icu4c/source/common/rbbi.cpp is different
icu4c/source/common/unicode/rbbi.h is different
icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIterator.java is now changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIteratorFactory.java is now changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is now changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2023-11-16T02:25:03Z

Notice: the branch changed across the force-push!

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2023-11-16T02:31:27Z

Notice: the branch changed across the force-push!

icu4c/source/common/brkiter.cpp is different
icu4c/source/common/unicode/brkiter.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2023-11-16T03:15:13Z

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi_cache.cpp is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

eggrobin

The behaviour matches my understanding of the definition.

FrankYFTang · 2023-11-20T19:05:36Z

Did you see my pr #2676 which has a test case from a minority language?

I try your proposed diff of icu4c/source/test/testdata/rbbitst.txt and below is the error I got in my PR. Is your expectation "correct"?

=== Handling test: rbbi/RBBITest/TestExtended: ===
   rbbi {
      RBBITest {
         TestExtended {
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
------------------------------------------------ 4
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break expected, but not found.  Pos=   4  File line,col= 1538,  13
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
------------------------------------------------ 5
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break found, but not expected.  Pos=   5  File line,col= 1538,  15
         Reverse Itertion, break found, but not expected.  Pos=   5  File line,col= 1538,  15
         Reverse Iteration, break expected, but not found.  Pos=   4  File line,col= 1538,  13
         isBoundary(4) incorrect. File line,col= 1538,  13
                 Expected, Actual= true, false
         isBoundary(5) incorrect. File line,col= 1538,  15
                 Expected, Actual= false, true
         following(0) incorrect. File line,col= 1538,   8
                 Expected, Actual= 4, 5
         following(1) incorrect. File line,col= 1538,  10
                 Expected, Actual= 4, 5
         following(2) incorrect. File line,col= 1538,  11
                 Expected, Actual= 4, 5
         following(3) incorrect. File line,col= 1538,  12
                 Expected, Actual= 4, 5
         following(4) incorrect. File line,col= 1538,  13
                 Expected, Actual= 10, 5
         preceding(10) incorrect. File line,col= 1538,  20
                 Expected, Actual= 4, 5
         preceding(9) incorrect. File line,col= 1538,  19
                 Expected, Actual= 4, 5
         preceding(8) incorrect. File line,col= 1538,  18
                 Expected, Actual= 4, 5
         preceding(7) incorrect. File line,col= 1538,  17
                 Expected, Actual= 4, 5
         preceding(6) incorrect. File line,col= 1538,  16
                 Expected, Actual= 4, 5
         preceding(5) incorrect. File line,col= 1538,  15
                 Expected, Actual= 4, 0
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
------------------------------------------------ 12
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break expected, but not found.  Pos=  12  File line,col= 1538,  13
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
------------------------------------------------ 13
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break found, but not expected.  Pos=  13  File line,col= 1538,  15
         Reverse Itertion, break found, but not expected.  Pos=  13  File line,col= 1538,  15
         Reverse Iteration, break expected, but not found.  Pos=  12  File line,col= 1538,  13
         isBoundary(12) incorrect. File line,col= 1538,  13
                 Expected, Actual= true, false
         isBoundary(13) incorrect. File line,col= 1538,  15
                 Expected, Actual= false, true
         following(0) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(1) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(2) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(3) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(4) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(5) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(6) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(7) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(8) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(9) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(10) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(11) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(12) incorrect. File line,col= 1538,  13
                 Expected, Actual= 26, 13
         preceding(28) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(27) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(26) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(25) incorrect. File line,col= 1538,  19
                 Expected, Actual= 12, 13
         preceding(24) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(23) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(22) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(21) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(20) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(19) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(18) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(17) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(16) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(15) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
         preceding(14) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
         preceding(13) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
      
         } ERRORS (52) in TestExtended (38ms) 
      
   
      } ERRORS (52) in RBBITest (38ms) 
   

   } ERRORS (52) in rbbi (38ms) 


--------------------------------------
Errors in total: 52.
            TestExtended
         RBBITest
      rbbi
   
--------------------------------------

FrankYFTang · 2023-11-20T19:16:37Z

You diff actually only add one test case

<data>•โอํน• อะไป •จู่วาม •โล่น•</data>

for line break
but I think it is incorrect
why should the line break happen before the space ?
It should be

<data>•โอํน •อะไป •จู่วาม •โล่น•</data>

instead, right?

FrankYFTang · 2023-11-20T19:28:43Z

Looking https://github.com/unicode-org/icu/pull/2676/files#diff-b177067bbc1df57fc40ae7629a81e8df960899b9088555b010680a1c500943e2
Also, the line

<line>
# Should no longer break at the dictionary points - it's not Thai language
...
#<data>•โอํน• •อะไป• •จู่วาม• •โล่น• •เปี่ยร• •อะลู่วาง• •แมะ,• •ปาย• •อัน• •แบ็จ• •อะโจํน• •ซา• •เมาะ.• •อัน• •ฮะบืน• •ตะ• •เวี่ยะ• •ตะ• •งี่ยาน,• •อัน• •ฮะบืน• •อีว• •อะปายฮ.•</data>

should be

<line>
# Should no longer break at the dictionary points - it's not Thai language
...
<data>•โอํน •อะไป •จู่วาม •โล่น •เปี่ยร •อะลู่วาง •แมะ, •ปาย •อัน •แบ็จ •อะโจํน •ซา •เมาะ. •อัน •ฮะบืน •ตะ •เวี่ยะ •ตะ •งี่ยาน, •อัน •ฮะบืน •อีว •อะปายฮ.•</data>

there are no reason to have a line break before the space. Line break should only happen after the SPACE not before the SPACE, right?

jira-pull-request-webhook · 2023-11-20T21:15:53Z

Notice: the branch changed across the force-push!

icu4c/source/test/testdata/rbbitst.txt is now changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

FrankYFTang · 2023-11-20T21:19:44Z

@srl295 I copy your test change over but change it. Please read my modified version in this PR and see do you agree with that. The change are

for line break, there should have no line break before the SPACE
for word break, the status should be 200 not 0
for word break, we should break beefore . and , if we treat the Thai as AL.
not using dx=zyyyy . That part of spec is very bad. I file bug https://unicode-org.atlassian.net/browse/CLDR-17247 for that. I do not think we should implement that behavior. It is clearly a spec bug from my point of view.

icu4c/source/common/rbbi.cpp

jira-pull-request-webhook · 2023-11-21T19:42:08Z

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi.cpp is different
icu4c/source/test/testdata/rbbitst.txt is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2023-11-21T21:23:29Z

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi.cpp is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java

macchiati · 2023-11-21T21:29:43Z

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java

+                    // Ask the language object if there are any breaks. It will add them to the cache and
+                    // leave the text pointer on the other side of its range, ready to search for the next one.
+                    if (lbe != null) {
+                        foundBreakCount += lbe.findBreaks(fText, rangeStart, rangeEnd, fBreaks, fPhraseBreaking);


I haven't looked in detail at this, but it appears that this wouldn't catch the case where a character before rangeEnd should be excluded.

so... how is that behavior specified in UTS 35 + UAX 29 + UAX 14?
Could we have a test case for that?

Here's what I mean. The meaning of dx-xxxx is that none of the xxxx characters will be processed by the break iterators.

So say that 't' stands for Thai, and . stands for other characters.

ttttt......ttttttttttt......ttttttt.......

With dx-thai, break iterators must only act on the dots (non-thai)

With your code, the iterator would skip over the first ttttt and start at the first non-Thai (the first dot)

However, I see nothing in the code that would prevent the iterator from continuing at least part-way into into the second group of ttttttttttt.

So... notice this part is inside a function
populateDictionary(int startPos, int endPos,...)
If you have the text "ttttt......ttttttttttt......ttttttt......."
so the first "ttttt" is in index 0-5, and the second "ttttttttttt" is in index 11-22, 28-35
the upper caller will call this function three time

first call this with populateDictionary(0, 5...)

second time call this with populateDictionary(11, 22...)

third time call this with populateDictionary(28, 35...)

and the code put in the break between 0 - 5 into fBreak first call, the breaks between 11-22 into fBreak second call and the breaks between 28-35 into fBreak the third call
and the upper caller will figure out how to break the ...

With my change, when we hit any t, if (excludedFromDictionaryBreak(c)) willl return true and therrefore just advance the iterator till 5 and return out of the loop.

Mark, I add the following to unit test to show it does work

bi->setText(UnicodeString(u"aaอออaaaaaอออ aaaa"));

for line break, only the 1) begin, 2) between" " and "a" and 3) the end of text break
for word break, only the 1) begin, 2) between "อ" and " ", 3) between " " and "a" and 4) the end of text break

jira-pull-request-webhook · 2023-11-21T21:31:17Z

Notice: the branch changed across the force-push!

icu4c/source/common/brkiter.cpp is different
icu4c/source/common/rbbi.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati · 2023-11-23T00:34:31Z

Let me try to be clearer.

Suppose that

The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.
The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.

AAABBBCCCDDD

What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.

From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.

What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.

FrankYFTang · 2023-11-23T00:51:28Z

Let me try to be clearer.

Suppose that

The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.

The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.

AAABBBCCCDDD

What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.

From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.

What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.

I see. ok, you are right, that is not clear. I need to call excludedFromDictionaryBreak to adjust the startRange and endRange before passing to the findBreaks. the startRange and endRange pass to the findBreaks may need to be changed to a different values excluding these characters.

srl295 · 2023-11-23T01:11:10Z

@FrankYFTang my apologies, I have not found spare time to review this recently. i will let @mhosken in case he's able to review.

it seems it's going in a good direction… certainly feel free to amend my test as as needed (it was meant as an example not as proscriptive) and close the other PR…

macchiati · 2023-11-30T17:55:56Z

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java

+                throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);
+            }
+            String script = dxs.substring(i*5, i*5+4);
+            // Special handling of zyyy


Change this. Right after the length check, see if the entire dxValues value equals (case insensitive) "-zyyy". If so, return UnicodeSet.ALL_CODE_POINTS (everything, might want a static constant).

Otherwise, there is no special zyyy handling

Needs a test case also.

but we need to take care the case of "en-u-dx-thai-zyyy" or "en-u-dx-thai-hani-zyyy", etc too right?

Yes; the CLDR ticket clarifying dx was accepted for CLDR v44.1 (you are a watcher)

My point is " if the entire dxValues value equals (case insensitive) "-zyyy"" is not good enough because we may have
"en-u-dx-thai-zyyy" or "en-u-dx-thai-hani-zyyy" which the type is "thai-hani-zyyy" not just "zyyy"

oh. Just saw what you landed in https://github.com/unicode-org/cldr/pull/3411/files that make sense. Sorry. Ignore my previous comments.

macchiati · 2023-11-30T18:00:34Z

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java

+        // For example, if the locale is "en-u-dx-abc-defgh", dxs is "abc-defgh"
+        // and builder.toString() return "[[:scx=abc-:][:scx=efgh:]]" and causes
+        // UnicodeSet constructor to throw IllegalArgumentException
+        return new UnicodeSet(builder.toString());


Freeze the UnicodeSet — add .freeze(). that makes it immutable and faster.

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java

macchiati · 2023-12-06T22:58:56Z

Np, your feedback is great for catching problems!

…

On Wed, Dec 6, 2023 at 1:16 PM Frank Yung-Fong Tang < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java <#2702 (comment)>: > + String dxs) { + if (dxs == null) { + return null; + } + if (dxs.length() % 5 != 4) { + throw new IllegalArgumentException("Incorrect value for dx key: " + dxs); + } + // Change from "thai" to "[[:scx=thai:]]" or "thai-arab" to "[[:scx=thai:][:scx=arab:]]" + StringBuilder builder = new StringBuilder("["); + int items = 1 + (dxs.length() / 5); + for (int i = 0; i < items; i++) { + if (i > 0 && dxs.charAt(i*5-1) != '-') { + throw new IllegalArgumentException("Incorrect value for dx key: " + dxs); + } + String script = dxs.substring(i*5, i*5+4); + // Special handling of zyyy oh. Just saw what you landed in https://github.com/unicode-org/cldr/pull/3411/files that make sense. Sorry. Ignore my previous comments. — Reply to this email directly, view it on GitHub <#2702 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMFIZ2UB57SAEBL2FWDYIDOEBAVCNFSM6AAAAAA7LXMYCKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONRYGYZTKMRTGM> . You are receiving this because you were assigned.Message ID: ***@***.***>

FrankYFTang · 2023-12-12T01:25:28Z

Please ignore my update. I am still working on this PR. It is not ready for review. After Mark point out some issue, I found my design was wrong and need a more intensify rework.

jira-pull-request-webhook · 2023-12-12T01:25:47Z

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi.cpp is different
icu4c/source/test/intltest/rbbitst.cpp is different
icu4c/source/test/intltest/rbbitst.h is different
icu4j/main/core/src/main/java/com/ibm/icu/impl/breakiter/CjkBreakEngine.java is now changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati

Fixes the problems I noted

mhosken · 2024-04-15T10:36:28Z

I'm getting user complaints again on this. Can we action this. Some fix for disabling dictionary breaking has been requested since, well I can't find out since I can't get to the old bug tracker, but it came into the latest tracker in 2019.

Perhaps the best is the enemy of the good here? The only people, that I know of, that are affected by this are those using minority languages, are inserting ZWSP for word breaks and are dealing with correctly tagged text. Do we have to refine this fix for the non use cases as well, before we can fix for the actual use case?

I'm sorry that my frustration is showing. But we seem to be more concerned about people who do the wrong thing than those who do the right thing (and tag correctly, by some definition). The really correct solution is that if the text is not tagged with the language of the dictionary, then no dictionary breaking should occur. I realise that that is just too much for most people and so we have special tagging. But can we please get something out for these users who are able to tag correctly?

Please shipit already.

srl295 · 2024-04-15T14:53:37Z

@FrankYFTang is this going to be merged for 75?

FrankYFTang · 2024-04-15T22:13:24Z

no, the issue is more complicated than my PR did.

srl295 · 2024-04-16T00:04:12Z

no, the issue is more complicated than my PR did.

Do you have more detail?

FrankYFTang · 2024-04-16T22:51:03Z

no, the issue is more complicated than my PR did.

Do you have more detail?

It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination.

srl295 · 2024-08-09T02:18:53Z

no, the issue is more complicated than my PR did.

Do you have more detail?

It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination.

Frank,
It's opt-in. COuld we consider moving forward with this even if it needs additional work? It shouldn't be a problem for users that don't set -u-dx

FrankYFTang · 2024-08-12T18:35:56Z

sorry, I need to pick up the work again.

ICU-13219 Fix ICU-13219 add -u-dx- support to BreakIterator ICU-13219 Fix ICU-13219 update

jira-pull-request-webhook · 2024-10-21T23:05:24Z

Notice: the branch changed across the force-push!

icu4c/source/test/intltest/rbbitst.cpp is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

FrankYFTang changed the title ~~ICU-13219 add DX support to BreakIterator~~ ICU-13219 add -u-dx support to BreakIterator Nov 15, 2023

FrankYFTang force-pushed the ICU-13219-DX branch from 1266364 to 862aefd Compare November 16, 2023 01:51

FrankYFTang force-pushed the ICU-13219-DX branch from 862aefd to 7466a67 Compare November 16, 2023 02:24

FrankYFTang force-pushed the ICU-13219-DX branch from 7466a67 to b032026 Compare November 16, 2023 02:31

FrankYFTang force-pushed the ICU-13219-DX branch from b032026 to a7b3d95 Compare November 16, 2023 03:15

FrankYFTang requested review from srl295, eggrobin and aheninger November 16, 2023 22:52

eggrobin previously approved these changes Nov 20, 2023

View reviewed changes

FrankYFTang dismissed eggrobin’s stale review via 3cb8d52 November 20, 2023 21:15

FrankYFTang force-pushed the ICU-13219-DX branch from a7b3d95 to 3cb8d52 Compare November 20, 2023 21:15

FrankYFTang requested a review from eggrobin November 20, 2023 21:40

eggrobin requested changes Nov 21, 2023

View reviewed changes

icu4c/source/common/rbbi.cpp Outdated Show resolved Hide resolved

FrankYFTang force-pushed the ICU-13219-DX branch from 3cb8d52 to f64fe10 Compare November 21, 2023 19:42

FrankYFTang requested a review from eggrobin November 21, 2023 19:42

FrankYFTang force-pushed the ICU-13219-DX branch from f64fe10 to 8f8766a Compare November 21, 2023 21:23

macchiati requested changes Nov 21, 2023

View reviewed changes

FrankYFTang force-pushed the ICU-13219-DX branch from 8f8766a to 9cf2b52 Compare November 21, 2023 21:31

FrankYFTang requested a review from macchiati November 22, 2023 00:38

markusicu assigned macchiati Nov 30, 2023

macchiati requested changes Nov 30, 2023

View reviewed changes

FrankYFTang added the incomplete Needs work; do not approve/merge as is. label Dec 12, 2023

FrankYFTang force-pushed the ICU-13219-DX branch from 6868938 to 2dfa13c Compare December 12, 2023 01:25

macchiati previously approved these changes Dec 12, 2023

View reviewed changes

FrankYFTang dismissed macchiati’s stale review via 334bd42 October 21, 2024 22:55

ICU-13219 add -u-dx- support to BreakIterator

ba1260f

ICU-13219 Fix ICU-13219 add -u-dx- support to BreakIterator ICU-13219 Fix ICU-13219 update

FrankYFTang force-pushed the ICU-13219-DX branch from 334bd42 to ba1260f Compare October 21, 2024 23:05

ICU-13219 add -u-dx support to BreakIterator #2702

Are you sure you want to change the base?

ICU-13219 add -u-dx support to BreakIterator #2702

Conversation

FrankYFTang commented Nov 15, 2023

Checklist

FrankYFTang commented Nov 15, 2023

srl295 commented Nov 15, 2023

FrankYFTang commented Nov 15, 2023

jira-pull-request-webhook bot commented Nov 16, 2023

jira-pull-request-webhook bot commented Nov 16, 2023

jira-pull-request-webhook bot commented Nov 16, 2023

jira-pull-request-webhook bot commented Nov 16, 2023

eggrobin left a comment

Choose a reason for hiding this comment

FrankYFTang commented Nov 20, 2023

FrankYFTang commented Nov 20, 2023

FrankYFTang commented Nov 20, 2023 • edited Loading

jira-pull-request-webhook bot commented Nov 20, 2023

FrankYFTang commented Nov 20, 2023

jira-pull-request-webhook bot commented Nov 21, 2023

jira-pull-request-webhook bot commented Nov 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankYFTang Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

FrankYFTang Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Nov 21, 2023

macchiati commented Nov 23, 2023

FrankYFTang commented Nov 23, 2023 • edited Loading

srl295 commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macchiati commented Dec 6, 2023 via email

FrankYFTang commented Dec 12, 2023

jira-pull-request-webhook bot commented Dec 12, 2023

macchiati left a comment

Choose a reason for hiding this comment

mhosken commented Apr 15, 2024 • edited Loading

srl295 commented Apr 15, 2024

FrankYFTang commented Apr 15, 2024

srl295 commented Apr 16, 2024

FrankYFTang commented Apr 16, 2024

srl295 commented Aug 9, 2024

FrankYFTang commented Aug 12, 2024

jira-pull-request-webhook bot commented Oct 21, 2024

FrankYFTang commented Nov 20, 2023 •

edited

Loading

FrankYFTang Nov 23, 2023 •

edited

Loading

FrankYFTang Nov 23, 2023 •

edited

Loading

FrankYFTang commented Nov 23, 2023 •

edited

Loading

mhosken commented Apr 15, 2024 •

edited

Loading