Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result difference between uniseg.GraphemeClusterCount and Android android.icu.text.BreakIterator #60

Open
munpf opened this issue Dec 3, 2024 · 4 comments

Comments

@munpf
Copy link

munpf commented Dec 3, 2024

Hello!

I found that in the following text, the results of the two libraries are different. I'm not sure which one is correct, could you help to confirm if the result is expected with uniseg?

Thank you!


Text:
क्‍ष
it's unicode codepoint is
"\u0915\u094d\u200d\u0937"

uniseg.GraphemeClusterCount Result:
2

android.icu.text.BreakIterator Result:
1

@shogo82148
Copy link
Contributor

what version of uniseg and android.icu do you use?

@rivo
Copy link
Owner

rivo commented Dec 3, 2024

So the codepoints you posted are assigned the following grapheme break properties according to this table:

  1. \u0915 = Any
  2. \u094d = Extend
  3. \u200d = Zero-Width Joiner
  4. \u0937 = Any

The way I see it, the following rules apply:

  • GB9 "Do not break before extending characters or ZWJ.": This applies to the sequence (1) (2) (3).
  • GB999 "Otherwise, break everywhere.": This applies to the sequence (3) (4).

So basically, \u0937 is its own character according to those rules and must be counted separately. Unless I'm missing something, these are two separate characters, i.e. 2. Incidentally, in Chrome where I'm writing this comment, I can select each character individually, too. So at least Chrome seems to agree.

I have no experience with android.icu so I can't comment on its functionality.

@munpf
Copy link
Author

munpf commented Dec 4, 2024

what version of uniseg and android.icu do you use?

uniseg version is 0.4.7
android.icu has no version, it is a builtin library, and is copied from com.ibm.icu.text

@munpf
Copy link
Author

munpf commented Dec 9, 2024

@shogo82148 I searched and found that Android may use the official java implementation of icu. I wonder if there is any difference between the implementation of uniseg and the standard icu?

  • icu4j, and Android use RuleBasedBreakIterator to iterate the text.
image

In addition, I found that Android seems consider multiple consecutive "\n" as one grapheme cluster, while uniseg will consider it as multiple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants