Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we coalesce quotation mark CE lists into single CEs? #927

Open
markusicu opened this issue Aug 26, 2024 · 4 comments
Open

can we coalesce quotation mark CE lists into single CEs? #927

markusicu opened this issue Aug 26, 2024 · 4 comments
Labels

Comments

@markusicu
Copy link
Member

markusicu commented Aug 26, 2024

I remember that I added some logic for the CLDR version of the default sort order to coalesce some adjacent CEs, absorbing ignorable CEs into their main CEs. Look for how that works for things like sharp s, or generally look for existing differences between allkeys_CLDR.txt and allkeys_DUCET.txt, and see whether we can turn this

0027  ; [*0337.0020.0002] # APOSTROPHE
FF07  ; [*0337.0020.0003] # FULLWIDTH APOSTROPHE
2018  ; [*0337.0020.0004][.0000.011E.0004] # LEFT SINGLE QUOTATION MARK
2019  ; [*0337.0020.0004][.0000.011F.0004] # RIGHT SINGLE QUOTATION MARK

into something like this

0027  ; [*0337.0020.0002] # APOSTROPHE
FF07  ; [*0337.0020.0003] # FULLWIDTH APOSTROPHE
2018  ; [*0337.0021.0002] # LEFT SINGLE QUOTATION MARK
2019  ; [*0337.0022.0002] # RIGHT SINGLE QUOTATION MARK

I found the coalescing code, and I had misremembered where I put it. It's in MappingsForFractionalUCA.java modifyMappings() // Check and merge secondary CEs.

It does not modify the "UCA" mappings. It only modifies intermediate mappings that turn into FractionalUCA.txt mappings. I verified that allkeys_CLDR.txt and allkeys_DUCET.txt have the same number of non-initial ignorable CEs. And FractionalUCA.txt shows the merged byte-based CEs:

0027; [09 6E, 05, 05]	# Zyyy Po	[0337.0020.0002]	* APOSTROPHE
FF07; [09 6E, 05, 20]	# Zyyy Po	[0337.0020.0003]	* FULLWIDTH APOSTROPHE
2018; [09 6E, 70, 05]	# Zyyy Pi	[0337.0020.0004][0000.011E.0004]	* LEFT SINGLE QUOTATION MARK
2019; [09 6E, 73, 05]	# Zyyy Pf	[0337.0020.0004][0000.011F.0004]	* RIGHT SINGLE QUOTATION MARK

The code includes comments about the modified mappings not being well-formed. It should be possible to make them well-formed, since the resulting FractionalUCA mappings are well-formed.

If we wanted to, we could then try to move this logic up one or two levels:

  1. up into the "UCA" object and its mappings, and thus visible in allkeys_CLDR.txt and allkeys_DUCET.txt
  2. further up into the C sifter code

Either way, the FractionalUCA generator would need to be adjusted for working with non-ignorable CEs having non-default secondary weights.

@markusicu markusicu added the uca label Aug 26, 2024
@macchiati
Copy link
Member

Looks reasonable. From what you wrote here, it looks like there aren't any characters in the second case between FF07 and 2018. Is that still true with your change?

@markusicu
Copy link
Member Author

This is the case according to the allkeys_CLDR.txt file which is in sorted order.
I have to remind myself what the code looks like that I thought would do this, and see what's different from a case like sharp s.
Anyway, this is just a drive-by thought that I wanted to jot down. The real work for today is #926 :-)

@macchiati
Copy link
Member

macchiati commented Aug 27, 2024 via email

@markusicu
Copy link
Member Author

I thought it was sorted by shifted values ... not a real sort.

The real UCA allkeys.txt is sorted with something like alternate=shifted (not sure if that's completely true, and I think it might sort with strength=tertiary, dropping the shifted primaries, making ignorable characters come out in a somewhat random order).

The allkeys_CLDR.txt and allkeys_DUCET.txt that the Unicode Tools generate are sorted with alternate=non-ignorable.


FYI: I found the coalescing code, and I amended the issue description above a few minutes ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants