Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results for string functions on args with higher-plane characters #719

Open
beyackle2 opened this issue Oct 3, 2022 · 3 comments

Comments

@beyackle2
Copy link

Some functions are improperly treating strings that contain higher-plane Unicode characters (i.e., those with code points at U+10000 and higher, including most emoji) as if each of those characters were two characters long and unintelligible.

Split("🦓🦊🐺𐊀","")

Table({Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"},{Value:"�"})
(Should be Table({Value:"🦓"},{Value:"🦊"},{Value:"🐺"},{Value:"𐊀"}))

Left("xyz🅰🅱🅲", 6)

"xyz🅰�" (note the ending character which is apparently half of the 🅱 emoji; this should just evaluate to the same string as was passed in)

Similarly, Right("🅰🅱🅲def", 6)"�🅲def", and Mid("🅰🅱🅲def", 2, 4)"�🅱�".

I believe something is going awry with how PowerFx is handling characters wider than 16 bits, and strings aren't being kept in a consistent translation format, which is leading to these errors.

@marclundgren
Copy link

Most (all?) emojis are 2 characters rather than one, according to javascript.

'🌻'.length // 2

@beyackle2
Copy link
Author

In a UTF-16 string, which is what JS uses internally, it does take two 16-bit "characters" to make a single code point from a higher plane. It can be even more than that; emoji like 👩🏾‍💻 which are formed from a base character, a skin-tone, a zero-width joiner, and another emoji, are seven whole 16-bit "characters" wide (the base, skin-tone, and following emoji count as 2 each, and the ZWJ is the additional 1), but they consist of 4 real code points and display as a single unit. I'm arguing that the intuitive representation here should be what PowerFx uses; both 🌻 and 👩🏾‍💻 should be treated as 1 character each, both to conform to what a user would expect functions like Split or Left to do and to prevent characters from being improperly split.

@MikeStall
Copy link
Contributor

MikeStall commented Oct 14, 2022

A workaround is to just split on a different character, such as a comma inbetween items:

Split("🦓,🦊,🐺,𐊀",",")

Returns:
Table({Value:"🦓"},{Value:"🦊"},{Value:"🐺"},{Value:"𐊀"})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants