Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Compatibility #2

Open
Varriount opened this issue Oct 3, 2013 · 4 comments
Open

Unicode Compatibility #2

Varriount opened this issue Oct 3, 2013 · 4 comments

Comments

@Varriount
Copy link

To the best of my knowledge, all string types used throughout the library are standard ASCII character types. Since this is a library based on string manipulation, we should probably figure out a way to use UTF-8 (Unicode) or UCS (Wide character) strings instead.

According to this site, using UTF-8 for unicode strings would probably be the best solution, as it is already compatible with ASCII encodings, and is more memory efficient than wide characters.

@arrbee
Copy link
Owner

arrbee commented Oct 3, 2013

My plan regarding unicode is that the core library should actually be treating data as raw bytes, not ASCII characters. When it finds spans of differing data, those represent spans of bytes. The core diff algorithm shouldn't have to know about UTF-6 or UTF-16 (or any of the UCS stuff, etc). Once the byte differences are calculated, a second pass through the generated diff should be able to adjust the diff boundaries to account for the format of the data.

A simple example

Say you had two UTF-16 texts:

0053 0061 006D 0070 006C 0065 == "Sample"
0053 006F 006D 0065 0020 0054 0065 0078 0074 == "Some Text"

The raw byte diff might look like:

0: common: 00 53 00
1: delete: 61
2: insert: 6F
3: common: 00 6D 00
4: delete: 70 00 6C
5: insert: 65 00 20 00 54
6: common: 00 65
7: insert: 00 78 00 74

We would then apply a "UTF-16 filter" to the diff which know to adjust diff spans to 16-bit boundaries. The first common span (i.e. line 0) has an odd number of bytes, so the filter would remove one byte from the end of the common span and insert it into both of the follow deleted / inserted spans(i.e. 1 and 2). Again, span 3 ends misaligned, so we would also trim a byte off the end.

The general algorithm would be to walk the list of spans looking for ranges that end misaligned with a UTF-16 boundary. When we find a misaligned span ending, we either push a byte off the end of the span if it is common data, or we pull a byte from the following common span if it is inserted or deleted data. In this case, the end result after the UTF-16 filter would be:

0: common: 00 53
1: delete: 00 61
2: insert: 00 6F
3: common: 00 6D
4: delete: 00 70 00 6C
5: insert: 00 65 00 20 00 54
6: common: 00 65
7: insert: 00 78 00 74

The only real UTF-16 error case would be if the raw texts are not an even number of bytes.

The UTF-8 filter is somewhat more complicated to write because you have to examine the actual bytes to determine if a span boundary is also at a code point boundary. Fortunately the design of the UTF-8 encoding makes it pretty easy to walk backwards through a byte stream to find the start of the code point. Just as with the proposed UTF-16 filter, if a span does not start or end on a code point boundary, we shave bytes off the preceding or following common span to make all spans line up with valid code points. The UTF-8 filter might be implemented with the extra feature of detecting invalid UTF-8 sequences, but I would rather avoid that complexity and leave it to the caller.

I think this plan should work because I believe there are no places where the code assumes ASCII data or relies on NUL-terminated strings or anything like that. If there is anything like that, we should treat it as a bug, I think, and make the core code clean for arbitrary byte sequences.

I think we would have two APIs to transform an existing byte-oriented diff that look like:

extern int dmp_diff_transform_as_utf8(dmp_diff *diff);
extern int dmp_diff_transform_as_utf16(dmp_diff *diff);

Future thoughts

I've written this pretty much off the top of my head based on how I've been thinking about this recently. I suspect I was influenced by how the upstream library does it, but I don't really recall. We should probably study the upstream APIs for this a bit.

We could add an easy-to-use API that checks for a UTF BOM at the beginning of the text and automatically applies the appropriate UTF transform. I'm thinking something like int dmp_diff_transform_utf(dmp_diff *diff); that looks for a BOM and applies the right filter. However, if there is no BOM then the correct behavior is going to depend on the use case. It is probably best to leave this off for now, or at most just offer a BOM detection utility function.

It would be nice to have an extension of the core library that brought in ICU and then used charset detection so there was less dependency on the user doing the right thing or the presence of a BOM. However, I do not want the core library to be dependent on ICU and I think directly supporting basic UTF-8 and UTF-16 character filters without ICU is reasonable.

@Varriount
Copy link
Author

Ah, that clears things up. Thanks for the wonderful explanation - sorry for not understanding in the first place.

@arrbee
Copy link
Owner

arrbee commented Oct 3, 2013

No problems! It's not like I've ever really written this all down before. Good to write it out.

@arrbee
Copy link
Owner

arrbee commented Oct 4, 2013

Just to document it here, one additional bit of complexity with a UTF-8 filter may be dealing correctly with decomposed unicode characters and combining diacritical marks. In those cases, it may not be sufficient to merely walk the diff boundaries to the nearest code point. Fortunately, it appears that the combining diacritics are constrained to specific ranges of UTF-8 data, so it should be possible to extend span boundaries to include those possible trailing bytes as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants