Unicode Compatibility #2

Varriount · 2013-10-03T15:05:31Z

To the best of my knowledge, all string types used throughout the library are standard ASCII character types. Since this is a library based on string manipulation, we should probably figure out a way to use UTF-8 (Unicode) or UCS (Wide character) strings instead.

According to this site, using UTF-8 for unicode strings would probably be the best solution, as it is already compatible with ASCII encodings, and is more memory efficient than wide characters.

arrbee · 2013-10-03T17:23:31Z

My plan regarding unicode is that the core library should actually be treating data as raw bytes, not ASCII characters. When it finds spans of differing data, those represent spans of bytes. The core diff algorithm shouldn't have to know about UTF-6 or UTF-16 (or any of the UCS stuff, etc). Once the byte differences are calculated, a second pass through the generated diff should be able to adjust the diff boundaries to account for the format of the data.

A simple example

Say you had two UTF-16 texts:

0053 0061 006D 0070 006C 0065 == "Sample"
0053 006F 006D 0065 0020 0054 0065 0078 0074 == "Some Text"

The raw byte diff might look like:

0: common: 00 53 00
1: delete: 61
2: insert: 6F
3: common: 00 6D 00
4: delete: 70 00 6C
5: insert: 65 00 20 00 54
6: common: 00 65
7: insert: 00 78 00 74

We would then apply a "UTF-16 filter" to the diff which know to adjust diff spans to 16-bit boundaries. The first common span (i.e. line 0) has an odd number of bytes, so the filter would remove one byte from the end of the common span and insert it into both of the follow deleted / inserted spans(i.e. 1 and 2). Again, span 3 ends misaligned, so we would also trim a byte off the end.

The general algorithm would be to walk the list of spans looking for ranges that end misaligned with a UTF-16 boundary. When we find a misaligned span ending, we either push a byte off the end of the span if it is common data, or we pull a byte from the following common span if it is inserted or deleted data. In this case, the end result after the UTF-16 filter would be:

0: common: 00 53
1: delete: 00 61
2: insert: 00 6F
3: common: 00 6D
4: delete: 00 70 00 6C
5: insert: 00 65 00 20 00 54
6: common: 00 65
7: insert: 00 78 00 74

The only real UTF-16 error case would be if the raw texts are not an even number of bytes.

The UTF-8 filter is somewhat more complicated to write because you have to examine the actual bytes to determine if a span boundary is also at a code point boundary. Fortunately the design of the UTF-8 encoding makes it pretty easy to walk backwards through a byte stream to find the start of the code point. Just as with the proposed UTF-16 filter, if a span does not start or end on a code point boundary, we shave bytes off the preceding or following common span to make all spans line up with valid code points. The UTF-8 filter might be implemented with the extra feature of detecting invalid UTF-8 sequences, but I would rather avoid that complexity and leave it to the caller.

I think this plan should work because I believe there are no places where the code assumes ASCII data or relies on NUL-terminated strings or anything like that. If there is anything like that, we should treat it as a bug, I think, and make the core code clean for arbitrary byte sequences.

I think we would have two APIs to transform an existing byte-oriented diff that look like:

extern int dmp_diff_transform_as_utf8(dmp_diff *diff);
extern int dmp_diff_transform_as_utf16(dmp_diff *diff);

Future thoughts

I've written this pretty much off the top of my head based on how I've been thinking about this recently. I suspect I was influenced by how the upstream library does it, but I don't really recall. We should probably study the upstream APIs for this a bit.

We could add an easy-to-use API that checks for a UTF BOM at the beginning of the text and automatically applies the appropriate UTF transform. I'm thinking something like int dmp_diff_transform_utf(dmp_diff *diff); that looks for a BOM and applies the right filter. However, if there is no BOM then the correct behavior is going to depend on the use case. It is probably best to leave this off for now, or at most just offer a BOM detection utility function.

It would be nice to have an extension of the core library that brought in ICU and then used charset detection so there was less dependency on the user doing the right thing or the presence of a BOM. However, I do not want the core library to be dependent on ICU and I think directly supporting basic UTF-8 and UTF-16 character filters without ICU is reasonable.

Varriount · 2013-10-03T21:07:22Z

Ah, that clears things up. Thanks for the wonderful explanation - sorry for not understanding in the first place.

arrbee · 2013-10-03T21:33:22Z

No problems! It's not like I've ever really written this all down before. Good to write it out.

arrbee · 2013-10-04T18:08:16Z

Just to document it here, one additional bit of complexity with a UTF-8 filter may be dealing correctly with decomposed unicode characters and combining diacritical marks. In those cases, it may not be sufficient to merely walk the diff boundaries to the nearest code point. Fortunately, it appears that the combining diacritics are constrained to specific ranges of UTF-8 data, so it should be possible to extend span boundaries to include those possible trailing bytes as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Compatibility #2

Unicode Compatibility #2

Varriount commented Oct 3, 2013

arrbee commented Oct 3, 2013

Varriount commented Oct 3, 2013

arrbee commented Oct 3, 2013

arrbee commented Oct 4, 2013

Unicode Compatibility #2

Unicode Compatibility #2

Comments

Varriount commented Oct 3, 2013

arrbee commented Oct 3, 2013

A simple example

Future thoughts

Varriount commented Oct 3, 2013

arrbee commented Oct 3, 2013

arrbee commented Oct 4, 2013