LibWeb: Implement and use "isomorphic decoding" #1893

Gingeh · 2024-10-21T05:34:22Z

To isomorphic decode a byte sequence input, return a string whose code point length is equal to input’s length and whose code points have the same values as the values of input’s bytes, in the same order.

This is essentially spec-speak for "Decode as ISO-8859-1 / Latin-1", but existing code interpreted it as "Transparently interpret as UTF-8" which is only correct for ASCII characters (< 0x80).

This change fixes a crash in http://wpt.live/mimesniff/mime-types/charset-parameter.window.html when parsing a non-ascii header value. The test now fully passes when combining these changes with #1879.

Gingeh · 2024-10-21T07:52:19Z

Marked as a draft while I figure out what's going on with https://wpt.live/mimesniff/mime-types/parsing.any.html

Gingeh · 2024-10-21T23:44:15Z

Testing for each change:

Content-Type header parsing:
- http://wpt.live/mimesniff/mime-types/charset-parameter.window.html
  - Used to crash on non ascii chars, now fully passes
- https://wpt.live/mimesniff/mime-types/parsing.any.html
  - 258 new failures!
  - These tests both rely on and contradict the outdated File API spec (see w3c/FileAPI#43)
  - The regressed tests are all because we end up returning mojibake (e.g. ÿ becomes Ã¿) from Blob.type, the spec (and chrome/firefox) says we should return an empty string for non-ascii mimetypes but the tests disagree and expect the original string to be returned (with the correct encoding).
  - I guess it's time to intentionally fail some wpt subtests.
Refresh header usage:
- Deno.serve((_req) => new Response("", {headers: {"Refresh": "1; https://example.org/ÿ/"}}));
- Used to redirect to https://example.org/%EF%BF%BD/ now redirects to https://example.org/%C3%BF/ (same as accessing url directly)
Base64 data: urls:
- No real difference between previous and current behaviour because non-ascii chars are never valid in base64

Gingeh · 2024-10-22T00:02:03Z

Un-drafting this because I'm happy with my testing.
Our Blob.type implementation needs to be changed to return an empty string if the mimetype contains non-ascii characters, I'm not sure if that should be done now or in a follow-up pr.

Gingeh · 2024-10-22T04:07:21Z

Updates:

New Infra::isomorphic_encode and Infra::isomorphic_decode helper functions to reduce code duplication (and because the Latin-1 encoder isn't actually isomorphic).
Changed Header::from_string_pair to isomorphic encode its input strings because headers are expected to be ISO-8859-1 encoded.
- This solves all test regressions in https://wpt.live/mimesniff/mime-types/parsing.any.html and passes an entirely new subtest.
Renamed commit and PR to LibWeb: Implement and use "isomorphic decoding" because this is not using the Latin-1 en/decoder anymore.

trflynn89 · 2024-10-23T11:52:10Z

Userland/Libraries/LibWeb/Infra/Strings.cpp

+    StringBuilder builder(input.size());
+    for (u8 code_point : input) {
+        builder.append_code_point(code_point);
+    }
+    return builder.to_string_without_validation();


I don't think this is the right thing to do - String is by definition UTF-8, but here we can be throwing any encoding into the string. The to_string_without_validation isn't a path to avoid that invariant, rather it's a performance optimization for when we already know the data going into String is UTF-8 (as validation isn't free).

We use ByteString to represent strings of arbitrary encodings.

Unless I'm misunderstanding what append_code_point does that's exactly what I'm doing. This function takes ISO-8859-1 encoded bytes and converts them to the corresponding UTF-8 encoding.

See also: Latin1Decoder::process and Decoder::to_utf8

Ah I see, I think I was misunderstanding what isomorphic decoding means. But yes, this looks right. Thanks :)

This couldn't be added in LadybirdBrowser#1893

This couldn't be added in #1893

F3n67u · 2024-12-06T17:39:12Z

Changed Header::from_string_pair to isomorphic encode its input strings because headers are expected to be ISO-8859-1 encoded.

@Gingeh There are still several places creating Infrastructure::Header using aggregate initialization, which may cause Infrastructure::Header is not ISO-8859-1 encoded. Do we need to change them to using Infrastructure::Header::from_string_pair instead? Or do we need to forbidden aggregate initialization of Infrastructure::Header` to avoid this kind of bug?

Also, when using Infrastructure::Header's value or name, many of them still assume they are utf-8 encoded and didn't use "isomorphic decoding". I found a crash that was caused by this in #2814, but there are more places like that.

Gingeh · 2024-12-06T23:38:33Z

@F3n67u

There are still several places creating Infrastructure::Header using aggregate initialization, which may cause Infrastructure::Header is not ISO-8859-1 encoded. Do we need to change them to using Infrastructure::Header::from_string_pair instead? Or do we need to forbidden aggregate initialization of Infrastructure::Header` to avoid this kind of bug?

The safest option would be to disallow aggregate initialization and exclusively use Header::from_string_pair. There could be a slight optimization by skipping it when the input is either already encoded or statically known to be pure ASCII, but I'm not sure if that's worth the risk.

Maybe the aggregate initialization could be forbidden in favour of a Header::from_latin1_pair constructor which makes the expected encoding more explicit?

Also, when using Infrastructure::Header's value or name, many of them still assume they are utf-8 encoded and didn't use "isomorphic decoding". I found a crash that was caused by this in #2814, but there are more places like that.

I'm not surprised, a lot of code still assumes headers are ASCII. These cases will need to be found and fixed over time.

EDIT:
Another thing to be careful of is the fact that isomorphic encoding a value that has already been encoded will result in garbage data, I've just found several cases of this in Fetching.cpp

F3n67u · 2024-12-07T00:42:59Z

The safest option would be to disallow aggregate initialization and exclusively use Header::from_string_pair. There could be a slight optimization by skipping it when the input is either already encoded or statically known to be pure ASCII, but I'm not sure if that's worth the risk.
Maybe the aggregate initialization could be forbidden in favour of a Header::from_latin1_pair constructor which makes the expected encoding more explicit?

I agreed that we should disallow aggregate initialization to in favor of Header::from_string_pair and Header::from_latin1_pair!

EDIT:
Another thing to be careful of is the fact that isomorphic encoding a value that has already been encoded will result in garbage data, I've just found several cases of this in Fetching.cpp

Yes! It is causing https://wpt.live/fetch/api/basic/request-headers-nonascii.any.html failed. Header::from_latin1_pair would be handy when handling those cases.

This patch ensure Headers object's associated header list is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`, and correctly decoded using `Infra::isomorphic_decode`. Follow-up of LadybirdBrowser#1893

This patch ensure Headers object's associated header list is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`, and correctly decoded using `Infra::isomorphic_decode`. Follow-up of #1893

isomorphic encoding a value that has already been encoded will result in garbage data. `response_headers` is already encoded in ISO-8859-1, we cannot use `from_string_pair`, as it calls `isomorphic_decode`. Follow-up of LadybirdBrowser#1893

isomorphic encoding a value that has already been encoded will result in garbage data. `response_headers` is already encoded in ISO-8859-1, we cannot use `from_string_pair`, as it triggers ISO-8859-1 encoding. Follow-up of LadybirdBrowser#1893

isomorphic encoding a value that has already been encoded will result in garbage data. `response_headers` is already encoded in ISO-8859-1/latin1, we cannot use `from_string_pair`, as it triggers ISO-8859-1/latin1 encoding. Follow-up of LadybirdBrowser#1893

isomorphic encoding a value that has already been encoded will result in garbage data. `response_headers` is already encoded in ISO-8859-1/latin1, we cannot use `from_string_pair`, as it triggers ISO-8859-1/latin1 encoding. Follow-up of #1893

This patch ensure Headers object's associated header list is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`, and correctly decoded using `Infra::isomorphic_decode`. Follow-up of LadybirdBrowser/ladybird#1893 (cherry picked from commit 824e91ffdb568e28ec1640d28cf4f45d2fe4a0ae)

Gingeh force-pushed the isomorphic-decode branch from b835430 to a0464f3 Compare October 21, 2024 05:37

Gingeh marked this pull request as draft October 21, 2024 06:35

Gingeh marked this pull request as ready for review October 21, 2024 23:59

Gingeh force-pushed the isomorphic-decode branch 2 times, most recently from 220a547 to 93f690d Compare October 22, 2024 04:00

Gingeh changed the title ~~LibWeb: Use ISO-8859-1 decoder for "isomorphic decode"~~ LibWeb: Implement and use "isomorphic decoding" Oct 22, 2024

Gingeh force-pushed the isomorphic-decode branch 2 times, most recently from ba9110c to 9bc08a5 Compare October 22, 2024 05:53

Gingeh mentioned this pull request Oct 22, 2024

LibRequests+LibWeb+RequestServer: Propagate HTTP reason phrase #1876

Merged

Gingeh force-pushed the isomorphic-decode branch from 9bc08a5 to 73924aa Compare October 22, 2024 22:30

trflynn89 reviewed Oct 23, 2024

View reviewed changes

Gingeh requested a review from trflynn89 October 23, 2024 21:09

Gingeh force-pushed the isomorphic-decode branch 2 times, most recently from f8b1363 to 392b534 Compare October 26, 2024 00:20

LibWeb: Implement and use "isomorphic decoding"

24cb823

Gingeh force-pushed the isomorphic-decode branch from 392b534 to 24cb823 Compare October 29, 2024 09:15

trflynn89 merged commit ebb8342 into LadybirdBrowser:master Oct 29, 2024
6 checks passed

Gingeh deleted the isomorphic-decode branch October 29, 2024 22:43

Gingeh added a commit to Gingeh/ladybird that referenced this pull request Dec 6, 2024

LibWeb: Add test for non-ascii content-type headers

a676e2d

This couldn't be added in LadybirdBrowser#1893

tcl3 pushed a commit that referenced this pull request Dec 6, 2024

LibWeb: Add test for non-ascii content-type headers

e4512d8

This couldn't be added in #1893

F3n67u mentioned this pull request Dec 7, 2024

LibWeb: Ensure Headers API can handle non-ascii characters #2814

Merged

F3n67u mentioned this pull request Dec 12, 2024

LibWeb: Avoid re-encoding response headers #2890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LibWeb: Implement and use "isomorphic decoding" #1893

LibWeb: Implement and use "isomorphic decoding" #1893

Gingeh commented Oct 21, 2024

Gingeh commented Oct 21, 2024

Gingeh commented Oct 21, 2024

Gingeh commented Oct 22, 2024

Gingeh commented Oct 22, 2024

trflynn89 Oct 23, 2024

Gingeh Oct 23, 2024

Gingeh Oct 23, 2024

trflynn89 Oct 29, 2024

F3n67u commented Dec 6, 2024 •

edited

Loading

Gingeh commented Dec 6, 2024 •

edited

Loading

F3n67u commented Dec 7, 2024 •

edited

Loading

LibWeb: Implement and use "isomorphic decoding" #1893

LibWeb: Implement and use "isomorphic decoding" #1893

Conversation

Gingeh commented Oct 21, 2024

Gingeh commented Oct 21, 2024

Gingeh commented Oct 21, 2024

Testing for each change:

Gingeh commented Oct 22, 2024

Gingeh commented Oct 22, 2024

trflynn89 Oct 23, 2024

Choose a reason for hiding this comment

Gingeh Oct 23, 2024

Choose a reason for hiding this comment

Gingeh Oct 23, 2024

Choose a reason for hiding this comment

trflynn89 Oct 29, 2024

Choose a reason for hiding this comment

F3n67u commented Dec 6, 2024 • edited Loading

Gingeh commented Dec 6, 2024 • edited Loading

F3n67u commented Dec 7, 2024 • edited Loading

F3n67u commented Dec 6, 2024 •

edited

Loading

Gingeh commented Dec 6, 2024 •

edited

Loading

F3n67u commented Dec 7, 2024 •

edited

Loading