Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibWeb: Implement and use "isomorphic decoding" #1893

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

Gingeh
Copy link
Contributor

@Gingeh Gingeh commented Oct 21, 2024

To isomorphic decode a byte sequence input, return a string whose code point length is equal to input’s length and whose code points have the same values as the values of input’s bytes, in the same order.

This is essentially spec-speak for "Decode as ISO-8859-1 / Latin-1", but existing code interpreted it as "Transparently interpret as UTF-8" which is only correct for ASCII characters (< 0x80).

This change fixes a crash in http://wpt.live/mimesniff/mime-types/charset-parameter.window.html when parsing a non-ascii header value. The test now fully passes when combining these changes with #1879.

@Gingeh Gingeh force-pushed the isomorphic-decode branch from b835430 to a0464f3 Compare October 21, 2024 05:37
@Gingeh Gingeh marked this pull request as draft October 21, 2024 06:35
@Gingeh
Copy link
Contributor Author

Gingeh commented Oct 21, 2024

Marked as a draft while I figure out what's going on with https://wpt.live/mimesniff/mime-types/parsing.any.html

@Gingeh
Copy link
Contributor Author

Gingeh commented Oct 21, 2024

Testing for each change:

  • Content-Type header parsing:
  • Refresh header usage:
    • Deno.serve((_req) => new Response("", {headers: {"Refresh": "1; https://example.org/ÿ/"}}));
    • Used to redirect to https://example.org/%EF%BF%BD/ now redirects to https://example.org/%C3%BF/ (same as accessing url directly)
  • Base64 data: urls:
    • No real difference between previous and current behaviour because non-ascii chars are never valid in base64

@Gingeh Gingeh marked this pull request as ready for review October 21, 2024 23:59
@Gingeh
Copy link
Contributor Author

Gingeh commented Oct 22, 2024

Un-drafting this because I'm happy with my testing.
Our Blob.type implementation needs to be changed to return an empty string if the mimetype contains non-ascii characters, I'm not sure if that should be done now or in a follow-up pr.

@Gingeh Gingeh force-pushed the isomorphic-decode branch 2 times, most recently from 220a547 to 93f690d Compare October 22, 2024 04:00
@Gingeh Gingeh changed the title LibWeb: Use ISO-8859-1 decoder for "isomorphic decode" LibWeb: Implement and use "isomorphic decoding" Oct 22, 2024
@Gingeh
Copy link
Contributor Author

Gingeh commented Oct 22, 2024

Updates:

  • New Infra::isomorphic_encode and Infra::isomorphic_decode helper functions to reduce code duplication (and because the Latin-1 encoder isn't actually isomorphic).
  • Changed Header::from_string_pair to isomorphic encode its input strings because headers are expected to be ISO-8859-1 encoded.
  • Renamed commit and PR to LibWeb: Implement and use "isomorphic decoding" because this is not using the Latin-1 en/decoder anymore.

Comment on lines +163 to +167
StringBuilder builder(input.size());
for (u8 code_point : input) {
builder.append_code_point(code_point);
}
return builder.to_string_without_validation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right thing to do - String is by definition UTF-8, but here we can be throwing any encoding into the string. The to_string_without_validation isn't a path to avoid that invariant, rather it's a performance optimization for when we already know the data going into String is UTF-8 (as validation isn't free).

We use ByteString to represent strings of arbitrary encodings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding what append_code_point does that's exactly what I'm doing. This function takes ISO-8859-1 encoded bytes and converts them to the corresponding UTF-8 encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: Latin1Decoder::process and Decoder::to_utf8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I think I was misunderstanding what isomorphic decoding means. But yes, this looks right. Thanks :)

@Gingeh Gingeh requested a review from trflynn89 October 23, 2024 21:09
@Gingeh Gingeh force-pushed the isomorphic-decode branch 2 times, most recently from f8b1363 to 392b534 Compare October 26, 2024 00:20
@Gingeh Gingeh force-pushed the isomorphic-decode branch from 392b534 to 24cb823 Compare October 29, 2024 09:15
@trflynn89 trflynn89 merged commit ebb8342 into LadybirdBrowser:master Oct 29, 2024
6 checks passed
@Gingeh Gingeh deleted the isomorphic-decode branch October 29, 2024 22:43
Gingeh added a commit to Gingeh/ladybird that referenced this pull request Dec 6, 2024
tcl3 pushed a commit that referenced this pull request Dec 6, 2024
@F3n67u
Copy link
Contributor

F3n67u commented Dec 6, 2024

Changed Header::from_string_pair to isomorphic encode its input strings because headers are expected to be ISO-8859-1 encoded.

@Gingeh There are still several places creating Infrastructure::Header using aggregate initialization, which may cause Infrastructure::Header is not ISO-8859-1 encoded. Do we need to change them to using Infrastructure::Header::from_string_pair instead? Or do we need to forbidden aggregate initialization of Infrastructure::Header` to avoid this kind of bug?

Also, when using Infrastructure::Header's value or name, many of them still assume they are utf-8 encoded and didn't use "isomorphic decoding". I found a crash that was caused by this in #2814, but there are more places like that.

@Gingeh
Copy link
Contributor Author

Gingeh commented Dec 6, 2024

@F3n67u

There are still several places creating Infrastructure::Header using aggregate initialization, which may cause Infrastructure::Header is not ISO-8859-1 encoded. Do we need to change them to using Infrastructure::Header::from_string_pair instead? Or do we need to forbidden aggregate initialization of Infrastructure::Header` to avoid this kind of bug?

The safest option would be to disallow aggregate initialization and exclusively use Header::from_string_pair. There could be a slight optimization by skipping it when the input is either already encoded or statically known to be pure ASCII, but I'm not sure if that's worth the risk.

Maybe the aggregate initialization could be forbidden in favour of a Header::from_latin1_pair constructor which makes the expected encoding more explicit?

Also, when using Infrastructure::Header's value or name, many of them still assume they are utf-8 encoded and didn't use "isomorphic decoding". I found a crash that was caused by this in #2814, but there are more places like that.

I'm not surprised, a lot of code still assumes headers are ASCII. These cases will need to be found and fixed over time.

EDIT:
Another thing to be careful of is the fact that isomorphic encoding a value that has already been encoded will result in garbage data, I've just found several cases of this in Fetching.cpp

@F3n67u
Copy link
Contributor

F3n67u commented Dec 7, 2024

The safest option would be to disallow aggregate initialization and exclusively use Header::from_string_pair. There could be a slight optimization by skipping it when the input is either already encoded or statically known to be pure ASCII, but I'm not sure if that's worth the risk.
Maybe the aggregate initialization could be forbidden in favour of a Header::from_latin1_pair constructor which makes the expected encoding more explicit?

I agreed that we should disallow aggregate initialization to in favor of Header::from_string_pair and Header::from_latin1_pair!

EDIT:
Another thing to be careful of is the fact that isomorphic encoding a value that has already been encoded will result in garbage data, I've just found several cases of this in Fetching.cpp

Yes! It is causing https://wpt.live/fetch/api/basic/request-headers-nonascii.any.html failed. Header::from_latin1_pair would be handy when handling those cases.

F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 7, 2024
This patch ensure Headers object's associated header list
is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`,
and correctly decoded using `Infra::isomorphic_decode`.

Follow-up of LadybirdBrowser#1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 7, 2024
This patch ensure Headers object's associated header list
is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`,
and correctly decoded using `Infra::isomorphic_decode`.

Follow-up of LadybirdBrowser#1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 9, 2024
This patch ensure Headers object's associated header list
is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`,
and correctly decoded using `Infra::isomorphic_decode`.

Follow-up of LadybirdBrowser#1893
ADKaster pushed a commit that referenced this pull request Dec 11, 2024
This patch ensure Headers object's associated header list
is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`,
and correctly decoded using `Infra::isomorphic_decode`.

Follow-up of #1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 12, 2024
isomorphic encoding a value that has already been encoded will
result in garbage data. `response_headers` is already encoded in
ISO-8859-1, we cannot use `from_string_pair`, as it calls
`isomorphic_decode`.

Follow-up of LadybirdBrowser#1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 12, 2024
isomorphic encoding a value that has already been encoded will
result in garbage data. `response_headers` is already encoded in
ISO-8859-1, we cannot use `from_string_pair`, as it triggers
ISO-8859-1 encoding.

Follow-up of LadybirdBrowser#1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 12, 2024
isomorphic encoding a value that has already been encoded will
result in garbage data. `response_headers` is already encoded in
ISO-8859-1/latin1, we cannot use `from_string_pair`, as it triggers
ISO-8859-1/latin1 encoding.

Follow-up of LadybirdBrowser#1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 13, 2024
isomorphic encoding a value that has already been encoded will
result in garbage data. `response_headers` is already encoded in
ISO-8859-1/latin1, we cannot use `from_string_pair`, as it triggers
ISO-8859-1/latin1 encoding.

Follow-up of LadybirdBrowser#1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 13, 2024
isomorphic encoding a value that has already been encoded will
result in garbage data. `response_headers` is already encoded in
ISO-8859-1/latin1, we cannot use `from_string_pair`, as it triggers
ISO-8859-1/latin1 encoding.

Follow-up of LadybirdBrowser#1893
F3n67u added a commit to F3n67u/ladybird that referenced this pull request Dec 13, 2024
isomorphic encoding a value that has already been encoded will
result in garbage data. `response_headers` is already encoded in
ISO-8859-1/latin1, we cannot use `from_string_pair`, as it triggers
ISO-8859-1/latin1 encoding.

Follow-up of LadybirdBrowser#1893
gmta pushed a commit that referenced this pull request Dec 17, 2024
isomorphic encoding a value that has already been encoded will
result in garbage data. `response_headers` is already encoded in
ISO-8859-1/latin1, we cannot use `from_string_pair`, as it triggers
ISO-8859-1/latin1 encoding.

Follow-up of #1893
nico pushed a commit to nico/serenity that referenced this pull request Dec 21, 2024
This patch ensure Headers object's associated header list
is ISO-8859-1 encoded when set using `Infra::isomorphic_encode`,
and correctly decoded using `Infra::isomorphic_decode`.

Follow-up of LadybirdBrowser/ladybird#1893

(cherry picked from commit 824e91ffdb568e28ec1640d28cf4f45d2fe4a0ae)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants