Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-128136: Possibility to customize hardcoded xml declaration #128095

Closed
wants to merge 12 commits into from

Conversation

pprossi
Copy link

@pprossi pprossi commented Dec 19, 2024

Added possibility to customize hardcoded xml declaration

As rfc and dtd allows single- and doublequotes for the declaration tag.
Manual under https://docs.python.org/3/library/xml.etree.elementtree.html shows doublequotes but that output will never be possible with the hardcoded part in code.

without changing the default behaviour to pass tests.

Added possibility to customize hardcoded xml declaration
fixed lint error triling spaces
unchanged default declaration to pass tests
Copy link

cpython-cla-bot bot commented Dec 19, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Dec 19, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Dec 19, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@rruuaanng
Copy link
Contributor

Please add a title tag, e.g.

gh-<issue_id>: title body

And you also need to add a NEWS to describe your fix. We can use the blurb(pip install blurb), or online site(https://blurb-it.herokuapp.com/).

@@ -679,6 +679,7 @@ def iterfind(self, path, namespaces=None):
def write(self, file_or_filename,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should add a unit test to test it's behavior. Verify it whether the requirements are met.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some unit tests but there is'nt an issu to add an issue title.

and i think the change has little impact on Python users.

test cases for custom xml declaration
@bedevere-app
Copy link

bedevere-app bot commented Dec 20, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@picnixz
Copy link
Contributor

picnixz commented Dec 20, 2024

Added possibility to customize hardcoded xml declaration

Features should be announced using What's New entry and a NEWS entry (a "blurb"). Hence, this requires an issue and possibly a discussion for that feature. In particular, adding a new parameter may break existing code (and also, XML is a C extension modules, so you should also port those changes to the C interface, if any).

fix: unknowen r in code :)
@bedevere-app
Copy link

bedevere-app bot commented Dec 20, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Dec 20, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@pprossi pprossi changed the title Possibility to customize hardcoded xml declaration gh-128136: Possibility to customize hardcoded xml declaration Dec 20, 2024
@picnixz picnixz marked this pull request as draft December 20, 2024 19:41
@picnixz
Copy link
Contributor

picnixz commented Dec 20, 2024

Marking it as a draft until the CI is green (this is to indicate that the work is currently ongoing).

@picnixz
Copy link
Contributor

picnixz commented Dec 20, 2024

Note: the NEWS entry shouldn't contain newlines IIRC. A more verbose explanation should be put in Doc/whatsnew/3.14.rst. Documentation should be added accordingly (namely somewhere in Doc/library/xml.rst or something like that).

@pprossi pprossi marked this pull request as ready for review December 20, 2024 21:35
Copy link
Contributor

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO only the version should be allowed to be modified. In addition the XML declaration should be a valid one according to XML specs so this is probably a bit more tricky. For the default one, we should instead use None and internally hardcode the one that will be the default format string. Otherwise, it should be left to the caller the responsibility for formatting it correctly and we only say that it's used through .format(version=version, encoding=encoding).

@@ -0,0 +1,8 @@
New parameter added to the write function that allows you to customize the xml declaration, which was previously hard-coded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, the NEWS entry must be shorter (one sentence in general).
Examples should be put in the documentation instead. In addition, a What's New entry must be created for new features.

Lib/xml/etree/ElementTree.py Outdated Show resolved Hide resolved
@pprossi
Copy link
Author

pprossi commented Dec 21, 2024

IMO only the version should be allowed to be modified. In addition the XML declaration should be a valid one according to XML specs so this is probably a bit more tricky. For the default one, we should instead use None and internally hardcode the one that will be the default format string. Otherwise, it should be left to the caller the responsibility for formatting it correctly and we only say that it's used through .format(version=version, encoding=encoding).

I don't think there is a "standard declaration" in this sense.

https://www.w3.org/TR/xml/

The following examples are provided here.

<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8" ?>
<?xml version="1.0" standalone='yes'?>
<?xml encoding='UTF-8'?>
<?xml encoding='EUC-JP'?>
<?xml version='1.0'?>

The corresponding rule in the DTD is:

VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )

The VersionNum differ at most between 1.0 and 1.1.
The standalone parameter is not taken into account at all in the current code, but could easily be implemented here.

From my experience so far, this is only necessary because there are programs which, for whatever reason, can only cope with a certain variant of the declaration.
Probably because the string <?xml version="1.0" encoding="UTF-8" ?> is also hard-coded there.
This shows the disadvantages of hard-coded variants.

This variant represents a non-breaking change and makes it easiest for the caller to choose the declaration variant that is necessary for him.

move new parameter to end of argument list to non positional parameters.
@picnixz
Copy link
Contributor

picnixz commented Dec 21, 2024

This variant represents a non-breakable change.

It does. For instance:

file_or_filename = ...
encoding = ...
xml_declaration = ...
default_namespace = ...
method = ...

# before: "default_namespace" is mapped to the parameter "default_namespace" (OK)
x.write(file_or_filename, encoding, xml_declaration, default_namespace, method)
# after: "default_namespace" is mapped to the parameter "xml_declaration_definition" (not OK)
x.write(file_or_filename, encoding, xml_declaration, default_namespace, method)

I don't think there is a "standard declaration" in this sense.

What I meant by standard is that we don't want to break existing code that relies on this specific form. So we should have =None to keep the existing behaviour. While features may introduce breaking changes, we prefer not to force old code to update. Note that the standard declaration is still

<?xml field=... field=... field=... ?>

so we should only be able to specify the different fields IMO.

makes it easiest for the caller to choose the declaration variant that is necessary for him.

Which is exactly the reason why this should documented (not at the code level but at the level of https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree.write).


To be clear: I don't oppose the addition of this feature but I'd like some additional confirmations (btw, I don't think we need a C implementation since this part appears to be purely in Python):

  • Let's assume that we hardcode the declaration differently, for instance <xml encoding="HAHA">. The encoding is wrong but the rest of the output is XML-compliant. What should we do? should we reject the declaration (this will be VERY hard to come up with a simple and maintainable solution) or do we just ignore consistency issues?
  • If the version is 1.1, does it mean we need to change how the output is written? AFAICT, we only use XML v1.0 and AFAIR, XML 1.1 is rarely used (see https://stackoverflow.com/questions/6260975/should-i-learn-xml-1-0-or-xml-1-1). So I'm not very confident we should have that freedom. Instead, I suggest we add the different fields allowed by the XML specs (namely a keyword-only "standalone" parameter).

@pprossi
Copy link
Author

pprossi commented Dec 21, 2024

It does. For instance:

file_or_filename = ...
encoding = ...
xml_declaration = ...
default_namespace = ...
method = ...

# before: "default_namespace" is mapped to the parameter "default_namespace" (OK)
x.write(file_or_filename, encoding, xml_declaration, default_namespace, method)
# after: "default_namespace" is mapped to the parameter "xml_declaration_definition" (not OK)
x.write(file_or_filename, encoding, xml_declaration, default_namespace, method)

This should no longer be necessary with the last change.

d8a2300

<?xml field=... field=... field=... ?>

so we should only be able to specify the different fields IMO.

This is not primarily about the parameters themselves.
It is about quotes and other parameters like standalone, actually just this one :)

Which is exactly the reason why this should documented (not at the code level but at the level of https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree.write).

Yes, it should be documented anyway :)
But even now the documentation does not correspond to what can be achieved. e.g.

Let’s say we want to add one to each country’s rank, and add an updated attribute to the rank element:

for rank in root.iter('rank'):
    new_rank = int(rank.text) + 1
    rank.text = str(new_rank)
    rank.set('updated', 'yes')
tree.write('output.xml')

Our XML now looks like this:

<?xml version="1.0"?>
<data>
   <country name="Liechtenstein">
       <rank updated="yes">2</rank>
       <year>2008</year>
       <gdppc>141100</gdppc>
       <neighbor name="Austria" direction="E"/>
       <neighbor name="Switzerland" direction="W"/>
   </country>
   <country name="Singapore">
       <rank updated="yes">5</rank>
       <year>2011</year>
       <gdppc>59900</gdppc>
       <neighbor name="Malaysia" direction="N"/>
   </country>
   <country name="Panama">
       <rank updated="yes">69</rank>
       <year>2011</year>
       <gdppc>13600</gdppc>
       <neighbor name="Costa Rica" direction="W"/>
       <neighbor name="Colombia" direction="E"/>
   </country>
</data>

Thats wrong, what we actually get is:

<data>
   <country name="Liechtenstein">
       <rank updated="yes">2</rank>
       <year>2008</year>
       <gdppc>141100</gdppc>
       <neighbor name="Austria" direction="E"/>
       <neighbor name="Switzerland" direction="W"/>
   </country>
   <country name="Singapore">
       <rank updated="yes">5</rank>
       <year>2011</year>
       <gdppc>59900</gdppc>
       <neighbor name="Malaysia" direction="N"/>
   </country>
   <country name="Panama">
       <rank updated="yes">69</rank>
       <year>2011</year>
       <gdppc>13600</gdppc>
       <neighbor name="Costa Rica" direction="W"/>
       <neighbor name="Colombia" direction="E"/>
   </country>
</data>

And if we add the declaration with xml_declaration=True we get a mix of single quotes and double quotes.

<?xml version='1.0' encoding='us-ascii'?>
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
* Let's assume that we hardcode the declaration differently, for instance `<xml encoding="HAHA">`. The encoding is wrong but the rest of the output is XML-compliant. What should we do? should we reject the declaration (this will be **VERY** hard to come up with a simple and _maintainable_ solution) or do we just ignore consistency issues?

Shouldn't we leave it up to the developer? If he gives an incorrect coding like "HAHA" then he should know why he is doing this.
In this case I would ignore it because this is not about validating XML code.

* If the version is 1.1, does it mean we need to change how the output is written? AFAICT, we only use XML v1.0 and AFAIR, XML 1.1 is rarely used (see https://stackoverflow.com/questions/6260975/should-i-learn-xml-1-0-or-xml-1-1). So I'm not very confident we should have that freedom. Instead, I suggest we add the different fields allowed by the XML specs (namely a keyword-only "standalone" parameter).

It is correct that version 1.1 is rarely used, but should we limit the possibility of using it for this reason? I would expect something like that from closed source software, but not from open source.

If I understand correctly, in your opinion a format function should be created here that sets the version and encoding options like standalone. And also controls the output via single quotes or double quotes.

That sounds feasible and an alternative approach, but would greatly increase the need to adapt the current code and the associated unit tests.

Another nice variant would be to remember the original declaration tag and continue to use it, but this would only work for parsed documents.

Perhaps in the future someone with in-depth Python knowledge will be found who has the time to do this.

@picnixz
Copy link
Contributor

picnixz commented Dec 21, 2024

It is correct that version 1.1 is rarely used, but should we limit the possibility of using it for this reason?

If it were a 3rd-party library, no but here we are talking about the standard library which sometimes makes design choices and restrict features on purpose. The reason is maintenance cost vs usability vs use-cases. If we don't have a lot of use-cases, we usually don't include a feature in the standard library and this is best left to pypi packages.

Shouldn't we leave it up to the developer? If he gives an incorrect coding like "HAHA" then he should know why he is doing this.

In this case, we should still document this.

Thats wrong, what we actually get is:

A separate PR (no need for an issue) for updating the docs would be appreciated!

If I understand correctly, in your opinion a format function should be created here that sets the version and encoding options like standalone. And also controls the output via single quotes or double quotes.

Yes, what I meant is that we only allow version, encoding and standalone to be specified but not the format itself. The format and field names are standard but the field values may be customized.

That sounds feasible and an alternative approach, but would greatly increase the need to adapt the current code and the associated unit tests.

How come? existing tests should not be changed. For instance, if we don't specify standalone (namely standalone=False), then we don't need to include it in the output and no changes for the tests are needed. The encoding can already be specified and the version as well. I would note that we only change the XML declaration if the encoding is different from some expected encodings so we need to take that into account.


It is about quotes and other parameters like standalone, actually just this one :)

If this is only for standalone, then we should just be able to specify it as a boolean. If the boolean is true, then we would just add it to the header, otherwise, we keep the current behaviour. Have I understood correctly?

@pprossi
Copy link
Author

pprossi commented Dec 21, 2024

OK, that now exceeds my willingness to invest any more time in this topic.

I hope that it will be possible at some point to achieve an adequate result in the standard library and in the meantime I will continue to use a manual write function in which I write my declaration and then return the string of the element tree.
This also gives me a valid document that meets my needs.

I wish you blessed holidays and a good transition into the new year.

@picnixz
Copy link
Contributor

picnixz commented Dec 22, 2024

OK, that now exceeds my willingness to invest any more time in this topic.

I understand that the process can be tiring but I'm afraid we need to be conservative when changing things or improving things for a standard library.

I hope that it will be possible at some point to achieve an adequate result in the standard library

In general, a good indication of whether something is worth adding to the standard library is how many use cases there would be (at least, feature-wise).

I wish you blessed holidays and a good transition into the new year.

You too!


Considering this, I will close this PR and the dedicated issue as "not planned" for now. Ping me if you want to re-open it and continue the discussion. Note that you can also build up support on Discourse: https://discuss.python.org/c/ideas/6.

@picnixz picnixz closed this Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants