-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "decimal" linter; percent-encodability during DL URI conversion #15
Comments
What would the action of such a linter be when passed the data for the component that is tagged against? Note that a linter function only sees the component data, not the AI nor other components. I think that we need to consider issues related to encoding/decoding of AI values using a separate (non-linter) framework. This belief is primarily driven by behaviours that might be expected for percent encoded AIs... To my present understanding, the raw value of a AI is whatever the user puts in it, conforming to CSET-82 or whatever format specification is in effect. So, irrespective of whether some AI is "percent-encoding enabled" or not, an AI value containing " If it were any other way, I think implementors would get a shock as I suspect many field sizes are fixed based on the format specification, and they would have trouble fitting Currently a DL URI containing such data value There's a school of thought that it shouldn't be this way for percent-encoding enabled AI values, which I appreciate because double escaping is wasteful. However, to achieve this in a way that does not introduce ambiguities based on whether the AI is percent-encoding enabled or not, it would be required that we formally tagged (or subtyped, or whatever) the special AIs with an encoder/decoder pair and described exactly when and where a raw value (i.e. Such treatment means that the interpretation of a percent-encoded value encoded in a DL URI parameter varies according to whether the AI is percent-encoding enabled or not, which adds bloat to lightweight processes such as the barcode reader shim when extracting element strings from URIs, requiring that they also have a table of percent-encoding-enabled AIs. So it's not clear that we want to add such complexity to save a few characters, and adding such complexity would add a new concept (encoder/decoder pairs) to the GS1 system that would touch many standards. Aside: The encoding/decoding processes are nuanced (not merely left-and-shift), even for percent encoding, and require a full decode/encode round trip between percent-encoded AI values and DL URIs and vice versa. In percent-encoded AI values the encoding is used to escape non-CSET 82 characters, which is different from the use in URIs where it escapes URI unsafe characters. E.g. " |
From the linter perspective, "decimal" would do nothing; the content is just a string of numbers so the regular number validation would work just fine. As you suggest, though, I'm looking at it from an encoding/decoding perspective that can be aware of individual AIs. The framework I'm developing uses the syntax dictionary to handle as many generic cases as possible in a way that minimizes the work required by the user. For example, an element string that contains AIs 01, 10, and 17 would be decoded into an array containing, in the same order, a string (with the character set and check digit validated), a string (with the character set validated), and a date/time object (with the individual fields validated and the year adjusted according to the 50-year window rule). The dictionary-based decoding fails for the "decimal" AIs above because it sees them as whole numbers, when in fact the values that should be extracted are I want to take as much of the requirement to learn the intricacies of the AI system away from the user as I can. So, when constructing an element string, if the user says "add 'LENGTH (m)' of 123.45 to the element string", the encoder should automatically generate AI 3112 based on a) the title and b) the position of the decimal. The percent encoding question you pose is an interesting one, and I think it's worth taking up in the Web Technology SMG. Here are my thoughts on the matter:
Because of the above, there should be no double-percent-encoding when encoding these AIs in a GS1 Digital Link URI. If a specification is defined with "pcenc", the linter should accept the string unconditionally when the content comes from a URI, because it will have already been decoded (the same way lot "a/b" would have been decoded from "3%47b" prior to going through the linter). Tagging @mgh128 and @philarcher for input. |
I'll try to give it some thought, but I don't think the goal of decomposing AIs into components for the purpose of validation and for the purpose of semantics/presentation are necessarily well aligned. So NAK on this particular suggestion for the time being, but I too would like to see this, and the broader issue of encoding/decoding from raw values to "presentation formats", comprehensively tackled. Whether the solution involves a common "semantic decomposition of the AI" step (modification of the current component scheme to serve both purposes) or the use of separate component specifications for validation vs formatting, I'm not sure yet. I agree that we should take this and the wider issue to the Web Tech SMG. I'll try to work up a white paper for discussion, but it won't happen this month and I want to solicit requirements from the other Open Source stakeholders in the Syntax Dictionary prior to growing the scope. Thanks also for your helpful thoughts on percent encoding. I'll try to lay out the issue of generic encoding/decoding in the whitepaper. I don't want to get too focused on one particular encoding method but rather it would be helpful to review what is required to build a framework for tackling the general issue. Happy to hear what others think, of course. But it may be a few weeks before I can devote the required attention. |
Thanks. I'll happily contribute to the white paper. |
Hi Kevin, Terry, Phil, What Kevin is proposing sounds very similar to work on semantics of GS1 Application Identifiers, which we started in the work on GS1 Digital Link but which is incomplete and is more broadly applicable, irrespective of whether element string or GS1 Digital Link URI syntax is used. In addition to chapter 9 of https://ref.gs1.org/standards/digital-link/, these two draft documents may be helpful: https://docs.google.com/document/d/1htR_74P0-SGKQoCvtW_l5CWmGCbmp09S-MC0DwdDHD4/edit?usp=drivesdk https://docs.google.com/document/d/1XpvaD7H_KbSPPU6ISmFlSOrwimYGRRgtRkEQnh5NnyA/edit?usp=drivesdk I think that rather than overloading / bloating the Barcode Syntax Resource dictionary with details of semantics, we should instead extend the GS1 ATA dataset ( https://ref.gs1.org/ai/GS1_Application_Identifiers.jsonld ) because that is less constrained by memory size and more easily ingestible, being JSON/JSON-LD without requiring a dedicated parser for a particular compact text format. There's also an incomplete prototype toolkit for semantics of GS1 AIs within https://gs1.github.io/GS1DigitalLinkCompressionPrototype/ or https://github.com/gs1/GS1DigitalLinkCompressionPrototype Of course, all of this should be done within GSMP, even if some of us further develop ideas and prototypes that later feed into a formal reviewed standardisation effort on this topic. Best wishes Mark |
Whether a given application standard declares that DL URI syntax can or can't be used is, I think, a second-order issue best left to the developers of those documents. Our task is, I think, to provide the solid ground work. Looking at the very early work being done on the 'Identification Standard' that is targeted for the Jan '26 GenSpecs (perhaps with some bits appearing in Jan '25), there is no mention there of semantics. Nor do I think they'll appear as I had thought some time ago. @PetaDing-GS1 might be able to comment at a future date. What we're talking about here is machine-readable reference files and some core open-source libraries that use them. I feel more comfortable defining the reference files first and thinking about software functions second. The BSR is focused on the AIDC world in which URI syntax is effectively alien so that converting between the AIDC and online worlds is an important function. I'd hope we can avoid double percent encoding. It's only needed in URI syntax so it really only comes up when generating or parsing a DL URI, no? If the value to be encoded in AIDC-land is To cut to the chase, I agree with @mgh128 that we should complete the work of defining the semantics of each AI and make sure that's part of the ATA. I had thought this should probably be done by the ID SMG but in reality, it probably needs to be done primarily in the Web Tech SMG with a strong liaison with the ID SMG, with an end goal to include all the semantic definitions - which could well include the kind of thing @KDean-Dolphin began this thread with, presentation detail and so on. SHACL supports regular expressions so you could go all the way from 3103 to a property We need to have a clear idea of the status of the ATA that AFAIK, that hasn't been formalized yet. But that wouldn't stop us completing the work to define the semantics and, I hope, creating tools that convert between a bunch of AIs and their values and a block of JSON-LD (and vice versa of course). |
Correct. The BSR performs no interpretation of raw AI values during data processing. They are effectively opaques during the data conversion operations because no process that thus far required them to interpreted as anything other than the raw byte values that they are. Avoiding double escaping during conversion to/from a GS1 DL URI would be the first process that introduces a requirement to consider the raw AI value to be a representation of some more fundamental value.
Consider the AI element string: In a world where we want to avoid double escaping of percent-encoded AI values within DL URIs, we have no choice but to deal with the concept of a intermediate "interpretation" of AIs' raw value content, i.e. the platonic meaning outside of any particular encoding scheme (or UTF-8 which is used as the universal proxy for such non-worldly purposes in many systems). So the breakdown of the above element string is as follows:
In the above (4300) has successfully avoided double escaping (decode to a interpreted value, then re-encode for URI; equivalent to "lift and shift"); (99) isn't double-escaped because it wasn't percent-escaped in the first instance. It shows that in a world where the GS1 system assigns meaning to raw values AI and where it is that meaning (rather than the raw AI value) that is respected during format conversions, we have to treat the conversion process for AIs differently depending upon whether or not their raw value has a percent-encodable representation. Concretely, when a reader operating the GS1 DL URI to element string shim encounters the following URI:
Were it not to take the percent-encodability of the AIs into account then it would arrive at the following element string:
Which I note is not where we started: (4300) has been corrupted from "ABC%40XYZ" to "ABC@XYZ" via the round trip to a GS1 DL URI. To avoid such issues, the shim would need to recognise that (4300) is a percent-encodable AI and handle it specially. Dynamic tables => no longer a static algorithm suitable for embedding. |
I'm struggling to understand the problem with percent encoding. Let me start with the "regular" AIs, i.e., those that don't have "pcenc" as a specification decorator. In non-DLURI representations (HRI and barcode), the serial number When putting serial number AI 4300 and its ilk are different. The domain of the ship-to company name is pretty much any UTF-8 character, and we run into a different set of rules. GS1 barcodes don't support characters outside the AI 82 character set, so such strings have to be encoded in some way, and percent encoding is a logical choice (with special handling for the tilde '~'). If you're using a generic library, it wastes some space by encoding characters that don't need to be encoded for a barcode, but that's a reasonable tradeoff. That same representation would go into the HRI format because we don't really expect the element string to be processed by humans. The non-HRI representation, however, would be in full UTF-8. Although the GenSpecs is silent on the subject, it would be nonsensical for the percent-encoded version to be printed alongside "SHIP TO COMP" on the label. We see this already with things like the expiry date, which can be printed in non-HRI as a full date. Putting AI 4300 into a DLURI requires percent encoding just like any other AI, and we don't care about the tilde in that case. What we don't need to do is percent-encode it as if it's going into a barcode and then percent-encode it again to put it into the DLURI. This does mean, though, that any validator needs to be aware of the representation that it's validating. For any AI that doesn't have "pcenc" as a specification decorator, the validator won't care. For AI 4300 etc., the validator can apply the validation to a barcode data stream or HRI text and accept it verbatim for a DLURI. This can leave a gap, though, as it would be possible to generate an attribute that, if percent-encoded, would exceed the length limitation. A better solution might be something like this:
To Terry's point:
I think there is still a static algorithm. Let's assume that the input to the algorithm is a set of AI-value pairs. For all except the "pcenc" AIs, constructing the barcode data stream or HRI is exactly the same as before; if an AI is percent-encoded, then there's an additional step before adding the value to the barcode data stream or HRI. It's much more a documentation issue and yes, there are legacy applications to deal with, but I think that any legacy application that's going to get into Scan4Transport or patient demographics will have more to deal with than "Oh, by the way, there are some new rules...". In short:
|
Numerous AIs are numeric with a decimal point implied by the last digit of the AI.
Automatic encoding and decoding would be aided by adding a "decimal" linter type, i.e.,:
The text was updated successfully, but these errors were encountered: