-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop support for optional data values starting with "x_" #318
Comments
These have been cleaned up.
We use this for all CVEs where we have more fine-grained CWE data, so I do find it useful. I would probably also argue there are other cases where some data can be included in
These are no longer used because |
Awesome, thanks for the response @mprpic and appreciate the cleanup. The request to drop "x_*" fields is not based on historical precedence at all. There may have been decisions made and discussions about the value and purpose of these optional fields. I'm just pointing out in practice there is now one custom field from one CNA (and then the x_generator field), and the cost of writing an exception into parsers may not be worth the global benefit of allowing anyone to stick anything into structured data. I am working through my experience parsing and using this data. Having optional fields that could be applied inconsistency across records opens up parsers to potential errors and even modes of failures. While I currently have been forced to create an exception in my code for this one source of JSON data, it would be nice (higher quality, cleaner and standardized data) if every JSON parser created did not have to write an exception that requires a regex check against fields for this data. |
Our CWE data could also be moved into a specific note of some sort or somewhere else, so I guess I'd be fine with dropping the Though, just out of curiosity, how are you parsing this data that there extra fields would cause issues? In my mind, whatever you get from the record that validates against the schema is correct, and you get to choose whichever fields you want to make use of. You can completely ignore any attributes that start with |
I am trying to streamline and standardize all of my JSON intake and I'm building out multiple layers/steps for each data ingestion task. Meaning I have a single JSON parsing level that infers the data schema from the JSON data (since many JSON sources do not create schemas and/or maintain them properly). The challenge here is that JSON stores everything as a single stand-alone record, and for data analysis and knowledge extraction I need the data in a columnar format. Which means extracting structured and normalized data and then storing data in parquet files. If one CVE record adds a "x_foobar" field, the parser would add an "x_foobar" field for every CVE record and leave it blank for all but one. Moving to a columnar format is very common for data analysis. The main challenges I have with the "x_*" records is that any one of the few hundred CNAs can "legally" dump whatever structure they want and I have to deal with it or code up an exception for this one particular source of data (which seems like I'll need to do for other data sources anyway). A great example of this is the decision to include a full copy of the v4 record in the x_legacyV4Record, the first few passes I did on a generic JSON parser took twice as long to run (embarrassingly my first parser took over 2 hours to run, excluding the x_legacyV4Record it was 1h20m on the full collection of CVE data, now it's down to under 10 minutes). That x_legacyV4Record section added several dozen (unnecessary) parquet files and generated several data consistency errors (since many v4 records fails schema validation). After saying all of that, my local solution isn't really what this issue is about. I'll have to work in exceptions for weird data into my process anyway and this suggested change to the 5.2 schema isn't going to affect my own implementation or help solve a problem for me personally because I'm going to code around it either way. Allowing optional data (perhaps complex data structures) with federated data collection can present a challenge to anyone trying to automatically parse the data. The fewer "surprises" we can keep out of the data the better. And combined with the fact that nobody is taking any sort of real advantage of these fields means that the CVE program is supporting complexity with little (or no) reward. Hence I suggest we remove the support for something that presents risks to consumers/parsers with very little benefit. |
+1, your reasoning makes sense! Will you be sharing the final parquet files somewhere btw? It would be neat to define some common queries on top of those files that could be run through DuckDB à la https://www.nikolasgoebel.com/2024/05/28/duckdb-doesnt-need-data.html :-) |
My first pass at parsing CVE JSON into parquet files was in R and is rather slow, but those parquet files are in this repo: https://github.com/jayjacobs/visualcve/tree/main/cache/cveparse. The latest pass is in python and a lot more robust (and much faster). That repo is the backend for the github pages: https://jayjacobs.github.io/visualcve/ I had intended to refresh that repo fairly often, but so far that's maybe weekly if I'm lucky. :) |
Reference to #175 as well |
Having optional fields in structured data adds complexity for parsing at scale AND it is not being used in any sort of practical way right now.
there are two fields in use that are generated by the upconvert process that are either for legacy reasons or specifically for the CNA, not for the data or any end users.
There is one field added by what I assume are cve/json generators and we should consider adding this as a standard field into the schema:
And then there are three fields that may be isolated to redhat:
There are no other uses of any optional "x_" instances.
The text was updated successfully, but these errors were encountered: