Data Normalization and OSIM work #63

davaya · 2024-11-06T20:02:27Z

The Mitre white paper "Data Normalization Challenges and Mitigations in SBOM Processing" highlights technical challenges in automating the production of SBOMs, including:

... interoperability between different SBOM standards, handling missing
information, imprecise definitions of SBOM elements, multiple formats for SBOM
elements (e.g., component name, version), and difficulties with ingesting/parsing data in
producing SBOM elements.

As an example, Section 4.3.3 Version observes: "There are many variations in how product versions are named, identified, and cited", including Numbers, Dates, Code Names, Version Indicators, and Git hashtags.

The NTIA framing document suggests formats and sources for obtaining content, observes that a common approach is to create a set of canonical names/representations, but with respect to version says:

As there is a wide range of versioning schemes in use, recording what is provided from the supplier accurately is the primary goal. Semantic versioning is preferred. Git hashes are also acceptable.
As a minimum expectation, declare the version string as provided by the supplier.

An information model cannot do much about bad, missing or inconsistent input data, but it can attempt to classify the data that it does find, flag data that cannot be classified, or canonicalize what can. For example a Version type could be defined as a Choice among known formats, the SemVer option would be classified as a SemVer and broken out into major, minor and patch components. And in response to the common practice of giving up and declaring an SBOM Version to be a "String", would flag examples like "four score and seven years ago, our fathers ...", which is a valid string but not any recognizable Version format.

Questions:

Should the OSIM TC attempt to tackle the SBOM Data Normalization Challenge?
If so, who will do the work, and who are the stakeholders?

sparrell · 2024-11-06T20:19:57Z

I'm not sure I'd put it at the top of the list, but I do think we should address. I do think we should 80/20 and not worry about the edge cases that would put us down too many rabbit holes. I don't think we need a global versioning system - we just need to extract the version information to make it useful within the ecosytem it works in. The objective is unambiguous identification of a piece of software - including version. For at least 80% of the most-needed apps, a given piece of software uses whatever it's ecoystem uses and it only needs to be internally consistent. So linux can use linux_6.5, linux_6.6; and Phoenix Liveview can use SemVar and it doesn't matter they using something different. So I think your list of common formats is a good one, with a generic text string as a catch-all. And we probably should make it extensible. We don't even need to figure it out automagically - it could be 'hardcoded' for each ecosystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Normalization and OSIM work #63

Data Normalization and OSIM work #63

davaya commented Nov 6, 2024 •

edited

Loading

sparrell commented Nov 6, 2024

Data Normalization and OSIM work #63

Data Normalization and OSIM work #63

Comments

davaya commented Nov 6, 2024 • edited Loading

sparrell commented Nov 6, 2024

davaya commented Nov 6, 2024 •

edited

Loading