Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Normalization and OSIM work #63

Open
davaya opened this issue Nov 6, 2024 · 1 comment
Open

Data Normalization and OSIM work #63

davaya opened this issue Nov 6, 2024 · 1 comment

Comments

@davaya
Copy link
Contributor

davaya commented Nov 6, 2024

The Mitre white paper "Data Normalization Challenges and Mitigations in SBOM Processing" highlights technical challenges in automating the production of SBOMs, including:

... interoperability between different SBOM standards, handling missing
information, imprecise definitions of SBOM elements, multiple formats for SBOM
elements (e.g., component name, version), and difficulties with ingesting/parsing data in
producing SBOM elements.

As an example, Section 4.3.3 Version observes: "There are many variations in how product versions are named, identified, and cited", including Numbers, Dates, Code Names, Version Indicators, and Git hashtags.

The NTIA framing document suggests formats and sources for obtaining content, observes that a common approach is to create a set of canonical names/representations, but with respect to version says:

As there is a wide range of versioning schemes in use, recording what is provided from the supplier accurately is the primary goal. Semantic versioning is preferred. Git hashes are also acceptable.
As a minimum expectation, declare the version string as provided by the supplier.

An information model cannot do much about bad, missing or inconsistent input data, but it can attempt to classify the data that it does find, flag data that cannot be classified, or canonicalize what can. For example a Version type could be defined as a Choice among known formats, the SemVer option would be classified as a SemVer and broken out into major, minor and patch components. And in response to the common practice of giving up and declaring an SBOM Version to be a "String", would flag examples like "four score and seven years ago, our fathers ...", which is a valid string but not any recognizable Version format.

Questions:

  • Should the OSIM TC attempt to tackle the SBOM Data Normalization Challenge?
  • If so, who will do the work, and who are the stakeholders?
@sparrell
Copy link
Contributor

sparrell commented Nov 6, 2024

I'm not sure I'd put it at the top of the list, but I do think we should address. I do think we should 80/20 and not worry about the edge cases that would put us down too many rabbit holes. I don't think we need a global versioning system - we just need to extract the version information to make it useful within the ecosytem it works in. The objective is unambiguous identification of a piece of software - including version. For at least 80% of the most-needed apps, a given piece of software uses whatever it's ecoystem uses and it only needs to be internally consistent. So linux can use linux_6.5, linux_6.6; and Phoenix Liveview can use SemVar and it doesn't matter they using something different. So I think your list of common formats is a good one, with a generic text string as a catch-all. And we probably should make it extensible. We don't even need to figure it out automagically - it could be 'hardcoded' for each ecosystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants