You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR; We are having issues figuring out where parantheses go in SPDX expressions. This is causing validation errors on ClearlyDefined curation pull requests. This is because of some quirks in how we validate license expressions. Figuring out where parentheses go is hard.
NOTE: If you find yourself asking "What was this written this way?", I wasn't there when it was written and I have no doubt the previous developers were doing the best they could with the resources that they had. It's also mostly worked for over two years, which reflects very well on them and their work.
Problem
When ClearlyDefined validates "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it says that it is an invalid SPDX string.
According to the SPDX documentation, using three "ors" is a valid license string to show a choice between three licenses.
ClearlyDefined should validate "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" as a valid license.
Background
When someone submits a curation to ClearlyDefined, the ClearlyDefined service validates the license(s) in the curation.
The ClearlyDefined service uses clearlydefined/spdx to validate license strings. It does this through calling SPDX.normalize on a submitted license string in a curation.
spdx.normalize
Currently, when we call SPDX.normalize, it says that "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" is not a valid license string.
First, we call the ``parse method``` on the expression.
The parse method takes the string and turns it into an abstract syntax tree. In the case of "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it turns it into an AST that looks like this:
And returns an SPDX string. In the case above, the stringify function would return "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)", rather than "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only".
It does this by recursively traversing the AST, dividing things into right and left.
The first time the code parses the above AST, it would divide it like this
It would see that the "Right" object has a conjunction of "or", so it would put parentheses around all of it. This is how we get "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)".
This isn't something we want to always be the case, we can have OR statements that do not need parentheses around them.
Modifying spdx.normalize
It took a few iterations, but eventually this modification to the code seemed to get things mostly right.
Should be parsed into "MIT OR (BSD-2-Clause OR BSD-3-Clause AND Unlicense)" (and in the original code it was). This led me to do some research into the rules of parentheses in SPDX license strings.
According to the SPDX documentation, there is a default order of precedence when there are not parentheses present in an SPDX string. (See "4) Order of Precedence and Parentheses" in the documentation).
So with this string "LGPL-2.1-only OR BSD-3-Clause AND MIT", the AND operator is applied before the OR operator. So the license represents a choice between "LGPL-2.1-only and the expression BSD-3-Clause" and "MIT", since the "and" is applied before the "Or".
According to the documentation, parentheses are used to express an order of precedence that is different from the default order. So "MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)" states that the ORD operators should be applied before the AND operator, which is different from the default precedence.
I don't know that there is a way to know where parentheses go when translating an AST representing an SPDX expression into a string. It would depend on the intent of the person who originally wrote it.
How to fix?
There are a few options that I can think of:
Alter the AST to track where parentheses go. This should be possible, though it would be a significant change.
Do away with the AST and use a different method for parsing the license expression to see if it is valid. I imagine other projects are somehow comparing SPDX expressions and verifying that they are valid, it would be worthwhile to research how others handle this problem (we can't be the only one).
Something I'm not thinking of.
I'm inclined to start with researching how other projects validate SPDX strings and see if there is a good example we can follow (or, perhaps, a different library we can use), then make the decision about what approach to take to fix this.
What do others think?
The text was updated successfully, but these errors were encountered:
I appreciate your kind words re previous authors as I believe that was mostly my code moved from another repo... I do remember losing much of my remaining hair while working this out. IIRC, the core problem was less about validation and more about normalization to facilitate simple comparison. By normalizing everything into a canonical form, both in terms of parentheses and order, we can do simple compares to see if, for example, ISC AND MIT has already been seen as MIT AND ISC (simple case to illustrate).
I'm uncertain as to the actual error here. MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only) appears to be logically equivalent to MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only -- the addition of the parentheses does not change the logic and the input expression is successfully normalized. Normalized forms are, by definition, not necessarily the same as the original a straight comparison from input to normalized should not be used. In theory, normalization should throw an error if the input could not be normalized (ie., was not valid) but that's not happening here.
Could it be that the problem is in the code that calls normalize?
Having said that, it may be worth while biting the bullet and correctly following the precedence ordering as per the spec but beware that this is used in other places, for example, when collecting up all the discovered licenses and reducing them to just the unique equivalents.
I'm having some trouble with fixing a bug in https://github.com/clearlydefined/spdx
TLDR; We are having issues figuring out where parantheses go in SPDX expressions. This is causing validation errors on ClearlyDefined curation pull requests. This is because of some quirks in how we validate license expressions. Figuring out where parentheses go is hard.
NOTE: If you find yourself asking "What was this written this way?", I wasn't there when it was written and I have no doubt the previous developers were doing the best they could with the resources that they had. It's also mostly worked for over two years, which reflects very well on them and their work.
Problem
When ClearlyDefined validates "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it says that it is an invalid SPDX string.
According to the SPDX documentation, using three "ors" is a valid license string to show a choice between three licenses.
ClearlyDefined should validate "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" as a valid license.
Background
When someone submits a curation to ClearlyDefined, the ClearlyDefined service validates the license(s) in the curation.
The ClearlyDefined service uses clearlydefined/spdx to validate license strings. It does this through calling
SPDX.normalize
on a submitted license string in a curation.spdx.normalize
Currently, when we call SPDX.normalize, it says that "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only" is not a valid license string.
Let's take a look at that function:
And let's zoom in on this line:
First, we call the ``parse method``` on the expression.
The
parse
method takes the string and turns it into an abstract syntax tree. In the case of "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only", it turns it into an AST that looks like this:Then, once we have our result from
parse
,normalize
callsstringify
on the result.This appears to be the source of the error.
This function takes the AST as a parameter:
And returns an SPDX string. In the case above, the stringify function would return "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)", rather than "MPL-1.1 OR LGPL-2.1-only OR GPL-2.0-only".
It does this by recursively traversing the AST, dividing things into right and left.
The first time the code parses the above AST, it would divide it like this
When it hits this line of code:
It would see that the "Right" object has a conjunction of "or", so it would put parentheses around all of it. This is how we get "MPL-1.1 OR (LGPL-2.1-only OR GPL-2.0-only)".
This isn't something we want to always be the case, we can have OR statements that do not need parentheses around them.
Modifying spdx.normalize
It took a few iterations, but eventually this modification to the code seemed to get things mostly right.
However, this broke another one of the tests.
According to the test, this AST:
Should be parsed into "MIT OR (BSD-2-Clause OR BSD-3-Clause AND Unlicense)" (and in the original code it was). This led me to do some research into the rules of parentheses in SPDX license strings.
According to the SPDX documentation, there is a default order of precedence when there are not parentheses present in an SPDX string. (See "4) Order of Precedence and Parentheses" in the documentation).
So with this string "LGPL-2.1-only OR BSD-3-Clause AND MIT", the AND operator is applied before the OR operator. So the license represents a choice between "LGPL-2.1-only and the expression BSD-3-Clause" and "MIT", since the "and" is applied before the "Or".
According to the documentation, parentheses are used to express an order of precedence that is different from the default order. So "MIT AND (LGPL-2.1-or-later OR BSD-3-Clause)" states that the ORD operators should be applied before the AND operator, which is different from the default precedence.
I don't know that there is a way to know where parentheses go when translating an AST representing an SPDX expression into a string. It would depend on the intent of the person who originally wrote it.
How to fix?
There are a few options that I can think of:
I'm inclined to start with researching how other projects validate SPDX strings and see if there is a good example we can follow (or, perhaps, a different library we can use), then make the decision about what approach to take to fix this.
What do others think?
The text was updated successfully, but these errors were encountered: