v32.0.0 #3406

AyanSinhaMahapatra · 2023-05-22T22:04:24Z

AyanSinhaMahapatra
May 22, 2023
Maintainer

v32 of ScanCode is all about improved license detections!

We have more licenses and rules, and major updates on post-processing matches to license detections.
We also have major improvements in package license detections and unknown references, along with top level detection
summaries for licenses, and reference data for the licenses detected too. There are also a couple of API changes due to
model changes in license data.

See also https://github.com/nexB/scancode.io/ for a complete, customizable SCA solution using ScanCode and
https://github.com/nexB/scancode-workbench/releases for visualizing data generated by ScanCode Toolkit.

Important API changes:

This is a major release with major API and output format changes and significant
feature updates.

In particular the output format has changed for the licenses and packages, and
also for some of the command line options.

The output format version is now 3.0.0.

See https://github.com/nexB/scancode-toolkit/milestone/15 for more details on this release.

Package detection:

Update GemfileLockParser to track the gem which the Gemfile.lock is for,
which we assign to the new GemfileLockParser.primary_gem field. Update
GemfileLockHandler.parse() to handle the case where there is a primary gem
detected from a gemfile.lock. If there is a primary gem, a single Package
is created and the detected gem data within the gemfile.lock are assigned as
dependencies. If there is no primary gem, then all of the dependencies are
collected into Package with no name and yielded.

Repeated package and dependency results when scanning extracted rubygem #3072
Fix issue where dependencies were not reported when scanning an extracted
Python project by modifying BaseExtractedPythonLayout.assemble() to favor
using package data from a PKG-INFO file from an egg-info directory. Package
data from a PKG-INFO file from an egg-info directory contains the dependency
information collected from the requirements.txt file along side PKG-INFO.

No dependency results when scanning celery-5.2.7.tar.gz #3083
Fix issue where we were returning incorrect purl package type for cocoapods.
pods was being returned as a purl type for cocoapods, it should be
cocoapods instead.
https://github.com/package-url/purl-spec/blob/master/PURL-TYPES.rst#cocoapods

Incorrect purl type for cocoapods #3081
Code for parsing a Maven POM, npm package.json, freebsd manifest and haxelib
JSON have been separated into two functions: one that creates a PackageData
object from the parsed Resource, and another that calls the previous function
and yields the PackageData. This was done such that we can use the package
manifest data parsing code outside of the scancode-toolkit context in other
libraries.
The PackageData model now includes a holder field, which is populated with
holder data extracted from the copyright field if copyright data is present,
otherwise it remains empty.

Add Copyright Holder information for Packages #3290
DatafileHandlers now have a classmethod named get_top_level_resources(),
which is supposed to yield the top-level Resources of a Package codebase,
relative to a Package manifest file. maven.MavenPomXmlHandler is the first
DatafileHandler that has this method implemented.

License detection:

The SPDX license list has been updated to the latest v3.20
This is a major update to license detection where we now combine one or more
license matches in a larger license detection. This approach improves the
accuracy of license detection and removes a larger number of false positive
or ambiguous license detections. See for details
RFC: a plan for false positive license detection #2878
There is a new license_detections codebase level attribute with all the
unique license detections in the whole scan, both in resources and packages.
This has the 3 attributes also present in package/resource level license
detections: license_expression, identifier and detection_log
(present optionally if the --license-diagnostics option is enabled) with
an additional attribute:
- count: Number of times in the codebase this unique license detection
  was encountered.
The data structure of the JSON output has changed for licenses at file level:
- The licenses attribute is deleted.
- A new license_detections attribute contains license detections in that file.
  This object has three attributes: license_expression, identifier
  and matches. matches is a list of license matches and is roughly
  the same as licenses in the previous version with additional structure
  changes detailed below. Identifier is the detected license-expression with an
  UUID generated from the content of matches such that this is unique for
  unique detections. We also have another attribute detection_log with
  diagnostics information if the --license-diagnostics option is enabled.
- A new attribute license_clues contains license matches with the
  same data structure as the matches attribute in license_detections.
  This contains license matches that are mere clues and where not considered
  to be a proper conclusive license detection.
- The license_expressions list of license expressions is deleted and
  replaced by a detected_license_expression single expression.
  Similarly spdx_license_expressions was removed and replaced by
  detected_license_expression_spdx.
- See license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-resource>_
  for examples and details.
The data structure of license attributes in package_data and the codebase
level packages has been updated accordingly:
- There is a new license_detections attribute for the primary, top-level
  declared licenses of a package and an other_license_detections attribute
  for the other secondary detections.
- The license_expression is replaced by the declared_license_expression
  and other_license_expression attributes with their SPDX counterparts
  declared_license_expression_spdx and other_license_expression_spdx.
  These expressions are parallel to detections.
- The declared_license attribute is renamed extracted_license_statement
  and is now a YAML-encoded string, which can be parsed to recreate the
  original extracted license statement. Previously this used to be nested
  python objects lists/dicts/string, but now this is always a YAML string.
  
  See license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#change-in-license-data-format-package>_
  for examples and details.
The license matches structure has changed: we used to report one match for each
license key of a matched license expression. We now report instead one
single match for each matched license expression, and list the license keys
as a licenses attribute. This avoids data duplication.
Inside each match, we list each match and matched rule attributred directly
avoiding nesting. See license updates doc <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#licensematch-result-data>_
for examples and details.
There are new and codebase level attributes with --license-references to report
reference license metadata and texts once for each license matched across the
scan; we now have two codebase level attributes: license_references and
license_rule_references that list unique detected license and license rules.
for examples and details. This reference data is also removed from license matches
in all levels i.e. from codebase, package and resource level license detections and
resource level license clues, irrespective of this CLI option being used, i.e. default
with --licenses.
See license updates documentation <https://scancode-toolkit.readthedocs.io/en/latest/explanations/license-detection-reference.html#comparision-before-after-license-references>_
We replaced the scancode --reindex-licenses command line option with a
new separate command named scancode-reindex-licenses.
- The --reindex-licenses-for-all-languages CLI option is also moved to
  the scancode-reindex-licenses command as an option --all-languages.
- We can now detect licenses using custom license texts and license rules
  stored in a directory or packaged as a plugin for consistent reuse and deployment.
- There is an --additional-directory option with the scancode-reindex-licenses
  command to add the licenses from a directory.
- There is also a --only-builtin option to use ony builtin licenses
  ignoring any additional license plugins.
- See Add support for "extra", e.g. private or local licenses #480 for more details.
We combined the license data file and text file of each license in a single
file with a .LICENSE extension. The .yml data file is now included at the
top of each .LICENSE file as "YAML frontmatter". The same applies to license
rules and their .RULE and .yml files. This halves the number of data files
from about 60,000 to 30,000. Git line history is preserved for the combined
text + yml files.
- See ScanCode contains too many data files #3049
There is a new console script scancode-license-data to export
license data in JSON, YAML and HTML, with indexes and a static website for use
in the licensedb web site. This becomes the API way to getr scancode license
data.

See Add a command line option to dump the license data #2738
The deprecated "--is-license-text" option has been removed.
This is now built-in with the --license-text option and --info
and exposed with the "percentage_of_license_text" attribute.
The license dump() has been modified to add an extra space at empty
newlines for license files which also have multiple indentation levels
as this was generating invalid YAML output files when --license-text
or --license-references was enabled.

See scancode license scan produces invalid yaml #3219
A bugfix has been added to the --unknown-licenses option where
we would crash when using this option without using --matched-text
option. This is now working correctly and also better tested.

See crash when using --unknown-licenses #3343

What's Changed

Add support for external licenses in scans by @KevinJi22 in Add support for external licenses in scans #2979
Separate Package parsing functions by @JonoYang in Separate Package parsing functions #3135
Update docs for deprecated and other options consolidate option should be marked as deprecated in the docs #3126 by @AyanSinhaMahapatra in Update docs for deprecated and other options #3126 #3127
Add license dump option by @AyanSinhaMahapatra in Add license dump option #3100
Combine license matches in new LicenseDetection by @AyanSinhaMahapatra in Combine license matches in new LicenseDetection #2961
Fix issue 3155 by running scancode-reindex-licenses subcommand instead of using --reindex-licenses flag by @abhi-kr-2100 in Fix issue 3155 by running scancode-reindex-licenses subcommand instead of using --reindex-licenses flag #3159
Detect wurfl commercial license by @pombredanne in Detect wurfl commercial license #3163
Do not use packaging.LegacyVersion ImportError: cannot import name 'LegacyVersion' from 'packaging.version' #3171 setup: Use packaging version <22 #3177 by @pombredanne in Do not use packaging.LegacyVersion #3171 #3177 #3180
More License Detection changes by @AyanSinhaMahapatra in More License Detection changes #3154
docs(fix): how to install Py. 3.8 on recent Ubuntu by @camillem in docs(fix): how to install Py. 3.8 on recent Ubuntu #3146
Add links to basic options in docs by @AyanSinhaMahapatra in Add links to basic options in docs #3142
install.rst: spelling by @vargenau in install.rst: spelling #3184
Release 32.0.0rc1 prep by @AyanSinhaMahapatra in Release 32.0.0rc1 prep #3150
Remove deprecated images from CI and release-script by @AyanSinhaMahapatra in Remove deprecated images from CI and release-script #3099
Fix unhashable type error in cyclonedx Exception when creating cyclonedx output #3016 by @AyanSinhaMahapatra in Fix unhashable type error in cyclonedx #3016 #3189
Update license db generation by @AyanSinhaMahapatra in Update license db generation #3197
Remove license text from index.json of licenseDB by @AyanSinhaMahapatra in Remove license text from index.json of licenseDB #3201
Support python 3.11 by @AyanSinhaMahapatra in Support python 3.11 #3199
Properly assign boolean to is_resolved None assigned to is_resolved field when parsing dependencies from a Maven POM #3152 by @JonoYang in Properly assign boolean to is_resolved #3152 #3153
Vendor attrs to avoid unpickle issues An error occurs when parsing a general text file. #3179 Vendor some key libraries #3192 by @pombredanne in Vendor attrs to avoid unpickle issues #3179 #3192 #3193
Remove trailing T in date by @pombredanne in Remove trailing T in date #3203
Restore help.html from Improve help and doc #22 scancode-licensedb#23 by @AyanSinhaMahapatra in Restore help.html from nexB/scancode-licensedb#23 #3202
adapt code to new spdx-tools release by @meretp in adapt code to new spdx-tools release #3173
Add nuget nuspec dependencies by @pombredanne in Add nuget nuspec dependencies #3206
Fix release scripts by @AyanSinhaMahapatra in Fix release scripts #3208
Fix attrs version in requirements by @AyanSinhaMahapatra in Fix attrs version in requirements #3209
Work around heisen-failures in CI by @pombredanne in Work around heisen-failures in CI #3207
Add HERE Proprietary rule for pom.xml files by @bennati in Add HERE Proprietary rule for pom.xml files #3212
Add required phrase to JSR rule by @bennati in Add required phrase to JSR rule #3218
Fix choking license detection post-processing License Detection takes forever to complete #3245 by @AyanSinhaMahapatra in Fix choking license detection post-processing #3245 #3247
Build app archives for all python versions by @AyanSinhaMahapatra in Build app archives for all python versions #3232
Bump version to v32.0.0rc2 by @AyanSinhaMahapatra in Bump version to v32.0.0rc2 #3262
Add new and improve existing licenses by @AyanSinhaMahapatra in Add new and improve existing licenses #3271
Improve License Detection reporting by @AyanSinhaMahapatra in Improve License Detection reporting #3286
Release v32.0.0rc3 prep by @AyanSinhaMahapatra in Release v32.0.0rc3 prep #3291
Fix Invalid SPDX with empty file: no SHA1 #3250: Invalid SPDX with empty file: no SHA1 by @vargenau in Fix #3250: Invalid SPDX with empty file: no SHA1 #3279
Add docs, changelog and authors in CONTRIBUTION and fix typos and errors by @shricodev in Add docs, changelog and authors in CONTRIBUTION and fix typos and errors #3204
Silence pyicu warning by @AyanSinhaMahapatra in Silence pyicu warning #3280
Fix licenses in HTML output by @AyanSinhaMahapatra in Fix licenses in HTML output #3275
Fix misc license detection related bugs by @AyanSinhaMahapatra in Fix misc license detection related bugs #3299
Add copyright holder field to PackageData model by @keshav-space in Add copyright holder field to PackageData model #3302
Merge latest skeleton into scancode by @AyanSinhaMahapatra in Merge latest skeleton into scancode #3305
New licenses and license rules by @AyanSinhaMahapatra in New licenses and license rules #3309
Update documentation for v32 by @AyanSinhaMahapatra in Update documentation for v32 #3292
Get valid yaml output by @AyanSinhaMahapatra in Get valid yaml output #3220
Fix-up the category of the 'ms-cla' license by @fviernau in Fix-up the category of the 'ms-cla' license #3318
Release prep V32.0.0rc4 by @AyanSinhaMahapatra in Release prep V32.0.0rc4 #3336
Update release script to remove ubuntu18 by @AyanSinhaMahapatra in Update release script to remove ubuntu18 #3337
Update doc to reference attrib in AbcTK by @chinyeungli in Update doc to reference attrib in AbcTK #3252
Add new proprietary license detection rule by @ninad365 in Add new proprietary license detection rule #3234
Only trigger license rule with Freetype by @pombredanne in Only trigger license rule with Freetype #3227
Fix unknown license detection by @AyanSinhaMahapatra in Fix unknown license detection #3345
Fix typo Fix typo in maven role #3363 by @JonoYang in Fix typo #3363 #3364
Port v31.2.5 hotfix by @pombredanne in Port v31.2.5 hotfix #3351
Add get_top_level_resources() to DatafileHandler class by @JonoYang in Add get_top_level_resources() to DatafileHandler class #3315
3396 update get license detections and expression by @JonoYang in 3396 update get license detections and expression #3397
Bump commoncode version to 31.0.2 by @JonoYang in Bump commoncode version to 31.0.2 #3399
Do not set version to empty string in npm_api_url Bug in npm_api_url #3393 by @JonoYang in Do not set version to empty string in npm_api_url #3393 #3398
Format extracted_license_statement as YAML by @AyanSinhaMahapatra in Format extracted_license_statement as YAML #3402
Release prep v32 by @AyanSinhaMahapatra in Release prep v32 #3405

New Contributors

@abhi-kr-2100 made their first contribution in Fix issue 3155 by running scancode-reindex-licenses subcommand instead of using --reindex-licenses flag #3159
@camillem made their first contribution in docs(fix): how to install Py. 3.8 on recent Ubuntu #3146
@vargenau made their first contribution in install.rst: spelling #3184
@meretp made their first contribution in adapt code to new spdx-tools release #3173
@bennati made their first contribution in Add HERE Proprietary rule for pom.xml files #3212
@shricodev made their first contribution in Add docs, changelog and authors in CONTRIBUTION and fix typos and errors #3204
@keshav-space made their first contribution in Add copyright holder field to PackageData model #3302
@ninad365 made their first contribution in Add new proprietary license detection rule #3234

Full Changelog: v31.2.4...v32.0.0

This discussion was created from the release v32.0.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v32.0.0 #3406

{{title}}

Replies: 0 comments

Select a reply

v32.0.0 #3406

AyanSinhaMahapatra May 22, 2023 Maintainer

Important API changes:

Package detection:

License detection:

What's Changed

New Contributors

Replies: 0 comments

AyanSinhaMahapatra
May 22, 2023
Maintainer