Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and classify non-standard open source licenses in GitHub repos #25

Open
irynastr opened this issue Aug 17, 2020 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@irynastr
Copy link
Contributor

irynastr commented Aug 17, 2020

The goal is to identify the license used by GitHub repos where these are not classified automatically by GitHub. As stated in the GitHub API documentation https://developer.github.com/v3/licenses/, the open source Ruby Gem Licensee (https://github.com/licensee/licensee) is used to identify the license type.

Licensee automates the process of reading LICENSE files and compares their contents to known licenses using a several strategies (which we call "Matchers"). It attempts to determine a project's license in the following order:

  1. If the license file has an explicit copyright notice, and nothing more (e.g., Copyright (c) 2015 Ben Balter), we'll assume the author intends to retain all rights, and thus the project isn't licensed.
  2. If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.
  3. If we still can't match the license, we use a fancy math thing called the Sørensen – Dice coefficient, which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 95% similar to the MIT license, that 5% likely representing legally insignificant changes to the license text.

We published a blog with an analysis of open source licenses used on GitHub (https://solutionshub.epam.com/blog/post/examining-open-source-license-usage) but it includes a lot of repos which have a "custom" license. Our manual analysis of a selection of these shows that many organizations use licenses which are minor modifications of standard licenses, but GitHub does not automatically identify their license. If this could be improved, we would be able to make a better analysis of popularity of open source repos among commercial organizations as well as across all of GitHub.

The goal is to identify such licenses with a high level of probability, e.g. content of the LICENSE file is 90+% the same as the standard Apache license text.

Our suggestion would be do this for the top 3-6 license types, which based on our analysis are Apache 2.0, BSD 3-clause, MIT, GPL 2.0, GPL 3.0, LGPKL 2.1, EPM 1.0.

Some examples:
https://github.com/MicrosoftDocs/microsoft-365-docs/blob/public/LICENSE - Creative Commons license
https://github.com/dotnet/runtime/blob/master/LICENSE.TXT - MIT license
https://github.com/IBM-Cloud/webapp-with-cos-and-cdn/blob/master/License.txt, https://github.com/IBM-Cloud/serverless-followupapp-ios/blob/master/License.txt- Apache 2.0
https://github.com/strongloop/loopback.io/blob/gh-pages/LICENSE, https://github.com/strongloop/loopback-next/blob/master/LICENSE - MIT license
https://github.com/mono/mono/blob/master/LICENSE - mix of licenses, so won't be possible to identify a single license type

@irynastr irynastr added the enhancement New feature or request label Aug 25, 2020
@patrickstephens1 patrickstephens1 changed the title Identify 'other' licenses on GitHub Identify and classify non-standard open source licenses in GitHub repos Aug 25, 2020
@patrickstephens2
Copy link

I suggest this proposal be posted as an Issue in https://github.com/licensee/licensee. The problem is not OSCI specific, and OSCI is probably not the place to solve it. I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants