You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal is to identify the license used by GitHub repos where these are not classified automatically by GitHub. As stated in the GitHub API documentation https://developer.github.com/v3/licenses/, the open source Ruby Gem Licensee (https://github.com/licensee/licensee) is used to identify the license type.
Licensee automates the process of reading LICENSE files and compares their contents to known licenses using a several strategies (which we call "Matchers"). It attempts to determine a project's license in the following order:
If the license file has an explicit copyright notice, and nothing more (e.g., Copyright (c) 2015 Ben Balter), we'll assume the author intends to retain all rights, and thus the project isn't licensed.
If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.
If we still can't match the license, we use a fancy math thing called the Sørensen – Dice coefficient, which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 95% similar to the MIT license, that 5% likely representing legally insignificant changes to the license text.
We published a blog with an analysis of open source licenses used on GitHub (https://solutionshub.epam.com/blog/post/examining-open-source-license-usage) but it includes a lot of repos which have a "custom" license. Our manual analysis of a selection of these shows that many organizations use licenses which are minor modifications of standard licenses, but GitHub does not automatically identify their license. If this could be improved, we would be able to make a better analysis of popularity of open source repos among commercial organizations as well as across all of GitHub.
The goal is to identify such licenses with a high level of probability, e.g. content of the LICENSE file is 90+% the same as the standard Apache license text.
Our suggestion would be do this for the top 3-6 license types, which based on our analysis are Apache 2.0, BSD 3-clause, MIT, GPL 2.0, GPL 3.0, LGPKL 2.1, EPM 1.0.
patrickstephens1
changed the title
Identify 'other' licenses on GitHub
Identify and classify non-standard open source licenses in GitHub repos
Aug 25, 2020
I suggest this proposal be posted as an Issue in https://github.com/licensee/licensee. The problem is not OSCI specific, and OSCI is probably not the place to solve it. I think.
The goal is to identify the license used by GitHub repos where these are not classified automatically by GitHub. As stated in the GitHub API documentation https://developer.github.com/v3/licenses/, the open source Ruby Gem Licensee (https://github.com/licensee/licensee) is used to identify the license type.
Licensee automates the process of reading LICENSE files and compares their contents to known licenses using a several strategies (which we call "Matchers"). It attempts to determine a project's license in the following order:
We published a blog with an analysis of open source licenses used on GitHub (https://solutionshub.epam.com/blog/post/examining-open-source-license-usage) but it includes a lot of repos which have a "custom" license. Our manual analysis of a selection of these shows that many organizations use licenses which are minor modifications of standard licenses, but GitHub does not automatically identify their license. If this could be improved, we would be able to make a better analysis of popularity of open source repos among commercial organizations as well as across all of GitHub.
The goal is to identify such licenses with a high level of probability, e.g. content of the LICENSE file is 90+% the same as the standard Apache license text.
Our suggestion would be do this for the top 3-6 license types, which based on our analysis are Apache 2.0, BSD 3-clause, MIT, GPL 2.0, GPL 3.0, LGPKL 2.1, EPM 1.0.
Some examples:
https://github.com/MicrosoftDocs/microsoft-365-docs/blob/public/LICENSE - Creative Commons license
https://github.com/dotnet/runtime/blob/master/LICENSE.TXT - MIT license
https://github.com/IBM-Cloud/webapp-with-cos-and-cdn/blob/master/License.txt, https://github.com/IBM-Cloud/serverless-followupapp-ios/blob/master/License.txt- Apache 2.0
https://github.com/strongloop/loopback.io/blob/gh-pages/LICENSE, https://github.com/strongloop/loopback-next/blob/master/LICENSE - MIT license
https://github.com/mono/mono/blob/master/LICENSE - mix of licenses, so won't be possible to identify a single license type
The text was updated successfully, but these errors were encountered: