From 045f4d6747c1ed3abdff3c7e6fbcc52005d238db Mon Sep 17 00:00:00 2001 From: Hero2323 Date: Thu, 5 Sep 2024 20:13:50 +0200 Subject: [PATCH] chore(report): AI Powered License Detection Reports for weeks 8-15 --- .../license-detection/updates/2024-05-30.md | 2 +- .../license-detection/updates/2024-06-06.md | 2 +- .../license-detection/updates/2024-06-13.md | 2 +- .../license-detection/updates/2024-06-20.md | 2 +- .../license-detection/updates/2024-06-27.md | 2 +- .../license-detection/updates/2024-07-04.md | 2 +- .../license-detection/updates/2024-07-11.md | 4 +- .../license-detection/updates/2024-07-18.md | 130 ++++++++ .../license-detection/updates/2024-07-25.md | 22 ++ .../license-detection/updates/2024-08-01.md | 265 +++++++++++++++ .../license-detection/updates/2024-08-08.md | 53 +++ .../license-detection/updates/2024-08-15.md | 40 +++ .../license-detection/updates/2024-08-22.md | 305 ++++++++++++++++++ .../license-detection/updates/2024-08-29.md | 39 +++ .../license-detection/updates/2024-09-05.md | 39 +++ 15 files changed, 901 insertions(+), 8 deletions(-) create mode 100644 docs/2024/license-detection/updates/2024-07-18.md create mode 100644 docs/2024/license-detection/updates/2024-07-25.md create mode 100644 docs/2024/license-detection/updates/2024-08-01.md create mode 100644 docs/2024/license-detection/updates/2024-08-08.md create mode 100644 docs/2024/license-detection/updates/2024-08-15.md create mode 100644 docs/2024/license-detection/updates/2024-08-22.md create mode 100644 docs/2024/license-detection/updates/2024-08-29.md create mode 100644 docs/2024/license-detection/updates/2024-09-05.md diff --git a/docs/2024/license-detection/updates/2024-05-30.md b/docs/2024/license-detection/updates/2024-05-30.md index 661212d28..62833440a 100644 --- a/docs/2024/license-detection/updates/2024-05-30.md +++ b/docs/2024/license-detection/updates/2024-05-30.md @@ -5,7 +5,7 @@ author: Abdelrahman Jamal # Meeting 1 diff --git a/docs/2024/license-detection/updates/2024-06-06.md b/docs/2024/license-detection/updates/2024-06-06.md index 8248e6e36..cf95be3dc 100644 --- a/docs/2024/license-detection/updates/2024-06-06.md +++ b/docs/2024/license-detection/updates/2024-06-06.md @@ -5,7 +5,7 @@ author: Abdelrahman Jamal # Meeting 2 diff --git a/docs/2024/license-detection/updates/2024-06-13.md b/docs/2024/license-detection/updates/2024-06-13.md index 7be28aeee..6bd100c43 100644 --- a/docs/2024/license-detection/updates/2024-06-13.md +++ b/docs/2024/license-detection/updates/2024-06-13.md @@ -5,7 +5,7 @@ author: Abdelrahman Jamal # Meeting 3 diff --git a/docs/2024/license-detection/updates/2024-06-20.md b/docs/2024/license-detection/updates/2024-06-20.md index 36e314e5c..b4723e99c 100644 --- a/docs/2024/license-detection/updates/2024-06-20.md +++ b/docs/2024/license-detection/updates/2024-06-20.md @@ -5,7 +5,7 @@ author: Abdelrahman Jamal # Meeting 4 diff --git a/docs/2024/license-detection/updates/2024-06-27.md b/docs/2024/license-detection/updates/2024-06-27.md index 116e537b2..4d0911820 100644 --- a/docs/2024/license-detection/updates/2024-06-27.md +++ b/docs/2024/license-detection/updates/2024-06-27.md @@ -5,7 +5,7 @@ author: Abdelrahman Jamal # Meeting 5 diff --git a/docs/2024/license-detection/updates/2024-07-04.md b/docs/2024/license-detection/updates/2024-07-04.md index ac0889f85..e54f841c1 100644 --- a/docs/2024/license-detection/updates/2024-07-04.md +++ b/docs/2024/license-detection/updates/2024-07-04.md @@ -5,7 +5,7 @@ author: Abdelrahman Jamal # Meeting 6 diff --git a/docs/2024/license-detection/updates/2024-07-11.md b/docs/2024/license-detection/updates/2024-07-11.md index c5bb3561c..c16aee11a 100644 --- a/docs/2024/license-detection/updates/2024-07-11.md +++ b/docs/2024/license-detection/updates/2024-07-11.md @@ -5,12 +5,12 @@ author: Abdelrahman Jamal # Meeting 7 -*(July 4,2024)* +*(July 11,2024)* ## Attendees: - [Kaushlendra Pratap](https://github.com/Kaushl2208) diff --git a/docs/2024/license-detection/updates/2024-07-18.md b/docs/2024/license-detection/updates/2024-07-18.md new file mode 100644 index 000000000..2ea38a374 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-07-18.md @@ -0,0 +1,130 @@ +--- +title: Week 8 +author: Abdelrahman Jamal +--- + + +# Meeting 8 + +*(July 18,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) +- [Ayush Bhardwaj](https://github.com/hastagAB) +- [Avinal Kumar](https://github.com/avinal) +- [Anupam Ghosh](https://github.com/ag4ums) +- Katharina Ettinger + +## Discussion: +- Improvements on semantic search (somehow) +- test nomos +Discussed several interesting ideas including license compatibility and obligations and so on + + +### Evaluation of Semantic Search on Nomos Test Dataset + +- **Dataset Overview:** + The Nomos test dataset consists of 2054 rows, but I evaluated the algorithm using only 1000 rows to gather initial results. + +- **Initial Results:** + 1. **Accuracy:** 65.1% + 2. **Coverage:** 58.2% + + **Challenges:** + - My parser isn't perfect, and there are cases where the algorithm matches minor variations of licenses (e.g., AFL-1.2 instead of AFL-1.1), which is counted as incorrect. + - The real accuracy is likely higher, in the range of 70-75%, considering these minor variations. + +- **Next Steps:** + 1. Refine the parser to better handle license versions and variations. + 2. Reevaluate accuracy with a more comprehensive dataset to improve these metrics. + +### Work on License Obligations + +- **Introduction to License Obligations:** + License obligations refer to the specific legal requirements imposed by open-source licenses to ensure compliance. I began working on this aspect to explore its potential in the project. + + **Key Concepts:** + OSADL (Open Source Automation Development Lab) has developed the "Open Source License Obligations Checklists" project, which helps organizations comply with open-source licenses by: + 1. **Encoding Obligations:** Defining what actions are required or prohibited by different licenses. + 2. **Creating Checklists:** Structured lists of obligations for various licenses. + 3. **Evaluating Compatibility:** Assessing how different licenses interact and whether they can be used together. + +- **Progress:** + I started by using OSADL’s checklist as a framework for obligations. Here’s a [link](https://osadl.org/Access-to-raw-data.oss-compliance-raw-data-access.0.html) to OSADL's obligations for various licenses. + + **Example Obligation for the Academic Free License v2.0 (AFL-2.0):** + ``` + USE CASE Source code delivery OR Binary delivery + YOU MUST Reference License text + YOU MUST Search License acceptance + ATTRIBUTE Reasonable + IF Software modification + YOU MUST Forward Copyright notices + YOU MUST Forward Patent notice + YOU MUST Forward Trademark notice + YOU MUST Forward License notice + YOU MUST Provide Modification notice + YOU MUST NOT Promote + IF Modified work Under Original license + EITHER + YOU MUST Include Source code Of Modified work + ATTRIBUTE Machine-readable + OR + YOU MUST Provide Delayed source code delivery Of Modified work + ATTRIBUTE Machine-readable + ATTRIBUTE Via Internet + ATTRIBUTE No profit + ATTRIBUTE Duration As long as distributed + YOU MUST Reference Source code + ATTRIBUTE Below Copyright notices + PATENT HINTS Yes + ``` +### Experimenting with License Obligation Conversion via LLM + +- **Objective:** +I attempted to evaluate whether an LLM could generate license obligations similar to OSADL’s checklist through prompt engineering. + +- **Initial Results:** +Using prompt engineering, I tested multiple LLMs to generate obligations directly from license texts. However, the results were inconsistent due to: +1. **Ambiguity in Interpretation:** OSADL's obligations are highly detailed and rely on specific legal interpretations. These are difficult for an LLM to replicate without detailed context. +2. **LLM Limitations:** While the LLM could generate obligations, they were not structured or detailed enough to meet the OSADL standard. + + +- **Challenges and Next Steps:** +1. The LLM produced obligations, but the results did not meet the precision required, mainly due to different interpretations. +2. Moving forward, I plan to refine the prompt further and explore additional ways to guide the LLM towards creating obligations more aligned with legal standards. + +## Conclusions and Next Steps + +- **Further Work on Obligations:** +I will continue working on converting licenses into obligations, refining the prompt-engineering process and exploring different approaches to improve the LLM’s performance. + +- **Other Tasks:** +1. Begin acknowledging licenses from notice files, extending the current capability of license identification. +2. Continue refining the parser and matching algorithm to better handle edge cases and improve overall accuracy and speed. + +## Additional Discussions and Potential Advancements + +In this meeting, several potential directions forward were discussed regarding obligations and other interesting license-related applications: + +1. **Validate Obligation Rule Against License Text**: + * Provide an obligation rule (e.g., "YOU MUST Provide Modification report") to the LLM. + * Ask the LLM to check if this rule exists within the license text and provide an explanation. + +2. **Potential Advancements** + + 1. **Create Human-Readable Obligations from Machine-Readable Obligations**: + * Convert OSADL's structured, machine-readable obligations into a more user-friendly format. + + 2. **Check License Compatibility Using Obligations or License Text**: + * Explore using obligations or the raw license text to determine if different licenses are compatible with each other. + + 3. **Convert License Text to Obligations Using LLM**: + * Continue refining the prompt engineering process to improve the LLM's ability to generate obligations directly from license text. + + These potential advancements offer various avenues for further exploration and development within the project. \ No newline at end of file diff --git a/docs/2024/license-detection/updates/2024-07-25.md b/docs/2024/license-detection/updates/2024-07-25.md new file mode 100644 index 000000000..a1ceac721 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-07-25.md @@ -0,0 +1,22 @@ +--- +title: Week 9 +author: Abdelrahman Jamal +--- + + +# Meeting 9 + +*(July 25,2024)* + +## Attendees: +None + +## Discussion: +No meeting was held this week due to my travel commitments with family. + + + diff --git a/docs/2024/license-detection/updates/2024-08-01.md b/docs/2024/license-detection/updates/2024-08-01.md new file mode 100644 index 000000000..87c9bf579 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-08-01.md @@ -0,0 +1,265 @@ +--- +title: Week 10 +author: Abdelrahman Jamal +--- + + +# Meeting 10 + +*(August 1,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) +- [Ayush Bhardwaj](https://github.com/hastagAB) +- [Avinal Kumar](https://github.com/avinal) + +## Discussion: +- Continued work on license obligations and compatibility +- Example analyses using LLM for license obligations conversion +- Evaluating accuracy of LLM-based approaches for legal obligations + +### Experimenting with License Obligation Conversion via LLM + +- **Objective:** + I focused on converting license texts to obligations using LLMs through a structured prompt. The goal was to extract obligations similar to OSADL's structured format. + +- **Prompt Used:** + ``` + Convert the Following License Text to License Obligations License Text: {text} + ``` +- **Results:** +Several examples were tested, and while the LLM was able to generate obligations in most cases, they were not always fully aligned with the structured approach required by OSADL. + +**Example 1: Universal Permissive License (UPL), Version 1.0:** +**Obligations from LLM:** + ``` + Include Copyright notices + Include License text OR Reference To https://unlicense.org/ + ``` +**Actual Obligations:** + ``` + USE CASE Source code delivery OR Binary delivery + YOU MUST Provide Copyright notices + EITHER + YOU MUST Provide License text + OR + YOU MUST Reference License text + ``` + +**Example 2: Bitstream Vera Font License:** +**Obligations from LLM:** + ``` + Include Copyright notices + Include Trademark notice + Include License text + Rename Font if Modified AND Font Name Contains "Bitstream" OR "Vera" + Do Not Sell Font By Itself + Do Not Promote + ``` +**Actual Obligations:** + ``` + USE CASE Font delivery + YOU MUST Provide Copyright notices + ATTRIBUTE In Font File + YOU MUST Provide Trademark notice + ATTRIBUTE In Font File + YOU MUST Provide License text + ATTRIBUTE In Font File + IF Font Modification + YOU MUST NOT Use "Bitstream" OR "Vera" In Font Name + YOU MUST NOT Sell Font + ATTRIBUTE Stand-alone + YOU MUST NOT Promote + ``` + +**Example 3: Attribution Assurance License:** +**Obligations from LLM:** + ``` + Source code delivery + Provide License text + Binary delivery + Provide License text + Display Attribution notice + Name + Professional identification + URL + No Trademark endorsement + ``` +**Actual Obligations:** +None as this license doesn't have a corresponding OSADL obligation + +**Challenges:** +The LLM struggles with complex license structures and nuances. For example, legal experts are often required to interpret certain clauses accurately. With legal supervision, LLMs can help as an initial draft tool but are not a replacement for expert evaluation. + +### Evaluation of Clause Accuracy in License Obligations + +- **Objective:** +I developed a prompt to check if specific clauses from obligations match the license text. + +- **Prompt Used:** + ``` + [Task] + Evaluate the accuracy of each clause within the provided set of obligations against the given open-source license text. Determine if each clause is valid (supported by the license), invalid (contradicts the license), or partially valid (partially accurate or open to interpretation). + + [Instructions] + 1. Carefully analyze the open-source license text. + 2. Examine each clause within the provided obligations. + 3. Compare each clause to the relevant sections of the license text. + 4. Categorize each clause as valid, invalid, or partially valid. + 5. Provide a clear explanation for each assessment, citing specific license text sections and any interpretations. + 6. Present your analysis in the following list format only, without any additional text or commentary: + Clause: [Clause text] + Result: [valid/invalid/partially valid] + Explanation: [Your detailed explanation, citing specific license text sections and any relevant interpretations] + 7. Be Concise and only follow the output format provided in Instruction 6 without any introductions or conclusions + + [Additional Notes] + 1. if a clause contains multiple parts, assess each part separately and then as a whole. + 2. Be concise yet thorough in your explanations. + 3. Use clear, unambiguous language. + 4. Maintain a professional, objective tone. + + [License Text] + {license_text} + + [Corresponding Obligations] + {obligations} + ``` + +- **Results:** +The LLM was able to categorize clauses as valid, invalid, or partially valid with ~93% accuracy. False negatives were primarily due to minor ambiguities or interpretation issues. After manual review, the accuracy is estimated to be closer to 95-96%. + + +**Example: W3C Software License** +**License Text:** + ``` + This work is being provided by the copyright holders under the following license. + + License + By obtaining and/or copying this work, you (the licensee) agree that you have read, understood, and will comply with the following terms and conditions. + + Permission to copy, modify, and distribute this work, with or without modification, for any purpose and without fee or royalty is hereby granted, provided that you include the following on ALL copies of the work or portions thereof, including modifications: + + • The full text of this NOTICE in a location viewable to users of the redistributed or derivative work. + • Any pre-existing intellectual property disclaimers, notices, or terms and conditions. If none exist, the W3C Software and Document Short Notice should be included. + • Notice of any changes or modifications, through a copyright statement on the new code or document such as "This software or document includes material copied from or derived from [title and URI of the W3C document]. Copyright (c) [YEAR] W3C® (MIT, ERCIM, Keio, Beihang)." + + Disclaimers + THIS WORK IS PROVIDED "AS IS," AND COPYRIGHT HOLDERS MAKE NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE OR DOCUMENT WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. + + COPYRIGHT HOLDERS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THE SOFTWARE OR DOCUMENT. + + The name and trademarks of copyright holders may NOT be used in advertising or publicity pertaining to the work without specific, written prior permission. Title to copyright in this work will at all times remain with copyright holders. + ``` + +**Obligations:** + ``` + USE CASE Source code delivery OR Binary delivery + YOU MUST Provide License text + ATTRIBUTE Viewable + YOU MUST Provide Legal notices + IF NOT Legal notices + YOU MUST Provide Copyright notice "W3C Software and Document Short Notice https://www.w3.org/Consortium/Legal/2015/copyright-software-short-notice.html" + IF Software modification + YOU MUST Provide Copyright notice AND Modification notice + YOU MUST NOT Promote + ``` + +**LLM Response:** + ``` + Clause: YOU MUST Provide License text + Result: Valid + Explanation: The license text explicitly states that the licensee agrees to comply with the terms and conditions, which includes providing the full text of the NOTICE (license text) on ALL copies of the work or portions thereof, including modifications. (Section: Permission to copy, modify, and distribute this work) + + Clause: ATTRIBUTE Viewable + Result: Partially Valid + Explanation: The license text does not explicitly mention the requirement to attribute the work as "Viewable". However, it does require the inclusion of the full text of the NOTICE (license text) on ALL copies of the work or portions thereof, including modifications. This implies that the attribution should be viewable, but the exact wording is not specified. (Section: Permission to copy, modify, and distribute this work) + + Clause: YOU MUST Provide Legal notices + Result: Valid + Explanation: The license text requires the inclusion of any pre-existing intellectual property disclaimers, notices, or terms and conditions on ALL copies of the work or portions thereof, including modifications. This implies that the licensee must provide legal notices. (Section: Permission to copy, modify, and distribute this work) + + Clause: IF NOT Legal notices + Result: Valid + Explanation: The license text requires the inclusion of the W3C Software and Document Short Notice if no pre-existing intellectual property disclaimers, notices, or terms and conditions exist. This implies that the licensee must provide a legal notice if none exist. (Section: Permission to copy, modify, and distribute this work) + + Clause: IF Software modification + Result: Valid + Explanation: The license text requires the inclusion of the full text of the NOTICE (license text) on ALL copies of the work or portions thereof, including modifications. Additionally, it requires the inclusion of a copyright notice and modification notice if the software is modified. This implies that the licensee must provide a copyright notice and modification notice if the software is modified. (Section: Permission to copy, modify, and distribute this work) + + Clause: YOU MUST NOT Promote + Result: Valid + Explanation: The license text explicitly states that the name and trademarks of copyright holders may NOT be used in advertising or publicity pertaining to the work without specific, written prior permission. This implies that the licensee must not promote the work without permission. (Section: Disclaimers) + ``` + +**Challenges:** +Some clauses were marked as partially valid due to wording differences or interpretation differences, which could be refined with further work. + +### Work on License Compatibility Analysis + +- **Objective:** +I experimented with checking license compatibility using both license texts and obligations. The analysis was done by comparing attributes like attribution requirements, copyleft conditions, and modification rights. + +- **Prompt Used:** + ``` + License Obligation Analysis + + Analyze the obligations imposed on licensees by the two license texts provided below. Consider the following key aspects: + + 1. **Attribution:** + * Do the licenses require attribution of the original author(s)? If so, what specific information must be included, and in what format? + * Are there any restrictions on how the attribution can be presented (e.g., size, placement)? + + 2. **Copyleft/ShareAlike:** + * Do the licenses include copyleft or ShareAlike provisions? If so, what obligations do these impose on the distribution of derivative works? + * Are there any exceptions or exemptions to the copyleft/ShareAlike requirements? + + 3. **Modification and Distribution:** + * Do the licenses allow modification of the software? If so, are there any conditions or restrictions on how the modified software can be distributed? + * Are there any requirements for the disclosure of source code or changes made to the software? + + 4. **Commercial Use:** + * Do the licenses explicitly permit or restrict commercial use of the software? + * If commercial use is permitted, are there any specific obligations or limitations associated with it? + + 5. **Patent Grants:** + * Do the licenses include any patent grants or licenses? If so, what rights do they grant, and under what conditions? + + 6. **Liability and Warranty Disclaimers:** + * Do the licenses include disclaimers of liability or warranty? If so, what are the specific terms and limitations of these disclaimers? + + Based on your analysis, provide a comprehensive summary of the obligations imposed by each license. Highlight any potential conflicts or ambiguities between the obligations of the two licenses. + + Overall Verdict: + + Based on the analysis above, provide an overall verdict on the relationship between the obligations of the two licenses. Are the obligations generally compatible, or are there significant conflicts that would make it difficult or impossible to comply with both licenses simultaneously? Briefly explain the reasoning behind your verdict. + + License Text A: + [Insert the full text of License A here] + + License Text B: + [Insert the full text of License B here] + ``` + +- **Results:** +The LLM produced somewhat accurate results but often misinterpreted certain compatibility conflicts. While it was able to detect simple compatibility issues, it struggled with more nuanced cases. + +**Challenges:** +The LLM occasionally implied conflicts where none existed and missed actual conflicts in more complex cases. Further refinement of the prompt and evaluation approach is needed. + +### Next Steps and Conclusion + +- **Refinement of License Identification Process:** +I plan to shift focus away from full license identification using LLMs. Instead, the goal will be to have the LLM extract license-identifying parts from comments, with further analysis done using semantic search or algorithms to match the text to known licenses. + +- **PR to Atarashi Repository:** + I will make a pull request (PR) to the [Fossology Atarashi](https://github.com/fossology/atarashi) repository with the semantic search functionality, but I will focus less on working on it going forward. + + +Overall, the project continues to show promise, with LLMs proving capable of initial obligation drafting, but further work is needed to ensure precision and reduce interpretation ambiguities. + diff --git a/docs/2024/license-detection/updates/2024-08-08.md b/docs/2024/license-detection/updates/2024-08-08.md new file mode 100644 index 000000000..8a841ff0c --- /dev/null +++ b/docs/2024/license-detection/updates/2024-08-08.md @@ -0,0 +1,53 @@ +--- +title: Week 11 +author: Abdelrahman Jamal +--- + + +# Meeting 11 + +*(August 8,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Ayush Bhardwaj](https://github.com/hastagAB) + +## Discussion: + +### Semantic Search Code Cleanup and Refinement + +- Evaluation and Refinement of Changes: + + 1. Approach: Began the process of cleaning up the semantic search code for push to the Atarashi repository. This included reviewing all modifications made throughout the project, from using cosine similarity and sentence transformers to employing fuzzywuzzy and Levenshtein distance methods. + + 2. Challenges: Due to the significant number of changes introduced, evaluating which modifications to retain took considerable time, especially in relation to how file comments and license text are grouped and chunked. + +### Atarashi Repository Exploration + +- Cloning and Local Build: + + 1. Approach: Cloned the Atarashi repository and started investigating areas for contribution, with a focus on understanding the package structure and functionality. + + 2. Issues: Encountered some local build issues but managed to develop a close representation of the code I intended to run for this week. + +- Next Steps: Continue exploring how to fix the local build issues and further refine my contributions to the repository. + +### License-Relevant Text Detection + +- Task Design and LLM Experimentation: + + 1. Objective: Designed a task prompt specifically for detecting license-relevant text within code files, without identifying which specific license the text belongs to. + + 2. Experimentation: Started experimenting with LLMs for this task and observed positive initial results, showing promise for future improvements in detecting license-specific sections. + +## Conclusions and Next Steps + +- Push Code to Atarashi: Finalize the clean-up and push the refined semantic search algorithm to the repository. + +- Fix Local Build Issues: Resolve any remaining build issues to contribute effectively to the Atarashi repository. + +- Continue LLM Experimentation: Improve the prompt and refine the LLM experiments for detecting license-relevant text. diff --git a/docs/2024/license-detection/updates/2024-08-15.md b/docs/2024/license-detection/updates/2024-08-15.md new file mode 100644 index 000000000..0d4890e00 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-08-15.md @@ -0,0 +1,40 @@ +--- +title: Week 12 +author: Abdelrahman Jamal +--- + + +# Meeting 12 + +*(August 15,2024)* + +## Attendees: +- No meeting held due to a national holiday. + +## Discussion: + +### Atarashi Integration and Build Process + +- Build Errors and Code Integration: + + 1. Approach: Fixed the errors with the Atarashi build process, successfully integrating the semantic search code into the codebase. + + 2. Evaluation: Began evaluating the integrated semantic search agent within Atarashi. Some issues were identified where the agent missed certain cases, and efforts were made to refine these areas to improve performance. + +### LLM Experimentation for License-Relevant Text Detection + +- LLM Selection and Testing: + + 1. Objective: Continued experimenting with various LLMs to determine which was best suited for detecting license-relevant text in code files. + + 2. Results: Tested Mistral 7b, Gemma 2 9b, and LLama 3 8b. All models performed well in general, but from my experience, Gemma 2 9b was the best performer. While no formal metrics were used, Gemma 2 9b appeared to handle more nuanced cases better than the other two models. Being the latest release, this was not unexpected, but all models were largely able to detect the license-relevant text without significant issues. + +## Conclusions and Next Steps + +- Refine Atarashi Agent: Continue to work on refining the semantic search agent in Atarashi to address the missed cases and improve overall accuracy. + +- LLM Selection: Focus on further experimenting with Gemma 2 9b for more comprehensive testing and continue exploring ways to quantify its performance in comparison to the other models. diff --git a/docs/2024/license-detection/updates/2024-08-22.md b/docs/2024/license-detection/updates/2024-08-22.md new file mode 100644 index 000000000..6b74150e2 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-08-22.md @@ -0,0 +1,305 @@ +--- +title: Week 13 +author: Abdelrahman Jamal +--- + + +# Meeting 13 + +*(August 22,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Ayush Bhardwaj](https://github.com/hastagAB) + +## Discussion: + +### Atarashi Integration and PR + +- Code Ready for PR: + + 1. Progress: Presented the latest results to the mentors. The Atarashi integration progressed well, and I created a [PR](https://github.com/fossology/atarashi/pull/103) after awaiting feedback from Kaushlendra. + + 2. Code Execution: The code allows for semantic search on a file by using the following command: + ``` + atarashi -a SemanticSearch /path/to/file.c + ``` + - Example output: + ``` + { + "file": "/home/jimbo/Desktop/GSoC24/atarashi/atarashi/license/licenseDownloader.py", + "results": [ + { + "description": "", + "shortname": "GPL-2.0-only", + "sim_score": 91.0, + "sim_type": "SemanticSearch-LVD" + } + ] + } + ``` + + 3. Example Comments in the File: + ``` + Copyright 2018 Aman Jain (amanjain5221@gmail.com) + + SPDX-License-Identifier: GPL-2.0 + + This program is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License + version 2 as published by the Free Software Foundation. + ``` + +### LLM License Text Identification + +- Evaluation on Nomos Test Dataset: + + 1. Approach: I evaluated Gemma 2 9b on approximately 200 samples from the Nomos Test dataset provided by Shaheem. + - [Evaluation Spreadsheet](https://docs.google.com/spreadsheets/d/1zLjSyUyMCiAdoqnawbXqjpG209xRfvbCpTJRi4WlbDk/edit?usp=sharing) + + 2. Results: The LLM was generally effective at extracting continuous license text sections, but had issues with edge cases where the license was interspersed with other content (such as HTML tags or commentary). + + 3. Interesting Snippets: + - **Example 1**: License text intermixed with HTML, where the LLM partially extracted the license correctly but missed sections due to the interruptions caused by tags. + + **File Comments:** + ``` + + + + Copyright and Licensing Information for ACE, TAO, CIAO, DAnCE, and CoSMIC + + + +
+ +

Copyright and Licensing Information for ACE(TM), TAO(TM), CIAO(TM), DAnCE(TM), and + CoSMIC(TM)

+ + ACE(TM), TAO(TM), CIAO(TM), DAnCE>(TM), + and + CoSMIC(TM) (henceforth referred to as "DOC software") + are copyrighted by Douglas C. Schmidt + and his research + group at Washington + University, University of California, + Irvine, and Vanderbilt + University, Copyright (c) 1993-2009, all rights reserved. Since + DOC software is open-source, freely available software, you are free + to use, modify, copy, and distribute--perpetually and irrevocably--the + DOC software source code and object code produced from the source, as + well as copy and distribute modified versions of this software. You + must, however, include this copyright statement along with any code + built using DOC software that you release. No copyright statement + needs to be provided if you just ship binary executables of your + software products.

+ + You can use DOC software in commercial and/or binary software releases + and are under no obligation to redistribute any of your source code + that is built using DOC software. Note, however, that you may not + misappropriate the DOC software code, such as copyrighting it yourself + or claiming authorship of the DOC software code, in a way that will + prevent DOC software from being distributed freely using an + open-source development model. You needn't inform anyone that you're + using DOC software in your software, though we encourage you to let us know so we can promote + your project in the DOC software + success stories.

+ + The ACE, TAO, CIAO, DAnCE, + and CoSMIC web + sites are maintained by the DOC Group at the Institute for Software + Integrated Systems (ISIS) and the Center for + Distributed Object Computing of Washington University, St. Louis + for the development of open-source software as part of the open-source + software community. + + Submissions are provided by the submitter ``as is'' with no warranties + whatsoever, including any warranty of merchantability, noninfringement + of third party intellectual property, or fitness for any particular + purpose. In no event shall the submitter be liable for any direct, + indirect, special, exemplary, punitive, or consequential damages, + including without limitation, lost profits, even if advised of the + possibility of such damages. Likewise, DOC software is provided as is + with no warranties of any kind, including the warranties of design, + merchantability, and fitness for a particular purpose, + noninfringement, or arising from a course of dealing, usage or trade + practice. Washington University, UC Irvine, Vanderbilt University, + their employees, and students shall have no liability with respect to + the infringement of copyrights, trade secrets or any patents by DOC + software or any part thereof. Moreover, in no event will Washington + University, UC Irvine, or Vanderbilt University, their employees, or + students be liable for any lost revenue or profits or other special, + indirect and consequential damages.

+ + DOC software is provided with no support and without any obligation on + the part of Washington University, UC Irvine, Vanderbilt University, + their employees, or students to assist in its use, correction, + modification, or enhancement. A number + of companies around the world provide commercial support for DOC + software, however. + + DOC software is Y2K-compliant, as long as the underlying OS platform + is Y2K-compliant. Likewise, DOC software is compliant with the new US + daylight savings rule passed by Congress as "The Energy Policy Act of + 2005," which established new daylight savings times (DST) rules for + the United States that expand DST as of March 2007. Since DOC + software obtains time/date and calendaring information from operating + systems users will not be affected by the new DST rules as long as + they upgrade their operating systems accordingly.

+ + The names ACE(TM), + TAO(TM), CIAO(TM), DAnCE(TM), CoSMIC(TM), Washington University, UC Irvine, and + Vanderbilt University, may not be used to endorse or promote products + or services derived from this source without express written + permission from Washington University, UC Irvine, or Vanderbilt + University. This license grants no permission to call products or + services derived from this source ACE(TM), TAO(TM), CIAO(TM), DAnCE(TM), or CoSMIC(TM), nor does it grant permission for the name + Washington University, UC Irvine, or Vanderbilt University to appear + in their names.

+ + If you have any suggestions, additions, comments, or questions, please + let me know.

+ + Douglas C. Schmidt

+ +


+ Back to the ACE home page. + + + + + ``` + + **LLM Response**: + ``` + Copyright (c) 1993-2009, all rights reserved. Since + DOC software is open-source, freely available software, you are free + to use, modify, copy, and distribute--perpetually and irrevocably--the + DOC software source code and object code produced from the source, as + well as copy and distribute modified versions of this software. You + must, however, include this copyright statement along with any code + built using DOC software that you release. No copyright statement + needs to be provided if you just ship binary executables of your + software products. + + You can use DOC software in commercial and/or binary software releases + and are under no obligation to redistribute any of your source code + that is built using DOC software. Note, however, that you may not + misappropriate the DOC software code, such as copyrighting it yourself + or claiming authorship of the DOC software code, in a way that will + prevent DOC software from being distributed freely using an + open-source development model. You needn't inform anyone that you're + using DOC software in your software, though we encourage you to let us know so we can promote + your project in the DOC software + success stories. + ``` + + - **Example 2**: The LLM extracted the license correctly even though the license information was embedded in a package description. This was deemed correct despite being less obvious. + + **File Comments**: + ``` + # rpm-specific + set(CPACK_RPM_PACKAGE_SUMMARY "${CPACK_PACKAGE_NAME}") + set(CPACK_RPM_PACKAGE_DESCRIPTION "${PACKAGE_LONG_DESCRIPTION}") + set(CPACK_RPM_PACKAGE_LICENSE "AGPL v.3.0") + ``` + + **LLM Response**: + ``` + set(CPACK_RPM_PACKAGE_LICENSE "AGPL v.3.0") + ``` + + - **Example 3**: The LLM stopped extracting once it encountered commentary between license statements, which might indicate a limitation in detecting when a license resumes after an interruption. + + **File Comments:** + ``` + Distribution and modification of the the URW Garamond fonts are subject to the + Aladdin Free Public License + + Copyright (c) 2000 URW++ Design & Development + + The font GaramondNo8 Italic has been modified, as compared with the version + from the original distribution site ; + credits to Ralf Stubner for fixing the broken quotedblleft character. + + + The TeX font metrics, virtual fonts and font definition files for URW Garamond, + i.e., the files in the archive ugm.zip, may be distributed and/or modified + under the conditions of the LaTeX Project Public License, either version 1.3 of + this license or (at your option) any later version. The latest version of this + license is in http://www.latex-project.org/lppl.txt and version 1.3 or later is + part of all distributions of LaTeX version 2003/12/01 or later. + + Copyright (c) 2005 Walter Schmidt + + + Various aspects of the font were subsequently improved and (contagiously) + released under the Aladdin Free Public License. + + Copyright (c) 2009 Gael Varoquaux + Copyright (c) 2010 Rogerio Theodoro de Brito + Copyright (c) 2010 Khaled Hosny + + + Finally, these fonts were converted by Gaetan Bisson to TrueType format using + FontForge, for use in an X environment. No additional restriction applies. + + Files provided "as is", with no warranties. + + ``` + + **LLM Response:** + ``` + Distribution and modification of the the URW Garamond fonts are subject to the + Aladdin Free Public License + + Copyright (c) 2000 URW++ Design & Development + ``` + + 4. Observations: The LLM handles straightforward, continuous license text effectively, but struggles when the license text is broken up by other information or interspersed with unrelated comments. Overall this can be handled with additional prompt engineering + +## Conclusions and Next Steps + + +- Documentation and Code Cleanup: Focus on documenting the entire project and cleaning up the code as the GSoC final evaluation approaches. + + diff --git a/docs/2024/license-detection/updates/2024-08-29.md b/docs/2024/license-detection/updates/2024-08-29.md new file mode 100644 index 000000000..bb2df1af3 --- /dev/null +++ b/docs/2024/license-detection/updates/2024-08-29.md @@ -0,0 +1,39 @@ +--- +title: Week 14 +author: Abdelrahman Jamal +--- + + +# Meeting 14 + +*(August 29,2024)* + +## Attendees: +- [Ayush Bhardwaj](https://github.com/hastagAB) + +## Discussion: +### New Repository for Final Code and Documentation + +- Repository Cleanup: + + 1. Progress: Working on creating a new, clean repository to push all the final documented code. This repository will include all my findings from the GSoC project, without the clutter from the numerous experimental versions in the previous repository. + + 2. Final Report: The GSoC final report will be written in this new repository, ensuring a structured and well-documented conclusion to the project. + +### Weekly Updates for Fossology GSoC Wiki + +- Fossology Wiki PR: + + 1. Objective: Finishing up my weekly updates to submit as a PR to the Fossology GSoC Wiki, documenting all progress made during the project. + +## Conclusions and Next Steps + +- Finalize the new clean repository: Continue working on the final report and pushing the documented code to the new repository. + +- Submit PR to Fossology Wiki: Prepare and submit the weekly updates as a PR to the Fossology GSoC Wiki. + + diff --git a/docs/2024/license-detection/updates/2024-09-05.md b/docs/2024/license-detection/updates/2024-09-05.md new file mode 100644 index 000000000..5e94f6efe --- /dev/null +++ b/docs/2024/license-detection/updates/2024-09-05.md @@ -0,0 +1,39 @@ +--- +title: Week 15 +author: Abdelrahman Jamal +--- + + +# Meeting 15 + +*(September 5,2024)* + +## Attendees: +- [Kaushlendra Pratap](https://github.com/Kaushl2208) +- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd) + +## Discussion: + +### Final Evaluation and Project Submission + +- GSoC Final Evaluation: + + 1. Deadline: The final evaluation for GSoC is due on September 9th. + + 2. Progress: I have completed updating my GSoC wiki locally and submitted a PR, finalizing the documentation of the project. + +- [Github Repository](https://github.com/Hero2323/GSoC-24): + + 1. Minor Changes: Currently finishing a few minor adjustments to the new Github repository, ensuring everything is clean and ready for the final submission. + +## Conclusions and Next Steps + +- Submit Final Evaluation: Submit the final GSoC evaluation by the September 9th deadline. + +- Finalize Github Repository: Complete and push the final changes to the repository before the project deadline. + +