Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add SARIF Support for URL Checker #91

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dwertent
Copy link

Motivation

This PR introduces SARIF support to the urlchecker-python tool. SARIF is a standardized format for static analysis results, which improves integration with various tools and workflows, enabling better reporting and issue tracking in CI/CD pipelines.

How to Use

The new feature allows users to save URL-checking results in SARIF format by specifying the output format during the command execution:

$ urlchecker check --save results.sarif --format sarif .

The default behavior remains unaffected, with CSV being the default output format. Users can still generate CSV reports without any changes to their current workflow.

Under the Hood

  • New Method: Added save_results_as_sarif to the UrlChecker class, generating SARIF logs.
  • Line Number Detection: Enhanced URL detection includes line numbers where URLs are found.
  • Flexible Output: Introduced save_results_format parameter to specify the desired output format (csv or sarif).
  • Backward Compatibility: Default output remains in CSV format to ensure current behavior is not affected.

Unit Tests

Included unit tests to verify the SARIF output functionality, ensuring the new format is correctly generated and validated.

@SuperKogito
Copy link
Member

This looks pretty neat, I like the the idea too. @vsoch what do you think?
On the first look the code looks okay. But please make sure that the code passes the formatting, type checking and unit tests.

@vsoch
Copy link
Collaborator

vsoch commented May 29, 2024

My quick read is that SARIF is for static analysis tools relevant for security (e.g., code scanning) and I'm not sure a broken link detected falls under that scope. The RFC linked is for this page https://www.rfc-editor.org/rfc/rfc3986 which has nothing to do with an invalid URL, it's just that it is down.

There are arguably many ways you could convert this data, and I'd argue if you wanted this specific format, it would be better to be an external script or step run after extraction.

@dwertent it would be helpful to walk me through the use cases for adding this. Superficially I don't think I am convinced yet.

@dwertent
Copy link
Author

Hi @vsoch,

Thank you for your feedback. I understand your concerns regarding the use of SARIF for reporting broken links and the relevance of the RFC link I provided. Here are some points to clarify my intention and the use case for adding SARIF:

  1. Relevance of SARIF: While SARIF is indeed primarily designed for static analysis tools and security-related results, it can also be used for reporting other types of analysis, including broken link detection. SARIF provides a standardized format that can be consumed by a variety of tools and services, enhancing interoperability and ease of integration into CI/CD pipelines.

  2. Use Case:

    • Consistency and Standardization: Using SARIF allows for a consistent and standardized way to report broken links, which can be easily integrated with tools that support SARIF, such as GitHub's code scanning.
    • Tool Integration: Many modern development environments and CI/CD tools are starting to adopt SARIF for various types of analysis reports. By adopting SARIF, urlchecker-python can easily integrate with these environments and provide a more seamless user experience.
    • Enhanced Reporting: SARIF offers rich reporting features, including detailed result messages, file and region annotations, and support for multiple runs and tools. This can provide more insightful and actionable feedback to users.
  3. RFC Link Clarification: The RFC link provided was meant to illustrate the kind of structured format and standards SARIF aligns with. It wasn't meant to indicate a direct relation to invalid URLs but rather to show the structured nature of such formats.

  4. Alternative Solutions: I understand the suggestion to implement this as an external script or step. While this is a viable option, integrating SARIF directly into the tool would streamline the process for users, reducing the need for additional steps and external scripts. It would make the tool more versatile and easier to use out of the box. Additionally, for SARIF to work effectively, we need the exact line of the broken link, which is not currently provided in the CSV file. This data is lost in the current implementation, and integrating SARIF directly ensures that this crucial information is captured. Otherwise, we would need to modify the CSV format to include line numbers, which could complicate the current workflow.

I hope this clarifies the motivation and potential benefits of integrating SARIF for broken link reporting. I'd be happy to provide further details or discuss any other concerns you might have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants