Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON lines aggregate results #126

Closed
wants to merge 4 commits into from
Closed

JSON lines aggregate results #126

wants to merge 4 commits into from

Conversation

edoardottt
Copy link
Owner

@edoardottt edoardottt commented Jun 22, 2023

This PR closes #115.

@ocervell what do u think?

This is a comparison test with the one shown in the issue:

{
  "url": "http://testphp.vulnweb.com/",
  "method": "GET",
  "status_code": 200,
  "words": 388,
  "lines": 110,
  "content_type": "text/html",
  "matches": {
    "infos": {
      "Email address": [
        "[email protected]"
      ],
      "HTML comment": [
        "<!-- InstanceEndEditable -->",
        "<!-- here goes headers headers -->",
        "<!-- end masthead -->",
        "<!-- begin content -->",
        "<!--end content -->",
        "<!--end navbar -->",
        "<!-- InstanceEnd -->"
      ]
    }
  }
}

@edoardottt edoardottt added enhancement New feature or request Go labels Jun 22, 2023
@edoardottt edoardottt self-assigned this Jun 22, 2023
@@ -122,7 +122,7 @@ func TestJSONOutput(t *testing.T) {
filetype: filetype,
errors: errors,
infos: infos,
want: `{"url":"http://test.com.pdf?id=5","method":"GET","status_code":200,"words":1,"lines":1,"content_type":"application/pdf","content_length":128,"matches":{"filetype":{"extension":"pdf","severity":7},"parameters":[{"name":"id","attacks":[]}],"errors":[{"name":"MySQL error","match":"it is a MySQL error happening"}],"infos":[{"name":"info1","match":"its my pleasure to inform you on this great day"}],"secrets":[{"name":"mysecret","match":"it's a random day for my secret regex to be found"}]}}`, //nolint:lll
want: `{"url":"http://test.com.pdf?id=5","method":"GET","status_code":200,"words":1,"lines":1,"content_type":"application/pdf","content_length":128,"matches":{"filetype":{"extension":"pdf","severity":7},"parameters":[{"name":"id","attacks":[]}],"errors":{"MySQL error":["it is a MySQL error happening"]},"infos":{"info1":["its my pleasure to inform you on this great day"]},"secrets":{"mysecret":["it's a random day for my secret regex to be found"]}}}`, //nolint:lll
Copy link
Contributor

@ocervell ocervell Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this new format is a bit harder to parse than the previous one, for instance with jq you would need to know the error name in advance, or make 2 different calls to parse it. It also prevents adding more data later, which relates to my other comment about MatcherResult struct.
That said, the general idea of combining multiple secrets of the same type is good.

Maybe something like:

"secrets":[
  {
     "type":"Regex",
     "results": [
        {"match": "it's a random day for my secret regex to be found", "source": "body", "location": "line 42"}
     ]
   }
]

would be better, while still doing some grouping

type MatcherResult struct {
Name string `json:"name"`
Match string `json:"match"`
Errors map[string][]string `json:"errors,omitempty"`
Copy link
Contributor

@ocervell ocervell Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the previous MatcherResult struct for the output, so that we can evolve it later. Maybe there will be more keys than 'Name' and 'Match' later on, and we could add them here. For instance we could add details for each match (which line it was matched at, or the location it was found at (HTML body, URL, parameters,...).

@edoardottt
Copy link
Owner Author

Thanks for the advice :)) appreciated!!

@ocervell
Copy link
Contributor

ocervell commented Oct 12, 2023

Adding another two cents from my usage of cariddi recently:

I came across huge matches like:

[
  {
    "name":"PHP error",
    "match":"PHP error"
  },
  {
    "name":"MySQL error",
    "match":"warning_forbid_default_priv"<MORE THAN 20000 LINES HERE>"
  }
]

which completely destroy my terminal 😄

So we might think about either:

  • truncate the output a bit when matching a regex and maybe add a CLI flag / env variable to control the truncate character limit
  • improving regexes such that it doesn't match too much but only up to a few lines before / after (maybe up to the next newline but not sure how it would work for e.g Python tracebacks) -> i'm sure we can find a way to do better ;) Probably adding regex matching tests would help
  • adding which regex matched to the output - for instance MySQL error is comprised of multiple regexes and it would help to know which one of them matched

We could end up with a JSON format like:

[
  {
     "name": "MySQL Error",
     "results": [
        {
           "type": "Regex",
           "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "regex": "(?i)Warning.*?mysqli?", "location": "line 42", "source": "body"}
        }
     ]
   }
]

Additionally, regexes have their limits - ideally we want to see one step further and create some kind of pattern-recognition algorithms, or using even using ML for this kind of tasks. It could be a good evolution for cariddi ;) The type key would be useful in that case to differenciate the matches from regex matches:

[
  {
    "type": "PatternFinder",
    "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "matcher": "error-finder", "version": "2.0.1"}
  },
  {
    "type": "ML",
    "details": {"model_name": "my-awesome-ml-model", "version": "0.0.1"}
  }
]

There is also room to improve the findings by filtering which ones are found important or not, for instance:

  • an HTML comment containing "TODO / DO THIS LATER / PASSWORD / etc..." is important
  • an HTML comment containing a software version is important
  • an HTML comment like "<!-- InstanceEndEditable -->" is not important
  • an email starting with licensing@<domain> or sales@<domain> is very common and not very sensitive
  • an error / exception with an actual traceback is very sensitive
    etc...

Those "rules" could be first hardcoded by us on a case-by-case and then learned by ML as well at some point, and a severity field could be set for each finding.

There might be a need to create separate issues for some of those points since it's not directly linked to the JSON lines aggregation. Feel free to copy-paste some of my comments there.

@edoardottt
Copy link
Owner Author

Hi @ocervell .

I've thought a lil bit before commenting on this. Imo the best thing to do is this:

  • Close this issue JSON lines aggregate results #115 and close this PR too since the current implementation is better than the proposed one.
  • Open a discussion with your last comment so that me and you (and the gh community) can leave thoughts there since it's a better place to discuss about features

@edoardottt edoardottt closed this Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Go
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON lines aggregate results
2 participants