JSON lines aggregate results #126

edoardottt · 2023-06-22T08:03:36Z

This PR closes #115.

@ocervell what do u think?

This is a comparison test with the one shown in the issue:

{
  "url": "http://testphp.vulnweb.com/",
  "method": "GET",
  "status_code": 200,
  "words": 388,
  "lines": 110,
  "content_type": "text/html",
  "matches": {
    "infos": {
      "Email address": [
        "[email protected]"
      ],
      "HTML comment": [
        "<!-- InstanceEndEditable -->",
        "<!-- here goes headers headers -->",
        "<!-- end masthead -->",
        "<!-- begin content -->",
        "<!--end content -->",
        "<!--end navbar -->",
        "<!-- InstanceEnd -->"
      ]
    }
  }
}

ocervell · 2023-07-25T16:59:33Z

pkg/output/jsonl_test.go

@@ -122,7 +122,7 @@ func TestJSONOutput(t *testing.T) {
 			filetype:   filetype,
 			errors:     errors,
 			infos:      infos,
-			want:       `{"url":"http://test.com.pdf?id=5","method":"GET","status_code":200,"words":1,"lines":1,"content_type":"application/pdf","content_length":128,"matches":{"filetype":{"extension":"pdf","severity":7},"parameters":[{"name":"id","attacks":[]}],"errors":[{"name":"MySQL error","match":"it is a MySQL error happening"}],"infos":[{"name":"info1","match":"its my pleasure to inform you on this great day"}],"secrets":[{"name":"mysecret","match":"it's a random day for my secret regex to be found"}]}}`, //nolint:lll
+			want:       `{"url":"http://test.com.pdf?id=5","method":"GET","status_code":200,"words":1,"lines":1,"content_type":"application/pdf","content_length":128,"matches":{"filetype":{"extension":"pdf","severity":7},"parameters":[{"name":"id","attacks":[]}],"errors":{"MySQL error":["it is a MySQL error happening"]},"infos":{"info1":["its my pleasure to inform you on this great day"]},"secrets":{"mysecret":["it's a random day for my secret regex to be found"]}}}`, //nolint:lll


imo this new format is a bit harder to parse than the previous one, for instance with jq you would need to know the error name in advance, or make 2 different calls to parse it. It also prevents adding more data later, which relates to my other comment about MatcherResult struct.
That said, the general idea of combining multiple secrets of the same type is good.

Maybe something like:

"secrets":[ { "type":"Regex", "results": [ {"match": "it's a random day for my secret regex to be found", "source": "body", "location": "line 42"} ] } ]

would be better, while still doing some grouping

ocervell · 2023-07-25T17:08:37Z

pkg/output/jsonl.go

-type MatcherResult struct {
-	Name  string `json:"name"`
-	Match string `json:"match"`
+	Errors     map[string][]string `json:"errors,omitempty"`


I think we should keep the previous MatcherResult struct for the output, so that we can evolve it later. Maybe there will be more keys than 'Name' and 'Match' later on, and we could add them here. For instance we could add details for each match (which line it was matched at, or the location it was found at (HTML body, URL, parameters,...).

edoardottt · 2023-08-07T06:55:48Z

Thanks for the advice :)) appreciated!!

ocervell · 2023-10-12T15:29:33Z

Adding another two cents from my usage of cariddi recently:

I came across huge matches like:

[
  {
    "name":"PHP error",
    "match":"PHP error"
  },
  {
    "name":"MySQL error",
    "match":"warning_forbid_default_priv"<MORE THAN 20000 LINES HERE>"
  }
]

which completely destroy my terminal 😄

So we might think about either:

truncate the output a bit when matching a regex and maybe add a CLI flag / env variable to control the truncate character limit
improving regexes such that it doesn't match too much but only up to a few lines before / after (maybe up to the next newline but not sure how it would work for e.g Python tracebacks) -> i'm sure we can find a way to do better ;) Probably adding regex matching tests would help
adding which regex matched to the output - for instance MySQL error is comprised of multiple regexes and it would help to know which one of them matched

We could end up with a JSON format like:

[
  {
     "name": "MySQL Error",
     "results": [
        {
           "type": "Regex",
           "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "regex": "(?i)Warning.*?mysqli?", "location": "line 42", "source": "body"}
        }
     ]
   }
]

Additionally, regexes have their limits - ideally we want to see one step further and create some kind of pattern-recognition algorithms, or using even using ML for this kind of tasks. It could be a good evolution for cariddi ;) The type key would be useful in that case to differenciate the matches from regex matches:

[
  {
    "type": "PatternFinder",
    "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "matcher": "error-finder", "version": "2.0.1"}
  },
  {
    "type": "ML",
    "details": {"model_name": "my-awesome-ml-model", "version": "0.0.1"}
  }
]

There is also room to improve the findings by filtering which ones are found important or not, for instance:

an HTML comment containing "TODO / DO THIS LATER / PASSWORD / etc..." is important
an HTML comment containing a software version is important
an HTML comment like "" is not important
an email starting with licensing@<domain> or sales@<domain> is very common and not very sensitive
an error / exception with an actual traceback is very sensitive
etc...

Those "rules" could be first hardcoded by us on a case-by-case and then learned by ML as well at some point, and a severity field could be set for each finding.

There might be a need to create separate issues for some of those points since it's not directly linked to the JSON lines aggregation. Feel free to copy-paste some of my comments there.

edoardottt · 2023-10-16T09:20:32Z

Hi @ocervell .

I've thought a lil bit before commenting on this. Imo the best thing to do is this:

Close this issue JSON lines aggregate results #115 and close this PR too since the current implementation is better than the proposed one.
Open a discussion with your last comment so that me and you (and the gh community) can leave thoughts there since it's a better place to discuss about features

edoardottt added 2 commits June 22, 2023 10:02

JSON lines aggregate results #115

a7e56fd

JSON lines aggregate results #115

098a9c1

edoardottt added enhancement New feature or request Go labels Jun 22, 2023

edoardottt self-assigned this Jun 22, 2023

ocervell reviewed Jul 25, 2023

View reviewed changes

edoardottt added 2 commits August 24, 2023 12:08

mod update

e79ecdc

update deps

b2f2033

edoardottt closed this Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON lines aggregate results #126

JSON lines aggregate results #126

edoardottt commented Jun 22, 2023 •

edited

Loading

ocervell Jul 25, 2023 •

edited

Loading

ocervell Jul 25, 2023 •

edited

Loading

edoardottt commented Aug 7, 2023

ocervell commented Oct 12, 2023 •

edited

Loading

edoardottt commented Oct 16, 2023

JSON lines aggregate results #126

JSON lines aggregate results #126

Conversation

edoardottt commented Jun 22, 2023 • edited Loading

ocervell Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

ocervell Jul 25, 2023 • edited Loading

Choose a reason for hiding this comment

edoardottt commented Aug 7, 2023

ocervell commented Oct 12, 2023 • edited Loading

edoardottt commented Oct 16, 2023

edoardottt commented Jun 22, 2023 •

edited

Loading

ocervell Jul 25, 2023 •

edited

Loading

ocervell Jul 25, 2023 •

edited

Loading

ocervell commented Oct 12, 2023 •

edited

Loading