identify complex function #2122

fariss · 2024-06-05T16:24:10Z

fariss
Jun 5, 2024
Collaborator

I am creating this discussion to discuss and gather insights on some of the potential heuristics we can implement to identify and exclude complex function from feature extraction. In this context, a complex function refers to a subroutine that presents challenging charactersitcs which can lead capa to spend time extracting features that would not be useful for rule matching.

The main concern here is the computational cost associated with implementing these heuristics. Ideally, the detection mechanisms should not introduce significant overhead and should be straightforward to calculate while minimizing false positives.

To match complex functions, we can utilize different metrics that can be pinpoint complexity, that is basic blocks, control flow graphs, instruction sequences, function size, call frequency, ... etc. The following are some heursitcs that I think can be helpful:

Number of basic blocks: Complex functions typically have a higher number of basic blocks compared to regular functions. A function with many basic blocks indicates that it has multiple paths of execution and potentially more complex control flow.
Large basic blocks: If a function contains basic blocks that are significantly larger than the average (i.e., more than 7 instructions <-- just an estimation), it indicates the presence of complex straight-line code: We can compute this heurstic using $\frac{\#\ instructions}{\#\ basic\ blocks}$. This should be fairly easy to compute.
Frequently called functions: these functions often indicate the presence of statically linked routines, API hashing, or string decryption algorithms. These functions are typically invoked by other parts of the program to resolve addresses/strings dynamically. Though, I think solely relying on this heurstic can overlook functions that contain valuable features. For example, a hash-resolving function, is commonlly called but would likely contains a number constant (key) that is used to resolve the API. To mitigate this, we can couple this with another complementary heurstic.
Cyclomatic complexity: is a metric used to measure the complexity of a function. It is calculated by counting the number of linearly independent paths through the code. In other words, it measures the number of decision points or branching statements (e.g., if statements, loops) in the code. The formula for calculating cyclomatic complexity of a function is: $complexity = \#edges - \#blocks + 2$. Higher scores suggest that the function has more branching paths. See this Ghidra script for a practical example.
Exclude function only called from library code: without any optimization (example) and in the worst case scenario, this becomes computationally expensive since we have to walk up the entire function call graph to decide if a function is only called from library code.

References: [1], [2], [3]

williballenthin · 2024-06-05T17:29:24Z

williballenthin
Jun 5, 2024
Maintainer

Can you expand a bit on:

In this context, a complex function refers to a subroutine that presents challenging charactersitcs which can lead capa to spend time extracting features that would not be useful for rule matching.

What is the intuition that complex functions won't have interesting features?

2 replies

fariss Jun 5, 2024
Collaborator Author

My reasoning here was that if, for example, a function poses code flattening charteristics, then the features extracted from such a function would not be useful in matching rules, since they tend be not so diverse. I might be biased here :) since I looked at this sample, notably this function 0x10058F0 before writing this thread:

As you can see, most of the features are Mnemonics, calls to Characteristics, and small Number features. I'd imagine if the binary has a bunch of this functions, then capa would spend a lot of time doing repetitive works extracting the same set of features.

As you mentioned, we need gather samples where this is a noticable issue, we can then infer what is causing the slowness and maybe build a heurstic around it.

williballenthin Jun 6, 2024
Maintainer

This makes sense to me. Let's continue to explore if we want to detect complex functions or obfuscated functions. And, if there's a difference between the two, how we can exploit that.

williballenthin · 2024-06-05T17:30:06Z

williballenthin
Jun 5, 2024
Maintainer

In this thread let's collect examples of complex functions that we'd want to ignore.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identify complex function #2122

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

identify complex function #2122

fariss Jun 5, 2024 Collaborator

Replies: 2 comments · 2 replies

williballenthin Jun 5, 2024 Maintainer

fariss Jun 5, 2024 Collaborator Author

williballenthin Jun 6, 2024 Maintainer

williballenthin Jun 5, 2024 Maintainer

fariss
Jun 5, 2024
Collaborator

Replies: 2 comments 2 replies

williballenthin
Jun 5, 2024
Maintainer

fariss Jun 5, 2024
Collaborator Author

williballenthin Jun 6, 2024
Maintainer

williballenthin
Jun 5, 2024
Maintainer