Replies: 2 comments 2 replies
-
Can you expand a bit on:
What is the intuition that complex functions won't have interesting features? |
Beta Was this translation helpful? Give feedback.
2 replies
-
In this thread let's collect examples of complex functions that we'd want to ignore. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am creating this discussion to discuss and gather insights on some of the potential heuristics we can implement to identify and exclude complex function from feature extraction. In this context, a complex function refers to a subroutine that presents challenging charactersitcs which can lead capa to spend time extracting features that would not be useful for rule matching.
The main concern here is the computational cost associated with implementing these heuristics. Ideally, the detection mechanisms should not introduce significant overhead and should be straightforward to calculate while minimizing false positives.
To match complex functions, we can utilize different metrics that can be pinpoint complexity, that is basic blocks, control flow graphs, instruction sequences, function size, call frequency, ... etc. The following are some heursitcs that I think can be helpful:
Number of basic blocks: Complex functions typically have a higher number of basic blocks compared to regular functions. A function with many basic blocks indicates that it has multiple paths of execution and potentially more complex control flow.
Large basic blocks: If a function contains basic blocks that are significantly larger than the average (i.e., more than 7 instructions <-- just an estimation), it indicates the presence of complex straight-line code: We can compute this heurstic using$\frac{\#\ instructions}{\#\ basic\ blocks}$ . This should be fairly easy to compute.
Frequently called functions: these functions often indicate the presence of statically linked routines, API hashing, or string decryption algorithms. These functions are typically invoked by other parts of the program to resolve addresses/strings dynamically. Though, I think solely relying on this heurstic can overlook functions that contain valuable features. For example, a hash-resolving function, is commonlly called but would likely contains a number constant (key) that is used to resolve the API. To mitigate this, we can couple this with another complementary heurstic.
Cyclomatic complexity: is a metric used to measure the complexity of a function. It is calculated by counting the number of linearly independent paths through the code. In other words, it measures the number of decision points or branching statements (e.g., if statements, loops) in the code. The formula for calculating cyclomatic complexity of a function is:$complexity = \#edges - \#blocks + 2$ . Higher scores suggest that the function has more branching paths. See this Ghidra script for a practical example.
Exclude function only called from library code: without any optimization (example) and in the worst case scenario, this becomes computationally expensive since we have to walk up the entire function call graph to decide if a function is only called from library code.
References: [1], [2], [3]
Beta Was this translation helpful? Give feedback.
All reactions