dynamic scope design #1517

williballenthin · 2023-06-05T10:17:12Z

williballenthin
Jun 5, 2023
Maintainer

following the discussion in #1516, lets sketch out the various designs for dynamic scope, including how it looks to invoke capa.exe, what rules look like, and how the results might be rendered.

considerations include:

What dynamic features are there?
- Same as static?
- Write example rules
Invocation?
- Capa.exe malware.exe trace.json
- –do-dynamic-analysis=true submits file to CAPE API and fetches results when ready, and/or fetching existing results
Is dynamic a subscope of file?
- if so, then we can mix static analysis with dynamic analysis
- Willi’s intuition is that this is more complex
- Static scope implicit right now, how about dynamic?
Support existing rules? We have 800 existing rules - it would be best if we can continue to use these without modification (or small modification) in the dynamic scope. Is this possible?

ghost · 2023-06-05T10:21:39Z

ghost
Jun 5, 2023

static and dynamic scopes can be mixed

in summary: capa can do both static and dynamic analysis and reason about the results together. that is, it can express things like: "when I see this in the file and that in the API trace, then the sample must be able to do XYZ".

discussion: we should find motivating examples of these rules; otherwise, the additional complexity of this approach may not be worth it.

invocation of capa.exe (ideas):

$ capa.exe /path/to/suspicious.exe                                           # no dynamic analysis done here
$ capa.exe /path/to/suspicious.exe  /path/to/suspicious.trace.json           # existing trace file on disk
$ capa.exe /path/to/suspicious.exe  http://cape.com/traces/suspicious.exe    # fetch trace from CAPE API
$ capa.exe /path/to/suspicious.exe  --submit-to-cape=true                    # submit to CAPE and fetch results

not all of these are required, though they may be useful/convenient for our users. the local trace file on disk is an easy place to start.

rule examples

these rule examples should should both static and dynamic features in the same rule, and be used in a way that cannot be done when static and dynamic are separate.

1 reply

yelhamer Jun 5, 2023
Collaborator

Regarding the capa.exe invocation, the idea I have in mind is to store sandbox-relevant information in a json/yaml configuration file that exists in capa's home directory. This way, users can add as many sandboxes as they want in the future (provided that they are supported by either us or by the end user); then, they can specify for example which set of supported sandboxes they want to be used to extract the features. I am thinking of the following call syntax:

$ capa.exe /path/to/suspicious.exe --static              # run only static analysis with the default backend
$ capa.exe /path/to/suspicious.exe --static -b binja     # run only static analysis for the binary ninja backend
$ capa.exe /path/to/suspicious.exe --dynamic             # do dynamic analysis with cape, all sandboxes (or preferred ones as indicated in the configuration file)
$ capa.exe /path/to/suspicious.exe --dynamic cape, vmray # do dynamic analysis for cape and vmray, i.e. use both extractors
$ capa.exe /path/to/suspicious.exe                       # this implies both static and dynamic

these parameters (static and dynamic) would eventually determine which extractors get returned by the get_extractor() function.

ghost · 2023-06-05T10:21:57Z

ghost
Jun 5, 2023

static and dynamic scopes are separate and cannot be mixed

in summary: static and dynamic analysis cannot be mixed. capa detects on startup which mode it's in and only does that thing.

discussion: this is probably much easier to implement; however, do we miss out on crucial functionality here?

invocation of capa.exe:

$ capa.exe /path/to/supicious.exe
$ capa.exe /path/to/supicious.trace.json

but note that these cannot be mixed.

rule examples:

static rule (existing format):

rule:
  meta:
    name: hash data with CRC32
    namespace: data-manipulation/checksum/crc32
    scope: function
    mbc:
      - Data::Checksum::CRC32 [C0032.001]
  features:
    - or:
      - and:
        - mnemonic: shr
        - or:
          - number: 0xEDB88320
          - bytes: 00 00 00 00 96 30 07 77 2C 61 0E EE BA 51 09 99 19 C4 6D 07 8F F4 6A 70 35 A5 63 E9 A3 95 64 9E = crc32_tab
        - number: 8
        - characteristic: nzxor
      - and:
        - number: 0x8320
        - number: 0xEDB8
        - characteristic: nzxor
      - api: RtlComputeCrc32

dynamic rule (proposed format):

rule:
  meta:
    name: hash data with CRC32
    namespace: data-manipulation/checksum/crc32
    scope: dynamic
    mbc:
      - Data::Checksum::CRC32 [C0032.001]
  features:
    - or:
      - api: RtlComputeCrc32

note that we update scope: dynamic and remove branches of the static rule that deal with disassembly features. i dont think any sandbox will give us a useful instruction trace (and, if so, it would be massive). but, we keep around API features that can also be recognized by the dynamic feature extractor.

question: can we somehow do this automatically, without changing scope: dynamic? how does this work? what are the semantics? how do we handle things like scope: function or scope: basic block?

another example:

rule:
  meta:
    name: reference HTTP User-Agent string
    namespace: communication/http
    scope: function
  features:
    - or:
      - substring: "Mozilla/5.0"
      - substring: "like Gecko"
      - api: urlmon.ObtainUserAgentString
      - property/read: System.Net.HttpWebRequest::UserAgent

rule:
  meta:
    name: reference HTTP User-Agent string
    namespace: communication/http
    scope: dynamic
  features:
    - or:
      - substring: "Mozilla/5.0"
      - substring: "like Gecko"
      - api: urlmon.ObtainUserAgentString
      - property/read: System.Net.HttpWebRequest::UserAgent

in this example, all the features can probably be extracted by a dynamic trace. the translation from scope: function to scope: dynamic seems like it could be automatic.

i imagine that substring: Mozilla/5.0 means that this substring is present within a string provided as an argument to any API function captured by the dynamic trace. the API name is not restricted, and could be anything, including SetUserAgent, free, memcpy, etc.

5 replies

yelhamer Jun 5, 2023
Collaborator

I thought some more about mixing dynamic and static features the way I initially proposed, and I agree that it would definitely be a lot harder to implement. Most notably, we would have to come up with some guidelines on how to handle rules that mix up dynamic and static features at lower scopes:

should we ignore those rules when doing a purely static or purely dynamic analysis?
should we do some more elaborate parsing: if the rule's statement does an and between a static feature and a dynamic one, then filter out that rule; however, if it's an or, evaluate that dynamic feature to False?
other guidelines that need to be developed?

coming up with this guidelines should require some effort, and implementing them should do so as well.

note however that applying such rule-filtering guidelines might be simpler, easier, and worthwhile to do if it is done at "higher" scopes, and if the rule is fairly simple such as:

features:
  - or
    - match: anti-analysis/packer/static-rule-for-a-packer.yml
    - match: anti-analysis/packer/dynamic-rule-for-a-packer.yml

In summary, for the initial case I believe that it would make dealing with rules much more difficult, for both maintainers as well as rule authors. I can imagine someone moving off-of the assumption that a sample does not have a specific capability that they described in a rule, just to find out that rule was filtered out by capa during "mixed" (as in dynamic+static) analysis.

Mixing rules at that level would most certainly allow for great expressibility, but I think that it should be skipped for now given that the need for it isn't really that clear.

So, what I suggest is to move forward with finding a place for the dynamic scope while being agnostic of the previously discussed dynamic and static close feature mangling, all while keeping in mind that it might be a thing that would be worth implementing in the future — perhaps if the need for it is clearly expressed or some new sandbox comes along with interesting features.

Ideas that come to mind initially are to have it as a scope that is parallel to the file scope, which would make integrating it an easy task. I will try to organize my thoughts some more on this and share them soon.

Thoughts? @williballenthin @mr-tz

yelhamer Jun 5, 2023
Collaborator

One thing I forgot to mention is the following:

Mixing up static and dynamic features at lower scopes would be extremely interesting in the case of named api arguments; since that is something that has been hindering static analysis (according to the Radboud Master's thesis) this far. However, we should be able to get around that by creating separate dynamic rules that do the required named api argument matching, and then reference those when needed using a match feature. So something like this?

rule:
  meta:
    ...
  features:
    ...
    - and:
      - or:
        - string: ...  # some known C2 strings
        ...
      - match: dynamic-send-tcp-data-via-wfp-api    # less false-positives this way
    ...

This might be somewhat off-topic: Correct me if I am wrong, but I believe that this would work because the aforementioned rule would be looking for a MatchedResult feature for that other dynamic rule that has been added into the ruleset, and would need to be evaluated by the dynamic extractor; as opposed to the case of explicitly stating the dynamic features to be used, in which case the rule's statement would be a bunch of static and dynamic features mixed at closer levels.

yelhamer Jun 5, 2023
Collaborator

Regarding the substring feature, there are several places we could get strings from using cape. It could be a string that exists in the file, or captured network traffic, accessed file/registry key, written content to a file handle, etc.

See this sample for example: https://capesandbox.com/analysis/394144/

williballenthin Jun 5, 2023
Maintainer Author

while being agnostic of the previously discussed dynamic and static close feature mangling, all while keeping in mind that it might be a thing that would be worth implementing in the future — perhaps if the need for it is clearly expressed or some new sandbox comes along with interesting features.

I think this is a really good point. we can move ahead and explore the dynamic scope separately, keeping open the oportunity for merging those features with static scope if we can figure out an elegant design. i like that this allows for short term progress while keeping options open.

williballenthin Jun 5, 2023
Maintainer Author

I believe that this would work because the aforementioned rule would be looking for a MatchedResult feature for that other dynamic rule that has been added into the ruleset, and would need to be evaluated by the dynamic extractor

i think this is correct. this might be a feasible bound for how static and dynamic can be mixed, though i think it still suffers from complexity around how scopes work. touching on an earlier point, maybe we could enable mixing static and dynamic scopes only at the highest level and/or only via match statements.

mr-tz · 2023-06-05T10:37:02Z

mr-tz
Jun 5, 2023
Maintainer

Dynamic features

Static features to support

api
number
string
global features (os, arch, format)?

Static features that are out of (dynamic) scope :)

static characteristics
namespace
class
property
offset
mnemonic
static basic block features
static function features
static file features
bytes?
operand?

Additional dynamic-only features

sequence (api calls in order)
tracking of handles and other objects (to tie API calls together)
return value

New features

call arguments - see call scope function call arguments syntax #921, viv: x86: extract function call arguments #926, new feature: function call arguments #771

3 replies

ghost Jun 5, 2023

sequence (api calls in order)

tracking of handles and other objects (to tie API calls together)

these are really neat, though we should be realistic about when to implement them (possibly after other easier features are ready). especially the latter is probably challenging to express clearly, so we'll have to have in-depth discussions about the rule syntax. still, i'm excited for this!

williballenthin Jun 5, 2023
Maintainer Author

how about additional dynamic features like:

dns resolution: "google.com"
network connection: "8.8.8.8"
file created: path: "C:\windows\temp\a.txt"
file created: name: "a.txt"
registry key created: substring: "CurrentVersion/Run"

seems useful, and most sandboxes will provide this sort of summary information. but how quickly does this converge with existing vocabularies like STIX, etc? and, how useful is the logic that capa enables, ie, using AND/OR with the above? im afraid we'd just see lists of filenames OR'd together.

mr-tz Jun 5, 2023
Maintainer

sequence and object tracking are found in https://github.com/0x534a/master-thesis/blob/main/MA_Thesis_Signature-Based_Detection_Behavioural_Malware_Features_Public.pdf and have details on the implementation there.

ghost · 2023-06-05T10:54:22Z

ghost
Jun 5, 2023

thread scope is a subscope of dynamic scope

in summary: introduce scope: thread that is a subscope of scope: dynamic so that we can express logic that must be found within a single thread, not across the entire dynamic analysis trace. we can do this be considering the events for each TID separately, before merging them into the global dynamic scope.

this might enable us to better express different parts a program (threads) do a different things, like enumerate files, serve HTTP responses, etc. This maps nicely to how a programmer structures their code. in particular, it better supports other proposals like "sequence of API calls", since we want to see these in the context of a thread, and not interspersed with API calls from other threads.

however, in practice, I wonder if this would ever actually be used. unless, scope: thread is the default recommended dynamic scope as a good practice. so, alternatively, would scope: dynamic ever be used directly? maybe to merge functionality from across threads?

example rules:

proposed:

rule:
  meta:
    name: attach user process memory
    namespace: host-interaction/process/inject
    scope: thread
  features:
    - and:
      - api: ntoskrnl.KeStackAttachProcess
      - api: ntoskrnl.KeUnstackDetachProcess

even more experimental:

rule:
  meta:
    name: attach user process memory
    namespace: host-interaction/process/inject
    scope: thread
  features:
    - and:
      - sequence:  # this doesn't exist today
        - api: ntoskrnl.KeStackAttachProcess
        - api: ntoskrnl.KeUnstackDetachProcess

1 reply

ghost Jun 6, 2023

furthermore, if we had the call stack available at each event (seems unlikely), then we could also introduce a "call frame" scope that groups features from all events with the same caller/return address.

mr-tz · 2023-06-05T11:45:59Z

mr-tz
Jun 5, 2023
Maintainer

Reuse existing rules

While trying to write the first example rules and exploring our current set I noticed the following:

ideally, we can reuse all existing rules
how to handle/interpret existing scopes/features
- function scope
  - tie together based on call locations / return addresses?
  - enhance with static analysis results, if available, i.e. function boundaries -> pro for mixing static and dynamic analysis
  - ignore because function scope limits current rule set (not all features have to show up in one function, e.g. due to wrapper functions)
- basic block scope
  - in most cases (for api and number/string features) this expresses API call and its arguments
    - -> change rule parsing?
      - use call scope in rule and interpret as basic block scope for now?
      - when dynamic analysis: parse basic block into call scope
ignore unsupported (low-level) features
- mnemonic, offset, operand

0 replies

yelhamer · 2023-06-06T00:37:49Z

yelhamer
Jun 6, 2023
Collaborator

brainstorming some ideas here...

why we're using dynamic scope:

Thinking about the best possible design for this scope made me rethink whether it's the right approach; so, I am re-sharing the main motives for this design choice for both documentation purposes, as well as figuring out if the motives are indeed valid.

The main reason that motivates this design choice for me is the fact that it requires minimal changes to the current capa code design, as well as the rule syntax. Most of the other alternative that come to mind involve making changes to the rule syntax in a way that might brick people's custom rules, which is something that's undesirable.

Other reasons include: making it easier to integrate any future subscopes/parent-scopes into this logic, choosing whether to do dynamic and/or static analysis should be easier this way (filtering can be done by the pre-existing scope-filtering mechanisms).

The two main proposed ways to integrate this dynamic scope are the following:

1. dynamic scope as a subscope of the file scope:

pros

more expressive rules: mixing static and dynamic features.

cons

high complexity for code maintainers as well as rule authors/maintainers.
this hierarchy doesn't really intuitive from the perspective of a rule author.

2. dynamic scope as a parent or an equal of the file scope:

I suggest either:

having a the dynamic scope as a parent of the file scope: this would make it possible to use all of the existing rules as is, with the exception of rules that contain non-supported features (such as instruction scope ones); file scope features would be supported since cape offers that information (or we can use file feature extractors from within the dynamic extractor), as well as function ones, which would mean that we could evaluate rules that are expressed in terms of only those two scopes. One thing with this approach is that we would either need to introduce a new static scope, or make that new static scope something that's implicit — if dynamic is not specified assume static.
having the dynamic scope as a "lateral" (neither a subscope nor a parent scope) scope to the file one: following this design .

For both of these approaches, we would end up using the same features (as opposed to the first approach wherein that might not be the case necessarily). Therefore, settling for this approach would allow us to get to working on the dynamic extractor right away, while we discuss the remaining rule-writing side of things...

pros

high re-usability of the available features: features such as string for example would have 2 meanings depending on the extractor.
high re-usability of the current rule set: current rules would be essentially describing the same capabilities both in a static and dynamic manner, and it's up to the user to choose if extracting those capabilities is better done statically or dynamically; so, rule authors can write new rules that make use of purely dynamic concepts such as api/file sequencing or a thread scope for example, all while not making the old rules redundant.
makes more sense to rule authors, at least compared to the aforementioned alternative of having the dynamic scope as a subscope of the file one.
less complex to implement.

cons

static and dynamic features cannot be mixed at lower scopes, however, most users will likely be opting for dynamic analysis just to get around packing/obfuscation anyways.

implementation

Naively, one straightforward way to implement this idea would be to just generate the RuleSet as capa does currently, and then pass that to the dynamic extractor, and all rules containing non-dynamic features would just end up being filtered out since those non-dynamic features were not detected. I believe this would have the benefit of supporting the variety between different extractors and the features they offer, since some might not extract api traces, while others might provide some useful instruction information in the future...

Thoughts?

3 replies

mr-tz Jun 6, 2023
Maintainer

Currently, I think (and hope) 2. dynamic scope as a parent or an equal of the file scope would work - given all the pros.

To borrow your syntax, this makes sense to me:

Static ⊂ FILE ⊂ FUNCTION ⊂ BASIC BLOCK ⊂ INSTRUCTION
Dynamic ⊂ FILE ⊂ [FUNCTION ⊂ BASIC BLOCK ⊂ INSTRUCTION]*
* with select/supported features from these for dynamic extractors

Assuming we fully use the same rules for static and dynamic, a crucial part will be how to handle:

static-only features with the dynamic extractors
dynamic-only features with the static extractors

Can we prune just the non-supported features or must rule authors specify what to do? My current hope is that this could all be done as part of the rule parsing - depending on the selected extractor.

yelhamer Jun 6, 2023
Collaborator

Initially I was thinking what to essentially use the following guideline: **if a rule has a static-only feature then it's a static-only rule, if it has a dynamic-only feature then it's a dynamic-only rule, and if it has just ambiguous features that it's both a static and dynamic rule".

as for determining whether a rule is static-only, dynamic-only, or both, I was thinking we could either:

match against the entire RuleSet and -- to my belief -- the non-appropriate rules would just not be matched against. this includes cases such as:

  - and:
    - incompatible-feature: ...
    - shared-feature: ...

which would evaluate to False; while this rule would evaluate to True (if the shared-feature is present):

  - or:
    - incompatible-feature: ...
    - shared-feature: ...

add some logic for parsing the RuleSet which would make things more efficient: I feel like the previous approach might slow down things, and that we'd need to add some code that makes things more efficient -- i.e. keep a list of static-only, dynamic-only, or shared features and construct the RuleSet according to the type of run (dynamic or static)

mr-tz Jun 6, 2023
Maintainer

Good points. I think we could add some logic to trim out rules fairly easily.

ghost · 2023-06-06T10:17:26Z

ghost
Jun 6, 2023

I wonder if we are mixing concepts and terminology here around "static vs dynamic" and "scopes". Are these definitely the same thing? Does it make sense to consider them separately? I'll propose here to separate the concepts.

First, some definitions:

"scope" is used today to declare how to group features that may match together. For example, we can use "function scope" to collect all the features for a function and match rules against that collection. This way, rules can target small parts of a program that reflect the units of the malware author's design. We have these scopes today (all in a static context) and most of these scopes "collect upwards" into their parent scope:
- instruction
- basic block
- function
- file
- global - features available at all other scopes, such as file format and OS
"static" analysis is what capa does today by inspecting the input file and reasoning about its disassembly.
"dynamic" analysis is what we want capa to do, by inspecting runtime traces collected by sandboxes

We are trying to add dynamic analysis support to capa. It's desirable to do it in a way that we can reuse the existing static analysis rules, because then we have less work to do. Also, if rules work for both static and dynamic analysis, rule authors can get a better value in writing each rule. To be clear, while this is very nice to have, we should be wary of taking shortcuts that save us a little time now and cost us a lot of time later, so we should be open to updating all the rules if we absolutely have to.

In many of the discussions so far, we've talked about adding a "dynamic scope". I think this stems from the idea that we need a place in the rule format to mark if a rule works during static analysis, dynamic analysis, or both, and the rule.meta.scope field was an easy place to extend. So we said, "if you use the term 'dynamic' in the scope field, then it will work in the dynamic analysis context" or something similar. We have talked about "dynamic scope" would be something that nests under "file scope" or beside it and approximates the collection of features available in a runtime trace.

I think this might be a mistake and we should separate scopes from analysis ... context/flavor/mechanism ("static" or "dynamic"). Let's call this "analysis flavor" for now, and it includes "static analysis flavor" and "dynamic analysis flavor".

Scopes should describe how features are collected together and matched. Not all features are available at all scopes. For example, section names are not available at instruction scope, naturally.

Analysis flavors should also describe when a rule can be applied: during static analysis, dynamic analysis, or both. Some features will only work in one flavor; for example, instruction mnemonics will not work in dynamic analysis flavor, because a full instruction trace is not expected to be available. Likewise, an ordered sequence of API calls is not available in the static analysis flavor (today, though maybe it's a good idea for future research). Some features will work in both flavors; for example, an API call can be extracted at both the disassembly and sandbox trace levels.

But scopes and analysis flavors should not be the same thing. A rule must be evaluated with a scope, and I think that scope can depend on which analysis flavor is in play, but the scope is not the same thing as the analysis flavor.

In the most explicit world, we might have two rules for creating a file:

flavor: static

flavor: dynamic

rule:
  meta:
    name: create file (static)
    flavor: static
    scope: function
  features:
    or:
      api: CreateFile
      api: fopen

rule:
  meta:
    name: create file (dynamic)
    flavor: dynamic
    scope: thread
  features:
    or:
      api: CreateFile
      api: fopen

But obviously there's a lot of repetition, so perhaps we could do something like:

rule:
  meta:
    name: create file
    scope: 
      - static: function
      - dynamic: thread
  features:
    or:
      api: CreateFile
      api: fopen

Which is pretty nice. We might also support things like scope.static: none or unsupported to indicate the rule cannot be used in that flavor.

Now, because we want to support a bunch of existing rules, we could provide some built-in shortcuts and logic, such as:

if scope: function then assume scope.dynamic: thread unless an invalid feature is present, in which case prune, if possible.

Then, the following would be equivalent:

implicit dynamic flavor	explicit dynamic flavor
rule: meta: name: create file scope: function features: or: api: CreateFile api: fopen	rule: meta: name: create file scope: - static: function - dynamic: thread features: or: api: CreateFile api: fopen

We could either build this logic into capa, or do a one-time automated update to the rules using the translations. In the former, rule authors don't have to learn anything new. In the latter, rules are more explicit and dynamic analysis flavor is more of a first-class citizen.

With all this in mind, I'd propose that there are two flavors: static and dynamic. And when analyzing with a flavor, there's a scope in play, which include:

static flavor: FILE ⊂ FUNCTION ⊂ BASIC BLOCK ⊂ INSTRUCTION
dynamic flavor: TRACE ⊂ THREAD ⊂ CALL FRAME (to be researched and determined)

We'll need to create a multi-dimensional table that describes which features are available in each flavor and scope, perhaps like:

static flavor	file scope	function scope	basic block scope	instruction scope
mnemonic		✓	✓	✓
characteristic(loop)	✓	✓
api	✓	✓	✓	✓
string	✓	✓	✓	✓

dynamic flavor	trace scope	thread scope
mnemonic
characteristic(loop)
api	✓	✓
string	✓	✓

This proposal doesn't initially address if rules can match across flavors, but my intuition is that this isn't supported without additional research and design.

1 reply

yelhamer Jun 6, 2023
Collaborator

This sounds really good to me!
I was initially wary of modifying the rule syntax (the meta part) because I felt like that would force people to rewrite their rules, however, the way you propose to handle this sounds pretty good to me.

xusheng6 · 2023-06-08T05:12:22Z

xusheng6
Jun 8, 2023

I am really excited to hear that capa is adding the capacity to work with dynamic traces! I think dealing with API traces generated from a sandbox is a great start since it will provide more info than static analysis.

I think it would also be possible to run the feature extractor at instruction level on an execution trace recorded by tools like Windbg TTD/Reven/RR/Undo, etc. The immediate benefit is to see some behavior that we cannot see if the relevant code is encrypted. For example, if the sample uses a cyrpto function, but the code is encrypted, capa would not be able to see it directly. Besides, we can also see the concrete register/memory value at any time, this makes it possible to not only detect certain operations, but also obtain the data it actually operated on.

There would also be some challenges. For example, the trace can be quite long even for a few seconds of execution. Also, the boundary of function is not as clear when it comes to an execution trace. However, from my own experience of working with execution traces, despite the whole trace being very long, the unique instructions/basic blocks are pretty manageable.

1 reply

williballenthin Jun 12, 2023
Maintainer Author

i love this idea. perhaps we can collaborate on a dynamic extractor that relies on BN and its debugger API to extract the trace and reason about it?

yelhamer · 2023-06-08T12:56:01Z

yelhamer
Jun 8, 2023
Collaborator

I am sharing some more thoughts I had when drafting a dynamic extractor.

First of all, I thought I'd re-summarize what our design-goals are:

be able to reuse old rules.
make it possible to create ambiguous rules as well as non-ambiguous rules, which would would make it possible to describe capabilities both statically and dynamically, and use a rule in both contexts if possible.
try our best to make the integration of dynamic analysis as flexible as possible, since support for static features/scopes (such as the instruction scope as @xusheng6 suggested); furthermore, sandboxes vary in terms of the features they support, for example, CAPE and cuckoo offer detailed api traces that any.run doesn't, while any.run gives us the ability to easily determine which process/thread tampered with a file/registry or made a certain connection.

Having agreed that these are the main goals, here's what I think would work best for feature extraction:

1. Scopes:

First of all, I believe that the scoping should be the following:

Static analysis: File ⊂ Function ⊂ Basic Block ⊂ Instruction
Dynamic analysis: Process ⊂ File ⊂ Thread ⊂ Function

This hierarchy introduces 2 new scopes: Process and Thread. This choice was mainly inspired by the analysis traces I saw thus far:

Process scope: pretty much all sandboxes offer process-specific features such as: "this process injected into this other process", "this process dropped this file", "this process dynamically loaded this DLL", etc. This is in addition to the standard File-scope features which each possesses such as: imports/exports table, section information, etc. Example from the CAPE trace:
Thread scope: this scope would include all of the api calls made within that thread, which should make it easier to implement stuff like api call sequencing and so on.

2. RuleSet construction:

According to the described hierarchy above, I believe that it should be possible to use the same RuleSet for both static and dynamic extractors (comment?), with out-of-context rules being ignored by the extractor (static extractors would ignore dynamic-only rule). Additionally however, we probably would implement some mechanism to optimize this by filtering out out-of-context rules (by maintaining a list of supported features for each extractor for example, i.e. any.run doesn't support api traces).

Furthermore, following the design that has been proposed thus far should make it easier to add new features to extractors as they are introduced, and it should also make it possible to make sandbox-agnostic rules, which should make it possible to support newer sandboxes as they implement more functionality (support any.run when it introduces api traces, or CAPE when it adds origin pid/tid data to each file).

3. Implementation:

The way I am currently doing things is similar to static extractors: extract all processes from a trace, for each process extract all file features and threads, for each thread extract all thread features (haven't determined these yet) and function features; function features include all sub-function features (such as api calls) as well as calls-to and from features.

4 replies

williballenthin Jun 12, 2023
Maintainer Author

Dynamic analysis: Process ⊂ File ⊂ Thread ⊂ Function

i think there should also be a scope available at the "whole trace" level, that is a union of all the process events. this way we can do things like "if there are three instances of injection across five process than its XYZ". im not sure what the name should be, perhaps "detonation", "recording", or something?

williballenthin Jun 12, 2023
Maintainer Author

I believe that it should be possible to use the same RuleSet for both static and dynamic extractors

i think so. the RuleSet indexes rules by scope and collects rules with their dependencies. since the two analysis flavors, static and dynamic, don't overlap their scope names, i think these indices will remain valid, and its up to the rule engine to apply the logic to the rules as appropriate.

that being said, there are going to be a bunch of changes needed in capa-core to introduce the second analysis flavor. i would strongly recommend starting with just trace feature extraction and getting tests working for that. then we can incrementally layer on the additional engine updates until the entire system flows together. otherwise, you'll be juggling a PR with many thousands of lines of changes and risking merge conflicts.

williballenthin Jun 12, 2023
Maintainer Author

sandboxes vary in terms of the features they support, for example, CAPE and cuckoo offer detailed api traces that any.run doesn't, while any.run gives us the ability to easily determine which process/thread tampered with a file/registry or made a certain connection.

this is true, and we should keep this in mind. though, i'd love to give these sandbox maintainers a motivation to include the high-resolution information that we need, so we should start with the sandboxes that have the best features we need now.

williballenthin Jun 12, 2023
Maintainer Author

pretty much all sandboxes offer process-specific features such as: ..., etc. This is in addition to the standard File-scope features which each possesses such as: imports/exports table, section information, etc.

interesting. we should discuss how to best represent this information, especially in a way that ports across all the different sandboxes.

in the meantime, using API calls and their parameters as the lowest common denominator will be most consistent with how capa is designed today. i also suspect that we can derive the sandbox-specific summaries from the API traces, and maybe even do a better job in the long run. stretch goal should be that all sandboxes use capa to extract their summary metadatas ;-)

dynamic scope design #1517

williballenthin Jun 5, 2023 Maintainer

Replies: 9 comments · 19 replies

ghost Jun 5, 2023

static and dynamic scopes can be mixed

invocation of capa.exe (ideas):

rule examples

yelhamer Jun 5, 2023 Collaborator

ghost Jun 5, 2023

static and dynamic scopes are separate and cannot be mixed

invocation of capa.exe:

rule examples:

yelhamer Jun 5, 2023 Collaborator

yelhamer Jun 5, 2023 Collaborator

yelhamer Jun 5, 2023 Collaborator

williballenthin Jun 5, 2023 Maintainer Author

williballenthin Jun 5, 2023 Maintainer Author

mr-tz Jun 5, 2023 Maintainer

Dynamic features

Static features to support

Static features that are out of (dynamic) scope :)

Additional dynamic-only features

New features

ghost Jun 5, 2023

williballenthin Jun 5, 2023 Maintainer Author

mr-tz Jun 5, 2023 Maintainer

ghost Jun 5, 2023

thread scope is a subscope of dynamic scope

example rules:

ghost Jun 6, 2023

mr-tz Jun 5, 2023 Maintainer

Reuse existing rules

yelhamer Jun 6, 2023 Collaborator

why we're using dynamic scope:

1. dynamic scope as a subscope of the file scope:

pros

cons

2. dynamic scope as a parent or an equal of the file scope:

pros

cons

implementation

mr-tz Jun 6, 2023 Maintainer

yelhamer Jun 6, 2023 Collaborator

mr-tz Jun 6, 2023 Maintainer

ghost Jun 6, 2023

yelhamer Jun 6, 2023 Collaborator

xusheng6 Jun 8, 2023

williballenthin Jun 12, 2023 Maintainer Author

yelhamer Jun 8, 2023 Collaborator

1. Scopes:

2. RuleSet construction:

3. Implementation:

williballenthin Jun 12, 2023 Maintainer Author

williballenthin Jun 12, 2023 Maintainer Author

williballenthin Jun 12, 2023 Maintainer Author

williballenthin Jun 12, 2023 Maintainer Author

williballenthin
Jun 5, 2023
Maintainer

Replies: 9 comments 19 replies

ghost
Jun 5, 2023

yelhamer Jun 5, 2023
Collaborator

ghost
Jun 5, 2023

yelhamer Jun 5, 2023
Collaborator

yelhamer Jun 5, 2023
Collaborator

yelhamer Jun 5, 2023
Collaborator

williballenthin Jun 5, 2023
Maintainer Author

williballenthin Jun 5, 2023
Maintainer Author

mr-tz
Jun 5, 2023
Maintainer

williballenthin Jun 5, 2023
Maintainer Author

mr-tz Jun 5, 2023
Maintainer

ghost
Jun 5, 2023

mr-tz
Jun 5, 2023
Maintainer

yelhamer
Jun 6, 2023
Collaborator

mr-tz Jun 6, 2023
Maintainer

yelhamer Jun 6, 2023
Collaborator

mr-tz Jun 6, 2023
Maintainer

ghost
Jun 6, 2023

yelhamer Jun 6, 2023
Collaborator

xusheng6
Jun 8, 2023

williballenthin Jun 12, 2023
Maintainer Author

yelhamer
Jun 8, 2023
Collaborator

williballenthin Jun 12, 2023
Maintainer Author

williballenthin Jun 12, 2023
Maintainer Author

williballenthin Jun 12, 2023
Maintainer Author

williballenthin Jun 12, 2023
Maintainer Author