binary ninja: optimize feature extraction #2402

williballenthin · 2024-09-25T11:15:39Z

During some initial profiling, I'm finding that the Binary Ninja backend is substantially slower than vivisect or IDA. This thread will enumerate all the things we discover. It might include: bugs in Binary Ninja, things we're doing wrong, workarounds, etc.

Given how good Binary Ninja's code analysis is, we'd really like to be able to use it widely. So, let's prepare the code for this.

williballenthin · 2024-09-25T11:16:46Z

Vector35/binaryninja-api#5951

0953cc3b77ed2974b09e3a00708f88de931d681e2d0cb64afbaf714610beabe6 (100KB or so) takes a huge amount of time to load into Binary Ninja. Maybe there's an infinite loop somewhere.

williballenthin · 2024-09-25T11:18:42Z

To run capa against 321338196a46b600ea330fc5d98d0699, it takes 2:48. But :36 is spent just in BNGetLowLevelILForInstruction to help recover calls to functions. I have expected this to be extremely fast (less than :01). Need to triage if this is a bug in the API or we're using it incorrectly and/or there's a workaround.

We can also see that BNGetFunctionMediumLevelIL takes quite a bit of time (:13, and its called twice).

  Austin  TUI   Wall Time Profile                                                                                         
   _________   Command python -m capa.main --json --backend=binja tests/data/321338196a46b600ea330fc5d98d0699.exe_                     
   ⎝__⎠ ⎝__⎠   Python 3.12.3    PID 3946467     PID:TID 3946467:0:3946467   
               Samples 1060897  ⏲️    2'48"       Threshold 1%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                              
  00"    2'09"     0.0%   76.6%  ├─ _run_module_as_main (<frozen runpy>:199)                                                           │
  00"    2'09"     0.0%   76.6%  │  └─ _run_code (<frozen runpy>:88)                                                                   │
  00"    2'08"     0.0%   76.3%  │     └─ <module> (~/code/public/capa/capa/main.py:1096)                                              │
  00"     14"      0.0%    8.4%  │        ├─ main (~/code/public/capa/capa/main.py:983)                                                │
  00"     14"      0.0%    8.4%  │        │  └─ get_extractor_from_cli (~/code/public/capa/capa/main.py:830)                           │
  00"     13"      0.0%    7.7%  │        │     ├─ get_extractor (~/code/public/capa/capa/loader.py:264)                               │
  00"     13"      0.0%    7.7%  │        │     │  └─ load (~/software/binaryninja/python/binaryninja/__init__.py:414)                 │
  00"     13"      0.0%    7.6%  │        │     │     └─ BinaryView.load (~/software/binaryninja/python/binaryninja/binaryview.py:2758)│
  13"     13"      7.6%    7.6%  │        │     │        └─ BNLoadFilename (~/software/binaryninja/python/binaryninja/_binaryninjacore.│
  00"    1'39"     0.0%   58.7%  │        ├─ main (~/code/public/capa/capa/main.py:987)                                                │
  00"    1'39"     0.0%   58.7%  │        │  └─ find_capabilities (~/code/public/capa/capa/capabilities/common.py:75)                  │
  00"    1'38"     0.0%   58.3%  │        │     ├─ find_static_capabilities (~/code/public/capa/capa/capabilities/static.py:184)       │
  00"     11"      0.0%    6.3%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:117)      │
  00"     10"      0.0%    6.2%  │        │     │  │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extra│
  00"     10"      0.0%    6.0%  │        │     │  │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395) │
  10"     10"      6.0%    6.0%  │        │     │  │  │  │  └─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_b│
  00"     48"      0.0%   28.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:118)      │
  00"     08"      0.1%    4.6%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     07"      0.0%    4.2%  │        │     │  │  │  ├─ BinjaFeatureExtractor.get_instructions (~/code/public/capa/capa/features/ex│
  00"     02"      0.1%    1.4%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     05"      0.0%    2.8%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     02"      0.0%    1.5%  │        │     │  │  │  │  │  ├─ CoreArchitecture.get_instruction_text (~/software/binaryninja/python/│
  00"     35"      0.1%   20.8%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  01"     31"      0.5%   18.6%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     30"      0.0%   18.1%  │        │     │  │  │  │  ├─ BinjaFeatureExtractor.extract_insn_features (~/code/public/capa/capa/fea│
  01"     30"      0.8%   18.0%  │        │     │  │  │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/i│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  ├─ extract_insn_number_features (~/code/public/capa/capa/features/ext│
  00"     02"      0.0%    1.1%  │        │     │  │  │  │  │  │  ├─ extract_insn_bytes_features (~/code/public/capa/capa/features/extr│
  00"     02"      0.0%    1.3%  │        │     │  │  │  │  │  │  ├─ extract_insn_api_features (~/code/public/capa/capa/features/extrac│
  00"     02"      0.0%    1.2%  │        │     │  │  │  │  │  │  │  ├─ is_stub_function (~/code/public/capa/capa/features/extractors/b│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  │  │  ├─ Function.llil (~/software/binaryninja/python/binaryninja/fun│
  02"     02"      1.0%    1.0%  │        │     │  │  │  │  │  │  │  │  │  ├─ BNGetFunctionLowLevelIL (~/software/binaryninja/python/bi│
  00"     03"      0.1%    2.1%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     03"      0.0%    2.0%  │        │     │  │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)          │
  02"     02"      1.4%    1.4%  │        │     │  │  │  │  │  ├─ RuleSet._match (~/code/public/capa/capa/rules/__init__.py:1879)      │
  00"     04"      0.0%    2.4%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     04"      0.0%    2.4%  │        │     │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)             │
  00"     37"      0.0%   21.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:128)      │
  00"     37"      0.0%   21.9%  │        │     │  │  ├─ BinjaFeatureExtractor.extract_function_features (~/code/public/capa/capa/featu│
  00"     37"      0.0%   21.9%  │        │     │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/functio│
>>00"     36"      0.0%   21.5%  │        │     │  │  │  │  ├─ extract_function_calls_to (~/code/public/capa/capa/features/extractors/b│
>>00"     36"      0.0%   21.4%  │        │     │  │  │  │  │  ├─ ReferenceSource.llil (~/software/binaryninja/python/binaryninja/binar│
>>00"     36"      0.0%   21.3%  │        │     │  │  │  │  │  │  ├─ Function.get_low_level_il_at (~/software/binaryninja/python/binary│
>>36"     36"     21.3%   21.3%  │        │     │  │  │  │  │  │  │  └─ BNGetLowLevelILForInstruction (~/software/binaryninja/python/bi│
  00"     14"      0.0%    8.6%  │        ├─ main (~/code/public/capa/capa/main.py:990)                                                │
  00"     14"      0.0%    8.6%  │        │  └─ compute_layout (~/code/public/capa/capa/loader.py:662)                                 │
  00"     14"      0.0%    8.4%  │        │     ├─ compute_static_layout (~/code/public/capa/capa/loader.py:631)                       │
  00"     14"      0.0%    8.3%  │        │     │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extracto│
  00"     14"      0.0%    8.2%  │        │     │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395)    │
  14"     14"      8.2%    8.2%  │        │     │  │  │  ├─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_bin

edit: maybe we can cache the results of fetching the llil/mlil to save some time. Still is surprising that it takes 3x longer to fetch the llil than do the complete analysis. Maybe its Python serialization overhead?

xusheng6 · 2024-09-26T14:37:58Z

To run capa against 321338196a46b600ea330fc5d98d0699, it takes 2:48. But :36 is spent just in BNGetLowLevelILForInstruction to help recover calls to functions. I have expected this to be extremely fast (less than :01). Need to triage if this is a bug in the API or we're using it incorrectly and/or there's a workaround.

We can also see that BNGetFunctionMediumLevelIL takes quite a bit of time (:13, and its called twice).

  Austin  TUI   Wall Time Profile                                                                                         
   _________   Command python -m capa.main --json --backend=binja tests/data/321338196a46b600ea330fc5d98d0699.exe_                     
   ⎝__⎠ ⎝__⎠   Python 3.12.3    PID 3946467     PID:TID 3946467:0:3946467   
               Samples 1060897  ⏲️    2'48"       Threshold 1%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                              
  00"    2'09"     0.0%   76.6%  ├─ _run_module_as_main (<frozen runpy>:199)                                                           │
  00"    2'09"     0.0%   76.6%  │  └─ _run_code (<frozen runpy>:88)                                                                   │
  00"    2'08"     0.0%   76.3%  │     └─ <module> (~/code/public/capa/capa/main.py:1096)                                              │
  00"     14"      0.0%    8.4%  │        ├─ main (~/code/public/capa/capa/main.py:983)                                                │
  00"     14"      0.0%    8.4%  │        │  └─ get_extractor_from_cli (~/code/public/capa/capa/main.py:830)                           │
  00"     13"      0.0%    7.7%  │        │     ├─ get_extractor (~/code/public/capa/capa/loader.py:264)                               │
  00"     13"      0.0%    7.7%  │        │     │  └─ load (~/software/binaryninja/python/binaryninja/__init__.py:414)                 │
  00"     13"      0.0%    7.6%  │        │     │     └─ BinaryView.load (~/software/binaryninja/python/binaryninja/binaryview.py:2758)│
  13"     13"      7.6%    7.6%  │        │     │        └─ BNLoadFilename (~/software/binaryninja/python/binaryninja/_binaryninjacore.│
  00"    1'39"     0.0%   58.7%  │        ├─ main (~/code/public/capa/capa/main.py:987)                                                │
  00"    1'39"     0.0%   58.7%  │        │  └─ find_capabilities (~/code/public/capa/capa/capabilities/common.py:75)                  │
  00"    1'38"     0.0%   58.3%  │        │     ├─ find_static_capabilities (~/code/public/capa/capa/capabilities/static.py:184)       │
  00"     11"      0.0%    6.3%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:117)      │
  00"     10"      0.0%    6.2%  │        │     │  │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extra│
  00"     10"      0.0%    6.0%  │        │     │  │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395) │
  10"     10"      6.0%    6.0%  │        │     │  │  │  │  └─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_b│
  00"     48"      0.0%   28.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:118)      │
  00"     08"      0.1%    4.6%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     07"      0.0%    4.2%  │        │     │  │  │  ├─ BinjaFeatureExtractor.get_instructions (~/code/public/capa/capa/features/ex│
  00"     02"      0.1%    1.4%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     05"      0.0%    2.8%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     02"      0.0%    1.5%  │        │     │  │  │  │  │  ├─ CoreArchitecture.get_instruction_text (~/software/binaryninja/python/│
  00"     35"      0.1%   20.8%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  01"     31"      0.5%   18.6%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     30"      0.0%   18.1%  │        │     │  │  │  │  ├─ BinjaFeatureExtractor.extract_insn_features (~/code/public/capa/capa/fea│
  01"     30"      0.8%   18.0%  │        │     │  │  │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/i│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  ├─ extract_insn_number_features (~/code/public/capa/capa/features/ext│
  00"     02"      0.0%    1.1%  │        │     │  │  │  │  │  │  ├─ extract_insn_bytes_features (~/code/public/capa/capa/features/extr│
  00"     02"      0.0%    1.3%  │        │     │  │  │  │  │  │  ├─ extract_insn_api_features (~/code/public/capa/capa/features/extrac│
  00"     02"      0.0%    1.2%  │        │     │  │  │  │  │  │  │  ├─ is_stub_function (~/code/public/capa/capa/features/extractors/b│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  │  │  ├─ Function.llil (~/software/binaryninja/python/binaryninja/fun│
  02"     02"      1.0%    1.0%  │        │     │  │  │  │  │  │  │  │  │  ├─ BNGetFunctionLowLevelIL (~/software/binaryninja/python/bi│
  00"     03"      0.1%    2.1%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     03"      0.0%    2.0%  │        │     │  │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)          │
  02"     02"      1.4%    1.4%  │        │     │  │  │  │  │  ├─ RuleSet._match (~/code/public/capa/capa/rules/__init__.py:1879)      │
  00"     04"      0.0%    2.4%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     04"      0.0%    2.4%  │        │     │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)             │
  00"     37"      0.0%   21.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:128)      │
  00"     37"      0.0%   21.9%  │        │     │  │  ├─ BinjaFeatureExtractor.extract_function_features (~/code/public/capa/capa/featu│
  00"     37"      0.0%   21.9%  │        │     │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/functio│
>>00"     36"      0.0%   21.5%  │        │     │  │  │  │  ├─ extract_function_calls_to (~/code/public/capa/capa/features/extractors/b│
>>00"     36"      0.0%   21.4%  │        │     │  │  │  │  │  ├─ ReferenceSource.llil (~/software/binaryninja/python/binaryninja/binar│
>>00"     36"      0.0%   21.3%  │        │     │  │  │  │  │  │  ├─ Function.get_low_level_il_at (~/software/binaryninja/python/binary│
>>36"     36"     21.3%   21.3%  │        │     │  │  │  │  │  │  │  └─ BNGetLowLevelILForInstruction (~/software/binaryninja/python/bi│
  00"     14"      0.0%    8.6%  │        ├─ main (~/code/public/capa/capa/main.py:990)                                                │
  00"     14"      0.0%    8.6%  │        │  └─ compute_layout (~/code/public/capa/capa/loader.py:662)                                 │
  00"     14"      0.0%    8.4%  │        │     ├─ compute_static_layout (~/code/public/capa/capa/loader.py:631)                       │
  00"     14"      0.0%    8.3%  │        │     │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extracto│
  00"     14"      0.0%    8.2%  │        │     │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395)    │
  14"     14"      8.2%    8.2%  │        │     │  │  │  ├─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_bin

edit: maybe we can cache the results of fetching the llil/mlil to save some time. Still is surprising that it takes 3x longer to fetch the llil than do the complete analysis. Maybe its Python serialization overhead?

I opened the file in binja GUI and the analysis only took 4.3 seconds:

[Analysis] Analysis update took 4.325 seconds

My machine is probably faster then the CI box used by GitHub, still quite surprising to see such a huge difference

williballenthin · 2024-09-26T17:29:36Z

@xusheng6 on my test rig it took maybe 13s to load the binary. Then lots longer to extract the features (minutes). So accessing the LLIR/MLIR is taking integer multiples of the total load time 😕

Maybe 3s vs 13s comes from only having about two cores available in the test environment.

xusheng6 · 2024-09-26T17:50:21Z

@xusheng6 on my test rig it took maybe 13s to load the binary. Then lots longer to extract the features (minutes). So accessing the LLIR/MLIR is taking integer multiples of the total load time 😕

Maybe 3s vs 13s comes from only having about two cores available in the test environment.

thx for letting me know about it, it seems either I wrote the backend in a bad way, or the Python wrapping adds significant overhead to it

williballenthin · 2024-09-26T18:31:08Z

The profiler didn't expose any invocation counts, so I'm not yet sure if we're calling the API way too many times or if the API itself is slow. Given that it's both LLIR and MLIR, I sorta suspect the latter. But, in the few minutes I looked at the bindings, it didn't seem like all that much was happening (on the Python side).

williballenthin added binary-ninja performance Related to capa's performance labels Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binary ninja: optimize feature extraction #2402

binary ninja: optimize feature extraction #2402

williballenthin commented Sep 25, 2024

williballenthin commented Sep 25, 2024

williballenthin commented Sep 25, 2024 •

edited

Loading

xusheng6 commented Sep 26, 2024

williballenthin commented Sep 26, 2024 •

edited

Loading

xusheng6 commented Sep 26, 2024

williballenthin commented Sep 26, 2024

binary ninja: optimize feature extraction #2402

binary ninja: optimize feature extraction #2402

Comments

williballenthin commented Sep 25, 2024

williballenthin commented Sep 25, 2024

williballenthin commented Sep 25, 2024 • edited Loading

xusheng6 commented Sep 26, 2024

williballenthin commented Sep 26, 2024 • edited Loading

xusheng6 commented Sep 26, 2024

williballenthin commented Sep 26, 2024

williballenthin commented Sep 25, 2024 •

edited

Loading

williballenthin commented Sep 26, 2024 •

edited

Loading