Skip to content
Sandeep Dasgupta edited this page Oct 12, 2017 · 2 revisions

Preliminary Work

We have been using a publicly available tool called McSema which converts x86/x86-64 binary to functional LLVM IR. One of the limitations of the IR recovered by McSema is that it misses key high-level information like variables and types. Moreover, McSema uses a large flat array to model the runtime process stack, which is shared by all the procedures. This inhibits many non-trivial optimizations on stack variables and accesses because of potential aliases between procedures.

One direction of work in this reporting period is that we improved the LLVM IR extracted from binaries by implementing the following IR transformation passes: (1) Distinguish individual stack frames; and (2) Identify local variables in each stack frame. Activity (1) regarding stack frame reconstruction is not only important on its own but also a prerequisite for activity (2) because doing symbol promotion directly on the global array could be very conservative because an indirect write made by a different procedure may prevent symbol promotion in the current procedure. We have presented this work at the 2016 LLVM Developers Meeting 2016. We have also tested our implementation on the LLVM test suite with the majority of the test cases passing with the expected output. The majority of failing test cases are due to unsupported X86-64 vector instructions. We have added support for some of these vector instructions in McSema, and successfully upstreamed it to the McSema code repository.

In the current reporting year, we have also put considerable effort into an empirical study of recovered high level attributes (like variable offset, type, function prototype information) based on their usefulness for specific clients (like pointer analysis, call graph construction, or symbolic execution, e.g. KLEE). We expect that this study will motivate which kinds of attributes are important to extract from binaries to support a desired analysis or optimization capability. To our knowledge, such a study has never been conducted before (even though there has been a plethora of binary analysis systems that aim to extract various kinds of attributes).

To enable a systematic study that is independent of the capabilities of any particular binary analysis system or algorithms, we use debug information to extract precise information about the attributes of interest, like variables and type information. We can then vary the precision or the subset of this information used in order to study the impact of both less precise and more precise information on client analyses. This direction of work is still ongoing, and we have accomplished the following this year:

  1. Implemented a stack analysis on IDA's internal data structures to expose the stack variables.

  2. Implemented a DWARF reader which reads the type information of stack variables from debug information (for the sake of the empirical evaluation).

  3. Added a patch in Mcsema code which consumes the above information and generates an IR with all the stack variables promoted with proper type information.

In addition to these steps, we have also improved the ability of MCSema to recover programs. Support for translating unknown instructions into inline assembly as a temporary crutch has been added, and we have started work on leveraging the automatic generation of x86-64 semantics by the Strata project to add support for a larger class of instructions without needing to manually specify x86 instructions.