Skip to content

Latest commit

 

History

History
221 lines (170 loc) · 9.51 KB

NEW_FEATURES.rst

File metadata and controls

221 lines (170 loc) · 9.51 KB

Introduction

The original versions of this code up until the time I started were pretty awesome. You can get a sense of this by running it. For the most part it was remarkably fast, and a single module with few dependencies.

Here I will largely give what are the major improvements over old code.

This also serves to outline a little bit about what is in this code.

See also How does this code work?.

Old Cool Features

Before getting to the new stuff, I'll describe cool things that was there before.

I particularly liked the ability to show the assembly, grammar reduction rules as they occurred, and the resulting parse tree. It is neat that you could follow the process and steps that deparser takes, and in this not only see the result how the bytecode corresponds to the resulting source. Compare this with other Python decompilers.

And of course also neat was that this used a grammar and table-driven approach to decompile.

Expanding decompilation to multiple Python Versions

Aside from pycdc, most of the Python decompilers handle a small number of Python versions, if they supported more than one. And even when more than one version is supported if you have to be running the Python version that the bytecode was compiled for.

There main reason that you have to be running the Python bytecode interpreter as the one you want to decompile largely stems from the fact that Python's dis module is often what is used and that has this limitation.

pycdc doesn't suffer this problem because it is written in C++, not Python. Hartmut Goebel's code had provisions for multiple Python versions running from an interpreter different from the one that was running the decompiler. That however used compiled code in the process was tied a bit to the Python C headers for a particular version.

You need to not only to account for different "marshal" and "unmarshal" routines for the different Python versions, but also, as the Python versions extend, you need a different code type as well.

Enter xdis

To handle all of these problems, I split off the marshal loading portion and disassembly routines into a separate module, xdis. This also allows older Pythons to have access to features found in newer Pythons, such as parsing the bytecode, a uniform stream of bytes, into a list of structured bytecode instructions.

Python 2.7's dis module doesn't has provide a instruction abstraction. Therefore in uncompyle2 and other earlier decompilers you see code with magic numbers like 4 in:

if end > jump_back+4 and code[end] in (JF, JA):
    if code[jump_back+4] in (JA, JF):
        if self.get_target(jump_back+4) == self.get_target(end):
            self.fixed_jumps[pos] = jump_back+4
            end = jump_back+4
elif target < pos:
    self.fixed_jumps[pos] = jump_back+4
    end = jump_back+4

and in other code -1 and 3 in:

if self.get_target(jmp) != start_else:
    end_else = self.get_target(jmp)
if self.code[jmp] == JF:
    self.fixed_jumps[jmp] = -1
self.structs.append({'type':  'except',
               'start': i,
               'end':   jmp})
i = jmp + 3

All of that offset arithmetic is trying to find the next instruction offset or the previous offset. Using a list of instructions you simply take the offset field of the previous or next instruction.

The above code appears in the uncompyle2 "Scanner" class in service of trying to figure out control flow. Note also that there isn't a single comment in there about what specifically it is trying to do, the logic or that would lead one to be confident that this is correct, let alone assumptions that are needed for this to be true.

While this might largely work for Python 2.7, and uncompyle2 does get control flow wrong sometimes, it is impossible to adapt code for other versions of Python.

In addition adding an instruction structure, xdis adds various flags and features that assist in working with instructions. In the example above this replaces code like ... in (JF, JA) which is some sort of unconditional jump instruction.

Although not needed in the decompiler, xdis also has nicer instruction print format. It can show you the bytes as well as the interpreted instructions. It will interpret flag bits and packed structures in operands so you don't have to. It can even do a limited form of inspection at previous instructions to give a more complete description of an operand. For example on LOAD_ATTR which loads the attribute of a variable, often the variable name can be found as the previous instruction. When that is the case the disassembler can include that in the disassembly display for the LOAD_ATTR operand.

Python Grammar Isolation

If you want to support multiple versions of Python in a manageable way you really need to provide different grammars for the different versions, in a grammar-based system. None of the published versions of this decompiler did this.

If you look at the changes in this code, right now there are no grammar changes needed between 1.0 to 1.3. (Some of this may be wrong though since we haven't extensively tested these earliest Python versions

For Python 1.4 which is based off of the grammar for 1.5 though there are number of changes, about 6 grammar rules. Later versions of though we start to see larger upheaval and at certain places, especially those where new opcodes are introduced, especially those that change the way calls or exceptions get handled, we have major upheaval in the grammar. It is not just that some rules get added, but we also need to remove some grammar rules as well.

I have been largely managing this as incremental differences between versions. However in the future I am leaning more towards totally separate grammars. A well constructed grammar doesn't need to be that large.

When starting out a new version, we can just copy the grammar from the prior version. Within a Python version though, I am breaking these into composable pieces. In particular the grammar for handling what can appear as the body of a lambda, is a subset of the full Python language. The language allowed in an eval is also a subset of the full Python language, as are what can appear in the various compilation modes like "single" versus "exec".

Another nice natural self-contain grammar section is what can appear in list comprehensions and generators. The bodies of these are generally represented in a self-contained code block.

Often in decompilation you may be interested not just in decompiling the entire code but you may be interested in only focusing on a specific part of the code. And if there is a problem in decompiling the entire piece of code, having these smaller breaking points can be of assistance.

Other Modularity

Above we have mentioned the need for separate grammars or to isolate these per versions. But there are other major pieces that make up this decompiler. In particular there is a scanner and the source code generation part.

Even though differences in version that occur in disassembly are handled by xdis, we still have to do conversion of that to a token stream for parsing. So the scanners are again broken out per version with various OO mechanisms for reusing code. The same is true for source code generation.

Expanding decompiler availability to multiple Python Versions

Above we mention decompiling multiple versions of bytecode from a single Python interpreter. We talk about having the decompiler runnable from multiple versions of Python, independent of the set of bytecode that the decompiler supports.

There are slight advantages in having a decompiler that runs the same version as the code you are decompiling. The most obvious one is that it makes it easy to test to see whether the decompilation correct because you can run the decompiled code. Python comes with a suite of Python programs that check themselves and that aspects of Python are implemented correctly. These also make excellent programs to check whether a program has decompiled correctly.

Aside from this, debugging can be easier as well. To assist understanding bytecode and single stepping it see x-python and the debugger for it trepan-xpy.

Handling Language Drift

Given the desirability of having this code running on logs of Python versions, how can we get this done?

The solution used here is to have several git branches of the code. Right now there are 3 branches. Each branch handles works across 3 or so different releases of Python. In particular one branch handles Python 2.4 to 2.7 Another handles Python 3.3 to 3.5, and the master branch handles 3.6 to 3.10. (Again note that the 3.9 and 3.10 decompilers do not decompile Python 3.9 or 3.10, but they do handle bytecode for all earlier versions.)

Cool features of the Parser

  • reduction rule checking
  • numbering tokens
  • showing a stack of completions

Cool features Semantic Analysis