Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow processing on large Python 3.4 bytecode with lots of nesting #318

Open
platomav opened this issue May 31, 2020 · 4 comments
Open

Slow processing on large Python 3.4 bytecode with lots of nesting #318

platomav opened this issue May 31, 2020 · 4 comments
Labels
Control Flow Problem has to do with bad control-flow detection Python 3.x Volunteer wanted Volunteer wanted to fix if a bug or to implement if a new feature.

Comments

@platomav
Copy link

Hello,

I'm trying to recover some old source code for archiving purposes from bytecode which was created by Python 3.4.3 (3310) years ago.

I'm using Windows 10 1809 x64, Python 3.4.3 and latest uncompyle6 40a653c.

The issue looks like some sort of infinite loop because uncompyle6 processing never ends.

Capture

I've narrowed down the issue to commit f008b8f (Mar 20, 2018) or newer versions. Meaning, the last "working" commit was 2e81ee5 (Mar 20, 2018). I say "working" because the output code has obvious issues with if-else (maybe more that I missed). Probably why you added the f008b8f changes but it causes the aforementioned problem in my case.

To perfectly reproduce the bug, I have included both the .py and .pyc of another old version for which I have the source code for, compiled via Python 3.4.3.

MEA_v1.1.0_PV.zip

@rocky
Copy link
Owner

rocky commented Jun 1, 2020

I looked at the bytecode and the source code.

When I ran the decompiler on it, I saw that it was indeed making progress. However, for the way this decompiler works at present and the particular structure of this program, it may be a very long time before the program completes.

The source code you sent has an elif block that spans 12K lines with numerous nested if blocks. Until we have better control-flow analysis in place (see below), the decompiler has to consider the ends of each if as the end of the elif that starts at line 71 both in the sample source code and bytecode. As we build up more if's and nested if's, and just simple statements this process takes longer and is nonlinear in the number of reduction-rule possibilities.

From my side, although this is a bug, it is not very interesting. And there is no shortage of bugs in this project, just a shortage of people fixing them. This bug has probably appeared already (maybe several times) in numerous guises.

So what can be done?

Short of what you've already done, I don't know that there is an easy-to-find and simple fix here that is going to reliably decompile your program while not breaking other programs.

If it were really necessary to have an accuate decompilation, it could be done. But as I mentioned, it would be a bit tedious. This kind of chore is typically something you do yourself or pay someone else to do. The other tools I've written I believe can help out here if you decide to undertake doing more.

We can leave this bug report open — there may be other people who take interest in this, or would be guided by the information here.

But to set expectations, it might be several years before this is addressed, if ever. To address the control-flow problems, I started in the decompyle3 project. But even there I realize that I may have to make another start and major refactor to get control-flow detection done efficiently which is needed to address this kind of bug.

To date, I've not had the resources (help or funding) to do any of this. I find that I increasingly am losing ground in each Python release, and there are a lot of other things which are more challenging, fun and interesting that I'd rather be doing. If the community can support here, though, I would be happy to help.

@rocky rocky added Control Flow Problem has to do with bad control-flow detection Python 3.x labels Jun 1, 2020
@platomav
Copy link
Author

platomav commented Jun 1, 2020

First of all, thank you very much for your lengthy reply.

The source code you sent has an elif block that spans 12K lines with numerous nested if blocks.

Small typo there, it's 1K but I agree completely. Back then my code was trash because I was still learning Python so no wonder that unoptimized mess is causing less common issues to appear in this impressive project.

From my side, although this is a bug, it is not very interesting.

No problem there, we all need to pick our software development battles.

But as I mentioned, it would be a bit tedious.

Yes I understand and no, it's not really worth it to go through any trouble or 3rd party payments to get the original code. I tried this mostly for sentimental reasons because I was stupid enough back then to overwrite the code whenever a new build was released, effectively losing early source code. But it's not important for current development or anything so no problem.

I find that I increasingly am losing ground in each Python release, and there are a lot of other things which are more challenging, fun and interesting that I'd rather be doing.

I know the exact feeling. Your project plays catch up with Python, mine with Intel. Both one man undertakings for the most part.

When I ran the decompiler on it, I saw that it was indeed making progress.

Quick final question: I know this is CPU thread dependent but out of curiosity, from the little progress you saw, are we talking about hours or days? (edit: had to end it after 6 hours and 9GB of RAM so...yeah...don't try it) 😅

@rocky
Copy link
Owner

rocky commented Jun 3, 2020

Small typo there, it's 1K

Yes, you are correct. A decimal point didn't get in there 1.2K (but the source lines are about 1.3K), but if we are talking about instructions, the largest offset seems to be 21944 (a little under 22K) which means that there are over 7K of instructions assuming the larger instruction size of 3 byte per instruction.

You can see the progress, by looking at the instruction number using -g.

For example:

...
L. 21: 142     testexpr ::= testtrue (61)
^^^^^ line number                    ^^^^ instruction number
L. 21: 142-164 ifstmt ::= testexpr \e__ifstmts_jump (61)
Reduce ifstmt invalid by check
L. 22: 167     expr ::= LOAD_NAME (62)

Another thing you can do is extract prortions that you know from large compond blocks (no incomplete "if" or "if/else" blocks) and put that into an assembly, create custom bytecode and decompile the smaller chunks.

This is a lot of work though and as I said probably a bit tedious.

And finally as you mentioned, change the grammar or go back to whatever version and use that and then hand verify the results.

@platomav
Copy link
Author

platomav commented Jun 3, 2020

Thank you for the suggestions rocky. Indeed lots of instructions due to bad old code and yes this is tedious so for now I'll just collect the bytecode from the executables I can find here and there and try again in the future. If you want to leave this issue open just in case someone wants to address it in the future, no problem for me. :)

@rocky rocky added the Volunteer wanted Volunteer wanted to fix if a bug or to implement if a new feature. label May 5, 2021
@rocky rocky changed the title Infinite processing at Python 3.4 bytecode Slow processing on large Python 3.4 bytecode with lots of nesting Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Control Flow Problem has to do with bad control-flow detection Python 3.x Volunteer wanted Volunteer wanted to fix if a bug or to implement if a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants