
A Python cross-version decompiler - eatonphil
https://github.com/rocky/python-uncompyle6
======
jcranmer
Just from taking a quick look at their description of how it works, this
appears to essentially be a glorified pattern-matching approach to
decompilation. This is generally the worst approach to take to the issue,
since it's extremely brittle and trivial optimization or obfuscation will
complicate your approach.

A modern decompiler framework tends to take a very different approach. For
native executables, the hardest problems are actually disassembly and higher-
level type recovery. Bytecode is easier in that the type information is
essentially retained and code is very nicely and obviously differentiated from
data; no need to worry about jump tables or constant pools.

Control flow for a decompiler is recovered via some form of control flow
structuring; the problem is essentially a graph grammar parsing problem, but
this is not an easy problem to solve. In practice, it's the existence of goto-
like edges (gotos, breaks, and continues) that really screws things up. It's
usually easy to tell what the structure should look like without these edges,
but identifying which edges fall into this category turns out to be
challenging in practice. Especially if you start trying to undo jump
threading.

The other interesting issue is working out what the decompiled variables are
and should have been. SSA is actually a pretty clean way to separate out
different variables that get assigned to the same register (this
identification problem exists even in bytecode, as local variable slots due
get reused whenever possible), but doing the PHI elimination and finding out
where you've got bitcasted type magic going on (more of an issue for native
code) can be a challenge.

~~~
XMPPwocky
Do note that CPython bytecode has explicit break, continue, and "for loop"
instructions, and (ofc) the language itself has no goto- so you can
unambiguously recover high-level structure from a CFG far more easily than you
might with compiled C.

~~~
rockybernstein
Yes and no. If you are sure that you start out with something generated by
Python, then assuming the Python compiler is faithful, then of course you can
always produces source code that doesn't have goto's in it. If instead someone
crafts custom bytecode, like I did here:
[http://rocky.github.io/pycon2018-light.co/#/](http://rocky.github.io/pycon2018-light.co/#/)
it might not have a high-level Python translation.

When there is a translation, is it always unambiguous? No. The same bytecode
could get decompiled "a and b" or as "if a: b".

Why this happens less often in Python, especially earlier versions fo Python,
is that it's code generation is very much template driven and there very
little in the way of the kinds of compiler optimization that would make
decompiling control flow otherwise harder.

Using the example given above, I believe earlier versions of Python would
always enclose the "then" statement/expression with "push_block", "pop_block"
pairs. Even though in certain situations (no new variables are introduced in
the "then" block) this isn't needed. But the fact that those two additional
instructions are there can then distinguish between those two different
constructs.

It is sort of like matching ancestry based on the largely "junk" DNA strands
that exist in a person.

Finally, I should note that Python with its control structures with "else"
clauses like "for/else" "while/else" and "try/else" make control flow
detection much harder. And this is the major weakness of uncompyle6.

That's why I started [https://github.com/rocky/python-control-
flow](https://github.com/rocky/python-control-flow) to try to address this. In
contrast, the current python uncompyle6 code which has a lot of hacky control
flow detection, this creates a control flow graph, computes dominators and
reverse domninators in support of distinguishing these high-level structures
better.

~~~
XMPPwocky
Definitely- though, arguably trying to decompile something which was never
Python into Python is a bit of a fool's errand!

Didn't think of for/else and friends, yeah, that's tricky.

Definitely using doms/idoms is the right approach.

------
biofunsf
The README mentions that some Dropbox source code is included, though I don't
see it in the repo?

    
    
      We include Dropbox's Python 2.5 bytecode
    

Dropbox is pretty well known for building their desktop clients in Python. I
recall they're stuck on Python 2 and use some obfuscation and opcode remapping
with a custom interpreter so its a little hard to decompile but still
possible. Would love to see what this decompiler can generate for them though.

~~~
loeg
> I recall they're stuck on Python 2 and use some obfuscation and opcode
> remapping with a custom interpreter so its a little hard to decompile but
> still possible.

[https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-
one...](https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-one-of-the-
largest-python-3-migrations-ever/)

Currently (custom) Python 3.5, I guess.

~~~
gizmo385
Am I missing the details in their blog post or do they detail the
customizations that they've made to Python 3.5 to suit their purposes?

~~~
loeg
I also don't see any details of their Python 3.x customization in that blog
post.

------
rwmj
Could this be good for converting Python 3 code back to Python 2 (to target
CentOS 7 specifically)? The final code doesn't need to look nice, it just
needs to run.

~~~
aynawn
That sounds much more difficult than installing python 3 on centos 7. Why
can't you do that instead?

~~~
rwmj
Because that would mean all my customers have to find and install Python 3.
Most of them won't allow EPEL packages to be installed because they're not
supported, and even if they did there's a huge barrier to asking people to add
an extra repository.

------
loeg
Does anyone use this and have a good example they've round-tripped through the
CPython compiler and this decompiler? I would assume variable names are lost,
like in most other bytecode compiler languages.

~~~
dbrueck
Actually, variable names should largely /not/ be lost - Python is insanely
dynamic and needs variable/function/etc. names to remain.

At most you might lose the names of local variables inside a function (i.e.
variables that, due to internal scope, have a limited lifespan), but member
variables, global variables, function names, class names, method names, etc.
would all need to be kept in the bytecode.

~~~
ben-schaaf
Considering `globals` and `locals` exist I would be very surprised if _any_
name got lost. Python is so dynamic you can actually mutate the bytecode of a
function at runtime[0] (LISP esque), so the compiler can't actually make any
assumptions and optimizations based on your source code.

[0] [https://pypi.org/project/byteplay/](https://pypi.org/project/byteplay/)

~~~
loeg
Ah, that's unfortunate. I sort of assumed the compiler would be able to
optimize stack variables and only do the slow thing if `locals` was actually
invoked.

