Hacker News new | past | comments | ask | show | jobs | submit login
How to recover lost Python source code if it's still resident in-memory (gist.github.com)
231 points by ingve on March 11, 2017 | hide | past | favorite | 45 comments



If it's a single file can't you just copy the deleted filehandle from procfs?

https://www.linux.com/news/bring-back-deleted-files-lsof


The code was deleted, not necessarily the file. But that's a nice tool to have as well.


This is worth a read just for the explanation of how to get an interactive python shell connected to a running script.


To save everyone a click: https://pypi.python.org/pypi/pyrasite/

Wish I'd known about that years ago just for the ability to grab a call stack from a running program.


For those wondering how it works, it's basically just a wrapper around gdb which calls the internal C exec implementation in the Python interpreter (which then executes some other script file which you specify). Pretty cool.


The method described is for recovering and decompiling bytecode into equivalent Python code, which is not exactly the same thing as source code, since it won't contain things like comments and unexported object names.

edit: made more accurate


Of course, Python is a little better than most languages for things like 'comments' - as a lot of them aren't really comments but docstrings, which are actually a part of the class, not a true comment, and so presumably can be restored.


Not if you use `python -OO`, which will optimize the docstrings out.


Indeed, but it just gave me an idea for a weird use case we've been discussing at work :)


If it ever makes sense to do this it means you aren't committing frequently enough.


Yep.

Amusingly, Simon and I were working on the same codebase during Eventbrite's quarterly hackathon when this happened. He was working solo on one feature while myself and three other developers were working on another. Because we were collaborating, we had to commit frequently simply to stay in sync. Because he was working alone, he was able to avoid committing and get into a situation where figuring out the process in this gist was necessary.

Despite this distraction and 1/4 the personpower, Simon arguably still wrote something cooler than we did.


Yes, in a perfect world...

However, if you're reverse engineering a system you've inherited, or one where source code isn't readily available, then this decompilation makes sense.


Thanks for the reminder.


This is certainly an interesting technique, however in this case surely git-reflog would have been easier/resulted in not losing code comments etc?


He presumably used checkout on a file with uncommitted changes.


Awesome use of pyrasite, great job! --luke (pyrasite author)


Interesting. One of the comments asks why he didn't copy the .py from the docker container, but it doesn't get answered. Too bad because I was curious also.


Depending on the author's dev practices, the script might have been in a volume so that editing it on the host would allow the container to receive the changes as they get saved. While I wouldn't normally suggest this approach, it can be handy at times.


Mounting host directory in a container is quite the popular approach. n almost every development tutorials I have read using Docker mentions this approach.


Yup that was exactly the problem: I had the code in a folder that was mounted to the container as a volume.


I had to do the exact thing for my .bash_profile, as I accidentally had deleted it (with tons of customizations) and I didn't have quick access to my backups, but I did had the deleted .bash_profile loaded into memory in some open terminal sessions.

Fortunately, you can just write:

  declare
to get all your aliases, functions, etc from the current shell session.


Pardon my ignorance but why is this a big deal? Debuggers can attach to running processes and pyrasite can inject code into running processes so for a language like python, it makes sense that you would be able to fetch either the resident bytecode or the code itself. What am I missing?


You're not missing anything. But a bunch of us didn't figure it out before and we're just now learning about it. So we're excited, we're learning something new!

(Please, please, don't be condescending to people who are learning something new.)


There's a first time for everything you know...

https://news.ycombinator.com/item?id=13848028


It's some other people's day as part of the 10,000.

https://xkcd.com/1053/


Would this work on Dropbox, which is mostly written in Python?


They only ship encrypted pyc files, along with a one off Python interpeter that has the key inside it. Additional tricks as well, like scrambling the opcode table, disabling built in functionality you could use to dump in memory structures, etc.

Of course, since the key is in the binary, you can hack at it: https://www.usenix.org/conference/woot13/workshop-program/pr...


Yep and they change things around every once in a while too. I RE'd dropbox several times using several different techniques. I just checked my old tarball containing a script which downloads dropbox binary, downloads the Python interpreter, builds the opcode table database and then decompiles everything.

  gvb@santarago:/tmp/lookinsidethebox$ ./run.sh
  fetched all dependencies..lets try decompiling
  no saved opcode mapping found; try to generate it
  pass one: automatically generate opcode mapping
  108 opcodes
  pass two: decrypt files, patch bytecode and decompile
  1928/1928
  successfully decrypted and decompiled: 1727 files
  error while decrypting: 0 files
  error while decompiling: 196 files
  opcode misses: 7 total 0x6c (108) [#9],  0x2c (44) [#14],  0x8d (141) [#15],  0x2e (46) [#1],  0x2d (45) [#14],  0x30 (48) [#5],  0x71 (113) [#11783],  
A starting point to do this yourself is: https://github.com/rumpeltux/dropboxdec. After unmarshalling the new pyc files the seed read in via the rng() function is in newer Dropbox installations passed through a Mersenne twister from which 4 DWORD values are being read which are then used to construct the key for the Tiny Encryption Algorithm cipher.

After that you get the binary blob back which you can unmarshall now. But you still need to figure out the opcode mapping. For that I used a trick publicly first done (to the best of my knowledge) by the author of PyREtic (Rich Smith) released at BH 2010. He just compares the stdlib pyc files with the stdlib included within dropbox (after decrypting those pyc files) byte by byte. That should yield a mapping of opcodes.

Then pass everything through uncompyle2 and you've got pretty readable source code back. Some files will refuse to decompile but that means hand-editing / fine-tuning the last bits of your opcode table a bit.

EDIT: follow-up on parent comment; the encryption keys are not in the interpreter. The interpreter is patched to not expose co_code and more (to make this memory dumping more difficult; injecting an shared object is a different technique that I used too). It's also patched to use the different opcode mapping and the unmarshalling of pyc files upon loading them. However the key for each pyc file is derived from data strictly in those files themselves. It's pretty clear when you load up the binary in IDA Pro and compare the unmarshalling code with a standard Python interpreter's code


I thought that they would have added more obfuscation after that paper was published. Since the opcode generation and parsing is reset, per pyc file, I expected to see stuff like:

- A rotating opcode table that changes every X opcodes

- Multiple opcodes that referenced the same operation, selected randomly at generation time.


Now I understand why they hired Guido back then.


IIRC, they used to randomize bytecode a bit, to avoid decompilation with off-the-shelf tools.


I'm pretty sure this would be a violation of the EULA.


Anybody knows if it's possible in Ruby? Just being able to have a live irb from a process ID would be pretty awesome


"sudo gdb -p [process id]" will attach to the running process. If you attach that to a Ruby process, you can do:

   call rb_eval_string("puts 'hi from the Ruby process'")
(The latter will output to the ruby scripts STDOUT so if you can't see the stdout from the Ruby interpreter, it won't produce anything - test it by just starting a ruby interpreter and find it's pid)

You can use that to (try to) load code. For example try loading pry and setting up a pry-remote connection. You can also use any other part of the MRI extension API - gdb can do quite complicated calls.

Note that any uncaught error may easily cause the Ruby script you're connected to to terminate, and there are plenty of opportunities to cause problems since when you're connecting gdp to the process, the Ruby interpreter will be paused at an arbitrary point, which may leave the interpreter in an inconsistent state.

To wrap the above up into anything reasonable, you'd want to first ensure the interpreter is in a decent state by setting a breakpoint somewhere "sane" in the interpreter, continue execution, and then execute some very carefully crafted code to let you inject the code you need without triggering any errors that mess up for the script, or changing variables etc. the script depends on.

It probably could be wrapped up into something quite nice combined with pry/pry-remote, though.


Thanks, I will probably make a wrapper gem around that (e.g. maybe just a `call rb_eval_string ("binding.pry")`).

In my case I get "No symbol "rb_backtrace" in current context.", I will look deeper into it.

Also, there is a great gist about Ruby GDB debugging: https://gist.github.com/mmullis/6211061.


While not exactly the same it's still quite useful. https://github.com/Mon-Ouie/pry-remote


Thanks, I guess I could just have some special key that triggers pry-remote (E.g trap ctrl-D or something)


I did this yesterday. However IntelliJ had my back. It's History feature has saved my bacon on more than one occasion.

One time I was able to go back weeks to fetch some code I'd long since deleted and was nowhere to be found in git.


I have this in my .vimrc which serves a similar purpose:

    " persist undo history to file / across sessions
    set undofile
    set undodir=$HOME/.vimundo
    " max out history length
    set history=9999
    set undoleves=9999999999
    set undoreload=10000

(Don't forget to mkdir "$HOME/.vimundo".)

Combine that with https://github.com/mbbill/undotree and you can easily walk the entire history of every file you edit.


That undotree plugin looks sweet! Do I understand correctly that if I do N undos, then fat-finger an edit instead of a yank, I can still get back to where I was before the N undos? This would solve my only big gripe with vim's undo.

(Well, the other big gripe is that when I'm forced to use a program with crippled undo, e.g. any Microsoft product ever, I'm frustrated up the wall. The worst-in-class award goes to Word with Endnote, where many Endnote commands wipe your entire undo history. How people can use that on a permanent basis is beyond me.)


The equivalent emacs code is here: https://www.emacswiki.org/emacs/BackupDirectory


Is it actually `undoleves` or is that a typo?


That should be `undolevels`, yes. Unfortunately I can't edit it now.


Interesting! I wonder if...

gdb -p `pidof python`

gcore foo.core

strings foo.core | grep -a200 -A200 knowntext

.. would also work.


I don't think python keeps the text file in memory, since it compiles everything to bytecode on startup.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: