

Why Python Pickle is Inscecure - mnemonik
http://nadiana.com/python-pickle-insecure

======
tptacek
Ruby's Marshal library is not quite as blatantly insecure as pickle (it won't
do any string interpolation on load), but you shouldn't trust any of these
facilities: you're essentially passing data to a very weird variant of eval().

But _[edit, should have said this to begin with]_ pickle isn't an interchange
format. It's not supposed to be secure. Python already offers a myriad of good
interchange formats. Interchange isn't pickle's job, and if you use it for
that, you've made a serious design error.

Ruby unfortunately blurs the line here by using Marshal as an interchange
format in some cases. None of those cases are insecure by design (they all
allow code execution by design), but the usage does create a confusing
precedent.

You're better off with ASN.1/BER than you are with Pickle or Marshal as a file
or protocol format; that's how inappropriate Pickle is to the task.

~~~
viraptor
> not quite as blatantly insecure as pickle (it won't do any string
> interpolation on load)

Are you saying that pickle works via string interpolation (or that this
problem is possible because of interpolation)? That's incorrect...

~~~
tptacek
No; Marshal and pickle are very different (and I confused things by talking in
Ruby terms and referring to Python). Ruby Marshal isn't a virtual machine.
Pickle is more like Flash or Postscript than RTF, which is what Marshal is
like.

------
mlLK
Regardless of how insecure pickle is, it is the perfect module to learn how
easy _basic_ data-persistence can be in Python through serialization. I don't
get articles like this, maybe it's cause I'm still learning a lot, but I don't
see the value of posting some inference a module or library already explicitly
states in the manual or documentation.

Regardless of my opinion, pickle is a great module to get you going in Python
and even better for scripting and storing basic data-sets on your local
machine.

~~~
rg123
The article was prompted by some assertions on StackOverflow that pickle was
secure or "secure enough", etc. See
[http://stackoverflow.com/questions/1389738/how-to-save-
data-...](http://stackoverflow.com/questions/1389738/how-to-save-data-with-
python/1389792#1389792)

------
rw
Isn't making pickle completely secure equivalent to solving the halting
problem?

~~~
rcoder
No.

Proving termination for programs written in models less powerful than a Turing
machine does not require you to solve the halting problem. Programs using
primitive recursion, finite automata, and regular expressions, for example,
can all be proved to terminate, and can express a number of useful
computations.

The problem is that the pickle module is far too permissive. In particular,
the REDUCE operation invokes a Python callable with an argument tuple on the
pickle stack, which means that 'pickles' are at least as powerful (in the
Turing-general, halting problem sense) as Python.

~~~
tptacek
It sure sounds like you said "No, <stuff> <stuff> <stuff>, but yes."

~~~
calcnerd256
No, they are saying that Pickle as it is currently designed is too powerful.
Depending on the goals of Pickle, it may be possible to reduce its power by
redesigning it to do no more than its goals.

~~~
tptacek
And I'm saying "pickle isn't really designed to solve the problem they're
saying it's bad at solving".

------
yason
So what's news? Unpickling is safe only if you are very sure the pickled data
was created by yourself, with the same version of Python. It has always been
like this.

Typically you pickle when you wish to offload some objects to persistent
storage. It's similarly typical to compute the message digest of the pickle
data with some salt padding and store that in a place that you consider
relatively safe wrt. to your security needs. Then you don't need to worry
about unpickling malicious data.

~~~
tptacek
I don't know what "the digest of pickle data with some salt padding" means
(when you say "salt", you trip the "talking about crypto using words only lay
programmers use" sensor), but it sure doesn't sound crypto-safe. There are a
number of easy-to-make errors with digest schemes that allow attackers to make
constrained modifications to documents without breaking the digest.

Long story short, _don't_ do this; instead, PGP/GPG encrypt and sign the
pickled file. GPG is strictly better across all axes than hand-hacking your
own protection scheme.

~~~
yason
For what I've read salt is a commonly used term in cryptology, and refers to
certain schemes for key derivation for hashing/encrypting. Salt can either be
public (to make brute-force attacks infeasible) or private (for better
security).

Almost always when pickling we're only interested in one aspect of security
that is integrity: what we put in is what we get out. Encryption isn't needed
and signing doesn't really offer much more with regard to this case. Instead,
cross-checking against a message digest to make it hard to modify pickles (or
any runtime data offloaded to disk) seems to be almost idiomatic. YMMV.

Not that I wouldn't want to write a fancy GPG based persistent storage but
generally it would be an overkill. And overkills, in my experience, are good
at blinding the developers from other threats. YMMV.

There are many good, proven message digest algorithms that are useful for
implementing a simple salted hashing scheme. In practice, an application
developer must eventually take algorithms for granted. We consider MD5
demonstrably weak but SHA, especially variations with longer digests, we
consider strong enough for most purposes. So given the assumption that we can
trust SHA, I'm sure you're familiar with something like the following:

\- take the pickle output from pickle.dumps()

\- create some random data and use it as salt

\- run the pickle output + salt through hashlib.sha512() or whichever you
prefer to obtain the message digest

\- store the pickle output somewhere, even in public

\- store the message digest somewhere, even in public

\- store the salt some place safe

\- recompute and verify before calling pickle.loads()

You still have to have trust in some storage that you consider safe. Computing
the salt dynamically from a set of fixed and/or runtime values could be done
by anyone, and is merely security by obscurity. However, exactly the same
applies to GPG: you would have to store the private and public keys somewhere
safe, and go from there.

And finally, as for me, I'd probably create more security holes in
implementing the GPG integration than just sticking with the standard Python
hashlib.

------
rarrrrrr
For a secure alternative to pickle (with the same API so a drop in
replacement): <http://home.gna.org/oomadness/en/cerealizer/index.html>

------
micktwomey
Nice, worth it alone to find out about pickletools (something I've needed on
many occasions but never thought to look for).

------
cool-RR
Makes me fear that data corruption in my pickles will crash the system. Maybe
someone should implement an interface to 'pickle' that maintains a hash of the
string.

~~~
yangyang
Why are they any more likely to be corrupted than any other file (including
your python sources or .pyc files (which use another serialisation scheme,
marshal))?

~~~
pyre
I'm guessing because it's possible for the data to be corrupted in just the
right way so as to construct some system-crashing (or critical-data
corrupting) system() or eval() call. Though that's a pretty extreme paranoia.

As you state, the python files themselves could also be corrupted in such a
manor and then run in the interpreter, or a compiled program could get
corrupted in just the right way to execute 'rm -rf /' though it's not likely.

------
euroclydon
What's up with YAML? Doesn't Google use it extensively? Is it a better
alternative than JSON even if you are sending serialized data from the
Javascript in the browser to the web server and back?

~~~
rcoder
YAML is strictly a superset of JSON in its expressiveness, since it allows
tagging of serialized objects with a type name. JSON reduces everything to
maps, arrays, and scalars, so any type information has to be encoded in (or
inferred from) the structures themselves. I also find it to be a bit easier to
read, and so prefer it for configuration files or console dumps of data
structures.

However, YAML isn't quite as well-supported in the standard libraries of
various programming languages, most notably Javascript. Which one to use
depends largely on how heavily you expect browsers to consume your service
output.

