
Pickle’s Nine Flaws - gilad
https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html
======
Kednicma
There _is_ a way to read pickles without running them, but it is Python-only
and still requires one to know how pickles work. The module `pickletools` can
be used to disassemble pickles to bytecode, just like `dis` for normal Python
objects. Honestly, though, I wouldn't say that this invalidates the point
about unreadability, but just hammers in exactly how unreadable they really
are.

------
carapace
To me it's interesting that _pickle_ can be thought of as recording some of
the implicit assumptions GvR made about the expected use of Python semantics.

Formally serialization/deserialization is very crunchy and precise. (And I
remember how stoked I was to find out that Python included an implementation!)
In practice, things get messy and we break the implicit assumptions.

Is it a flaw of the _pickle_ module? Or are our designs too clever?

Patient: "It hurts when I do this."

Doctor: "Don't do that."

;-)

------
nurettin
* Insecure: If you are unpickling insecure code, you have other problems. Deserializers should not be used as a protection against hacking.

* Old pickles look like old code: Again, convert your object into json and serialize that to your database. Oh no, you are missing an attribute. Pickle should not be used so you don't have to employ a release engineer.

* Implicit: No software works everywhere with defaults. So use copyreg.

* Over-serializes: USE copyreg.

* __init__ isn’t called: USE COPYREG.

* Python only: what's this for, then? [http://www.picklingtools.com/](http://www.picklingtools.com/)

* Unreadable: Great feature.

* Appears to pickle code: Another great feature.

* Slow: check again, it has been 8 years. I can't find any faster method.

~~~
price
> * Insecure: If you are unpickling insecure code, you have other problems.
> Deserializers should not be used as a protection against hacking.

There are lots of good use cases for deserializing untrusted data -- it's what
you do in almost any client-server situation. So the fact you can't do this
with pickle really is an important limitation.

~~~
manicdee
In what client server situation does it make sense to use pickles over
JSON/YAML?

~~~
hyperpape
None, except when you're taking a huge shortcut. Which is why you want to be
super cautious about using Pickle, or Java serialization, or any serialization
solution that deserializes arbitrary objects. Once your deserialization isn't
explicit about what objects you accept, you have to be super careful about the
provenance of that data.

------
hyperpape
I'm skeptical of the point about over-serialization. In my opinion, throwing
an exception on an unserializable attribute is a good default. If an object is
using a file, it will more often than not be unusable when deserialized
without the file.

This is one of the few things Java gets right about its built in
serialization: if you have an object that can't be serialized, anything using
that object has to declare it as transient, meaning it won't be serialized or
deserialized. Hopefully you'll think about whether the result makes sense
before using the keyword.

If you don't mark an unserializable field transient, you'll get an exception
at runtime. It's not enforced by the compiler, which would be ideal, but
linters will warn you.

------
moreati
Hawking my own (incomplete) contribution to Pickle security/analysis
[https://github.com/moreati/pickle-fuzz#rehabilitating-
python...](https://github.com/moreati/pickle-fuzz#rehabilitating-pythons-
pickle-module)

------
ChrisSD
This seems to seriously misunderstand the point of pickle. It's not for data
interchange. It's for e.g. caching objects or debugging. That's it.

The fact it keeps "old code" is a feature. The object is exactly as it was at
the time it was saved.

~~~
ryanisnan
I mean, the most popular async message queue for python broadly supports
pickle as its serialization format. I don't think this problem is exclusive to
this blog post.

~~~
ChrisSD
I should have been more precise. I meant interchange between different
programs/platforms/etc. Not internal messages.

------
lordnacho
I think these flaws are fairly minor, at least you seem to be nudged towards
use cases where you're not overly reliant on pickle for complex work.

If readability is an issue there's a JSON version that's quite useful.

Other than that, most of the other concerns are addressable. If security
matter perhaps use an encryption lib around the pickle, rather than ask for it
to be built into it? As for speed, you're already using python and chances are
you're not constantly pickling and unpickling?

~~~
price
Encryption won't do you any good - encrypted messages can be forged. But let's
say you meant a signature or other form of authentication.

That still does you no good if you're, say, a server getting data from a
client. Very few servers want to allow clients to execute arbitrary code
inside them.

That still leaves some situations where it can be used - but it's a major
limitation on the scope of those situations.

------
stared
I am surprised when people use pickle NOT as a last resort.

For numeric data, H5 is nice. For configs, JSON is pretty much a standard. For
Python code... well, nothing beats Python code.

------
edejong
Pickle's greatest flaw is the complete lack of forward and backward
compatibility. The compatibility is not guaranteed between when upgrading any
of the dependencies. Dependencies should stay the same over releases, halting
forward progress in the development process.

~~~
nedbat
Can you elaborate? What dependencies? Pickle is in the standard library.

~~~
price
Perhaps they're referring to your application's dependencies, in a situation
where you're pickling instances of those dependencies' types. Then this is an
example of "old pickles look like old code".

~~~
johncearls
I have this problem a lot with pandas Dataframes. I know I could engineer
around it, but most of the time I just want to distribute some number
crunching to a bunch of docker images quickly and a database is overkill for a
one off analysis. Works fine until an image updates pandas. Not taking issue
with pandas or pickle, but it's an issue of time trade-off. JSON is okay, but
object conversion/Nan/None/inf can be a bear.

------
sradman
Data serialization is hard and the artifacts are much longer lived than
executable code and even our API interfaces. YAML, JSON, XML are all flawed.
There are many competing binary serialization frameworks. Beware. Dar be
Dragons in Durable Data.

------
cmwelsh
I find it interesting that everyone so far has suggested JSON as a pickle
alternative. Depending on why you are serializing and deserializing the data,
a lot of times the true replacement for pickle is a full-fledged database.

~~~
curiousgal
Right now I'm working on a large Monte Carlo project where I need to export
the result of every simulation. Inserting into a database takes longer than
just pickling the result.

------
varbhat
It must only be used for small programs and it serves the purpose well.

~~~
analog31
I use it for transferring stuff from a data collection program to a Jupiter
notebook. It works quite well for that. Converting big numpy arrays to text
and back again would be cumbersome.

~~~
hansvm
If all you're transferring is numpy arrays, they natively support an
efficient, cross-platform, forward-compatible binary format that might work
better for you.

~~~
analog31
Good point, I'll look into that. Usually it's a mixture of things like numpy
arrays and meta-data. I find it's easy to save a lot of stuff along with the
data, that I might regret not having later on.

------
forgotmypw17
Similar to PHP's serialize() and unserialize()

~~~
brazzy
Also very, very similar to Java's serialization mechanism.

~~~
dnautics
Basically every language with a vm with boxed types has to have this. Erlang
has term_to_binary, which has some crazy superpowers, like I can serialize a
lambda, put it on a pigeon, and have it run on an airgapped machine (assuming
any module referenced in the lambda has an equivalently named version in the
airgapped VM). Of course you can see how this could also be a security problem
if you're not careful.

On the other hand, this is part of how erlang distributed systems (which is
crazy easy) can communicate with extremely simple semantics, and the security
model for erlang distribution is very explicitly "trusted only; locking down
the cluster as a single security domain is YOUR responsibilty".

