
Don't Pickle Your Data - benfrederickson
http://www.benfrederickson.com/2014/02/12/dont-pickle-your-data.html
======
falcolas
JSON, while a fantastic lightweight data format, is insufficient for storing
even moderately complex objects. It can not properly differentiate between
tuples and lists, it can only accept strings for object keys, and it only
stores unicode strings.

You also have to write custom (de)serializers if you want to store datetime
objects (my personal pet peeve), any of the special python containers...
basically any time you want anything other than an object, array, string or
number.

Unless I'm dealing with untrusted data sources, or need to interoperate with
other languages, I will keep using Pickle.

~~~
drdaeman
Heh. Pickle is not really sufficient for some edge cases, too. Luckily,
there's Dill that can take almost perfect snapshots of the whole interpreter:
[https://pypi.python.org/pypi/dill](https://pypi.python.org/pypi/dill)

~~~
dman
Also Dill's developer Michael Mckerns is super responsive and helpful.
Hopefully someday pickle will be replaced / augmented with dill in the
standard distribution.

------
johnrob
"Use pickle with caution" would be better advice. There are plenty of cases
where your data is under control and pickle is a huge time saver. It takes
work to translate most data structures to/from json.

------
vbit
Your benchmarks don't mean much because you're giving different data to
different packers.

cPickle gets a list of objects but json gets a list of dictionaries? The cost
of converting the objects into the dictionaries is conveniently excluded from
the json benchmark.

You should try serializing the same list of dictionaries, use the highest
pickle protocol and repost results.

------
jlebar
This benchmark doesn't specify a pickle protocol, which forces Python to use a
big, inefficient format.

I filed a bug. [https://github.com/benfred/bens-blog-
code/issues/1](https://github.com/benfred/bens-blog-code/issues/1)

~~~
falcolas
I also went and tested it.

Baseline JSON:

    
    
        --------------------------------------------------------------------------------
        JSON
        packTime 0.612724065781 s -  163205.601975 items/s
        unpackTime 0.782174110413 s -  127848.772631 items/s
        size 174.26637
    

Baseline cPickle, ascii protocol:

    
    
        --------------------------------------------------------------------------------
        cPickle
        packTime 2.41442704201 s -  41417.6938297 items/s
        unpackTime 0.875658035278 s -  114199.831408 items/s
        size 286.26637
    

cPickle, highest protocol:

    
    
        --------------------------------------------------------------------------------
        cPickleHP
        packTime 1.02942800522 s -  97141.3245929 items/s
        unpackTime 0.583297967911 s -  171438.965162 items/s
        size 198.26637
    

For giggles, I evened the playing field and let cPickle at the same data
structure as JSON, ascii protocol and highest protocol:

    
    
        --------------------------------------------------------------------------------
        cPickleJsonData
        packTime 0.642832040787 s -  155561.629874 items/s
        unpackTime 0.478959083557 s -  208786.101847 items/s
        size 205.53356
    
    
        --------------------------------------------------------------------------------
        cPickleHPjsonData
        packTime 0.285845041275 s -  349839.897708 items/s
        unpackTime 0.340456962585 s -  293722.881273 items/s
        size 175.26637
    

My conclusions? Serializing dictionaries is easier than serializing objects,
and using a non-ascii protocol offers some obvious benefits. JSON is not
obviously better than cPickle when comparing apples to apples.

It's also worth noting that comparing the json module to the pure python
Pickle module isn't a fair fight either; json is at least partially written in
C.

~~~
paulgb
I wonder how ujson (pure-C) compares?
[https://pypi.python.org/pypi/ujson](https://pypi.python.org/pypi/ujson)

------
thu
At the end of the post he says that only Python can parse Pickle. This is not
really true. You can write a Pickle parsing library in any language (maybe
some special use would not be possible).

For instance I have implemented enough of it[0] to use in a drop-in
replacement for Graphite[1].

[0]: [https://github.com/noteed/python-
pickle](https://github.com/noteed/python-pickle) [1]:
[http://graphite.wikidot.com/](http://graphite.wikidot.com/)

------
silveira

      >>> import json
      >>> json.dumps(set())
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
          return _default_encoder.encode(obj)
        File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
          chunks = self.iterencode(o, _one_shot=True)
        File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
          return _iterencode(o, 0)
        File "/usr/lib/python2.7/json/encoder.py", line 178, in default
          raise TypeError(repr(o) + " is not JSON serializable")
      TypeError: set([]) is not JSON serializable
      >>> from decimal import Decimal
      >>> json.dumps(Decimal(1))
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
          return _default_encoder.encode(obj)
        File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
          chunks = self.iterencode(o, _one_shot=True)
        File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
          return _iterencode(o, 0)
        File "/usr/lib/python2.7/json/encoder.py", line 178, in default
          raise TypeError(repr(o) + " is not JSON serializable")
      TypeError: Decimal('1') is not JSON serializable

~~~
JasonFruit
What, precisely, is your point? The article points out JSON's limitations.

------
jboynyc
I like y_serial, a library to store compressed Python objects in a sqlite
database.

[http://yserial.sf.net/](http://yserial.sf.net/)

------
ris
> Given the downsides though, its worth writing the little bit of code
> necessary to convert your objects to a JSON-able form if your code is ever
> going to be used by people other than yourself.

Disagree strongly. You have _no idea_ how complex my graphs of (pretty damn
complex) python objects are and how much I value being able to change their
organization without having to rewrite the serialization code every time I do
so.

Don't fear pickle. Just don't expect it to be secure or amazingly fast.

------
Sami_Lehtinen
Pickle seems to be five times faster than standard json. If I store lot of
data and want it to be compact then I'll use xz on top.

------
tgb
In case anyone doesn't know, Python has had a json library since version 2.6.

~~~
kennywinker
That's not the issue. The issue is that python objects are not serializable
into json without writing a custom serializer. Rather than telling devs not to
use pickle, why not build a json serializer that can handle simple objects and
try to get it adopted into the python standard library?

------
staticfish
I thought pickle output was just JSON? Shows what I know.

~~~
herge
If it was JSON, how would you pickle dictionaries with tuples as keys?

~~~
habitue
The same way you would in JavaScript: turn it into a string first. "(1, 2)"
Its not that serializing python objects to json is impossible, just that there
is not a straightforward implementation that everyone agrees is correct.

For a fully general python <-> json serializer/deserializer, you'd need a
bunch of extra annotations etc that would make it look very weird compared
with what most people expect when they think of json.

