Don't Pickle Your Data

AussieWog93 · on Aug 11, 2022

From the conclusion of the article: >Pickle on the other hand is slow, insecure, and can be only parsed in Python. The only real advantage to pickle is that it can serialize arbitrary Python objects

ie, a bunch of drawbacks that don't really matter at all for the average home-made Python script, plus the "minor" advantage of being able to pickle literally anything and have it "just work".

None of the other options out there let you build a foolproof "save button" in 3 lines of code.

henrydark · on Aug 12, 2022

I'm sure that most python developers who have worked with pickle for more than 3 lines of code can confirm that pickle does not in fact just work

bobbylarrybobby · on Aug 12, 2022

The real question is why doesn't Python have something like a class decorator`@json.interchangeable` that you can apply to a class -- maybe dataclasses only? -- to have json (de)serialization be only three lines of code (or less).

npage97 · on Aug 12, 2022

There is dataclasses-json which does this

https://pypi.org/project/dataclasses-json/

cochne · on Aug 12, 2022

I implemented a shoddy version of this once (before accepting that pickles are worth it) only took like 100 LOC for serialization and deserialization.

thraxil · on Aug 12, 2022

The problem is that "the average home-made Python script" frequently ends up turning into a critical production system.

tpoacher · on Aug 12, 2022

true of any language

meatmanek · on Aug 11, 2022

They forgot another major problem: You can only reliably unpickle data using the same (or same-enough) code that pickled it. If your class definitions have changed or moved around, unpickling can break.

kangalioo · on Aug 11, 2022

Reminds me of C's dumping structs to disk via memcpy

int_19h · on Aug 12, 2022

Pickle is far more flexible than that, though, since it's all symbol-based introspection. Adding new attributes etc doesn't break things.

philsnow · on Aug 13, 2022

I ran into a bug in production unpickling some builtins (dicts or sets or something) that were pickled in 2.2 and unpickled in 2.4 (or 2.4 -> 2.6, it's fuzzy).

Between those two versions, the exposed 'dunder' methods of whichever builtin changed, and this resulted in unpickled dicts being empty, IIRC.

Spivak · on Aug 11, 2022

If you’re hydrating objects from your JSON you hit the same thing so it’s not as much of a downside as you might think.

In fact this is what the benchmark does.

jleahy · on Aug 11, 2022

Should be (2014).

More interestingly, as much as numpy and everybody advises against it, I believe that pickling data into a zstd stream is one of the fastest ways of storing sets of large matrices.

The 'recommended' alternatives include numpy.save (uncompressed, which is bad when lz4 is faster than memcpy and you're saving to disk), numpy.savez (uncompressed zip files, even worse), numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds worst formats and also using zlib), etc. I wish it wasn't the case, but it certainly seems like a good argument for pickle.

a-dub · on Aug 11, 2022

even though all the metadata is weird and overengineered, i would probably still use hdf5 as it provides for interop with other numerical computing environments (matlab, julia).

also hdf5 is at least securable. pickle streams are not designed for that. it's good to be able to send your data to others.

fwiw. matlab .mat files are hdf5 at their core.

i should also note that json is pretty bad for numerical data. the specification says nothing about how much precision to retain and printf/scanf is ridiculously slow for storing floats.

jleahy · on Aug 11, 2022

hdf5 is extremely slow however, pickle+zstd is faster and results in smaller files.

welterde · on Aug 11, 2022

I kind of feel like you are doing something wrong with HDF5, since for my use cases it's the fastest solution by far.

a-dub · on Aug 11, 2022

hdf5+zstd would likely be comparable.

good luck loading those pickle files 5y from now.

jleahy · on Aug 11, 2022

hdf5+zstd is not a thing (or at least not a thing that's interoperable or usable 5y from now). I just wish there was a good off-the-shelf solution, this stuff is not difficult.

a-dub · on Aug 12, 2022

yeah, just pointing out that there's nothing inherently wrong with hdf5 and that the gains you speak of are likely from just from the use of modern compression standards.

maybe there's room for a simplified standard... or maybe just the addition of better compression to hdf5. (although they move slow for very good reason)

hyperbovine · on Aug 12, 2022

Why? I have successfully loaded pkl files that are much older than that.

mr_toad · on Aug 12, 2022

> Should be (2014).

I was wondering why it didn’t mention Apache Arrow.

RhysU · on Aug 12, 2022

Why is mmaping out of the running?

jleahy · on Aug 13, 2022

No compression, and its slower than just reading a file with ‘read’ if you plan to consume the whole thing.

chaxor · on Aug 11, 2022

Last time I checked (i.e. performed several benchmarks upon), parquet with Zstd was about the best way to store compressed data for really fast and small files.

Zstd is quite good, and is now (iirc) in the linux kernel.

People may have some issue with parquet being column based, which can make inserts a little slower for example, but for a large mostly-set database it is a very good choice. A tsv.zst file could be another way to go as well. But like others, I really with hdf5 had some of these features of compression and wasn't so dang slow.

mistrial9 · on Aug 12, 2022

linux 5.15.0-25-generic on ubuntu 22.04 shows

    $lsmod | grep zstd  
    zstd_compress         229376  1 btrfs

solarkraft · on Aug 11, 2022

> Pickle is slow

... Python is slow. But "slow" means "plenty fast" nowadays and the development speed advantage is immense.

> unpickling malicious data can cause security issues

Why would I do that?

I can't read the linked page because it seems to be down/the link is broken, so I don't know whether this includes user data that is present before pickling and then turns to be an issue after pickling. Then I would worry, otherwise ... yeah, I'm not gonna unpickle random data.

> Just use JSON

How do I effortlessly restore objects including their methods from JSON?

marcosdumay · on Aug 11, 2022

> How do I effortlessly restore objects including their methods from JSON?

The recommendation from the title is usually made instead of something like "deserializing executable data is harmful". That is exactly the one question where the answer is "don't".

It's not exactly the unpickling process that is the problem. It's how you established that the data isn't malicious. It is very hard to use pickle without creating some local privilege escalation possibilities. And at the end of the process, you usually don't get any capability that replicating the code on both sides of the communication channel wouldn't give you.

(The problem isn't specific to Python either. There was a time when that kind of functionality was very hyped on both the industry and academia. For example, Java also got something similar that they had to retract. The famous Gnu-Hurd OS (the one that would never finish) was supposed to do that on the system level.)

mananaysiempre · on Aug 12, 2022

  Do you, Programmer,
  take this Object to be part of the persistent state of your application,
  to have and to hold,
  through maintenance and iterations,
  for past and future versions,
  as long as the application shall live?

Arturo Bejar, as quoted[1] in Mark Miller’s “Safe serialization under mutual suspicion”, which describes what it takes to make reasonable and compatible serialization restoring “all you can do is to send a message” objects.

(The Smalltalk school actually spent quite a bit of time on the upgrade problem, see e.g. Fuel[2] and its references, but it was after the industry took the object orientation shiny and ran away with it, so that work seems to be little-known outside it.)

[1] http://www.erights.org/data/serial/jhu-paper/upgrade.html

[2] http://wiki.squeak.org/squeak/6221

xhevahir · on Aug 11, 2022

The Mozart/Oz people came up with pickle, I think.

vore · on Aug 11, 2022

One thing that's not mentioned is that pickled data is effectively fossilized once you've pickled it. If you want to change the layout of a class and have objects unpickle correctly, it can be an ordeal, as objects are unpickled by their class name, and you need both the original class and the new class to correctly unpickle and migrate.

If you instead selectively pick what you want to serialize about your data and keep the representations separate, you can change the internal model easily without having a huge impact on the serialized model.

LtWorf · on Aug 11, 2022

The benchmark is bad. Because after you load a json you can't really use it. Well to use it you must check lists are lists for real, objects are really objects and have the keys you think they should have and so on.

The alternative is using something like typedload (which I wrote) or pydantic in addition to json load, to avoid cluttering the code with the countless and error prone checks one must do to use untrusted json.

In the end dealing with untrusted json directly is terrible.

theamk · on Aug 11, 2022

if you are dealing with untrusted data, pickle is not an option at all, it lacks security.

vore · on Aug 12, 2022

Isn't pickle just as bad if not worse? There's no guarantee that the unpickled object has the right shape either, AND it can arbitrarily execute code while unpickling.

LtWorf · on Aug 12, 2022

Indeed, but if you say json is better and faster, you have to add all the costs, not just the parsing. Otherwise while json will not execute code, it will crash your program.

And while in python it's ok, in C++ you can still execute code in that way :)

IshKebab · on Aug 11, 2022

> But "slow" means "plenty fast" nowadays

Not in my experience. "Slow" means "it seems fast enough now and I'm sure we'll have time to rewrite it in a fast language once it's grown to a monster that processes 1000 times the data it does now... right?".

> Why would I do that?

Because you are using someone else's code and make the fairly reasonable assumption that deserialising data doesn't cause arbitrary code execution... But of course it's all your fault because you didn't read their code to see that it's using Pickle!

> How do I effortlessly restore objects including their methods from JSON?

You don't. You shouldn't.

solarkraft · on Aug 12, 2022

> You don't. You shouldn't.

There's a good reason the functionality exists: It's a breeze for quickly persisting some state without worrying about anything else. It's not pretty, it's not clean, but it just works (which is arguably a use case Python is very popular for).

A sibling comment pointed out that pickled data makes it annoying to deal with eventual schema changes. For sure. But so do other things that go hand in hand with quick-and-dirty approaches.

"Probably don't use pickle in production" is something I could get behind, but that would, of course, not be such an inflammatory title.

cratermoon · on Aug 11, 2022

>> unpickling malicious data can cause security issues

> Why would I do that?

If you pickle data from an untrusted source, say a web form submission and then later unpickle it. See https://cwe.mitre.org/data/definitions/502.html

ademarre · on Aug 11, 2022

> If you pickle data from an untrusted source . . . and then later unpickle it

That is not exactly right. The risk is when you unpickle data that was pickled by someone else or that was tampered with after you pickled it.

cratermoon · on Aug 11, 2022

Look closer at the CWE and the linked examples: An attacker can construct a illegitimate, serialized object, like an auth token or sessionID, that instantiates one of Python's subprocesses to execute arbitrary commands

ademarre · on Aug 12, 2022

That quote supports my statement. Notice that the serialized object is the thing that was constructed by the attacker, not some user data that you serialized yourself.

cratermoon · on Aug 12, 2022

No the input was not serialized, it was carefully crafted so that when it gets serialized and deserialized, it triggers the malicious payload.

ademarre · on Aug 12, 2022

No. That is not how it works.

https://docs.python.org/3/library/pickle.html

https://blog.nelhage.com/2011/03/exploiting-pickle/ (referenced from https://cwe.mitre.org/data/definitions/502.html#REF-467)

TremendousJudge · on Aug 11, 2022

There's also the much faster cPickle. It may just be fast enough for your needs. If it isn't, then you start exploring other options.

kzrdude · on Aug 11, 2022

the regular pickle module uses "cPickle" transparently. It should not be worth mentioning since Python 3.x.

The article is 8 years old, so it kind of misses this detail.

TremendousJudge · on Aug 12, 2022

That's good to know. I haven't worked professionaly with Python for a couple of years now, and back then we were maintaining a Python 2 system.

IshKebab · on Aug 11, 2022

That was included in the benchmarks.

NotTameAntelope · on Aug 11, 2022

Instantiate a new object of the class with the JSON as arguments, is one way.

I’ve built a bunch of these systems, keeping your data separate solves a lot of future problems.

ris · on Aug 11, 2022

Don't Assume Things About Others Use Cases.

In cases where I'm doing some sort of interactive or exploratory data analysis with structures of complex python objects and want to stash a copy of what I'm working with in case the next thing I do screws the up or, who knows, I lose power - being able to quickly pickle something and have an amount of confidence I'll be able to get it back in a sensible state is very useful.

I've also used it for debug dumps in experimental software so I have a chance of reproducing odd cases it comes across.

hansvm · on Aug 11, 2022

I made a simple library for just such a purpose if you're interested. You can wrap a whole module (like requests or pandas) and cache every function/coroutine result to disk. https://github.com/hmusgrave/ememo

I mainly use it for web scraping to be polite while I figure out the remote API, but I'm sure somebody could have another use.

northisup · on Aug 11, 2022

Who is out there using pickle because they think it is a good idea? We use it because it is easy and builtin to the language and handles datetime by default!

WorldMaker · on Aug 11, 2022

Good thing JSON is in the standard library now too.

Spivak · on Aug 11, 2022

People really don’t get what pickle does apparently. The on disk format (could be JSON) is irrelevant.

I guess you could say JSON is pickle but restricted to only primitive types. The hard part is deciding what an to_json and from_json should be for an arbitrary Python object. That pair of methods is the pickle part.

“just write custom serializers for all objects you want to store to and load from disk and then invent a tagging system so you can keep track of what classes they were” is a surprisingly valid solution but it’s still different.

WorldMaker · on Aug 11, 2022

Obviously one of those cases where everyone's mileage varies, but the default JSON serialization has covered 99% of everything I've ever serialized/deserialized in Python without needing to write a to_json/from_json by hand. Needing those for 1% of cases seems fine. But also maybe I lean into duck typing a whole lot more than you do at the serialization boundary and don't need a lot of specific class types coming from JSON as long as the data is all in the same shape. You may think that's "primitives obsession", and I may think that a giant class hierarchy of custom objects is a bit non-Pythonic. We're both right, which is why, again, your mileage may vary here.

int_19h · on Aug 12, 2022

It just means that you usually deal with data that has a natural tree structure. The moment it becomes a graph, hands-off JSON serialization is broken - but Pickle isn't.

WorldMaker · on Aug 13, 2022

At a possibly serialized boundary sure. That doesn't mean I don't work with graphs, though, just that I guess I don't expect graphs to be entirely reified in memory at all times if I'm going to duck type walk through them. We all have different habits, of course.

ridiculous_fish · on Aug 11, 2022

What are some alternatives to pickling which can handle cyclic references?

I've looked into ORMs but these are invasive in terms of needing to annotate your classes and fields.

WorldMaker · on Aug 11, 2022

There are several approaches to references in JSON. A common Python library I found via StackOverflow mentions is: https://pypi.org/project/jsonref/ (it supports automatic dereferencing at load time, but dumping references is still a slight challenge).

UncleEntity · on Aug 11, 2022

If you make your own C extensions then you certainly have to write code to be able to pickle your classes.

I did it once, don’t remember why, and it wasn’t that hard but I can imagine it would quickly get out of hand if you were changing class structures on a regular basis.

If you were just wrapping some library with simple C++ classes or something it also probably wouldn’t be that hard to automatically generate the pickling code.

jessikat · on Aug 11, 2022

JSON really is a terrible serialization format. Even JavaScript can't safely deserialize JSON without silent data corruption. I've had to stringify numbers because of JavaScript, and there were no errors. Perhaps that's the fault of JavaScript, but I find the lack of encoding the numerical storage type to be a bug rather than a feature.

windows_sucks · on Aug 11, 2022

would love to see an example of the data corruption you're talking about

WorldMaker · on Aug 11, 2022

Sounds like they've been bitten by IEEE 794 floating point problems. JS only supports encoding numbers that are representable in 64-bit ("double precision") IEEE 794. Most JSON standards make the same assumption and define JSON number to match. (There's no lack of a standard an "encoding" standard there, it just inherits JS', which is double precision IEEE 794.) Some JSON implementations in some languages don't follow this particular bit of JSON standardization and instead try to output numbers outside of the range representable by IEEE 794, but that's arguably much more an "implementation error" than an error in the standard.

This is a most common occurrence in dealing with int64/"long" numbers towards the top or bottom of that range (given the floating point layout needs space).

There is no JSON standard for numbers outside of the range of double precision IEEE 794 floating point other than "just stringify it", even now that JS has a BigInt type that supports a much larger range. But "just stringify it" mostly works well enough.

SAI_Peregrinus · on Aug 12, 2022

The JSON "Number" standard is arbitrary precision decimal[1], though it does mention that implementations MAY limit the parsed value to fit within the allowable range of an IEEE 754 double-precision binary floating point value. JSON "Number"s can't encode all JS numbers, since they can't encode NANs and infinities.

[1] https://www.ecma-international.org/wp-content/uploads/ECMA-4... section 8.

WorldMaker · on Aug 13, 2022

The "dual" standard RFC 8259 [1] (both are normative standards under their respective bodies, ECMA and IETF) is also a useful comparison here. It's wording is a bit stronger than ECMA's, though not by much. ("Good interoperability" is its specific call out.)

It's also interesting that the proposed JSON 5 (standalone) specification [2] doesn't seem to address it at all (but does add back in the other IEEE 754 numbers that ECMA 404 and RFC 8259 exclude from JSON; +/-Infinity and +/-NaN). It both maintains that its numbers are "arbitrary precision" but also requires these few IEEE 754 features, which may be even more confusing than either ECMA 404 or RFC 8259.

[1] https://datatracker.ietf.org/doc/html/rfc8259#section-6

[2] https://spec.json5.org/#numbers

jeramey · on Aug 11, 2022

One example that's bitten me is that working with large integers is fraught with peril. If you can't be sure that your integer values can be exactly represented in an IEEE 754 double precision float and you might be exchanging data with a JavaScript implementation, mysterious truncations start to happen. If you've ever seen a JSON API and wondered why some integer values are encoded as strings rather than a native JSON number, that's why.

treve · on Aug 11, 2022

Javascript will parse JSON numbers into number, and numbers in JSON aren't limited in their precision or size.

cratermoon · on Aug 11, 2022

https://owasp.org/www-pdf-archive/OWASPLondon20161124_JSON_H...

sidewndr46 · on Aug 12, 2022

The author is mostly correctly, except about one thing: pickle can be read from Golang code.

I wrote a library for this years ago: https://github.com/hydrogen18/stalecucumber

0xbadcafebee · on Aug 11, 2022

I would rather use YAML than JSON, if only for the schema tags that let you customize the processor to load data into a custom data structure automatically. Saves time and lets you represent complex data structures in a minimal way. Use ruamel.yaml rather than PyYAML.

cratermoon · on Aug 11, 2022

Much the same can be leveled against Java's serialized objects. The OWASP top 10 from 2017 even had "Insecure Deserialization" at #8. The 2021 update[1] changes it to "Software and Data Integrity Failures", still at #8. It's CWE-502: Deserialization of Untrusted Data[2], where Python and Java are specifically mentioned.

1 https://owasp.org/www-project-top-ten/

2 https://cwe.mitre.org/data/definitions/502.html

marginalia_nu · on Aug 11, 2022

Does anyone actually use the Java object serialization API for code written this side of 2010? Feels like a vestigial feature that's not made sense for a long time, like to the point where we've already discarded another bad serialization format (XML) before looking at JSON or YAML now.

cratermoon · on Aug 12, 2022

Anyone who uses Hibernate uses Java object serialization.

ksaj · on Aug 12, 2022

It is worth noting that this article was published in 2014. The accuracy of its comparisons may have changed in the past 8.5 years.

solarkraft · on Aug 11, 2022

(2014)

ohiovr · on Aug 11, 2022

I found unpickling a lot slower than json loading.

LtWorf · on Aug 11, 2022

But then you have to check that the "list" is really a list, that the objects do have the keys, that the strings are strings.

This should be factored in the cost, and it wasn't in the benchmark.

nomel · on Aug 11, 2022

I've found it to be much faster, with large amounts of data, like numpy arrays. And, some things aren't possible to convert to JSON, without writing a bunch of code to do the serialization/deserialization, which often makes things slow again.