Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Don't Pickle Your Data (benfrederickson.com)
59 points by behnamoh on Aug 11, 2022 | hide | past | favorite | 76 comments


From the conclusion of the article: >Pickle on the other hand is slow, insecure, and can be only parsed in Python. The only real advantage to pickle is that it can serialize arbitrary Python objects

ie, a bunch of drawbacks that don't really matter at all for the average home-made Python script, plus the "minor" advantage of being able to pickle literally anything and have it "just work".

None of the other options out there let you build a foolproof "save button" in 3 lines of code.


I'm sure that most python developers who have worked with pickle for more than 3 lines of code can confirm that pickle does not in fact just work


The real question is why doesn't Python have something like a class decorator`@json.interchangeable` that you can apply to a class -- maybe dataclasses only? -- to have json (de)serialization be only three lines of code (or less).


There is dataclasses-json which does this

https://pypi.org/project/dataclasses-json/


I implemented a shoddy version of this once (before accepting that pickles are worth it) only took like 100 LOC for serialization and deserialization.


The problem is that "the average home-made Python script" frequently ends up turning into a critical production system.


true of any language


They forgot another major problem: You can only reliably unpickle data using the same (or same-enough) code that pickled it. If your class definitions have changed or moved around, unpickling can break.


Reminds me of C's dumping structs to disk via memcpy


Pickle is far more flexible than that, though, since it's all symbol-based introspection. Adding new attributes etc doesn't break things.


I ran into a bug in production unpickling some builtins (dicts or sets or something) that were pickled in 2.2 and unpickled in 2.4 (or 2.4 -> 2.6, it's fuzzy).

Between those two versions, the exposed 'dunder' methods of whichever builtin changed, and this resulted in unpickled dicts being empty, IIRC.


If you’re hydrating objects from your JSON you hit the same thing so it’s not as much of a downside as you might think.

In fact this is what the benchmark does.


Should be (2014).

More interestingly, as much as numpy and everybody advises against it, I believe that pickling data into a zstd stream is one of the fastest ways of storing sets of large matrices.

The 'recommended' alternatives include numpy.save (uncompressed, which is bad when lz4 is faster than memcpy and you're saving to disk), numpy.savez (uncompressed zip files, even worse), numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds worst formats and also using zlib), etc. I wish it wasn't the case, but it certainly seems like a good argument for pickle.


even though all the metadata is weird and overengineered, i would probably still use hdf5 as it provides for interop with other numerical computing environments (matlab, julia).

also hdf5 is at least securable. pickle streams are not designed for that. it's good to be able to send your data to others.

fwiw. matlab .mat files are hdf5 at their core.

i should also note that json is pretty bad for numerical data. the specification says nothing about how much precision to retain and printf/scanf is ridiculously slow for storing floats.


hdf5 is extremely slow however, pickle+zstd is faster and results in smaller files.


I kind of feel like you are doing something wrong with HDF5, since for my use cases it's the fastest solution by far.


hdf5+zstd would likely be comparable.

good luck loading those pickle files 5y from now.


hdf5+zstd is not a thing (or at least not a thing that's interoperable or usable 5y from now). I just wish there was a good off-the-shelf solution, this stuff is not difficult.


yeah, just pointing out that there's nothing inherently wrong with hdf5 and that the gains you speak of are likely from just from the use of modern compression standards.

maybe there's room for a simplified standard... or maybe just the addition of better compression to hdf5. (although they move slow for very good reason)


Why? I have successfully loaded pkl files that are much older than that.


> Should be (2014).

I was wondering why it didn’t mention Apache Arrow.


Why is mmaping out of the running?


No compression, and its slower than just reading a file with ‘read’ if you plan to consume the whole thing.


Last time I checked (i.e. performed several benchmarks upon), parquet with Zstd was about the best way to store compressed data for really fast and small files.

Zstd is quite good, and is now (iirc) in the linux kernel.

People may have some issue with parquet being column based, which can make inserts a little slower for example, but for a large mostly-set database it is a very good choice. A tsv.zst file could be another way to go as well. But like others, I really with hdf5 had some of these features of compression and wasn't so dang slow.


linux 5.15.0-25-generic on ubuntu 22.04 shows

    $lsmod | grep zstd  
    zstd_compress         229376  1 btrfs


> Pickle is slow

... Python is slow. But "slow" means "plenty fast" nowadays and the development speed advantage is immense.

> unpickling malicious data can cause security issues

Why would I do that?

I can't read the linked page because it seems to be down/the link is broken, so I don't know whether this includes user data that is present before pickling and then turns to be an issue after pickling. Then I would worry, otherwise ... yeah, I'm not gonna unpickle random data.

> Just use JSON

How do I effortlessly restore objects including their methods from JSON?


> How do I effortlessly restore objects including their methods from JSON?

The recommendation from the title is usually made instead of something like "deserializing executable data is harmful". That is exactly the one question where the answer is "don't".

It's not exactly the unpickling process that is the problem. It's how you established that the data isn't malicious. It is very hard to use pickle without creating some local privilege escalation possibilities. And at the end of the process, you usually don't get any capability that replicating the code on both sides of the communication channel wouldn't give you.

(The problem isn't specific to Python either. There was a time when that kind of functionality was very hyped on both the industry and academia. For example, Java also got something similar that they had to retract. The famous Gnu-Hurd OS (the one that would never finish) was supposed to do that on the system level.)


  Do you, Programmer,
  take this Object to be part of the persistent state of your application,
  to have and to hold,
  through maintenance and iterations,
  for past and future versions,
  as long as the application shall live?
Arturo Bejar, as quoted[1] in Mark Miller’s “Safe serialization under mutual suspicion”, which describes what it takes to make reasonable and compatible serialization restoring “all you can do is to send a message” objects.

(The Smalltalk school actually spent quite a bit of time on the upgrade problem, see e.g. Fuel[2] and its references, but it was after the industry took the object orientation shiny and ran away with it, so that work seems to be little-known outside it.)

[1] http://www.erights.org/data/serial/jhu-paper/upgrade.html

[2] http://wiki.squeak.org/squeak/6221


The Mozart/Oz people came up with pickle, I think.


One thing that's not mentioned is that pickled data is effectively fossilized once you've pickled it. If you want to change the layout of a class and have objects unpickle correctly, it can be an ordeal, as objects are unpickled by their class name, and you need both the original class and the new class to correctly unpickle and migrate.

If you instead selectively pick what you want to serialize about your data and keep the representations separate, you can change the internal model easily without having a huge impact on the serialized model.


The benchmark is bad. Because after you load a json you can't really use it. Well to use it you must check lists are lists for real, objects are really objects and have the keys you think they should have and so on.

The alternative is using something like typedload (which I wrote) or pydantic in addition to json load, to avoid cluttering the code with the countless and error prone checks one must do to use untrusted json.

In the end dealing with untrusted json directly is terrible.


if you are dealing with untrusted data, pickle is not an option at all, it lacks security.


Isn't pickle just as bad if not worse? There's no guarantee that the unpickled object has the right shape either, AND it can arbitrarily execute code while unpickling.


Indeed, but if you say json is better and faster, you have to add all the costs, not just the parsing. Otherwise while json will not execute code, it will crash your program.

And while in python it's ok, in C++ you can still execute code in that way :)


> But "slow" means "plenty fast" nowadays

Not in my experience. "Slow" means "it seems fast enough now and I'm sure we'll have time to rewrite it in a fast language once it's grown to a monster that processes 1000 times the data it does now... right?".

> Why would I do that?

Because you are using someone else's code and make the fairly reasonable assumption that deserialising data doesn't cause arbitrary code execution... But of course it's all your fault because you didn't read their code to see that it's using Pickle!

> How do I effortlessly restore objects including their methods from JSON?

You don't. You shouldn't.


> You don't. You shouldn't.

There's a good reason the functionality exists: It's a breeze for quickly persisting some state without worrying about anything else. It's not pretty, it's not clean, but it just works (which is arguably a use case Python is very popular for).

A sibling comment pointed out that pickled data makes it annoying to deal with eventual schema changes. For sure. But so do other things that go hand in hand with quick-and-dirty approaches.

"Probably don't use pickle in production" is something I could get behind, but that would, of course, not be such an inflammatory title.


>> unpickling malicious data can cause security issues

> Why would I do that?

If you pickle data from an untrusted source, say a web form submission and then later unpickle it. See https://cwe.mitre.org/data/definitions/502.html


> If you pickle data from an untrusted source . . . and then later unpickle it

That is not exactly right. The risk is when you unpickle data that was pickled by someone else or that was tampered with after you pickled it.


Look closer at the CWE and the linked examples: An attacker can construct a illegitimate, serialized object, like an auth token or sessionID, that instantiates one of Python's subprocesses to execute arbitrary commands


That quote supports my statement. Notice that the serialized object is the thing that was constructed by the attacker, not some user data that you serialized yourself.


No the input was not serialized, it was carefully crafted so that when it gets serialized and deserialized, it triggers the malicious payload.



There's also the much faster cPickle. It may just be fast enough for your needs. If it isn't, then you start exploring other options.


the regular pickle module uses "cPickle" transparently. It should not be worth mentioning since Python 3.x.

The article is 8 years old, so it kind of misses this detail.


That's good to know. I haven't worked professionaly with Python for a couple of years now, and back then we were maintaining a Python 2 system.


That was included in the benchmarks.


Instantiate a new object of the class with the JSON as arguments, is one way.

I’ve built a bunch of these systems, keeping your data separate solves a lot of future problems.


Don't Assume Things About Others Use Cases.

In cases where I'm doing some sort of interactive or exploratory data analysis with structures of complex python objects and want to stash a copy of what I'm working with in case the next thing I do screws the up or, who knows, I lose power - being able to quickly pickle something and have an amount of confidence I'll be able to get it back in a sensible state is very useful.

I've also used it for debug dumps in experimental software so I have a chance of reproducing odd cases it comes across.


I made a simple library for just such a purpose if you're interested. You can wrap a whole module (like requests or pandas) and cache every function/coroutine result to disk. https://github.com/hmusgrave/ememo

I mainly use it for web scraping to be polite while I figure out the remote API, but I'm sure somebody could have another use.


Who is out there using pickle because they think it is a good idea? We use it because it is easy and builtin to the language and handles datetime by default!


Good thing JSON is in the standard library now too.


People really don’t get what pickle does apparently. The on disk format (could be JSON) is irrelevant.

I guess you could say JSON is pickle but restricted to only primitive types. The hard part is deciding what an to_json and from_json should be for an arbitrary Python object. That pair of methods is the pickle part.

“just write custom serializers for all objects you want to store to and load from disk and then invent a tagging system so you can keep track of what classes they were” is a surprisingly valid solution but it’s still different.


Obviously one of those cases where everyone's mileage varies, but the default JSON serialization has covered 99% of everything I've ever serialized/deserialized in Python without needing to write a to_json/from_json by hand. Needing those for 1% of cases seems fine. But also maybe I lean into duck typing a whole lot more than you do at the serialization boundary and don't need a lot of specific class types coming from JSON as long as the data is all in the same shape. You may think that's "primitives obsession", and I may think that a giant class hierarchy of custom objects is a bit non-Pythonic. We're both right, which is why, again, your mileage may vary here.


It just means that you usually deal with data that has a natural tree structure. The moment it becomes a graph, hands-off JSON serialization is broken - but Pickle isn't.


At a possibly serialized boundary sure. That doesn't mean I don't work with graphs, though, just that I guess I don't expect graphs to be entirely reified in memory at all times if I'm going to duck type walk through them. We all have different habits, of course.


What are some alternatives to pickling which can handle cyclic references?

I've looked into ORMs but these are invasive in terms of needing to annotate your classes and fields.


There are several approaches to references in JSON. A common Python library I found via StackOverflow mentions is: https://pypi.org/project/jsonref/ (it supports automatic dereferencing at load time, but dumping references is still a slight challenge).


If you make your own C extensions then you certainly have to write code to be able to pickle your classes.

I did it once, don’t remember why, and it wasn’t that hard but I can imagine it would quickly get out of hand if you were changing class structures on a regular basis.

If you were just wrapping some library with simple C++ classes or something it also probably wouldn’t be that hard to automatically generate the pickling code.


JSON really is a terrible serialization format. Even JavaScript can't safely deserialize JSON without silent data corruption. I've had to stringify numbers because of JavaScript, and there were no errors. Perhaps that's the fault of JavaScript, but I find the lack of encoding the numerical storage type to be a bug rather than a feature.


would love to see an example of the data corruption you're talking about


Sounds like they've been bitten by IEEE 794 floating point problems. JS only supports encoding numbers that are representable in 64-bit ("double precision") IEEE 794. Most JSON standards make the same assumption and define JSON number to match. (There's no lack of a standard an "encoding" standard there, it just inherits JS', which is double precision IEEE 794.) Some JSON implementations in some languages don't follow this particular bit of JSON standardization and instead try to output numbers outside of the range representable by IEEE 794, but that's arguably much more an "implementation error" than an error in the standard.

This is a most common occurrence in dealing with int64/"long" numbers towards the top or bottom of that range (given the floating point layout needs space).

There is no JSON standard for numbers outside of the range of double precision IEEE 794 floating point other than "just stringify it", even now that JS has a BigInt type that supports a much larger range. But "just stringify it" mostly works well enough.


The JSON "Number" standard is arbitrary precision decimal[1], though it does mention that implementations MAY limit the parsed value to fit within the allowable range of an IEEE 754 double-precision binary floating point value. JSON "Number"s can't encode all JS numbers, since they can't encode NANs and infinities.

[1] https://www.ecma-international.org/wp-content/uploads/ECMA-4... section 8.


The "dual" standard RFC 8259 [1] (both are normative standards under their respective bodies, ECMA and IETF) is also a useful comparison here. It's wording is a bit stronger than ECMA's, though not by much. ("Good interoperability" is its specific call out.)

It's also interesting that the proposed JSON 5 (standalone) specification [2] doesn't seem to address it at all (but does add back in the other IEEE 754 numbers that ECMA 404 and RFC 8259 exclude from JSON; +/-Infinity and +/-NaN). It both maintains that its numbers are "arbitrary precision" but also requires these few IEEE 754 features, which may be even more confusing than either ECMA 404 or RFC 8259.

[1] https://datatracker.ietf.org/doc/html/rfc8259#section-6

[2] https://spec.json5.org/#numbers


One example that's bitten me is that working with large integers is fraught with peril. If you can't be sure that your integer values can be exactly represented in an IEEE 754 double precision float and you might be exchanging data with a JavaScript implementation, mysterious truncations start to happen. If you've ever seen a JSON API and wondered why some integer values are encoded as strings rather than a native JSON number, that's why.


Javascript will parse JSON numbers into number, and numbers in JSON aren't limited in their precision or size.



The author is mostly correctly, except about one thing: pickle can be read from Golang code.

I wrote a library for this years ago: https://github.com/hydrogen18/stalecucumber


I would rather use YAML than JSON, if only for the schema tags that let you customize the processor to load data into a custom data structure automatically. Saves time and lets you represent complex data structures in a minimal way. Use ruamel.yaml rather than PyYAML.


Much the same can be leveled against Java's serialized objects. The OWASP top 10 from 2017 even had "Insecure Deserialization" at #8. The 2021 update[1] changes it to "Software and Data Integrity Failures", still at #8. It's CWE-502: Deserialization of Untrusted Data[2], where Python and Java are specifically mentioned.

1 https://owasp.org/www-project-top-ten/

2 https://cwe.mitre.org/data/definitions/502.html


Does anyone actually use the Java object serialization API for code written this side of 2010? Feels like a vestigial feature that's not made sense for a long time, like to the point where we've already discarded another bad serialization format (XML) before looking at JSON or YAML now.


Anyone who uses Hibernate uses Java object serialization.


It is worth noting that this article was published in 2014. The accuracy of its comparisons may have changed in the past 8.5 years.


(2014)


I found unpickling a lot slower than json loading.


But then you have to check that the "list" is really a list, that the objects do have the keys, that the strings are strings.

This should be factored in the cost, and it wasn't in the benchmark.


I've found it to be much faster, with large amounts of data, like numpy arrays. And, some things aren't possible to convert to JSON, without writing a bunch of code to do the serialization/deserialization, which often makes things slow again.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: