From the conclusion of the article:
>Pickle on the other hand is slow, insecure, and can be only parsed in Python. The only real advantage to pickle is that it can serialize arbitrary Python objects
ie, a bunch of drawbacks that don't really matter at all for the average home-made Python script, plus the "minor" advantage of being able to pickle literally anything and have it "just work".
None of the other options out there let you build a foolproof "save button" in 3 lines of code.
The real question is why doesn't Python have something like a class decorator`@json.interchangeable` that you can apply to a class -- maybe dataclasses only? -- to have json (de)serialization be only three lines of code (or less).
They forgot another major problem: You can only reliably unpickle data using the same (or same-enough) code that pickled it. If your class definitions have changed or moved around, unpickling can break.
I ran into a bug in production unpickling some builtins (dicts or sets or something) that were pickled in 2.2 and unpickled in 2.4 (or 2.4 -> 2.6, it's fuzzy).
Between those two versions, the exposed 'dunder' methods of whichever builtin changed, and this resulted in unpickled dicts being empty, IIRC.
More interestingly, as much as numpy and everybody advises against it, I believe that pickling data into a zstd stream is one of the fastest ways of storing sets of large matrices.
The 'recommended' alternatives include numpy.save (uncompressed, which is bad when lz4 is faster than memcpy and you're saving to disk), numpy.savez (uncompressed zip files, even worse), numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds worst formats and also using zlib), etc. I wish it wasn't the case, but it certainly seems like a good argument for pickle.
even though all the metadata is weird and overengineered, i would probably still use hdf5 as it provides for interop with other numerical computing environments (matlab, julia).
also hdf5 is at least securable. pickle streams are not designed for that. it's good to be able to send your data to others.
fwiw. matlab .mat files are hdf5 at their core.
i should also note that json is pretty bad for numerical data. the specification says nothing about how much precision to retain and printf/scanf is ridiculously slow for storing floats.
hdf5+zstd is not a thing (or at least not a thing that's interoperable or usable 5y from now). I just wish there was a good off-the-shelf solution, this stuff is not difficult.
yeah, just pointing out that there's nothing inherently wrong with hdf5 and that the gains you speak of are likely from just from the use of modern compression standards.
maybe there's room for a simplified standard... or maybe just the addition of better compression to hdf5. (although they move slow for very good reason)
Last time I checked (i.e. performed several benchmarks upon), parquet with Zstd was about the best way to store compressed data for really fast and small files.
Zstd is quite good, and is now (iirc) in the linux kernel.
People may have some issue with parquet being column based, which can make inserts a little slower for example, but for a large mostly-set database it is a very good choice. A tsv.zst file could be another way to go as well.
But like others, I really with hdf5 had some of these features of compression and wasn't so dang slow.
... Python is slow. But "slow" means "plenty fast" nowadays and the development speed advantage is immense.
> unpickling malicious data can cause security issues
Why would I do that?
I can't read the linked page because it seems to be down/the link is broken, so I don't know whether this includes user data that is present before pickling and then turns to be an issue after pickling. Then I would worry, otherwise ... yeah, I'm not gonna unpickle random data.
> Just use JSON
How do I effortlessly restore objects including their methods from JSON?
> How do I effortlessly restore objects including their methods from JSON?
The recommendation from the title is usually made instead of something like "deserializing executable data is harmful". That is exactly the one question where the answer is "don't".
It's not exactly the unpickling process that is the problem. It's how you established that the data isn't malicious. It is very hard to use pickle without creating some local privilege escalation possibilities. And at the end of the process, you usually don't get any capability that replicating the code on both sides of the communication channel wouldn't give you.
(The problem isn't specific to Python either. There was a time when that kind of functionality was very hyped on both the industry and academia. For example, Java also got something similar that they had to retract. The famous Gnu-Hurd OS (the one that would never finish) was supposed to do that on the system level.)
Do you, Programmer,
take this Object to be part of the persistent state of your application,
to have and to hold,
through maintenance and iterations,
for past and future versions,
as long as the application shall live?
Arturo Bejar, as quoted[1] in Mark Miller’s “Safe serialization under mutual suspicion”, which describes what it takes to make reasonable and compatible serialization restoring “all you can do is to send a message” objects.
(The Smalltalk school actually spent quite a bit of time on the upgrade problem, see e.g. Fuel[2] and its references, but it was after the industry took the object orientation shiny and ran away with it, so that work seems to be little-known outside it.)
One thing that's not mentioned is that pickled data is effectively fossilized once you've pickled it. If you want to change the layout of a class and have objects unpickle correctly, it can be an ordeal, as objects are unpickled by their class name, and you need both the original class and the new class to correctly unpickle and migrate.
If you instead selectively pick what you want to serialize about your data and keep the representations separate, you can change the internal model easily without having a huge impact on the serialized model.
The benchmark is bad. Because after you load a json you can't really use it. Well to use it you must check lists are lists for real, objects are really objects and have the keys you think they should have and so on.
The alternative is using something like typedload (which I wrote) or pydantic in addition to json load, to avoid cluttering the code with the countless and error prone checks one must do to use untrusted json.
In the end dealing with untrusted json directly is terrible.
Isn't pickle just as bad if not worse? There's no guarantee that the unpickled object has the right shape either, AND it can arbitrarily execute code while unpickling.
Indeed, but if you say json is better and faster, you have to add all the costs, not just the parsing. Otherwise while json will not execute code, it will crash your program.
And while in python it's ok, in C++ you can still execute code in that way :)
Not in my experience. "Slow" means "it seems fast enough now and I'm sure we'll have time to rewrite it in a fast language once it's grown to a monster that processes 1000 times the data it does now... right?".
> Why would I do that?
Because you are using someone else's code and make the fairly reasonable assumption that deserialising data doesn't cause arbitrary code execution... But of course it's all your fault because you didn't read their code to see that it's using Pickle!
> How do I effortlessly restore objects including their methods from JSON?
There's a good reason the functionality exists: It's a breeze for quickly persisting some state without worrying about anything else. It's not pretty, it's not clean, but it just works (which is arguably a use case Python is very popular for).
A sibling comment pointed out that pickled data makes it annoying to deal with eventual schema changes. For sure. But so do other things that go hand in hand with quick-and-dirty approaches.
"Probably don't use pickle in production" is something I could get behind, but that would, of course, not be such an inflammatory title.
Look closer at the CWE and the linked examples: An attacker can construct a illegitimate, serialized object, like an auth token or sessionID, that instantiates one of Python's subprocesses to execute arbitrary commands
That quote supports my statement. Notice that the serialized object is the thing that was constructed by the attacker, not some user data that you serialized yourself.
In cases where I'm doing some sort of interactive or exploratory data analysis with structures of complex python objects and want to stash a copy of what I'm working with in case the next thing I do screws the up or, who knows, I lose power - being able to quickly pickle something and have an amount of confidence I'll be able to get it back in a sensible state is very useful.
I've also used it for debug dumps in experimental software so I have a chance of reproducing odd cases it comes across.
I made a simple library for just such a purpose if you're interested. You can wrap a whole module (like requests or pandas) and cache every function/coroutine result to disk. https://github.com/hmusgrave/ememo
I mainly use it for web scraping to be polite while I figure out the remote API, but I'm sure somebody could have another use.
Who is out there using pickle because they think it is a good idea?
We use it because it is easy and builtin to the language and handles datetime by default!
People really don’t get what pickle does apparently. The on disk format (could be JSON) is irrelevant.
I guess you could say JSON is pickle but restricted to only primitive types. The hard part is deciding what an to_json and from_json should be for an arbitrary Python object. That pair of methods is the pickle part.
“just write custom serializers for all objects you want to store to and load from disk and then invent a tagging system so you can keep track of what classes they were” is a surprisingly valid solution but it’s still different.
Obviously one of those cases where everyone's mileage varies, but the default JSON serialization has covered 99% of everything I've ever serialized/deserialized in Python without needing to write a to_json/from_json by hand. Needing those for 1% of cases seems fine. But also maybe I lean into duck typing a whole lot more than you do at the serialization boundary and don't need a lot of specific class types coming from JSON as long as the data is all in the same shape. You may think that's "primitives obsession", and I may think that a giant class hierarchy of custom objects is a bit non-Pythonic. We're both right, which is why, again, your mileage may vary here.
It just means that you usually deal with data that has a natural tree structure. The moment it becomes a graph, hands-off JSON serialization is broken - but Pickle isn't.
At a possibly serialized boundary sure. That doesn't mean I don't work with graphs, though, just that I guess I don't expect graphs to be entirely reified in memory at all times if I'm going to duck type walk through them. We all have different habits, of course.
There are several approaches to references in JSON. A common Python library I found via StackOverflow mentions is: https://pypi.org/project/jsonref/ (it supports automatic dereferencing at load time, but dumping references is still a slight challenge).
If you make your own C extensions then you certainly have to write code to be able to pickle your classes.
I did it once, don’t remember why, and it wasn’t that hard but I can imagine it would quickly get out of hand if you were changing class structures on a regular basis.
If you were just wrapping some library with simple C++ classes or something it also probably wouldn’t be that hard to automatically generate the pickling code.
JSON really is a terrible serialization format. Even JavaScript can't safely deserialize JSON without silent data corruption. I've had to stringify numbers because of JavaScript, and there were no errors. Perhaps that's the fault of JavaScript, but I find the lack of encoding the numerical storage type to be a bug rather than a feature.
Sounds like they've been bitten by IEEE 794 floating point problems. JS only supports encoding numbers that are representable in 64-bit ("double precision") IEEE 794. Most JSON standards make the same assumption and define JSON number to match. (There's no lack of a standard an "encoding" standard there, it just inherits JS', which is double precision IEEE 794.) Some JSON implementations in some languages don't follow this particular bit of JSON standardization and instead try to output numbers outside of the range representable by IEEE 794, but that's arguably much more an "implementation error" than an error in the standard.
This is a most common occurrence in dealing with int64/"long" numbers towards the top or bottom of that range (given the floating point layout needs space).
There is no JSON standard for numbers outside of the range of double precision IEEE 794 floating point other than "just stringify it", even now that JS has a BigInt type that supports a much larger range. But "just stringify it" mostly works well enough.
The JSON "Number" standard is arbitrary precision decimal[1], though it does mention that implementations MAY limit the parsed value to fit within the allowable range of an IEEE 754 double-precision binary floating point value. JSON "Number"s can't encode all JS numbers, since they can't encode NANs and infinities.
The "dual" standard RFC 8259 [1] (both are normative standards under their respective bodies, ECMA and IETF) is also a useful comparison here. It's wording is a bit stronger than ECMA's, though not by much. ("Good interoperability" is its specific call out.)
It's also interesting that the proposed JSON 5 (standalone) specification [2] doesn't seem to address it at all (but does add back in the other IEEE 754 numbers that ECMA 404 and RFC 8259 exclude from JSON; +/-Infinity and +/-NaN). It both maintains that its numbers are "arbitrary precision" but also requires these few IEEE 754 features, which may be even more confusing than either ECMA 404 or RFC 8259.
One example that's bitten me is that working with large integers is fraught with peril. If you can't be sure that your integer values can be exactly represented in an IEEE 754 double precision float and you might be exchanging data with a JavaScript implementation, mysterious truncations start to happen. If you've ever seen a JSON API and wondered why some integer values are encoded as strings rather than a native JSON number, that's why.
I would rather use YAML than JSON, if only for the schema tags that let you customize the processor to load data into a custom data structure automatically. Saves time and lets you represent complex data structures in a minimal way. Use ruamel.yaml rather than PyYAML.
Much the same can be leveled against Java's serialized objects. The OWASP top 10 from 2017 even had "Insecure Deserialization" at #8. The 2021 update[1] changes it to "Software and Data Integrity Failures", still at #8. It's CWE-502: Deserialization of Untrusted Data[2], where Python and Java are specifically mentioned.
Does anyone actually use the Java object serialization API for code written this side of 2010? Feels like a vestigial feature that's not made sense for a long time, like to the point where we've already discarded another bad serialization format (XML) before looking at JSON or YAML now.
I've found it to be much faster, with large amounts of data, like numpy arrays. And, some things aren't possible to convert to JSON, without writing a bunch of code to do the serialization/deserialization, which often makes things slow again.
ie, a bunch of drawbacks that don't really matter at all for the average home-made Python script, plus the "minor" advantage of being able to pickle literally anything and have it "just work".
None of the other options out there let you build a foolproof "save button" in 3 lines of code.