Hacker News new | past | comments | ask | show | jobs | submit login
AI Supply Chain Attack: How Malicious Pickle Files Backdoor Models (jchandra.com)
4 points by jchandra 6 days ago | hide | past | favorite | 7 comments





From "Insecurity and Python Pickles" (2024) https://news.ycombinator.com/item?id=39685128 :

> There should be a data-only pickle serialization protocol (that won't serialize or deserialize code).

> How much work would it be to create a pickle protocol that does not exec or eval code?

"Title: Pickle protocol version 6: skipcode pickles" https://discuss.python.org/t/create-a-new-pickle-protocol-ve...


I have to agree with Chris Angelico there:

> Then the obvious question is: Why? Why use pickle? The most likely answer is “because <X> can’t represent what I need to transmit”, but for that to be at all useful to your proposal, you need to show examples that won’t work in well-known safe serializers.


Code in packages should be signed.

Code in pickles should also be signed.

I have no need for the pickle module now, but years ago thought there might have been safer way to read data that was already in pickles.

For backwards compatibility, skipcode=False must be the default,

were someone to implement a pickle str parser that doesn't eval code.

JS/ES/TS Map doesn't map to JSON.


Pickle still is good for custom objects (JSON loses methods and also order), Graphs & circular refs (JSON breaks), Functions & lambdas (Essential for ML & distributed systems) and is provided out of box.

We're contemplating protocols that don't evaluate or run code; that rules out serializing functions or lambdas (i.e., code).

Custom objects in Python don't have "order" unless they're using `__slots__` - in which case the application already knows what they are from its own class definition. Similarly, methods don't need to be serialized.

A general graph is isomorphic to a sequence of nodes plus a sequence of vertex definitions. You only need your own lightweight protocol on top.


Because globals(), locals(), Classes and classInstances are backed by dicts, and dicts are insertion ordered in CPython since 3.6 (and in the Python spec since 3.7), object attributes are effectively ordered in Python.

Object instances with __slots__ do not have a dict of attributes.

__slots__ attributes of Python classes are ordered, too.

(Sorting and order; Python 3 objects must define at least __eq__ and __lt__ in order to be sorted. @functools.total_ordering https://docs.python.org/3/library/functools.html#functools.t... )

Are graphs isomorphic if their nodes and edges are in a different sequence?

  assert dict(a=1, b=2) == dict(b=2, a=1)

  from collections import OrderedDict as odict
  assert dict(a=1, b=2) != dict(b=2, a=1)
To crytographically sign RDF in any format (XML, JSON, JSON-LD, RDFa), a canonicalization algorithm is applied to normalize the input data prior to hashing and cryptographically signing. Like Merkle hashes of tree branches, a cryptographic signature of a normalized graph is a substitute for more complete tests of isomorphism.

RDF Dataset Canonicalization algorithm: https://w3c-ccg.github.io/rdf-dataset-canonicalization/spec/...

Also, pickle stores the class name to unpickle data into as a (variously-dotted) str. If the version of the object class is not in the class name, pickle will unpickle data from appA.Pickleable into appB.Pickleable (or PickleableV1 into PickleableV2 objects, as long as PickleableV2=PickleableV1 is specified in the deserializer).

So do methods need to be pickled? No for security. Yes because otherwise the appB unpickled data is not isomorphic with the pickled appA.Pickleable class instances.

One Solution: add a version attribute on each object, store it with every object, and discard it before testing equality by other attributes.

Another solution: include the source object version in the class name that gets stored with every pickled object instance, and try hard to make sure the dest object is the same.


You could use https://github.com/trailofbits/fickling for analysis.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: