

Python Pickle versus HDF5 - tomrod
http://www.shocksolution.com/2010/01/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/

======
TheLoneWolfling
"Warning The pickle module is not intended to be secure against erroneous or
maliciously constructed data. Never unpickle data received from an untrusted
or unauthenticated source."

Storing pickled data on disk paves a path for all sorts of nasty exploits...
Case in point:

File contents: "cos\nsystem\n(S'ls ~'\ntR."

How is HDF5, security-wise?

~~~
pekk
If your local filesystem is literally an untrusted source, then you have big
problems. All your own Python code is coming from that same untrusted source,
along with all the .pyc files in your code and on PYTHONPATH. Is a .py or a
.pyc paving a path for all sorts of nasty exploits, just because it can be
run?

This isn't even limited to Python: every executable you are running on that
machine are coming from the same untrusted source, and every binary you build
on that machine is also tainted by extension.

Does loading a kernel from disk pave a path for all sorts of nasty exploits?

~~~
lvh
I think there's a useful distinction to be made between data and code here.
Your code typically writes to your data store. Your code does not typically
write to your code.

It's a lot more reasonable that an attacker manages to convince your code to
write some malicious data, than it is that an attacker has full write access
to your filesystem.

As an analogy: your SQL database probably writes to a filesystem somewhere. If
it's running on the same machine as your app server, it may be on the _same_
filesystem. But, say, SQL injection attacks are still infinitely more common
than "can write to the source file currently being executed".

Or, to rephrase: being writable (by the user executing app code) is a default
state for your data store. It isn't a default state for your app, or your
kernel.

------
rch
PyTables is really useful, but there is also h5py. I sometimes find it handy
to create an HDF data structure in memory, and I think the latter library has
better support in that case. I would like to know more about HDF read and
write performance though, maybe relative to protobuf and msgpack (both could
be much faster, but I wonder).

------
stock_toaster
My goodness. The author should be using cPickle.

~~~
zmk_
With protocol=2 as well.

~~~
stock_toaster
The code example was already using highest_protocol, which is currently 2.

