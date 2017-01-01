This stuff is really all over the place - PMML, Arrow, Dill, pickle.
Some stuff won't work with one or the other. I will actually pay for consistency versus performance.
There are way too many primitive serialization libraries. Surprisingly none for the higher order ML, etc stuff.
Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.
For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem.
Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather)
Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop.
Googling turned up very little for me.
TIA
Edit: libraries mentioned in thread:
PMML, Arrow, Dill, marshmallow, pytables, parquet/fastparquet (and pickle, obviously)
Specifically, the ones I talked about are for storing large tabular datasets on disk. Stuff that lays out data on disk so that it's easy and efficient to query only a part of the dataset, e.g. only certain columns or only certain rows that match a predicate or within a range of indexes. These can store hundreds of gb, no problem. They often have some sort of compression, like LZ, snappy or blosc that has relatively low CPU overhead while giving decent compression. I tried to separate the file formats (which are readable from other languages) from the python libraries that write them. For this, I'd default to pytables / HDF5, barring some specific use case where you'd already know what other one you need.
Dill / pickle are for serializing generic python objects. I wouldn't really use them to store anything big, but it's very convenient for complicated data structures, like hierarchies of objects and classes. E.g. to save the current running state of your program. You don't have to think about storage formats and layouts and serialization routines, if you have a list of python objects you can pickle it. Pickle is built in, while dill is an external library that nicely handles a bunch more edge cases.
PMML seems like an XML based format specifically for trained machine learning models. Don't really know much about this.
http://wesmckinney.com/blog/outlook-for-2017/
>Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.
pyarrow; pyarrow.parquet (which uses parquet-cpp).
http://pyarrow.readthedocs.io/en/latest/pandas.html
Take a look at this to understand what I mean . http://stackoverflow.com/questions/32757656/what-are-the-pit...
A good serialization library should serialize:
- classes/objects (best practice: objects for holding data)
- pandas/numpy objects (must have: minimizing space)
- namedtuples (currently: a mess, factory implementation)
- dicts and lists of dicts (must have: space efficiency)
Python is terrible at this and it limits use in real data analysis environments and limits competition with matlab.
If you want matlab files in Python you can use `scipy.io.loadmat('file.mat')`. PyTables (built on hdf5) is a better solution since the hdf5 format is a lot more flexible than matlab's (ime). But Parquet is looking to be the best solution moving forward as it's gaining a lot of mindshare as the go-to flexible format for data and will be / is used in Arrow.
But really, Matlab is on par with pickles when it comes to serialisation. It's a trap solution.
pickle.dump(f, anyobject)
anyobject = pickle.load(f)
It's great to get a view of other problems people are experiencing.
Now we've finished wrapping up 1.0.0 we're going to be spending some time on the roadmap of new features. I personally feel variation in use cases from our own is only going to help make Kim better so we'll defo look into this problem some more in the near future. Right now though i couldn't say for sure what Kim would have to offer when working with Pandas etc as we've simply never tried.
Works pretty well - I know of large financial firms that are using this in production to load large trained models of size hundreds of GB
it would be great if you can share some ways that you specifically need serialization to work for something like pandas, or better yet, some ways existing solutions don’t work with pandas. We’ve had some pretty unique requirements ourselves and have not found any blockers yet.
Thanks for the message.
Disclaimer: Posting this comment because my colleague pointed out that I could get some points.
In the case of serialization libraries,unless you are validating as part of your (de)serialization, I'd recommend avoiding schema-driven serialization libraries.
These Kim-like libraries, such as Marshmallow, introduce quite a bit of overhead. If validation isn't required and performance matters, I recommend choosing a lighter-weight serialization/marshalling alternative, such as that provided by asphalt-serialization: https://github.com/asphalt-framework/asphalt-serialization
Asphalt-serialization supports cbor, msgpack, json, ... and is easy to wire up
This recommendation is based on my own experience using Marshmallow for Yosai, analyzing its performance and then refactoring to a ported version of asphalt-serialization.
That's a great point and an important distinction to make. As I mentioned in some of the other comments, we have certainly been focussed on features over performance so far but we are actively working on dramatically improving the performance of Kim.
I guess it's almost important to pick the right tool for the job. Thanks for sharing the link to asphalt too. I'd not see that before.
Library Many Objects One Object
--------------------- -------------- ------------
Custom 0.0187769 0.00682402
Strainer 0.0603201 0.0337129
serpy 0.073787 0.038656
Lollipop 0.47821 0.231566
Marshmallow 1.14844 0.598486
Django REST Framework 1.94096 1.3277
kim 2.28477 1.15237
source: https://voidfiles.github.io/python-serialization-benchmark/
I'll be doing this stuff for myself, but would you be curious in having:
a) Support for lima: https://lima.readthedocs.io/en/latest/
b) more benchmark cases (serializing a larger list of objects)
This is a great start for us understanding where we need to get to! We've got some work to do :)
[1]: https://github.com/marshmallow-code/marshmallow/
We started writing Kim around the same time as the Marshmallow project began as we found it wasn't suitable for our needs at that time, though it has come a long way since then.
They are very similar projects and have similar functionality, but Kim has a focus on making it relatively simple to do unusual or 'advanced' things.
For example, Kim supports polymorphism out of the box, if you have an AnimalMapper subclassed by a CatMapper and a DogMapper, passing a Cat and a Dog to AnimalMapper.many.serialize() will automatically do the right thing in a similar way to SQLAlchemy polymorphism.
We also have support for complex requirements such as nesting the same object to itself (useful when your JSON representation is nested but your DB representation is flat,) serialising multiple object fields to a single JSON field (eg full_name consisting of obj.first_name and obj.last_name,) a range of security models for marshalling nested objects and a fairly extensible roles system.
In general we've followed the philosophy "Simple things should be simple. Complex things should be possible."
I'm excited to try out Kim. I've been very close to just writing my own serialization lib on many occasions.
It looks like your pipelines might bring a bit of sanity to it. :)
It looks like you support a few sorts of validation, but the docs aren't super clear as to what the expected validation strategy is. Could you elaborate on what that looks like?
My typical strategy I'd like to do is to just a list of functions that take the input and return a boolean as far as validation goes.
It's great you asked this question as we noticed part of the documentation was actually broken. here's a link to a pretty basic example of adding extra validation "pipes" to a pipeline
http://kim.readthedocs.io/en/latest/user/advanced.html#custo...
We'd be more than happy to discuss how to solve more complex requirements if there's something specific you had in mind though.
Thanks for the message!
If all require projects to say negative things about other people's projects while talking up their own, a lot of projects are going to distort the facts. In the end, if we don't have the ability to evaluate the software ourselves, then all we are measuring is who can shout the loudest and who is the most aggressive against other projects. Quiet projects will still be good, but now those would be overlooked even more because they aren't shouting. With this requirement you are making your life easier but you are making life harder on open source developers by forcing them to deal with unnecessary inter-project drama and to divert lots of effort into marketing that could have been put into code. That might make sense in proprietary products, but in open source this kind of demand just hurts the ecosystem.
If the pain of choosing is too much then choose something that is standardized, or the most popular thing, or what your trusted friend recommends. People will seek out the very specific projects they need. If you don't even know why you are using something, it isn't the responsibility of someone else to tell you why you are using it!
For me this wish for a comparison (that I'd love to be objective and in god spirit of course - naive?) is probably coming more from "shopping around" between projects. Or just when seeing a new thing on HN and wondering if I should investigate adding this particular thing to my toolbox.
A great suggestion though, thanks!
We've not really dug into performance yet, though if you look at the last patch (1.0.2) we yielded a 10% speed up by removing an erroneous try/except block.
We've really focussed on features initially and performance is something we're actively researching now. Perhaps we can get some initial benchmarks together and share them with you this week. They will be useful no doubt as we start to plan a release focussed on speed ups.
Thanks for reaching out!
thanks for the message. Gonna be honest, I'm not sure what you mean by cycles. Can you elaborate a bit?
A = {}
B = {}
A["ref"] = B
B["ref"] = A
Note that
print A
{'ref': {'ref': {...}}}
class BaseMapper(Mapper):
__type__ = TestType
score = Integer()
nest = Nested('NestedMapper')
__roles__ = {'nested': blacklist('nest')}
class NestedMapper(Mapper):
__type__ = TestType
back = Nested('BaseMapper', role='nested')
name = String()
obj2 = TestType(name='test')
obj = TestType(score=5, nest=obj2)
obj2.back = obj
>> BaseMapper(obj=obj).serialize()
{'nest': {'back': {'score': 5}, 'name': 'test'}, 'score': 5}
One way is to to store a table of objects (as identified by id()) encountered during serialization, indexed by the order you encounter them. If you encounter an object you have already serialized, serialize an index into that table. On deserialization, construct the same kind of table, and deserialize an index with a reference to the same object.
See e.g. AMF for an example format that does this: https://en.wikipedia.org/wiki/Action_Message_Format
{'nest': {'back': {'score': 5}, 'name': 'test'}, 'score': 5}
(I might be wrong but it seems to me that the act of serialization has simply expanded the cycle one level deep.)
a = {}
b = {}
a["b"] = b
b["a"] = a
a == deserialize(serialize(a))
Silly question, what happens with Unicode?
This fundamentally doesn't offer much advantage over a .toJSON() instance method and a .fromJSON() class method.
Don't say "security-focused" if you can't handle cyclic object graphs.
This stuff is really all over the place - PMML, Arrow, Dill, pickle.
Some stuff won't work with one or the other. I will actually pay for consistency versus performance.
There are way too many primitive serialization libraries. Surprisingly none for the higher order ML, etc stuff.
Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.