Hacker News new | comments | show | ask | jobs | submit login
Show HN: Kim – A Python serialization and marshaling framework (readthedocs.io)
136 points by mikeywaites 247 days ago | hide | past | web | favorite | 57 comments



We are really looking for serialization libraries that will work with pandas and scikit.

This stuff is really all over the place - PMML, Arrow, Dill, pickle.

Some stuff won't work with one or the other. I will actually pay for consistency versus performance.

There are way too many primitive serialization libraries. Surprisingly none for the higher order ML, etc stuff.

Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.


So stuff like this or marshmallow is more for cases when you have some database / ORM objects and you want to serialize them out to a json object, or you want to process form/POST data into a well-structured json or database object.

For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem.

Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather)

Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop.


Do you know of a source that compares these different libraries in terms of capabilities, focus/use cases, size limits, performance, format support, etc.?

Googling turned up very little for me.

TIA

Edit: libraries mentioned in thread:

PMML, Arrow, Dill, marshmallow, pytables, parquet/fastparquet (and pickle, obviously)


No, I don't, but some of these are apples and oranges, that was part of my point. You're conflating many different types of things.

Specifically, the ones I talked about are for storing large tabular datasets on disk. Stuff that lays out data on disk so that it's easy and efficient to query only a part of the dataset, e.g. only certain columns or only certain rows that match a predicate or within a range of indexes. These can store hundreds of gb, no problem. They often have some sort of compression, like LZ, snappy or blosc that has relatively low CPU overhead while giving decent compression. I tried to separate the file formats (which are readable from other languages) from the python libraries that write them. For this, I'd default to pytables / HDF5, barring some specific use case where you'd already know what other one you need.

Dill / pickle are for serializing generic python objects. I wouldn't really use them to store anything big, but it's very convenient for complicated data structures, like hierarchies of objects and classes. E.g. to save the current running state of your program. You don't have to think about storage formats and layouts and serialization routines, if you have a list of python objects you can pickle it. Pickle is built in, while dill is an external library that nicely handles a bunch more edge cases.

PMML seems like an XML based format specifically for trained machine learning models. Don't really know much about this.


McKinney has been hard at work getting parquet and arrow support in pandas.

http://wesmckinney.com/blog/outlook-for-2017/

>Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.

pyarrow; pyarrow.parquet (which uses parquet-cpp).


Wow this is great. I've been working around the jvm to integrate sklearn and some spark jobs that produce Parquet. This is a huge relief


Arrow doesn't do scikit - atleast last time I checked . Has it changed ?


pyarrow has methods to convert to pandas, which scikit supports

http://pyarrow.readthedocs.io/en/latest/pandas.html


No - this is not it. Scikit models need to be persisted. The only ways I have found is pickle or dill.

Take a look at this to understand what I mean . http://stackoverflow.com/questions/32757656/what-are-the-pit...


Python's data infrastructure has a huge problem: serialization and thus saving data results.

A good serialization library should serialize:

  - classes/objects (best practice: objects for holding data)
  - pandas/numpy objects (must have: minimizing space)
  - namedtuples (currently: a mess, factory implementation)
  - dicts and lists of dicts (must have: space efficiency)
Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

Python is terrible at this and it limits use in real data analysis environments and limits competition with matlab.


> Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

If you want matlab files in Python you can use `scipy.io.loadmat('file.mat')`. PyTables (built on hdf5) is a better solution since the hdf5 format is a lot more flexible than matlab's (ime). But Parquet is looking to be the best solution moving forward as it's gaining a lot of mindshare as the go-to flexible format for data and will be / is used in Arrow.

But really, Matlab is on par with pickles when it comes to serialisation. It's a trap solution.


Actually, since Matlab v7.3, .mat files are actually hdf5 files.


To expand on fnord, to my knowledge, pickle handles all of these things. Its still a bad solution, but it does everything you want.

    pickle.dump(f, anyobject)
    anyobject = pickle.load(f)


Pickle had size constraints that make it unsuitable in certain ML applications.


Indeed, but I expect that's also true for matlab's vanilla solution.


Does using protocol version 4 help with this?


Thanks for expanding on that mhneu. So our primary focus with Kim has certainly been around serializing/marshaling JSON though we've used it for plenty of other uses cases.

It's great to get a view of other problems people are experiencing.

Now we've finished wrapping up 1.0.0 we're going to be spending some time on the roadmap of new features. I personally feel variation in use cases from our own is only going to help make Kim better so we'll defo look into this problem some more in the near future. Right now though i couldn't say for sure what Kim would have to offer when working with Pandas etc as we've simply never tried.


defo?


definitely.


i believe the benchmark is set by R and it's RData format. It saves everything in the R domain. ML models, dataframes, everything.

Works pretty well - I know of large financial firms that are using this in production to load large trained models of size hundreds of GB


One of the things we felt very strongly about when developing Kim was that Simple things should be simple. Complex things should be possible. To that end the Pipeline system behind the Field objects really does allow anything to be achieved. Wether thats producing values from composite fields or handling unique or non standard data types.

it would be great if you can share some ways that you specifically need serialization to work for something like pandas, or better yet, some ways existing solutions don’t work with pandas. We’ve had some pretty unique requirements ourselves and have not found any blockers yet.

Thanks for the message.


I'd like to congratulate the authors regarding the clever naming. I totally get the Eminem's reference.

Disclaimer: Posting this comment because my colleague pointed out that I could get some points.


Kim: A JSON Serialization and Marshaling framework that Mathers


Cool project!

In the case of serialization libraries,unless you are validating as part of your (de)serialization, I'd recommend avoiding schema-driven serialization libraries. These Kim-like libraries, such as Marshmallow, introduce quite a bit of overhead. If validation isn't required and performance matters, I recommend choosing a lighter-weight serialization/marshalling alternative, such as that provided by asphalt-serialization: https://github.com/asphalt-framework/asphalt-serialization

Asphalt-serialization supports cbor, msgpack, json, ... and is easy to wire up

This recommendation is based on my own experience using Marshmallow for Yosai, analyzing its performance and then refactoring to a ported version of asphalt-serialization.


Hey Dowwie!

That's a great point and an important distinction to make. As I mentioned in some of the other comments, we have certainly been focussed on features over performance so far but we are actively working on dramatically improving the performance of Kim.

I guess it's almost important to pick the right tool for the job. Thanks for sharing the link to asphalt too. I'd not see that before.


Keep up the good work, Mikey. :) See you at PyCon, maybe?


One of the engineers from our team is going to be there for sure. Im certainly keen to go so fingers crossed.


I added Kim to my ongoing set of python serialization framework benchmarks here is how it ranks.

  Library                  Many Objects    One Object
  ---------------------  --------------  ------------
  Custom                      0.0187769    0.00682402
  Strainer                    0.0603201    0.0337129
  serpy                       0.073787     0.038656
  Lollipop                    0.47821      0.231566
  Marshmallow                 1.14844      0.598486
  Django REST Framework       1.94096      1.3277
  kim                         2.28477      1.15237
Comments on how to improve the benchmark are appreciated.

source: https://voidfiles.github.io/python-serialization-benchmark/


This is brilliant, exactly what I was looking for. I did a profile recently on some API calls and found that 40-50% was being spent on serialization with marshmallow, which I'm looking to drop.

I'll be doing this stuff for myself, but would you be curious in having:

a) Support for lima: https://lima.readthedocs.io/en/latest/

b) more benchmark cases (serializing a larger list of objects)


Just a minor note: It seems you don't mention anywhere what those numbers actually mean. I'm assuming they are seconds, but I can't know for certain, which makes it really unclear if Kim is the fastest or the slowest.


Thanks so much for this Voidfiles. We were under no illusions that we weren't the most performant library out there (yet)

This is a great start for us understanding where we need to get to! We've got some work to do :)


Nice, but I recommend closing issues https://github.com/mikeywaites/kim/issues which have fixes (some of them show 'merge'). It's one thing I as a user look at choosing whether to adopt a project or not.


absolutely. Im a bit annoyed at myself that I hadn't got round to that yet but thanks for raising it.


I will submit a PR for some doc fixes :-) on the way look out next 24 hours! This is an awesome project for a couple years, great run!


It does look like marshmalllow[1]. How does relate Kim with it?

[1]: https://github.com/marshmallow-code/marshmallow/


(I'm Jack, another developer at OSL.)

We started writing Kim around the same time as the Marshmallow project began as we found it wasn't suitable for our needs at that time, though it has come a long way since then.

They are very similar projects and have similar functionality, but Kim has a focus on making it relatively simple to do unusual or 'advanced' things.

For example, Kim supports polymorphism out of the box, if you have an AnimalMapper subclassed by a CatMapper and a DogMapper, passing a Cat and a Dog to AnimalMapper.many.serialize() will automatically do the right thing in a similar way to SQLAlchemy polymorphism.

We also have support for complex requirements such as nesting the same object to itself (useful when your JSON representation is nested but your DB representation is flat,) serialising multiple object fields to a single JSON field (eg full_name consisting of obj.first_name and obj.last_name,) a range of security models for marshalling nested objects and a fairly extensible roles system.

In general we've followed the philosophy "Simple things should be simple. Complex things should be possible."


I've been saddened by Marshmallow on many occasions (I have gripes with the particular way defaults/validation play together. This is true for WTForms too).

I'm excited to try out Kim. I've been very close to just writing my own serialization lib on many occasions.

It looks like your pipelines might bring a bit of sanity to it. :)

It looks like you support a few sorts of validation, but the docs aren't super clear as to what the expected validation strategy is. Could you elaborate on what that looks like?

My typical strategy I'd like to do is to just a list of functions that take the input and return a boolean as far as validation goes.


Hey that's really great to hear. (that you're keen to use Kim) WTF-Forms and Marshmallow both solve problems and they do it well but it seems like us you wanted something that offered just a bit more flexibility. That's totally the idea behind pipelines in Kim. They are like tiny little computer programmes and are really capable of anything (providing it's possible in Python of course :D )

It's great you asked this question as we noticed part of the documentation was actually broken. here's a link to a pretty basic example of adding extra validation "pipes" to a pipeline

http://kim.readthedocs.io/en/latest/user/advanced.html#custo...

We'd be more than happy to discuss how to solve more complex requirements if there's something specific you had in mind though.

Thanks for the message!


Obviously no OS developer owes anybody an explanation, but man would I appreciate if more projects had a "why you should use this over related projects" (like e.g. pendulum does https://github.com/sdispater/pendulum/blob/master/README.rst...)


I know the pain of searching for software to meet your requirements. But unless you have a friend you can really trust to provide informed recommendations, nobody can take this pain away for you.

If all require projects to say negative things about other people's projects while talking up their own, a lot of projects are going to distort the facts. In the end, if we don't have the ability to evaluate the software ourselves, then all we are measuring is who can shout the loudest and who is the most aggressive against other projects. Quiet projects will still be good, but now those would be overlooked even more because they aren't shouting. With this requirement you are making your life easier but you are making life harder on open source developers by forcing them to deal with unnecessary inter-project drama and to divert lots of effort into marketing that could have been put into code. That might make sense in proprietary products, but in open source this kind of demand just hurts the ecosystem.

If the pain of choosing is too much then choose something that is standardized, or the most popular thing, or what your trusted friend recommends. People will seek out the very specific projects they need. If you don't even know why you are using something, it isn't the responsibility of someone else to tell you why you are using it!


I think you're right, when actually using it for production software it's probably wise to not be a trailblazer :)

For me this wish for a comparison (that I'd love to be objective and in god spirit of course - naive?) is probably coming more from "shopping around" between projects. Or just when seeing a new thing on HN and wondering if I should investigate adding this particular thing to my toolbox.


Hey. It's a great point and something we will certainly look to add to the documentation. To be honest the docs were the major thing that held up the release of Kim. We made the mistake of leaving them until last so in the end we opted for quality over quantity to get them finished.

A great suggestion though, thanks!


I think marshmallow primary use case is to unserialize to nested dicts/lists while kim outputs full classes. Did I understood that right ?


Cool! Are there any speed comparisons available between this and marshmallow (or other alternatives)?


Hi Siddhant,

We've not really dug into performance yet, though if you look at the last patch (1.0.2) we yielded a 10% speed up by removing an erroneous try/except block.

We've really focussed on features initially and performance is something we're actively researching now. Perhaps we can get some initial benchmarks together and share them with you this week. They will be useful no doubt as we start to plan a release focussed on speed ups.

Thanks for reaching out!


Can it serialize cycles?


Hey Amelius,

thanks for the message. Gonna be honest, I'm not sure what you mean by cycles. Can you elaborate a bit?


Roughly speaking, by cycles I mean a structure that refers to itself somehow. For example:

    A = {}
    B = {}
    A["ref"] = B
    B["ref"] = A
So would it be possible to serialize A and B, and of course to deserialize them?

Note that

    print A
gives

    {'ref': {'ref': {...}}}
which is of course not a suitable serialization, since you can't recover the original structure from it.


Yes, this is possible as long as the second level nested object has a role to stop infinite recursion from occurring. Cycles are not automatically detected.

    class BaseMapper(Mapper):

        __type__ = TestType

        score = Integer()
        nest = Nested('NestedMapper')

        __roles__ = {'nested': blacklist('nest')}

    class NestedMapper(Mapper):

        __type__ = TestType

        back = Nested('BaseMapper', role='nested')
        name = String()

    obj2 = TestType(name='test')
    obj = TestType(score=5, nest=obj2)
    obj2.back = obj

    >> BaseMapper(obj=obj).serialize()
    {'nest': {'back': {'score': 5}, 'name': 'test'}, 'score': 5}


That's not a solution as it will not restore the same object graph, it will just repeat values.

One way is to to store a table of objects (as identified by id()) encountered during serialization, indexed by the order you encounter them. If you encounter an object you have already serialized, serialize an index into that table. On deserialization, construct the same kind of table, and deserialize an index with a reference to the same object.

See e.g. AMF for an example format that does this: https://en.wikipedia.org/wiki/Action_Message_Format


The output isn't clear to me:

    {'nest': {'back': {'score': 5}, 'name': 'test'}, 'score': 5}
How can this be mapped back unambiguously to a cyclic structure?

(I might be wrong but it seems to me that the act of serialization has simply expanded the cycle one level deep.)


    a = {}
    b = {}
    a["b"] = b
    b["a"] = a

    a == deserialize(serialize(a))


So this takes JSON and maps it to namedtuples?

Silly question, what happens with Unicode?


I was just looking for something like this or marshmallow.


why not pickle :) :)


Sorry, I must be harsh. No.

This fundamentally doesn't offer much advantage over a .toJSON() instance method and a .fromJSON() class method.

Don't say "security-focused" if you can't handle cyclic object graphs.


Please elaborate on the reasons for your opinion :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: