
Show HN: Kim – A Python serialization and marshaling framework - mikeywaites
http://kim.readthedocs.io/en/latest/
======
sandGorgon
We are really looking for serialization libraries that will work with pandas
and scikit.

This stuff is really all over the place - PMML, Arrow, Dill, pickle.

Some stuff won't work with one or the other. I will actually _pay_ for
consistency versus performance.

There are way too many primitive serialization libraries. Surprisingly none
for the higher order ML, etc stuff.

Give the kind of people behind Arrow, I would love wrapper that will use Arrow
to do all of this...But doesn't matter at the end of the day.

~~~
mhneu
Python's data infrastructure has a huge problem: serialization and thus saving
data results.

A good serialization library should serialize:

    
    
      - classes/objects (best practice: objects for holding data)
      - pandas/numpy objects (must have: minimizing space)
      - namedtuples (currently: a mess, factory implementation)
      - dicts and lists of dicts (must have: space efficiency)
    

Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

Python is terrible at this and it limits use in real data analysis
environments and limits competition with matlab.

~~~
fnord123
> Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

If you want matlab files in Python you can use `scipy.io.loadmat('file.mat')`.
PyTables (built on hdf5) is a better solution since the hdf5 format is a lot
more flexible than matlab's (ime). But Parquet is looking to be the best
solution moving forward as it's gaining a lot of mindshare as the go-to
flexible format for data and will be / is used in Arrow.

But really, Matlab is on par with pickles when it comes to serialisation. It's
a trap solution.

~~~
auxym
Actually, since Matlab v7.3, .mat files are actually hdf5 files.

------
limdauto
I'd like to congratulate the authors regarding the clever naming. I totally
get the Eminem's reference.

Disclaimer: Posting this comment because my colleague pointed out that I could
get some points.

~~~
_e
Kim: A JSON Serialization and Marshaling framework that Mathers

------
Dowwie
Cool project!

In the case of serialization libraries,unless you are validating as part of
your (de)serialization, I'd recommend avoiding schema-driven serialization
libraries. These Kim-like libraries, such as Marshmallow, introduce quite a
bit of overhead. If validation isn't required and performance matters, I
recommend choosing a lighter-weight serialization/marshalling alternative,
such as that provided by asphalt-serialization: [https://github.com/asphalt-
framework/asphalt-serialization](https://github.com/asphalt-framework/asphalt-
serialization)

Asphalt-serialization supports cbor, msgpack, json, ... and is easy to wire up

This recommendation is based on my own experience using Marshmallow for Yosai,
analyzing its performance and then refactoring to a ported version of asphalt-
serialization.

~~~
mikeywaites
Hey Dowwie!

That's a great point and an important distinction to make. As I mentioned in
some of the other comments, we have certainly been focussed on features over
performance so far but we are actively working on dramatically improving the
performance of Kim.

I guess it's almost important to pick the right tool for the job. Thanks for
sharing the link to asphalt too. I'd not see that before.

~~~
Dowwie
Keep up the good work, Mikey. :) See you at PyCon, maybe?

~~~
mikeywaites
One of the engineers from our team is going to be there for sure. Im certainly
keen to go so fingers crossed.

------
voidfiles
I added Kim to my ongoing set of python serialization framework benchmarks
here is how it ranks.

    
    
      Library                  Many Objects    One Object
      ---------------------  --------------  ------------
      Custom                      0.0187769    0.00682402
      Strainer                    0.0603201    0.0337129
      serpy                       0.073787     0.038656
      Lollipop                    0.47821      0.231566
      Marshmallow                 1.14844      0.598486
      Django REST Framework       1.94096      1.3277
      kim                         2.28477      1.15237
    

Comments on how to improve the benchmark are appreciated.

source: [https://voidfiles.github.io/python-serialization-
benchmark/](https://voidfiles.github.io/python-serialization-benchmark/)

~~~
makmanalp
This is brilliant, exactly what I was looking for. I did a profile recently on
some API calls and found that 40-50% was being spent on serialization with
marshmallow, which I'm looking to drop.

I'll be doing this stuff for myself, but would you be curious in having:

a) Support for lima:
[https://lima.readthedocs.io/en/latest/](https://lima.readthedocs.io/en/latest/)

b) more benchmark cases (serializing a larger list of objects)

------
yeukhon
Nice, but I recommend closing issues
[https://github.com/mikeywaites/kim/issues](https://github.com/mikeywaites/kim/issues)
which have fixes (some of them show 'merge'). It's one thing I as a user look
at choosing whether to adopt a project or not.

~~~
mikeywaites
absolutely. Im a bit annoyed at myself that I hadn't got round to that yet but
thanks for raising it.

~~~
yeukhon
I will submit a PR for some doc fixes :-) on the way look out next 24 hours!
This is an awesome project for a couple years, great run!

------
sakawa
It does look like marshmalllow[1]. How does relate Kim with it?

[1]: [https://github.com/marshmallow-
code/marshmallow/](https://github.com/marshmallow-code/marshmallow/)

~~~
tinnet
Obviously no OS developer owes anybody an explanation, but man would I
appreciate if more projects had a "why you should use this over related
projects" (like e.g. pendulum does
[https://github.com/sdispater/pendulum/blob/master/README.rst...](https://github.com/sdispater/pendulum/blob/master/README.rst#why-
not-arrow))

~~~
pekk
I know the pain of searching for software to meet your requirements. But
unless you have a friend you can really trust to provide informed
recommendations, nobody can take this pain away for you.

If all require projects to say negative things about other people's projects
while talking up their own, a lot of projects are going to distort the facts.
In the end, if we don't have the ability to evaluate the software ourselves,
then all we are measuring is who can shout the loudest and who is the most
aggressive against other projects. Quiet projects will still be good, but now
those would be overlooked even more because they aren't shouting. With this
requirement you are making your life easier but you are making life harder on
open source developers by forcing them to deal with unnecessary inter-project
drama and to divert lots of effort into marketing that could have been put
into code. That might make sense in proprietary products, but in open source
this kind of demand just hurts the ecosystem.

If the pain of choosing is too much then choose something that is
standardized, or the most popular thing, or what your trusted friend
recommends. People will seek out the very specific projects they need. If you
don't even know why you are using something, it isn't the responsibility of
someone else to tell you why you are using it!

~~~
tinnet
I think you're right, when actually using it for production software it's
probably wise to not be a trailblazer :)

For me this wish for a comparison (that I'd love to be objective and in god
spirit of course - naive?) is probably coming more from "shopping around"
between projects. Or just when seeing a new thing on HN and wondering if I
should investigate adding this particular thing to my toolbox.

------
siddhant
Cool! Are there any speed comparisons available between this and marshmallow
(or other alternatives)?

~~~
mikeywaites
Hi Siddhant,

We've not really dug into performance yet, though if you look at the last
patch (1.0.2) we yielded a 10% speed up by removing an erroneous try/except
block.

We've really focussed on features initially and performance is something we're
actively researching now. Perhaps we can get some initial benchmarks together
and share them with you this week. They will be useful no doubt as we start to
plan a release focussed on speed ups.

Thanks for reaching out!

------
amelius
Can it serialize cycles?

~~~
mikeywaites
Hey Amelius,

thanks for the message. Gonna be honest, I'm not sure what you mean by cycles.
Can you elaborate a bit?

~~~
amelius
Roughly speaking, by cycles I mean a structure that refers to itself somehow.
For example:

    
    
        A = {}
        B = {}
        A["ref"] = B
        B["ref"] = A
    

So would it be possible to serialize A and B, and of course to deserialize
them?

Note that

    
    
        print A
    

gives

    
    
        {'ref': {'ref': {...}}}
    

which is of course not a suitable serialization, since you can't recover the
original structure from it.

~~~
jackqu7
Yes, this is possible as long as the second level nested object has a role to
stop infinite recursion from occurring. Cycles are not automatically detected.

    
    
        class BaseMapper(Mapper):
    
            __type__ = TestType
    
            score = Integer()
            nest = Nested('NestedMapper')
    
            __roles__ = {'nested': blacklist('nest')}
    
        class NestedMapper(Mapper):
    
            __type__ = TestType
    
            back = Nested('BaseMapper', role='nested')
            name = String()
    
        obj2 = TestType(name='test')
        obj = TestType(score=5, nest=obj2)
        obj2.back = obj
    
        >> BaseMapper(obj=obj).serialize()
        {'nest': {'back': {'score': 5}, 'name': 'test'}, 'score': 5}

~~~
arnarbi
That's not a solution as it will not restore the same object graph, it will
just repeat values.

One way is to to store a table of objects (as identified by id()) encountered
during serialization, indexed by the order you encounter them. If you
encounter an object you have already serialized, serialize an index into that
table. On deserialization, construct the same kind of table, and deserialize
an index with a reference to the same object.

See e.g. AMF for an example format that does this:
[https://en.wikipedia.org/wiki/Action_Message_Format](https://en.wikipedia.org/wiki/Action_Message_Format)

------
ziikutv
So this takes JSON and maps it to namedtuples?

Silly question, what happens with Unicode?

------
rat87
I was just looking for something like this or marshmallow.

------
ff7c11
why not pickle :) :)

------
BuuQu9hu
Sorry, I must be harsh. No.

This fundamentally doesn't offer much advantage over a .toJSON() instance
method and a .fromJSON() class method.

Don't say "security-focused" if you can't handle cyclic object graphs.

~~~
mafro
Please elaborate on the reasons for your opinion :)

