
Saving Millions by Dumping Java Serialization - jnewhouse
https://www.quantcast.com/blog/saving-millions-by-dumping-java-serialization/
======
hrshtr
Was using Thrift or Protobuf an option?

~~~
quest88
I'd like to know this too. As a passerby, those seem to have solved
serialization, so I'm curious why you need rowfiles instead of e.g. protobuf.

~~~
hrshtr
One reason on top of my head: Using such communication protocol would require
changes to the other services consuming it.

~~~
rst
So did switching to their homebrew serialization format -- in fact, most of
the article is about how they managed the changes (which touched codebases at
multiple sites in a fairly large organization).

~~~
jnewhouse
Those switches all occurred at the pipeline level, leaving the map-reduce
platform untouched. Switching our base logs to something like Parquet, Thrift
or Protobuf would be a much larger project. We do support writing and reading
Parquet to allow us to interface with other big data systems.

------
Cieplak
Some interesting benchmarks of various Java serialization libraries:

[https://github.com/eishay/jvm-
serializers/wiki](https://github.com/eishay/jvm-serializers/wiki)

------
jnewhouse
Author here, let me know if you have any questions/want more details.

~~~
user5994461
So... what's quantcast?

~~~
jnewhouse
We're a big data advertise and measure company based in San Francisco. We run
online display ad campaigns for marketers across realtime bidding exchanges
(RTB), such as those run by Google and AppNexus. We also provide a publisher
product to give site owners insights into their audience. Stack Overflow's
profile is at
[https://www.quantcast.com/stackoverflow.com](https://www.quantcast.com/stackoverflow.com).

------
cntlzw
I am by no means an expert, but I always wonder why people don't adopt ASN.1
for serialization? I know it is not pretty but writing machine readable stuff
never is.

------
bluecarbuncle
Portable Object Format from 10 years ago?
[https://docs.oracle.com/cd/E24290_01/coh.371/e22837/api_pof....](https://docs.oracle.com/cd/E24290_01/coh.371/e22837/api_pof.htm#COHDG1367)

------
MS_Buys_Upvotes
Can someone explain to an amateur why serialization is faster than say passing
raw JSON?

It seems like parsing JSON would be faster than the serialize -> deserialize
process but with the popularity of things like Protobuff it's clear that JSON
is slower.

~~~
ScottBurson
For one thing, all numbers can be written in binary, saving the lexing and
conversion time. For another, strings can be written by first writing the
length (in binary, of course), then writing the raw contents; there's no need
to scan the input looking for the closing quote, handle backslash escapes, or
do UTF-8 conversion.

That's probably most of the gain right there, but more things can be done
along those lines.

~~~
chiph
And no whitespace or curly braces taking up room, so the serialized data is
smaller, and thus faster to transmit/store. Downside: Legibility? Future-
proofing? Whats that?

~~~
coldtea
> _Downside: Legibility? Future-proofing? Whats that?_

There's nothing in this practice that is against future-proofing.

Legibility, yes, but those formats are not meant to be human readable.

~~~
chiph
Without the ability to future-proof being inherent in the format (like XML,
which is self-describing), the sad reality of development practices in
programming shops mean that one day, someone will make an undocumented change
or take a shortcut that will tightly couple the binary format to the specific
version of the code used to produce & read it. Which is fine, as long as you
know that that coupling will happen when you're planing things. Not so fun
when you have to go back and read a 3-year old file, only to discover that you
can't.

Something that comes to mind is the old COM formats that MS-Office used to
use. Eventually they had to abandon it (and not just because of the EU
lawsuit) because it was unmaintainable, and no one understood how they worked
well enough to not-break backwards compatibility for the next release.

------
Alupis
> Secondly, Java serialization produces very bulky outputs. Each serialization
> contains all of the data required to deserialize. When you’re writing
> billions of records at a time, recording the schema in every record
> massively increases your data size.

Sounds to me like you shouldn't be storing objects in your database.

Why not just write the data into tables, and then create new POJO's when
necessary, using the selected data?

~~~
jnewhouse
A standard database table isn't large enough to handle our large datasets. For
example, the Hercules dataset was over 2 petabytes and even after optimization
is almost 1 petabyte. Big data systems like Spark, Impala, Presto, etc. are
designed to make the data look like a table, even though it is spread out into
many files in a distributed filesystem. This is what we do. It's pretty common
to reimplement some database features onto these big data file formats. In our
case we have very fast indexes that let us quickly fetch data, similar to an
index in a postgresql table.

~~~
Alupis
Well, you understand your system and requirements better than I, obviously,
but...

    
    
        A standard database table isn't large enough to handle our large datasets
    

... isn't much of an answer as-to why you're storing objects in your database.

As you already mentioned in your post, serialized objects are big - they
contain all of their data, plus everything necessary to deserialize the object
into something usable.

I imagine your objects have the standard amount of strings, characters,
numbers, booleans, etc... why not just store those in the database and select
them back out when needed? Less data in the database, and faster retrieval
time since you skip serialization in both steps (storage and retrieval). Even
if you have nested objects within nested objects, you can write-out a "flat"
version of the data to a couple of joined tables surely.

On the other hand, serializing the object is probably more "simple" to
implement and use... but then you get the classical tradeoff of performance
vs. convenience.

~~~
barrkel
What's "the database" that you have in mind?

Start out with the idea that you have hundreds of machines in your cluster,
with 1000s of TB of data. Suppose the current data efficiency is on the order
of 80% - that is, 80% of the 1000s of TB is the actual bytes of the data
fields. What database do you have in mind to store this data, still on the
order of 1000s of TB?

You say: a couple of joined tables. So you have hundreds of machines, and the
tables are not all going to fit on one machine; they're going to be scattered
across hundreds of machines each. How do you efficiently do a join across two
distributed tables?

It's no picnic.

If each row in one table only has a few related rows in the other table, it's
much, much better to store the related data inline. Locality is key; you want
data in memory right now, not somewhere on disk across the network.

------
jankotek
Perhaps I could share my project which is trying to 'fix' java serialization?

It was originally part of database engine, but was extracted into separate
project. It solves things like cyclic reference, non-recursive graph traversal
and incremental serialization of large object graphs.

[https://github.com/jankotek/elsa/](https://github.com/jankotek/elsa/)

------
user123
TLDR: we had shitty code, optimized it, now it runs well. No code examples,
nothing.

~~~
jnewhouse
If you want more details, we were packing a Row class into a base64 encoded
string using an ObjectOutputStream. This is a fine thing for small scale
serialization but sucks at scale, because of the reasons mentioned in the
post. Sorry we don't have code examples, but it's unclear how useful it'd be
given that no one else uses our file format. If you want a bit more detail on
how the format works. Each metadata contains a list of typed columns to define
the schema of a given part. Our map-reduce framework has a bunch of internal
logic that tries to justify the written Row class with the one the Mapper
class is asking for. This allows us to do things like ingest different
versions of a row with in the context of a single job. I think questions of
serialization at the scale are generally interesting, although ymmv. I know of
one company using Avro, which doesn't let you cleanly update or track schema.
They've ended up storing every schema in an HBase table and reserving the
first 8 bytes to do a lookup into this table to know the row's schema.

~~~
cakoose
What do you mean when you say Avro doesn't let you "cleanly update or track
schema"?

From what I've read about Avro 1\. It can transform data between two
compatible schemas. 2\. It can serialize/load schemas off the wire, so you can
send the schema in the header.

If schema serialization causes too much overhead, you can set things up so you
only send the schema version identifier, as long as the receiver can use that
to get access to the full schema.

~~~
jnewhouse
I think what I'd heard about was likely a poorly implemented use of Avro. I
haven't actually worked with it.

