
Getting Started with Apache Avro - luu
http://idarlington.github.io/2017/Getting-started-avro/
======
rubyn00bie
Yeah... you're only going to save money if you're an absolutely enormous
scale. I'd also really question that point at all when you can compress JSON
as it's being transmitted (GZIP, baby).

Next, I have yet to see somewhere... anywhere, not at the scale of something
like Uber, or a high-performance game server, really need to worry about
serialization and deserialization of JSON. Almost all serialization libraries
are written in C for MAXIMUM PERFORMANCE regardless of the language you're
using, or some sort of hand-tuned-magical bytecode fuckery which has the same
effect.

I honestly can't imagine this being useful except for obscenely high
throughput service to service communication or as an internal serialization
protocol (for something like FoundationDB where you're storing the Avro
encoded instance to disk). Truly, JSON and REST are good enough for 99.999% of
services and the overhead of using something non-standard will likely never
pay off due to the increased developer overhead and maintenance.

This just smells like tech debt and an architecture that can't be debugged.

~~~
nly
I'm toying with Avro right now for a project and our common JSON payload went
from 1400 to 70 bytes, which makes it much more suitable for use over UDP

Also using JSON in a statically-typed language is tedious and error-prone.

~~~
latchkey
How much of your data is repeated as part of a cyclical graph? You may want to
try jsog + compression.

[1] [https://github.com/jsog/jsog](https://github.com/jsog/jsog)

------
regnerba
If the author is around:

> Translates to less money saved on provisioning hardware.

I read that as it will cost more. Saving less money is bad.

I assume that is a typo?

------
bsder
Okay, so the article claims (without much evidence) those pieces where Avro
shines.

I'm more interested where it _fails_. It's the places where a library falls
down that's likely to burn me when I'm 24 hours from shipping a product.

~~~
sfev
We’ve used Avro in production for a number of years now. We use it to define
the schema for Kafka messages passing between parts of the system and it has
worked well - but you want some downsides, so..

Compatibility: Avro deals with this well, but there are gotchas especially
when you need/want to have “forward compatibility” as well as “backward
compatibility” in your messages. Adding a new type to a union? Breaking
change. Adding a new element to an enum? Breaking change. Want to add a new
member to a type? Best give it a default, or.. breaking change. We use the
Confluent schema registry to manage our schemas on kafka and we find ourselves
having to do carefully managed rollouts (stop all producers, put schema-reg
into compat mode, upgrade consumers, upgrade and start producers) every now
and again

Performance: we frequently see Avro as the “hot” code in our apps - mainly I
suspect because we’ve got a fairly large schema with lots of union types etc,
and our messages are up to 2kb or so in size. This results in our applications
topping out at low tens of thousands of messages per second, though it scales
linearly with number of cores. The most recent java version has a “fast read”
flag that promises to speed up our use-case but we haven’t tried it yet

~~~
nly
> Adding a new type to a union? Breaking change. Adding a new element to an
> enum? Breaking change.

This is incorrect. I've personally gone from type X to union {X,Y} in C++, you
just need to use the resolving decoder that takes both schemas. Enums can also
be resolved based on names. Check out the schema evolution rules here:

See
[https://avro.apache.org/docs/current/spec.html#Schema+Resolu...](https://avro.apache.org/docs/current/spec.html#Schema+Resolution)

Of all the binary serialization formats of this ilk (proto, thrift,
flatbuffers etc), Avro definitely has the most schema flexibility.

------
nly
The JSON schema format tedious to write and difficult to read. I recommend
anyone using Avro to use the IDL language[0] and use the Java tool to convert
to the JSON schema format.

java -jar avro-tools-1.9.2.jar idl yourschema.avdl

[0]
[https://avro.apache.org/docs/current/idl.html](https://avro.apache.org/docs/current/idl.html)

------
cmollis
When managing millions of transactions through Kafka that have to be
serialized to s3, read by hive natively, translated to parquet by spark,
support nullable complex types (map, struct, array), allow schema evolution
forward and back, then Avro is a good choice for your base data format.

