
Importing JSON into Hadoop via Kafka - jonbaer
https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/
======
mpd
> it is impractical to transport data in a binary format that is unparseable
> without extra information (schemas)

Well, Avro embeds the schema with the messages, and deserializing the message
will default to using that schema.

I think JSON is great for a one-off, but I'd hate to be on a team that doesn't
schema their data exchange formats.

~~~
andrewmccall
Do you embed the schema with every Kafka message? Otherwise you need to use
some out of band method to distribute schema updates with avro

~~~
WWLink
No. I think the ideal use case here is you use JSON over Kafka, and store the
data in Avro files. The avro files have the schema at the start.

I wonder what they're using to retrieve that data for analysis later on. I do
something very similar to this, but having to sift through millions of
messages for a given time period, to find a subset of said messages is kinda
annoying.

It's a good thing they didn't use Confleunt Camus. -shudder- It supports Avro-
Over-Kafka out of the box, on the caveat that every single time it reads a
message off kafka, it pings the schema registry to get the schema for it.
That's great and all, until you've got thousands of messages per second.

~~~
lacksconfidence
> I wonder what they're using to retrieve that data for analysis later on. I
> do something very similar to this, but having to sift through millions of
> messages for a given time period, to find a subset of said messages is kinda
> annoying.

It looks like hive or spark, depending on the use case. The data is also
loaded into Druid when looking at statistics, rather than getting full data
about individual messages.

> It's a good thing they didn't use Confleunt Camus. -shudder- It supports
> Avro-Over-Kafka out of the box, on the caveat that every single time it
> reads a message off kafka, it pings the schema registry to get the schema
> for it. That's great and all, until you've got thousands of messages per
> second.

They are using camus, much of the post is dedicated to it. It looks like they
are also running avro over kafka+camus for some application logging, but at a
lower volume (~10k messages/sec peak)

~~~
WWLink
iirc, confluent has their own version of kafka/camus that uses a schema
registry where the first few bytes of the kafka messages identify the schema.

The wikimedia article sounds like they're just using regular camus and their
own interpreter. that would perform a bit better :) Still wonder why they
didn't just write a spark job to do the same thing.

