
Publishing with Apache Kafka at The New York Times - rmoff
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
======
manigandham
The very definition of over-engineered. This is just event-sourcing turned
into a marketing article for Kafka.

It doesn't really matter what the "source of truth" system is, although Kafka
doesn't seem quite as mature/stable enough for that. With such little data,
they can push into a nice graph database instead, run it entirely in memory
and meet 10x any demand they'll ever see along with all the query types
they'll need. Add in an elasticsearch cluster on the side and problem solved.

Any database can serve as a log replay, as long as you save all the versions
then it's just called a query.

~~~
vikiomega9
Unless I'm mistaken, If I were to build out a simple event log represented by
a relational DB, I have bottle necks when writing to it, and have lag in terms
of processing the events, and if I were also pushing those events to a queue
to hydrate aggregate snapshots I would have to have client logic to deal with
duplicate events or not acking processed events etc?

Intuitively, I guess kafka is "more realtime" and "more available" when
compared to the home-brew event log?

EDIT: obviously those constraints in my home-brew event log are relaxed when
my problem domain is amenable to things like associative operators,
idempotency, inverses etc.

~~~
cookiecaper
There's no reason not to make good use of Kafka or similar solutions. The
issue is that people use it without understanding it. In this article, they
say that Kafka is their system of record and their primary long-term storage.
That's very silly.

~~~
subsubsub
Why is it silly to use as system of record or long term storage?

------
jdcarter
FWIW, the article mentions the book "Designing Data-Intensive Applications" by
Martin Kleppmann. I wanted to throw out my own endorsement for the book, it's
been instrumental in helping me design my own fairly intensive data pipeline.

~~~
teej
Dear HN reader - if you're not quite ready to buy the book, take a listen to
this episode of Software Engineering Daily
([https://softwareengineeringdaily.com/2017/05/02/data-
intensi...](https://softwareengineeringdaily.com/2017/05/02/data-intensive-
applications-with-martin-kleppmann/)). It will give you a sense of what Martin
Kleppmann is all about and how he thinks about problems. I ordered my copy of
"Designing Data-Intensive Applications" after listening to this episode.

------
pm90
Excellent, well written article. The key take away seems to be that instead of
an temporary event stream log, since the number of news articles (and
associated assets) is finite and cannot explode, they store all the "logs"
forever (I'm using the term log as is defined in the article, as a unit of a
time-ordered data structure).

I wonder if NYT can help other news websites by making their code open source?
I'm a huge fan of NYT and their jump to digital has just been amazing.
However, I would also like my local newspaper (which covers more regional
news) to be able to serve quality digital content.

~~~
knowtheory
> _I wonder if NYT can help other news websites by making their code open
> source?_

Hey! I, and a number of other news nerds have been encouraging FOSS for the
past decade or so. And in fact a number of major open source projects have
come out of news related projects, including Django,
Backbone.js/Underscore.js, Rich Harris's work on Svelt.js, and a whole lot
more.

Most often the problem with local news organizations are operational
constraints. The news biz has seen a huge downturn over this same period of
time. Most orgs, both on the reporting side and on the tech side are super
strapped for people-time.

It's not enough to have FOSS software, you also have to have folks doing
devops and maintaining systems often at below-market salaries.

------
cturner
This is a flawed architecture. It will work at release, but it will be
difficult to manoeuvre with, and they will grow to hate it.

As your business changes, your data changes. Imagine if on day one, they had
one author per article. On day 1000, they change this to be a list of authors.

Kafka messages are immutable. Each of those green boxes on the right hand side
of the first diagram will need to have special-case logic to unpack the kafka
stream, with knowledge of its changes (up until 17 May 2017, treat the data
like this, but between then and 19 May 2017 do x, and after that do y).

Document pipelines is a rare instance of a context where XML is the best
choice. They should have defined normalised file formats for each of their
data structures. Something like the gateway on the left of the first diagram
would write files in that format. (At some future time, they will need to
modify the normalised formats. Files are good for that. You can change the
gateway and your stored files in coordination.)

Secondly, they should have a gateway coming out of the file store. For each
downstream consumer, they should have a distinct API.

These APIs might look the same on the first day of release. But they should be
separate APIs so that you are free to refactor them independently.

When you have a one-to-one API relationship, you can negotiate significant
refactors in a single phone call. When you have more than one codebase
consuming, you need to have endless meetings and project managers. I call
this, "The Principle of Two."

Some of the other comments here say that they should have used databases. So
far, they have not made the case for it. And databases are easily abused in
settings like this one. People connect multiple codebases to them, and use SQL
as a chainsaw. Again, you can't negotiate changes.

When you create a system, your data structures are the centre of that system.
You need to do everything you can to keep your options open to refactor them
at a later time, and to do so in a way that respects APIs that you are
offering your partners.

Kafka is a good tool. If used well, your deployment design will stop your
system regularly (e.g. every day), nuke the channels, recreate them from
scratch, and restart your system against these empty channels. You shouldn't
use it as a long-term data store.

~~~
amenod
> Kafka messages are immutable. Each of those green boxes on the right hand
> side of the first diagram will need to have special-case logic to unpack the
> kafka stream, with knowledge of its changes (up until 17 May 2017, treat the
> data like this, but between then and 19 May 2017 do x, and after that do y).

I respectfully disagree. The genius of this approach is that you can make the
same transformation on the original Kafka stream to change its schema and
prepare the new feed. Once you are satisfied with the results and you have
switched all subscribers to the new feed, just turn off the old one. Voila -
you only have y.

> This is a rare case where use of XML makes sense.

Sorry, but no. Just no.

~~~
btgeekboy
I still don't understand the hatred around XML. Is it slightly verbose? Yes.
Does it support lots of neat functionality that make it great for
interoperating between systems, like validations and transformations? Yep.
Sure, it's possible to go full architecture astronaut with it, but you can do
that with pretty much any programming language.

Meanwhile, I'm just sitting over here wondering whether my YAML file is
supposed to have certain indents here or not, "-" or not, and trying to go
figure out which magic incantation I need to get it to handle a multi-line
string the way I'm expecting.

~~~
annnnd
I think parent was answering the usage of XML in this use case, which is not
appropriate. XML has many strengths (as you have outlined), but it has also
been (mis/ab)used so many times that it gained bad reputation. What my GGP
suggests is an example of such. There is nothing to gain from XML here that
any proper DB (with schema) wouldn't offer, or in this case, protobuf.

Kafka logs however are solving a different problem. The mental model is
different - they do not record state, but the whole history of transactions,
which makes it trivial to change the schema if/when need arises. Saying that
the schema should be thought in advance and shouldn't change is not realistic
IMHO.

------
toomim
> Traditionally, databases have been used as the source of truth ... [but] can
> be difficult to manage in the long run. First, it’s often tricky to change
> the schema of a database. Adding and removing fields is not too hard, but
> more fundamental schema changes can be difficult to organize without
> downtime.

This argument sounds self-contradicting. Kafka doesn't let you change its
schema at all! At least postgres gives you the option.

It seems that the author is excited about having a single source of truth that
doesn't change, and didn't realize that he could do that with a database, if
he just never used the schema-changing features.

Am I missing something? It seems like the author could be totally happy with a
bunch of derived postgres databases sitting in front of a "source of truth"
database, where he never changes the source of truth database's schema.

Why use kafka?

~~~
BenoitEssiambre
I tend to be the one arguing this, to stick to postgres for most things but
even I will admit it does depend on scale.

I'm not sure what the NYT requirements are but from my understanding of Kafka,
its persistent redundant distributed queues scale automatically horizontally
across machines to support colossal amounts of data. It's possible that they
had difficulty fitting everything in a postgres instance.

~~~
theossuary
See, that's where I'm confused. I'm no Kafka expert, but they say they use a
"single-partition topic" which I believe means the only way they can replicate
the data is by replicating the entire log, they can't shard because it's a
single partition. The reasoning behind this is because Kafka doesn't support
ordering between partitions.

Also I've never thought of Kafka as a persistent data storage solution, it's
interesting Confluent is supporting Kafka being used in this way.

~~~
tlberglund
You are not even a little bit confused! I think you have it perfectly right.
And you are not alone in not thinking of Kafka as persistent storage, but when
you get down to it there is no reason not to, and people are indeed using it
in just that way. And yes, Confluent does give its +1 to this practice. :)

------
tabeth
I wonder how much of this kind of stuff exists out of necessity and how much
of it exists because very smart people are just bored and/or unsatisfied.

Are there any articles that supplement this that explain how much business
value is added/lost by the existence/removal of these kind of features? In the
case of NYT I suspect its popularity is maintained because of the perception
(real or not) of high quality journalism, in spite of any technical failings.

\---

How much would be lost if NYT was just implemented as text articles that are
cached and styled with some CSS. "Personalization" could be added by tags each
article has and a small component that shows the three most recent articles
that share the same tag.

~~~
cookiecaper
>I wonder how much of this kind of stuff exists out of necessity and how much
of it exists because very smart people are just bored and/or unsatisfied.

That's a ton of it. Like it or not, publishing a digital newspaper is _not_ a
hard or unsolved problem; it's one of the web's core competencies. If you hire
people who want to build cool stuff to supervise a CMS, well, you get this
kind of outcome.

The raw cost is understated because these experimental setups misinterpret the
functionality of the new architectures/formats they're using. It doesn't truly
rear its ugly head until there is a major data loss or corruption event. It's
not that these _never_ happen with RDBMS, it's just that RDBMS contemplates
this possibility and tries to make it pretty hard to do that, whereas message
queues just automatically delete stuff ( _by design_ , so they can serve as
functional message queues!).

RDBMS have spoiled us and we take its featureset, 40+ years in the making, for
granted. We need to be careful and not assume that `GROUP BY` is the only
thing we leave on the table when we "adopt" (more accurately _abuse_ ) one of
these new-wave solutions as a system of record.

Since no one is going to admit to their boss "this wouldn't have happened if
we used Postgres", and since most bosses are not going to know what that
means, most of these spectacular failures will never be accurately attributed
to their true cause: developers putting their interest in trying new things
above their duty to ensure their employer's systems are reliable, stable, and
resilient.

~~~
awinder
There are non-negligible problems in the news space like:

    
    
      1. Supporting full-text search for a fair number of concurrent users
      2. Availability of the system with minimal downtime
      3. Scalability within the day and year, traffic patterns around e.g., breaking news events will far surpass 2AM traffic
      4. Notifications
    

I could go on and on but honestly, it's just a tone-deaf response.

Parting pot-shot: "No one is going to admit to their boss that the reason a
worldwide news organization can't publish any stories is because their one
postgres master node went down, or is waiting on a state transfer to a
fallback master"

~~~
cookiecaper
"We can't publish right now because the database had to enter an unplanned
maintenance period" is a lot different from "our authoritative archive is gone
and we have to try to rebuild it from all these separate 'materialized views',
woops."

------
look_lookatme
_This is very similar to a normalized model in a relational database, with
many-to-many relationships between the assets._

 _In the example we have two articles that reference other assets. For
instance, the byline is published separately, and then referenced by the two
articles. All assets are identified using URIs of the form nyt:
//article/577d0341-9a0a-46df-b454-ea0718026d30. We have a native asset browser
that (using an OS-level scheme handler) lets us click on these URIs, see the
asset in a JSON form, and follow references. The assets themselves are
published to the Monolog as protobuf binaries._

When consuming this data do you have to programatically do relationship
fetching on the client side or is eager loading/joins available in some way in
Kafka?

Additionally there seems to be a focus on point-in-time specific views of this
data, but are you able to construct views using arbitrary values/functions?
Let's say each article is annotated with some geo data, can you construct
regional versions of these materialized views of articles at the Kafka level?
If not it seems like you are pushing a fair amount of existing sophisticated
behavior at the RDBMS level up into custom built application servers.

------
iooi
> Because the topic is single-partition, it needs to be stored on a single
> disk, due to the way Kafka stores partitions. This is not a problem for us
> in practice, since all our content is text produced by humans — our total
> corpus right now is less than 100GB, and disks are growing bigger faster
> than our journalists can write.

Before this line, the author mentions they also store images. There's no way
that all their text + images is <100GB right? Something is inconsistent here.

~~~
kod
More than likely they store references to images in kafka, with the actual
image bytes being in a different store.

------
oliveralbertini
Interesting, we use rabbitmq instead of kafka and we have a re-indexation
system... not sure if it's more complex for what I see.

------
khlbrg
I also recommend the talk from Kafka summit. [https://kafka-
summit.org/sessions/source-truth-new-york-time...](https://kafka-
summit.org/sessions/source-truth-new-york-times-stores-every-piece-content-
ever-published-kafka/)

------
LaFolle
If all articles are put into monolog, what is the procedure to fetch all
articles, lets say, published in year 1857? Will that be a O(n) operation
(assuming all messages published in monolog have a timestamp field).

------
aug_aug
Figure 3: The Monolog, containing all assets every published by The New York
Times.

------
mehh
Why not just have the 'logs' in s3 in sorted/indexed buckets.

------
qaq
Resume driven architecture taken to extreme.

------
mateuszf
Isn't this just event sourcing?

~~~
arthurk
Yes it is, but for some reason they called it "Log-based architecture" in the
article

~~~
eropple
They call it that because Kafka is a log-based datastore.

------
cookiecaper
>We need the log to retain all events forever, otherwise it is not possible to
recreate a data store from scratch.

 _SIGH_. Cue the facepalm, head in hands, etc.

I'm not going to get into a big thing here. But if you find yourself saying "I
need to keep this thing forever no matter what" and then you try to use
something that even entertains the notion of automatic eviction/deletion
semantics as the system of record, _you 're doing it wrong_.

Not to burst the bubble of the techno-hipsters, but Kafka is "durable"
relative to message brokers like RabbitMQ, not relative to a system _actually
designed_ to store decades of mission-critical data. Those systems are called
"RDBMS".

Elsewhere in the article he says that they have less than 100GB of data and
that it's mostly text. This is massive overarchitecture that isn't even
covering the basic flanks that it thinks it is, such as data permanence.

I would really like to read the article that discusses why Postgres or MySQL
_couldn 't_ have served this purpose equally well.

~~~
bognition
OOC what makes a RDBMS more durable that a Kafka? Both of them are systems for
representing data on disk. I'd love to hear why one representation system is
better at disaster recovery than another.

~~~
cookiecaper
In Postgres, I never have to worry that the server will be accidentally loaded
with `retention.bytes` or `retention.days` set too low and, as a result,
choose to delete everything in the database, generating a wholly artificial
"disaster" that can result in long periods of disruption or downtime (at a
minimum; worst case is permanent data loss).

It is true that someone could issue `DROP DATABASE`, `rm -rf` the filesystem
on the database server, or so forth, so my point is not that other systems are
invincible. It's just that a properly-configured RDBMS is designed to take
data integrity extremely seriously and provides numerous failsafes and
protective mechanisms to try to ensure that any data "loss" is absolutely
intentional.

On a RDBMS, things like foreign key constraints prevent deletion of dependent
records, mature and well-defined access control systems prevent accidental or
malicious record alteration, concurrency models and transactions keep data in
a good state, etc. Kafka, on the other hand, is designed to automatically and
silently delete/purge data whenever a couple of flags are flipped.

That is not a flaw in Kafka itself; it's designed to do that so that you don't
have to interrupt your day and purge expired/old/processed data all the time.
It's a flaw in architectures that misinterpret Kafka's log paradigm as a
_replacement_ for a real data storage/retrieval/archive system.

I've had this argument countless times with people who've tried to use
RabbitMQ as a system of record (if only for a few minutes while the messages
sat in queue). There's just some fundamental disconnect for a lot of
developers where they don't understand that something accepting the handoff
doesn't mean that the data is inherently safe.

~~~
lima
Kafka is a fine replacement for a RDBMS if it fits your particular use case.
It has very strong data consistency guarantees - stronger than most RDBMS - if
you configure it properly (acks=1 et al). It won't even lose data if the
leader of a partition crashes during a commit.

It has been explicitly designed for these use cases and even has features like
compaction:

[https://kafka.apache.org/documentation/#compaction](https://kafka.apache.org/documentation/#compaction)

Now, I agree with you that _in most cases_ , using Kafka as your primary data
store instead of a RDBMS is madness - but that doesn't mean it's a bad idea in
general.

~~~
dreamfactored
Isn't that what 'in general' means?

