
Billions of Messages a Day – Yelp's Real-Time Data Pipeline - justinc-yelp
http://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html?hn=hn
======
z3t4
100 million reviews/year is only 3 reviews per second on average. Sure, they
seem to do more then just that, like voting, comments, etc. But it still seems
like something an old school stack could handle on a single large instance.

Reading between the lines it seems the problem wasn't scaling, but programmer
productivity. Smaller code bases is often easier to work with so I guess they
solved that by dividing it up into many small services. The blog could use a
more detailed description of the problem they are actually solving.

~~~
eellpp
Applying the 80/20 principle: 80% of traffic occurs in the 20% of years time.
Which is around 12 reviews per second. Still modest.

~~~
NhanH
Not entirely on topic, but is the 80/20 principle recursive?

~~~
thesimpsons1022
20 percent of it is 80 percent recursive

------
markpapadakis
I ‘ve come to realise that no matter what your data engine choice is(Storm,
Spark, Flink, DataFlow, Ruby or Bash scripts, whatever) it is extremely
beneficial to persist incoming data first to a distributed log.

Even if all you want to do is accept events and immediately persist them on
some data store (Cassandra, mySQL, text files, etc), you re far better of
first publishing them on a distributed log, and then consuming from it and
persist on, say, a mySQL or Cassandra cluster.

You decouple the data flow from processing - and this means you don’t need to
necessarily directly attach your firehose to the processing systems. You can
just accept them as they come and deal with them later, if ever - in fact many
distinct processing engines can each, asynchronously and independently, access
and scan those previously collected event streams, and they can do so at
whatever pace makes sense(i.e depending on how fast they can process
messages/events).

You are probably familiar with the the log abstraction anyway. See
[https://engineering.linkedin.com/distributed-systems/log-
wha...](https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying) and
[http://www.confluent.io/blog/stream-data-
platform-1/](http://www.confluent.io/blog/stream-data-platform-1/)

Using Kafka as a core infrastructure technology, and for persisting any
incoming and generated messages/events to it should be the default strategy.
Also, it’s extremely unlikely you will hit service capacity limits, because
publishing to Kafka, or a Kafka like service; publishing is mostly about
appending data to a file and consuming is mostly about streaming (sequential
scan) from files — all very fast and low overhead operations.

Shameless plug: if you want a Kafka like service, with, currently, fewer
features but with better performance and far less requirements and a
standalone operation mode(no need for ZooKepeper), you may want to check
[https://github.com/phaistos-networks/TANK](https://github.com/phaistos-
networks/TANK)

Anyway, I can’t recommend Kafka and investing on logs enough. Also, Kafka
Streams is very elegant; likely based on Google’s DataFlow design and
semantics. If you are using Kafka or plan to use it, you may want to evaluate
it and adopt it over other heavier footprint and more complicated systems (e.g
Storm or Spark).

~~~
discordance
... Kafka is so good Microsoft paid $26B for it :)

------
pixelmonkey
Is Yelp still using pyleus and Apache Storm? Or have they migrated to Spark
and Kafka Streams?

~~~
justinc-yelp
Most of the stream processing in the Data Pipeline happens inside of an
internal project called PaaStorm, which is storm-like. It was built to take
advantage of our platform as a service
([http://engineeringblog.yelp.com/2015/11/introducing-
paasta-a...](http://engineeringblog.yelp.com/2015/11/introducing-paasta-an-
open-platform-as-a-service.html)), which handles process scheduling really
well. Architecturally, it's pretty similar to Samza, with distributed
processes communicating using Kafka.

We do use Spark streaming, and are starting to use Kafka Streams and Data
Flow, where they're a better fit. I'm personally most excited about
Beam/Flink. We'll probably end up replacing the PaaStorm internals with some
other tool, when one with good python support matures. Beam's event-time
handling and windowing seem really promising at this point.
[https://www.oreilly.com/ideas/the-world-beyond-batch-
streami...](https://www.oreilly.com/ideas/the-world-beyond-batch-
streaming-102) is a great overview of the different concerns for stream
processing.

~~~
ricardobeat
Hi Justin! Thanks for sharing, very interesting stuff.

How do you scale Kafka to handle the massive amount of traffic (and storage)
that you seem to generate daily?

With services talking among themselves via HTTP there is a lot of resilience
built-in. Do you have anything in place to avoid this becoming a single point
of failure? It must have become the most critical piece of your infra.

~~~
poooogles
Scaling Kafka is pretty simple, the operations document contains most things
you'll need to get started [1].

We push 500k documents a second through over 10 6 core/24gb ram hosts pretty
uneventfully. Only real pointer is to size ZK appropriately and make sure you
leave lots of memory for the file system cache.

1 -
[https://kafka.apache.org/documentation.html#operations](https://kafka.apache.org/documentation.html#operations)

------
blergh999
I don't understand the following:

"Yelp passed 100 million reviews in March 2016. Imagine asking two questions.
First, “Can I pull the review information from your service every day?” Now
rephrase it, “I want to make more than 1,000 requests per second to your
service, every second, forever. Can I do that?” At scale, with more than 86
million objects, these are the same thing."

Who is making 1000 requests per second to retrieve all of the 100 million
reviews? Other services within Yelp? Why would they pull all reviews every
time, and not just the reviews they haven't already processed?

How is 1000 requests per second the same thing as pulling all review
information every day?

This section is really confusing and needs some clearer explanation.

~~~
artpepper
I think it's saying that a single iteration over the entire set, would
translate to 1000 requests per second for a day (if done naively as one
request per object). It's really talking about the N+1 problem.

~~~
vhost-
This is how I read it too. I think op was taking that too literally when they
were just stating that the two are essentially equivalent at that scale.

------
orliesaurus
This is a nice blog, it would be nice to also read one where you explain how
you "judge" whether a review is fake or not - I have heard so many times from
small business owners how legitimate reviews get hidden/deleted from their
page. I wonder if it's an algorithm or a "humanized" process with lots of
mechanical turks :) (Not in detail of course we don't want people to game your
algos)

------
overcast
"In 2011, Yelp had more than a million lines of code in a single monolithic
repo"

Did I read that correctly, the Yelp product is a MILLION lines of code?

For comparison, the Doom 3 source code has 601,000.

~~~
deadmutex
It's not that surprising. Some estimates (I could be way off)

100 Engineers x 50 lines of code per engineer per week x 50 weeks per year (2
week vacation) x 4 years = ~ 1 million LOC

~~~
hueving
Sweet jesus. Engineers that only produce code and never delete any are like
cancer.

~~~
dexterdog
I think that is why the weekly number is only 50. It's a net.

~~~
vkou
Not to mention that some number of those LOC are unit tests / integration
tests (Which can take more code than a feature.)

~~~
je42
mmh. usually the LoC of unittests/integration/end2end are much more than the
code under test. In my experience: between 3-6 times as much.

------
lifeisstillgood
one of my big questions is how did you arrange security? Kerberos? What about
maintaining service-users and permissions to send / rec messages, or some form
of web of trust? With message based comms encryption for the intended users
becomes possible?

------
sneak
That's cool and all, but let us not forget that Yelp are extortionist jerks
who fuck over small businesses.

~~~
vu4374fv18
As far as I can tell those accusations have not been substantiated. I'm ready
to believe it if you have evidence - a recording of a Yelp salesperson using
extorionate language would do.

------
jack9
When we're talking about 5-50k msgs per second I'll be mildly interested.

------
agjacobson
Yelp may have a value, but it is not proportional to the square of the number
of users.

