Hacker News new | past | comments | ask | show | jobs | submit login
Billions of Messages a Day – Yelp's Real-Time Data Pipeline (yelp.com)
172 points by justinc-yelp on July 15, 2016 | hide | past | web | favorite | 47 comments

100 million reviews/year is only 3 reviews per second on average. Sure, they seem to do more then just that, like voting, comments, etc. But it still seems like something an old school stack could handle on a single large instance.

Reading between the lines it seems the problem wasn't scaling, but programmer productivity. Smaller code bases is often easier to work with so I guess they solved that by dividing it up into many small services. The blog could use a more detailed description of the problem they are actually solving.

Whenever I see headlines like these, "X billion messages per day/hour/second in some web service Y", I wonder what the hell is that service doing that it generates so much messages? I.e. I could understand Facebook, with its billion+ of users and built-in messaging platform, could generate billion of "messages" a day. But a mostly read-based service like Yelp?

But I finally realized - those messages are probably mostly tracking, ads, more tracking, some infrastructure work and even more ads & tracking. The sausage machine that turns people into money.

Indeed. I'm working in adtech and we get four billion requests for bids per day (more on Black Friday and the holiday season). Add in the ad serving and processing of all the results and it blows up fast.

Applying the 80/20 principle: 80% of traffic occurs in the 20% of years time. Which is around 12 reviews per second. Still modest.

Not entirely on topic, but is the 80/20 principle recursive?

20 percent of it is 80 percent recursive


Even a Sqlite DB could handle this

Also, 80% (often 90%) are reads and 10-20% are writes.

100 million is cumulative, not annual.

  > 100 million reviews/year is only 3 reviews per second on average
That's writes, I imagine the reads would be a bit higher.

And ultimately, writes resulting from those reads. I imagine yelp has pretty stout analytics/metrics pipeline fed off views and conversion tracking.

This. A lot of people look at apps like Instagram and Facebook, and think that they're simple to build and manage, when they don't realize that the part of the app the consumer directly interacts with on their screen is the tip of the iceberg when it comes to the entire business.

Traffic is usually not averaged out.. it's spiky.

I ‘ve come to realise that no matter what your data engine choice is(Storm, Spark, Flink, DataFlow, Ruby or Bash scripts, whatever) it is extremely beneficial to persist incoming data first to a distributed log.

Even if all you want to do is accept events and immediately persist them on some data store (Cassandra, mySQL, text files, etc), you re far better of first publishing them on a distributed log, and then consuming from it and persist on, say, a mySQL or Cassandra cluster.

You decouple the data flow from processing - and this means you don’t need to necessarily directly attach your firehose to the processing systems. You can just accept them as they come and deal with them later, if ever - in fact many distinct processing engines can each, asynchronously and independently, access and scan those previously collected event streams, and they can do so at whatever pace makes sense(i.e depending on how fast they can process messages/events).

You are probably familiar with the the log abstraction anyway. See https://engineering.linkedin.com/distributed-systems/log-wha... and http://www.confluent.io/blog/stream-data-platform-1/

Using Kafka as a core infrastructure technology, and for persisting any incoming and generated messages/events to it should be the default strategy. Also, it’s extremely unlikely you will hit service capacity limits, because publishing to Kafka, or a Kafka like service; publishing is mostly about appending data to a file and consuming is mostly about streaming (sequential scan) from files — all very fast and low overhead operations.

Shameless plug: if you want a Kafka like service, with, currently, fewer features but with better performance and far less requirements and a standalone operation mode(no need for ZooKepeper), you may want to check https://github.com/phaistos-networks/TANK

Anyway, I can’t recommend Kafka and investing on logs enough. Also, Kafka Streams is very elegant; likely based on Google’s DataFlow design and semantics. If you are using Kafka or plan to use it, you may want to evaluate it and adopt it over other heavier footprint and more complicated systems (e.g Storm or Spark).

... Kafka is so good Microsoft paid $26B for it :)

> with better performance Where are the benchmarks? Couldn't find anything on the project page.

Is Yelp still using pyleus and Apache Storm? Or have they migrated to Spark and Kafka Streams?

Most of the stream processing in the Data Pipeline happens inside of an internal project called PaaStorm, which is storm-like. It was built to take advantage of our platform as a service (http://engineeringblog.yelp.com/2015/11/introducing-paasta-a...), which handles process scheduling really well. Architecturally, it's pretty similar to Samza, with distributed processes communicating using Kafka.

We do use Spark streaming, and are starting to use Kafka Streams and Data Flow, where they're a better fit. I'm personally most excited about Beam/Flink. We'll probably end up replacing the PaaStorm internals with some other tool, when one with good python support matures. Beam's event-time handling and windowing seem really promising at this point. https://www.oreilly.com/ideas/the-world-beyond-batch-streami... is a great overview of the different concerns for stream processing.

Hi Justin! Thanks for sharing, very interesting stuff.

How do you scale Kafka to handle the massive amount of traffic (and storage) that you seem to generate daily?

With services talking among themselves via HTTP there is a lot of resilience built-in. Do you have anything in place to avoid this becoming a single point of failure? It must have become the most critical piece of your infra.

Scaling Kafka is pretty simple, the operations document contains most things you'll need to get started [1].

We push 500k documents a second through over 10 6 core/24gb ram hosts pretty uneventfully. Only real pointer is to size ZK appropriately and make sure you leave lots of memory for the file system cache.

1 - https://kafka.apache.org/documentation.html#operations

I'm not actually a good resource on scaling Kafka. Our distributed systems teams do a great job of providing reliable infrastructure and scaling it up, so on the application side we are mostly able to treat it like a black box that just works.

In general, I do think poooogles covered it well. Kafka is designed to scale. The one thing we do that you might not expect is splitting data across clusters, depending on what guarantees we want to provide.

We also tend to make sure all data is replicated using geographic distribution to avoid SPOF issues. We do use the min ISR settings and different required ack levels, depending on we want to trade off durability and availability for an application.

Well, in the article they claim they're using Kafka Streams, and there's a mention of Spark in one of the diagrams.

I don't see any mention of pyleus or Apache Storm.

I don't understand the following:

"Yelp passed 100 million reviews in March 2016. Imagine asking two questions. First, “Can I pull the review information from your service every day?” Now rephrase it, “I want to make more than 1,000 requests per second to your service, every second, forever. Can I do that?” At scale, with more than 86 million objects, these are the same thing."

Who is making 1000 requests per second to retrieve all of the 100 million reviews? Other services within Yelp? Why would they pull all reviews every time, and not just the reviews they haven't already processed?

How is 1000 requests per second the same thing as pulling all review information every day?

This section is really confusing and needs some clearer explanation.

I think it's saying that a single iteration over the entire set, would translate to 1000 requests per second for a day (if done naively as one request per object). It's really talking about the N+1 problem.

This is how I read it too. I think op was taking that too literally when they were just stating that the two are essentially equivalent at that scale.

I think the unstated assumption is that there's some sort of processing that occurs on each review every day. So with 100 million reviews every 24 hours, that's just over 1157 requests per second.

It has to be internal because they are incredibly protective of their API (hell they sold their firehouse to just 1 darn company). My guess it's NLP type processing. Things like review highlights, recommendations, etc. that's probably 99% of their load.

The user-centric stuff like submitting reviews, comments is trivial - as another user said , a large single instance is enough.

This is a nice blog, it would be nice to also read one where you explain how you "judge" whether a review is fake or not - I have heard so many times from small business owners how legitimate reviews get hidden/deleted from their page. I wonder if it's an algorithm or a "humanized" process with lots of mechanical turks :) (Not in detail of course we don't want people to game your algos)

"In 2011, Yelp had more than a million lines of code in a single monolithic repo"

Did I read that correctly, the Yelp product is a MILLION lines of code?

For comparison, the Doom 3 source code has 601,000.

It's quite easy to get to that number of lines once you have code for integrations with other products, monitoring, devops, patched versions of broken or unmaintained libraries, a growing number of tests etc. We run a relatively small product and we already have 30k lines with just two people.

It's not that surprising. Some estimates (I could be way off)

100 Engineers x 50 lines of code per engineer per week x 50 weeks per year (2 week vacation) x 4 years = ~ 1 million LOC

Sweet jesus. Engineers that only produce code and never delete any are like cancer.

I think that is why the weekly number is only 50. It's a net.

Not to mention that some number of those LOC are unit tests / integration tests (Which can take more code than a feature.)

mmh. usually the LoC of unittests/integration/end2end are much more than the code under test. In my experience: between 3-6 times as much.

What do Doom 3 and Yelp have to do with each other?

You'd hope a CRUD app used in "hello world" tutorials (https://www.fullstackreact.com/articles/react-tutorial-cloni...) would be simpler than a 3d physics engine.

Displaying a list and showing a map is one feature of Yelp, but you could also say showing "hello world" in 3D text with OpenGL is a 3D graphics engine. In the real world they are both complicated things.

You realize that just because a tutorial that purports to build a Yelp clone does not mean that it will actually teach you to build the entirety of the Yelp app, right?

I work on a code base based on UE4. There's well over 3million LOC.

Maybe counting all the dependencies? You know, you need millions of lines of code to run an npm powered js app :)

one of my big questions is how did you arrange security? Kerberos? What about maintaining service-users and permissions to send / rec messages, or some form of web of trust? With message based comms encryption for the intended users becomes possible?

That's cool and all, but let us not forget that Yelp are extortionist jerks who fuck over small businesses.

As far as I can tell those accusations have not been substantiated. I'm ready to believe it if you have evidence - a recording of a Yelp salesperson using extorionate language would do.

When we're talking about 5-50k msgs per second I'll be mildly interested.

Yelp may have a value, but it is not proportional to the square of the number of users.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact