Reading between the lines it seems the problem wasn't scaling, but programmer productivity. Smaller code bases is often easier to work with so I guess they solved that by dividing it up into many small services. The blog could use a more detailed description of the problem they are actually solving.
But I finally realized - those messages are probably mostly tracking, ads, more tracking, some infrastructure work and even more ads & tracking. The sausage machine that turns people into money.
> 100 million reviews/year is only 3 reviews per second on average
Even if all you want to do is accept events and immediately persist them on some data store (Cassandra, mySQL, text files, etc), you re far better of first publishing them on a distributed log, and then consuming from it and persist on, say, a mySQL or Cassandra cluster.
You decouple the data flow from processing - and this means you don’t need to necessarily directly attach your firehose to the processing systems. You can just accept them as they come and deal with them later, if ever - in fact many distinct processing engines can each, asynchronously and independently, access and scan those previously collected event streams, and they can do so at whatever pace makes sense(i.e depending on how fast they can process messages/events).
You are probably familiar with the the log abstraction anyway. See https://engineering.linkedin.com/distributed-systems/log-wha... and http://www.confluent.io/blog/stream-data-platform-1/
Using Kafka as a core infrastructure technology, and for persisting any incoming and generated messages/events to it should be the default strategy. Also, it’s extremely unlikely you will hit service capacity limits, because publishing to Kafka, or a Kafka like service; publishing is mostly about appending data to a file and consuming is mostly about streaming (sequential scan) from files — all very fast and low overhead operations.
Shameless plug: if you want a Kafka like service, with, currently, fewer features but with better performance and far less requirements and a standalone operation mode(no need for ZooKepeper), you may want to check https://github.com/phaistos-networks/TANK
Anyway, I can’t recommend Kafka and investing on logs enough. Also, Kafka Streams is very elegant; likely based on Google’s DataFlow design and semantics. If you are using Kafka or plan to use it, you may want to evaluate it and adopt it over other heavier footprint and more complicated systems (e.g Storm or Spark).
We do use Spark streaming, and are starting to use Kafka Streams and Data Flow, where they're a better fit. I'm personally most excited about Beam/Flink. We'll probably end up replacing the PaaStorm internals with some other tool, when one with good python support matures. Beam's event-time handling and windowing seem really promising at this point. https://www.oreilly.com/ideas/the-world-beyond-batch-streami... is a great overview of the different concerns for stream processing.
How do you scale Kafka to handle the massive amount of traffic (and storage) that you seem to generate daily?
With services talking among themselves via HTTP there is a lot of resilience built-in. Do you have anything in place to avoid this becoming a single point of failure? It must have become the most critical piece of your infra.
We push 500k documents a second through over 10 6 core/24gb ram hosts pretty uneventfully. Only real pointer is to size ZK appropriately and make sure you leave lots of memory for the file system cache.
1 - https://kafka.apache.org/documentation.html#operations
In general, I do think poooogles covered it well. Kafka is designed to scale. The one thing we do that you might not expect is splitting data across clusters, depending on what guarantees we want to provide.
We also tend to make sure all data is replicated using geographic distribution to avoid SPOF issues. We do use the min ISR settings and different required ack levels, depending on we want to trade off durability and availability for an application.
I don't see any mention of pyleus or Apache Storm.
"Yelp passed 100 million reviews in March 2016. Imagine asking two questions. First, “Can I pull the review information from your service every day?” Now rephrase it, “I want to make more than 1,000 requests per second to your service, every second, forever. Can I do that?” At scale, with more than 86 million objects, these are the same thing."
Who is making 1000 requests per second to retrieve all of the 100 million reviews? Other services within Yelp? Why would they pull all reviews every time, and not just the reviews they haven't already processed?
How is 1000 requests per second the same thing as pulling all review information every day?
This section is really confusing and needs some clearer explanation.
The user-centric stuff like submitting reviews, comments is trivial - as another user said , a large single instance is enough.
Did I read that correctly, the Yelp product is a MILLION lines of code?
For comparison, the Doom 3 source code has 601,000.
100 Engineers x 50 lines of code per engineer per week x 50 weeks per year (2 week vacation) x 4 years = ~ 1 million LOC