Hacker News new | past | comments | ask | show | jobs | submit login
How Netflix uses Druid for realtime insights (netflixtechblog.com)
104 points by uaas on March 16, 2020 | hide | past | favorite | 59 comments



"To keep reading this story, create a free account."

Seriously? I know it's the Medium hustle but someone at Netflix should know better.


I guess it costs slightly less to hire someone who can manage a basic word editor, than someone who can manage a self-hosted website.


I think one cool feature Medium can introduce is enterprise permissioning. If you have an enterprise plan and pay Medium oodles of money, you should maintain your own permissioning control and who sees your articles while still having access to the Medium social network. Best of both worlds.


> oodles of money

Medium isn't worth oodles, though. Looking at Squarespace and Wordpress pricing, for a company's tech blog, it's worth $25-$100 per month. Unless there's value from the Medium brand...


That should be fixed. Thanks for calling it out.


I can read the whole article and am not signed in. It also shows up in incognito mode. That popup is usually asking you to create an account, but you should be able to just close it and continue. I see the whole article for sure behind no paywall and am not logged in.


Firefox add-on to avoid the hassle: https://github.com/iamadamdev/bypass-paywalls-firefox


Can anyone from Netflix comment on why not clickhouse? We tried druid, and the performance was pretty bad compared to a clickhouse.


Can confirm clickhouse is generally faster across most typical workloads


Clickhouse is much more resource efficient in many cases but is less flexible and importantly extensible than Druid.

Druid can easily be extended through available 3rd party extensions and you can write your own to implement custom serialisation formats, aggregations, connect to new streaming systems, read directly from whatever cold storage you have etc.

In the Clickhouse model you have to work out a lot more of that stuff yourself though these days it can read from Kafka directly which is useful.

Some things that are important for Clickhouse vs Druid at big scale is the rather large difference in indexing approaches. Clickhouse uses bloom filters and other probabilistic data structures to index large chunks of data, for the most part though actually checking for rows requires a full scan of that chunk to strip false positives.

This is different to Druid which uses full inverted indices for dimension filtering.

The tradeoff is basically Clickhouse is cheaper, especially when scaling out but Druid is faster especially when the cluster is under heavy concurrent query load, like serving analytics dashboards or data exploration interfaces to users.

Clickhouse excels when you want to scan most but not all the data most of the time. Namely reporting or bulk analytics queries that will hit most rows in a block.

I consider both to be excellent databases.


Druid committer here. (Also, I think we've met before in SF!)

One thing I wanted to add with regard to performance. Druid does indeed get a big boost from the fact that it uses inverted indexes for filtering. It also gets a boost from having a wide variety of approximate algorithms you can use if you want (for things like topN, count distinct, set difference/intersection, quantiles, etc). But straight scan performance has been improving quite a bit recently too.

The biggest change related to straight scan perf is fully vectorizing the query engine, which is partially done as of the latest release (0.17): https://druid.apache.org/docs/latest/querying/query-context..... In benchmarks, the implementation so far has been posting row scan rate improvements in the 2-3x range. I expect we'll be able to round it out and have it work for all queries over the next couple of releases. The multiples involved mean this is quite meaningful if you do a lot of straight scans.

There's plenty of other stuff going on too: our latest release added parallel merging of large result sets. Our next one (0.18) is going to add a new, more efficient hash aggregation engine. That next release is also going to add a JOIN operator -- not perf related, but probably the number one most requested feature.


I was reading the details on inverted index usage in Druid, but what is described seems to be bitmap indexes, not inverted indexes.

Inverted indexes map distinct values in a column to a list of document ids containing the value. Bitmap indexes map distinct values to an array of booleans the same length as the number of documents, with true for presence and false for absence. Both index types can be highly compressed, of course.

Can you clarify what Druid is using?


Logically, an array of booleans and a set of integers are equivalent. So in the Druid developer community we usually use the terms interchangeably. But to be precise, our indexes are all stored as bitmaps and compressed with bitmap compression libraries.


Good point regarding aggregations, especially if done at ingest time w/rollup they really make a big difference. Clickhouse has aggregations too but they are done at merge time in the background.

Vectorized query engine and JOINs sounds awesome.

(We did meet in SF! Beer hall!)


I think what you're getting at can be accomplished with materialized views in clickhouse now. Most queries that might be fast with inverted indices can be solved that way.

Also, I don't think they use bloom filters for the index as far as I can tell from the documentation. There is certainly an option to use a bloom filter aggregator on a table for faster counts, but it's not the default. If you're referring to the fact that count () is not precise, there's a exact count function too. This is my speculation, though, and you may be fight.


Yeah bloom filters aren't used by default my bad, by default you get no indices at all! I was thinking about the tokenbf/ngrambf indices. These help a ton in improving more sparse queries.

I will need to check out the materialised views. :)


> I think what you're getting at can be accomplished with materialized views in clickhouse now. Most queries that might be fast with inverted indices can be solved that way.

Let's say you have data with a few dozen dimensions, and want to compute aggregations filtered by any user-supplied union or intersection of dimension values. This is a fairly common use case in analytics dashboards. How do materialized views help with that?


In our experience, on a significantly smaller scale, Clickhouse is vastly easier to operate compared to Druid, with all of its various components that all have various knobs and dials to configure and have to be orchestrated.


Druid committer here. Fwiw, Druid was designed to run on huge clusters and that really shows up in the multi-process architecture. The idea is that if you separate the components needed for ingestion, historical processing, query routing, and coordination, then there are two benefits: they don't interfere with each other (spikes in ingestion load won't interfere with ability to query historical data), and also you can scale each one individually for your workload. You could even auto-scale some of them. For example, the original Druid cluster was operated with load-based auto-scaling for the ingestion processes.

That being said we are currently working on reducing the number of processes to 4 (from the current 6) for a "standard" setup. The main reason is that at smaller scale there isn't as much of a purpose to having a larger number of processes.

We're also working on removing some of the knobs. Actually, depending on what version you originally looked at, many of them might already be gone.


It's been a few years since we evaluated Druid. It's great to hear that you're simplifying things, especially for smaller setups!


Yes this is definitely also true.

Druid complexity is coming down a bit compared to where it started. These days you need brokers, middlemanagers and historicals - for queries, ingestion and storage respectively.

In the past to do batch ingestion it also required Hadoop but there is now a native parallel batch ingestion system that runs on the middlemanagers as worker tasks that can read from S3/GCS/existing Druid segments.

Druid is by far the more complex but you get a lot for it and with k8s it's not as hard to run/manage as it was in the past.


> During software updates, we enable the new version for a subset of users and ... compare how the new version is performing vs the previous version. Any regression in the metrics gives us a signal to abort the update and revert

How do you account for the possibility that the update only performs badly because it’s different than what users are used to, but would actually be an improvement in the long run?


> How do you account for the possibility that the update only performs badly because it’s different than what users are used to, but would actually be an improvement in the long run?

That's assuming that the update contains a UI/UX change. A lot of updates that will roll out won't include that, they'll be fixing or optimising things.


You acknowledge that fact and if you're making a severe change you either run a longer test or you break up the large change into manageable chunks (Google Maps is a great example of this).

For something that is effectively binary and can't be changed incrementally the only way is to extend the evaluation window or make a call with limited information.


regressions due to "users are used to" usually goes away after a while, you'll see a temporary dip in the results.

If you are convinced it's better for the long tun then what's the point in measuring ?


Have you worked at a large company before? You don't!


> How can we be confident that updates are not harming our users?

Maybe this is why they finally figured out the auto-playing trailer stuff was a horrible change (I cancelled my account over that). One can only hope.


Is it possible to view this article without creating a medium account?


Delete your cookies for the site.


It looks like Medium is opening their paywall modal based on viewed articles (and cookies). Therefore, you could try private browsing.


can confirm. opening link in incognito worked for me.


I wonder how this choice compares to using a hosted solution like Splunk (both in capabilities and cost).


Thats an interesting thought.


So this is behind paywall. I am sure Netflix is not marking this story to be behind paywall and Medium can not put this story behind paywall without consent from Netflix.


It's not. It's medium.com asking for signin (which is a free account) but you can delete your cookies and reload to view.


I am already logged in and it still is a featured story. It's counted against 3 features story I can read in a month. So it is behind paywall. C


It's a soft limit by Medium. You don't have to be signed in to read it. Delete your cookies or use a private window.


Then log out. It’s not a paywall. Log out and you can read it. Medium is playing you.


Proper response: delete account and don't visit medium again.

Even better: sites such as hn should never allow links to sites employing dark patterns.


People are interested in the content. No need for HN to start filtering because the blog service. You can avoid the site if you want.


There most certainly is a pressing need for sites such as HN to start care about stuff like this.

If people start to care about being exploited the content will move to a more ethical site. Or it might influence medium.


People don't care though, because they can make their own choices but they're clearly still sharing and reading on medium. You're asking for HN to arbitrarily enforce your wishes on top of what users are doing.


They do, and it is hardly my wishes alone.

If 1% don't care and share the crap the other 99% still have to suffer for it.


That’s assuming the 99% care to suffer. I don‘t, I see a post from wall street journal, I don‘t click. I know it’s a paywall. People have a choice to click or not.

I do agree that medium is a bit of a mess where one is „less privileged“ when signed in.

But what we should be asking for is the companies like Netflix not using medium in a first place.


Ban medium from HN and companies like netflix will switch in 5 minutes.


There are plugins to help you fight against Medium's paywall. However, if you didn't want to use a plugin, then you can easily use a private session which circumvents Medium's premium content.


That should be fixed. Thanks for flagging it.


You can avoid this issue on Medium and lots of other sites using the firefox add-on Bypass Paywalls: https://github.com/iamadamdev/bypass-paywalls-firefox


Yep, i was wondering how company can put their posts that should their “super skills” behind paywall..


I wonder if materialize.io could handle such workloads at this stage.


Hi, I work at Materialize!

It's early days, but I can confirm that it's a bit hard to know out of the gates whether Materialize can handle such workloads. In particular, unless I missed it the query workload isn't really discussed in the post, and that is all that matters.

Druid makes a few compromises that Materialize isn't willing to make. For example AFAIK Druid doesn't support deletions or modifications in "realtime", which means they can track "min/max" style queries much more efficiently, but they lose out on some other features (connecting pipelines of these views where you might have to "retract" a prior min or max). This could certainly let it scale more, but also means that you learn a few months down the road that it doesn't do everything you hoped it would.

No joins in Druid is also reportedly a bit of a pain. Instead, you get to pre-denormalize your data. No need with Materialize (and hey, feel free to push down reductions through the joins, rather than denormalize and then reduce).

Short version, Materialize is definitely a "higher sophistication" data processing play. Whether that works out well remains to be seen! I'll slap up a blog post later today with a worked example.


Materialize, today, looks to be limited to running on a single node. https://materialize.io/docs/overview/architecture

Netflix's workload would likely exhaust the resources of even a vertically-scaled single node.


Mind you, given the “Timely Dataflow” abstraction Materialize operates on top of, if you give it a query that only requires certain result rows from one of its mat views, then Materialize is only going to compute the intermediate rows (and, further back, retrieve the source rows) required to “render” the particular result-rows you ask for. (Sort of like how Excel, in memory-constrained conditions, only computes the intermediate cell-values required to render the cells currently in view, and thus, if an intermediate cell requires an `XmlHttpRequest` call to resolve, that call won’t fire if the cell’s value doesn’t need resolving yet.)

Because of that, you don’t really need to scale Materialize in a sharding sense. You can just have a bunch of “the same” Materialize node (i.e. every node just freestanding clone of a template node, with exactly the same sources and matviews) and then hit them with the parts of a map-reduce query launched by, say, Citus—where Citus was thinking it was talking to a bunch of Citus shard nodes each holding a table-shard named X, but was actually talking to Materialize nodes each holding a matview named X. As long as the query sent from the map-reduce job to each node is constrained in its WHERE clause to only the part of the data it expects to get from that node—rather than relying on the node to know what data it has—then the Materialize nodes would each just do the work required to supply that data (including only pulling in the parts of the configured sources required to compute that result.)

That’s just my intuition from how Materialize presents itself as PG-wire-protocol compatible, though; I haven’t tried this myself, and there might be some footguns in the path of anyone really trying to implement it.

And, of course, this is all irrelevant the moment you write a query that needs a pure reduce (e.g. the computation of a current finite-state-machine state over an event-stream source) rather than a map-reduce. Druid/Clickhouse/etc. can probably “scale” those, in at least the Hadoop “move the job around between each serial stage, so each stage has data-locality for the data of that stage” sense; while Materialize would give you no benefit at all in such a job over just querying a plain PG view defined on top of a Foreign Data Wrapper source.


> if you give it a query that only requires certain result rows from one of its mat views, then Materialize is only going to compute the intermediate rows

This is absolutely correct!

> You can just have a bunch of “the same” Materialize node (i.e. every node just freestanding clone of a template node, with exactly the same sources and matviews) and then hit them with the parts of a map-reduce query

This should work, but we have been thinking about it/testing it differently internally. In general you should be able to create materialized views on different "shards" that have different `where` conditions, allowing you to control memory that way. This technique does require data that is actually partitionable in this way, same as it must be partitionable in mapreduce.

> this is all irrelevant the moment you write a query that needs a pure reduce

Of course, with materialize's sinks you can spin up a bunch of `materialized`s and connect them for a final reduce after data has gone through e.g. kafka or shared files. Being able to write joins and aggregates across heterogenous sources makes this kind of workload actually pretty pleasant.


I can swear I saw they will be offering multi-node commercially soon. I can't find it now, apart from their cloud offering at

https://materialize.io/download/


materialize has to be able to keep all of its state in memory which only makes sense from a cost perspective for workloads which have a high value to space ratio. fine grained user behavior data typically just isn't that valuable.


> materialize has to be able to keep all of its state in memory

For now. We have a pretty good idea of what needs to be done to shed state to disk, and have designed to be able to implement it. We expect it to "just" be a matter of putting in the engineering effort.


Wondering if any performance comparison was done with Apache Pinot?


I tried setting up a Druid cluster once, they have so many components that it's hard to manage. Also Druid didn't not support SQL at the time and we don't want to learn its weird query language due to the lack of documentation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: