Hacker News new | past | comments | ask | show | jobs | submit login
Query serving systems: An emerging category of data systems (petereliaskraft.net)
53 points by KraftyOne 11 days ago | hide | past | favorite | 14 comments

The referenced paper on uniserve: https://petereliaskraft.net/res/uniserve.pdf is interesting, but seems to focus on systems where storage and compute are colocated, but it doesn't discuss (or maybe I skimmed too quickly) more modern architectures where compute and storage are separated (usually with a caching layer built into the compute nodes). In those architectures, most concerns about shifting data around at query time are moot.

Also in my experience building the scatter-gather query functionality and re-aggregation is usually the easiest part. The hard part is figuring out how to build fair multi-tenancy and QoS into what is essentially a massively parallel user facing real-time data lake.

That's a great point, and I definitely agree that supporting disaggregated architectures is important and a potential next step for the project. It raises new challenges--systems like Snowflake need to know a lot about how data is represented on disk in order to efficiently move it around--but it ought to be possible to define new abstractions for those representations (or reuse existing ones) in a way that cuts across a lot of systems.

I don't see the purpose.

We can always group stuff in a higher level category.

Theres no difference between backend, frontend, gaming, embedded, etc, essentially they're all bit manipulators.

But... What's the purpose here?

I think the idea is to observe that these kinds of systems share many common components and internal functionality. This is obvious to anyone who has built more than one of these things. It therefore follows that it might be interesting to build a general purpose toolkit comprising the common parts, allowing specialization as a layer above. This could for example make it easier to create new categories of query processing system because less foundational work is needed.

These > OLAP systems like Druid and Clickhouse

And these

> data warehouses like Snowflake and Redshift

Are fundamentally the same, and I’m yet to see any reason other than “marketing shenanigans” and “avoiding benchmarks” as to why they should be given their own special category. Call them all modern olap, or call them all data warehouses, doesn’t matter.

> general-purpose data placement algorithm for query serving systems that improves latency by maximizing query parallelism, spreading out shards that are frequently queried together.

This is cool, it will be interesting to know if the added parallelism wins over network overhead and added coordination required. Maybe there’s ways to shift where that line lies as well?

I would say one major difference between those two categories of systems is that Druid/Clickhouse are designed to be deployed in "user-facing" settings directly where you can put queries to them in the critical path of your app, whereas I've never head of anyone doing that for Snowflake/Redshift. I'm sure you could, but I bet the cost would be prohibitive and I'm not sure how well they'd handle the concurrency without a lot of safe guards in your application.

I’ve been on a project where we _experimented_ putting Snowflake on a user-facing path. It was expensive and ineffectual.

Given that the likes of ClickHouse and Druid can be made user facing, and support backend analytics workloads, doesn’t that just imply that Snowflake/redshift are just outright less capable?

Not really. Clickhouse is amazing, but if you want to run it at massive scale you’ll have to invest a lot into sharding and clustering and all that. Druid is more distributed by default, but doesn’t support as sophisticated of queries as Clickhouse does.

Neither Clickhouse nor Druid can hold a candle to what Snowflake can do in terms of query capabilities, as well as the flexibility and richness of their product.

That’s just scratching the surface. They’re completely different product categories IMO, although they have a lot of technical / architectural overlap depending on how much you squint.

Devil is in the details basically.

> Neither Clickhouse nor Druid can hold a candle to what Snowflake can do in terms of query capabilities, as well as the flexibility and richness of their product.

Do you have something specific in mind?

My previous experience with Snowflake was the query functionality was lacking, performance was subpar (at best), and half the purported features were a joke (looking at you “Kafka integration”) or just gimmicky (the time travel feature)

Clickhouse and Druid are not very good at complex OLAP queries. Clickhouse is pretty upfront about needing to denormalize your schema To avoid distributed joins. Neither are anywhere close to the performance of top DWs on analytical benchmarks like TPC-H or TPC-DS

SingleStoreDB is heavily used for this type of app. We used to call this use case real-time analytics (though it has many other names today)

[1] https://www.singlestore.com/blog/the-technical-capabilities-...

(Disclosure: SingleStoreDB cofounder)

CockroachDB is definitely not the first db that comes to mind when I thik OLTP.

How is it different to OLAP? It’s exactly what a data mart does.

That it is not different is exactly the point of TFA.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact