Hacker Newsnew | past | comments | ask | show | jobs | submit | MBkkt's commentslogin

TLDR It's not our choice, but it's meaningful. Because this 5GB is single data segment and literally what you will have in Elastic/etc when you have overall TBs of data. See https://www.elastic.co/docs/deploy-manage/production-guidanc... (single shard is one Lucene index that contains multiple data segments)


It's like saying that postgres was designed for distributed setups, just because there are large postgres installations. We all understand that clickhouse (and postgres) are great databases. But it's strange to call them designed for distributed setups. How about insertion not through a single master? Scalable replication? And a bunch of other important features -- not just the ability to keep independent shards that can be queried in single query


ClickHouse does not have a master replica (every replica is equal), and every machine processes inserts in parallel. It allocates block numbers through the distributed consensus in Keeper. This allows for a very high insertion rate, with several hundred million rows per second in production. The cluster can scale both by the number of shards and by the number of replicas per shard.

Scaling by the number of replicas of a single shard is less efficient than scaling by the number of shards. For ReplicatedMergeTree tables, due to physical replication of data, it is typically less than 10 replicas per shard, where 3 replicas per shard are practical for servers with non-redundant disks (RAID-0 and JBOD), and 2 replicas per shard are practical for servers with more redundant disks. For SharedMergeTree (in ClickHouse Cloud), which uses shared storage and does not physically replicate data (but still has to replicate metadata), the practical number of replicas is up to 300, and inserts scale quite well on these setups.


> some reviewers have the need to demonstrate their superior intellect/knowledge/whatever by insisting on minutia.

> Few senior engineers are humble, and consider code reviews as a collaborative effort to deliver the best possible solution.

The difference between these two only that in first case author didn't agree with reviewer and agreed in the second.

The really worst type of review that approve without looking or IDC.


Few weeks ago on my work we spend a week (not all but still a lot of time) to find why sort with libc++ was incorrect. As result we add similar (but a simpler check).

I think always do such checks isn't good (because performance reasons), but with address sanitizer it's sounds very good for me


Seastar is absolutely different approach, based on shared nothing architecture. It's not bad, but you cannot easy compare some async multi threaded architecture with it.

Seastar it's something like multi process architecture, where every process doesn't have synchronization, except n spsc queue per process (used to communicate between cores) and have only cooperative multitasking

So it good scales, if you don't have a lot of data to share, and have a very good work load balancer

But commonly you haven't, so go-like approach more and more easy


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: