It took about 5 months for our implementation with a chunk of that work mostly about figuring out how to integrate our internal auth as well as a using hashicorp vault as a clean automated way to get auth tokens for an AWS IAM role.
Overall, we are very pleased and the rest of the engineering org is very excited about it and planning to migrate most of our SQS and Kinesis apps.
Ask me anything in thread and will try and answer questions. At some point we will do a blog post on our experience.
- what k8s definitions do you use, e.g. do you use the official Helm Chart, or have you written your .yaml's from scratch?
- have you practiced disaster recovery scenarios in the context of k8s? Can you describe them briefly?
- how do you upgrade/redeploy the Pulsar k8s components, i.e. does this cause the Bookies to trigger a cluster rebalance, or does it trigger the Autorecovery
- for the Bookies, do you use AWS EBS volumes with the EKS or just local instance storage (that is, if you use persistent topics)
- do you use the Proxy pod's EKS k8s pod IPs as exposed on the AWS network, or do you use a NodePort type of service for the Proxy components (using the EKS node IPs)
- have you been bitten by the recent EKS k8s network plugin bug (loss of pod connectivity), and/or how do you maintain your EKS cluster
- do you run your EKS nodes in a multi-AZ setting?
We have practiced some disaster recovery, but it isn't 100% exhaustive (is it ever?), however it is also aided by how Pulsar is designed. We have killed bookie nodes as well as lost all our state in zookeeper. The first is pretty easily handled by the replication factor of bookkeeper data and for zookeeper we do extra backup step and just dump the state to s3 and can restore it. What we haven't tested in practice but now how to do theoretically is to restore a k8s stateful set from EBS volume snapshots. However, we see that as a real edge case. In Pulsar, we offload our data to s3 after a few hours, so we only need to worry about potentially losing a few hours of data in BK, as the zookeeper state is very easy to just snapshot and restore from s3. In other words, we are still working on getting more and more confident with data and don't yet recommend teams use it for mission critical non-recoverable data, but there are a ton of uses cases for it now and we can continue to improve on the DR front
We have done multiple upgrades and deploy all the time. Because bookkeeper nodes are in a stateful set and we have don't do automated rollouts, we manually have a process to replace the BK nodes. However, they don't trigger a re-balance as it closes gracefully and then re-attaches the EBS volume from the stateful set
We use EBS volumes, we use a piops volumes for the journal and a larger slower volume for the ledger store. THis is one of the great parts of bookkeeper design is that the two disks pools are separate so we just need a small chunk of really fast storage and then the journaled data is copied over to the ledger volume by a background process. We figure for really high write throughput we could use instance storage for the journal volume and EBS for ledger, but that would have some complications on recovery but still easier than having to rebuild the whole ledger data.
We use the pulsar proxy and expose it via a k8s service with the AWS specific NLB annotations.
We haven't had any issues with the k8s plugin and haven't really had any issues with EKS version upgrades. We just add new nodes when we migrate the kubelets
Yes, we have automation (via terraform) to allow us to add many different pools of compute and we use labels and taints to get specific apps mapped to specific pools of compute. For Pulsar, we run all the components multi-AZ
I have questions myself:
1. Did it reduce (TCO) costs or increase it versus using Kinesis and SQS/SNS?
1a. Interestingly, there's no global-replication with those AWS services. Why did you require global-replication with the move to Apache Pulsar?
2. Since you mention internal auth: Weren't Cognito / KMS / Secrets Manager up to the job? Given these are integrated out-of-the-box with EC2?
3. Was it ever under-consideration to roll out pub/sub on top of Aurora for Postgres with Global Replication? https://layerci.com/blog/postgres-is-the-answer/
1a. You are correct we didn't require geo-replication for existing use cases, however, initially, we saw geo-replication as an easy way to improve DR and we have an internal requirement for a DR zone in another region. Now that we have done the work, we are starting to see multiple places where we can simplify some things with geo-replication, so we think long term the feature will be really valuable
2. We split up auth into two main components: auth of users (where we use Okta) and auth of services. For okta, we just wrote a small webapp that users can log into via OKta and generate credentials. For apps/services, we already had hashicorp in place and wanted to just piggyback of our existing form of identity (IAM roles). Essentially, a user just associates an IAM role with a pulsar role and we generate and drop off credentials into a per-role unique shared location in vault that any IAM role can access (across multiple AWS accounts)
3. Once again, geo-replication wasn't really a hard requirement initially but more of something that we really like now that we have. I think the biggest reason why not postgres is that we have combined message rates (not everything is migrated yet) on the order of 300k msgs/sec across a few dozen services. Pulsar is designed to scale horizontally and also has really great organizational primitives as well as an ecosystem of tools. While I think you could maybe figure that out with some PG solution, having something purpose built really can pay big dividends for when you are trying to make a solution that can easily integrate into a complex ecosystems of many teams and many different apps/use cases
For replication across regions, do you peer VPCs via Transit Gateways or some such, or do it over the public Internet? I ask because a lot of folks complain about exorbitant AWS bandwidth charges for cross-AZ and cross-region communication (esp over the Internet versus over AWS' backbone): At 300k msgs/sec, the bandwidth costs might add up quickly?
Consequently, maintaining a multi-region, multi-AZ VPC peering might have been complicated without Transit Gateway, so I'm curious how the network side of things held up for you.
Bandwidth is certainly a concern and that is one of the nice bits about Pulsar is not everything is replicated. You mark a namespace by adding additional clusters it should replicate to. We don't expect to replicate everything, just the things teams care about.
When we did this, Transit Gateway was just within the same region. At re:invent they announce the cross region transit gateway which we will look at moving to as well, but for now, it is just a full mesh of VPC peers, which for 8 regions isn't bad... but certainly gets worse with each new region we need to add.
For exposing the service into other VPCs in the same region we use private-link endpoints as to avoid needing to do even more peering.
While there certainly is some aspects you need to be aware of, generally, Pulsar is much more "cloud native" and maps quiet well to k8s primitives.
Redis Streams now offers a capable persistent message stream solution and if/when Yugabyte supports that then it might be a good competitor. Otherwise Kafka/Pulsar offer much for persistent streams.
There's also KeyDB which is a forked Redis that comes with multithreading and persistence to support larger-than-RAM workloads and active/active failover. https://github.com/JohnSully/KeyDB
Being that the project is a top level Apache project and also has some adoption by quite a few different companies and a number of corporate sponsors the future of the project is pretty safe bet.
We've had some really large companies move to Stream from their in-house tech and save 30-70% comparing Stream's monthly fees vs their in-house hosting. (the difference gets much larger if you add the engineering & maintenance cost of their in-house systems). If your team is in the USA & funded it's pretty difficult to build in-house with a good ROI.
- ordering is really hard, you don't get guaranted ordering unless you write one message at a time or do a lot of complexity on writes (see https://brandur.org/kinesis-order) and the shards are simply too small for many of our ordered use cases
- cost, we just don't send some data right now because it would just be too much relative to the utility of the data (we would need like 250 shards)
- retention, long term, we want to store data in Pulsar with up to unlimited retention so we can rebuild views. There is still some complexity there (like getting parallel access to segments in a topic for batch processing) but it is much further along than any other options
- client APIs for consumer. We are a polyglot shop and really the only language where consuming Kinesis isn't terrible is Java (and other jvm languages). For every other language, we use lambda and while lambda is great it is still distinct deploy and management process from the rest of the app. Being able to deploy a simple consumer just as part of the app is really nice
Do you need to be able to generate a materialized view for a specific time window?
It feels weird to me to use your pub sub system to handle your persistent storage for views, but I am definitely missing context into pulsar and your use case
The idea is you have an event sourced view and can rebuilt it at any time by resetting the stream pointer to the beginning.
Seems like a cool use case!
For example, Kinesis is really limiting with the limited retention and making it very difficult to do any real ordering at scale due to the really tiny size of each shard.
Similarly, SQS does pub/sub well, but we keep finding that we do need to use the data more than the first initial delivery. Instead of having multiple systems where we store that data we have one.
As for why we didn't go with Kafka, the biggest single reason is that Pulsar is easier operationally with no needing to re-balance and also with the awesome feature that is tiered storage via offloading that allows us to actually do topics that have unlimited retention. Perhaps more importantly for the adoption though is pub/sub is much easier with Pulsar and the API is just much easier to reason about for developers than all the complexity of consumer groups, etc. There are a ton of other nice things like being able to have topics be so cheap such that we can have hundred of thousands and all of the built-in multi-tenancy features, geo-replication, flexible ACL system, pulsar functions and pulsar IO and many other things that really have us excited about all the capabilities
For GDPR a lot of us has to do exportable 'user activity'. Can you in theory have a topic/user ( we had like 50 million users) and publish any user activity to that topic?
It might be worth chatting with Pulsar devs on their slack community (https://apache-pulsar.herokuapp.com/).
Most commonly what I hear people doing for this is either one of two approaches (or a combination of both):
- encrypt the user data and delete the key, eventually the user data will get removed
- regularly compact the topic (pulsar has a compaction feature) and write in a tombstone record which will remove any user data after compaction
I understand why you chose Pulsar over RabbitMQ, but wouldn't have Kafka been a good choice as well?
I have used nats in several different ways but since it can be lossy its never been considered as a replacement for a pub sub message queue on my end. We used it for a chat message layer and that worked pretty well.
As for its message passing layer that can be interesting but you end up writing all the retry and failure logic anyway so its usually just better to use an existing message layer that handles all of that for you anyway without all the funky abstractions.
Again, its interesting but nowhere near close to being a rabbit or pulsar replacement if reliability is a goal.
Also, NATS has client libraries for many languages that adds request/response semantics on top of the pub/sub semantics.
I am interested to see where NATS goes but for where we are today Pulsar was a much more obvious choice.
I'm currently building a data processing system that is backed by S3 -> SQS based events, for persistent message passing.
You don't have to answer my questions, I am just shouting into the void. I'm glad it works for y'all.
Most IOT devices that aren’t running HTTP stacks are using MQTT.
Or for IOT
50k messages a second would cost you around $50k a month for AWS sqs, (math could be wrong, didn't double check).
Plus, with sqs, you get what they have. No customizations.
No way they would actually hit that sustained throughput for the entire month.
Even the other justifications about wanting to reference messages after delivery do not to me justify migrating off SQS/Kinesis, especially not at cost of 5 months development effort.
Maybe they have chatty IoT devices. Maybe a ton of sensors monitoring manufacturing plants for multiple different metrics in real-time. Maybe they just have large scale.
Having said that, assuming Pulsar has a similar feature to Kafkas log compaction I can definitely see the appeal!
Also the fact that SQS Fifo doesn’t integrate with this setup is super annoying.
Edit: Log compaction not key compaction
Really? It is one of the few open source projects that we've felt has had modern documentation. How long ago was this?
> As a small startup
You'll spend more time & money on the OpEx cost with Kafka than picking up the client library for Pulsar.
I completely disagree with the opex of picking up kafka vs developing a whole client library. Please could you try and explain how you came to this conclusion?
1. Stateless brokers
With Kafka any time a broker goes down you need to be aware of the kafka broker id. Yes, this can be fixed by creating your entire infrastructure as code and keeping track of state.
This is something of great OpEx. I've seen few people successfully automate this, Netflix is one of the few. The rest just use manual process with tooling to get around, pager, Kafka tooling to spawn replacement node with the looked up broker id, etc.
2. Kafka MirrorMaker
Granted I have not used v2 that recently came out in ~2.6 but dear gosh v1 was so bad that Uber wrote their own replacement from the ground up called uReplicator. The amount of time wasted on replication broken across regions is disgusting.
3. Optimization & Scaling
Kafka bundles compute & storage. There's (maybe on a upcoming KIP) no way that I know of splitting this. This means you'll waste time on Ops side deciding on tradeoffs between your broker throughput and your broker space.
Worse yet time & money will be wasted here. I'd just rather hire more people than waste time on silly things like this. This is where I justify taking on the expense of client libs.
4. Segments vs Partitions
The major time wasters are where you end up in a situation with the cluster utterly getting destroyed. It will happen, it isn't a question of if but a question of when or the company goes belly up and nobody cares.
It's 3 AM, the producer is getting back pressure, you get a page and now have to deal with adding on write capacity to avoid a hot spot. Don't forget you can't just simply do a rebalancement in Kafka or you'll break the contract with every developer who has developed under the golden rule of, "Your partition order will always be the same".
You'll successfully pay the cost of upgrading the entire cluster and then spending 3 days coming up with a solution to rebalance without making all your devs riot against you when you break that golden contract.
Having spent a couple of years dealing with Kafka I'm sorry to burst people's bubbles but is dead. Even Confluent doesn't have a good enough story these days to not switch to Pulsar, they're going to sell you on the same consulting bs, "We're more mature", "We've got better tooling.", "Better suppott"...
Yes, of course, it has been in the open source community 5 years longer and the company has been also around longer for that time. Kafka is dead, long live Pulsar.
5. Kafka is silly expensive
Pulsar supports message ack with subscription groups. The worst case with Pulsar is you're storing the entire retention period.
Let's say you have a 4 day retention window, to cover an outage happening on Friday and not having to deal with it until Monday. This is pretty typical with what I see in the Kafka world for small-mid size companies who don't want to pay the 1.5x OT on call.
So, with Pulsar you're at worst storing the 4 days of data but at best you're only storing the messages within the lag period of all consumer groups acknowledging the message.
Now, without getting too deep into Pulsar's feature set even that is a lie because Pulsar has tiered storage as a first class citizen. The messages after the four days could be ship off to S3 if we wanted or even within 1 day depending on our use case and this is all built into Pulsar, no OpEx tooling required. Even access the messages from S3 through Pulsar is abstracted, there's no tooling required to pull them back in if you wanted.
Now with Kakfa our worst case is simply 4 days of retention data. This can get very expensive as compute & storage are tied together, it means scaling up all the brokers (even though we don't need the throughput) for the storage increase. Now, yes MSK basically abstracts all this from you but you're paying for it.
6. AWS Managed Service are not equal citizens to EC2 standalone
Managed services right now don't fall under the new Saving Plan: https://aws.amazon.com/blogs/aws/new-savings-plans-for-aws-c...
This will cost you 30-60% discount on your entire Kafka bill.
7. Excel Life
If I look at the numbers for what I'm doing it would have costed ~$4M for Kafka vs ~$1M for Pulsar.
DC/OS implementation easily shuns out 1. and 2.
3. and 4. are valid points,
but I think in a real life these scenarios
are usually related to cloud service cost optimization,
and I would never recommend anyone running Kafka in a cloud due to these reasons.
There one more reason, which was not cited, but poses itself a real killer for cloud Kafka dream AFM: clouds, being prone to all kinds of network interruptions, are not well suited for running Zookeeper ensembles with decent uptime.
Disclaimer: I have never tried or used Apache Pulsar, and just examining its documentation after spotting this thread.
These can be just as fragile and now you have to learn how to manage the orchestrator. Even Confluent's own Kubernetes operator has issues. There's just too many issues with Kafka's design that hinders easy operations.
> "I would never recommend anyone running Kafka in a cloud"
That's a major problem considering that's where most computing is heading. At this point, running in noisey overloaded cloud environments is a good test of the reliability and durability of a software system. Kafka fails massively here.
Could you elaborate why this would be the case?
(1) choose a "flavor" wrapper (confluent seems to be a popular one), because the base project isn't easy to develop against
(2) write your own wrappers of those wrappers, to keep your developers from shooting themselves in the foot with wacky defaults
(3) suffer the immense pain that is authenticating topic write/reads, if that's even possible???
(4) stand up zookeeper... and probably lose some data along the way.
(5) suffer zookeeper outages due to buggy code in kafka/zk (I've experienced lost production data due to unpredictable bugs in kafka/zk, but obviously YMMV).
Based on my naive assessment, the kafka/zookeeper ecosystem is maybe 10x as complicated as the problem it's solving, and that shows up in the OpEx. I personally doubt that Pulsar is that much better, but it might be.
In this churny environment, where you want to keep on latest versions (necessitated by bugs mentioned in), you need abstractions to protect you somewhat from the churn.
Confluent also seems to have a fair amount of churn, so you need wrappers for that, that you can update all at once for your developers.
My biggest problems with it were when developers who didn't really understand Kafka started setting properties that had promising names to bad values to "ensure throughput" - let's set max.poll.records to 1 to ensure we always get a record as soon as one is available!
That might be my biggest issue with Kafka - it requires a decent amount of knowledge of Kafka to use it well as a developer. I'm not sure if Pulsar removes that cognitive burden for devs or not, but I'm interested in finding out.
And yeah, the wrappers to remove that burden were written in our company too - but then proved quite limiting for the varying use cases for a Kafka client in our system. sigh
Are we heading toward a split between apache/java/zookeeper stacks and go/etcd on the other ? I've seen an issue related to that question on pulsar, and this got me investigating the distributed KV part of the stack.
It seems by looking at some benchmark that etcd is much more performant than zookeeper, and that to some people, operating two stacks seems like an operation maintenance cost a bit too high. Is that a valid concern ?
Also, i've seen that kafka is working on removing the dependency to zookeeper, is pulsar going to take the same road ?
Looking at codebase of Pulsar it looks like typical Apache style sprawling Java project with more than thousand directories, many thousand files and more than hundred dependencies. As comparison NATS which is in Go has few hundred files, less than hundred directories and about a dozen or so dependencies.
It's not one or the other, they're just different tools.
The team is working on an entirely new system called Jetstream to eventually replace it.
The core of it is this point:
"Functions placed at low levels of a system may be redundant or of little value when compared with the cost of providing them at that low level."
That is, in order get that message redundancy or exactly once delivery, or message persistence, you pay a high cost, and you may be better off delegating to the endpoints.
This blog provides a good overview
Here is the original paper
"..Message/event persistence - NATS Streaming offers configurable message persistence: in-memory, flat files or database. The storage subsystem uses a public interface that allows contributors to develop their own custom implementations."
"At-least-once-delivery - NATS Streaming offers message acknowledgements between publisher and server (for publish operations) and between subscriber and server (to confirm message delivery). Messages are persisted by the server in memory or secondary storage (or other external storage) and will be redelivered to eligible subscribing clients as needed."
Disclaimer: I'm the author and former core contributor of NATS and NATS Streaming.
I think that the modern approach to distributed systems is moving towards golang style microservices and lightweight / simple system design with RPC communication, reconcile type loops for state reconciliation, and backing CP databases. I think this is the influence of k8s (and maybe google's approach to distributed systems).
I will almost certainly get downvoted for this (as I always seem to when I criticize the JVM), but Apache/JVM style architecture feels REALLY long in the tooth to me. I think you are committing to an outdated and very expensive approach to building software if you use anything running on the JVM, especially Apache based anything. Cassandra is a great example of this - out of the box it's a terribly performing database that is extremely expensive to run and tune. Throw enough resources and time at it and you can get it to acceptable scalability - but running on the JVM which is a huge memory hog will always make it expensive to run (and even then, you will always get terrible latency distributions with the JVM's awful GC).
If I was building a business I would run far far away from any JVM based solution. The only thing it has going for it is momentum. If you need to hire 100s of engineers off the street for a large project, then a JVM based stack is about your only option unfortunately.
This just makes it seem like you are trolling. JVM devs have done more to advance state of art in this area than any other language. The problem is that most JVM apps just produce too much garbage, not necessarily that the algo itself is awful.
Either way, there's no such thing as an optimal GC algorithm, just different trade-offs depending on your use case. Not everyone cares about latency.
This is strange way of saying that Java the language and most frameworks around it force apps to generate this much garbage.
There's a ton of projects Apache Foundation hosts that fit your description but it's a mistake, I think, to confuse individual projects with Apache in general. Bad enough that people confuse the license with the foundation.
I generally don't use Java or Go in my applications, but Go components are usually more lightweight and easier to use if you're not majorly working with those runtimes. JVM has a big overhead, and Java applications usually make things worse if they use things like dependency injection.
I guess my point is that k8s let’s you shift the operational burden to dev teams if they need it. If you have a centralized operations team running a giant, common zk/ etcd, yeah this would be additional operational burden.
Yes, it's in the works
It doesn't look like it's going to be ready anytime soon though.
Reliable pub/sub that supports message rates over 100k/sec (even up to the millions) has been available for a while now and with a great deal of efficiency (eg; the Aeron project). The incredible amount of effort to support complex partitions, extreme fault tolerance (instead of more clever recovery logic), etc. add a lot of overhead. To the point of talking about "low latency" overhead in the order of 5ms instead of microseconds or even nanoseconds as is expected in trading.
Worse, many startups try to adopt these technologies where their message rates are miniscule. To give you some context, even two beefy machines with an older message queue solution like ZeroMQ can tolerate throughput in excess of what most companies produce.
This is not to discredit the authors of Pulsar or Kafka at all... but it's just a concerning trend where easy to use horizontally scalable message queues are being deployed everywhere. Similar to how everyone was running hadoop a few years back even when the data fit in memory.
[EDIT] in fact it's pretty much the norm outside startup land, too, as soon is you're involved with any kind of bigco "innovation" or greenfield-development division.
I like that it is adaptable to run without an external broker (using embedded media driver option), yet can add that for scalability/redundancy.
Our use cases do not require a query on top of streams.
Instead all data would go into timescaleDB .
For example, say you have a KV Store with basic mathematical Set operations like GET, SET, UNION, INTERSECT, EXCEPT, etc. The Engine would parse the SQL and then call the low-level KV Store Set operations, returning the result or updating KV pairs. This explains how Join relates to Set operations:
Another thing I'd like is if KV stores exposed a general purpose functional programming language (maybe a LISP or a minimal stack-based language like PostScript) for running the same SQL Set operations without ugly syntax. I don't know the exact name for this. But if we had that, then we could build our own distributed databases, similar to Firebase but with a SQL interface as well, from KV stores like Pulsar. I'm thinking something similar to RethinkDB but with a more distinct/open separation of layers.
The hard part would be around transactions and row locking. A slightly related question is if anyone has ever made a lock-free KV store with Set operations using something like atomic compare-and-swap (CAS) operations. There might be a way to leave requests "open" until the CAS has been fully committed. Not sure if this applies to ledger/log based databases since the transaction might already be deterministic as long as the servers have exact copies of the same query log.
Edit: I wrote this thinking of something like Redis, but maybe Pulsar is only the message component and not a store. So the layering might look like: [Pulsar][KV Store (like Redis)][minimal Set operations][SQL query engine].
Pulsar has support for Presto: https://pulsar.apache.org/docs/en/sql-overview/
Pulsar isn't a KV store though, it's a distributed log/messaging system that supports a "key" for each message that can be used when scanning or compacting a stream. GET and SET aren't individual operations but rather scans through a stream or publishing a new message.
If you just want to have a SQL interface to KV stores or messaging systems that support a message key then Apache Calcite  can be used as a query parser and planner. There are examples of it being used for Kafka .
The most obvious way to model a secondary index on top of a pure KV store is to map indexed values to keys. For example, given the (rowID, name) tuples (123, "Bob"), (345, "Jane"), (234, "Zack"), you can store these as keys:
Now you can easily find the rowID of Jane by doing a key scan for "name:Jane:", which should be efficient in a KV store that supports key range scans. You can do prefix searches this way ("name:Jane" finds all keys starting with "Jane"), as well as ordinal constraints ("age > 32", which requires that the age index is encoded to something like:
This is what TiDB and FoundationDB's record layers do, which both have a strict separation between the stateless database layer and the stateful KV layer.
The performance bottleneck will be the network layer. Your range scan operations will be streaming a lot of data from the KV store to the SQL layer, and potentially you'll be reading a lot of data that is discarded by higher-level query layers. This is why TiKV has "co-processor" logic in the KV store that knows how to do things like filter; when TiDB plays your query, it pushes some query operators down to TiKV itself for performance.
Unfortunately, this is not possible with FoundationDB. This is why FoundationDB's authors recommend you co-locate FDB with your application on the same machine. But since FDB key ranges are distributed, there's no way to actually bring the query code close to the data (as far as I know!).
I'm sure you could do something similiar with Redis and Lua scripting, i.e. building query operators as Lua scripts that worked on sorted sets. I wouldn't trust Redis as a primary data store, but it can be a fast secondary index.
As for a generic relational layer over K/V stores, I think it’s a superficially appealing idea that would be impossible to optimize adequately in practice. Honestly, if you want to implement a distributed relational database, I would recommend starting with Postgres as your local storage engine and pushing down as much relational logic as possible to that layer. I have worked on such a system in the past and it produced very good results for minimal development effort.
The row ID can be encoded into the key. From my example, a basic mapping might be:
name:Bob:123 => <empty value>
For example, if I want to search for "name = 'Bob'", then I simply start at the key "name:Bob:" and pluck the row ID from each key, scanning sequentially until I reach the end of my range.
This works great for multiple values. For example:
name:Bob:123 => <empty value>
name:Bob:124 => <empty value>
name:Bob:125 => <empty value>
If you store row IDs in the value, you'll risk read/write contention on the value. Let's say there are multiple rows with "Bob", you'll end up storing something like:
name:Bob => [123, 124, 125]
This also puts a constraint on the number of row IDs you can fit in a single value. KV stores typically co-locate value data with keys, so now you might not be able to efficiently scan a large range without also loading that data. You can do tricks like introducing an indirection, where you don't store "row IDs", but "row page IDs", where each page is sharded maximum of N row IDs, allowing you to sidestep the size limit on values. But that comes with other costs.
I'm not counting in-memory stores like Redis here. Implementing in-memory indexes is a completely different ball game to something that needs to live on disk.
As for "superficially appealing idea that would be impossible to optimize adequately in practice", your assertion is demonstrably false: TiDB and CockroachDB both implement performant relational databases on top of general-purpose KV stores.
Which ones would you recommend looking at?
TVS and Bajaj are major motorbike manufacturers in India, and TVS had a model named "Apache" and Bajaj had a model named "Pulsar".
These other systems are designed to be remote with a network interface. You can use the client drivers to handle acknowledgements/retries/local-buffering in your own app or use something like Logstash , FluentD , or Vector  for message forwarding if you want a local agent to send to. You might have to wire up several connectors since none of them forward directly to Pulsar today.
Also RabbitMQ is absolute crap. There are better options for every scenario so I advise using something else like Redis, NATS, Kafka, or Pulsar.
Yikes, that's a bit harsh! I've been using RabbitMQ on multiple projects for several years, and I think it's a great pub/sub system. It also has a large userbase and community around it, as well as lots of plugins available.
I've never heard of using Redis for pub/sub - were you suggesting rolling your own on top of Redis? I'm not familiar with NATS (but it's been mentioned several times here, so I will definitely learn more!), but Kafka is a streaming log/event system, not a pub/sub system like RabbitMQ (different use cases).
I will say though that if something is wrong in your RabbitMQ config, the stack traces that Erlang produces when it crashes are a nightmare to decipher!
It's poor architecture and implementation. One of the worst products built with Erlang. It's hard to work on (for new contributors) while not really using any of the natural advantages of the language.
Redis supports pub/sub channels. It's very fast and if you already use it then it saves running another system.
Streaming log vs pub/sub is a fuzzy distinction and basically has no difference at this point. Publishers send to a topic and consumers listen. You can do the ephemeral pub/sub in Kafka by reducing retention to seconds, using a single partition per pub/sub topic, and having consumers listen from latest without a consumer group. Or use Pulsar which does both much more naturally.
Also I didn't mention but there are great commercial messaging products like AMPS  or Solace  if you need much more advanced features and support.
IME, any Erlang system is difficult for noobs to contribute to; Erlang's syntax alone is... frightening :)
I really haven't hit any of the issues you mentioned with rabbit. Several clusters in production for years have been rock solid, connections are stable (and when they do drop, the client libraries, of which there are a multitude, can handle reconnection), throughput is excellent, and I've never once seen data corruption (and combined we must have processed many billions of messages).
The whole HIPE debacle I can at least agree on though! (I seem to recall it's deprecated now, but I might have dreamed that)
If you need persistence then Redis v5 has a new data type called Streams which is similar to Kafka/Pulsar. Push messages to a stream and read with multiple consumers (using consumer groups) that can ack each msg to track read state.
Edit: Maybe you're looking for acks/confirms? https://www.rabbitmq.com/confirms.html
Clustering the Rabbit machine helps one particular failure scenario, but it's not a solution to the problem.
That solves the problem of lost messages if you have e.g. a network problem between servers. I don't know what to do about RabbitMQ being down, all you can really do about that is move the exact same problem somewhere else by making things much more complicated, such that it's probably not worth it. If the place you're storing outbound messages is broken, you (obviously) can't store them, whether that's PostgreSQL, Redis, a remote RabbitMQ server, a local RabbitMQ, whatever.
I'm build a distributed system with RabbitMQ just now, where producers may be offline due to transient networking issues. I write messages to a local SQLite database, with another thread responsible for sending them to RabbitMQ and deleting them on successful delivery.
NoSql for this seems to be a better use-case.
I also depend on nats.io instead of RabbitMQ
My (very limited, and purely gained through this thread!) understanding of nats suggests it doesn't handle persistence, instead leaving it up to producers and consumers - but I need to not lose any messages in the event of an outage. Waiting for consumers to signal completion of processing isn't practical, as it will sometimes take several minutes or even hours before messages are processed; I think the real solution here is persistence at the broker?
That's how to persist in case of a network failure.
The outbox-pattern pattern removes the send message from it's datastore when send/acknowledged.
Persistency is different from the outbox-pattern, it handles what is does if the message broker received the message. The outbox pattern handles failure before the message reached the message broker.
NoSql and event stores normally have methods to adjust data to the latest version of the model, eg. Postgress ( which can be used as NoSql using bson) supports js v8 transformations. This is useful for edge-cases in the outbox pattern.
In short, I was more or less talking about the tech used. Rabbit vs Nats and sqllite vs NoSql ( in my case, using Postgress as NoSql).
Ps. Nats can be persistent as others mentioned in this thread, look up: nats streaming
- it is much more performant than RabbitMQ
- it's a commit log as well, not just a pub-sub system, ie. it is a good candidate as the storage backend for event sourcing
- it supports geodistributed and tiered storage (eg. some data on NVMe drives, some on a coldline storage)
- it's persistent, not in-memory (primarily)
.. and so on.
Why use RabbitMQ and Kafka if you can use ZeroMQ? Meaning, isn’t it far more performant and distributed?
Maybe I am missing something here.
They are totally different, you're comparing apples with oranges.
ZeroMQ gives you basic, very fast tooling to communicate between distributed processes. ZeroMQ does not provide tooling for e.g. maintaining a strictly ordered, multi-terabyte event log. And so on.
Basically, one is decentralized and you can set up a massively parallel architecture, with eg each topic or subthread having its own pubsub.
The other is a monolithic centralized pubsub architecture.
You could argue that git in large institutional projects converges to a monolithic repo so at that point it’s less efficient even than svn.
But for most use cases, ZeroMQ would allow far more flexible distributed systems topologies and solutions. No?
Edit: HN and Google are both awesome: https://news.ycombinator.com/item?id=9634925
Not true. Facebook and Google do not use Git. Microsoft does not use vanilla Git for their monorepo. They created this extension to make it scalable https://en.wikipedia.org/wiki/Virtual_File_System_for_Git
I don't think Apache cares if it's maintaining similar projects.
This blog post offers some more info and leaders to other posts comparing Pulsar to RabbitMQ and Kafka: https://jack-vanlightly.com/blog/2018/10/2/understanding-how...
Firebase is a completely different animal.
When a team you are on starts discussing switching over to a technology like Pulsar because of its amazing benefits, unless your pants are on fire, it is much more likely than not that you do not stand to gain much from the benefits that such software brings but you are accepting the maintenance burden that it represents.
Kafka Streaming - Pulsar functions don't intend to (by the looks of it) provide all of the functionality available in Kafka Streaming. E.g., joining streams, treating a stream as a table (a log with a key can be treated as a table), aggregating over a window. They seem to be more focused on do X or Y to a given record. That said, you don't need Kafka Streaming for that, other streaming products like Spark Streaming can do it also (although last I checked, Spark Structured Streaming still had some limitations compared to Kafka Streaming - can't do multiple aggregations on a stream etc.) A use case I have and love Kafka Streaming's KTables for is enriching/denormalising transaction records on their way through the pipeline.
Kafka Connect - Pulsar IO will get there with time, but currently KC has a lot more connectors - for example, Pulsar IO's Debezium source is nice, (Debezium was built to use Kafka Connect, but can run embedded), but you may not want to publish an entire change event stream onto a topic, you might just want a view of a given database table available - so KC's JDBC connector is a lot more flexible in that regard, and Pulsar IO currently doesn't have a JDBC source. It also looks like its JDBC sink only supports MySQL and SQLite (according to the docs) - KC's JDBC connector as a sink has a wider range of support for DBs, and can take advantage of functionality like Postgres 9.5+s UPSERT. Likewise, there's no S3 sources or sinks - the tiered storage Pulsar offers is really nice, but you may only want to persist a particular topic to S3.
KSQL - KSQL lets your BAs etc. write SQL queries for streams. That said, I do like Pulsar SQL's ability to query your stored log data. When I've needed to do this with Kafka, I've had to consume the data with a Spark job, which adds overhead for troubleshooting.
So yeah, that's the main areas I can see, but it's really a function of time until Pulsar or community code develops similar features.
The only other major difference I can see is that at the current time, it's comprised of three distributed systems (Pulsar brokers, Zookeeper, Bookkeeper) which is one more distributed system to maintain with all the fun that entails.
That said, I'll be keeping my eye on this, and trialling it when I get some spare time, as I've found that people will inevitably use Kafka like a messaging queue, and that is a bit clunky. Plus I'm a little over having people ask me how many partitions they need :D
I think Pulse should really focus on the integration with the rest of the Apache stack if they want to gain traction.
Pulsar also lacks in the size of the community and ecosystem where Kafka has much more available.
Kafka and Pulsar persist every message and different consumers can replay the stream from their own positions. Pulsar also supports ephemeral pub/sub like NATS with a lot more advanced features.
NATS does have the NATS Streaming project for persistence and replay but it has scalability issues. They're working on a new project called Jetstream to replace this in the future.
I am confused by this. The format of Kafka's log files is designed to allow reading and sending to clients directly using sendfile, in sequential reads of batches of messages. http://kafka.apache.org/documentation/#maximizingefficiency
Pulsar separates storage into a different layer (powered by Apache Bookkeeper) which allows consumers to read directly from multiple nodes. There's much more IO throughput available to handle consumers picking up anywhere in the stream.
When consumers fall behind, they start to request data that might not be in the page cache, causing things to slow down.
I just want to clarify this - you're limited to N concurrent consumers for N partitions per consumer group.
There are helm charts for running an actual cluster.
You have to manually call ".receive()" to attempt to receive a message.