Kafka is a message broker. It’s not a database and it’s not close to being a database. This idea of stream-table duality is not nearly as profound or important as it seems at first.
Apparently the primary reason they went out of business was sales-related, not purely technical, but if they hadn't used Kafka, they could have had 2x the feature velocity, or better yet 2x the runway, which might have let them survive and eventually thrive.
Imagine, thinking you want a message bus as your primary database.
Kafka is awesome for what it does, but there are a ton of people out there using it for weird stuff.
Then...you decide to attend a kafka developer conference like Kafka Summit to sharpen your skills and learn best practices.
At that conference, you see the CEO of Confluent give a keynote speech about how Kafka is a transaction store; you 'turn your database inside out', run SQL queries on it, even run your BI dashboards on top of it! They make it seem like this is a best practice that all the smartest engineers are doing, so you better do it as well.
At that point you're investing in a path that will make Confluent millions in revenue trying to consult you on implementing their experimental, up-sell architecture on top of Kafka. Maybe their sales rep will suggest you to buy a $100k/month subscription to Confluent Cloud to really patch up any holes and make it easier to maintain.
Moral of the story is doing your best to separate out the business and the tech side of open source. Behind many open source projects, there's a company trying to make money on an 'enterprise' version.
Ethically, I prefer data companies like Snowflake that are at least transparent about trying to make money off their product. Rather than companies that use open source as a ladder to drive adoption, then try to bait-and-switch you into the same 7-figure deals with software that is more brittle and still requires you to develop your own solutions with it.
The developers who adopted the commissioned application were way too inexperienced to know one way or another, and by the time I was asked to help out, no one had really learned anything about Kafka.
My suggestions to replace it with a basic queue were rejected and my suggestions to store events in a database for easier recollection were rejected. They were sure learning Kafka better would fix everything instead.
Ironically there's not even that much to learn about it, in the scheme of things. It's like learning more about your car so you can make it fly one day. It's just not the right tool for the job.
1. Why do we want to use Kafka? [insert thinking] So that we can do x yet avoid y.
2. Why is the customer asking for these features? [insert listening] So that they can do z and think of themselves as w.
And then failure to remember and revisit that "why?".
And even worse a lot of this heavy lifting isn't obvious until you are fairly far into a project and suddenly realize that to solve a bug you need to implement transactions or referential integrity.
It's not a drop in replacement though, that's for sure.
That's incorrect unless I'm misunderstanding what you mean with primary columns.. It's just not as efficient.
And missing joins are by design and one of the reasons cassandra is as fast as it is. And as I said before: it's not a drop in replacement. You need to architect your application around it's strength to leverage it's performance.
It is however usable as a rdbms replacement if you know what you're doing and your data is fine with eventual consistency.
And knowing what cassandra does with your data is important as well. It's actually a key-value store on steroids. Once you get that, it's limitations and strengths become obvious
I'm referring to "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
> It's just not as efficient.
That's what I thought, too, so I "allowed filtering". And crashed the database. (Apparently the correct solution here is to use Spark).
cassandra is at its heart a key-value store. for every queryable field, you need to create a new in-place copy of the data. so you're basically saving the data n-times, for each way you wish to query the data.
if you however try to query on a field which hasn't been marked as query-able, the cluster will have to basically select everything in the table and do a for loop to filter for the 'where' clause :-)
But i haven't used in production yet, so you've got more experience then i do
It's tough to be consulted and watch people go against your advice like that, though.
Both your comment and the GP are using examples of failed startups as basis of your criticism. One mentions "50 engineers" and I'd bet not a single one had an "engineer" bone or training in their body. So I read that as 50 blog-driven programmers fail to execute a sophisticated system design they read about on some blog. Surprise?
Using Kafka as the distributed log of a deconstructed database is technically an order of magnitude more demanding than wiring MVC widgets, or using the latest paint by numbers framework, etc., because that is in fact developing a custom distributed database.
Technical leadership and experience are forgotten things but they do matter. Let's not blame the stack.
They could have replaced their brokers with simple worker queues and fed the raw data and results into a db. That was all they needed. The fear was that when they got MASSIVE, that architecture wouldn't scale for them. In reality it absolutely would, but like you said, the blogs told them otherwise.
Kafka is fine, using it for the wrong thing isn't.
I'm not criticizing Kafka so much as the way I've seen it used in software. It's largely down to user error, and like I said, Kafka is good software. I had to learn a lot about it for that project and it was a lot of fun. It was my first deep dive into message brokers and I really loved it, so I mean no disrespect to the software or its maintainers.
You could replace Kafka with Pulsar (or DistributedLog for that matter) in my comments above and it would stand.
Is using a distributed log as the WAL component of a event-sourcing or distributed database entirely a bad idea? I am of the opinion that it is a viable but sophisticated approach.
I've built (and sold) an entire company around this architecture, and am working on the next. It can be incredibly powerful, and is what you _have_ to do if you want to mitigate the significant downsides of a Lambda architecture. We successfully used it to power graph building, reporting , and data exports, all in real-time and at scale.
But, we didn't use Kafka for two key reasons (as evaluated back in 2014):
* We wanted vanilla files in cloud storage to be the source-of-truth for historical log segments.
This was really important, because it makes the data _accessible_. Applications that don't require streaming (Spark, Snowflake, etc) can trivially consume the log as a hierarchy of JSON/protobuf/etc files in cloud storage. Plus, who the heck wants to manage disk allocations these days?
* We wanted read/write IO isolation of logs.
This architecture is only useful if new applications can fearlessly back-fill at scale without putting your incoming writes at risk. Cloud storage is a great fit here: brokers keep an index and re-direct streaming readers to the next chunks of data, without moving any bytes themselves. Cloud read IOPs also scale elastically with stored data.
Operational concerns were certainly a consideration: we were building in Go, targeting CoreOS and then Kubernetes, had no interest in Zookeeper, wanted broker instances to be ephemeral & disposable (cattle, not pets), etc, but those two requirements are pretty tough to bolt on after the fact.
The result, Gazette, is now OSS (https://gazette.dev) and has blossomed to provide a bunch of other benefits, but serves those core needs really well.
- we do not need to read messages in a strict order
- using S3's API is super easy
- no operations overhead
- no compute resources (for the queue part)
After doing it some time I think that I was drinking too much cool-aid with streaming services like Kinesis and Kafka and failed to see how easy it is to implement a simple analytics even service using purely S3.
For instance, my nephew was heaping praises on Firebase and would use it for all his projects. It took me a while to understand that the convenience of using it masked some of the issues that would only get exposed at scale.
For whatever reason a vast number of developers are entering the workforce believing in Kafka’s capability to solve all problems. And going against that majority will be very tiring and even costly.
So, much more costly mistake.
You can construct a state (table) from a stream, but you cannot do the reverse. You cannot deconstruct a table into its original stream because the WAL / CDC information is lost -- you can only create a new stream from the table. This means you lose all ability to retrieve previous states by replaying the stream. Information is lost.
Duality in math is an overloaded term but it generally means you can go in either direction. This is not true here.
One benefit of using the table as a source of truth this way is that you can run migrations on the entire “stream” if you will, and republish the events in order, in the new schema definition.
Another benefit is that you can run SQL queries against your entire transaction history. Some (imo sketchy) tools exist that try to give you a SQL interface to Kafka, but that’s just bad. It’s much easier to have your cake and eat it in postgres than it is in Kafka, imo. Unless your data volume is absurd, in which case Kafka is your only realistic choice.
Of course, in a “literal” sense you’re correct, it’s not possible. But from a practical point of view, if your bit empor all data implementation is stable, you can totally depend on your ability to produce streams from a database. We do this at my current place of work all the time.
Pointing out (GP and yourself) that this is an 'asymmetric duality' is informative and valid.
A symmetric dual is one where there is equivalence: the dual of the dual is the primal.
But it is possible to have a "weak" duality that is assymmetric, where the dual encompasses the primal -- where there are regions of equivalence instead of a 1-to-1 equivalence.
I'm not sure if stream-duality fits this conceptual mold exactly but thank you for providing a counterpoint to my assertion. I'm going to think on it.
Also, that you believing it is the most profitable mindset - for them. Company blogs are marketing, first and foremost.
I don’t have a proof, but For things like state transitions, I’ve often solved this with a <new_state>_at column. Seeing that field populated would indicate a second event need be emitted. (Assumes no cycles in the state machine.)
Often that table is not the one you have, even if (perhaps especially if) you emitted events for every update — seems like some entropy-like force pushes the universe toward emitting details in events that aren’t reflected in the table records, over a long enough timeline.
Difference algebra over the deltas could find some interesting relevance.
That one would so confidently assert that "Codata is the categorical dual of Data" has me thinking there might be something here?
Anyone want to tell me how far off I am?
More usefully, data is a finite data type and codata is an infinite data type.
sum = foldr (+) 0
sum' = scanl (+) 0
> [0, 1, 3, 6]
> [0, 1, 3, 6, 10...
Induction describes how to process data without diverging - process each input element in a finite time.
Co-induction describes how to process codata without diverging - produce each output element in a finite time.
This is useful if you want to prove things about infinite processes like operating systems.
No. The notion of a "stream-table duality" is a powerful concept, that I've found can change the way any engineer thinks about how they are storing / retrieving data for the better (it's an old idea, rooted in Domain Driven Design, but for some reason a lot of engineers, myself included, still need to learn, or relearn, it).
The notion that relying on a stream as the primary data persistence abstraction or mechanism in a production is the misleading part, at least for now. I'd argue Kafka pushes us in a direction that makes progress along that dimension, and you can apply it successfully with a lot of effort. But to match what you can get from a more traditional DBMS? The tech just isn't there (yet).
I've definitely seen table-focused projects that really struggle because they need to be thinking in stream-focused terms. That's especially true in the age of mobile hardware. E.g., you want to take your web-focused app and have it run on mobile devices with imperfect connectivity so people in the field can use it, table-only thinking is a pain. It's much easier to reconcile event streams than tables.
Sure you can't expect a stream-oriented tool to do everything an SQL database does. But the reverse is true as well. If we threw out every tool because people unthinkingly used it for the wrong job, we wouldn't have many tools left.
And if you don't want to use that, there's also products for this specifically, such as event store.
It can make sense to do both, so to de-couple internal table structures from externally exposed events. Debezium supports you safely doing this via the transactional outbox pattern . Events are inserts into an outbox table in the same database as the actual business data, both updated in one transaction. Debezium then can be used to capture the outbox events from the logs and send the events to Kafka. To me this hits the sweetspot for many use cases, as you avoid the potential inconsistencies of dual writes which you mention while still not exposing internal structures to external consumers (for other use cases, like updating a cache or search index on the contrary, CDC-ing the actual tables may be preferable).
Disclaimer: I work on Debezium
I still can’t get behind using CDC via a DB watching tool unless I have little control of the system writing to the DB. There is so much context lost interpreting the change at the DB (who was the user that made this change, etc.) — unless you are sending this data to the DB (then yikes your DB schema is much more complex).
If your DB transaction fails after the Kafka message has been sent, you'll end up with inconsistencies between your local database state and what has been published externally via Kafka. That's exactly the kind of dual write issues the outbox pattern suggested in my other comment avoids.
> unless you are sending this data to the DB
You actually can do this, and still keep this metadata separately from your domain tables. The idea is to add this metadata into a separate table, keyed by transaction id, and then use downstream stream processing to enrich the actual change events with that metadata. I've blogged about an approach for this a while ago here: https://debezium.io/blog/2019/10/01/audit-logs-with-change-d....
Either way, I’m intrigued by the outbox pattern. As long as it is transparent to developers, it does seem like an ideal CDC mechanism.
Being able to elegantly and efficiently do "AS OF" queries is still a hard problem, I don't see that KTables really solve it. Hate to say it but Oracle's solution to this is still unmatched. https://oracle-base.com/articles/10g/flashback-query-10g#fla...
The article does not recommend writing this code yourself, it shows how to aggregate data into usable forms.
So I think your concerns may be a bit overblown. If you think that ksqlDB or Kafka Streams, the tools shown in this blog post, are are at risk for what you warn, this comment would be a valid criticism. But it's clear that the article isn't advocating for people to write their own versions of that...
By using Kafka (or anything else) as a "commit log" you've just resigned yourself to building the rest of the database around it. In a real RDBMS the commit log is just a small piece of maintaining consistency.
Every time I've worked on a project where we handle our own "projections" (database tables) we ended up mired deep in the consistency concerns normally handled by our database.
Whats so hard? Compaction is an evil problem. We figured we could "handle it later". Well it turns out maintaining an un-compactible commit log of every event even for testing data consumes many terabytes. This and "replays" sunk the project indirectly.
Replays. The ES crowd pretends this is easy. It's not. If you have 3 months of events to replay every time you make certain schema changes you are screwed. Even if you application can handle 100X normal load, its going to take you over a day to restore using a replay. This also happened to us. Every time an instance of anything goes down, it has to replay all the events it cares about. If you have an outage of more than a few instances the thundering herd will take down your entire system no matter how much capacity you have. With a normal system, the Db always has the latest version of all the "projections" unless you're restoring from a backup, then it only has to restore a partial commit log replay, and only once.
Data duplication. Turns out "projecting" everything everywhere is way worse on memory and performance than having the DB thats handling your commit log also store it all in one place. Who knew.
Data corruption. If you get one bad event somewhere in your commit log you are cooked. This happens all the time from various bugs. We resorted to manually deleting these events, negating all the supposed advantages of an immutable commit log. We would have been fine if we let the db handle the commit log events.
Questionable value of replays. You go through a ton of bs to get a system that can rewind to any point in time. When did we use this functionality? Never. We never found a use-case. When we abandoned ES we added Hibernate Auditing to some tables. Its relatively painless, transparent, and handled all of our use cases for replays
I disagree. Kafka is a log. That's half of a database. The other half is something that can checkpoint state. In analytics that's often a data warehouse. They are connected by the notion that the data warehouse is a view of the state of the log. This is one of a small number of fundamental concepts that come up again and again in data management.
Since the beginning of the year I've talked to something like a dozen customers of my company who use exactly this pattern to implement large and very successful analytic systems. Practical example: you screw up aggregation in the data warehouse due to a system failure. What to do? Wipe aggregates from the last day, reset topic offsets, and replay from Kafka. This works because the app was designed to use Kafka as a replayable, external log. In analytic applications, the "database" is bigger than any single component.
I agree there's a problem with the stream-table duality idea, but it's more that people don't understand the concept that's behind it. The same applies to transactions, another deceptively simple idea that underpins scalable systems. "What is a transaction and why are they useful?" has been my go-to interview question to test developers' technical knowledge for over 25 years. I'm still struck by how many people cannot answer it.
It's the same as moving from a framework to a library. At first, the framework seems great, it solves all your problems for you. But then as soon as you want to do something slightly different from the way your framework wanted you to do it, you're stuffed. Much better to use libraries that offer the same functionality that the framework had, but under your own control.
If you use the sledgehammer of a traditional RDBMS you are practically guaranteed to get either lost updates, deadlocks, or both. If you take the time to understand your dataflow and write the stream -> table transformation explicitly, you front-load more of the work, but you get to a database that works properly more quickly.
99% of webapps don't use database transactions properly. Indeed during the LAMP golden age the most popular database didn't even have them. The most well-designed systems I've seen built on traditional databases used them like Kafka - they didn't read-modify-write, they had append-only table structures and transformed one table to compute the next one. At which point what is the database infrastructure buying you?
If you're using a traditional database it's overwhelmingly likely that you're paying a significant performance and complexity cost for a bunch of functionality that you're not using. Given the limited state of true master-master HA in traditional databases, you're probably giving up a bunch of resilience for it as well. If Kafka had come first, everyone would see relational SQL databases as overengineered, overly-prescriptive, overhyped. There are niches where they're appropriate, but they're not a good general-purpose solution.
Oh ho, there's some deep irony about calling a database over engineered when we're looking at kafka.
I'm running a copy of sqlite right here, and it's pretty nice; it does queries and it's portable,
it even comes as a single .c file that I can drop directly into a project. I use the same API for
my mobile app.
Where's the equivalent for kafka?
No? I have to install java, and run several independent processes (forget embedded), and configure a cluster to... um... do what? send some messages? If I want to do more than that I have to add my own
self implemented logic to handle everything.
Different things you argue; but alas, that's the point; the point you made was that kafka would 'be it' if it had come first, but...since it cannot serve the purpose of a db in most contexts; you're just wrong.
It's a different thing, and it might be suitable for some things, but I have deployed and managed (and watched others manage) kafka clusters, and I'll tell you a secret:
Confluent sells managed kafka. They have a compelling product.
You know why? ...because maintaining a kafka cluster is hard, hard work, and since that's the product confluent sells, they're not about to fix it. Ever.
Now, if we're talking about event-streaming vs tables; now there's some interesting thoughts about what a simple single node embedded event streaming system might look like (hint: absolutely nothing like kafka), and what an alternate history where that became popular might have been...
...but Kafka, this kafka, would never have filled that role, and even now, it's highly dubious it fills that role in most cases; even, even if we acknowledge that event streaming could fill that role, if it was implemented in a different system.
SQLite can't serve the purpose of a db in most contexts either; not only can it not be accessed concurrently from multiple hosts, it can't safely be accessed from different hosts at all. It can't even concurrently be accessed from multiple processes on the same host. If we both took a list of general-purpose database use cases, and raced to implement them via Kafka or via SQLite, I guarantee the Kafka person would finish a lot quicker.
> You know why? ...because maintaining a kafka cluster is hard, hard work, and since that's the product confluent sells, they're not about to fix it. Ever.
Maintaining a Kafka cluster sucks. So does maintaining a Galera cluster (I've done both), and I'll bet the same is true for any other proper HA (master-master) datastore.
Comparing Kafka against "SQL databases" is a category error, sure. But in the big use case that everyone talks about, that all the big-name SQL databases focus on - as a backend for a network service that runs continuously where you don't want a single point of failure - those big-name SQL databases are overengineered and underwhelming compared to Kafka, and in most other use cases an SQL database is overengineered compared to a
(notional) similarly-designed event-sourcing system for use with the same sort of problems.
A huge advantage is backing up and restoring from backup with sqlite is trivially implemented with scp in a bash script. Ever tried to setup database restore from backup (that works) in AWS?
There are some real wins from using sqlite in prod.
I've run apps that were written like that as well (though not by me). What you say is true, as far as it goes, but those apps have no business using an SQL database: they don't need ad-hoc queries (and usually don't offer the possibility of doing them), they don't need the SQL table model (and usually have to contort themselves to fit it), they don't need any but the crudest transactions... frankly in my experience all the developer needed was some way to blat structured data onto disk, and they reached for SQLite because it was familiar rather than because they'd carefully considered their options.
> A huge advantage is backing up and restoring from backup with sqlite is trivially implemented with scp in a bash script. Ever tried to setup database restore from backup (that works) in AWS?
You're not comparing like with like though: if you're willing to stop all your server processes and shut down your database to do the backup and restore (which is what you have to do with sqlite, you just don't notice because it's something you do all the time anyway) then it's easy to back up practically any datastore.
If you don’t need them you might as well use a file system directly but if you need it, there is not much alternative :-)
Posix File system API don’t offer much for concurrency control other than lock.
In my line of work we use HDFS or AWS S3 as “filesystem” so we are not using the OS file access API at all and most File are append only by a single server.
But if you don’t need it there are plenty of good distributed file system that will not eat your data.
Just because something is designed for x doesn't mean it's any good at x. Kafka is also designed for handling concurrent events, and it's significantly better at it.
> But if you don’t need it there are plenty of good distributed file system that will not eat your data.
There really aren't any distributed filesystems that are nice to work with. S3 isn't a full filesystem and your writes take a while before they're visible in your reads, HDFS isn't a full filesystem and can be slow to failover, AFS has generally weird performance and locks up every so often. If your needs are simple then anything will work, including distributed filesystems, but it's very rare that a distributed filesystem is an actively good choice, IME.
Schemas are a good idea, but SQL is bad at expressing them. Uniqueness constraints and foreign keys seem like a good idea until you think about how you want to handle them; often a stream transformation gives you better options. Unindexed joins and indexed joins are very different, and it's much easier to understand the difference in a Kafka-like environment than in an SQL database.
> Can you imagine designing a simple crud app with a couple of tables and different products, but now do it with Kafka as your backing store.
Yes, that's exactly what I'm advocating. It's clearer and simpler and avoids a bunch of pitfalls - no lost updates, no deadlocks, your datastore naturally has decent failover.
> You’ve added so much more complexity to your application because you didn’t want to use a database.
On the contrary, if you use a database you add a huge amount of complexity that you don't want or need. You have to manage transactions even though you can't actually gain any of their benefits (if two users try to edit the same item they're going to race each other and one's update is going to get lost). You have to flatten your values into 2D tables and define mappings with join tables etc., rather than just serialising and deserialising in a standard way. Bringing in new indicies in parallel is flaky, as is any change to data format. And getting proper failover set up is an absolute nightmare, to the point that a lot of people will set up failover for the webserver but leave the database server as a single point of failure because it's too hard to do anything else.
Querying can be ad hoc - ie you don't know what your query will be and in that case you want to offload your data to a olap system which will perform about 1 or 2 orders of magnitude better than your typical row oriented mysql database. Moving the data out of your primary database will prevent you from deadlocks and shooting yourself in a foot.
If you know the query beforehand you simply compute and update a materialized index with results of your query.
Buulding system is such a decoupled way is conceptually and impelementationally much simpler than your typical overcomplicated sql database with unpredictable execution plans. These databases are good only if 1. you don't know what you are doing 2. you need some quick and dirty system to work without thinking which and you are fine that it will never scale up. Many small CRUD ecommerce apps are like that and these databases get used because that is what your typical medium.com article recommends as the best practice.
You can’t simply put all entity you might need to update atomically in the same key/document.
I agree that when using a SQL database that use locks instead of snapshot isolation you need to be careful about the order you scan row and join table to reduce frequency of dead-lock.
My experience using PostgreSQL is that deadlock are extremely rare and when they happen they are automatically detected by the database and client simply abort and retry the transaction when this happen.
You make each operation an event, you make all your logic into stream transformations, and you make sure the data only ever flows through in one direction (and you have to either avoid "diamonds" in your dataflow, or make sure that the consequences of a single event always commute with each other).
So at its simplest, someone trying to make a reservation would commit a "reservation request" event and then listen for a "reservation confirmed" event (technically async, but it's fast enough that you can handle e.g. web requests this way - indeed even with a traditional database people recommend using async drivers), and on the other end there's a stream transformation that turns reservation requests (and plane scheduling events and so on) into reservation confirmations/denials. Since events in (the same partition of) a single stream are guaranteed to have a well-defined order, if there are two reservation requests for the last seat on a given plane then one of them will be confirmed and the other will be rejected because the plane is full.
> My experience using PostgreSQL is that deadlock are extremely rare and when they happen they are automatically detected by the database and client simply abort and retry the transaction when this happen.
That's no good. If you do that then you haven't solved the problem and you're guaranteed to get a full lockup as your traffic scales up. You have to actually understand which operations have caused the deadlock and rearrange your dataflow to avoid them. (Not to mention that if it's possible for the client to "simply abort and retry" then you've already proven that you didn't really need traditional database-style synchronous transactions).
People found that they miss the abstraction that SQL and transaction offer as it simplify your code a lot. They simply needed something that scale better than doing manual sharding on top of mysql :-)
Since we're discussing misunderstandings in the community, it should be pointed out that a Database Management System (DBMS) is not merely a database, to say nothing of a data store. Oracle, Postgres, et al are genuine "DBMS". ATM you are very likely putting together a data store in your app code.
I interpreted the OP as DBMS=database, which absolutely includes application code that stores and retrieves data in proprietary formats.
Even a linked list mmap'd to disk can be a database, just maybe not a very good one.
To be fair this is a somewhat fuzzy categorization (DBMS, DB, Data Store) and it can cause confusion. An DBMS is a system, just like it says right on the acronym tin. A DBMS is a system that provides capabilities such DDL, DML, auxiliary processes, etc. to manage a database. There are various data models, e.g. a triple store, for databases, but the data model X is a orthogonal to XDBMS.
Yes, but only if your data is highly relational, and needs to be operated on in that way (e.g. transactionally across records, needing joins to do anything useful, querying by secondary indices, etc.). Depending on your domain, this could be a large amount of your data, or not. Just like anything, Kafka and Kafka Streams are tools that can be used and misused.
There is perhaps an exciting future where something like ksqlDB makes even highly relational data sit well in Kafka, but we're definitely not there yet, and I'm not terribly confident about getting there.
Ah, you mean that if the data is in a normalized form. Relational data means, basically, data in tabular form. Tabular data can be normalized or not.
> Depending on your domain, this could be a large amount of your data, or not.
Assuming you are talking about data in normalized form, that's orthogonal to the domain. That's in fact the major breakthrough of relational databases–that data across domains can be modelled as relational and normalized.
Now, in the relatively rarer cases that you don't care about normalizing data, a stream can be useful. But in the vast majority of cases IMHO, we should be looking at a proper relational database like Postgres before thinking about streams.
>Assuming you are talking about data in normalized form, that's orthogonal to the domain.
I don't agree with this though. Nothing about a software architecture is _ever_ orthogonal to the domain.
Keep in mind that I don't know anything about Kafka, but it seems like the same kind of thinking. You can think about your data as a log of events. Since you can always build the current state from the log of events, then you might as well just save the log. This works incredibly well when you treat your data as immutable and need to be able to see all of your history, or you want to be able to revert data to previous states. It also solves all sorts of problems when you want to do things in a distributed fashion.
The application which fits this model well and that almost every develop is familiar with is source control (and specifically git). If that is the kind of thing that you want to do with your data, then it can be truly wonderful. If it's not what you want to do, then it will be the lodestone that drags you underwater to your death.
The legacy application I work on is a sales system and actually it's not a bad area for this kind of data. You enter data. Modifications are essentially diffs. You can go back and forth through the DB to see how things have changed, when they were changed and often why they were changed. In many ways it is ideal for accounting data.
But as you say, if you want a full fledged DBMS, this should not be your primary data store. It can act at the frontend for the data store -- with an eventually consistent DB that you use for querying (and you really have to be OK with that "eventually consistent" thing). But you don't want to build your entire app around it.
Kafka is somehow a database, even if specialized on logs of events. For a similar discussion, I would refer to Jim Gray "Queues Are Databases" https://arxiv.org/pdf/cs/0701158.pdf.
The diagram juxtaposing a chess board (the table) and a sequence of moves (the stream) seems enlightening but actually hides several issues.
* An initial state is required to turn a stream into a table.
* A set of rules is required to reject any move which would break the integrity constraints.
* On the reverse direction, a table value cannot be turned into a stream. We need an history of table values for that.
Hence, the duality, or more accurately the correspondence, is not between stream and table but between constrained-stream and table history.
Even if these issues are not made explicit, the Confluence articles and the Kafka semantics are correct.
* "A table provides mutable data". This highlights that stream-table-duality is relative to a sequence of table states.
* "Another event with key bob arrives" is translated into an update ... to enforce an implicit uniqueness constraint on the key for the tables.
Otherwise loading data into a real database/data warehouse and using ETL with triggers/views/aggregations is better if you need advanced querying rather than a simple stream to table transforms.
Edit: From the sibling comments, the answer is https://ksqldb.io.
If your application assumes ACID compliance, it will fail. Also, I haven’t looked into this one as directly, but I imagine that query latency compared to an RDBMS would be quite poor.
IMO, most people reach for Kafka too soon, where an RDBMS with a proper schema that includes transaction and valid history would better suit the domain. It’s possible to write very efficient and ergonomic “append-only”, temporal data schemas in Postgres schemas.
Do not do this. Kafka is not a database! Kafka should never be the source of truth for your business. The source of truth should be in whatever consumes data from Kafka downstream when messages are committed as read. Why? Because in your middle layer you can do all the data normalization, sanity checking, processing, and interaction with a REAL database system downstream that can give you things like transactions, ACID, etc.
Of course Confluent wants you to try to use Kafka as a DB, so your usage of it is very high and you pay for the top support package and they have you by the cajones, but that doesn't mean you should do that. You will miss out on all the benefits of using a real database, with what benefit? Having a simple client API?
For the record, I have good real world experience with all kinds of databases (relational, NoSQL, and even legacy multivalue and hierarchical ones), and I don't see why what you say has to be "always true".
One way of looking at Kafka is that is an unbundled transaction log, nothing more or less, so it could be used to permanently store and replay transactional activity, if one wishes. Noting that, even most databases don't store an immutable, permanent transaction log (as they often grow to be huge and are truncated every so often, and tables are used as the current state).
This article by Confluent seems to cover the topic (yes, recognizing it is written by the very vendor you suggest is trying to lock us in): https://www.confluent.io/blog/okay-store-data-apache-kafka/.
Ok, so how about the idea of a persistent, immutable, never-ending transaction log (uhoh, sounds like blockchain now!)? Setting aside Kafka for now, what do you think about the basic design pattern? To me it sounds a bit like it could represent a temporal database in raw transactional log form. Why not?
EDIT: After rereading your comment I see your main concern is using Kafka as a database management system (DBMS). I would agree, that's not what Kafka is for. But, I don't think Confluent intends that use case, do they? I look at it more as an unbundled single component that is very useful by itself, and is part of a more complex data platform/architecture (ex. Lambda or Kappa architecture).
Because nobody wants to be replaying events all the time to get their actual data. They want the data to be, well, materialized. Replaying events can be helpful if you need an audit trail but the systems which need that have mostly all evolved their own audit trail techniques, e.g. double-entry accounting.
The people who invented these streaming event systems quickly realized that continually replaying events from the beginning would get absurdly expensive, so they even implemented 'checkpoint' events that snapshot the current state of the data every once in a while so that replays can start from (a hopefully recent) snapshot and finish quickly. At that point you have to encode the logic of how to roll up all events into the current state into your checkpointing code, which immediately enforces the notion of a global current state of the data, which is in fact what RDBMs solve anyway.
My main concern is that when all you have is a hammer, or you already have a hammer lying around, everything looks like a nail. Kafka can be great as a raw event stream, and yes you can store raw events forever, but are raw events really the source of truth for your business? If your workflow is, as I think is appropriate in almost all cases, Kafka->Consumer service->DB, why do you want to rely on Kafka when you have a consumer that can have better logic and custom handling regarding how you actually interpret your events? Moreover, why keep the data in Kafka when you can just plug it into a temporal DB from your service?
The standard wheel-and-spokes model would probably have a loan book service where all others call into it via HTTP requests, but you’ll have to build a query language on top of that REST API, or share DB access I suppose.
Or in this model you would stream the loan book updates through Kafka and every consumer that needs a copy of the loan book can keep their own, query it against their own chosen materialised store in its native query language (some might just append to Mongo, or kSQL, some might always use event sourcing per-request). “select * from loans where user.id = current_user” becomes possible here rather than stitching the HTTP responses.
Now I’m not saying one is right or wrong outright as you’re only trading a more native query interface with data consistency concerns, but this for me still feels like a valid architectural pattern and one I’d consider.
Whether that's useful for a given use case is certainly debatable. IIRC, the New York Times uses Kafka as the primary storage for their entire archive.
that's only true in the (frequent) case of write-ahead logs. Some database engines may use Rollback logs, or undo logs.
It’s a different way of working for sure, but for an event sourced system I think kafka is an excellent choice. In a sql database keeping the complete history of changes for years is much less practical than in kafka.
What you shouldn’t do is look at kafka as a database. It’s a message log. You push messages in, you eventually read them back out. The key word is eventually. You can’t read your writes, and that makes building an interactive system on top hard.
You've probably heard of smth thing and wonder how smth differ from smth else. In this article about smth we will dive into smth things to discover how smth is well better suited to do things thanks to smth things and other things that are really specific to smth! The power of smth enable things to things in a way things do things that make smth something you need to learn about.
So during this journey about smth will be cut in four part the first being an introduction.
(...) --sudo click the link for first part
This is the first part of a fourth series about smth to learn more about things and things in smth.
Smth is very specific about things, and that's specifically why smth is well suited for things. Smth is a new way to do something that will make you think more about things and things. Let's now dive into smth...
Smth use things as a things to do the smth things with tings on smth.........
--sudo repeat marketing ad nauseam
No matter of wether the actual product is pure gold or pure garbage you probably just lost 50% of the readers at some point.
After much angst in trying to work around this issue, I finally gave up and switched to Pulsar. Pulsar isn’t without it’s own issues (mostly around bugs and general maturity) but it handles this particular scenario admirably.
It’s much better than Kafka when you don’t know when the consumer will come back looking for what changed
All that said, if I had my time again I’d probably just use one of the cloud providers’ solutions and spend my efforts elsewhere...
If you're not already technically chained into it and Confluence hasn't already upsold your poor organization avoid it.
If you want the early flexibility and the rapid PoC just look at AWS Kinesis/Firehouse.
If you're looking at large scale (+1 gbit ingest, 100k/s, kind of stuff) then Apache Pulsar is where to go.
Pulsar is still niche in most enterprise.
Kafka is not dead, there are many enterprises (including 2 successful ones I've worked at in the past 5 years) that have built POCs and successful products on Kafka. Its supports all language performantly and has tons of community support. I would argue there is nothing better to build a POC on.
It is possible to write clients in any language, however, it is not that simple especially when you need to handle logic of scalling out or in your kinesis stream (that will split or join shards) and when you have multiple consumers in the same consumer group (you will need a distributed locking mechanism and a logic to steal locks if one consumer dies).
So it is not trivial.
This is the only official implementation from AWS:
The client handles horizontal scalling, checkpoiting, shards split and shards merges. Using just the SDK, you have to build this yourself (unless you are using Kinesis for use cases that dont need it to be done correctly).
This is the doc for developing consumers using the SDK
And in the second paragraph of this documentation:
"These examples discuss the Kinesis Data Streams API and use the AWS SDK for Java to get data from a stream. However, for most use cases, you should prefer using the Kinesis Client Library (KCL) . For more information, see Developing KCL 1.x Consumers."
Apache Pulsar definitely has a better architecture, separating the ingestion from the storage, but once something starts to gain widespread adoption (like Kafka) that lives little room for alternatives.
if it’s just for some kind of cache or system that is ok to randomly corrupt and drop data I would sleep better :-)
I wouldn't use it for transactions, though.
Pulsar is great, I love how it combines message queue and pub-sub semantics. Tiered storage etc. The built-in cluster replication is brilliant, although it has some limitations - you can't replicate from cluster A to cluster B to cluster C due to how Pulsar avoids the active/active infinite replication circle of doom.
But, it's early days for it. Anyone adopting it will need to invest more time in learning the innards compared to Kafka.
if you don’t need low end-to-end latency (tailing consumer blocked on long polling) it’s better to use something like HdFS or AWS S3, but if you need it but have low throughout it’s better to use something like RabbitMQ. if you have both high throughout and need polling then it’s worth investing into Apache Pulsar.
Kafka actually doesn't have that many great clients either, they all are just wrappers around the C++ librdkafka library.
Pulsar clients are also easy to create because of the stateless protocol, per-message acknowledgements, and optional websockets API. Or you can just use their Kafka bridge adapter and use your existing Kafka clients.
One benefit of pulsar is that the client library is completely stateless and much simple. It does not need to know where data is stored it’s all handled by the pulsar broker.
We might use it ourselves instead of the official Python library which is not async. Async publishing and consuming is a requirement for our project.
Otherwise look at NATS with NATS Streaming (although they're currently redesigning a better version of it). There's also Liftbridge which is an alternative NATS Streaming fork.
Because we increased the allowed message size, I wanted to make sure that disk space was freed after a given interval. This was super opaque to get working. I don't remember the specific config keys, but there was about 4 parameters involved, and 3 had no effect if the 4th wasn't set properly. They all used sligtly different naming conventions, so didn't appear in context in the documentation.
Still, I agree that defaults are usually the way to go with Apache stuff. If you find yourself turning a lot of knobs to stabilize something, there's probably a design issue.
Kafka in my experience is always the worse solution unless you need to aggregate http server log to do offline analytics using something like spark
I know folks like Netflix are using it for streaming UI logs into Flink for real time sentiment analysis. There are some very serious use cases for Kafka at scale.
If one just needs a message broker, well, there are other, better ones.
We now have a very overly complex file transmission system and a new CIO.
Sounds awesome, but the implementation pretty much breaks kafka in that one can no longer scale. Topic and partition are all constrained to a single broker. Hot spots abound!
I've yet to run into a workload that actually needed Kafka (and if you're not a Java shop don't add Java pieces unless you know you need what those Java pieces provide).
Proof of concept (with diagrams in the comments):
Services would commit messages to their own databases along with the rest
of their data (with the same transactional guarantees) and then messages
are "realtime" replicated (with all of its features and guarantees) to the
receiving service's database where their workers (e.g. https://github.com/que-rb/que, skip locked polling, etc) are waiting to
respond by inserting messages into their database to be replicated back.
Throw in a trigger to automatically acknowledge/cleanup/notify messages and
I think we've got something that resembles a queue? Maybe make that same
trigger match incoming messages against a "routes" table (based on message
type, certain JSON schemas
in the payload, etc)
and write matches to the que-rb jobs table instead for some kind of
distributed/replicated work queue hybrid?
I'm looking to poke holes in this concept before sinking anymore time exploring the idea. Any feedback/warnings/concerns would be much appreciated, thanks for your time!
Also,do look at the Outbox Pattern. It's basically what you are describing.
Not exactly the same architecture you are proposing, and quite complex tbh, but it is working for them.
But if working set fit in memory on the instances processing the writes, you could use PostgreSQL logical replication to update materialized view on other server easily.
but then it start looking like Amazon aurora or Datomic database.
But if you are ok with partitions not being available for many hours or losing all written data because the cluster did not automatically move parution to 3 new replica after 2 of the replica failed ... then it’s a good scalable(speed) product.
There is also no serious multi tenant support. So if you need multitenancy you gotta use kubernete and do one cluster per tenant and automate that yourself.
distributed system are hard to design correctly.
When using HDFS you can create a single 1TB file if the cluster have enough capacity.
And with (hdfs hbase cassandra ..) if 5 server permanently fail in the span of 24 hours you don’t need manual intervention and your data is not lost.
In Kafka your partition data need to fit entirely on a single server and you might still run into issue if that server also hosting other partition.
And with Kafka if one by one the 3 server manually assigned as replica to one partition fail before an engineer is wake up in middle of the night and have time to fix it manually you just lost all your data.
This seems unavoidable in any scenario.
All other distributed system handle this with no issue.
Ex: Cassandra, Hbase/hdfs, CockroachDB
Also, did you look into Cruise Control at all?
But I agree that having to fit any entire replica of a partition on one machine is a weakness of Kafka, and something that Pulsar improves on by using Bookkeeper which only needs to be able to store a segment of a ledger - BK's autorecovery daemon is nice too.
In the case of pulsar ack quorum can be much smaller than replication factor so you don’t have to wait for slow replica.
Also no number of node going down would cause new write to fail. And under replicated segment will fixed automatically using remaining healthy machines right away.
> In the case of pulsar ack quorum can be much smaller than replication factor so you don’t have to wait for slow replica. Also no number of node going down would cause new write to fail. And under replicated segment will fixed automatically using remaining healthy machines right away.
I agree, all good points about Pulsar using BK, although worth noting the autorecovery daemon has to be explicitly enabled.
But I'll mention Cruise Control again - it solves a all of the issues you've described without manual intervention.
But you will still have broker with like 100% cpu usage and high network usage while other are mostly idle.
Cruise control try to patch around the bad design of kafka but there is only little it can do.
Any broker in ISR going down still cause big latency spike , and if broker was leader for several partition all those partition will be down until the single cluster controller finish doing leader switch sequentially for all those partitions.
Typically on cluster with many partition of topic, each broker would be leader for ~1000 partition which mean it will take up to 15 minutes for leader switch to complete for all partitions.
because this mean you don’t care about losing some of the messages for which the broker confirmed to the client library that the write was successful.
if you care more about maximum throughput and efficiency than about data integrity and availability then kafka is a good choice.
Nope - the leader only acks when the write has been committed to disk. It's called at-least-once because it could fail between committing to disk and acking, and then the producer would try again.
You're thinking of acks=0, at-most-once.
Roughly you can think of it as:
- acks=0 - can tolerate only a small (but most common!) subset of failures; better to keep going with the next datum than fail/retry if anything is wrong, stale data might as well be no data or you only need 99% of it anyway; very common for web traffic analysis, hardware sensor data, etc;
- acks=1 - can tolerate client, client/broker, or single broker failure as long as client can retry; nice for e.g. API uploads where the client has more work to get onto but also wants to provide some reasonable feedback to the caller; a good default;
- acks=-1 - can tolerate client, client/broker, broker/broker, or arbitrary broker failure as long as client can retry; good for events which should absolutely not be lost and small enough to keep locally e.g. "batch job finished", "IDS alerted".
"At least once" and "at most once" are more often used to describe consumer commit behaviors than producer.
That said, I do find the tutorial flip-flops a bit in target audience. It's mainly "this is what Kafka is", but sometimes has weird asides like "This is how to optimise Kafka" (redundancy, number of partitions, etc) which are pretty distracting from the more fundamental points.
Conveniently so, I might add.
Any project just starting out should use Pub/Sub. One thing I really like is that GCP provides emulators of Pub/Sub et al for local testing. That used to be a bit of an obstacle not too long ago.
In terms of lock-in, I don't see how that applies to an AMQ. The data moving through it should only be transiently persisted, up to a week or two at most in the usual case.
If you want to avoid cloud lock-in, have DB backups, use Postgres/MySQL/etc, containerize your service(s), replicate data in object storage, etc. Common sense stuff, if that's something that's of concern.
Personally, I've seen "vendor lock-in" weaponized as an excuse for a lot of costly NIH bullshit. It's painful to reflect back on a project that could have involved literally a tenth of the time and pain it ended up taking because of that one choice alone.
Dropbox is famous for doing this when they did their move from S3 to pocket — an in house storage layer they built.
At my company I'm guessing we push over 200M messages a day through it, and it is a breeze to maintain and stable as hell. Maybe gets harder a couple orders of magnitude more, but no complaints here.
Certainly. In my experience the operational cost of kafka is enormous. Ive seen incidents too often for a relatively small (<10 machines) cluster. I would expect that after a first long phase of finding the right configuration, to step on a phase were it runs smoothly, but I didnt see it yet.
Google provides a Pub/Sub emulator for local development.
I don't really buy the vendor lock-in thing for Pub/Sub-like systems. The Cloud Pub/Sub usage pattern is basically the same as Kafka, you can have a library that abstracts away the differences. There are open source libraries that do that. If you ever need to switch cloud providers, or want a messaging system to span cloud providers, you can switch without changing lots of code.
I would love to know your experience
I wouldn't recommend relying primarily on something vendor-specific like Amazon SQS, but there are very good out-of-the-box tools like RabbitMQ or Kafka available.
Writing your own messaging system is like writing your own database, it's the wrong choice 99.9% of the time.
1) Abstract real library with tech-agnostic interface to remove as much tech details as possible and focus on behaviour.
2) Create simplest in-memory dummy implementation for that interface, basically a mock.
3) Write a test spec for that interface behaviour (does xxx on yyy, returns zzz otherwise) and run it agains both real wrapped interface & mock to be sure they're consistent.
4) Make code to load mock in local testing env
5) Use real kafka when running it on a testing pipeline
Same way you could abstract various things - e.g. once I just abstracted proprietary services and locally flushed everything into local Postgres. You can emulate multiple services there - e.g. database, somewhat-usable queue, etc. And get some extra control nice for tests (e.g. make it to throw errors on your wish or add timeout to simulate net delays). That saved a lot of time for local development for me, without dropping quality.
If you truly want something embedded, Debezium's KafkaCluster could be interesting: https://github.com/debezium/debezium/blob/master/debezium-co....
It's used four our own tests of Debezium, but I'm aware of several external users. It spins up AK and ZK embedded. Disclaimer: I work on Debezium
Dev time is virtually identical between the two kinds of infrastructure.
The whole human activity of reducing science to practical art will do just fine without knowing Apache Kafka, thanks.