What that post describes (all work going to one/few workers) in practice doesn't really happen if you properly randomize (e.g. just use random UUID) ID of the item/task when inserting it into Kafka.
With that (and sharding based on that ID/value) - all your consumers/workers will get equal amount of messages/tasks.
Both post and seemingly general theme of comments here is trashing choice of Kafka for low volume.
Interestingly both are ignoring other valid reasons/requirements making Kafka perfectly good choice despite low volume - e.g.:
- multiple different consumers/workers consuming same messages at their own pace
- needing to rewind/replay messages
- guarantee that all messages related to specific user (think bank transactions in book example of CQRS) will be handled by one pod/consumer, and in consistent order
- needing to chain async processing
And I'm probably forgetting bunch of other use cases.
And yes, even with good sharding - if you have some tasks/work being small/quick while others being big/long can still lead to non-optimal situations where small/quick is waiting for bigger one to be done.
However - if you have other valid reasons to use Kafka, and it's
just this mix of small and big tasks that's making you hesitant... IMHO it's still worth trying Kafka.
Between using bigger buckets (so instead of 1 fetch more items/messages and handle work async/threads/etc), and Kafka automatically redistributing shards/partitions if some workers are slow ... You might be surprised it just works.
And sure - you might need to create more than one topic (e.g. light, medium, heavy) so your light work doesn't need to wait for heavier one.
Finally - I still didn't see anyone mention actual real deal breakers for Kafka.
From the top of my head I recall a big one is no guarantee of item/message being processed only once - even without you manually rewinding/reprocessing it.
It's possible/common to have situations where worker picks up a message from Kafka, processes (wrote/materialized/updated) it and when it's about to commit the kafka offset (effectively mark it as really done) it realizes Kafka already re-partitioned shards and now another pod owns particular partition.
So if you can't model items/messages or the rest of system in a way that can handle such things ... Say with versioning you might be able to just ignore/skip work if you know underlying materialized data/storage already incorporates it, or maybe whole thing is fine with INSERT ON DUPLICATE KEY UPDATE) - then Kafka is probably not the right solution.
You say:
> What that post describes (all work going to one/few workers) in practice doesn't really happen if you properly randomize (e.g. just use random UUID) ID of the item/task when inserting it into Kafka.
I would love to be wrong about this, but I don't _think_ this changes things. When you have few enough messages, you can still get unlucky and randomly choose the "wrong" partitions. To me, it's a fundamental probability thing - if you roll the dice enough times, it all evens out (high enough message volume), but this article is about what happens when you _don't_ roll the dice enough times.
Fair enough. I agree .25^20 is basically infinitesimal, and even with a smaller exponent (like .25^3) the odds are not great, so I appreciate you calling this out.
Flipping this around, though, if you have 4 workers total and 3 are busy with jobs (1 idle), your next job has only a 25% chance of hitting the idle worker. This is what I see the most in practice; there is a backlog, and not all workers are busy even though there is a backlog.
With Kafka you normally don't pick a worker - Kafka does that. IIRC with some sort of consistent hashing - but for simplicity sake lets say it's just modulo 'messageID % numberOfShards'.
You control/configure numberOfShards - and its usually set to something order of magnitude bigger than your expected number of workers (to be precise - that's number of docker pods or hardware boxes/servers) - e.g. 32, 64 or 128.
So in practice - Kafka assigns multiple shards to each of your "workers" (if you have more workers than shards then some workers don't do any work).
And while each of your workers is limited to one thread for consuming Kafka messages. Each worker can still process multiple messages at the same time - in different async/threads.
To me it seems like your underlying assumptions is "1 worker can only work on one message/item at a time", right?
While you could also use Kafka like that - and it might even work for your use case, as long as you configure option (sorry forgot the name) that makes Kafka redistribute shards because particular workers/consumers are too slow.
AFAIK the usual way is for each worker to get more than one message/item at a time, and do the actual item/work in/through separate thread/work pool (or another async mechanism).
Kafka then keeps track of which messages were picked up by each worker/consumer, and how big is the gap between that and committed offset (marked as done).
It gets a bit more tricky if you:
- can't afford to process some messages/work again (well at extreme end it might actually be a show stopper for using Kafka)
- need to have automatic retry on error/fail, how quickly/slowly you want to retry, how many times to retry...etc.
- can you afford to temporarily "lose" some pending (picked up from Kafka but offset not marked as done) items for random things (worker OOMKILLED, solar flare hit network cable ...)
We've actually solved some of these with simply having another (set of) worker(s) that consume same topic with a delay (imagine cron job that runs every 5 minutes). And doing things in case there's no record of task being done, putting it into same topic again for retry ...etc.
The other thing that's PITA with Kafka is fail/retry.
If you want to continue processing other/newer items/messages (and usually you do), you need to commit Kafka topic offset - leaving you to figure out what to do with failed item/message.
One simple thing is just re-inserting it again into the same topic (at the end). If it was temps transient error that could be enough
Instead of same topic, you can also insert it into another failedX Kafka topic (and have topic processed by cron like scheduled task).
And if you need things like progressive backing off before attempting reprocessing - you liekly want to push failed items into something else.
While it could be another tasks system/setup where you can specify how many reprocessing attempts to make, how much time to wait before next attempt ...etc. Often it's enough to have a simple DB/table.
And while kids often end up with just headache and slight fever that goes away after a day - adults end with at least proper flu (e.g. higher fever, joints/muscle pain ...) for a week or two.
Are you sure frustration is from you needing more time (ramp up to understand, go deeper ...etc.) or something else?
Perhaps you really dislike (micro)management and need more autonomy/ownership/control (or just input/participation from your end) over your time/approach/order-of-tasks?
And think/elaborate more on where this "wasted a day investigating something for nothing" is coming from.
Perhaps you're not understanding (or not even getting to hear/see) the whole picture on why "business thinks investigating that is worth one day of your salary"? Or maybe you know and disagree?
Or is it happening so much that it's basically constantly being bumped between such tasks?
Anyway - I wouldn't necessarily listen to the "you're still getting paid so who cares" advice crowd...
I believe such stuff is like cilantro/coriander.
For some of us that don't have specific genes mutations - that stuff actually smells/tastes like dish soap.
And while some people can just clock in and collect the paycheck - others are wired differently.
So really spend more time on zeroing in on what exactly is (de) motivating for you.
Meanwhile there's not that much posted/discussed about good (so not 10x, just 1x instead of 0.5x or negative) engineering/product managers.
In my experience where I've worn both IC and TL/manager hats -even just going from "adequate" to "OK" team/engineering/product manager leads to impact/output that's bigger than having the best (so that mythical 10x) engineer.
Simply being connected to mobile (or wifi) network is enough to get your location - sometimes with pretty good precision.
So if you're worried about NSA or the like - you better not have a mobile phone/device (or a car - because new cars sold in EU all have eSIM for builtin emergency calls) at all.
And for particular first hand example - Xplora smart watch/phone got super confused when my kids school physically moved.
New building has indoor sports/gym (I think it's basketball court size) on the top floor - and all the reinforced concrete means mobile reception can be hit and miss (even on 3/4G).
Despite phone/watch never connecting to school (staff only) wifi. After move to new building - watch and parent app would regularly think/bounce location between old and new school buildings.
When even 3rd party companies have a mapping between wifi/ssid and approximate geo location, you can imagine state sponsored actors have at least next order of magnitude.
If your wifi client device can't find an access point, it goes around emitting every saved wifi network you've got. You can learn an awful lot from that.
Probably the school moved all of their Wi-Fi access points with it? These are often used for indoor positioning, even without any device connecting to them.
We had the same experience with the Xplora watches: we moved 6km away and took the internet contract with us. Whenever the watch was inside the house, it would show it as being at the old address. Outside of he house and away from wifi, it showed the location correctly. I imagine this is an edge case
Those daily long lunches were time/place where on rare occasion ad-hoc discussions would lead to a new way of doing something, a new (micro)service ...
Basically the type of synergy/innovation that is typically thought by business that it happens by the water-cooler.
On the other hand, in my experience and observations - just one or two of 2-3 days hackathons (so total of <=6 work days) produced almost an order of magnitude more synergy/innovation than those daily (long) lunches during the rest ~5 years of workdays.
Though sure - some things like initial employee onboarding (or even internal move to different unit/team), trainings/workshops, (semi)annual or quarterly plannings...etc are usually better in person.
Yet I've basically never had any of those at office desk. It was either bigger conference room (or dedicated auditorium/presentation/workshop room), or a meeting/lunch with manager or a (more) senior person with one or two of "newbies".
After initial onboarding (which is usually a week or two of "newbie" not being at desk with the rest of their new team).
All of those things add up to 1-2 days in office per month (including giving talks/training/workshop to "newbies").
Add 1 day for interviewing candidates F2F, and we're up to 2-3 days per month.
Even if you have a hackathon each month ... And the most I've seen was 2-3 days once a quarter, though usually it was 2-3 days per half a year - so nothing close to mythical Google 20% of the time ... We're still talking about at most 4-5 days in office per month.
A while ago I read somewhere that it was because for a long time (American) trucks were technically nit cars. And manufacturers could save money on a bunch of safety things.
Modern take would be to simply not open anything to the outside world - except WireGuard (TailScale or such).
From there everything is either considered "localhost" or a local network.
You can setup one or two central boxes (actual home lab "server" where you already have HTTP based services, and a raspberry pi zero 2 for backup) with TailScale.
With remote devices (including phones) in same tailscale network - you can access anything in home network as if you're physically home (but also have ACLs for kids/friends/etc).
On the other (professional) end - well then NginX and SSH are not even on the same network interface. And you run NginX LB/ReverseProxy on separate boxes compared to where actual apps/websites are ...etc.
In case of "zero trust network" the answer is no it doesn't violate.
With WireGuard or TailScale/CloudFlare/etc you still know/verify identity of every person/device that has access to the (virtual and through it real) network.
With that (and sharding based on that ID/value) - all your consumers/workers will get equal amount of messages/tasks.
Both post and seemingly general theme of comments here is trashing choice of Kafka for low volume.
Interestingly both are ignoring other valid reasons/requirements making Kafka perfectly good choice despite low volume - e.g.:
- multiple different consumers/workers consuming same messages at their own pace
- needing to rewind/replay messages
- guarantee that all messages related to specific user (think bank transactions in book example of CQRS) will be handled by one pod/consumer, and in consistent order
- needing to chain async processing
And I'm probably forgetting bunch of other use cases.
And yes, even with good sharding - if you have some tasks/work being small/quick while others being big/long can still lead to non-optimal situations where small/quick is waiting for bigger one to be done.
However - if you have other valid reasons to use Kafka, and it's just this mix of small and big tasks that's making you hesitant... IMHO it's still worth trying Kafka.
Between using bigger buckets (so instead of 1 fetch more items/messages and handle work async/threads/etc), and Kafka automatically redistributing shards/partitions if some workers are slow ... You might be surprised it just works.
And sure - you might need to create more than one topic (e.g. light, medium, heavy) so your light work doesn't need to wait for heavier one.
Finally - I still didn't see anyone mention actual real deal breakers for Kafka.
From the top of my head I recall a big one is no guarantee of item/message being processed only once - even without you manually rewinding/reprocessing it.
It's possible/common to have situations where worker picks up a message from Kafka, processes (wrote/materialized/updated) it and when it's about to commit the kafka offset (effectively mark it as really done) it realizes Kafka already re-partitioned shards and now another pod owns particular partition.
So if you can't model items/messages or the rest of system in a way that can handle such things ... Say with versioning you might be able to just ignore/skip work if you know underlying materialized data/storage already incorporates it, or maybe whole thing is fine with INSERT ON DUPLICATE KEY UPDATE) - then Kafka is probably not the right solution.