Hacker News new | past | comments | ask | show | jobs | submit login

Great question. Complication arises because of failures. One or more destinations may be down for any length of time, individual payloads may be bad etc. To handle all these you need to retry with timeouts while not blocking other events to other destinations. Also, not all events may be going to all destinations.

We build our own streaming abstraction on top of Postgres. Think layman's leveled compaction. We will write a blog on that soon. The code (jobsdb/jobsdb.go) should have some comments too in case you want to check out. Segment had a similar architecture and had a blog post on it but can't seem to find it. Also, eventually we will replace postgres with something low-level like RocksDB or even native files.

Yes, in theory you can use Kafka's streaming abstraction and create create a topic per destination. Two reason's we didn't go that route

1) We were told Kafka is not easy to support in an on-prem environment. We are not Kafka expert but paid heed to people who have designed and shipped such on prem softwares.

2) More importantly, for a given destination, we have dozens of writers all reading from same stream. Only ordering requirement is events from a given device (end consumer) are in order. So we assign same user to same writer. However, the writers themselves are independent. If a payload fails, we just block events from that user while other users continue. Blocking the whole stream for that one bad payload (retried 4-5 times) will slow things down quite a bit. If we had to achieve the same abstraction on Kafka, we would have had to create dozens of topics per destination.




I'm curious how you guys are usoing Postgres as a log.

I took a brief look at the code, and while the append-only dataset strategy is sound, it looks like your scenario only has a single reader and a single writer?

In my experience, it's not entirely trivial when you have:

1. Multiple readers who each needs to be able to follow the log from different positions in real time.

2. Multiple writers receiving events that need to be written to the end of the log.

3. Real-time requirements.

From what I can tell — I could be wrong here — your system doesn't need to poll the table constantly, because you also save the log to RAM, so whenever you receive an event, you can optimistically handle it in memory and merely issue status updates. If anything goes wrong, a reader can replay from the database.

But that doesn't work with multiple writer nodes where each node receives just a part of the whole stream. The only way for this to work would be dedicate a writer node to each stream so that it goes through the same RAM queue. So then you need a whole system that uses either Postgres or some consensus system like Etcd to route messages to a single writer, and you need to be able to recover when a writer has been unavailable.

Edit: I see you wrote that "we assign same user to same writer", so you're doing something like that.


Agreed. Our current implementation does not work when there are multiple readers for the same event stream and we need to track per-reader watermarks. We have a very simple model where one reader reads from DB and distributes the work to multiple workers (e.g. network writers) which in turn update the job status.

Multiple writers should work though. StoreJob() should handle that.

I missed the logging to RAM part. Yes, we always wanted to do that but haven't gotten to that yet. Right now, all events are moved through the DB - between gateway and processor and then router. Hence, we poll the table constantly.

Would love if you join our discord channel https://discordapp.com/channels/625629179697692673/625629179.... Slightly easier to have technical discussion there :)


Referring to this Segment blog post? Was my first read on the space, found it pretty informative.

https://segment.com/blog/exactly-once-delivery/


Thanks. Will check out :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: