
Secor: A service persisting Kafka logs to S3, GCS and Azure Blob Storage - mooreds
https://github.com/pinterest/secor
======
jonathanoliver
I love this. This is a property already baked into another messaging queuing
system called Pulsar and I've thought about writing my own for internal use
(maybe to be open-sourced). The only quirk that I see is this link in the
project description readme:

> it is guaranteed that each message will be saved in exactly one S3 file

This can get expensive. Specifically a 1-to-1 write per message with S3 (or
any cloud storage provider) gets expensive very, very quickly. Further, when
reading from cloud storage, you've got to read one message (object) at a time
which adds up as well. I'm not talking about egress bandwidth either, I'm
talking about S3 PUT/GET operations. I'd love to see some kind of batching
operation that takes groups of messages (perhaps configurable time/size
limits?) and writes them as a blog to cloud storage.

~~~
onefuncman
It's not a 1:1 write per message, it relies on Kafka offset guarantees to
batch messages into S3 at a configurable threshold, 256MB per S3 object is
common.

~~~
jonathanoliver
Oh! That's awesome. The way that sentence is written is a little tricky to
parse, but the behavior of a batching lots of messages (configurable 256MB+
chunks) is exactly what I'm hoping for.

------
derefr
I've wondered for a while now why there isn't something like a cross between
Kafka (durable queue broker) and Datomic (a "database" partitioned into
storage chunks, where each chunk canonically lives in object storage and is
only _cached_ by the instances doing the queries.)

I'd love to see a queuing infrastructure that consists of two parallel
clusters, running on the same machines:

• a set of _write nodes_ , which each "own" partitions on a topic (in the
sense of being that partition's sole writer); where a write-node receives
writes to its partition until it fills up a chunk, at which point it shoves
the chunk into object storage;

• a set of _read nodes_ , which serve _reads_ and/or _streaming subscription
requests_ by 1. streaming archival data by retrieving chunks from object
storage into a read-node-local MRU-on-disk-pressure cache; and then 2. once
the read stream "catches up" to the end of the archival data, it would switch
to proxying to the relevant write-node, which would serve the query/stream
from its uncommitted chunk.

\-----

I feel like this queue architecture would be useful for all sorts of things,
at least for my own use-cases. Especially because it'd scale perfectly-well
_down_ to running one read-node and one write-node on one machine, while still
being able to make durability guarantees about the _committed_ data. (The
durability of _uncommitted_ data, I don't much need to worry much about,
personally.)

Is there anything like this on the market?

~~~
rad_gruchalski
Maybe Pulsar gets close. Your general question regarding Kafka reminds me of
this: [https://medium.com/@rad_g/the-case-for-kafka-cold-
storage-32...](https://medium.com/@rad_g/the-case-for-kafka-cold-
storage-32929d0a57b2)

I’ve written that.

Edit: kafka jira did not get far:
[https://issues.apache.org/jira/plugins/servlet/mobile#issue/...](https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-3726)

~~~
PhoenixReborn
Confluent actually has released "tiered storage" which appears to be a similar
feature to what Pulsar has. [https://www.confluent.io/blog/infinite-kafka-
storage-in-conf...](https://www.confluent.io/blog/infinite-kafka-storage-in-
confluent-platform/)

It's still up in the air as to whether this feature will make its way into the
open-source Kafka though.

------
chrisjc
I've been following this project for a long time, but I have to ask what are
the advantages over using Kafka Connect Sinks?

[https://docs.confluent.io/current/connect/kafka-
connect-s3/i...](https://docs.confluent.io/current/connect/kafka-
connect-s3/index.html)

~~~
mooreds
Looks like the S3 connector was released in 2017. So if you wanted this
functionality before then, secor was a good option:
[https://www.confluent.io/blog/apache-kafka-to-
amazon-s3-exac...](https://www.confluent.io/blog/apache-kafka-to-
amazon-s3-exactly-once)

It also supports going to other cloud object stores (other than S3).

~~~
chrisjc
Yup, there was definitely a reason to consider Secor before 2017. Part of the
reason I'm familiar with Secor in the first place.

Just to be clear, there are dozens of Kafka Connect Sinks and Sources, and
finding Kafka Connect Sinks for other blob storage solutions should be easy.
If you can't the KC framework makes it pretty easy to develop your own.

------
thruhiker
At prior company we used Secor to backup Kafka topics to S3 and would use to
replay/reingest data with success. If I were to reimplement Kafka somewhere
new it would definitely be one of my "Go to Production" requirements.

------
lflux
We've used this at Patreon to get data from our various sources in the
application to Redshift. We've since replaced this with Kinesis Firehose

~~~
mooreds
Are you still using kafka? Does Kinesis Firehose read from Kafka?

~~~
chrisjc
No. If you want to use Firehose, you need to transfer data from Kafka to
Kinesis first. You can probably use one of many Kafka Connect Sinks for this.
Eg: [https://github.com/awslabs/kinesis-kafka-
connector](https://github.com/awslabs/kinesis-kafka-connector)

At that point, might as well set up a Kafka Connect cluster with a Kafka
Connect Sink, and dump to S3 yourself.

~~~
otterley
Why not go whole-hog and use Kinesis Streams and eliminate Kafka altogether?

~~~
chrisjc
Not sure if you're being factitious, but it sounded like parent's data was
already in Kafka.

But you're right, no reason not to start with kinesis in the first place,
especially with all the AWS options around it including Kinesis Analytics
(Streaming "SQL"), Flink (supported by AWS now) or plain old EMR (Spark,
Flink, etc...).

There is a lot of work that goes into maintaining your own Kafka ecosystem,
even if you use Confluent, Lenses, etc...

------
kureikain
I also used FluentD for this. Basically we alredy had fluentd, and fluentd can
output to anything, not just S3.

