
Commanding infinite streaming storage with Apache Kafka and Pyrostore - lbradstreet
http://pyrostore.io/blog/2018/05/10/kafka-potential-past-present.html
======
stingraycharles
I like it. Personally, one of my biggest problems with Kafka is its
operational complexity. I’ve just had one too many instances of Kafka brokers
getting stuck while doing an upgrade and things like that.

Additionally, I would really, really like to be able to use it as an Event
Store, easily accessible by anyone in the org with infinite data retention. I
know Kafka kind-of sort-of provides this functionality, but it doesn’t work in
practice.

This appears to be a solution to this problem. Will be interesting to see
whether it gains traction.

~~~
linkmotif
> I know Kafka kind-of sort-of provides this functionality, but it doesn’t
> work in practice.

How so?

~~~
stingraycharles
It’s difficult to search through, query, run projections. Also the API assumes
you want to stream realtime data, rather than query historical data.

~~~
sidlls
Use the correct tool for the job: hook an analytics DB up to the Kafka pipe
and store the data for future queries. Kafka was never intended to support
your use case.

If your inbound data that you'd like to put to Kafka isn't large, by the way,
just write straight to the DB. It's irritating to see Kafka used where it's
not necessary. It adds complexity to an infrastructure and the cost for doing
so has to be justifiable.

------
tomconnors
Everything Distributed Masonry does is very interesting. Wish I had more
excuses to use your stuff at work.

Storing all data forever in a single source of truth is awesome until
regulation like GDPR comes along. Do you have plans to support excision or is
your guidance on personal data to avoid putting it into a system like
Kafka/Pyrostore?

~~~
insensible
You might enjoy reading Greg Young's
[https://leanpub.com/esversioning](https://leanpub.com/esversioning), which
covers this topic.

It covers several strategies, three of which are:

* Encrypt it and then throw away the key to forget it

* Store private data outside the event with the event just pointing to it

* Delete events (on systems that support this)

------
taherchhabra
Integration with Azure Managed Disks : Due to the ingestion heavy nature, the
disks attached to the nodes on the cluster often result as the bottleneck.
Traditionally, to scale this bottleneck, more nodes need to be added. Azure
Managed Disks is a technology that provides cheaper, scalable disks that are a
fraction of the cost of a node. HDInsight Kafka has integrated with these
disks to provide upto 16 TB/node instead of the traditional 1 TB. This results
in an exponentially higher scale, while reducing costs in the inverse,
exponential manner.

[https://azure.microsoft.com/en-
us/services/hdinsight/apache-...](https://azure.microsoft.com/en-
us/services/hdinsight/apache-kafka/)

Is this same approach as pyro ?

~~~
lbradstreet
Our approach archives topics to cheap, highly durable and available object
stores, while keeping the data available for blending between warehoused and
live data sets.

This reduces operational complexity significantly vs scaling nodes up, dealing
with rebalancing, under replicated partitions, etc.

------
lmsp
This is what Apache Pulsar
([https://pulsar.incubator.apache.org/](https://pulsar.incubator.apache.org/))
already provides - infinite streaming storage, with simple/flexible messaging
streaming API and kafka compatible

------
chrisjc
Very interesting and reminds me of Pravega
([http://pravega.io/](http://pravega.io/)). Seems like unbounded streams will
be the next big step in streaming technology.

[https://www.youtube.com/watch?v=cMrTRJjwWys](https://www.youtube.com/watch?v=cMrTRJjwWys)

------
mavdi
These are the guys behind www.onyxplatform.org. That alone tells me this is
legit stuff. We will give it a try.

------
dominotw
> tradeoffs in our operation of Kafka have lossy effects on stream-ability.
> Balancing costs and operational feasibility, we ask Kafka to forget older
> data through retention policies.

What does ' lossy effects on stream-ability. ' mean here. Stream slows down,
data loss or something else?

~~~
lbradstreet
Pyrostore co-founder here. When practitioners archive their data from Kafka to
other storage products (S3, SQL database, etc) today, they are giving up on
the log ordered structure of the data their ability to consume their data in
its original ordering, with its original offsets and timestamps. Pyrostore
structures and indexes your data in S3 in order to provide a consumer that
implements the Kafka consumer interfaces, ensuring you are always able to
stream from hot and cold storage alike.

------
ah-
I wonder if this would ever be integrated into Kafka proper. Shipping out
historical chunks onto infinite storage seems like a generally sensible thing.

This would be even better if it didn't need a modified client.

~~~
rad_gruchalski
I did suggest a potential solution a while ago:
[https://medium.com/@rad_g/the-case-for-kafka-cold-
storage-32...](https://medium.com/@rad_g/the-case-for-kafka-cold-
storage-32929d0a57b2) Relevant JIRA ticket:
[https://issues.apache.org/jira/plugins/servlet/mobile#issue/...](https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-3726)

