At Asana, we've been beta testing Kinesis Firehose. It's been quite convenient not having to manage much, and having the data end up in S3. We're also using Kinesis streams, and have a simple KCL app to pull from streams and write to Firehose. We're looking forward to when streams can be as easy to manage, or when we KCL apps can read from Firehose.
Ah.... because AWS built these as two separate user-facing services, with completely different API's and no way to use their common underlying foundation together.... go Amazon...
To avoid interference between user application and theirs? Kinesis has pretty delicate throttle per shard and having user application that is out of their control working off the same shard will probably make the Firehose part much more fragile.
If you're already doing all the Kinesis shard management/KCL willy nilly why not just dump it to S3 yourself? Firehose seems to be targeting users who don't want to deal with sharding.
Yes, I figured out that Firehose is more higher-level product, than Kinesis. Maybe in the future it will be extended with API to read from a delivery stream.
There are competing products, like Google Cloud Pub/Sub where there is no need to manage shards manually or run your own workers, like KCL.
$ sudo pip install awscli --upgrade
...
$ aws firehose help
FIREHOSE() FIREHOSE()
NAME
firehose -
DESCRIPTION
Amazon Kinesis Firehose is a fully-managed service that delivers
real-time streaming data to destinations such as Amazon S3 and Amazon
Redshift.
AVAILABLE COMMANDS
o create-delivery-stream
o delete-delivery-stream
o describe-delivery-stream
o help
o list-delivery-streams
o put-record
o put-record-batch
o update-destination
FIREHOSE()
I have a question: if I were to send say 10000 event objects into a Amazon Kinesis Firehose stream it's clear that I should expect them to show up in an S3 bucket of my choosing, but should I also expect that my account will not incur any S3 HTTP POST API request fees ?
Is dodging those HTTP POST fees the value-add over simply using the S3 HTTP API yourself ?
> You will be billed separately for charges associated with Amazon S3 and Amazon Redshift usage including storage and read/write requests. However, you will not be billed for data transfer charges for the data that Amazon Kinesis Firehose loads into Amazon S3 and Amazon Redshift. For further details, see Amazon S3 pricing and Amazon Redshift pricing.
To me this sounds like a way to avoid the hassle of creating a lambda to buffer streaming data into chunks before HTTP POSTing them to the S3 API.
But to me it wouldn't make much sense to use Kinesis Firehose unless the fees were cheaper than what it would cost to utilize AWS Lamdba for the same work. I mean it can't be all that many lines of nodejs code to pop events off a stream, flush them into a tempfile in batches, HTTP POST those batches to S3 with appropriate error handling/retry logic.
Admittedly I haven't crunched the math but just by eyeballing their pricing I suspect it may be more expensive than using your own lamdba. I wonder if Kinesis Firehose is implemented internally as an AWS Lambda, it wouldn't surprise me.
I suppose that eventually the pricing on this service will drop once someone open sources such a lamdba, especially since installing a lambda is so remarkably simple.
The only complication with this approach right now is that Lambda scripts can't communicate with anything inside a VPC (such as most Redshift instances). I imagine they will fix this issue in the future though.
Good question. Here's one benefit. With Kinesis you can batch a bunch of writes to S3 that would have otherwise resulted in many small files in S3.
In other words, you can make small writes to Kinesis and then read out in larger amounts and write larger files to S3. This is a huge optimization for any job that runs across the data in S3. Many small files can really undermine performance in something like Hadoop MapReduce because of the additional request overhead.
Analytics event or IoT event ingestion, for example. Then, you can hook up AWS Lambda to the Kinesis stream to actually do any processing/aggregation.
Or, if you're using the new Firehose product, you could also use Lambda and attach it to the S3 event source using the bucket that the Firehose dumps into to perform batch processing on these records.
Kinesis is basically Kafka as a service. IIUC, Kinesis Firehose is just a way to ingest data into S3 buckets/Redshift and Firehose would give you the goodies like monitoring/fault tolerance etc.
The closest thing I could compare it to would be services that make it easier to get data into Redshift or S3. E.g. segment.io's redshift product: https://segment.com/redshift.
This looks like an AWS version of Apache Flume and Kinesis looks like an AWS version of Apache Storm. Does AWS have a Kafka equivalent i.e. a pub/sub message queue?
Ah OK. I saw that it kept state so I thought it was like Storm. I don't fully understand Storm. However, Kafka can not keep state (and neither can Flume). So Kinesis is a queue that has some state?
Hmmmm, I certainly hope dang or another admin consolidate all these stories into a single "Amazon" thread. Can't have multiple stories from a single company eating up space on the front page.
Someone complains about this every time there's an AWS announcement day. Each product gets substantially different conversations. It'd be very frustrating to have to skim through dozens of top-level comments to find one on the service you're interested in.