Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Kinesis Firehose – Simple and Scalable Data Ingestion (amazon.com)
90 points by hepha1979 on Oct 7, 2015 | hide | past | favorite | 30 comments



It says at the bottom of the post that this service is available now, I can't see it in the console though, at least in eu-west-1 and us-east-1


It often takes several hours for AWS to roll it out to the various consoles.


At Asana, we've been beta testing Kinesis Firehose. It's been quite convenient not having to manage much, and having the data end up in S3. We're also using Kinesis streams, and have a simple KCL app to pull from streams and write to Firehose. We're looking forward to when streams can be as easy to manage, or when we KCL apps can read from Firehose.


I'm confused.. why would you need a KCL app with Firehose (for reading or writing)?


Ah.... because AWS built these as two separate user-facing services, with completely different API's and no way to use their common underlying foundation together.... go Amazon...


I don't understand why Kinesis Firehose isn't more tightly integrated with Kinesis Streams and have unified API.

I want to be able both reading at different offsets in the stream AND backup it to S3 or ingest into Redshift.

With current offering I need to duplicate data into two different services with different APIs.


To avoid interference between user application and theirs? Kinesis has pretty delicate throttle per shard and having user application that is out of their control working off the same shard will probably make the Firehose part much more fragile.

If you're already doing all the Kinesis shard management/KCL willy nilly why not just dump it to S3 yourself? Firehose seems to be targeting users who don't want to deal with sharding.


Yes, I figured out that Firehose is more higher-level product, than Kinesis. Maybe in the future it will be extended with API to read from a delivery stream.

There are competing products, like Google Cloud Pub/Sub where there is no need to manage shards manually or run your own workers, like KCL.


    $ sudo pip install awscli --upgrade
    ...
    $ aws firehose help

    FIREHOSE()                                                          FIREHOSE()



    NAME
           firehose -

    DESCRIPTION
           Amazon  Kinesis  Firehose  is  a  fully-managed  service  that delivers
           real-time streaming data to destinations such as Amazon S3  and  Amazon
           Redshift.

    AVAILABLE COMMANDS
           o create-delivery-stream

           o delete-delivery-stream

           o describe-delivery-stream

           o help

           o list-delivery-streams

           o put-record

           o put-record-batch

           o update-destination



    	                                                            FIREHOSE()


This is basically kafka + https://github.com/linkedin/camus which is pretty cool.


I have a question: if I were to send say 10000 event objects into a Amazon Kinesis Firehose stream it's clear that I should expect them to show up in an S3 bucket of my choosing, but should I also expect that my account will not incur any S3 HTTP POST API request fees ?

Is dodging those HTTP POST fees the value-add over simply using the S3 HTTP API yourself ?


Unless I am reading it wrong, it sounds like you need to pay for the requests as well. From https://aws.amazon.com/kinesis/firehose/pricing/:

> Storage

> You will be billed separately for charges associated with Amazon S3 and Amazon Redshift usage including storage and read/write requests. However, you will not be billed for data transfer charges for the data that Amazon Kinesis Firehose loads into Amazon S3 and Amazon Redshift. For further details, see Amazon S3 pricing and Amazon Redshift pricing.


To me this sounds like a way to avoid the hassle of creating a lambda to buffer streaming data into chunks before HTTP POSTing them to the S3 API.

But to me it wouldn't make much sense to use Kinesis Firehose unless the fees were cheaper than what it would cost to utilize AWS Lamdba for the same work. I mean it can't be all that many lines of nodejs code to pop events off a stream, flush them into a tempfile in batches, HTTP POST those batches to S3 with appropriate error handling/retry logic.

Admittedly I haven't crunched the math but just by eyeballing their pricing I suspect it may be more expensive than using your own lamdba. I wonder if Kinesis Firehose is implemented internally as an AWS Lambda, it wouldn't surprise me.

I suppose that eventually the pricing on this service will drop once someone open sources such a lamdba, especially since installing a lambda is so remarkably simple.


The only complication with this approach right now is that Lambda scripts can't communicate with anything inside a VPC (such as most Redshift instances). I imagine they will fix this issue in the future though.


I'm still a bit ignorant as to how kinesis works, can someone explain why this would be preferable to uploading directly to S3?


Good question. Here's one benefit. With Kinesis you can batch a bunch of writes to S3 that would have otherwise resulted in many small files in S3.

In other words, you can make small writes to Kinesis and then read out in larger amounts and write larger files to S3. This is a huge optimization for any job that runs across the data in S3. Many small files can really undermine performance in something like Hadoop MapReduce because of the additional request overhead.


What are some typical use cases for Amazon Kinesis streams, on the web?


Analytics event or IoT event ingestion, for example. Then, you can hook up AWS Lambda to the Kinesis stream to actually do any processing/aggregation.

Or, if you're using the new Firehose product, you could also use Lambda and attach it to the S3 event source using the bucket that the Firehose dumps into to perform batch processing on these records.


How is that different than just saving database records?


Just so I'm understanding, is this a Kafka competitor?


Kinesis is basically Kafka as a service. IIUC, Kinesis Firehose is just a way to ingest data into S3 buckets/Redshift and Firehose would give you the goodies like monitoring/fault tolerance etc.


The closest thing I could compare it to would be services that make it easier to get data into Redshift or S3. E.g. segment.io's redshift product: https://segment.com/redshift.


This looks like an AWS version of Apache Flume and Kinesis looks like an AWS version of Apache Storm. Does AWS have a Kafka equivalent i.e. a pub/sub message queue?


Kinesis != Apache Storm The easiest way to think about Kinesis is a managed queue that can remember the history for the past 24 hours.


Ah OK. I saw that it kept state so I thought it was like Storm. I don't fully understand Storm. However, Kafka can not keep state (and neither can Flume). So Kinesis is a queue that has some state?


Kinesis is a LOT more like Kafka than Storm.


Hmmmm, I certainly hope dang or another admin consolidate all these stories into a single "Amazon" thread. Can't have multiple stories from a single company eating up space on the front page.


Someone complains about this every time there's an AWS announcement day. Each product gets substantially different conversations. It'd be very frustrating to have to skim through dozens of top-level comments to find one on the service you're interested in.


I hope you detected my sarcasm. I'm 100% with you and was upset when they consolidated all the MSFT hardware announcement threads the other day.


Sorry, I didn't... because someone genuinely complains every time there's a big AWS dump. :-p




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: