
AWS Kinesis with Lambdas: Lessons Learned - omn1
https://tech.trivago.com/2018/07/13/aws-kinesis-with-lambdas-lessons-learned/
======
stlava
The post is good but just scratches the surface on running Kinesis Streams /
Lambda at scale. Here are a few additional things I found while running
Kinesis as a data ingestion pipeline:

\- Only write logs out that matter. Searching logs in cloudwatch is already a
major PITA. Half the time I just scan the logs manually because search never
returns. Also, the fewer println statements you have the quicker your function
will be.

\- Lambda is cheap, reporting function metrics to cloudwatch from a lambda is
not. Be very careful about using this. \- Having metrics from within your
lambda is very helpful. We keep track of spout lag (delta of when event got to
kineis and when it was read by the lambda), source lag (delta of when the
event was emitted and when it was read by the lambda), number of events
processed (were any dropped due to validation errors?).

\- Avoid using the kinesis auto scaler tool. In theory it's a great idea but
in practice we found that scaling a stream with 60+ shards causes issues with
api limits. (maybe this is fixed now...)

\- Have plenty of disk space on whatever is emitting logs. You don't want to
run into the scenario where you can't push logs to kinesis (eg throttling) and
they start filling up your disks.

\- Keep in mind that you have to balance our emitters, lambda, and your
downstream targets. You don't want too few / too many shards. You don't want
to have 100 lambda instances hitting a service with 10 events each invocation.

\- Lambda deployment tools are still young but find one that works for you.
All of them have tradeoffs in how they are configured and how they deploy.

There are some good tidbits in the Q&A section from my re:Invent talk [1].
Also, for anyone wanting to use lambda but not wanting to re-invent checkout
Bender [2]. Note I'm the author.

[1]
[https://www.youtube.com/watch?v=AaRawf9vcZ4](https://www.youtube.com/watch?v=AaRawf9vcZ4)
[2] [https://github.com/Nextdoor/bender](https://github.com/Nextdoor/bender)

edit: formatting

~~~
bogaczio
I've heard that they'll be releasing a cheap Splunk-like service over
CloudWatch for log searching, which will hopefully alleviate the issue.

~~~
nikhilkuria
There was this one time when we had trouble finding some errors on Cloudwatch
logs. This one library was helpful,
[https://github.com/jorgebastida/awslogs](https://github.com/jorgebastida/awslogs)

------
dbllxr
> For us, increasing the memory for a Lambda from 128 megabytes to 2.5
> gigabytes gave us a huge boost.

> The number of Lambda invocations shot up almost 40x.

One thing I've learned from talking to AWS support is that increasing memory
also gets you more vCPUs per container.

\-----

Serverless is great in scaling and handling bursts, but you may find it VERY
difficult in terms of testing and debugging.

A while back I started using an open source tool called localstack[1] to
mirror some AWS services locally. Despite some small discrepancies in certain
APIs (which are totally expected), it's made testing a lot easier for me.
Something worth looking into if testing serverless code is causing you
headaches.

[1][https://github.com/localstack/localstack](https://github.com/localstack/localstack)

~~~
nikhilkuria
Thanks! For testing on our local machines, we use SAM Local,
[https://aws.amazon.com/blogs/aws/new-aws-sam-local-beta-
buil...](https://aws.amazon.com/blogs/aws/new-aws-sam-local-beta-build-and-
test-serverless-applications-locally/) . This has very similar capabilities as
local stack. Still, there is always a delay from a service being released by
AWS to it being available on tools like Serverless or AWS local.

------
cagenut
The article doesn't appear to get into the end-destination of this data, it
just says "to AWS".

My initial thought is: restore last nights backup to another mysql instance on
aws and then let it catchup on the binlog?

But I guess the unstated assumption is that their goal is to also transform to
some other datastore.

~~~
toomuchtodo
You could also have replicas in AWS replicating off your on prem DBs in real
time. Is Kafka offering something as the message bus/data transfer mechanism
that replication and ETL off the AWS replica can’t?

~~~
xenji_fm
Disclaimer: I work for trivago and I am partially responsible for the
architecture behind trivago's Kafka use.

I'd like to invite you to watch this talk and get a few more insights why we
use Kafka in the ways we do:
[https://www.youtube.com/watch?v=cU0BCVl4bjo](https://www.youtube.com/watch?v=cU0BCVl4bjo)

Let me try to come up with a TL;DR here: trivago comes from a complete on-
premise, central database point of view. Change Data Capture via Debezium into
Kafka enables a lot of migration strategies into different directions (e.g.
Cloud) in the first place, while not having the need to change everything on
the spot.

It seems like a common pattern to compare Kafka with a pure MQ technolgy.
Kafka can also serve as a persistent data storage and a source of truth for
data.

I hope this makes the picture a bit more clear to you. Feel free to ask if I
missed something.

~~~
wenc
Thanks for heads up about Debezium. That is very cool.

Debezium seems to be a production version of Martin Kleppmann's CDC-to-Kafka
POC, Bottled Water [1].

Database replication is the killer app for CDC, but CDC can be used for so
much more than replication, like event-based alerting, triggering, etc.

[1] [https://www.confluent.io/blog/bottled-water-real-time-
integr...](https://www.confluent.io/blog/bottled-water-real-time-integration-
of-postgresql-and-kafka/)

~~~
gunnarmorling
(Disclaimer: Debezium lead here)

While the basic idea of using PG logical decoding for CDC is the same,
Debezium is a completely different code base than Bottled Water. Also we
provide connectors for a variety of databases (MySQL, Postgres, MongoDB;
Oracle and MongoDB connectors are in the workings). If you like, you can also
use Debezium independently of Kafka by embedding it as a library into your own
application, e.g. if you don't need to persist change events or want to
connect it to other streaming solutions than Kafka.

In terms of CDC use cases, I keep seeing more and more the longer I work on
it. Besides replication e.g. updates of full-text search indexes and caches,
propagatating data between microservices, facilitating the extraction of
microservices from monoliths (by streaming changes from writes to the old
monoliths to new microservices), maintaining read models in CQRS
architectures, life-updating UIs (by streaming data changes to Web Sockets
clients) etc. I touch on a few in my Debezium talk
([https://speakerdeck.com/gunnarmorling/data-streaming-for-
mic...](https://speakerdeck.com/gunnarmorling/data-streaming-for-
microservices-using-debezium?slide=5)).

~~~
wenc
Thanks for the explanation.

Any plans to support SQL Server? (SQL Server is prevalent in the enterprise
world)

~~~
gunnarmorling
Yes, we're working on it right now (meant to say "Oracle and SQL Server are in
the workings" above).

------
otterley
I'm curious about the economics of this design. Real-time stream consumption
implies that the consumer is always running, and if you need to run software
24x7, running it on EC2 instances is likely to be far cheaper than running
Lambda functions continuously.

~~~
teej
The lambdas are only triggered when there is data in the stream, so the
"consumer" in this case is not always running. Under the hood there is a
process that polls Kinesis to check for records but it's completely managed by
AWS and you aren't charged for it.

Plus, in any real world scenario, if you're running 24x7 your load isn't
evenly distributed throughout the day. Which means you're setting up
autoscaling (w/ added time+complexity) or provisioning for peak, wiping out
your cost savings.

In my experience if your data volumes are low enough like in the article, a
Kinesis+Lambda setup is stress free and quick to implement. That makes it
worth the cost over raw instances.

~~~
djhworld
It also depends on what your business problem is.

If it's just doing simple ETL then it's probably OK, but if you need to do
aggregations on the data, you're going to have a bad time, or end up
implementing some sort of ersatz map-reduce framework in lambda.

------
djhworld
We used to run an architecture similar to this a few years ago, I work for a
broadcaster and unfortunately it failed badly during a big event.

The Kinesis stream was adequately scaled, but the poller between Kinesis ->
Lambda just couldn't cope. This was discovered after lots of support calls
with AWS.

It might be better these days I don't know, we moved to using Apache Flink +
Apache Beam, which has a lot more features and allows us to do stuff like
grouping by a window, aggregation etc.

------
sixdimensional
I'm not being negative about this, it's a cool setup.

But just to be clear, the pattern (not marketing terms) is doing change data
capture (essentially the database transaction log) to a message queue, with
message/job processors that can take any action, including writing the
messages to other databases.

Kinesis is SQLStream underneath, which is probably why the lifetime of
messages is limited - it's not originally intended to be Kafka or a durable
message queue.

EDIT: Note above, when SQLStream first came out it didn't seem intended as a
long term store. That was like really early on when I saw it at Strata. It
looks like they made the storage engine pluggable and Kafka is an option too,
so my statement above is likely incorrect.

Lambda is being used as a distributed message/job processor, much like any
worker process processing a queue would be scaled up.

~~~
otterley
> Kinesis is SQLStream underneath

That's the first I've heard of this. Any citations you can provide to
substantiate this claim?

My understanding is that SQLStream is an event streaming processor, which
would make it a potential Kafka consumer, not a basis for a durable message
queue.

~~~
sixdimensional
Amazon Web Services Has Licensed SQLstream Technology for Amazon ...
[https://sqlstream.com/amazon-web-services-has-licensed-
sqlst...](https://sqlstream.com/amazon-web-services-has-licensed-sqlstream-
technology-for-amazon-kinesis-analytics-service-launched-today/)

~~~
pixelmonkey
I don't have this on authority, but I believe Kinesis Streams is an AWS-
maintained fork of Kafka around 0.7. Firehose is a wrapper around Lambda +
Kinesis Streams. And Kinesis Analytics is a SQLStream wrapper with Kinesis
Streams as its input. That's my best understanding, anyway.

~~~
coredog64
I heard a rumor that Amazon wanted to offer Kafka as a managed service,
couldn't get it to work they way they wanted, and then said "f--k this" and
wrote something that is functionally similar.

~~~
pixelmonkey
That's wholly possible. They might have just copied the "rough design" of
topics/consumers and queue-as-log, but clean-house implemented it. I wouldn't
be surprised. Kafka was a pretty minimal codebase at that stage.

------
dmlittle
> For us, increasing the memory for a Lambda from 128 megabytes to 2.5
> gigabytes gave us a huge boost.

I thought the maximum memory limit for Lamda was 3008 MB and that you couldn't
bump this limit through a service request.

Anyone knows if you can request to bump the memory limit or the uncompressed
deployment package limit (250 MB)?

~~~
dmlittle
And here I am realizing the stupidity of my question... (3008 MB is close to
3GB)

------
mpd
With so much overlap in the functionality and use cases of Kafka and Kinesis,
it's not clear why they increase their surface area by using both.

Is Kinesis' write latency better than it was? IIRC it wrote to 3 data centers
synchronously, which led to some pretty bad performance. This was almost 2
years ago though.

~~~
nikhilkuria
Exactly. The biggest motivation to use Kinesis was the other services we could
plug in from 'AWS ecosystem'. Lambdas have Kinesis triggers available
natively. We did not have any concern with the latency of Kinesis. Now I'm
curious, do you have any literature on the bad performance history?

~~~
mpd
Just past experience. In the end, the low data retention time was the real
reason Kinesis didn't fit our needs, but the difference in write performance
between the two was pretty stark. Read performance was fine for what we
needed.

------
sheeshkebab
Copying a billion records screams for streaming? do you mean like a couple of
hundred gigs that could be copied overnight (and maybe transformed with some
script)?

~~~
nikhilkuria
Rather than doing a one-time copy of the data, we have applications
continually writing to our on-premise MySQL. This data has to reactively reach
our datastore on AWS. Also, this is not a 1-1 copy. There are several
transformations and cross-validations in play.

------
abledon
Anyone else reading this post with the trivago repetitive jingle on loop in
the back of their mind? Din dan din dan dan —

