- Only write logs out that matter. Searching logs in cloudwatch is already a major PITA. Half the time I just scan the logs manually because search never returns. Also, the fewer println statements you have the quicker your function will be.
- Lambda is cheap, reporting function metrics to cloudwatch from a lambda is not. Be very careful about using this.
- Having metrics from within your lambda is very helpful. We keep track of spout lag (delta of when event got to kineis and when it was read by the lambda), source lag (delta of when the event was emitted and when it was read by the lambda), number of events processed (were any dropped due to validation errors?).
- Avoid using the kinesis auto scaler tool. In theory it's a great idea but in practice we found that scaling a stream with 60+ shards causes issues with api limits. (maybe this is fixed now...)
- Have plenty of disk space on whatever is emitting logs. You don't want to run into the scenario where you can't push logs to kinesis (eg throttling) and they start filling up your disks.
- Keep in mind that you have to balance our emitters, lambda, and your downstream targets. You don't want too few / too many shards. You don't want to have 100 lambda instances hitting a service with 10 events each invocation.
- Lambda deployment tools are still young but find one that works for you. All of them have tradeoffs in how they are configured and how they deploy.
There are some good tidbits in the Q&A section from my re:Invent talk . Also, for anyone wanting to use lambda but not wanting to re-invent checkout Bender . Note I'm the author.
* e.g. "ETLLambda METRIC RECORDS_IN 948575"
> The number of Lambda invocations shot up almost 40x.
One thing I've learned from talking to AWS support is that increasing memory also gets you more vCPUs per container.
Serverless is great in scaling and handling bursts, but you may find it VERY difficult in terms of testing and debugging.
A while back I started using an open source tool called localstack to mirror some AWS services locally. Despite some small discrepancies in certain APIs (which are totally expected), it's made testing a lot easier for me. Something worth looking into if testing serverless code is causing you headaches.
My initial thought is: restore last nights backup to another mysql instance on aws and then let it catchup on the binlog?
But I guess the unstated assumption is that their goal is to also transform to some other datastore.
For example, at MindMup, we’re using a similar setup to the one described in the article for appending events to user files. It’s critical that we don’t get two updates going over eachother in a single file, which is tricky to do with lambdas only because there’s no guarantee how many lambdas will kick off if updates come concurrently from different users for the same file. With Kinesis, we just use the file ID as the sharding key, so no more than a single lambda ever works on a single user file, but that we can have multiple lambdas in parallel working on different files.
How are you appending to files from Lambda? EFS?
> With Kinesis, we just use the file ID as the sharding key, so no more than a single lambda ever works on a single user
Is there a risk of "heavy" users causing hot shards?
I'd like to invite you to watch this talk and get a few more insights why we use Kafka in the ways we do: https://www.youtube.com/watch?v=cU0BCVl4bjo
Let me try to come up with a TL;DR here:
trivago comes from a complete on-premise, central database point of view. Change Data Capture via Debezium into Kafka enables a lot of migration strategies into different directions (e.g. Cloud) in the first place, while not having the need to change everything on the spot.
It seems like a common pattern to compare Kafka with a pure MQ technolgy. Kafka can also serve as a persistent data storage and a source of truth for data.
I hope this makes the picture a bit more clear to you. Feel free to ask if I missed something.
Debezium seems to be a production version of Martin Kleppmann's CDC-to-Kafka POC, Bottled Water .
Database replication is the killer app for CDC, but CDC can be used for so much more than replication, like event-based alerting, triggering, etc.
While the basic idea of using PG logical decoding for CDC is the same, Debezium is a completely different code base than Bottled Water. Also we provide connectors for a variety of databases (MySQL, Postgres, MongoDB; Oracle and MongoDB connectors are in the workings). If you like, you can also use Debezium independently of Kafka by embedding it as a library into your own application, e.g. if you don't need to persist change events or want to connect it to other streaming solutions than Kafka.
In terms of CDC use cases, I keep seeing more and more the longer I work on it. Besides replication e.g. updates of full-text search indexes and caches, propagatating data between microservices, facilitating the extraction of microservices from monoliths (by streaming changes from writes to the old monoliths to new microservices), maintaining read models in CQRS architectures, life-updating UIs (by streaming data changes to Web Sockets clients) etc. I touch on a few in my Debezium talk (https://speakerdeck.com/gunnarmorling/data-streaming-for-mic...).
Any plans to support SQL Server? (SQL Server is prevalent in the enterprise world)
Plus, in any real world scenario, if you're running 24x7 your load isn't evenly distributed throughout the day. Which means you're setting up autoscaling (w/ added time+complexity) or provisioning for peak, wiping out your cost savings.
In my experience if your data volumes are low enough like in the article, a Kinesis+Lambda setup is stress free and quick to implement. That makes it worth the cost over raw instances.
If it's just doing simple ETL then it's probably OK, but if you need to do aggregations on the data, you're going to have a bad time, or end up implementing some sort of ersatz map-reduce framework in lambda.
Edit: fixed a comma.
The Kinesis stream was adequately scaled, but the poller between Kinesis -> Lambda just couldn't cope. This was discovered after lots of support calls with AWS.
It might be better these days I don't know, we moved to using Apache Flink + Apache Beam, which has a lot more features and allows us to do stuff like grouping by a window, aggregation etc.
But just to be clear, the pattern (not marketing terms) is doing change data capture (essentially the database transaction log) to a message queue, with message/job processors that can take any action, including writing the messages to other databases.
Kinesis is SQLStream underneath, which is probably why the lifetime of messages is limited - it's not originally intended to be Kafka or a durable message queue.
EDIT: Note above, when SQLStream first came out it didn't seem intended as a long term store. That was like really early on when I saw it at Strata. It looks like they made the storage engine pluggable and Kafka is an option too, so my statement above is likely incorrect.
Lambda is being used as a distributed message/job processor, much like any worker process processing a queue would be scaled up.
That's the first I've heard of this. Any citations you can provide to substantiate this claim?
My understanding is that SQLStream is an event streaming processor, which would make it a potential Kafka consumer, not a basis for a durable message queue.
* Kinesis Streams (the product being discussed in the post)
* Kinesis Firehose
* Kinesis Analytics
Most people, when they discuss Kinesis, mean Kinesis Streams specifically. The other products are Streams consumers.
Is this just clarification of the architecture, ie how Lambda fits into pre-existing patterns, or are you suggesting that since Lambda is being used as a distributed job processor, there would be a better tool for that specific task?
I thought the maximum memory limit for Lamda was 3008 MB and that you couldn't bump this limit through a service request.
Anyone knows if you can request to bump the memory limit or the uncompressed deployment package limit (250 MB)?
Is Kinesis' write latency better than it was? IIRC it wrote to 3 data centers synchronously, which led to some pretty bad performance. This was almost 2 years ago though.