Hacker News new | past | comments | ask | show | jobs | submit login
How we built a serverless architecture with AWS (hypertrack.com)
114 points by kdeorah on July 11, 2019 | hide | past | favorite | 67 comments

Well, you folks now made your business super coupled with AWS. I just have 1 word: Oracle

It's pretty obvious that someone at Amazon is watching and voting on this thread because this is absolutely true. This kind of thing is vendor lock-in to the nth degree.

Funny, right. I worked for at least 2 companies that at some point in time put a lot of money into Oracle. One of them is a leading gaming network, a lot of billions in revenue.

Teams struggled with migration off it. It was a multi year/multi millions project and there is no end to it. And newcomers were saying -> oh, that was a silly idea to use all this stuff (why didn't they used Dynamo :) ), hovewer, 15 years ago it was pretty ok + Oracle solution architects were all over the company.

I don't see how amazon's strategy is different. And I don't get how folks, who are saying Oracle lock was bad, but Amazon is ok, can justify such a thinking.

I will put my money on it, in 10 years those will be good examples of how not to do things. Like, for example, when AWS leadership changes. And internet will be: who would've seen it coming.

Same reason that people think that Chrome lock-in is okay and IE lock-in was bad; the new product is shiny and has good features. At the moment.

Well, and you gotta justify that 1 million $ investment (R&D + Costs) into your AWS architecture somehow.

Lots of folks from Amazon participate in Hacker News, like Tim Bray, Colm MacCarthaigh, and Jeff Barr. Jeff Barr often comments in threads about announcements he's written. One of Tim's blog posts was recently on the front page [1]. See: timbray, colmmacc, jeffbarr

I doubt there's any kind of voting cabal, but if folks are participating then they're probably voting according to their inclinations. (I don't vote too much myself, either on comments or articles.)

Any time you invent a new technology with a unique interface, then software built using that technology is coupled to it to some extent. It's actually fairly rare for software components to be so completely interchangeable that you can swap out implementations without changing the software that uses it.

[1] https://news.ycombinator.com/item?id=20210850

At the most basic level, it wouldn't be a tough migration to any other FAAS. Yes it would be work, but I can't think of any other infra migrations that would be less effort.

But also you don't need to think of lambda code as code that can _only_ necessarily run on AWS Lambda.

We organize related lambdas (that would traditionally constitute an 'application') as a gradle multiproject, one module per lambda, with a common module for shared code, like DAOs. The CI creates and uploads an individual jar per Lambda, but updates them all every release.

We then have an extra module that pulls all of those together onto a web API and can be run as container independently any FAAS. At that point the fact your deploying to Lambda is basically irrelevant to your code-base, it looks and feels like any other 'application' and is probably even a little more organized.

Usage of AWS services is a conscious decision, absolutely. However, the product architecture that uses these AWS services is subject to careful review for design and functional components integrity. Any of these components must be replaceable as issues/bottlenecks are identified. For example, if AppSync is proven to have issues as the company scales further, AppSync can be replaced with self-hosted GraphQL clusters. Additionally, other components in the architecture can be similarly replaced.

Another way to phrase the question : How reproducible are architectures on AWS ?

It's the reproducible build problem again, but at the architecture/infrastructure level

It occurs to me that you can stick a single colon into the title after “architecture” and you pretty much get the summary of the article.

I like the idea of serverless architectures, but I still wouldn't use it for anything that is important.

- Using a serverless architecture almost always implied getting married to your provider. You can run your code in only one place. You have given up all bargain power. When the relationship ends you have to build your system over again.

- It isn't really serverless; they're just not your servers.

- They are only efficient for the workloads the architecture is designed for. Stray outside the parameters and things start to become expensive, slow or both.

- If you use serverless architectures you have to make damn sure the people who built it stick around, because the only value you are left with if your provider folds or increases prices on you is inside the heads of the people who built the solution.

I have already seen friends getting burnt by this. Typically people build a prototype or a technology demo, it gets funding, the CTO insists that it isn't important to do anything about the serverless bits and just go with them (there is no pause to make good on "we'll fix it later" once money gets involved), then get jerked around by the service provider because they can't provide the support needed. Then they slam head first into the costs of actual production traffic which, somehow, even though it requires only basic arithmetic skills, none of them had been able to calculate before the huge bills started rolling in.

> Using a serverless architecture almost always implied getting married to your provider

I don’t disagree but I’ve made a web app with aws serverless. Frontend on s3, Backend Python flask on serverless and MySQL server (haven’t tried RDS serverless yet). Works fine, had a compiled library that did not work but all standard stuff. No marriage. :)

Then strictly speaking it isn’t serverless.

"- Using a serverless architecture almost always implied getting married to your provider. You can run your code in only one place. You have given up all bargain power. When the relationship ends you have to build your system over again."

I think you can use a format that is mostly provider-independent. Any movement will change just the integration. Also, this kind of applies for anything large. You get married to the API anyway and movement will always be painful to some degree.

"- It isn't really serverless; they're just not your servers."

You mean they are not on prem? 'Cuz they can be. You mean that the name is bad? It really isn't as bad as people seem to imply. When done right, you don't worry about the servers. You mean that you have no control over the execution environment? Well, if spectre and meltdown have taught us anything is that really you lose control at some point anyway.

I don't understand what you really gain with this setup. I mean, this extreme vendor lock-in situation is so short term. The absolute wrong strategy if you ask me. I would be curious to see this company in 5 years from now.

Let me ask you this question differently: lets say you are exploring a new market opportunity for the an exciting product you want to build. Would you rather spend cycles building what AWS has done already and thus delay speed to market or would you rather use what has already been done by AWS managed services and build what nobody has done before? Surely, this architecture will evolve over time, but it will only evolve as the startup quickly discovers what the market needs.

Oh boy, I bet that's costing a new car each month.

I used to hate aws for how expensive their bandwidth and storage was, until I started actually using it last year. I think their new serverless stack is about to leave a lot of devops out of a job.

You can setup a a CI/CD pipeline in about half an hour with amplify, at the previous company I remember it taking a good 3 weeks to get CircleCi up and running properly.

And then moving a microservice over to it is basically 1 command, a few options, mostly just copy over the config from your old express backend with a few changes, and you're done. It's insane.

One other dev I've showed the lighthouse scores of the react stack I deployed on it even said "this should be illegal". And they're right, it's pretty much automated devops, the whole ap now loads in 300ms. If you have server side rendering in your app the static content will automatically be cached on their CDNs.

And if you want to save a bit of money you can just use google firebase for your authentication and db. GraphQl is surpsingly a breeze too as a middle layer if you want to leave your java or .net backend apis untouched.

At the end of the day, nodejs is completely insecure by design, your infrastructure will never be as secure as running it on gcp or aws. That's why you go serverless and stop messing with security and front end scalability.

If they solve the cold-start issue of databases on aurora they will completely dominate the market even more than they already have.

>You can setup a a CI/CD pipeline in about half an hour with amplify, at the previous company I remember it taking a good 3 weeks to get CircleCi up and running properly.

>And then moving a microservice over to it is basically 1 command, a few options, mostly just copy over the config from your old express backend with a few changes, and you're done. It's insane.

As a an engineer at a decent sized tech company, this sounds pretty normal, because our infrastructure teams have been providing it (and much more) to service authors via internal APIs/web UIs for much longer than "serverless" has been a buzzword.

Except now you don't need an infrastructure team. That's the whole point of serverless architecture: to be able to scale without having huge team scale as well.

You haven’t needed an infrastructure team since PHP shared hosting, and certainly not since Heroku or Elastic Beanstalk, except that people kept wanting greater complexity at lower cost. There is nothing new about “serverless” there.

There is a difference between what was then and what is today. The key difference is that the "serverless" term is massively overloaded here and once you dissect it you will see that it's a mix of multiple serverlessly managed services that we are able to take advantage of: Kinesis, DynamoDB streams, Kinesis Firehose, SNS, Lambda, Cloudwatch, and GraphQL/AppSync. Serverless computing came a long way.

Agree that serverless is a buzzword here, just as data science and machine learning have now become. It is about taming greater complexity at lower cost, and at increasingly more granular levels.

Exactly. We were able to support millions of new devices without infrastructure involvement.

Can you elaborate on Amplify? Is it really that good? It didn't take me terribly long to set up gitlab-ci with ECS and later Fargate; both of these feel more appropriate for web-serving apps.

I may see it in a full-JS app, but I still can't find a good fit for a JS-based backend. I've recently been exploring alternatives to Django for API backends and seriously considering a JS-based framework. I have yet to find one that is all three of: good, simple, in typescript. TypeORM looks excellent for the ORM side but there's still the matter of writing APIs; anything I've looked at (Express, Koa…) is atrociously repetitive compared to Django REST Framework -- NestJS is the best I've found, and it's still miles away.

SSR is going to do wonders for page load times on the internet as it finally gets popular via React/Vue. I hope it's the future for all of these heavy-weight user-facing JS apps.

SSR is the future? What has PHP been doing for 15+ years?

I'm talking about Next.js/Nuxt.js style JS front-ends replacing exactly that plus JS heavy frontends like Angular and SPA react apps which was the last decade's modus operandi.

The way SSR hooks React/Vue into these JS apps "hydrating" them after loading prerendered component based views...to make them interactive without losing any performance compared to static HTML, is unique and extremely powerful, which most people don't understand until they do it. It really is the future of frontend development.

SSR combined with async loaded chunked bundles of components is far more than prerendering some server side Web apps templating library with full HTTP requests in between. All the power of a full fledged SPA but with none of the performance or SEO downsides with automatic offline + service worker caching. It's great for the webs future.


Yo dawg I heard you like job security, so we put a program in your program, now you you can render while you render

More seriously, I do understand the difference, but disagree with the whole approach in 95% of cases

Even the Haskell people are adopting SSR’d JS-heavy frontends (Miso, Purescript, etc) for their web apps. That’s when you know it’s mainstream. Good luck with PHP!

SSR is the default for the web since it started.

Why do you say that nodejs is completely insecure by design and how does gcp or aws mitigate those security concerns?

Because you will inevitably have hundreds/thousands of dependencies, controlled by at least as many people, anyone of which could inject code to backdoor your server.

A supply chain attack will sooner or later be the cause of a major incident.

It's the same for any other language . With java with c++, dot net, PHP and even with Erlang. None of them force you to use governed central repositories. And that's a good thing.

The scale is on a different level however. Your average node project will have 10/100x as many dependencies compared to other languages. Too many to conceivably check. Also due to how dynamic the language is, I think it is way easier to hide something.

The V8 runtime itself is pretty secure. However every npm package has total access to your filesystem and network i/o. This is by design, the author of node himself has apologized for it and admitted that nothing can be done now because it'd basically break the internet. This means any package ( i.e. eslint), dependency, anything that has code from just one malicious contributor can grab away all your API keys, ssh keys(if you still use those), environment variables, crypto wallets of your users( this has actually happened a few times now at scale).

With something like aws-amplify you just go on their site and put your environment variables there, instead of keeping them on your own machine.

Now you don't have to worry about using sketchy docker images, or your junior devs using their work laptops on a malware infested gaming cafe while still running their localhost server.

Aws and gcp can afford to have way better internal security and regular pentesting of their containers and infrastructure, so now wrapping those protecting layers around node, express, etc... is their problem. You just push your code the production or testing branch and they handle all the provisioning, builds and deployments in 3-5minutes.

The npm dependency issue is a serious concern, but I'm not convinced that gcp or aws would mitigate the issue. If the problem is unaudited code that could be potentially compromised, gcp and aws will run that compromised code without protest.

It's very easy to incur high costs here. We implemented cost analysis dashboards that allow us to monitor costs per each event, device, with visibility into each AWS service we use, with charts showing historical data. Fiscal planning is now part of our architecture design and implementation.

It all depends on how much data they're processing. It looks to be mostly a pay-per-use model.

I'd say their big cost is Kinesis and potentially API Gateway. Lambda is great for this kind of workload (mostly).

Top 3 are EC2, DynamoDB and Lambda.

I don't see EC2 in your architecture diagram. Where are you using EC2?

How are you structuring your dynamo tables? Is there one table that is used much more than another?

Not using EC2 directly, though AWS breaks that down cost with a line item for EC2.

Many tables in DynamoDB. Two out of those are most used (equally).

Saw this post a while ago: https://medium.com/@dadc/aws-appsync-the-unexpected-a430ff71... - did you hit any limitations with AppSync?

We haven't hit any scaling issues yet. GraphQL is nice. It's really about getting data directly from DynamoDB and Aurora to an end point that Android/iOS/React-JS can query and subscribe to. Apache Velocity Template Language that AppSync uses is a pain though. This post captures it well (unfortunately): https://www.reddit.com/r/graphql/comments/b0zomv/aws_appsync...

AppSync does have limitations we have to contend with. Custom scalar types cannot be defined hence we are not able to define strictly typed GeoJSON objects. Apache VTL has its own learning curve; once you master it you can implement functionality without leaning on invoking lambda functions and avoid paying for their usage in high volume GraphQl call scenarios and access queried data faster.

Just FYI, Hasura GraphQL Engine has native support for GeoJSON types:


PS: I work here. Apologies for plug.

I've been using Hasura for several months at work and it's approach to GraphQL has nailed the level of abstraction needed for early product development.

It's a great complement to serverless and static front-ends.

How do you deal with Lambda concurrency? I have found its pretty easy to hit 1K concurrents if functions take a long time to run and receive bursty traffic.

You can IIRC ping support and ask for a concurrency limit increase, but probably what I would do first is try to segregate lambda deployments and API endpoints (or whatever trigger) by region so that total load is distributed (you get 1000 concurrents per region). Obviously at this point you would also profile your code to optimise function executions.

Do you mean you don't want it to handle 1k concurrent requests (you want some to be rejected or queued instead?) or do you mean that the concurrent execution causes some other problem?

(honest question, not snark)

I think they mean there’s a 1k concurrent request limit that they hit. Though the alternative would be dedicated servers and load balancers, no?

Right, I'm referring to AWS limits. I was running a benchmark yesterday against a logging endpoint I made with a similar architecture to the article. One function is attached to a public ALB endpoint and does some validation then writes the event to SQS; this was taking 100-200ms with 128Mb of RAM. A second function was attached to the SQS queue; its job was to pull events and write them out to an external service (Stackdriver, which sinks to BigQuery). This function was taking 800-1200ms at 128Mb RAM, or 300-500ms at 512Mb (expensive!).

While running some load testing with Artillery I found that I was often getting 429 errors on my front-end endpoint. When pushing 500+ RPS, the 2nd function was taking up over 50% of the concurrent execution limit and new events coming into the front-end would get throttled and in this case thrown out. That also means that any future Lambdas in the same AWS account would exacerbate this problem. Our traffic is spiky and can easily hit 500+ RPS on occasion, so this really wasn't acceptable.

My solution was to refactor the 2nd function into a Fargate task that polls the SQS queue instead. It was easily able to handle any workload I threw at it, and also able to run 24/7 for a fraction of the cost of the Lambda. Each invocation of the Lambda was authenticating with the GCP SDK before passing the event and the Lambda has to stay executing while the 2 stages of network requests were completed.

I'm happy to report I haven't been able to muster a test that breaks anything since I started using Fargate!


> the 2nd function was taking up over 50% of the concurrent execution limit and new events coming into the front-end would get throttled and in this case thrown out.

It sounds like you already found a great solution for your particular case. But it's also worth mentioning that you can apply per-function concurrency limits, which can be another way to prevent a particular function from consuming too much of the overall concurrency. For anyone who's lambda workload is cheaper than a 27/7 task, that could be a good option.

> Each invocation of the Lambda was authenticating with the GCP SDK before passing the event

I'm curious whether you tried moving the authentication outside of the handler function so it could be reused for multiple events? I've found that can make a huge difference for some use cases.

1 lambda hitting 1k concurrent or many lambdas hitting that in aggregate?

How does the cost of DynamoDB (and other components) compares to other options that you considered, especially at scale? Would economics works with the same architecture say at 100X scale?

Good question. At 100x, probably not. At 10x, yes would be better than managing services on our own. By that time, we would have a better prioritized list of which services to self-manage and which ones to leave to AWS. Are you specifically concerned about DynamoDB for some reason?

How easy or hard would it be to switch to self-managed components as you grow from 10X to 100X? Quite often, they end up becoming a tech debt that remains in the back burner. Just curious.

Ah yes. The engineer would tell you we can move when we want. The manager would tell you it is harder than it looks. Management would tell you it will never happen. :-)

See it as reducing startup risk and deferring the payment to when you become successful and have money/time to throw at problem. Though there are best practices to do it in a clean way so moving is easier.

Do you know some known gotchas here?

I’d be curious why you think at 100x is where you would lose out on TCO with self managed. I feel like staff time commitment should only go up with larger fleets, you’d really start running into the pricing advantages on 0-rated network here etc.

This is not going to scale. Lambdas are hella slow. The cold starts will kill you.

The only two places where the cold starts will hurt is the API key auth and the JWT auth/posting to kinesis. Plus if it's being called with any kind of decent frequency it won't matter.

With constant traffic cold starts should not be happening. Also lambdas will stick around for 45-60 minutes before going cold.

That's exactly right. Background location tracking leads to constant traffic. Not running into cold starts as an issue.

> This is not going to scale. Lambdas are hella slow. The cold starts will kill you.

Frameworks like Zappa handle this out of this box by setting up executions to run via cloudwatch event cron.

You can pay to never hit cold starts..

No you can't, you can just try and keep your lambdas warm, but that isn't the same thing.

wouldn't EC2 be an example of paying to never hit cold starts?

No, you can't.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact