Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: The Segment AWS Stack (segment.com)
237 points by calvinfo on June 16, 2016 | hide | past | web | favorite | 58 comments

Very nice. I didn't know about terraform, and seems like a very powerful combo... Can anyone comment as to the differences in functionality between ansible AWS playbooks and terraform?

The Segment Stack is a good showcase of best-practices for AWS ... only one thing I didn't understand, why a NAT instance is needed in each subnet? https://github.com/segmentio/stack/blob/master/vpc/main.tf#L... It seems a little wasteful. Couldn't you just allow traffic between the subnets?

The biggest thing I've noticed is that terraform doesn't, out of the box, have any way of bringing in your existing resources, and sharing the [tfstate file] that has the current state of your world is its own challenge. The ec2 module of ansible has `count_tag`, and that was exactly what I wanted: "I want this many boxes that are identified by these tags."

Played with an external tool call [terraforming] that creates a terraform setup from your existing boxes, but that, too, was way more than I wanted.

Terraform is a very neat tool, but learning the AWS ecosystem _and_ this third-party tool that has its own very strong set of best practices was too much for me. If you're experienced, and especially if you're already using CloudFormation heavily, you can probably get a lot of mileage out of terraform.

Doing a quick search to make sure I'm not crazy, it looks like terraform 0.7 will start to support importing, so that's something: https://github.com/hashicorp/terraform/issues/581

[tfstate file]: https://www.terraform.io/docs/state/remote

[terraforming]: https://github.com/dtan4/terraforming

I work on Terraform at HashiCorp - Terraform import is on the way, starting with 0.7. We have a multi-phase approach - first importing state with 0.7 and then moving on to generating configuration in subsequent releases. Many AWS resources are already supported for import, and we're intending to get full support over time. If there are specific resource types that you need that aren't currently supported, please open issues on the repository and we will endeavour to support them quickly!

Oh man, I just wanted to gush and say thanks for such awesome software!

- NAT Instances are placed in each Subnet for High Availability. This way, if one Availability Zone goes down and it happens to be where your NAT Instance is located, you don't lose connectivity.

- Terraform vs. Ansible: /u/mitchellh (I think that's his HN name) started terraform and explained his "Terraform vs. Alternatives" in a Google Groups Post[1].

In addition to that post, Terraform also has nice support of multiple providers, so you can mix AWS, Azure, and others in one set of templates.

In general, Terraform is doing a lot of cutting-edge thinking with orchestration tools and represents IMO a best-of-breed approach.

That being said, there are a lot of bugs, especially around eventual consistency issues in AWS. They're getting better, and most of them are recoverable, though.

[1] https://groups.google.com/d/msg/terraform-tool/6Fxnl_bejX4/0...

I believe you can use a Managed NAT Gateway now instead of NAT instances in each AZ for outbound VPC connectivity:


> In general, Terraform is doing a lot of cutting-edge thinking with orchestration tools and represents IMO a best-of-breed approach.

Terraform isn't really doing anything that snazzy compared to Cloudformation [1] (an AWS tool) unless you're also orchestrating in concert with non-AWS services.

[1] https://aws.amazon.com/cloudformation/details/

> Terraform isn't really doing anything that snazzy compared to Cloudformation

I'd disagree with that. Just take a look at the Terraform Changelog[1] for some of the latest & greatest.

For example, the concept of "Data sources" is pretty cool. Basically, you can reference pre-existing data, potentially do rich queries against it, and get a read-only value back. For example, you can use a Data source to find the latest AMI for a given search string.[2]

CloudFormation has a concept of Custom Resources[3] which could achieve similar functionality, but not without a lot of hassle.

Terraform has also been building up a rich language of interpolation functions [4] that can be used for string replacement, hash generation, and even arithmetic.

There's a lot more, too. I think it goes well beyond "cloud-agnostic."

[1] https://github.com/hashicorp/terraform/blob/master/CHANGELOG...

[2] https://github.com/hashicorp/terraform/blob/master/website/s...

[3] http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuid...

[4] https://www.terraform.io/docs/configuration/interpolation.ht...

I don't disagree it has some "cool" features, but we've been bitten pretty hard in production when Terraform takes action that wasn't expected (and fell outside of what the plan called for before performing the apply).

Doing devops, I prefer boring and reliable over cool.

Definitely hear you on that, and we've taken steps to mitigate Terraform's "surface area" after we encountered some problematic use cases.

This excellent article [1] does a nice job of talking about why it's important to keep your tfstate files small and isolated. Since we started doing that, working with Terraform has been much nicer (and safer!).

[1] https://charity.wtf/2016/03/30/terraform-vpc-and-why-you-wan...

I agree managed NAT is probably better (although can be more costly), but you still need a managed NAT Gateway in each AZ.

Absolutely correct. We are using NAT Gateways in our architecture, and I love not having to manage it all myself, but it is definitely a tradeoff in costs.

From the AWS documentation [1]:

"Each NAT gateway is created in a specific Availability Zone and implemented with redundancy in that zone. You have a limit on the number of NAT gateways you can create in an Availability Zone. For more information, see Amazon VPC Limits.

Note If you have resources in multiple Availability Zones and they share one NAT gateway, in the event that the NAT gateway’s Availability Zone is down, resources in the other Availability Zones lose Internet access. To create an Availability Zone-independent architecture, create a NAT gateway in each Availability Zone and configure your routing to ensure that resources use the NAT gateway in the same Availability Zone."

[1]: https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-n...

> why a NAT instance is needed in each subnet?

I don't think it's strictly needed, but it's best practice because instances in each AZ remain independent from failures in other AZs. Were the AZ with the single NAT to go down, then instances in the other AZs wouldn't be able to communicate outside the VPC (ie. to the rest of the internet)

There's also a side benefit of much lower latency using a NAT in the same AZ vs going across AZ (unscientific benchmark is 0.1ms in same AZ vs 0.3ms across AZ)

Note that this is only important if your use-case requires your servers to have constant, direct 'phone out' access. If you're using ELBs or similar as your link to the outside world for your use-case, a NAT isn't really necessary for constant access. We use a single micro NAT in each VPC, which is only used for system updates (no, I don't have a package cache yet...) and for when we're manually in the server troubleshooting. If there's an outage, well, there's not much we can do in that case, and the NAT isn't needed for production use. And if we really need that NAT back up, just spin up another one and modify the VPC.

As you say, it's not strictly needed. It really depends on your use case. If your use-case suffers for the NAT being down, then you need HA on it. If it can wait, then no. With the new managed NAT in AWS, you may as well go with that if you need HA - it's cheapest is roughly twice the price of a micro anyway, and it's one less bit of clutter in your instance list.

If, for whatever reason, you can't run VPC endpoints then you also want the NAT to be able to reach S3 (and some other service endpoints)

> Were the AZ with the single NAT to go down, then instances in the other AZs wouldn't be able to communicate outside the VPC (ie. to the rest of the internet)

Oh I see. Though assuming app servers are wired up behind an ELB, the service will only be partially degraded (no app server outbound connectivity, like you said).

The one-NAT per AZ is a more robust design but at $30/mo each (for NAT gateway) seems expensive ;) Even at $10/mo (small do-it-yourself NAT instance) it's not free.

The idea with terraform is you describe your infrastructure as config (data), and terraform keeps track of the state within AWS (or whichever provider).

So, as you change your config, terraform will come up with a plan to move from your present AWS state to the desired AWS state.

Terraform allows you to easily express the dependencies in your infrastructure, and also knows about the various dependencies that naturally come with AWS services, which allows it to make better decisions in its planning compared to tools like ansible, chef, etc.

You only need a NAT associated to the top level VPC. The subnets that use it just need it to refer to it in their route table. It looks like that's what they've setup, but I'm not familiar enough with Terraform to say for sure

I think the reason most people put a NAT per AZ is in case there is a whole AZ outage taking your NAT with it. However, these days I would argue that for most cases it is better to use NAT Gateway rather than NAT Instances.

And Terraform had support for the AWS NAT Gateway a long time before AWS' own CloudFormation did. I think there was a PR for TF a day or so after AWS added it to the API. Very impressed with Terraform.

It was about 5 hours after the API became available - I remember opening the PR ;-)

yup, I was on that early thread and really impressed how quickly support got into TF. I was working on a deadline and was able to use NAT Gateways for HA NAT instead of setting it all up myself. Super convenient :)


This is great!

I've been working on an open source stack that is somewhat similar and serves the same purpose. Allowing startups to bootstrap their stack on AWS.


Great work on open sourcing this. I will see what I can do to contribute.

Github organization is here: https://github.com/the-startup-stack

I am developing everything "in the open" so feel free to contribute / ask questions.

I spent the last 3 months collecting usage information from startups using the-startup-stack and about to make another effort to commit all that knowledge back into the project.

This is amazing. Kudos to the Segment team for putting in the effort to share and open source this. And also kudos for leveraging Terraform rather than CloudFormation. I may fork this to get it running with kubernetes on GCP

One thing that I would have loved some more detail on is how are secrets and credentials being handled?

Getting secrets and credentials correct seem to be the crux of most architectures. I'd also love to hear more from the Segment team on their approach with this setup!

This is so timely for me! I was struggling to build a repeatable stack with CloudFormation, started down the road of Terraform just last night. This will help me skip a lot of learning curve. Thank you!

Thank you for sharing, simple and detailed. However I'd like to comment that AWS is not cheap.

What would you consider to be cheaper than AWS?

If all you want is stuff running on the equivalent of EC2 instances, you can use DigitalOcean or some other competitor. However, AWS offers a degree of depth and sophistication that just isn't available from the competition. Trying to do something that is as security-clean as this VPC using DO or Linode or something, from scratch, sounds like weeks of hell to me.

"Not Invented Here" is a big problem in this industry in general. We tend to be too comfortable cobbling a solution together from stone knives and bearskins, rather than using someone else's solution (and paying for it). If you are running a business, though, you shouldn't be building things that aren't what you sell, unless you really cannot otherwise buy them.

> If you are running a business, though, you shouldn't be building things that aren't what you sell [or that you can't build more efficiently], unless you really cannot otherwise buy them.

I'd add the above. Otherwise AWS would have never been built.

I'd add, though, that "efficiently" isn't just whether you can build cheaper than buying. It's if the cost of building the functionality is cheaper than the profit/growth you can generate with a sellable product built using the same amount of developer effort. That's a very, very different (and probably more expensive) proposition. That's why executives should make the decisions, not engineers!

Back in the dot-com days, I worked on a project to build some functionality in-house that we could easily have bought off the shelf. The engineers argued that we'd save the company a million dollars. But frankly, we just wanted to do it because it was badass. And it turned out our solution would actually have cost us more per-system than the commercial solution we sneered at (hardware costs, not just development cost). Six man-months of engineering when everyone knew we were racing the clock before the money ran out? Absolute stupidity.

If I were CEO/CTO and caught wind of such a project, I'd tell people that if they lifted a finger on it, they'd be fired. But that's a very different perspective than I had back then. Risking the very existence of what could have been a very big company in order to someday save a million dollars? Feh. (Of course, no one stopped us, because the CTO was just head nerd, and the money execs were busy fundraising rather than supervising)

> That's why executives should make the decisions, not engineers!

I understand your point and it's valid, but opportunities like this are juicy steaks for Oracle and IBM sales guys. I think the best possible outcome is to involve engineers, solicit a reasonable internal cost bid, then invite external contracts, then pick what makes sense.

"We can just buy it" is what funneled money that could better be spent elsewhere into the coffers of legacy infrastructure companies for decades.

As a for-example, we were implementing this stuff in the days before MySQL had ACID transactions (back when they bragged about how much faster they were than other databases, handwaving over the cost others had for implementing transaction logs and rollback). The DBA and I tried to get Sybase in, but the Sybase sales rep, unused to startups, quickly offended the CTO with vague pricing. He was an open source purist and wanted MySQL.

I cannot begin to explain how much extra code we had to write to deal with the lack of transactions in MySQL! Sybase would have saved us a ton of time and risk.

Indeed. Remember, AWS started as Amazon providing infrastructure as a service to internal projects, so individual teams and projects wouldn't have to go buy a bunch of hardware and ops staff in order to build products. Amazon itself is the biggest customer AWS has.

Thanks to economies of scale, it's worthwhile for Amazon to develop new features as full-scale products, which leads to truly amazing things like Lambda. No one in their right mind would develop Lambda in-house, but for AWS, it makes a ton of sense.

I think Netflix is their biggest customer.

> If you are running a business, though, you shouldn't be building things that aren't what you sell, unless you really cannot otherwise buy them.

Using the "aws specific" features that are always touted as the reason it's expensive, have a secondary "invisible" cost to them that many don't or won't recognise: you're tying your business directly to a single provider, and one that has a history of predatory pricing to gain market control.

You don't need to buy physical servers to not use aws. There's lots of middle ground that can be achieved for equal or lower costs, with less lock-in and more control for your business.

t2.nano instances are less than $5 a month on AWS.

Also I wouldn't even consider DigitalOcean a competitor to AWS. Not even close.

Yes, but Digital Ocean also includes storage and bandwidth.

We were exclusively on AWS, but now are distributed across Linode, DigitalOcean, and Vultr. We're saving a lot, and getting better performance.

Can you share a little what your stack looks like? And how are you doing infrastructure automation to build and reproduce it? I'm really interested in where you found ways to cut costs, and what kind of effort was involved.

Sure, think of it this way:

If you had a perfectly clean install of your distro of choice, could you write a shell script that could build your server from scratch?

If you can answer Yes to that (and you should), then you can build an if/then-heavy shell script that works with each of the APIs to create a perfectly clean install of your favorite distro.

One script to create the clean slate machine.

One script to build what you need.

We have 9 "pods" around the globe, each with API servers (Java/Tomcat/Apache), static servers (Varnish), a MySQL slave, and an HAProxy Maître D'. With close to 60 servers, our monthly bills are less than $1,000, and we haven't had downtime in years. Spinning up a new server is just: sh build.sh atl api 1.

Feel free to ping me if you want any more details: mark@areyouwatchingthis.com.

How do you tie things together and handle cross cloud failover?

I use AWS Route 53 DNS with Health Checks, and put one "A" record in for each "pod". If an entire pod somehow disappears, Route 53 will take it out of the rotation.

Terraform is not AWS-specific - there are 30-odd providers covering most major cloud services and many SaaS systems also. It can be valuable for multi-cloud orchestration!

trust me, no matter how you slice it, it's a shitload of effort. that is exactly how AWS makes a ton of money. they make hard things easy (and expensive).

do you want the pain, or do you want the money, that's the basic proposition here. when you start looking at 25, 50, 150k/month of savings by doing stuff yourself, the choice becomes much clearer. in many cases i've seen, you could theoretically hire an entire team to take care of the stuff that AWS does for you, and still come out ahead.


I'm assuming that this was originally sarcastic, but bare metal can actually turn out really well in some situations. If you're big enough to save just $50k/month by moving from AWS to physical machines, that's several System Engineers you can hire to do your day-to-day ops work and maintenance on your Ansible/whatever scripts.

The thing that's more difficult to do on physical hardware, obviously, is scaling down every day if your peak load is some insane multiple of your base load for a 24h cycle. That's where AWS makes a lot of sense.

I don't think there's an obvious 'best' answer -- it depends on what you need.

It just bugs me that the current state of Ops/Infrastructure sometimes looks so much like the worst parts of the Javascript ecosystem.

If you were to set this up and leave it running (say for a dev environment), what would your monthly bill look like?

So, my co-founder and I have basically been independently building a commercially supported alternative to this excellent open source package. But there are a few differences that make sense when offering this as a paid service:

- We wrote our own terraform testing framework to validate that every change to our modules doesn't break functionality - We actively update our modules based on feedback from new client engagements - We provide commercial support for each module - We combine our modules with consulting and training as needed

And of course, there are many similarities - We give 100% of the source code to our clients - Everything runs in the client's AWS account - Everything is self-documented, modularized, and can be combined/composed as the needs of different teams require

I didn't mean for this to be a shameless plug; more just that I found it interesting to compare the open source vs. commercial approach to solving this same problem. Props to the Segment team for sharing this.

There's a lot of potential for such a service as a consultancy - advice and in-house customization included along with the software. As cool as the Segment stack is, it doesn't let people off the hook for designing effectively in the first place, or for taking a poor design and re-engineering it.

My initial approach with the-startup-stack[1] was exactly that.

Building an open source solution with a "pro" level all setup included in it. I had a very hard time quantifying how much companies will pay for this and whether they even will.

How I see it, you are either on Heroku (or other similar) and you don't care about anything except `git push` or you have a full blown stack.

I know the middle ground between the two is where the stack is but I just couldn't figure out how many companies are actually experiencing those difficulties and how much are they willing to pay for the help.

Since it's not a startup for me, it's just an open source project I decided not to worry about it, but would still be nice to get some input.

[1] http://the-startup-stack.com

Based on personal experience, I strongly suspect there is a market for that type of middle ground. Also, the author of the top level comment in this subthread said they are doing commercial consulting work, so that shows there is some market for it. Given, he/she did not say how good business was, where/how/if they find customers, if they are profitable, etc.

What tool did you use to make the flow chart, or is it custom?

[0] https://segment.com/blog/the-segment-aws-stack/images/main.p...

Thats the `terraform graph` command.

[0] https://www.terraform.io/docs/commands/graph.html

`terraform graph` was used in the post but it's not the one he linked to.

This seems custom to me, maybe Sketch or something.

I so much wish segment would offer their services on the Azure stack.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact