More

ocdnix · 2025-01-05T11:06:44 1736075204

Fascinatingly effective, but maybe I'm the only one getting the heebie-jeebies when someone suggests implementing this in production corp networks. Sure it's super convenient, but the thought of bypassing all traditional NATs and firewalls, and instead relying solely on a software ACL, seems super risky. Maybe I just don't understand how it works, but it seems that a bad actor getting access to a stray VM with Tailscale on it in, say, your AWS testing env, essentially has an clear path all the way into your laptop on the internal corp network, through the kernel, into user space and into the Tailscale ACL code as the sole arbiter of granting or blocking access. Would I even know someone unauthorized made it that far?

dfawcus · 2025-01-05T15:18:31 1736090311

That is why many of us keep repeating that NAT is not a security mechanism.

Punching through NAT, and most associated state tracking filters, is very easy.

I've implemented such in a production corp environment, as a product to be sold. There is no magic here, it is all well understood technology by the practitioners.

If you actually want to have packet filtering (a firewall) then deploy a firewall instance distinct from any NAT, and with appropriate rules. However that only really helps for traffic volume reduction, the actual security gain from a f/w per se is now minimal, as most attacks are over the top: HTTP/HTTPS, POP/IMAP etc.

cyberax · 2025-01-05T23:12:22 1736118742

> That is why many of us keep repeating that NAT is not a security mechanism.

You can say that in general, network firewalls are not a security mechanism. They are at most a means to prevent brute-force attacks from outside of the network.

dijit · 2025-01-05T11:22:07 1736076127

to be completely fair with you, everyone misinterprets NAT as a security mechanism, because traditionally it is deployed alongside a stateful firewall.

In reality, of course, the stateful firewall is doing all of the heavy lifting that NAT is getting the credit for. Tailscale does not get rid of the firewall in fact it has a much more comprehensive setup based on proper ACLs.

Though I’m definitely the first to admit that their tooling around ACL’s could be significantly improved

tptacek · 2025-01-05T17:51:17 1736099477

I think they mostly interpret NAT as a security mechanism because that's what it originally was; "NAT" was a species of firewall, alongside "stateful" and "application layer". And NAT obviously does serve a security purpose; just not the inside->out access control function we're talking about here.

avianlyric · 2025-01-05T20:55:23 1736110523

> think they mostly interpret NAT as a security mechanism because that's what it originally was; "NAT" was a species of firewall

That’s simply wrong. NAT is, and always has been for the sole purpose of Network Address Translation, I.e. allowing a large IP address space to hide behind a much smaller IP address space (usually a single IP address), for the purpose of mitigating IP address exhaustion.

NATs were meant to be a stop gap solution between IPv4 running out, and the rollout of IPv6. But we all know how that panned out.

The “firewall” like aspects of a NAT are purely incidental. The only reason why a NAT “blocks” unsolicited inbound traffic is because it literally has no idea where to send that traffic, and /dev/null is the only sensible place to direct what’s effectively noise from the NATs perspective.

The fact that NATs shares many of basic building blocks as a very simple stateful firewall is just a consequence of both NATs and firewalls being nothing more than stateful packet routing devices. The same way any standard network switch is (they internally keep a mapping of IP to MAC address of connected devices based of ARP packets, which incidentally blocks certain types of address spoofing, but nobody calls a network switch a firewall).

tptacek · 2025-01-05T21:15:14 1736111714

You're trying to piece this together axiomatically, but you can just read the history of the Cisco PIX firewall to see that the story is not as simple as you want it to be. One of the first and clearly the most popular NAT middlebox products of the 1990s was a firewall, and Cisco made a whole big deal about how powerful NAT was as a security feature.

avianlyric · 2025-01-06T01:54:49 1736128489

You’re working backwards here from Cisco’s marketing material. Just because someone in Cisco’s marketing team was smart enough to realise they could market NAT as a security feature, doesn’t mean it was designed to be a firewall.

Apple advertises their iPads as “computer replacements”, that doesn’t mean the iPad was originally designed to be a computer replacement, and it certainly doesn’t make iPads a good computer replacement for many people.

I would also highlight that Cisco PIX had a dedicated firewall layer in addition to its NAT layer, which provided much more capabilities than the NAT layer alone. The fact that these two layers intelligently built on each other is just good implementation engineering, it doesn’t change the fundamental fact that NAT isn’t, and never has been, a proper security tool.

tptacek · 2025-01-06T01:58:17 1736128697

I'm working forwards from my experience at the time as a security engineer working with products that claimed NAT was a security feature, since it allowed for machines to access the Internet without being routable from the Internet for initiated connections, which is why the first commercial PIX product, after Cisco bought Network Translation (which named PIX), was a firewall.

People confuse the fact that NAT is not an especially powerful or coherent security feature with the idea that it isn't a security feature, which leads you to the crazy rhetorical position of having to argue that PIX, the first mainstream NAT product, was not a security product. I have friends who worked on PIX, for many years. I assure you: they were in the Security BU.

I think this position is pretty hopeless, though if you want to drag us around through the network security marketing of the mid-1990s, I'm happy to do so, just for nostalgia's sake. NAT is absolutely a security feature, and was originally deployed as one, in an era where it was still feasible to get routable allocations for individual workstations.

avianlyric · 2025-01-06T17:45:28 1736185528

> NAT was a security feature, since it allowed for machines to access the Internet without being routable from the Internet for initiated connections

I'm sure you also know, that any stateful firewall can achieve the same result without having to provide NAT capabilities. Sure Cisco PIX may have been a security appliance, but that doesn't make NAT's a firewall. You don't need Network Address Translation to create a firewall that allows devices to connect to the internet, but makes those machines unrouteable to unsolicited requests. For your claim that NATs are meant to be a firewall, you need to provide an explanation as to why we don't use NATs with IPv6.

Why would increasing the IP address space so that it's once again possible to get routable allocation for indivual workstations, result in people not deploying IPv6 NATs, when apparently they're an important security tool for IPv4, in even in the days when "it was still feasible to get routable allocations for individual workstations"?

tptacek · 2025-01-06T19:18:29 1736191109

Now you're arguing that NAT isn't a good security feature. We agree. There's no reason for us to drill for things to disagree about.

avianlyric · 2025-01-07T15:18:09 1736263089

No I’m arguing that NAT isn't a security feature, and wasn’t meant to be a security feature. The fact people sold it as a security feature, and the fact that it might incidentally behave like a poor firewall, doesn’t change the fact that NAT isn’t and never was meant to be a security feature, good or bad.

tptacek · 2025-01-07T16:38:18 1736267898

I feel like I've provided black-letter proof that it was meant to be a security feature; the commercial product of its inventor was a firewall that advertised NAT as a security feature. I don't really understand how you can argue around that.

Nobody's reading this thread anymore, so why don't we leave our arguments where they stand.

cesarb · 2025-01-06T03:34:04 1736134444

> The same way any standard network switch is (they internally keep a mapping of IP to MAC address of connected devices based of ARP packets, which incidentally blocks certain types of address spoofing, but nobody calls a network switch a firewall).

I thought standard network switches kept a mapping of MAC address to physical network ports, and didn't concern themselves with the IP layer at all (other than things like IGMP/MLD snooping)? Mapping from IP to MAC addresses is a function of hosts/gateways, not switches.

yencabulator · 2025-01-06T21:31:06 1736199066

Lots of switches filter out ARP responses that would change the destination of traffic to preexisting clients.

For example: https://www.arubanetworks.com/techdocs/AOS-S/16.10/ASG/YAYB/...

dijit · 2025-01-05T17:58:32 1736099912

I mean, it really isn’t a security mechanism of any kind. Any security properties at all are completely accidental.

One need only disable stateful firewalling and use that to see how completely dire the situation would be. As all outbound connections open up your host to the internet.

Hilift · 2025-01-05T11:49:30 1736077770

> production corp networks.

Networking has long been the toxic wasteland of security and misconfiguration. Now combine that with newer host-based networking models for containers. The Windows network stack is substantially different now due to that, and more complex. Since Wireguard has been part of Linux, everyone and their brother now has a VPN, somewhere connecting to a VPS. It's probably worse than you think because you don't know what you don't know.

rixed · 2025-01-05T14:39:12 1736087952

This is to go through NAT, which are devices designed to work around the rarefaction of IPv4 addresses.

Firewalling is a different concept, but since you raise that issue of connectivity wrt. security, I have to say that what makes /me/ sad and anxious is to see how internet security has always been hinging on bloquing paquets based on the destination port.

Doing what's easy rather than what's correct, exemplified and labelled "professional solutions"...

navigate8310 · 2025-01-05T14:51:27 1736088687

I'm rather more curious as to why you stylized "bloquing paquets"?

aborsy · 2025-01-05T15:13:52 1736090032

Maybe the OP is French? :)

rixed · 2025-01-05T22:59:50 1736117990

Haha indeed.

I also frequently stumble upon "connexion" and "trafic", in the same department.

antod · 2025-01-06T18:49:58 1736189398

They are emphasizing the queuing implementation.

Muromec · 2025-01-05T12:44:14 1736081054

That’s how all voip worked since forever and we also have a bunch of standard and public facing infrastructure to make it easier. All the ice, turn and friends.

It still needs something on the inside to talk to outside first, so the actual firewall should whitelist both outbound and inbound connections.

Than again, if you rely on perimeter, it’s a matter of time when someone figures out what’s your equivalent of high wiz jacket is.

totallywrong · 2025-01-05T16:08:16 1736093296

It's no different from traditional VPNs. The tailnet admin has control over the routes that are exposed to clients and ACLs are available to further limit access. It's an overlay network, it doesn't magically give you access to user space on people's laptops.

ray_v · 2025-01-05T18:47:20 1736102840

Given how tailscale works and many of the features (the SSH features especially) it's not terribly hard to imagine a critical flaw or misconfigured setup giving access to userspace

avianlyric · 2025-01-05T21:02:08 1736110928

Everything beyond tailscales core VPN features are opt-in. The risk of misconfiguring Tailscale is the same (arguably it’s much smaller) as just misconfiguring SSH on a machine.

At the end of the day, Tailscale works just like any other VPN, from the perspective of the type of data that can traverse between machines connected to the same virtual network. Tailscales use of a P2P wireguard mesh is just an implementation detail, it’s no more or less risky that having every machine connect to a central VPN gateway, and bouncing all their traffic off that. Either way, all the machines get access to a virtual network shared by every other machine, and misconfigured ACLs could result in stuff getting exposed between those machines, which shouldn’t be exposed.

If anything the Tailscale mesh model is much more secure, because it forces every device to use a true zero trust model. Rather than the outdated, “oh it managed to connect to the VPN, it must be safe then” model that traditional VPNs often end up implementing.

rainsford · 2025-01-05T15:16:03 1736090163

I'm not sure how to compare the risk and attack surface of traditional NATs and firewalls vs Tailscale's ACL code, but I'm not sure Tailscale is obviously the riskier choice there. I think more traditional network devices are more familiar and more of a known quantity, but there's a lot of janky, unpatched, legacy network devices out there without any of the security protections of modern operating systems and code.

It's also worth considering that exploitability of ACL code is just one factor in comparing the risk and Tailscale or similar solutions allow security conscious setups that are not possible (or at least much more difficult) otherwise. For example, the NAT and firewall traversal means you don't have to open any ports anywhere to offer a service within your Tailscale network. Done correctly, this means very little attack surface for a bad actor to gain access to that stray VM in the first place. You can also implement fairly complex ACL behavior that's effectively done on each endpoint without having to trust your network infrastructure at all, behavior that stays the same even if your laptop or other devices roam from network to network.

Not to say I believe Tailsclae is bulletproof or anything, but it does offer some interesting tradeoffs and it's not immediately obvious to me the risk is worse than legacy networks (arguably better), and you gain a lot of interesting features and convenience.

miki123211 · 2025-01-05T19:13:46 1736104426

And for whatever it's worth, Tailscale is written in a language that makes buffer overflow and memory corruption vulnerabilities extremely unlikely.

bradleyjg · 2025-01-05T18:23:15 1736101395

You don’t want to be hard on the outside, soft on the inside. Especially because you probably aren’t that hard on the outside!

Defense in depth.

devmor · 2025-01-05T11:14:19 1736075659

That is a whole lot of different levels of exploits that would have to be chained together that you just trivialized there.

How do you suppose they gained access to the kernel and userspace just by having a network connection to the laptop?

Muromec · 2025-01-05T13:14:41 1736082881

By using an unpatched rce in any network exposed code. The whole point of firewall is to prevent bad hackers from the bad internet to exploit your unpatched rces, abuse your default passwords, host based security you shouldn't have had in the first place and access stuff using compromised credentials you didn't revoke or didn't know you should have revoked. Because consistently doing all of that all of the time is hard for creative professionals. It's a chore. It's a tax.

avianlyric · 2025-01-05T21:17:31 1736111851

How exactly is Tailscale different to literally any other piece of network capable software in that regard?

NAT traversal requires careful coordination between two devices to create a connection, it’s not like any random device on the internet can perform NAT traversal against a machine just because it’s running Tailscale (not to mention every modern browser has NAT traversal built in for WebRTC connections).

And if the issue doesn’t arise from using NAT traversal, then how does Tailscale expose anything more significant than what a traditional VPN will expose? After all the only difference between a P2P VPN and a traditional VPN, is that a traditional VPN bounces all your traffic off a common server, rather than attempting P2P connections.

cyberpunk · 2025-01-05T11:25:45 1736076345

I think the point is not that there are necessarily exploits, but by compromising one node in the tailnet they now have the ability to hit code in these locations, or services running on your tun0 interface on your laptop etc.

avianlyric · 2025-01-05T21:20:06 1736112006

How is compromising a a single node in a tailnet more dangerous than compromising a single node in a traditional VPN?

Traditional VPN don’t usually put firewalls between machines on the network, because traditionally the whole point of a VPN is to avoid the need for firewalls to provide security between nodes on the virtual network, by assuming that only safe machines can connect to the VPN.

aborsy · 2025-01-05T12:04:43 1736078683

You would typically remove the default any to any ACL rule, and allow the connections that you need. The compromised node normally would not have access to anything interesting. Normally it’s jailed, or would not be able to make outgoing connections.

Am I missing something?

cyberpunk · 2025-01-05T12:18:38 1736079518

The ACL logic happens in the tailscaled on the destination though doesn't it? So even if you block the access via the ACL the packet has still gone through the network stack and go runtime etc before the traffic is dropped which is a significantly bigger surface than a (traditional) external network firewall.

aborsy · 2025-01-05T12:45:09 1736081109

I see your point. You are talking about a vulnerability.

You are right. Tailscale nodes can send packets processed in any other node, irrespectively of ACLs. Essentially each node gets to “run code” in other nodes, which is normally dropped. I don’t know how deep the Tailscale packets go before being dropped (perhaps the coordination server distributes firewall rules).

But you have to compare with another access method, like, the hub and spoke VPN. The compromised and uncompromised nodes connect to a VPN access server. A compromised node sends packets that are processed in the VPN server, but can also connect to the uncompromised node, meaning, the latter has to process and drop the packets of the former. You have to trust the OS IP stack. To some extent, the same is true if the trusted node VPNs directly to the untrusted node. During an established connection, the networking stack of the trusted node has to block the other side.

Maybe someone familiar with the implementation of ACLs in Tailscale could chime in.

Update: The ACL rules are applied to the incoming packets over tailscale interface. The filtering is then done by tailscaled. The packet has gone past the interface and processes by tailscaled. So an unauthenticated packet indeed travels through the kernel space all the way to the userspace.

avianlyric · 2025-01-05T21:27:19 1736112439

How is this any different to any other piece of network capable software that’s listening to a port on your machine?

An external network firewall can only offer protection if you can somehow guarantee that every packet that hits a specific node is first routed via that firewall. Traditionally nobody has setup networks like that, because it requires routing every single packet via a single common bottleneck, causing huge latency and throughput problems.

As for packets going via the network stack, and then the go runtime. Do you honestly believe there’s set of vulnerabilities out there which would allow random external packets to be sent to a random machine, and cause an RCE by virtue of simply being process by the OS kernel, which somehow can only be exploited if you’re running Tailscale? Better still, if such a vulnerability exists, what on earth makes you think your firewall isn’t also vulnerable to same issue, given that pretty much every firewall out there is built on the Linux kernel these days.

cyberpunk · 2025-01-08T17:39:56 1736357996

It goes into tailscaled before it's dropped, and tailscaled has had a number of security issues in the past (including RCE):

https://tailscale.com/security-bulletins#ts-2024-005 https://tailscale.com/security-bulletins#ts-2022-004

kortilla · 2025-01-05T12:46:30 1736081190

> Am I missing something?

Yes

> You would typically remove the default any to any ACL rule

This part doesn’t happen.

Defaults are rarely changed.

0xbadcafebee · 2025-01-05T16:12:20 1736093540

Network security is a myth. NATS, firewalls, ACLs, etc don't keep you safe. Even on your Wifi LAN right now, you aren't safe from local network attacks originating from outside attackers.

dd_xplore · 2025-01-05T20:14:21 1736108061

0xbadcafebee · 2025-01-06T02:32:29 1736130749

Because hackers can contort themselves into amazing shapes in order to fit through tiny holes in the oddest places. Once they position themselves correctly, and are able to reach the network address and port of a given service, and it has no authentication, it's open season. It may seem difficult, nigh impossible, for a hacker to reach all the way into your WiFi LAN. But there are always twists and turns to take.

From the public internet: tens of thousands of internet routers have publicly known exploits right now, which the router vendors refuse to fix. Just scan the internet for the routers, use your exploit, and you're inside.

From the opposite direction: malware in a website can redirect your browser to the management interface of a router on your local LAN, where it can reconfigure your router. If there is a password but you have logged in from your browser, the active session token lets it right in, and CSRF protection is often disabled or incorrectly set up. And even if it has a password, many such routers have exploits that will work despite a password. Many people also fall for phishing attacks that can drop payloads on your machine directly.

In some cases, the ISP itself has shipped a firmware update to routers that included malware.

All of these things have happened in the past 2 years, to millions of internet users, that we know of. Many large attacks go unnoticed for years. Once the router is compromised, it can be configured to forward ports or enable UPnP, or simply persist malware inside the router itself. The network is wide open and at the attacker's fingertips.

And this is just one class of attack. There are many more that can attack private networks. So there is no place safe from network attacks. Not in a corporate network, not on your local LAN, nowhere. There is no network security. The only network services that can be somewhat trusted are ones which require strong authentication, authorization, and encryption.

avianlyric · 2025-01-05T21:10:11 1736111411

A better question is, “why do you think your local network is safe?”.

Have you taken steps to validate the integrity of every single device connected to the network?

If a single device is compromised, how will detect its been compromised?

If a device is compromised, what prevents it from being used to launch an attack on other devices in your network, especially if your security model assumes that all devices on your local network are “safe”?

For a practical example of this happening, in a very impressive manner: https://arstechnica.com/security/2024/11/spies-hack-wi-fi-ne...

For a more boring everyday equivalent, just search around for one of the many botnets that are assembled from compromised SoHo routers, or IoT devices, around the world.

https://arstechnica.com/security/2024/01/chinese-malware-rem...

Assuming a local network is safe and secure is foolish. There’s nothing inherently secure about a local network, the only reason it offers any level of security is due to a local network being many-many orders of magnitude smaller than the entire internet. So the probability of a hostile device (whether intentional installed as hostile, or became hostile after a remote attack) being connected is smaller. But at the end of the day, is security via “being luckier than the next dude”.

miki123211 · 2025-01-05T19:10:23 1736104223

As far as I understand, Tailscale won't even let you initiate a connection (or give you WireGuard keys for a node) unless there's an ACL rule that allows it.

kennes913 · 2025-01-08T15:18:17 1736349497

Currently evaluating tailscale as a VPN-like solution and read the same thing:

"At a less granular level, the coordination server (key drop box) protects nodes by giving each node the public keys of only the nodes that are supposed to connect to it. Other Internet computers are unable to even request a connection, because without the right public key in the list, their encrypted packets cannot be decoded. It’s like the unauthorized machines don’t even exist. This is a very powerful protection model; it prevents virtually any kind of protocol-level attack. As a result, Tailscale is especially good at protecting legacy, non-web based services that are no longer maintained or receiving updates."

Source: https://tailscale.com/blog/how-tailscale-works#bonus-acls-an...

binary132 · 2025-01-05T15:28:20 1736090900

Isn’t this essentially what a VPN does? I mean, that’s what TailScale is built on: Wireguard.

tantalor · 2025-01-05T13:27:54 1736083674

Security in depth.

ocdnix · 2024-03-16T07:56:49 1710575809

This lists r7iz at the top, and AWS says they run at 3.9 GHz. Dunno where this leaves the m5zn and x2iezn family (at 4.5 GHz), or even the z1d (at 4.0 GHz). Frequency isn't everything, but seems strange that they're not to be found in the table.

crohr · 2024-03-18T08:59:05 1710752345

You're right, just added m5zn. Base frequency is 3.8Ghz and turbo boost 4.5Ghz, but since it is an outdated CPU, they are average at best.

I'll also benchmark the z1d.

eisa01 · 2024-03-16T08:42:41 1710578561

z1d is ancient, from 2017. My theory is that they keep it around because of corporates not switching, and wasting $ on slow and more expensive compute?

z1d: Skylake, 4.0 GHz, 2017 X2iezn: Cascade Lake, 4.5 GHz, 2019

ocdnix · on Sept 10, 2021

What pain points do you see with Amazon ES, if I may ask? I'm considering migrating to it.

ocdnix · on Feb 14, 2021

How do 32 and 64ths look if you swing on 16ths? Do they end up on a sine-like interpolated curve of sorts, or do changes in velocity happen instantaneously like a square wave?

ganafagol · on Feb 17, 2021

It gets blurry beyond the first swingified level. Most common is binary subdivisions ( meaning that your 1/3rds become 1/6ths), but if the swing is slow enough, another layer of 1/3rds is possible too, making 1/9ths effectively.

ocdnix · on Feb 11, 2021

Reminds me of Lyft's thing from a couple of months ago: https://news.ycombinator.com/item?id=25000950

I would love to get answers to questions like "which users have access to resource X, including implicitly through one or more assume-role jumps, across these N accounts, including stuff like iam:PassRole, even including tag-based policies?". Add a time dimension too, like "who had access to X between Jun and Aug 2020?", and you'd have a winner. Would such queries be possible here?

fatjohnny · on Feb 11, 2021

The joins are very powerful. For example - you can connect a lambda function to its IAM role and then right through to the attached policies. We have quite a few join examples scattered through the AWS table docs. For tags, Steampipe actually normalizes a tags column across AWS, Azure, GCP & DigitalOcean tables. It's always available as a JSONB {"foo":"bar"} format, even if the source was labels like DigitalOcean, so definitely possible to find resources with specific tags. We have multi-account on the near-term roadmap, but the idea of historical searches is a super interesting and challenging idea… we haven’t started to contemplate yet. Perhaps using snapshots into a materialized view would work for comparisons over time?

whoknew1122 · on Feb 11, 2021

I just glanced over the source, but I think the answer is no in both cases.

> "which users have access to resource X, including implicitly through one or more assume-role jumps, across these N accounts, including stuff like iam:PassRole, even including tag-based policies?"

This would be difficult to pull off because you'd need to make separate calls to each of your accounts to determine this sort of thing. And if you're looking at assuming roles through mulitple accounts, you have to consider whether external Ids are defined.

And if external Ids are defined, how do you handle that? Do you assume the caller has the external Id?

> "who had access to X between Jun and Aug 2020?"

This one would be easier, but would require integration with AWS Config.

ocdnix · on Jan 25, 2021

That cloud provider should be AWS, no? There's no corresponding incident on AWS' status page, which seems a bit strange.

ocdnix · on Jan 4, 2021

With AWS Route 53's 10 000 records/zone, 400 values/record 255 chars/TXT and base64's ~35% overhead, you have a bit over 600 megabytes of binary value storage.

ocdnix · on Oct 22, 2020

I was expecting this to be about the NLB's strange "lag" in updating its flows, wreaking havoc when it comes to changes in the target group, and possibly also relating to weirdly long delays before starting health checks of newly registered targets. I'm bewildered why this hasn't been more of a problem for others, and also why AWS seem to have kind of silently acknowledged the issue (by not closing them), while not coming up with a fix, even after a year. Am I the only one seeing this problem? Ref: https://github.com/aws/containers-roadmap/issues/470 and https://github.com/aws/containers-roadmap/issues/469

ocdnix · on Aug 12, 2020

This doesn't cover other interesting uses, like tag-based automation. Random examples: Tagging DynamoDB tables to identify which should be backed up and at which frequency (when you don't quite trust the built-in backup); tagging dev RDS databases with a shut-down schedule for nights/week-ends; tagging Elastic IPs and Auto Scaling Groups with a "IP pool ID", and a Lambda that re-assigns EIPs to ASG instances as they are recycled; using a "data flow ID" tag on resources that are in the hot-path of data flows that are subject to high-volume bursts, so you can easily list them and scale them up before known events.

acdha · on Aug 12, 2020

One pattern I like is having a tag for security groups indicating that they should accept traffic from a CDN or other partner service which a scheduled Lambda function will periodically update from a canonical list of CIDR ranges. This makes it really easy to avoid people leaving origins open by mistake since you can still have a blanket ban on 0.0.0.0/0 rules.

These days I think you can use the new customer managed IP prefix list feature they added last month for this specific need so this approach could be simplified if you need to share the same ranges across accounts:

https://docs.aws.amazon.com/vpc/latest/userguide/sharing-man...

toeknee123 · on Aug 12, 2020

Those are really practical and interesting use cases for tags that we should definitely cover more of. Let me note this and we'll be sure to cover more in-depth uses as we develop more parts to this guide! Appreciate the feedback.

ocdnix · on July 23, 2020

Turns out this is about EC2 spot instances for ECS. How would it compare to ECS Fargate spot these days?

I'm also missing a discussion about designing for interruption, either by not keeping state, or by being able to shed state quickly, to be picked up by other instances.

Also, if you set up EC2 spot with a launch template or ASG with very differently-sized instance types (to reduce risk of running out), is there a way to even out the load coming through an ALB? The least-connections scheduling can help in some cases, but a connection might not map 1:1 to one unit of load. The ALB can use weighted balancing, but on the target group level. Dunno how easy it would be to allocate different instance sizes to different target groups and weigh them accordingly.

ollyculverhouse · on July 23, 2020

AFAIK with Fargate a lot of this is handled for you, as long as you have the auto scaling group.

We have this setup with two capacity providers (FARGATE_SPOT and FARGATE) with a 75/25% split, meaning that even if there are no spot instances available we will still be up.

The benefit of Fargate being that we don't need to care if certain instance sizes are not available as that is handled by AWS.

l33tman · on July 23, 2020

Cool, when fargate launched they didn't have a spot possibility (AFAIK) and since we run ECS on Spot instances it would just be a massive increase in cost to switch to FG, but if it now can use underlying spot instances, it might be worth looking at again..

Epixors · on July 23, 2020

Yeah spot capacity providers for Fargate only got added a few months ago, been running well for us in production.

pluies · on July 23, 2020

(Not the OP, but running a fairly similar setup, albeit for EKS nodes rather than ECS nodes :) )

Fargate Spot is about a third of the price of Fargate (at least in eu-west-1 now according to: https://aws.amazon.com/fargate/pricing/ ); so the savings are roughly identical.

Re risk of running out, our current strategy is to use different-but-closely-similar instance groups; so for example we have an autoscaling group running a mix of:

- m5.large - m5dn.large - m5n.large - m5ad.large - m5d.large

Which are the same price on Spot instances, but I'd wager it'd be pretty rare to have all these families reclaimed at once.

(We also use some on-demand only ASGs with lower priority in the cluster-autoscaler to ensure that if it _does_ happen, then we'll have a fallback)