I understand you may not want my advice. But anyways, here it goes :)
I have dealt with similar issues too at my full-time job. We ingest data in the billions range too, and our costs are a lot lower than what you describe.
* On the public endpoint, can you move the first point of contact closer to the edge? You want to block requests before they even reach your ingestion pipeline. If not, add a few geographical Load Balancers with the sole job of accepting requests before forwarding them to your Lambda. Divert traffic at the DNS level using a geo routing policy on Route53.
* Are you also hitting DynamoDB for every request as part of your spam system? I found it quite expensive on-demand, unless you provision capacity. We used to pay tens of thousands per month for DynamoDB in the past, when a single Redis instance would have done it. Specially if it's temporary, non-critical data such as spam detection.
* Do you batch the incoming events before triggering SQS?
* How are your networking costs going? They tend to creep up in the AWS bills too.
It just sounds to me you clearly need something in-between the public endpoint and those Lambdas. But maybe I'm missing some info. Otherwise you're going to keep paying for capacity that a "traditional" server could handle, without costing you more for each request.
If you can't fight back, and don't want real requests to be lost, put a server at the battle-front, and buffer all raw HTTP requests into a queue made for this (like Kinesis). You can then process events and recover the data at your own pace. I did this in the past, and could handle 60k+ req/s with 2 instances on AWS. I used a simple Go tool to capture the traffic https://github.com/buger/goreplay
Also, at my full-time job we use Kinesis to buffer all analytics events, it's cheaper than SQS and handles billions of data points per month. This also keeps the ingest rate constant, but you seem to already do that with your workers.
Full disclosure: I'm Anthony from http://panelbear.com/ , and I just want to offer honest help.
It's not just a theoretical case, I saw this for many customers with Panelbear. Many JS frameworks use the history API in ways you may not be considering.
That's why I would recommend de-duplicating events on the client-side before sending them to your ingestion pipeline.
But that is DDoS matters aside.
How do they even know it was "a lonely nerd"? In fact, it's much more likely that it was carried out by a well-socialized team of shady professionals, on a commission from competitors.
It's 2020, people, can we drop it with the "Hack3rs" stereotypes...?
Well, stereotype accuracy is one of the most robust and replicable findings from psychology research...
Just amazed at the amount of money and time that people are willing to give AWS so that they don't manage dedicated servers themselves, I mean, paying $36k/year just to have someone manage firewall rules that's plain laughable... not to mention this is happening on "hacker news".
> They're competitors of ours and it's just not a good fit.
I sincerely hope that AWS doesn't start an analytics product and become a "competitor" because that sounds like you're going to have to rewrite all your software outside of AWS, and it sure seems extremely locked in ...
The reality is that AWS has terabits of bandwidth and can mitigate these attacks upstream from your servers, which have bandwidth measured in gbps, not tbps.
So, unless you have a global network the size and scale of Amazon or Cloudfront, no, you can't just mitigate these attacks with a few firewall rules.
I am not very well versed in ddos, but that's just traffic generated from thousands of bots. Its still takes bandwidth and will literally overload your link.
I assume AWS has multiple upstream links where the traffic is coming via multiple links and its easier for them to handle than for someone with one or few upstream links.
I'm currently still on your free tier, but would be happy to pay when I need to.
Just wanted to say, thank you!
puppeteer with setRequestInterception configured to block responses in the right way maybe https://github.com/puppeteer/puppeteer/blob/v5.5.0/docs/api.....
I can imagine nightmares along those lines that could persuade some normally functioning js client logic that works normally when in a normal network environment to continually reissue failed requests when browser requests/responses are blocked in just the wrong way -- that could cause the same client side ip to appear in logs over and over again.
And then ip variability could be the customer spinning up browser bots in various regions around the world in order to measure their sites region specific latency response.
Just offering up an alt theory -- as I'm with some of the others -- not sure the graphs shown immediately scream dos to me.
Boggles my mind who the heck thinks Fathom is valuable enough to target with such a persistent series of attacks.
Cloudflare is dominating the "consumer-grade" market, but not really past that.
https://www.cloudflare.com/case-studies/wikimedia-foundation is an example deployment covering Wikimedia's datacenters.
I did a quick check for usefathom.com though, the only somewhat interesting keyword would be "google analytics alternative", but fathom ranks on page 2, which imo would not really justify an attack. Maybe there is some other long-tail keyword with a high enough value though where usefathom.com pushed small niche players out of the top positions.
Also a DDoS is obviously bad for the user experience of paying customers, nobody wants their analytics service to be down or have missing data bc of downtime etc.
I'd be almost certain that AWS' staff looked at the logs to determine this, they would've sorted it in 20 minutes by seeing a load of requests coming from a specific host.
> privacy-first analytics solution [...] The only downside of this is that we need to keep access logs (IP & User-Agent, no browsing history) for 24 hours
Keeping IPs don't make you super privacy friendly.
* We're getting hit with a huge DDoS attack, repeatedly over 3 weeks, with no sign of stopping
* With zero access logs, there was no way to find patterns in the attack, and we had no way to block it
* Our service was going offline during these attacks
* We introduced access logs that are auto-deleted after 24 hours. We redacted all information about the site/page/activity etc. but keep IP & User Agent for pattern matching
* We were then able to identify a pattern and block the attack on Saturday
* Without access logs (even redacted ones), this wasn't possible
I was hoping a more senior engineer on Hacker News would comment and I can't wait to hear how you'd do it. I have no experience in DDoS protection at all, and this seems like the only possible way. Even rate limiting requires storing IP addresses. But if you know a more privacy-focused way to block these attacks, I'm sure I'll buy you a few beers when we hang out.
I've read the full blog post, I am not convinced it's a DDoS attack. Traffic patterns for web analytics will come from over the place and will look like a DDoS when it's not. For example, a customer misplacing their analytics in a JS loop and having a moderate traffic blog will generate billions of requests from all over the globe. Event tracking can be billions of requests as well by themselves. Never attribute to malice that which is adequately explained by simpler means.
> * With zero access logs, there was no way to find patterns in the attack, and we had no way to block it
It's a loss of time and energy. Your system should be able to handle these billions of requests.
> * We were then able to identify a pattern and block the attack on Saturday
Was it specific accounts?
> * Without access logs (even redacted ones), this wasn't possible
You can one-way hash the IP. So you can still look for pattern but you've lost the actual IP. And same one-way hash IP can block whichever IP seems devious in your firewall. (like md5 can be enough.)
> But if you know a more privacy-focused way to block these attacks, I'm sure I'll buy you a few beers when we hang out.
Haha no need to. But come say hi if you ever in Austin. julien _at_ serpapi.com.
The plaintext space (amount of possible IPs, even more so for IP blocks) is so small that that you can try all the possible plaintexts within seconds, essentially reversing the hash.
By using Bloom filter you avoid storing IP's, and save tons of space. They're basically required for rate limiting in IPV6 where the address space is so large an attacker can just run you out of ram by using so many different IP's.
It's also pretty common to have several thresholds of blocking based on individual IP vs subnets. So if more than 2 ips get blocked in a small subnet, you block the whole subnet.
Could you salt the hash with something way less predictable that gets discarded when the next one is generated? Assuming you only kept logs for a day for example you could regenerate it daily and store $salt somewhere only the most senior of senior techs can access it (if anyone at all)
I like the idea of an MD5 hash. Although I'm not too certain why an IP would be a bad thing to log for 24 hours. From a privacy law perspective, the MD5 hash is considered PII. And if we see an IP address in an access log, we know that an IP visited one of the tens of thousands of websites Fathom runs on, but we don't know which one.
Edit: Something else, with MD5 there's no way of finding patterns with the IPs, so you'd have to play whack a mole. Whereas raw IPs allow no real privacy invasion whilst allowing pattern detection
Layer 7 just means they are making ton of HTTP requests from lot of IPS however this is the nature of a web analytics.
> we know that an IP visited one of the tens of thousands of websites Fathom runs on, but we don't know which one.
How can the attackers know which websites have your analytics on? Crawling the web is super hard.
You should be able to get back which accounts are affected from your analytics call:
Exactly! Which is why it was so hard. There's no path pattern. Everything hits "/". So the only way to fight back is to match IP / header patterns (but even then, we have to redact sensitive headers).
> How can the attackers know which websites have your analytics on? Crawling the web is super hard.
The attacker went after some of our more high profile customers. They're known via testimonials or from Twitter.
> The access log should store this `sid` allowing you to just block the account. (or ask them for ton of money if it's a legitimate use :) )
We could certainly temporarily block traffic to a site. The problem is, without some kind of firewall (e.g. WAF), our application has to absorb so much traffic, and that's the issue. We need to block it at the edge.
I think you need some high performance ingestion code.
One day somebody is going to load test or misconfigure their site. Or you'll get a really big client. Or somebody will get Reddit frontpaged or slashdotted. Blocking huge volumes of traffic isn't always going to be an option. You should be able to handle it.
I would get off all this "sexy" stuff like lamdba and SQS and build some old school battle boxes at a co-lo. Run some extreme performance framework like Actix or Vert.X built to handle giant piles of traffic. A single box with those frameworks and a 40 gig line can handle millions of requests a second.
Use counting bloom filters synced between boxes for IP blocking. This avoids saving IP's and is needed to prevent ram exhaustion attacks anyways. Use two layers, one for individual IP's and a second for blocking subnets with lots of bad actors in them. Block for ~24h by using dual sets of bloom filters in a sliding window with 12h overlap where IP's get added to both. When a filter is 24h old, discard it. This needs to be done because you can't remove items from Bloom filters, they saturate over time.
On these super boxes you can do filter, blocking, batching, and dedup. Then send the data off to wherever for further processing.
You mentioned not wanting to use Cloudflare because they're a competitor. That's fine but I would take a look at their blog. They go over the tech they use for mass data ingestion and filtering, and it's basically what I just described.
Potentially, but then customers don’t see a ton of spam on their dashboard. Just a lot of repeated requests.
They don’t come in random waves either. Valid traffic is pretty well spaced out.
> Your system should be sble to handle these billions of requests
Potentially, but you really don’t want to pay for them. The only real way to go about stopping it is blocking offending ips.
I wouldn't worry about short-term IP caching. AWS's upstream load balancers and your own servers are probably doing it anyways to maintain TCP state tables. Linux kernel's "conntrack". If you don't want to cache IP's you can you a probabilistic data structure like a bloom filter synched between instances. If has a small false positive rate but is very fast and doesn't store whole IP. Bloom filter based IP filtering is used in every big DDOS prevention system I know of.
As much as you like lambda, I would ditch it. And the queues. My general advice is that any time you need to add a work queue to something, it's not fast enough.
Your analytics endpoint data ingestion should be something lightning fast like Go, Rust, or an async Java back end. Analytics is a lossy process, you lose traces because of browser behavior and plugins all the time anyways so I wouldn't prioritize 100% accuracy.
I would focus on power/dollar over reliability. If I was you, my ingest boxes would be load balanced with DNS round robin and sitting at various Colocation providers. Get a fat 40 gig unlimited data pipe. Build some stupid fast Rust/Go/Java backend that can saturate that pipe. And do all your filtering/spam analysis here.
I don't think lambda, SQS queues, PHP are the best technologies for this kind of mass data ingestion. I don't even think your ingestion layer should be on AWS. I would follow the lead of other companies doing mass data ingestion and build your own machines. That's how CloudFlare, Netflix etc are able to handle so much traffic without going bankrupt.
I would consider yourself lucky that your first DDOS was so small. 10k requests/sec is tiny. ~400k/sec can be generated on a regular desktop with fiber internet connection. Right now, a single user could knock you offline by messing around with JMeter. I think it's a wake up call that you're in the infrastructure business whether you like it or not, and you need to massively beef up your data ingestion layer. Realistically, you should be able to handle ~50 gigabit attack with 10 million requests/sec. I think that's achievable with a couple boxes colocated on 40 gig lines running fast software
I've had a similar ddos attack which manged to bypass the cloudflare protection and our app handled it without $0 increase in bill.
Please be respectful of OP, he put a few weeks working on the issue, it is not fair to tell him that you think everything he tells is shit, only because you don't believe what he says
* I found AWS Shield Advanced organically
* It was a DDoS attack
* I am incredibly happy with the personalized service. It’s like hiring someone to handle it, except you have people on call 24x7
This is what confuses me out of a lot of these replies, saying you're being swindled for paying for this "absurd service" when you should just "own your infra yourself and hire people to manage this"
Do they think owning all this yourself and hiring specialized DDoS people is gonna cost less than $36k/yr? Because I don't think it will.
A few thoughts:
1. Build a “live status” dashboard for your processing system so you can see the profile on the IPs or customers being affected, so you have more visibility into the issue when it happens
2. Can you just put the system behind cloudfront and process CF log files instead of having to invoke lambdas?
3. Identify the affected customer and put their lambda results in a separate queue, to help limit blast radius if it happens again from other customers
4. Double check there wasn’t throttling because downstream services were throttling lambda (you mentioned SQS?)
5. Make your lambda even faster (if it’s 100ms make it 10ms or less); can you move your logic out of lambda? Can you instead of inserting into SQS log to cloudwatch and then read its data? Or ensure no cross region lambda <> SQS invocations, etc
6. Consider replacing lambda (that cannot batch process invocations) with app servers that can absorb and then batch send many requests to SQS (let them absorb and coalesce a large amount of requests into aggregate SQS messages at the edge); going a step further eg even uploading aggregated data files to S3 for later processing then put a single message on the SQS queue pointing to that file
Hmm, not the most economical of choices, but...
We had decided that we were going to simply increase our lambda concurrency limit to 8,000,000 requests per second (800,000 concurrents) and handle the spam attacks
... wait, what? If you're using Lambda functions for something as trivial as logging page views, you evidently have more money than sense.
The author is concerned about this story reading like an advert for AWS Shield, but to me it reads like a case study in when "Serverless" is a bad idea.
I'd be happy to help make this more efficient though (at no charge of course). Offhand I'd say "web server which accumulates data and uploads it to S3 every N requests or M seconds" would probably get you what you need at a tiny fraction of the cost of "lambda which posts to SQS". Create an AMI and toss it at an autoscaling group and you really won't need to worry about scaling issues either.
For us, the cost works and we have appropriate margin for it. The cost savings aren't worth the extra "we have to monitor these servers" thoughts. Our approach is 100% emotional.
The offer is open if you ever change your mind (e.g. if the cost becomes annoyingly large when you scale up further).
You might want to take them up on the offer, you don't get an opportunity like that everyday.
But seriously, it's clear making this more efficient isn't a priority for them, and I completely understand and respect that. I ignore plenty of good advice for the same reason: "Working on more important stuff in Tarsnap right now".
Fathom has a $34/month plan which handles up to 400k pageviews per month. 400k Lambda invocations is 16 cents, if they're doing it efficiently. Transfer pricing seems negligible also at only ~2KB a request.
I'm pleasantly surprised, actually, but of course when someone throws millions of improper requests at you every second.. I guess that's when the game changes ;-)
But if you're regularly rebooting servers because they stop responding, something has gone very wrong.
You have done what only few small shops would or could and this knowledge will be a valuable asset. You can expect a lot of requests for help / information based on your write up. Prepare some canned answers and a cool presentation you can whip out when needed.
What would be an advisable approach for a start up with limited resource in a situation like this?
I am the co-founder of a SaaS in the higher-ed-tech sector (universities, 300+ of them). Going to make sure we implement support for Fathom (in addition to GA, GTM, Matomo) and reach out to our customers. Hope it helps a bit. Our users would also benefit from stuffing less data to G.
While some attackers have access to compromised home routers and IoT devices, most script kiddies do not, so they use Tor (because they don't want to get caught).
I think they need to take notes.
Btw, two servers plus a bit of architecture simplification solves this and cut the costs down to ~$400 or less.
And in all honesty it's a lot easier to take a site offline through depletion of budget than trying to exhaust an infinitely-scaling service.
That said, it’d be a no brainer at enterprise level.