Hacker News new | comments | show | ask | jobs | submit login
How I attacked myself using Google and I ramped up a $1000 bandwidth bill (behind-the-enemy-lines.com)
775 points by Panos 1682 days ago | hide | past | web | 142 comments | favorite



+1 For Amazon for kindly reimbursing the overage charge.

-1 For Google for creating what is the biggest threat to content providers by enabling easy-to-use DDOS attacks across the entire interwebs.

</hyperbole>

Seriously, is this what we have to look forward to when Google Spreadsheets, and God knows what else, become ever-more popular?

Think about all the additional onerus costs that would be incurred by content providers as more and more Google Spreadsheet users hotlink images, mp3s, videos...

This has to be a bad design decision by Google, there's no need to redownload assets by-the-hour, on-the-hour, regardless of whether the user's spreadsheet is open or not.

Is it time to go back to the days of putting your web assets behind $HTTP_REFERER?


It's not a DDOS becuase it's not distributed and there was no denial of service. And it doesn't work "across the entire interwebs" because the Google bots are rate-limited against most websites. There are some whitelisted sites, like S3, that are not rate-limited. S3 did not go down in this "attack". In fact the app didn't even go down. So the decision not to rate-limit against S3 was sound, since there was no DDOS.


Ironically, it might be DoS in the sense that it drives app maintenance costs so high that the owner could decide to, naturally, deny the service (shut down).

Otherwise, maybe it's the new type of flood attack, cost-of-service (CoS), applicable against those who use ‘invincible’ cloud infrastructure such as Amazon's.


Is that a term? "Cost of service attack" ? If not, let's coin it :)

Cost-of-service-attack: Consuming bandwidth or other resources in cloud based solutions to drive up the cost of running the service.

Very easy when the cost is so tightly coupled with the resource use...


I believe this kind of attack is already referred to as EDoS (Economic Denial of Service/Sustainability) - basically killing of a service through bankruptcy.

This is why most cloud services have a customisable limit on how many virtual servers can be booted automatically, a solution is yet to be found for protecting S3.


Thanks..

"[snip]..one might envision that instead of worrying about a lack of resources, the elasticity of the cloud could actually provide a surplus of compute, network and storage utility that could be just as bad as a deficit"

http://rationalsecurity.typepad.com/blog/2009/01/a-couple-of...


OK, so now we have DoS and CoS!

I dunno though, it's all technically bandwidth flood in the end. The consequences (or goals) of flood may include immediate or eventual denial of service, high cost of service, various security attacks, or probably all at once—further classification surely is complicated.

Meanwhile, both terms DoS and CoS are a bit vague, too. Say, turning off a server or forcing providers to alter their DNS records more or less qualifies as denial of service attack (which, by Wikipedia's definition, is “an attempt to make a computer or network resource unavailable to its intended users”).


I wold argue that "DoS" is not vague, just a broad term.

Aanyway.. You're right, it boils down to bandwidth flood (or maybe CPU..) in the end.

Maybe CoS is not really a technical term, just made more relevant by the technological changes. OP would have been charged much less had he not been using "the cloud".

Lets say I have small'ish competitor hosting their app on EC2. Using classical DoS techniques, how much can I ramp up their bill before they notice? Or can actually stop it? What if I run it on a smaller scale, "just" doubling their bandwidth bill.. Will they notice? Maybe they'll believe it's extra users ;)


Maybe Amazon should add smart log mining ?

If it requires a lot of effort to productize, Amazon could potentially offer it as a service $0.5/hr to ensure you are not making major goof ups.


Are you saying Google isn't distributed?

In any case, it may not be a denial-of-service, but if you can find someone with an S3 bucket with large files you could maliciously cause them to rack up a huge bandwidth bill using this mechanism. I guess you could say it's a DDOM (Distributed Denial of Money).


lol I'll take DDOS any day over DDOM!!


Come on, really? This is clearly an edge case they simply didn't see and will probably solve once the right people know about it. The sky may be falling, but it's not because of Google Spreadsheets...


Like the author notes, I don't think the problem was with Google. It was no fault of theirs. The reason: It's a fine line between maintaining privacy policies and managing such events. If google were storing/caching these links, there would have been an outcry from those worried about user privacy and stuff.

About the by-the-hour downloads, well there again is a trade-off between providing data quickly and doing a lazy evaluation.

I think, as the author notes, it was just an unfortunate event that was a consequence of good design decisions gone bad circumstantially and wreaking havoc for the author.

It was certainly nice of Amazon to have made that refund. The resources (read bandwidth) were used after all.


Google are presumably storing/caching these photos for an hour before expiring and redownloading them.

I don't see why Google don't redownload them using If-Modified-Since, and reusing the original downloaded photos in the case that the origin server says they've not changed.


I don't understand the privacy concern - a publicly accessible URL doesn't offer any privacy. It's the same as a transparent proxy.


A password works by there is a piece of text that only you know, and thus, when you give that password to a server, that server knows who you are.

can't you also view a url as a password? (If only I know the url, then only I can download the file).

I am able to give out a url to someone else, so they can access the file, likewise, I can give out my file server's username and password, and whoever has it can also access my files.


You have a point, but the difference is that url's are not usually considered secret and are therefore not treated the same way a password would be. For example in the browser..


Google is already storing urls to those files.


I think it is primarily a legal concern. Legally, caching public content is not always an acceptable (or at least clear-cut) practice. I concur in thinking that's a little silly, but enough people disagree for it to be an issue that Google has to take into account.

There have been similar complaints about attempts to aggregate the contents of Hacker News posts, for example.


> If google were storing/caching these links, there would have been an outcry from those worried about user privacy and stuff.

Where would we be without Google taking such a noble and strongly principled stance on user privacy, applied equally and consistently, striving to avoid legal responsibility, even when it costs unrelated parties thousands of dollars.


> -1 For Google for creating what is the biggest threat to content providers by enabling easy-to-use DDOS attacks across the entire interwebs.

Wanna help make the web better? Stop using Google and use a competitor like DuckDuckGo.com. Google has too much power, too much control.


Reminds me of the email cannon we built on Gmail, not exactly on purpose. We needed a full Gmail account to do some testing on our company's email backup system. At the time that meant 8GB of email. And you couldn't just send a bunch of huge attachments, as it had to be like regular email. So we took a gmail account and signed it up for dozens of active linux related email lists (because they were easy to find).

The result was an email account that got an email message every second or so in a variety of languages.

We quickly realized that you could have some fun by forwarding that address to someone else's email account.

Fast forward a few months, and Gmail smartly requires a confirmation before allowing you to forward.


Once in middle school in the late '90s I created an email loop between two free email accounts that forwarded an extra copy of the email every iteration. It was an experiment out of curiosity and mischief. I don't think the server actually went down when I started the loop (at least, I can't remember if it did) - since the mailbox would have failed up and messages fail to be delivered.

I think I was hoping to produce an "email cannon" similar to yours - an email spammer to be directed at whoever I liked.


Ah yes, the days of computer labs in high school and figuring out that outlook lets you copy and paste emails in the Outbox. Just get a big enough collection and keep copying and pasting. Easy recipe for getting hundreds - thousands of emails.


It's really awesome that Amazon was reasonable and refunded the charges because they were accidental. I mean, technically it was still your fault, so it would have been easy for them to be jerks about it.


They wouldn't be jerks if they asked for the charges; you still generated the traffic and they had to pay for it.


Just as "legal" does not mean "ethical", "contractually permitted" does not mean "not a jerk move".


just like asking you to pay for the services you used is not a jerk move.

Amazon did a nice thing, but the services were used. Asking to user to pay for those services (even if it was a mistake), would not have been a "jerk move"


Amazon was genuinely nice in this case. I had no expectation that they would refund the charges. It was a self-inflicted wound and Amazon had no obligation to pay for my own stupidity.


Bandwidth pricing is a funny thing, in that it's only metered because it's convenient to do so. You haven't consumed any kind of finite resource by moving 1GB or 1000GB. There isn't any "use".

And it makes good business sense besides. They can let the guy off and eat the probably less than a hundred or so bandwidth this guy actually cost them due to their upstream providers, get a good writeup and look better as a result,

..or kill his account, take him to collections, etc, etc, etc, (which would probably cost them more than $1K anyways), and lose a customer and get a PR black eye while they're at it.

So yeah, it would be a jerk move, and a pretty dumb one at that.


In the end, we are talking about a finite resource, because the Internet can only handle so much data transfer at one time. One way or another, your packets have to travel over physical infrastructure which, just like your home network, has a maximum capacity. Use of this infrastructure costs money; charging by bandwidth should help prevent users from "clogging the tubes" willy-nilly because there is a cost associated with excessive use.


Except most providers don't charge by time and demand (which would actually make sense), but by a fixed cap or fixed cost per byte. If the problem is congestion as you suggest, surely it would make more sense to charge different amounts based on time of day, length of session, etc?

Which would give Amazon more trouble? 1,000 TB spread out over a month, or 1,000 TB spread out over a day? Their current pricing model assumes both of these are equal, which they are plainly not.

Aside: Caps make the same mistake. If you have a 250GB cap, the ahem ISP charges the same whether you burn through that cap in a month or in a day.


It is more complicated than that. Customer convenience must be considered. For instance, I and many others like me will not buy data transfer from someone when figuring the cost requires me to account for the time of day and phase of the moon.


Customers already deal with such billing from other companies and it's not difficult.

Ever had a cell phone? During the day counts against an allotment, after a certain time at night, calls are free. For a home internet connection it would probably be reversed, as peak usage is right after work lets out.

Some power companies also do the same thing.


You've consumed energy.

But in any case, even if in theory nothing is used, are you sure that Amazon doesn't have to pay for it too? Because then it's irrelevant whether it was really used or not.


Not $1,000 worth in either case.


Are you taking in account the time spent by the tech person(s) to read his email and process the rebates? People time is expensive, particularly if some manager needs to approve it.

Maybe it wouldn't be $1000, but it'd still be significantly more than $0.


We're not counting rebates yet - we're saying "This guy ran X amount of bandwidth through our system which we normally charge Y for" - how much did Amazon pay their upstream for that bandwidth and the billing (which is all completely automated)


That's my point: if you want them to charge less, the billing for this case is not automated (and is thus much more expensive).


I'd be flabbergasted to find out their billing system isn't at least as simple as their AWS console. 5 minutes of CSR work tops.

Would probably even be easier to completely drop a bill as opposed to modifying it to charge a lesser amount.


I'd expect Amazon and Google to peer directly within US territory, one being a major byte-generator and another one being the largest byte-sucker.

So this traffic must be almost as cheap as Amazon's own intra-territory exchange. Probably that's why they were being so nice. They probably are even somewhat happy to generate some more traffic directly for Google as the ratio is usually a leverage for the network guys.

But, surely, it's still very cool of them to drop the customer's bill.

I wonder, though, how was one ever to know that Google stripped their feedfetcher of even the basics of "intellect" and allowed one single spreadsheet to generate incoming 250 gigs hourly while not being open.

So, after all it's Amazon's and Google poor designs that hammered the guy and a common business sense for Amazon to let it go for the customer and just study the case together with Google.


Nor does failing "to be really nice" equal "jerk move".

Offering the refund was really nice. Asking to pay for the services would not have been a jerk move, because Amazon wasn't responsible for the extra costs either. I usually root for the little guy, but in this case, if anyone else might be expected to foot the bill, it should be Google for building a system that needlessly slurps 250GB/h without any apparent capping or rate-limiting, caused by a single spreadsheet. Just because Amazon S3 is huge and probably doesn't feel a thing doesn't mean it's a good idea, what if it were 500 spreadsheets doing this?

What if that FeederBot were to gorge itself onto one Google's own CDNs? Sure it's probably just a drop in the ocean for Google as well, but I bet they'd rather not experience this "minor inefficiency".


It's possible Google and Amazon have a peering arrangement so they may have paid little to nothing for the traffic.


But, why re-downloading every hour?

Does merely having the spreadsheet passively open in a browser trigger that, or was some other process re-loading the spreadsheet every hour? (If the former, I wouldn't be as forgiving of Google. I understand the desire not to cache possibly-private data, but proper URL design and conditional GETs should be able to prevent the entire download on an automatic hourly schedule. And even if the latter – the author had chosen to reload the spreadsheet each hour – I'd want Google's design to allow browser-side caching to work for such embedded and/or generated images.)


About an hour is the standard time for external data in Google Spreadsheet to be refreshed. I've come across this with JSON data.


Even if noone has it open in a browser? Or they do have it open, but haven't interacted with it?

In either case, this seems odd, unless the URL is especially noted as 'volatile', and/or there are other parts of the spreadsheet that might trigger conditional calculations/notifications based on that URL's contents. (And don't S3 resources have last-modified-dates or etags for conditional GETs?)


I don't remember, but it wouldn't surprise me, that way it updates correctly for offline mode and if you have conditional logic it can work on changed data. It also improves page load time, which is very important for Google in general but doubly so for an online office suite (Excel opens quickly, so should a Google spreadsheet).


There must me something else to it because if the spreadsheet is sitting passively open there would be no need for Google to go an pre-fetch again.


I've seen this behavior with google gadgets as well, even with no user interaction or even having the page open.


This really underscores Amazon's glaring omission of a billing cutoff on Amazon web services. How hard would it be for them to let me say, cut off my services at $100/month?

This is the main reason I'd never use AWS to host anything public.


Or perhaps an email saying "We're past the limit you set, approve the overage in the next X hours by clicking this link". That way you don't cut off accidentally during a spike you want, but know about it.


End of the article: "PS: Amazon was nice enough to refund the bandwidth charges, as they considered this activity accidental and not intentional. Thanks TK!"


Would they have refunded him if someone did this to him maliciously, as he points out is possible at the end of his post?


That might also qualify as an "accident".


We are already testing a billing charges monitor that works through CloudWatch. Here's some more info:

http://blog.bitnami.org/2011/12/monitor-your-estimated-aws-c...


Looks great! Any chance that this will become available to accounts without Premium support in the future?


It's not that simple.

First, the problem isn't inherent to virtual hosting services, you could just as easily get hit by this on a bare-metal site, though the interplay of S3 and Google Docs is an added dimension.

Cutting off all services opens the door for a class of DoS attacks. Simply direct enough traffic at a single account's assets, and you'll knock them offline for a given billing cycle. If the attack is cheap to launch (botnet, URL referral network, etc.) it's a cheap attack. Different entities would have different cut-off and degradation policies.

Better would be to identify the parameters of a specific anomalous traffic pattern, but this can be hard.

A more general solution is to set asset (server side) and client (remote side) caps in tiers. You'd want generous (but not unlimited) rates for legitimate crawlers, your own infrastructure, and major clients. The rest of the Net generally gets a lower service level. Such rules are not trivial to set up, and assistance through AWS or other cloud hosting providers would be very useful.


Cutting off all services opens the door for a class of DoS attacks. Simply direct enough traffic at a single account's assets, and you'll knock them offline for a given billing cycle.

I dont' see this as a real issue (the potential DOS is there, but not a real problem with having a bandwidth metric kill switch). I'd assume this would be a configurable setting and would have to be enabled by choice. In the author's case I'm sure he'd prefer to have is service cut off prior to running up a 1k bill. If one of my test instances started bleeding bandwidth I'd prefer for it to just get killed then to rack up an ungainly bill. If its your production service then don't configure a cutoff.


Or if you have the connections, contact one of the Vetted and get the attackers at the origin. Well that's according to http://www.wired.com/politics/security/magazine/15-09/ff_est...


> you'll knock them offline for a given billing cycle

Er... Because caps can never be raised mid-cycle?


Sure, they could be, but it's still a pretty blunt alerting system.

Depending on the organization, though, you'd have to get purchase approval on the overage, etc. That will depend on specifics. Makes for sticky questions to answer there as well, which is another argument for putting better management tools at the cloud level.


You're making this way more complicated than it needs to be. It's not an alerting system, it's a knob, like a thermostat. I turn the knob where I want it, the system follows the "run until this limit is hit" instruction. If I want to change my mind later, even after the limit is hit and the system shuts down, I can make that decision at that time. I don't want the system or anybody else to make it for me.

As for the "purchase approval" nonsense -- if the knob isn't right for your company, then don't use it, but there are a huge number of companies where the guy turning the knob is the one and only person with any say over money being spent at all.


I guess that would be a nice feature to offer as an option, but most people would not want their web service cut off if they get a spike in traffic.


That's where the "option" part comes in. People who aren't worried about the cost simply don't turn it on.

Those of us who can't afford an extra £1000/month (or more if it's a dedicated attack) would love to be able to just turn it off when a threshold is hit.

Users may lose access, but at least you don't go bankrupt in the process.


> most people would not want their web service cut off if they get a spike in traffic.

Only if they are sure enough about their finances and their web service to be confident that the spike will be profitable.


A cutoff is good, but a triggered alarm would go a long way.

For example in Linode you can configure trigger conditions for CPU (triggers in x% cpu for y time) and Network usage (there are other conditions as well)

Very useful for flagging suspicious activity (or just plain accidents)


You can set this up with CloudWatch - add your EC2 costs as a metric, then you can create alarms based on your desired threshold (although I'm not sure how frequently this data is updated).


I have just discovered www.cloudability.com - they generate reports to give you insights in to your AWS bill, and you can configure budget notifications. I am trying the beta now and am very impressed so far.


Amazon did send him a warning email.


quite sad that Amazon still hasn't implemented bill capping yet. A lot of users want this... its been on their TODO list since 2006 https://forums.aws.amazon.com/thread.jspa?threadID=58127&...


The argument in favor of bill-capping: it puts the onus on Amazon (or whatever other service provider) to have the tools to identify what's causing cap-busting behavior, as they're going to eat the costs.

I killed my EVDO wireless service based on a similar experience with my cell provider, after requesting multiple times that they provide me capped service. It simply wasn't worth the downside cost risk.


I cannot help notice that Hetzner offers 5000 GB/month AND a full dedicated server, for 39 EUR (51 USD) [1], so his traffic would have cost him at that rate a total of 100 USD if he were to use a dedicated server instead of Amazon.

(before mentioning Amazon's scalability, consider that Hacker News is ran on a single dedicated server, and the moral of the story seems to be how not to scale especially when you don't want to)

[1] - http://www.hetzner.de/en/hosting/produkte_rootserver/x3


Of course if you want to do away with any kind of built in redundancy and take over the sysadmin duties involved you can get the raw space and transfer cheaper.


I have a VPS on buyvm which includes 2TB/month for $6. Each extra TB is another $2.50. This is not a special case, you can other providers with very good pricing for data transfer. I don't understand how the difference in bandwidth pricing can be so large. I'm also certain that Amazon doesn't pay as much for bandwidth as the guy from buyvm.


One factor is: Amazon charges for actual bandwidth used. Most buyvm users only use a tiny fraction of what they are allowed to.


buyvm is overselling their bandwidth to you. Real uplink pricing is not 2 dollars per terabyte. If you actually tried to use say 13 terabytes from a single vm, it would work eithe because of rate limmiting on your uplink or they would cut you off.


well, real uplinks are charged on the 95th percentile, mostly. (sometimes capped, and Cogent does 90th percentile)

I'm signing a deal (if all goes well) tomorrow for $0.65 per MB/sec. 5 gigabit commit on a 10GB port, ($1.15/megabit overage charge, billed on the 90th percentile.) From Cogent, probably the cheapest provider that claims 'tier 1' status (there's a lot of argument about Cogent's "tier 1" status... which is kind of funny; If you only have one provider, a good tier 2 is going to be more reliable than a tier 1 anyhow.) but it is a real uplink and I really can use 5 gigabits of that.

If you ran one megabit full out for a month, assuming 2,629,743 seconds in a month and assuming seagate gigabytes, divide by 8 and you get 328717 megabytes. so, uh, yeah, two dollars a terabyte? assuming they are larger than I am, and their BGP mix is the low-end stuff (Cogent and HE.net or the like) they aren't losing money.


Prices have gone down since the last time I had to deal with all of that. I remember the company I was working with paying $1.67 per Mb/sec, I've seen offers for 1 Gigabit/sec for $1000 so roughly $1.00 per Mb/sec (using 1000, not 1024), but I didn't know they would go down to $0.65 per Mb/sec!

Guess the more you commit to, the better pricing you get!


Bandwidth pricing falls dramatically with bulk. Cogent offered me $0.75/meg for a 3g commit, $0.65 for a 5g commit, and $0.50 for a 10G commit. I know a guy that is getting $4.00/meg on a 100M commit on a 1000M pipe, also from Cogent. (and he chose that over going through me for $650/month for a capped full 1000M pipe) Bandwidth is also a "negotiated good" without a standard price, so what I pay may be rather different from what you pay.

On top of that, bandwidth costs fall... dramatically[1] in competitive markets, and I'm in silicon valley, probably one of the most competitive transit markets. According to the graphs I've seen[1] I'm paying below average, but it won't be very long before what I'm paying is average. I was explaining this to the real-estate guy that owns my data center that was wanting to get in on the bandwidth business (he wanted to get people in for cheap and crank up the price on renewal, like you do with real-estate and data-center space) His response? "but then why would anyone want to be in the bandwidth business?"

I mean, the fiber in the ground? that's like real-estate. the prices go up, the prices go down, eh, whatever. But, the amount of traffic you can push over two strands of fiber? that goes up all the time; I have some old cisco 15540 ESPx DWDM units that can do 16 10GB/sec waves over one pair of fiber. They were awesome back in the day. Modern DWDM equipment? you can get 80 100GB/sec waves on one pair of fiber.

It's irritating, though, as like everything in this industry, you have to negotiate for months to get the "real price" - I asked Cogent for a single, capped 1 gigabit port for $1000/month north of 6 months ago. "Call me back when you can get me a buck a meg" They kept calling me back "how about $3 a meg? how about $1.75 a meg?" I mean, even now, they beat the buck a meg price point, but I had to buy a lot more than I needed (I'm splitting it half and half with another company, at cost, so while the transaction cost was huge, once you factor in the discounted setup fees, well, I am still paying more than a grand a month, but it's still a pretty comfortable fee for me.) I imagine Cogent has spent several thousand dollars of salesperson time, and I /know/ I have spend several thousand dollars of my time on this, and they are charging me rather less than if I had wasted almost none of their time. They even dropped the setup charge down to almost nothing.

And now I've gotta do the same thing all over again with a second provider (most likely he.net) What a waste of time and effort all around. I mean, a little bit? it's kinda fun, I mean, sales people are always ridiculously overdressed extroverts, and in this industry, most of them can pick up on my personality and act in a way that is tolerable or even fun for short periods of time, but I really am an introvert. I mean, it can be fun for a while? but man, I have had like 5 meetings the last two days, between dealing with Cogent and dealing with the people that are buying half the pipe from me. It's exhausting, and I guess I have a hard time seeing how this is the most efficient way to sell bandwidth. I mean, I guess some people throw up their hands and pay the $3/meg asking price, and if they can break even on me, god damn, you wouldn't need many of the $3/meg customers to get really rich.

But then that leads the question: why bother with me? I mean, I'm going to turn around and sell transit very near this cost in a very public way, which will mean that more of those $3/meg customers are going to turn around and ask Cogent for a discount.

I guess they can rely on the fact that, well, I'm a scruffy introvert, and no large corporation will do business with me. Still, I mean, Cogent will let me drop 1G (capped, sadly) ports into datacenters where I don't have equipment for $650/month, and I've been running ads saying I would be willing to sell those at cost (and make my profit off of the difference between the list setup fee, $2500, and what they are actually charging me.) - This was mostly a way to get the higher commit pricing without actually paying for it all.

[1]http://drpeering.net/white-papers/Internet-Transit-Pricing-H...


At the time we were looking at a 2 gb commit...

$4 a meg sounds really expensive, at that point it is almost worth it going up, have room for future expansion and pay less for it, or do you do, and split the BW and cost.


I realize they bank on their users not fully utilizing the allocated bandwidth but you can take as a somewhat similar example whatbox (or any other seedbox provider), they sell bandwidth pretty cheap and their users are in general huge bandwidth hogs.


For me biggest reason to use AWS is ease of use and APIs. I have used dedicated servers before but with AWS I don't even think about servers. Application is somewhere sitting in the cloud and it will scale, if required or I will get alert if something is not right.

Also, majority of people will not be needing more than a single EC2 instance which costs ~$20 a month, lesser for Linux. Amazon reviews and lowers prices periodically.


That such a simple thing on your part could result in such an extreme and expensive result implies something is wrong with the design of the system you're using.


I was going to post exactly the same thing. It's very lenient of him to blame himself instead of Google for this.


Feedfetcher strikes again, huh? I never did find out why they were so interested in one of my images.

http://rachelbythebay.com/w/2011/10/27/wtfgoog/


I loved your reference to the huge Russion bomb Tsar Bomba. I for one had never heard of it, and it made a great metaphor.

[1] http://en.wikipedia.org/wiki/Tsar_Bomba


Check out the Nuclear Effects Calculator; You can see what effect various historical bombs would have. (They include Tsar Bomba.)

(http://nuclearsecrecy.com/blog/2012/02/03/presenting-nukemap...)

(http://news.ycombinator.com/item?id=3624714)


This was an extremely interesting article. I hope Google rectifies this type of behavior. This is actually pretty scary.


This isn't Google's fault. I could get a couple machines and do the same thing if I had a list of 250 gigs of files. And I wouldn't be limited to once every hour.


But those machines wouldn't be Google's, and you would probably either be committing a crime (building an exploited botnet) or using someone's money (be it your employer, university, or yourself) to run them.

This is pretty clearly a different league of attack: rather than attacking systems or spending money, you'd be exploiting a Google feature to use Google's resource for free to incur a giant amount of data transfer.


It would almost certainly be criminal regardless of the method (Google, EC2, botnet).


If I wanted to launch an attack on something like Instagram all I would need to do is put a bunch of images (hosted on instagram) into a Google Spreadsheet? Then the google crawler will come through and download them all once an hour?


That's what I'm wondering too, I don't see how this would work on a normal website since a normal website probably wouldn't be whitelisted by Google like AWS is... What am I missing here?


I don't think you'd be able to take down a website with this strategy. The OPs site never went down, it just cost him a lot of money. Even pretty small webservers should be able to serve up static image files pretty easily.

But you could do a denial of service by making the service too expensive and all of your work would be hidden behind the anonymity of the google bot.


Try http://cloudability.com to keep track of your cloud costs. Note: I am not affiliated with them.


Thanks for the link!!


kudos to Amazon for running time backward and letting you pull your foot out of the way.

I can't help but think that if benign decisions lead to disasters like this in the cloud, how much destruction could robots wreak in the future due to similar benign choices?


That was exactly what I was thinking after reflecting on this issue...


"To err is human, but to really foul things up requires a computer." - Bill Vaughan


I'm pretty surprised Google didn't have the client download the images instead. Wouldn't that be a better solution or am I missing something here?

Pretty interesting though and if this becomes a big enough story you can bet Google will be changing something; the last thing they need is someone using Google Docs to DOS websites.


Google Docs is in HTTPS so they need to proxy the assets like Github does : https://github.com/blog/743-sidejack-prevention-phase-3-ssl-...


Interesting. I never thought of that. It seems almost silly that you're bringing insecure content over a secure channel like that but then again it is only images.


"only images"

As long as they don't do WMFs, with their code-injection-by-design functionality...

http://en.wikipedia.org/wiki/Windows_Metafile_vulnerability


The function lets you specify height/width, passing it through makes some sense. It's also usually (though not in this case) to not hot link--a document can be viewed by thousands of people. There are some privacy issues too, viewing a Google Doc could have your browser making requests to unknown servers that could return whatever they want.


Perhaps the server needs to know the image dimensions for page layout.


They use quite a bit of HTML 5 so they should be able to do this client-side I would imagine. But I dunno.


Can you limit bandwidth with AWS?

Also, why would the spreadsheet be calling these images every hour. Did you have the spreadsheet open? Does google do this call even when no one is viewing the spreadsheet?


> Can you limit bandwidth with AWS?

Not on S3, no.


You can put a robots.txt in the bucket.


Which will be ignored by Feedfetcher :-)

Plus you cannot put a robots.txt at s3.amazonaws.com so if the url is accessed through the https://s3.amazonaws.com/.... url, the robots.txt will not work.


You could put robots.txt in the bucket if you address it using the http://mybucket.s3.amazonaws.com/ alternative URL scheme - a robots.txt in the root of the bucket would then be available at http://mybucket.s3.amazonaws.com/robots.txt


Yes, that would solve the issue of not being able to have your own robots.txt file and I did not know about that. On the other hand, Feedfetcher would still ignore the robots.txt


Google's justification for ignoring this is very weak.


I disagree. Feedfetcher is no different than a browser: it fetches the URL the user inserted, nothing more (unlike a spider, which discovers URLs by itself).


Not true. It fetches the URL every single hour, not just when the user requests it. So Google is claiming they can ignore robots.txt because it was an action performed by a user (true) but they're unleashing a huge problem with this background refreshing. Google is wasting gobs of their own money, too. What if I made a bot that generated 1000s of Google accounts with 1000s of spreadsheets hotlinking 1000s of big files stored on S3? This one guy's one file did TERABYTES of transfers over a week. The underlying problem is that Google is relying on the domain name to indicate the company size, and thus the bandwidth allocation for this service.


Background refreshing is a common feature in client applications, like RSS readers. I think their reasoning makes sense.

I do think they should change their process (making it lazy-load instead), but that's a different issue to robots.txt.


I believe the parent's point was that, for the HTTPS scheme, you can't use any alternative CNAMEs, because they won't match the key S3 serves--so if your site is designed to be HTTPS-by-default, and is attached to an S3 bucket, putting a robots.txt in it is moot.


According to the article, that would not have helped; feedfetcher is meant to be manually triggered and thus does not obey robots.txt


For certain definitions of "manually" :)

It's manually triggered to start downloading resources every hour regardless of whether someone needs them.

In that sense, any web spider is "manually triggered" as well ;-)


The article states that (1) you can't and (2) the bot ignores it.


No, you can't. It'd have to be at the root of http://s3.amazonaws.com/.

This is mentioned specifically in the article, in fact.


Pick subdomain-safe bucket names and you have an alternative.

    http://[bucket name].s3.amazonaws.com/robots.txt


Amusing thing about that page - it's full of '\$100', and uses Javascript to strip the \'s out, replacing them with empty <span> elements. Not sure I really want to know why...


Mathjax :-)


Ah, that's a good excuse. Thanks :)


Here's the big money question, since I haven't used S3 - do you have any throttling control? Can you simply block the feedfetcher bot, or block repeated hits like this at all? It seems like S3 is a nightmare $$$ hole if there aren't some really robust tools to manage this sort of problem.


Sometimes a 509 Bandwidth limit exceeded helps!

AWS can actually do a setting for max bandwidth per hour or so & alert early if there is suspicious activity.


Scary indeed. Many thanks for the warning. I was contemplating starting to use Amazon Cloud and some Google tools but definitely will not touch any of it now. Who knows how many other traps like this are laying in wait for the unweary?

All this automation is very well but this illustrates the dangers of running stuff on other's machines of unknown complexity and out of your own control, while having to pay for whatever may happen. Not for me, thanks.


I don't think the OA would want you to take this as FUD. These are incredibly powerful tools that give you more control than the alternatives (AWS anyway), and yes unintended consequences are possible, but this really is a freak occurrence.


Seeing how popular the story is, amazon could not have made a better $1000 PR campaign than being classy and reimbursing him!


"What I find fascinating in this setting is that Google becomes such a powerful weapon due to a series of perfectly legitimate design decisions."

I call in to question that these are "perfectly legitimate design decisions", basically, if google thinks that the data is private or too sensitive to cache, then it shouldn't be this easy to have it automatically keep hitting a site like this. Google should have realized the potential for abuse here. I'm guessing it truly is an oversight on their part as I can't imagine they would want to waste all this bandwidth either, however its something they should figure out a solution for.


This is the same kind of attack. I've demonstrated here. But for users of a popular analytic service. http://news.ycombinator.com/item?id=3873774


I'm curious to know whether the objects were stored with Cache-Control: or Expires: headers. Does having such headers make a difference?

Clearly the client's not presenting If-Modified-Since: pragmas as I believe S3 honors those.


Quite an interesting article. I didn't even think of it as a potential DDoS tool until he mentioned it. It makes sense but it seems like Google should have a built a fail-safe for something like that situation.


You can use http://cloudcost.teevity.com to protect yourself from such situations (disclaimer : I'm the founder and CEO)


I have a VPS. When i read stories like this, I think there isn't any point in going over to GAE or AWS. But lots of people do use these services, so what am I missing?


He's getting 100+ requests per second.

You could rate limit the IPs. The question is how many IPs is Feedfetcher using ?


Kudos to Amazon for refunding the guy. I don't think any other hosting company would just take the bullet for the customer.


>"But how come did Google download the images again and again?"

"But how come did" indeed.


Communication skills matter, sorry.


But you get the idea what he trying to say no? We all here are not speaking English as the mother language. Please excuse :)


very interesting findings, good read.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: