The best win is to build out in a data center with a 10G pipe into Amazon's network so that you can spin up AWS only on peaks or while you are waiting to bring your own stuff up. That gives you the best of both worlds.
AWS has a product called Direct Connect  to reduce bandwidth costs between AWS and your infrastructure.
There is a latency spike between local and Amazon infrastructure so it would be critical to build your system so that putting this spike in the mix didn't impact your own flow path.
I had thought a bit about how we might do that at Blekko and I would probably shift crawling into the cloud (user's don't see that compute resource) and move crawler hosts over to the frontend/index side. But I'm sure there would have been a bunch of ways to slice it.
They certainly can and likely will respond by tweaking their cost models - bandwidth costs at AWS are completely out of whack - e.g. list prices at AWS per TB are tens of times higher than Hetzner for example; I presume that's based on looking at what customers are most sensitive to. E.g. if you retrieve only a small percentage of your objects every month, it won't matter much. Similarly, if most of your retrievals is from EC2 rather than from the public internet, the bandwidth prices won't be a big deal, and you may not pay attention to how much you actually pay per TB.
The high bandwidth prices hit some niches much more than others, and it may be more expedient for AWS to keep it straight forward for those affected to put caches in front than it is to rattle the cage of other customers. E.g. if you consume huge amounts of bandwidths you'll sooner or later speak to specialized CDN's or start looking at peering anyway.