What are good use-cases for on-demand high I/O servers?
At $3.10/hr, these instances work out to $2k/mo.
There are probably many more cost-effective options if you want a 2TB SSD server.
Since the benefit of using EC2 is that you can provision instances elastically, what are the sorts of scenarios in which one needs to provision high I/O servers elastically?
[edit: A few minutes of Googling, and I can't find any dedicated servers with 2 TB of SSD.]
I do genome mapping where our indexes won't entirely fit in memory. It would be very handy to be able to spin up a few of these instances, load the indexes from an EBS volume onto the local SSDs, then run for a couple of hours or so. This is a very I/O intensive job that we need to run about once a week, but then the rest of the time could be idle.
SSDs would make our jobs run significantly faster. So much so that we've toyed with the idea of adding SSDs to our in-house cluster, but couldn't quite justify the costs. This might actually shift the cost savings to get our lab to migrate to EC2 as opposed to our in-house or university cluster.
I'm working on a data visualisation app, which is getting a lot of interest from biologists and bioinformaticians. I'd like to learn a bit more about your work. Can I email you somewhere? Or please drop me an email at hrishi@prettygraph.com. Thanks!
based on this, it shouldn't take more than half a day under the worst circumstances (single EBS drive with crappy performance), and if you Raid together enough drives, you can do it in about an hour. Correct me if I'm wrong, but you pay for EBS by size, not physical disks, so the more you can split up your data in blocks, the more performance you're going to get.
You can get some of the data from the 1000 genomes project directly from Amazon, so you don't need to pay to download it. There's about 200TB of data there (so far).
What I'm working on is mapping those short sequences (50-75 bases) to the genome and then either looking for mutations or expression levels (how many of those reads map to a particular location). There are a couple of ways to do the mapping, but most these days use either a big hash table or a Burrow Wheeler transform.
Well, the raw output of a typical so-called "next gen sequencing" (which are actually very current gen) machine is around 1TB (at least, the ones we used here).
This is raw file though, so once processed (but not yet analyzed) I believe we have sizes around 50 to 100GB (but that's not really what I work on so don't quote me fully on this).
The next steps vary on what you want to do exactly, but it usually involves alignment of base pairs (basically, trying to tie together by their ends sequences of DNA but seeing if they "fit").
Essentially you sequence tons of short bits of dna and then either fit them together (assemble) or fit them to a reference (align). You can find example data sets in the Short Read Archive:
http://www.ncbi.nlm.nih.gov/sra/
Cloudburst (a hadoop based aligner) has a good description of an algorithm:
http://sourceforge.net/apps/mediawiki/cloudburst-bio/index.p...
Though they can get much more sophisticated and there are a number of open and closed source implementations...I only link this one because of the quality of the figure.
The data sets we work with in my group can be up 400gb's of compressed text for the reads from a single individual.
Another example from biology with a similar computational profile would be searching through a hugh number of mass spectrometer outputs to identify the components in a new sample.
[edit: A few minutes of Googling, and I can't find any dedicated servers with 2 TB of SSD.]
I am the founder of SSD Nodes, Inc. (http://www.ssdnodes.com) and we offer lower-sized SSD-backed storage in addition to custom cloud and dedicated plans that range from 1-12TB of local SSD storage, at comparable performance. [/plug]
Given your "1-12tb" range doesn't list prices online, can you tell us whether your prices are comparable? At $1249/month for 2x 200GB SSD, it seems unlikely, but maybe I'm wrong?
Typically clients requiring that much SSD-backed storage have performance targets and a very specific workload, so this affects options along with pricing. With that said, ballpark for what Amazon is advertising is comparable to our pricing, except with us you are _guaranteed_ the resources whereas they are using a multi-tenant environment (you're not using the only instance on their host and your I/O is influenced by everyone else using that host).
I haven't downvoted (and in fact couldn't if I wanted to since you replied to me), but personally my issue with your comment is that you're being as vague as your website's prices. Would be more interesting and relevant if you actually gave a price for a box comparable to the AWS specs.
This assumes Amazon is using SLC and not MLC drives, which are considerably cheaper. Also, the quote above is from SoftLayer I believe, which is not the cheapest provider for high end hardware.
That is only one of the benefits of EC2. If you are not using elasticity, then you have to factor in the reserved instance pricing, which drops the prices down by 71% (as in, to 29% of the list price; and I mean even if you include the up-front cost: that's overall savings). You, like most people who comment on the price of EC2, do not seem to be taking this into consideration. :(
One example, which stems from "on-demand ness" (which you added: I was only responding to "elasticity"), is that you can do "test runs" of migrations and deployments without even thinking about it: you can rent, for just an hour, a setup identical to your existing one, often based on a consistent and atomic snapshot of your production machine, so you can try something "likely correct but possibly horribly wrong"; then, if it works, rather than replicating the change on your "real" machine, you can just cut over to the new one and shut down the old.
Way too many people seem to believe that the only benefit of "on-demand" is "elasticity", and then make bogus arguments here that "if you can plan your traffic you shouldn't be using EC2": EC2 is cheaper than people like to claim (and is in fact quite price competitive) and your ability to turn on/off machines on a whim changes the way you look at hardware so drastically that, in all honesty, it makes traditional ways of dealing with hardware seem draconian and only worth putting up with if you are dealing with some weird corner case or have horribly special requirements.
How often do we have to repeat this argument here on HN.
When running 24/7 then EC2 instances are 2-3x more expensive than the cheapest equivalent rent-server options and orders of magnitude more expensive than the physical hardware if you buy it.
The numbers have been recited countless times, I'm not digging them up yet again.
So, no, EC2 is not cost effective for steady loads at the mid-range. EC2 shines at the very low and the very high end and in specific workloads, i.e. it shines when the benefits can be quantified to an amount greater than the price difference.
For us it was more like a 10x, but a few things went into that:
- We found screamin' deals on hardware by snatching it up when it was available, not when we needed it.
- We were at a fairly cheap colo, and haggled hard to get the cheapest rack possible. We went on a tour, noticed they had _tons_ of empty racks, and used that as some leverage.
- We didn't add any additional ops overhead by having everyone responsible for ops.
We were in the 5k/month ballpark for EC2, and cut it to under $600 with a few grand outlay for hardware spread over the course of a quarter.
That said, all of my current projects are on EC2 for the provisioning flexibility, and because I hate having to drive down to a datacenter at 4AM to swap a drive.
Please tell me that you were including the cost of you driving to the datacenter at 4AM to swap a drive in the cost of the hardware in your price comparisons, as if you are just talking about the cost of the raw hardware and are not including the opportunity cost of all this time and energy spent haggling and performing maintenance, then this is simply a dishonest comparison: you could easily have been spending that time doing just about anything else, from working on new features for your product to improving your sales/investment pitch to simply sleeping (which will improve the quality of all of your work the next day). I'm also curious what your replacement plan is: are you intending to do this again next year, or are you intending to wait until all of your hardware starts failing and the operational overhead starts becoming painful? Finally, "having everyone responsible for ops" might mean that you didn't have to add a new explicit hire, but you can't claim that that isn't overhead: that is now state that everyone has to keep in their head and is a liability that could cause anyone to randomly get interrupted; it might even be cheaper in the long term to hire a new, more dedicated person than to reuse existing people.
Yes, 10x is common. 100x is a little contrived but possible when you spec out enough Ram in EC2 instances (>2T), then compare to a physical box over 3yrs.
I would be very interested in knowing what factors you are trading off for "equivalent" other than "on-demand". Many of my friends use co-location services for their businesses, and most of them purchase only on price, and their servers honestly suck: they have high latency, they are unstable, they don't have remote serial console access... they are living in a ghetto that burns tons of their time into "becoming server people".
If you find a company that has reasonable support, reliable servers, good datacenters, and the minimal features required to debug issues remotely, then you are looking at prices fairly equivalent to those offered by EC2 heavy-utilization reserved instances (and are going to end up with a similar contract length anyway). If this somehow doesn't work out: call Amazon AWS's sales division and see if you are compelling enough for them to negotiate with (they totally have a sales division, and they really do "want your business").
Regardless, your choice of quote is really bothersome: "people like to claim" that EC2 is as expensive as their on-demand list prices, and that's a fact clearly demonstrable by the person I'm responding to (who is quite clearly and obviously claiming EC2 is more expensive than it really is) and one that is not defensible as the price you should be looking at is the heavy-utilization reserved pricing; if you'd like to respond to my comment "and is in fact quite price competitive" then you should quote that and adjust your argument appropriately.
Honestly, the history of HN is not much better (as I scour around trying to find the "numbers" you claim "have been recited countless times"). It is actually difficult to find people who don't claim that Amazon EC2 is more expensive than it is; I'm almost wondering if you and I are living on different versions of the site...
"EC2 is about 10-20 times more expensive than dedicated hosting. Even if reserved instances save us 22% over 3 years, it still doesn't even come close." -- cmer
^ No, EC2 reserved instances save you 71% over 3 years.
"It costs $576/mo to run an extra large EC2 instance fulltime" -- stephenjudkins
^ No, even two years ago (before "heavy utilization reserved instances") you could drop this price by 66% to $195.84/mo.
"With EC2 prices at about $0.10 per hour, I can't imagine ever using a service with such a high premium." -- apinstein
^ Obviously: no, but the fact that this person is angry about the price of a small instance at $72/mo is quite telling; he isn't willing to go lower than $20/mo.
I found a price comparison by vladd from earlier this year, comparing a high-end VPS to EC2's largest offering, coming up with a nearly 10x difference, but the server is entirely useless: it is a consumer-level product running non-ECC RAM. Later comments claim the same hosting company has "competitively priced servers with ECC ram".
A couple months ago I found a thread that linked to a fairly detailed argument[1] stating that EC2 instances are 2-3x more expensive than a VPS. However, this person again is performing a comparison with non-ECC hardware. What damns this comparison, however, is that he is not taking advantage of 3-year reserved instances for a long-term high-end use case: his numbers seem to be based on 49% off, when he can easily get 71% off, nearly a 2x difference. <- Again, EC2 is cheaper than people like to claim.
Seriously: I can't find anyone who is actually doing legitimate comparisons of Amazon's offerings. People either compare EC2 to "I spent a week of time negotiating a deal to take over a bunch of hardware from a failing company down the street" (which, for the record, will also give you a great deal on chairs and office furniture: comparing the cloud to a fire sale is inane), assume "a server is a server is a server" and find "the cheapest" option (which seems to always have unreliable RAM), or (frankly: "and") fail to take into account Amazon's reserved instance discounts.
That said, Hacker News has a really horrible search system, and I'm trying to find something kind of esoteric (as I want to search for a dollar sign, and thereby have to use proxies such as "expensive" and "cost"). I would thereby love to see an honest comparison, and am happily willing to believe that I missed it: do you have a link to such?
it is a consumer-level product running non-ECC RAM
Sorry to break that for you but EC2 instances are in all likelihood not running on ECC-Ram either[1]. If they had ECC-Ram then Amazon would probably prominently advertise that or at least respond when they are directly asked. If you can find a link to prove the opposite then I'll take that back.
I would thereby love to see an honest comparison, and am happily willing to believe that I missed it: do you have a link to such?
You have probably already seen any of the blog-posts I could cite here, so I'll instead just try to wrap your two claims up:
1. You claim that dedicated servers are more labor intensive (setup, hardware failures) and require more staff. This is not my experience at all. In fact the complexity and idiosyncrasies of the AWS platform are much harder to abstract in the beginning, and no less labor intensive in the long term. You're just trading one set of problems (hardware issues) for a different one (cloud issues). What you may save on the hardware management front you have to spend on adapting your application for a cloud-environment.
2. You claim that equivalent hardware to an EC2 instance (with comparable performance, good support, network, etc.) would be roughly the same price as an EC2 instance. Sorry but that is laughable, when have you last time benchmarked an EC2 instance? Even a cheap rented dedicated server (hetzner, leasweb, ovh) will normally give you twice the bang for buck on every key metric (I/O, Ram, CPU). And this quickly raises to beyond an order of magnitude when you start comparing EBS to a local array or a 256G Ram box to 256G Ram in EC2-instances. Where redundancy is a concern you can usually quite literally buy two of each and still be cheaper than EC2.
I'll say what I always say: EC2 does have its place. However for deployments in the range of 10-~50 servers you will in pretty much all cases save a lot of money by sticking with dedicated servers for the base-load. That is unless your app needs the cloud-flexibility, of course (most apps don't).
What makes you believe this flexibility would come for free anyways? As all things it comes with a price-tag, and actually quite a hefty one in this case.
They don't advertise it because it goes without saying that servers have ECC. EC2 uses Xeons and Opterons which only support ECC. It should only be a few percent more expensive, which is nothing when you consider the premium Amazon charges (which is something I definitely agree with you about).
I've been dealing for long enough with hosters and hardware to know that nothing goes without saying.
Xeons and Opterons which only support ECC
Have you actually checked the CPU models they use?
All I know is that amazon uses a range of different CPUs, and some Xeon/Opteron models do accept non-ECC Ram.
only be a few percent more expensive
In the past ECC DIMMs used to be significantly more expensive.
Either way, as said, I don't know whether they're using ECC Ram. I agree it should go without saying, but I don't share your optimism that it actually does. I also wonder why they explicitly mention it for their GPU-instances when it goes without saying otherwise.
No source aside from personal experience working with them, sorry. They avoid publicizing anything about the hardware/infrastructure if possible, partly so that they can change it without customer awareness and partly because they have secret sauce in places (no, ECC isn't secret sauce).
I do this too, not for risky migrations, but for daily updates. The app relies on a data service that's normally read-only, but gets fresh data daily. When everything was running on the bare metal, we had to schedule the updates for the middle of the night and carefully migrate the data in-place to avoid interrupting service. It would take 8 to 12 hours.
Since we moved to EC2, updating is simpler. The service runs on a micro instance. We launch a large instance to do all the CPU- and IO-intensive processing that prepares the new dataset, then launch a new micro instance, upload the dataset, run a few smoke tests, and if all is well, cut over to it. Because we're doing it off line, we were able to optimize the data processing for speed rather than low resource usage, and cut the runtime down to 45 minutes.
One thing that's often missed in discussions of IaaS versus bare metal is that the elasticity of a particular application can be affected by its design. When we were running on dedicated machines, we smoothed out the load to avoid idle hardware, but after moving to EC2 we concentrated it into spikes to get maximum productivity from running instances. In our case, spiky load is better from a business point of view, because serving data that's 1-25 hours old is better than data that's 8-32 hours old.
There are also the network benefits (pun intended). If the rest of your app does benefit from elasticity, you've had to choose betwen:
1. Keeping your app on EC2 and working around the lack of high I/O options
2. Keeping your app somewhere else and working around the lack of EC2-style elasticity
In TFA, Netflix had chosen #1, and they used to run an extra memcached layer + I/O on 48 instances. They were able to bring this down to 15 I/O instances with no intervening cache, and lower overall latency.
That said, I'd guess the on-demand hi1.xlarge won't get a lot of usage; I imagine they offer it just for orthogonality's sake (all other instances are available both on-demand and reserved), plus the ability to try before you buy.
What's really exciting is that Amazon clearly recognizes their lack of good I/O solutions. Maybe we'll see a whole range of options stem out of this... one can hope.
Say I have a batch process that has huge I/O requirements that has to run once per month and be finished within X hours of starting or SLAs are broken. (Plenty of these types of custom workloads exist in the enterprise)
I can either buy a server with specialist enterprise-grade SSD / Fusion-IO / whatever (> $20,000 most likely) for this once a month process or I can spin up one of these high I/O servers for 1 day per month for a grand total of $50.
In this scenario, this new server type is a godsend.
Monthly financial consolidation for large companies? My employer has a rack of servers for one application that are used very heavily for a few days a month then are hardly used for the rest of the time - the databases are currently being moved to internal Fusion-IO devices rather than Fibre Channel SAN drives.
The key consideration there is how many consulting hours will you need to figure out how to build up and tear down whatever Oracle/SAP/etc software that you need to do your financial calculations on the EC2 VM.
You may find that Fusion-IO is cheaper than $50 + 3 months of consulting time.
closest i can find is Hetzner's EX8S[1] which is EUR99 for the box itself without disks, and another about EUR80 for 4 240Gb SSDs... This is a little less than HALF of what Amazon give, but costs about EUR180 a month and EUR150 for setup... but you do loose the elasticity of Amazon... Mind you, it is 1/10th of the price...
"Since the benefit of using EC2 is that you can provision instances elastically" That is certainly a benefit, but I think many would argue that its far from the only benefit.
And as with all EC2 pricing, reservations drop the price substantially. A 1 lease reservation for a "heavy utilization" instance comes out to $7280 per year + $0.621 per hour for a total of $12719
A reserved instance is about $650/month. One possible use: run your master postgres on it and use streaming replication to transmit the data to something more durable.
What sort of map-reduce space needs fast random I/O? The only things which come to mind are disguised table joins, where the right solution is to use sorting instead.
(Not trying to be argumentative, I'm just trying to figure out what you have in mind.)
I cant think of an example, but I'd guess this'd work out as a good cost-effective solution for people who need to do "big data" map-reduce type jobs, but with less utilisation than 1 week/month. Or, if you can solve the "getting the data in" problem (perhaps you've already got all your data stored on S3), you could use one of these for an hour or two at a time, perhaps running end of week or end of day batch processing, at a somewhat lower cost that having a similar sort of pay-by-the-month colo server.
We use ec2 within our telecom infrastructure which manages peaks of several thousands of calls with full duplex call recording on (think IO here) We are using the elasticity of ec2 to scale on demand.
This kind of instance is a nice addition for us as far as our auto scaling is concerned.
If lots of your infrastructure is in ec2, you may need a good db server inside ec2 that is used constantly. eg if you have a read heavy cassandra workload, you need ssds.
At $3.10/hr, these instances work out to $2k/mo. There are probably many more cost-effective options if you want a 2TB SSD server.
Since the benefit of using EC2 is that you can provision instances elastically, what are the sorts of scenarios in which one needs to provision high I/O servers elastically?
[edit: A few minutes of Googling, and I can't find any dedicated servers with 2 TB of SSD.]