The x1 cost per GB is about 2/3 that of r3 instances, but you get 4x as many memory channels if spec the same amount of memory via r3 instances so the cost per memory channel is more than twice as high for x1 as r3. DRAM is valuable precisely because of its speed, but the speed itself is not cost-effective with the x1. As such, the x1 is really for the applications that can't scale with distributed memory. (Nothing new here, but this point is often overlooked.)
Similarly, you get a lot more SSDs with several r3 instances, so the aggregate disk bandwidth is also more cost-effective with r3.
2. If DRAM was not faster than NVRAM/SSD, nobody would use it. "Speed" involves both bandwidth and latency. Latency is probably similar or higher for the X1 instances, but I haven't seen numbers. We can make better estimates about realizable bandwidth based on the system stats.
https://www.youtube.com/watch?v=vS47RVrfBvE main system board
https://www.youtube.com/watch?v=_poMPOUGRa0 memory risers
Edit: Supermicro has several 2TB boards, and even some 3TB ones: http://www.supermicro.com/products/motherboard/Xeon1333/#201...
(Disclaimer: AWS employee, no relation to EC2)
6TB is about where single machines currently top out due to the hardware constraints of multiple vendors and architecture, and memory bandwidth starts being an issue. You have to throw 96x64GB at the ones that exist so wave buh bye to a cool half a million USD or so. If you're sitting on a 12TB box I want a SKU (I want one!).
I don't actually think Supermicro makes a 6TB SKU, even. That's Dell and HP land.
Sure, http://www.supermicro.com/products/motherboard/Xeon/C600/X10... supports 3TB in a 48 x 64GB DIMM configuration.
This shows a logical diagram of how they cobble all these cores together: http://www.redbooks.ibm.com/abstracts/tips0972.html?Open
I've seen these both opened up and racked up. They are basically split into max 4 rackmount systems, each I think was 2U IIRC. The 4 systems (max configuration) are connected together by a big fat cable, which is the interconnect between nodes in the Redbook I've linked above. The RAM was split 4 ways among the nodes, and NUMA really matters in these systems, since memory local to your nodes is much faster to access than memory across the interconnect.
This is what I observed about 5-6 years ago. I'm sure things have miniaturized further since then...
4 CPU, 60 cores, 120 threads (cloud cores), 3TB RAM, 90TB SSD, 4 x 40GB Ethernet, 4 RU. $120K.
Same price as the AWS instance for one year of on demand.
One of the links on the top points to a server with 96 DIMM slots, supporting up to 6 TB of memory in total.
You could do exhaustive analysis on that dataset fully in memory.
I only point this out to try and correct a common error I see. You're absolutely right that it is awesome that the entire data set can be analyzed in RAM!
Looking at the thread from the release, I see no explanation of how he got the data, but I see several people commenting that they finally have a way to get comments beyond the 1000 per account:
Practically speaking, on the other hand, reddit likely never took user content licensing into account to begin with - they never knew they'd get all of digg a couple years in, and 2006 was a different time besides.
The only way reddit would be able to release the data now would either be a specific clause (added from the start) allowing such in the TOS, or a general-purpose "we can do w/e we want with your data"; nobody thought to add the former (as noted), and of course the latter would have been roasted alive the moment it was noticed.
So, me thinks licensing issues.
Modmailing /r/reddit.com is the canonical way to reach the admins (for any reason), if you'd like more info.
What time of day the accounts most frequently comment. (I'd bet there is an interesting grouping of those that post while at work during the day, and those who post from home at night.
or what subreddits people comment in most during the day vs which /r/ they post to at night ;)
Sounds like you'd enjoy reading this guy's stuff:
The storage format of a dataset can make a big difference in memory usage.
It makes for an interesting exercise to load in your data, do your analytics, and then store out the meta data. I wonder if the oil and gas people are looking at this for pre-processing their seismic data dumps.
Why don't you compare aws to building your own cpu?
On your definition of "aws-believer" is that someone who feels that AWS is a superior solution to deploying a web facing application in all cases? Does your definition include economics? (like $/month vs request/month vs latency?)
Can I assume that you consider comparing AWS to building your own CPU as an apples to oranges comparison? I certainly do, because I define a CPU to be a small component part of a distributed system hosting a web facing application.
Building your own datacenters is also an alternative to aws-ec2, but you go to that step by step (I think?) (dedicated > colocation > datacenter). In some cases when you have crazy growth you can skip a/some steps (ex: dropbox going from aws to their own datacenter)
They don't even compare them in whatever $/request/latency/$metric since they don't even mention them. And dedicated has also many options from low-price+no-support-shitty-network to high price/good-network/support etc.
Or load a data set in this monster and then use GPU workers to hit it?
Basically you take waves that are transiting the area of interest and do transforms on them to ascertain the structure underground. Dave Hitz of NetApp used to joke these guys have great compression algorithm, they can convert a terabyte of data into 1 bit (oil/no-oil).
One of the challenges is that the algorithms are running in a volume of space, so 'nearest neigbor' in terms of samples has more than 8 vectors.
In the early 2000's they would stream their raw data off tape cartridges into a beowulf type cluster, process it, and then store the post processed (and smaller) data to storage arrays. Then that post processed data would go through its own round of processing. One of their challenges was that they ended up duplicating the data on multiple nodes because they needed it for their algorithm and it was too slow to fetch it across the network.
A single system image with a TB of memory would let them go back to some of their old mainframe algorithms which, I'm told, were much easier to maintain.
Can PostgreSQL/MySQL use such type of hardware efficiently and scale up vertically? Also can MemCached/Redis use all this RAM effectively?
I am genuinely interested in knowing this. Most of the times I work on small apps and don't have access to anything more than 16GB RAM on regular basis.
I setup a handful of pgsql and Windows servers around this size. SQL Server at the time scaled better with memory. Pgsql never really got faster after a certain point, but with a lot of cores it handled tons of connections gracefully.
PostgreSQL scales nicely here. Main thing you're getting is a huge disk cache. Makes repeated queries nice and fast. Still I/O bound to some extent though.
Redis will scale nicely as well. But it won't be I/O bound.
Honestly, if you really need 1TB+ it's usually going to be for numerically intensive code. This kind of code is generally written to be highly vectorizable so the hardware prefetcher will usually mask memory access latency and you get massive speedups by having your entire dataset in memory. Algorithms that can memoize heavily also benefit greatly.
No idea about MySQL, people tend to scale that out rather than up.
Scaling for performance reasons: Past a certain point, many workloads become difficult to scale due to limitations in the database process scheduler and various internals such as auto increment implementation and locking strategy. As you scale up, it's common to spend increasing percentages of your time sitting on a spinlock, with the result that diminishing returns start to kick in pretty hard.
Scaling for dataset size reasons: Still a bit complex, but generally more successful. For example, to avoid various nasty effects from having to handle IO operations on very large files, you need to start splitting your tables out into multiple files, and the sharding key for that can be hard to get right. But MySQL
In short, it's not impossible, but you need to be very careful with your schema and query design. In practice, this rarely happens because it's usually cheaper (in terms of engineering effort) to scale out rather than up.
It's easy to poke at Java for being a hog when in reality its just poor coding and operating practices that lead to bloated runtime behavior.
After spending 4 days trying to diagnose a problem with hbase given the two errors "No region found" and "No table provided" and finally figuring out it was due to a version mismatch I now believe it is the culture.
At the very least you should be printing a WARN when you connect to an incompatible version.
I've never seen another community that would actually have a builder for the configuration for the factory for the settings for a class.
If you want bad logging, look at most PHP projects...
For really thoughtful logging, the Apache HTTP client and HikariCP are good Java examples.
Both were written in Java.
And don't get me started on Forte (developed by Sun itself, no less). It was even slower and more memory-hungry than JBuilder.
Sorry, but you just made my day. :P
You could work around it by using shared memory regions and the like but then you're doing a lot of extra work.
With a managed language and a bit of care around exception handling, you can write code that's pretty much invincible without much effort because you can't corrupt things arbitrarily.
Also, depending on the dataset in question you might find that things shrink. The latest HotSpots can deduplicate strings in memory as they garbage collect. If your dataset has a lot of repeated strings then you effectively get an interning scheme for free. I don't know if G1 can really work well with over 1TB of heap, though. I've only ever heard of it going up to a few hundred gigabytes.
The JVM has crashed on me in the past (as in hard crash, not a Java exception). Less often than the C++ programs I write do? Yes, but I of course I wouldn't test a program on a 1TB dataset before ironing out all the kinks.
>The latest HotSpots can deduplicate strings in memory as they garbage collect
Obviously when working with huge datasets I would implement some kind of string deduplication myself. Most likely even a special string class an memory allocation scheme optimized for write-once, read-many access and cache friendliness.
Or I would use memory mapping for the input file and let the OS's virtual memory management sort it out.
But the real impact is if you want to be mutating that data set. The default behaviour of "tear down the entire process on error" can of course be worked around even if you ignore data corruption errors, but not having to do things by hand is the point of managed runtimes in the first place.
IIRC, the images for the site were closer to 7-8TB, but I don't know how typical that is for other types of sites, and caching every image on the site in memory is pretty impractical... just the same... damn.
The reason is so if you fuck up a scaling script for example you can't launch 1000 machines and take all the capacity and then bitch that you won't pay for it.
It's a stop gap.
However, aside from the hard limit of 100 S3 buckets, all other limits are configurable at the request of your AWS rep
If practice in recent decades has taught us anything, it's that performance is found in intelligently using the cache. In a multi-core concurrent world, our tools should be biased towards pass by value, allocation on the stack/avoiding allocating on the heap, and avoiding chasing pointers and branching just to facilitate code organization.
EDIT: Or, as placybordeaux puts it more succinctly in a nephew comment, "VM or culture? It's the culture."
EDIT: It just occurred to me -- Programming suffers from a worship of Context-Free "Clever"!
Whether or not a particular pattern or decision is smart is highly dependent on context. (In the general sense, not the function call one.) The difficulty with programming, is that often context is very involved and hard to convey in media. As a result, a whole lot of arguments are made for or against patterns/paradigms/languages using largely context free examples.
This is why we end up in so many meaningless arguments akin to, "What is the ultimate bladed weapon?" That's simply a meaningless question, because the effectiveness of such items is very highly dependent on context. (Look up Matt Easton on YouTube.)
The analogy works in terms of the degree of fanboi nonsense.
Still really cool to see something like this, I didn't even know you could get close to 2TB of ram in a single server at any kind of scale.
Or significantly higher if you don't restrict yourself to single-system-image, shared memory machines - there are at least 2 1300-1500 TB systems on the Top 500 list.
Do a little research before implying that there's no way that Java can address gigantic heaps.
Scala _beats_ Java in most of the benchmarks: http://benchmarksgame.alioth.debian.org/u64q/scala.html
Not according to that data!
* $117K / year on-demand
* 81K / year for one-year commitment, nothing up front
* 69K / year for one-year commitment, $34,285 up front
* 67K / year for one-year commitment, $67,199 up front
* 35K / year for three-year commitment, $52,166 up front
* 33K / year for three-year commitment, $98,072 up front
Plus, eventually, the spot market, and of course you can run save money with on-demand if you only need the instance occasionally.
That's what I consider in the "free or nearly free" tier, off hand. The other benefits come with being able to interface seamlessly and quickly (same infrastructure) to the rest of AWS services.
You might do better at finding a piece of hardware that does that (and I'm curious now what a 2TB RAM server goes for) but I think you'd be hard pressed to find a way to start from scratch to deploy that instance and all of the services that come with it for under that price. People with on-prem compute likely have some of that already, but the value here is that you could request an X1 today without ever having been a customer before, and you'll get all of that, and access to more, just with that one instance.
If that's not a good value proposition today, then I'd say wait just a few months. Today probably marks the highest price anyone will ever see for an X1. Given past history, it's just going to go downhill from here.
If you need to do some analysis or computation on a massive data set once a month or something, it's going to be cheaper to pay $5k/yr (assuming you run for 24 hours a month and don't make use of the spot market) than to purchase and maintain the hardware and infrastructure.
If your unique snowflake in your own datacenter (don't forget to factor in the physical space and your datacenter personel into your costs) doesn't work well, it can mean replacement, additional costs and downtime. If the AWS instance has a hickup, terminate and replace (I'm not saying that's going to be trivial either at a 2TB RAM instance size).
A big hunk of hardware of your own also represents significant CapEx and a depreciating asset for business. Spinning this monster up costs you $13 to start testing immediately, and you can walk away from it at any time. That's worth a lot.
For some use cases, the pricing of AWS makes sense.
How many hours can I run the machine before it costs more than building one? Probably a month? (random guess) Will I be running it longer than that?
If yes, build the machine, if no, just rent it from AWS.
I can't even imagine a scenario where I run this 24 hours a day 365 that I wouldn't build out a Hadoop cluster or similar for.
if 3x * 90 > y
So at $13.338 (assuming no reserved instance) if y is less than $3600 you make your money back in 3 months. Of course a machine with those stats will not be $3600.
It looks like one of these bad boys would set you back about $40000. So you'd break even at 4 months. If you're going to go for the 3 year reserved instance, with $100K upfront, you'd be way better off on capex and opex just buying and colocating the thing (not considering other expenses.)
The bigger you grow the more the balance tips towards do it yourself because your costs for colocation, managing hardware, etc amortize out better with scale. AWS gets a little better with scale but not nearly as quickly.
I'll give you a case in point. I have a friend who works for a construction company. He has three servers he's cobbled together all at one site, and a NAS for storage. They provide email and document sharing, accounts management, etc. Every new blueprint that comes through needs to be reviewed for accuracy, but they come in as images, so he has another server that just runs tesseract OCR when he gets a pdf in. He asked me if he should get three servers, or one and virtualize the three. He wanted to plan for growth, but didn't know what that would be. the NAS was underutilized.
He's buying no servers. He's setting up Workmail and Workdocs for their users. (Workspaces is an option in the future.) Blueprints are uploaded to S3 as they come in, where the file arrival notification triggers a lambda function that performs an OCR pass on the file and dumps it into a new bucket. Because this function runs on-demand, he avoids paying for a server to sit there, and only pays a few cents for a doc. His other services no longer run on servers he has to manage, so he doesn't worry about patching. He has reliable, off-site backups for the first time and version control, and he only pays for the exact amount of storage he's using as opposed to buying space for his NAS in advance. And he doesn't need to worry about scaling.
That story plays out all the time. It's arguably a bigger savings for smaller people who can't afford the upfront costs than large corporations that already are likely to have massive sunk costs in datacenter space and assets. If you're a small business, the moment you go cloud you not only save money, but you gain capabilities (the biggest being in redundancy and reliability, which are the most important...how many small businesses have a DR/COOP capability on-prem?) which you would never have had otherwise.
Well first let me say that I have been doing all of this (as a business person who knows tech and Unix) for quite some time. And what you wrote above spins my head because I don't know much about AWS (but I can easily turn up a server if needed). My point is what you are describing for a small business person requires a tech person knowledgeable with AWS to implement (and keep on top of) which is the "cost" as opposed to perhaps saving money in this particular example for this particular customer. I am no doubting that AWS (which I see you work for in your profile) is good for certain types of uses (such as perhaps the example you are giving). But you still need a knowledgeable tech guy to keep it all working it is not like using webmail vs. running your own email server (imap, smtp etc.)
Additionally once you are on AWS you are pretty much locked into AWS by the way (as you describe) things are setup. I am not convinced that at a time in the future AWS will not change their pricing to take advantage of the lock that they have (even though now they continually drop prices). Or add extra fees or whatever. Being in business so many years I have seen this happen and changing from AWS will be near impossible. (In other words for people on AWS such as you describe there is no easy alternative competition for a system once designed and specific to AWS in the way you detail).
This is just FUD. AWS prices are always lowering, there is never a time they have increased. Even if they increased their prices, Google, Rackspace and Microsoft would eat their lunch. There is plenty of competition in the computing space.
> Being in business so many years I have seen this happen and changing from AWS will be near impossible. (In other words for people on AWS such as you describe there is no easy alternative competition for a system once designed and specific to AWS in the way you detail).
If your solutions are so narrowly defined that they won't run on anything else than AWS, then you are doing it wrong. You might run on specific AWS services, but there should be no reason why someone could not recreate their solutions over on Google Compute or other services. Yes, you might have to rework some of your solutions, but it should be very doable.
The prices are lowered, but AWS is also continuously introducing new more powerful instance types and deprecates older ones.
> If your solutions are so narrowly defined that they won't run on anything else than AWS, then you are doing it wrong. You might run on specific AWS services, but there should be no reason why someone could not recreate their solutions over on Google Compute or other services.
AWS has many solutions that are different than you are used to. For example there is no NFS (well there is EFS which supposed to be similar, but it's still in preview for at least a year now). You're forced to improvise, and use their solutions. Typically S3, which is not exactly 1:1 alternative to NFS, so you'll need to improvise and rework your applications to that model.
If you want to move to something else, you no longer will have these services available.
> Yes, you might have to rework some of your solutions, but it should be very doable.
Exactly, that's what's pointed out. You'll have to rewrite your applications. There's no lock in that can't be solved by rewriting. The problem is that the rewrite might be quite expensive.
It may seem that way, but in actuality, it really doesn't. In the case I described, their "tech" guy was someone who'd been pushed into the role by necessity. He had the most knowledge out of a body of people who didn't really make it their area of expertise, and became the IT guy as a result.
He did a well enough job working on his own. But there's things he's going to miss, and it's a lot to keep up with if you want to run an infrastructure right. As you pointed out, it's a lot different running Software-as-a-service solution like Webmail vs. Infrastructure or Platform as a service.
But that's the nice thing about AWS (and to be fair, Azure and Google do this quite well too) is that there's a solution for your tech level. If he were really knowledgeable and wanted to roll his own mail solution, he can, and stand up instances running sendmail and put in his own MX records into our DNS service and manage all of the bells and whistles. Or, he can use Workmail (or Gmail for business if you want to be agnostic) and point and click his way to getting mail set up.
So there's a lot of range there, and the higher you go up the stack, the more management you hand off to the cloud provider at the (hopefully small) expense of control. But if the provider's offering fulfils all of your requirements, then it's a great deal for small, non-technical, shops who can offload their workloads into a provider who are experts at managing infrastructure, and you, the business owner, can create the occasional account and otherwise focus on business.
There's a great chart on this you might have seen, but it looks something like this: http://1u88jj3r4db2x4txp44yqfj1.wpengine.netdna-cdn.com/wp-c...
In any case, there's a solution for everyone depending on their skillset, and arguably, (in my humble yet biased opinion =P ) AWS has the best spread. If I had one self-criticism about us, it's that it's not always intuitive as to which of our services fall in the SAAS/PAAS/IAAS stack. Education will always be a challenge no matter who the provider is, and being that cloud is still relatively young, it's up to AWS, Microsoft, Google, and everyone else to explain just what "cloud" means and how to start with it.
As for locked in, it’s AWS’ goal to keep you by ensuring you /want/ to use the service, not because you’re forced to be here. There’s no lock in other than transit fees. Which, if you have thousands of terabytes, can be hefty. That’s bandwidth for you. But there’s nothing proprietary. Take your code and data with you. I’ve seen engineers walk customers through the process of getting their stuff out. It happens. Not often, but it does, and we’ll help with that too.
Ok that is something I actually need to do. I have a project now to bring up a smtp and imap server and I can put it on the colo box or do it on AWS.  Is there a specific
step by step guide to doing this with AWS. Quite frankly with all of the things you do there I wouldn't even know where to start. Otoh if I go to, say, Rackspace I can easily and quickly spin up a server (with RAID) and not have to give any thought to doing so because it's a centos server running on a vps. With AWS I don't know the difference (and would have to learn) as far as the different instances availability zones all of that (and quite frankly I have little time to do that so I would typically go with what I already know). Make sense?
 So what I am saying is I can do this on a colo box but don't even know where to begin to do it on AWS. I don't mean I need handholding to install and get the server working, just to get it working given the various offerings on AWS.
Is that all the server will be used for? If you just want to run a Linux server 24x7 you will not be taking advantage of any AWS services, so you would probably be better off getting a colo box or VPS to act as your mail server.
If you really want to run this on AWS, you just need to spin up an EC2 instance, and configure the Security Group (think of it like a firewall). There is a wizard which will guide you through the process. Once the instance is running you can (optionally) use Route 53 to set up DNS records for your new instance.
Also, feel free to buy my book :-) It starts with the basics like launching instances, and then moves on to more interesting AWS-specific features .
If you have any other questions, feel free to reach out to our forums, or drop me a line. =)
If you can save a FTE admin, you're saving at least $10K/month, that can cover the hosting and bandwidth for a pretty big site/software/service on AWS and the like.
If you have a whole bunch of memory optimized instances, this can allow you to simplify and consolidate and still save you money. Still, don't put all your eggs in one virtual basket.
Big data version of Cannonball Run.
I'm sure you would also get a hefty discount if you were a customer who regularly buys servers that cost that much.
Every time I look at AWS, it just doesn't make sense from a financial standpoint (even after you add another machine for redundancy, and remote hands -- you're ahead after 12 months).
Having this type of instance available via an API call is within seconds is really cool.
(not that I don't understand that this could be said of many things, might actually be true in economic terms etc. It just never occured to me with other heavy machinery).
We've experiemnted with something similar on Google Cloud, where an instance that is considered dead has its IP address and persistent disks taken away, then attached to another (live or just created instance). It's hard to say whether this can recover from all failures however without having experienced them or even work better than what Google claims it already does (moving around failing servers from hardware to hardware). Anyone with practical experience in this type of recovery where you don't duplicate your resource requirements?
Not too surprising given how close SAP and Amazon AWS have been ever since SAP started offering cloud solutions. Going back a couple years when SAP HANA was still in its infancy; trying it on servers with 20~100+ TB of memory, this seems like an obvious progression.
Of course there's always the barrier of AWS pricing.
That being said, a three year commitment is still hard to swallow compared to dedicated servers that are month-to-month.
18(core) * 4(cpus) * 2(+ht) = 144
"Ben we're not utilizing all the ram."
"Add another for loop."
Something volatile running una RAM disk maybe?
Edit, y'all don't get the reference: famous computer urban legend...
To answer the question at hand though, MySQL seems great at up scaling with larger hardware, as does Redis (from what I've seen, it seems to almost get more efficient at sizes >500GB). MongoDB (my goto database) occasionally doesn't seem to scale very nicely off the bat, but after some configuration that too scales very well, possibly even better than MySQL.
That's what make us friends.
I'm not here to prove myself - I participate in the tech community here and it's like everyone like you just tries to put on some superiority complex and act like tech is ONLY serious business.
Get over yourself.
I'd be interested in hearing what Gates  has to say about it, though.
 "640 kB ought to be enough for anybody"