Much easier to request a client provisions 20 of their standard machines, or get them from AWS. People don't like custom hardware, and for good reason.
I just speced out a 6TB Dell server. Price? It is already at $600K, and I haven't fully speced it out yet (just processor, memory, drive). Maybe that memory requirement is high (though it is about what I would need); 1TB is somewhat over $200K.
For the right situation that sort of thing maybe makes sense, though I'm SOL if I need high availability (power out, internet flakey, RAM chip goes bad, etc leaves me dead in the water).
So I would need to stand up a million or more in equipment in several places, or just use AWS and suffer the scorn of someone saying 'you could have put that in RAM'. Yes. Yes I could have.
Well, I keep arguing against that, because you still get 90%+ of the maintenance work, plus some new maintenance work you didn't have before, to avoid some some relatively minor hardware maintenance. And you can get most of the benefits of non-cloud deployment with managed hosting where you never have to touch the hardware yourself.
I work both on "cloud only" setups and on physical hardware sitting in racks I manage, and you know what? The operational effort for the cloud setup is far higher even considering it costs me 1.5 hours in just travel time (combined both ways) every time I need to visit the data centre.
For starters, while servers fail and require manual maintenance, those failures are rare compared to the litany of issues I have to protect against in cloud setups because they happen often enough to be a problem. (The majority of the servers I deal with have uptimes in the multi-year range; average server failure rate is low enough that maintenance cost per server is in the single digit percentage of server and hosting costs). Secondly I have to fight against all kind of design issues with the specific cloud providers that are often sub-optimal and require extra effort (e.g. I lose flexibility to pick the best hardware configurations).
Cloud services have their place, but far too many people just assumes they're going to be cheaper, and proceed to spend three times as much what it'd cost them to just buy or lease some hardware, or rent managed hosting services.
Even if you don't want to maintain your own hardware, AWS is almost never cost effective if you keep instances alive more than 6-8 hours of the day in general. Your mileage may wary, of course.
I am reasonably happy configuring an old school Linux box. Heroku is much more of a pain in the arse to deploy to in my experience, despite much of the work being done for you already. Debugging deployment issues is particularly painful.
This assumes that a) You're in the valley where average developer salary is $10k a month or more, b) You're a large company paying developers that salary.
There are lots of other places where a) Developers are cheaper, or b) You're a cash strapped startup whose developers are the founder(s) working for free.
That's not the only option. You can rent a cluster for a lot cheaper.
If you could do the same job on a fleet of 8-16GB servers.. you can get a lot more CPU for a lot less dollars. Depends if you really need everything on 1 machine or not (as of course nothing will beat same machine in memory locality)
softlayer, dual Xeon 2000 Series: 512GB, $1,823.00/mo (3.56 $/gb/mo)
these are on-demand prices. pre-pay, or use a term discount, and its cheaper.
Build it yourself: You can build a Dell or similar on a 2-Xeon-proc (E5 series), your main limit is getting good prices on 16x 32GB DIMMS. But lets say you can buy the RAM for ~$6500, then its just dependent on the rest of your kit, lets say $10,000 flat for the whole server. $277.77/mo over 36 months, but you still need network infrastructure, and you might want a new one in 12 months, but you get the general idea.
 - http://www.rackspace.com/en-us/cloud/servers/onmetal
I originally saw it on HN, but almost two years ago. Old comments: https://news.ycombinator.com/item?id=6398650
If your working set is small, say 1TB -- and so fits in RAM -- for what kind of problems would using SQL be so much of an overhead that you need a different approach? And what would that approach be?
I suppose you could have a massive set of linear equations that you might be able to fit into 1TB of RAM, but would be difficult to work with as tables in Postgres?
I appreciate you taking the time to answer -- and I get that there's a reason for why we have graph databases. But I really meant something more concrete, as in here's a real-world example that isn't feasible to do on machine X with postgresql, but easy(ish) with a proper graph structure/db -- rather than "not all data structures are easy to map to database tables in a space-efficient manner".
It's very much dependent on how frequently you update the data and whether or not (re)loading the data or updating your structure in memory can be done efficient or not to determine whether or not such an approach is useful or not but going from 'too long to wait for' to 'near instant' for the result of a query is a nice gain.
In the end 'programmer efficiency' versus 'program efficiency' is one trade-off and cost of the hardware to operate the solution on is another. Making those trade-offs and determining the optimum can be hard.
But a rule of thumb is that a solution built up out of generic building blocks will usually be slower, easier to set up, will use more power and will be more expensive to operate but cheaper to build initially than a custom solution that is more optimal over the longer term.
So for a one-off analysis such a custom solution would never fly, but if you need to run your queries many 100's of times per second and the power bill is something that worries you then a more optimal solution might be worth investing in.
Right. But you've still not made a compelling argument that the "smart encoding of the data" and "clever search strategy" couldn't be done with/in SQL. I'm quite sure you could probably get a magnitude or two of improvement by diving down outside of SQL -- but there's a big difference between using the schema you have, and setting up a dedicated system.
I was perhaps not clear, but I was wondering if it's not often the case with real-world data, that you could create a new, tailored schema in SQL, and do the analysis on one system -- especially if you allow for something like advanced stored procedures (perhaps in C) and alternate storage engines.
I suppose one could argue when something stops being SQL, and start becomming SQL-like. But the initial statement was that "For some problems SQL would already be way too much overhead.". The implication being (as I understood it, at least) -- that you not only have to change how the date is stored and queried, but that SQL databases are unsuitable to that that task -- and that by going away from your dbs, you can suddenly do something that you couldn't do on a system with X resources.
I'm not saying that's wrong -- I'm just asking if a) that's what you mean, and b) if you can quantify this overhead a bit more. Are using bitfields or whatnot with postgres going to be a 20% overhead, or a 20000% overhead in your scenario? Because if it's 20%, that sounds like it can be "upgraded away".
If it fits in memory, it's going to be magnitudes faster to work with than on any other infrastructure you can build.
So the trick is, you take their "big data problem" and hand them a server where everything can be hot in memory and their problem no longer exists.
Re-do the same thing using an optimal algorithm operating on a compact datastructure and you make it look easy, fast and cheap. Of course you're not going to make nearly as much money.