The subtext is that running a fancy distributed system is more exciting and beneficial for ones resume than simply buying a massive bloody server and putting postgres on it, and that people are making tech decisions on this basis.
This of course ignores that it's much easier to get your hands on a cluster of average machines than one massive bloody server, and all the non-performance-oriented benefits of running a cluster (availability etc.).
Much easier to request a client provisions 20 of their standard machines, or get them from AWS. People don't like custom hardware, and for good reason.
Yes. Just yesterday in some other story everyone was arguing for the cloud because who wants to maintain their own hardware? This morning the hue and cry is "slap more RAN in that puppy".
I just speced out a 6TB Dell server. Price? It is already at $600K, and I haven't fully speced it out yet (just processor, memory, drive). Maybe that memory requirement is high (though it is about what I would need); 1TB is somewhat over $200K.
For the right situation that sort of thing maybe makes sense, though I'm SOL if I need high availability (power out, internet flakey, RAM chip goes bad, etc leaves me dead in the water).
So I would need to stand up a million or more in equipment in several places, or just use AWS and suffer the scorn of someone saying 'you could have put that in RAM'. Yes. Yes I could have.
> everyone was arguing for the cloud because who wants to maintain their own hardware?
Well, I keep arguing against that, because you still get 90%+ of the maintenance work, plus some new maintenance work you didn't have before, to avoid some some relatively minor hardware maintenance. And you can get most of the benefits of non-cloud deployment with managed hosting where you never have to touch the hardware yourself.
I work both on "cloud only" setups and on physical hardware sitting in racks I manage, and you know what? The operational effort for the cloud setup is far higher even considering it costs me 1.5 hours in just travel time (combined both ways) every time I need to visit the data centre.
For starters, while servers fail and require manual maintenance, those failures are rare compared to the litany of issues I have to protect against in cloud setups because they happen often enough to be a problem. (The majority of the servers I deal with have uptimes in the multi-year range; average server failure rate is low enough that maintenance cost per server is in the single digit percentage of server and hosting costs). Secondly I have to fight against all kind of design issues with the specific cloud providers that are often sub-optimal and require extra effort (e.g. I lose flexibility to pick the best hardware configurations).
Cloud services have their place, but far too many people just assumes they're going to be cheaper, and proceed to spend three times as much what it'd cost them to just buy or lease some hardware, or rent managed hosting services.
Even if you don't want to maintain your own hardware, AWS is almost never cost effective if you keep instances alive more than 6-8 hours of the day in general. Your mileage may wary, of course.
"The cloud" is basically a new non standard OS to learn.
I am reasonably happy configuring an old school Linux box. Heroku is much more of a pain in the arse to deploy to in my experience, despite much of the work being done for you already. Debugging deployment issues is particularly painful.
Depends on what you mean by massive bloody server. You can get a server with a terabyte of RAM for a price that's insignificant compared to the cost developing software to run on a cluster.
> You can get a server with a terabyte of RAM for a price that's insignificant compared to the cost developing software to run on a cluster.
This assumes that a) You're in the valley where average developer salary is $10k a month or more, b) You're a large company paying developers that salary.
There are lots of other places where a) Developers are cheaper, or b) You're a cash strapped startup whose developers are the founder(s) working for free.
Comparison still holds, because if you buy a cluster with X amount of RAM the price will be roughly the same as a single server with X amount of RAM. Except that for some large X there won't be any off the shelf servers you can buy with that amount of RAM (let's say 2000GB), but lets be honest here, 99% of companies needs are under that X especially if we're talking about startups.
If you could do the same job on a fleet of 8-16GB servers.. you can get a lot more CPU for a lot less dollars. Depends if you really need everything on 1 machine or not (as of course nothing will beat same machine in memory locality)
Not true, 8x16GB costs as much as 1x256 on Amazon. The issue here is that Amazon is hilariously expensive in general. Hetzner will rent you a 256GB server for €460 per month. Or you can buy one from Dell for $5000. These are not high numbers, in 1990 you paid more than that for a "cheap" home computer. For the price of a floppy drive back then you can now get a 32GB server.
these are on-demand prices. pre-pay, or use a term discount, and its cheaper.
Build it yourself: You can build a Dell or similar on a 2-Xeon-proc (E5 series), your main limit is getting good prices on 16x 32GB DIMMS. But lets say you can buy the RAM for ~$6500, then its just dependent on the rest of your kit, lets say $10,000 flat for the whole server. $277.77/mo over 36 months, but you still need network infrastructure, and you might want a new one in 12 months, but you get the general idea.
and fwiw, costs at amazon will scale linearly with resources. the 1 beefy box with 256GB RAM box costs about as much as 16 boxes with 16GB of RAM each.
There is no point deploying a heavy, complex (and usually pretty slow due to the overheads involved) distributed database, when you could just buy a server with xTB ram, load any sql database on it, and run your queries in a fraction of the time. If your data is so large that it can't fit in the RAM of a single machine, then distributed databases make more sense (since loading data off disk is very slow, modulo SSD).
If your working set is small, say 1TB -- and so fits in RAM -- for what kind of problems would using SQL be so much of an overhead that you need a different approach? And what would that approach be?
I suppose you could have a massive set of linear equations that you might be able to fit into 1TB of RAM, but would be difficult to work with as tables in Postgres?
Take a graph, you could use an SQL database to store it and do your graph analysis using SQL, or, alternatively, you could convert your graph to an extremely compact in-memory format and then do your analysis on that. Much better efficiency for the same size problem, bonus: you can now analyze much larger graphs with the same hardware.
I appreciate you taking the time to answer -- and I get that there's a reason for why we have graph databases. But I really meant something more concrete, as in here's a real-world example that isn't feasible to do on machine X with postgresql, but easy(ish) with a proper graph structure/db -- rather than "not all data structures are easy to map to database tables in a space-efficient manner".
Ok, one more example: A German company holds a very large amount of profile data and wanted to search through it. On disk storage in the 100's of gigabytes. Smart encoding of the data and a clever search strategy allowed the identification of 'candidate' records for matches with fairly high accuracy, fetching the few records that matched and checking if they really were matches sped things up two orders of magnitude over their SQL based solution.
It's very much dependent on how frequently you update the data and whether or not (re)loading the data or updating your structure in memory can be done efficient or not to determine whether or not such an approach is useful or not but going from 'too long to wait for' to 'near instant' for the result of a query is a nice gain.
In the end 'programmer efficiency' versus 'program efficiency' is one trade-off and cost of the hardware to operate the solution on is another. Making those trade-offs and determining the optimum can be hard.
But a rule of thumb is that a solution built up out of generic building blocks will usually be slower, easier to set up, will use more power and will be more expensive to operate but cheaper to build initially than a custom solution that is more optimal over the longer term.
So for a one-off analysis such a custom solution would never fly, but if you need to run your queries many 100's of times per second and the power bill is something that worries you then a more optimal solution might be worth investing in.
> A German company holds a very large amount of profile data and wanted to search through it. On disk storage in the 100's of gigabytes. Smart encoding of the data and a clever search strategy allowed the identification of 'candidate' records for matches with fairly high accuracy, fetching the few records that matched and checking if they really were matches sped things up two orders of magnitude over their SQL based solution.
Right. But you've still not made a compelling argument that the "smart encoding of the data" and "clever search strategy" couldn't be done with/in SQL. I'm quite sure you could probably get a magnitude or two of improvement by diving down outside of SQL -- but there's a big difference between using the schema you have, and setting up a dedicated system.
I was perhaps not clear, but I was wondering if it's not often the case with real-world data, that you could create a new, tailored schema in SQL, and do the analysis on one system -- especially if you allow for something like advanced stored procedures (perhaps in C) and alternate storage engines.
I suppose one could argue when something stops being SQL, and start becomming SQL-like. But the initial statement was that "For some problems SQL would already be way too much overhead.". The implication being (as I understood it, at least) -- that you not only have to change how the date is stored and queried, but that SQL databases are unsuitable to that that task -- and that by going away from your dbs, you can suddenly do something that you couldn't do on a system with X resources.
I'm not saying that's wrong -- I'm just asking if a) that's what you mean, and b) if you can quantify this overhead a bit more. Are using bitfields or whatnot with postgres going to be a 20% overhead, or a 20000% overhead in your scenario? Because if it's 20%, that sounds like it can be "upgraded away".
Yeah, sorry about that :-) I really do appreciate your input -- and the previously linked page-rank example show how using manual data layout can squeeze a problem down -- but I still wonder how many of those techniques could've been made to work with SQL -- and what the resulting overhead would've been. Size on disk would probably have been a problem -- AFAIK postgres tend to compress data a bit, but probably nowhere near as much as manual rle+gzip. But I wonder how far one could've gotten with just hosting the db on a compressed filesystem...
Right, RAM an order of magnitude faster than disk, so calculations will be performed very quickly. Big data usually implies clusters of servers because the data won't fit on one server (even on the disk).
Big Data usually implies 'big dollars', not necessarily a large amount of data. Simply use some in-efficient algorithm and a datastore with sufficient overhead and you're in Big Data territory.
Re-do the same thing using an optimal algorithm operating on a compact datastructure and you make it look easy, fast and cheap. Of course you're not going to make nearly as much money.