I've seen corporate code bases consisting of entirely PL/SQL before, it's a nightmare. No thanks, from both a maintenance perspective and a performance perspective. YMMV.
SQL (and RDBMS) is good for three things:
- The relational model can model the nature of our data well.
- Transaction control, recovery, etc are notoriously hard to implement. Most people (I bet 99.99% of the programmers) would fail to design these systems. RDBMS takes care of these for you.
- Physical data independence: The declarative nature of SQL enables you to specify what data you want, rather than how you want them. The optimizer picks a logical and physical execution plan (that doesn't suck) for you. The same query from the 70s running on IBM mainframes can still be running today on say 1000 Amazon EC2 nodes, although the underlying computer and storage architecture is poles apart. (There are political reasons why this is not as true as it should be, but that's another story.)
Now of course, the features are double-edged swords. Some engineers dislike RDBMSes because they don't understand what is going on in the system. If you are Google, you have the engineering power and the business needs to design systems that precisely tailor to the nature of your data.
1) Are RDBMS useful at all, ever? Sure. If I don't have an extreme scaling need, I'm the first one to say throw MySQL on that box and call it a day, it's tested, it works, gives you a bunch of functionality for 5 minutes of installation, developers are used to it, easy.
2) Should you be putting application logic in the database? I argue no. I understand that this isn't unanimous, but you're sacrificing long term flexibility and scalability for some short-term convienence that isn't even that convenient. I mean, PL/SQL sucks. I'm not saying you shouldn't ever do a join in the database, or shouldn't ever normalize your data, but if you're writing a lot of PL/SQL code, it's probably gonna come back to bite you if you ever need to scale or change architectures.
RE: that query from the 70s -- that query from the 70s that will actually run across different implementations contains exactly the subset of SQL that I consider acceptable and worth using (minus limit and a couple other small fry, I guess). If it's more complicated than that, it should be app code IMO.
Actually, if you have to change architectures, you are probably bitten anyway.
If you need performance:
1. Do as much as you can in one SQL statement (unless you are using MySQL where the optimizer sucks)
2. If you can't do it in one SQL statement, use PL/SQL to do data processing as close as possible to the data - this will save on network load and PL/SQL is highly optimized to reduce overhead for embedded SQL statements.
With all the talk about "horizontal scalability" we sometimes forget the benefits of writing fast code in first place.
Performance really comes down to what work are you doing, where does it take place, and how many disks and CPUs can you get involved to spread the load. For small amounts of load, you could say there's some gain from pushing more work in the database but it doesn't matter anyways. Large load, you're making the DB a bottleneck.
Here's a neat trick - for every query SELECT * FROM TABLE, then filter and sort it in the application too.
And then implement the app logic in some slow-ass web language that doesn't hold a candle to your database's optimized C. What happens when your web server can't keep up? Buy another web server. That's an easily solvable problem. A saturated DB, on the other hand, is very hard to fix, especially if all of your application logic is running in it and would require a rewrite to scale.
FWIW I work on a "non-scalable" RDBMS that comfortably handles thousands of commits/sec, tens of thousands of selects/sec and tens of terabytes of data. All done with stored procs, one database, no need for "sharding" either. Will it scale indefinitely? No, but neither will anything else. The limits of the RDBMS are orders of magnitude greater than the NoSQL crowd think they are.
And, yeah, damn right I'm hesitant about a single piece of hardware that costs $200k and still leaves you with a SPOF. Effectively you're saying it's a 400k piece of hardware after replication. Oh, and Oracle replication software costs you an extra several hundred thousand. So for about a million dollars, yeah, unless it's incredibly transactional in nature, I'd give other options a very serious look.
I still think embedding all your app logic together with your underlying data in PL/SQL or any equivalents is a bad idea but hey that's for anyone to do and find out themselves :)
Agreed. There's a couple of orders of magnitude just from I/O improvement in hardware, using conventional hardware, configured intelligently.
There's at least another order of magnitude in optimizations that aren't possible in NoSQL's strawman, such as separate tablespaces for data and index, partial indexes, and a galaxy of query tuning from an EXPLAIN that actuall provides a query plan.
 Meaning commodity-priced, nothing fancier than $400 RAID cards and spinning disks.
Usually they learned how to use MySQL from thirdhand PHP & MySQL tutorials off somebody's blog or something, and thought it was representative of all RDBMSs.
Not saying everyone using "NoSQL" is poorly informed, just that sometimes peoples' impressions of performance aren't very accurate. It makes me suspicious when somebody's benchmark only uses MySQL.
Moore's law doesn't apply to RPMs of spinning disks.
Agreed, if you include fast interconects like SAS and exclude the network requirement of SANs.
sharding/distribution across multiple nodes another.
I disagree, for the sam reason that doing so with iSCSI over ethernet isn't: too much added latency.
Infiniband may help, but I have yet to try it empirically.
 Switching/routing, multiple initiators, distances longer than a few dozen meters.
This is only significant if one is limited to a trivial number of spinning disks. 20 years ago, with separate disk controllers, this was the case.
If you run some benchmarks, I expect you'll find that, for random I/O, N disks perform better than N times one of those disks.
SCSI provided (arguably) an order of magnitude for number of disks per system.
Now, SAS provides another. $8k will buy 100 disks (and enclosures, expanders, etc). How many IOPS is that?
ETA: The Fujitsu Eagle (my archetype of 20ish years ago disk technology) had, IIRC, an average access time of 28ms. If its sequential transfer rate was one 60-100th of modern disks, what fraction of a modern disk's 4k IOPS could it do?
PL/SQL, though, with global data reach and advanced locking states for every single transaction, make it really hard to move off of a single host. So it's more and more work to get more disks attached to that host, and CPU is a hard upper limit.
>The limits of the RDBMS are orders of magnitude greater than the NoSQL crowd think they are.
Wish I had more upvotes. Knowing what logic to put in stored procs, and what to put in the application code, so as to play to the strengths of the db and the strengths of the appserver, seems to be a lost art these days.
It's not that there's anything fundamentally wrong with the relational model, it's that, as formulated, it's really hard to split up into multiple machines.
A HDD is limited to well under a thousand reads per second, and there's multiple reads per SQL select, so I'll assume either yes or you're leaving something out. If the tens of thousands of reads are over a 4MB table that stays in memory and the rest of your 10TB are infrequently accessed, congrats, you have a single-node problem. If you actually need to deliver, say, 1,000 queries per second over a 10TB dataset? It's not happening in an RDBMS unless you get a whole bunch of SSDs.
Also, RE: websites, of course, substitute your environment's front end if you'd like.
In one of the code-word projects I saw an $25k system with an OLTP dataset do 1 billion rows in a second.
Edit: We also had statistics that said that 90% of our customers had less than 100GB of data. 99% had < 1 TB. The vast majority of database users shouldn't even be thinking about looking at non-RDBMS systems.
If your dataset fits in memory on one node, I'm all about using a database (prefer MySQL personally, might look at postgres now that it finally has replication).
Still doesn't add up to 10k select statements per second over a 10TB dataset on a singlenode. Even without writes, that's not happening. I call BS on grandparent post.
With a half-decent SAN with 15k drives and 4Gbit fibrechannel connections, you can get 1000+ IOPS without the storage system even breaking a sweat. Under load it can easily give 10 times that.
This is something that's everywhere in the business world.
Pair this with a bunch of cores and a few GB of memory, and you can have an RDBMS that chews through impressive amounts of data. Unless, of course, you optimize nothing and swamp it with lame queries that do nothing that table scans. Funny enough, the same people that are fine with doing everything in code are the ones that can't be bothered to think more than one second about what kind of queries they are throwing at the database.
Again, as I say, the NoSQL crowd have no idea about what the state of the art is in the RDBMS world.
it's awfully hard to even fit 10TB on a RAID1+0 setup
It would actually be hard for me to buy an array that small...
How many commodity servers could I buy for that?
I have a pretty solid idea what state of the art is in the RDBMS world - it's diminishing returns as a machine that's twice as powerful costs 10X as much, all the way up the enterprise ladder. It's spending 100k on your software licenses, 100k on your storage and 500 bucks on a CPU.
Not that there's anything wrong with that. It's ok. If your domain is highly transactional, it's probably a better move than implementing your own transactions over something else. Just don't pretend that your limitations are actually strengths -- you have your own strengths.
Because you can't compete at this level by chucking increasing amounts of anything at the problem - people, dollars, spindles, nodes, you name it.
If your problem is extremely transactional and legitimately unshardable, feel free to drop 6 mil on exadata. Or a half a mil on a database server and backup. But frankly, your objections are starting to have a religious feel to them. All I was saying is that PL/SQL is a pile of crap to code in and fundamentally unscalable without spending a boatload of money. A little better design can get the same thing with a lot less cash.
EDIT: No, those are facts, PL/SQL looks like it was designed in 1965 and, yes, putting all of your CPU processing into a single node is fundamentally unscalable. I've seen it. It was fundamentally unscalable.
I'm not making a religious point about RDBMS - it can be the best model in many situations. I'm making a point about single bottlenecks for your architecture.
Oracle tried to market their Exalogic as "no bottlenecks" which is nearly as funny as "unbreakable linux" and "zero latency".
I've always seen databases as an index storage layer and thought that "outsourcing" the apps job to the database while convenient in a lot of cases, are hell to maintain at scale.
So yes I rather keep my databases light and get the app to deal with the processing as much as possible, throwing a new webserver at it when the growth requires it.
That's me deliberately misinterpreting your comment, beating the straw man with a killer counter-argument and declaring victory.
Dunno why everyone's taking this so personally. It's like they're personally threatened and their approach is to get angry rather than try and understand.
I guess people don't appreciate debating anymore.
Mongo has some pretty good performance characteristics, according to people that have very, very large datasets being updated at a massive rate. As for reliability, I personally don't really see to much difference in the reliability of Python logic vs SQL logic. It's like you're comparing the reliability of English vs Spanish.
"> Here's a neat trick - for every query SELECT * FROM TABLE, then filter and sort it in the application too."
Huh? Mongo (my personal database of choice) doesn't do that. You specify what you want when you run find(). No other database does either.
Again, if you have a point, make it. The last three posts you're written have been shoving words in to other peoples mouths, sarcasm and statements without any actual backup or citations.
In my experience, and perhaps with my bias as someone who's very comfortable with (relational) databases, I have seen plenty of nightmarish scenarios involving, say, Hibernate.
I'd rather create a service layer on top of my DB and access it through that. Then I have more control over logging, managing access, pubsub, etc.
Discussions at this level presuppose a certain amount of background knowledge. Do try to keep up.
things need to be tested, strongly typed DB or not.
Please quit it with the Rude manner. It's against HN guidelines and makes your arguments weaker to boot. Or leave and go somewhere else.
The scalability problem? Good luck getting 90%+ of your processing out of the database. Rewrite from scratch?