Unlike HTML & CSS, which seem designed to spread into a greater morass of obtuse and interdependant hacks with every trial-and-error iteration, SQL has always felt elegant to me.
It takes just as much time to craft an SQL query as it does to translate a design into HTML/CSS, but 99% of the time I can sit back and be proud of my handiwork; even re-using it later.
On the other hand, I have never completed a webpage without looking at the resulting markup and thinking, "What an ugly goddamn kluge." SASS and JQuery and HTML5 and CSS3 don't make a difference. Maybe I'm just bad at it, though.
I eschew ORMs at every opportunity, and it is mysterious to me why "writes SQL for you" is such a common "feature." I would liken it to a tool which "paints paintings for you" or "designs your app for you." Why would I entrust a computer program to do something which requires the creative spark of a human?
However, for that remaining 1% of complicated queries, a good developer can do a much better job than any ORM.
() The company's product was, broadly speaking, a specialized analytics engine, with temporal features. The analyses were written in a dialect of SQL-92 (just SELECT statements, plus extra features for temporal logic and currency conversion), which I had to compile to PL/SQL and T-SQL to run as stored procedures.
If so, I agree strongly, but I feel like that problem could be solved by better error reporting on the part of the compiler. I dread SQL errors now, at least with Oracle (which I work with the most often), because they almost never tell me what I need to know. To a certain extent it's essential to the abstraction of data reporting itself, but certain errors are just comically vague.
It's the Read part that gets me. I'm all for frameworks that map HTML forms to model object properties to database fields so you don't have to write boilerplate, but I don't think you can treat "Read" like the other three. For the life of me I cannot articulate my reasoning right now, elegantly or otherwise, but ORMs brush a lot of complexity under the rug in that respect.
Indexes. Everybody says it, but I'm still surprised at the number of devs who have next to no idea about index and how they work.
Few guides: Every index adds a write penalty, so don't just add them to everything. If you've got a compound index, it will only be used if the query also uses all the fields in the index (If you build an index on surname, firstname and dob, but execute select * from users where firstname="Bob" and surname="Jones", the index won't be used). Benchmark the difference each index makes, and stand back in awe at the improvement a few well places indexes can contribute.
[Edit: as pointed out below, my experience is well out of date and it looks like there's a number of database engines where the compound index thing doesn't apply]
Learn to think in sets. Most devs are used to operating on discrete pieces of data; we're not used to thinking in sets, and so writing efficient SQL doesn't come naturally. Put in the effort and learn it and you'll be well rewarded. SQL is built around sets of data, and can almost assuredly do stuff faster than your code can. Hint: anytime you find yourself getting the result of a query, iterating over the result set and executing another query for each result you're doing it wrong
Learn how your ORM works. If you're using an ORM (and most of us do), learn how to use it efficiently. This is where having a good working knowledge of SQL helps, because you can then find out how to help your ORM create good SQL. If you're using ActiveRecord, understand how to correctly use :conditions, :include etc, and what they're doing behind the scenes. Ideally you want your ORM to make one call to the DB, and to retrieve everything you need in that one call (calls are expensive, even for very small queries). Poorly written code causes the DB to get hit constantly. Not good,
I don't believe that's true, at least not for all RDBMS. I just checked the docs for MySQL and PostgreSQL and they seem to agree with me.
http://dev.mysql.com/doc/refman/5.0/en/multiple-column-index... | "The name index is an index over the last_name and first_name columns. The index can be used for queries that specify values in a known range for last_name, or for both last_name and first_name."
http://www.postgresql.org/docs/8.2/static/indexes-multicolum... | "A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns"
- If I do an UPDATE query, and I don't modify indexed columns, the index doesn't need to be modified.
- If I have indexes on (last_name, first_name) and (last_name) the (last_name) index is redundant. Some queries may use the last_name index when I don't specify the first_name - but they could have used the other index. Now I have bloated what contents need to be memory resident (affecting reads as well).
Total system I/O is a finite resource, but this can be hard to reach if indexes are on separate spindles from the data and each other. This can cost substantially more initially, but disappears at scale.
Perhaps, then, as for the OP's original topic, one mistake is not hiring someone who knows these performance strategies. Another is using an RDBMS that doesn't support them.
 which, granted, is far and away the norm
 or sets of spindles, or SSDs
ALTER TABLE staff ADD INDEX idx (last_name, first_name);
SELECT first_name FROM staff WHERE last_name = 'n';
A covering index is something very special - all results are returned from the index. Not all databases do this - Postgres does not, because it does not want to maintain all the versions (see MVCC in wikipedia) in the index.
I thought that operations using primary keys were always fast, that they were somehow always indexed.
I thought this would be fast:
SELECT content FROM test1 WHERE id = constant;
SELECT content FROM test1 WHERE content = "xxx";
In a primary index, the leaf nodes are the actual data records, so when you've followed the index to the leaf you have the exact physical location on disk that you need to get the data from. If you do a query that retrieves multiple records, for example
SELECT * FROM test1 WHERE id > constant;
Contrast that with a secondary index, where each leaf node essentially contains the physical address of the record, so for each record you're returning, you have to make an additional random seek to the underlying datastore to get the entire record. This is why primary indexes are faster than secondary indexes, in general. Various caching schemes smooth out the difference though.
Another good tip to keep in mind when writing indexes is that you can have covering indexes. If you often do a query of the type
SELECT a, b FROM table WHERE c = ... AND d > ... ORDER BY e, f
That's a clustered index. In some data stores primary keys are always clustered (innodb), in some they're not (myisam), and some even let you optionally cluster the table data on a non-primary index (sql server).
> Contrast that with a secondary index, where each leaf node essentially contains the physical address of the record, so for each record you're returning, you have to make an additional random seek to the underlying datastore to get the entire record.
Not necessarily: often the non-leaf nodes of the clustered index are stored somewhere completely separate from the leaf nodes and end up with the address of a leaf node page. In such case seeks may not any better in principle but scans on the clustered key where you need lots of non-key data from the data page are. If you do not need lots of non-key columns from the data page, scans on the clustered key can actually be slower because the key columns are stored so much further apart and not all next to each other.
On the other hand, the secondary index may not end with row addresses but instead with primary keys which then must be looked up in turn, so in such case logically cannot be any faster than a primary key for a seek to a data page.
> This is why primary indexes are faster than secondary indexes, in general.
Clustered indexes are actually often not faster for many operations. A secondary index is often much faster if it covers the fields selected by the query without needing to do a bookmark lookup to data pages, and scan on secondary indexes are generally faster than clustered key scans, again moderated by whether you need to do bookmark lookups or not.
You are definitely right about each RDBMS having their own quirks though! Solid advice: know what you are working with and its nuances.
I did not know that - I always thought that a compound index would be used for a query against a subset of those columns provided the subset was a prefix of the overall list - so in your example, a query on surname would use the index but a query on firstname and dob would not.
Don't assume that once you make your query run perfectly it will stay that way for ever. Once row count grows, your query might start to perform very poorly. Happened to me just recently, when a query was running in the 0.0001s range, but over night crawled to a halt and took 1 minute(!) to execute. Always pay attention to your querys and monitor if anyone of them is starting to slow down.
Don't expect some gradual curve of loss of performance. Once a table exceeds the cache size an ok performing query can go very bad with no warning (say 100x longer). This might mean that you really need to do a production patch of some type. Plan for this before it happens.
It's possible I'm using it wrong, did get much past trying it out. Fairly easy to roll my own rough timer in a DB wrapper, haven't got around to finishing this yet though.
There is likely some overhead for this feature, as it needs to make more system calls (getrusage).
And measure every optimization you make, too :) I saw someone doing an optimization that should have accelerated the processing, only to discover once in production it was 4x slower.
I love SQL and find it easy to solve problems with it. But I feel the opposite for ORMs. It adds another "learning barrier" for me. I hate it because sometimes I know how to solve a problem in SQL but have a really hard time solving it using the ORMs syntax. I have my doubts on it because at the start of the project you are supposed to define what DB you will be using.
Also, am I the only one who wasted countless hours trying to figure out how to execute a query using an ORM? While being absolutely sure how to do it with SQL?
This is my current situation. I am not that experienced and have a lot to learn. That is why I will be spending more time learning the ORMs I will be using and more hours figuring it out :)
Maybe it comes from having started building web and db apps for large enterprises where the DB was the nucleus of the operation and the apps the satellites. It seems among the web crowd it's the opposite - the app is the nucleus, with the DB and presentation layer the satellites, resulting in ORM among other things.
In those enterprises, access to the DB's was strictly controlled by the DBA team and data made available to apps via stored procs only. The app devs would request a stored proc with certain parameters that would return a certain dataset, and the DBA team would implement it and grant access to that app only (or grant access to an already-existing sp that did the same thing).
The idea of ORM still bugs me for that reason, even though I use it anyway. I can't help but suspect it's the formalization of a convenient but bad practice. I could be wrong about that though, biased as I am in this regard.
Indeed, this is one of the strongly-held "opinions" of Ruby on Rails. The database is more or less treated with contempt, a mere bitbucket.
#2: don't put stuff that aren't true facts into your database -- just because it's a reliable cross-process data store does NOT mean it's appropriate for all cross-process data sharing.
#3: don't fall in love with SQL. It fucking sucks.
Say you have a table for people and another for cities where they currently live at. If you were to do this for a small country you won't deal with many different cities so a join won't waste much time. But if you're using all the countries in the world then the number of cities is really big. When that happens, use two different and simple queries: one to get all the people, another to get all the cities. Then just check if both entries exist on your sets and use them on a while or for-each loop. It can be a big improvement.
Note: This is a really simple example which isn't the best case scenario for this tip. Think large sets with lots of joins between them.
Once I had this problem with a calendar on some application from a client. It was a month calendar and for every day there were some conditions that needed to be verified: is it an holiday?, is it fully scheduled?, is there at least one person available that day?, and so on.
How was it working? Well, for every day it would query the database for each one of those conditions which would in the end sum up to 5 minutes since it was a not so small dataset.
So how could we fix this? I started with the usual approach, let's just try to get this done in one query and some joins. There was some improvement but it as a messy query full of joins and even some sub-queries. It took something like 20 seconds to load that calendar and that wasn't enough.
Next approach: divide the most complex joins and sub-queries in some really simple queries. Examples: get a set of all the holidays, get a set of all the fully booked days, etc. Then, when you loop to show the calendar you just check if the iterating day is on the holidays set, then on the next one until you find one that works; if you don't just leave it blank. This made the calendar loading right away.
Benefits: cleaner and (much) quicker code (and also easier to cache if you can, just cache each query individually since some of them don't change that often, like holidays).
Is it making any sense now?
Also, at the surface, this seems like it would be a good application for using materialized views. I obviously don't know the ins and outs of what the usage here was, but this seems like a case where slowing down the insert would be a good speed sacrifice for retrieval.
I'm not going to even feign being an expert at database optimization, but that seems like it might be a more efficient way to approach it.
It feels really good to learn something new, I'm thinking of some scenarios which might improve with this type of views, thanks!
I've experienced this myself, and while natural keys do work in theory, in practice, they are prone to failure.
Number 6 needs to be used with discretion.
When you reference a complex view, you are in essence making use of a stored query snippet -- there is no magic to the view, and unless you've materialized it (which is very rare), there is no efficiency advantage. All it really is is code sharing.
So what's the problem then?
The problem then is that the query analyzer has to assume that you actually care about every element of said view, every returned column, every spurious join. I've dealt with too many performance problems where people made use of an overly generous view that invalidated most indexes, removed the potential for covering indexes (the #1 most important performance element of a database), and did a lot of joins that were irrelevant to the usage.
Look at it this way: If someone comes to you and says, "I want to get from New York to California in 2 hours, so I'm going to take a 747." Off the top of your head, you could tell them, "I don't think you can do that without a Concord." A computer could easily optimize that plan.
But if someone says, "I want to get from New York to California in 2 hours, so I'm going to stick grapes in my nostrils and soak my shoes in canola oil," there's nothing you can do to help this person, because they clearly have no idea where to even start. A computer could do no better.
Using or instead of in() is taking a 747 instead of a Concord. But many of the points in that article, like using distinct to mask a lack of understanding about how you're asking for data, are very much the grapes & canola-oil-Nike's of SQL.
EDIT: Whoops. Sorry; since you had been downvoted, I misread the gray-on-light-gray text as "Given the amazing amount of run-time query optimization performed by most relational DBs." Guess I should have run my comment through an HN Post Optimizer.
NoSQL solutions are not a drop-in replacement for relational databases, either, though. They're different tools for different jobs.
No matter what, you're going to need to look at the same data in different ways. If that isn't true, then the decision is easy. But in pretty much every non-trivial application I've ever worked with, the source information and the requirements for the resulting data format have never been the same for every feature.
When you decide to go with NoSQL, you are implicitly accepting some of the burden previously handled by the RDBMS: The onus is now on you to do your own "indexing," your own "views," your own "normalization." I only have experience with CouchDB, which provides an excellent array of tools for handling these responsibilities, but for certain types of data, there is still no substitute for an RDBMS.
I'm actually rather hoping that someone produces a hybrid NoSQL-SQL system that gives the best of both worlds.
Most "old" RBDMSes (i.e., not MySQL) are actually ok to quite good at optimizing queries. Sure, you will always find things that are not optimized properly, but having worked on a query optimizer myself, this is actually much harder than you might think it is. Not only coming up with optimizations, but also making sure they are legal, and don't degrade certain queries. There's nothing worse than optimizing a legit query to pieces...