Hacker News new | past | comments | ask | show | jobs | submit login
"I've isolated the bug to a database query" (thedailywtf.com)
394 points by dmarinoc on Nov 9, 2011 | hide | past | web | favorite | 165 comments

One company I was at did a merger with another startup, and most of the other company's engineers quit. Amongst the piles of Visual Basic we found a stored procedure that was about 5 pages long. It took about 18 hours to run; its job was to do a daily report.

I hate being afraid of code. I spent a day with it, got to understand it, then rewrote it as a couple of queries and some Java code, whereupon it took about five minutes to run.

[... then there was the guy who implemented bitwise AND and OR by precomputing some 65536 entry tables. Wow. Why do I find all the really howling bad stuff so close to databases?]

"Why do I find all the really howling bad stuff so close to the databases?"

Because the database and queries against were were coded as an afterthought.

Because the programmers ran all the db access tests against tables with 4 rows of data when the customers would windup with 100 million rows in production.

Because most undergrad education around databases is poor and antiquated. Plus, it is not SQL focused - much time is spent on talking about data modeling. Not wrong, but not often helpful in practice.

Because DBA's are (rightly) focused on making sure your db is properly backed up, replicated, and fault-tolerant. They may not even understand SQL queries beyond the basic one table select.

Because the users will force data into the systems that the developers never anticipated.

Because db jobs are segregated by product for development. If you've done SQL coding on Oracle for the last 10 years, good luck applying for a job doing SQL coding on MS SQL Server.

Because SQL hacking doesn't get much respect. I've been in job interviews where deep into the process they ask me a question with data in two tables that requires an outer join. And answering correctly differentiates you significantly from other applicants. I guess it is the "fizz buzz" of SQL these days . . .

I have a slightly stupid question. How is it that we've come to a point where we think of application and database development as separate things? I think this sets the stage for a lot of the problems you're getting at.

On the one hand we make-believe that all you need to know is how to sling code and as long as you can somehow get your data into the database and out, then you've done your job as a developer. I think that's bullshit. No one would use your application without the data, so the database is as much of your responsibility as any other part of the system. You should know how to use it well and should take responsibility for its quality.

On the other hand, there are things that belong in application code, rather than, for example, in a stored procedure. Most things you could do with a stored procedure probably belong in application code. The app is where the business logic is, the database is where the data is. As a DBA you aren't doing your job right if you feel OK about implementing business rules in SQL. It's much poorer at expressing intent than an OO language would be, it's less readable, and doubles the number of places you need to look for domain logic.

The WTF's go both ways.

How is it that we've come to a point where we think of application and database development as separate things?

My take is that the principles of database engineering only make intuitive sense at scale.

If every programmer started their career as an assistant DBA at a web host with a lot of traffic, or a data warehouse with a ton of transactions and report generation going on, or a company where lots of different groups were developing client code against the central database, every programmer would be inspired to understand databases.

As it happens, programmers start their career building little apps with one client at a time and a handful of data. And many of them finish their career building such apps, because they are very handy. And these programmers may never feel a burning need to know how databases perform with large datasets, or a huge stream of colliding transactions, or a mission-critical data integrity and safety requirement, or a variety of clients all trying to generate complex reports.

I agree with your point, but with an important caveat.

You are right that application developers should not be ignorant of databases, and DBAs should not be ignorant of what application developers are doing. All too often, this is the case.

However, the point when people came to think of application and database development as two separate things was the origin of databases themselves, and arguably one of the most important historical points in computer science itself. The notion that data has a life outside of any given application, and that it deserves itself to be managed, was and is incredibly powerful.

The same reason we have front-end developers who don't know how to write business logic, and back-end developers who don't know how to write a presentation layer.

The app is where the business logic is, the database is where the data is. As a DBA you aren't doing your job right if you feel OK about implementing business rules in SQL

Sigh... This again. Out here in the real world, we have databases with a hundred apps connected, with tens of thousands of concurrent users, and these apps are written in a dozen or more languages, there's COBOL, VBA, C++, Java, Python... There's green screens, desktop apps, websites, embedded devices...

The ONLY way to have consistent, enforced business logic, from ANY application in ANY language (even people executing SQL directly) is to do it in stored procedures. Doubles the number of places? Your way is 100x the number of places. Little websites with one application running on one database are not really representative of serious database use.

The ONLY way? I call bullshit on that too. Databases like the one you mentioned exist, but I still think if you have business rules you want to enforce across all applications connected to a database, you wrap that in a service and let the applications call the service. Clearly stored procedures aren't the ONLY way.

You're seriously proposing going back over 25+ years of applications and rewriting them all to talk to a "service" instead of the database? Over what protocol, CORBA, DCOM, SOAP, WSDL, some new thing? And what about third party apps that connect to the DB, have the vendors rewrite them too, to talk to your whizzy new "service"? And what language is your service written in, the fashionable one du jour, that in 10 years your successors will sneer at as "legacy" anyway? And it's as reliable and high-performance as your most mission-critical application, since now it's a single point of failure and a bottleneck?

Ho ho ho. Website experience doesn't translate to real database work.

I'm not proposing anything. My point is that there are other options.

As to the rest of your response, I don't mean to be rude but you're simply rambling. Obviously with a legacy system you work with what you've got, but the point is what you would choose if you had a brand new system and needed to make the call today. Choice of language depends on system requirements as we all know.

I really feel uncomfortable about all this posturing. I'm all for discussion, but if you honestly consider me a moron for saying what I have, you shouldn't have any interest in responding. Participating in a discussion merely to dismiss (and disparage) the other party is pointless and boring. Down-vote and move on.

It depends on many factors, "system requirements" is so vague as to be meaningless. Conciseness of expressing the algorithm is one, sure, so is the need to interoperate with other systems, both bespoke and off the shelf, so is the existing skillset of the organization and the talent pool it hires from.

It must be nice to only work on greenfield projects, but as I keep saying, it is not experience that transfers to large scale development efforts, which span decades and continents. If you spread this kind of misinformation, you create more problems than you solve, when the "business logic" is in a hundred places and no-one knows what bit of the system does what. Sure SQL isn't the perfect language, but it is the de facto lingua franca. And there is no "the app" for the business logic to live in. There are a hundred apps, each of which is a piece of "the app".

And if we're pointing fingers, it was you who asserted that a DBA who sees the big picture (funny how every developer thinks theirs is the only app in prod, when they are really in a cast of thousands) doesn't know how to do their job. Show a little humility and you will learn.

I work with large databases every day. I cannot tell you the amount of people I've interviewed that weren't able to solve a left join question. Let's leave right and full outer joins out of the discussion--a left join is as advanced as I dare get off the bat in an interview.

I've once had a guy I worked with ask for help when he was with a new company. I told him to look at the query plan. He replied, "Yeah, but that doesn't tell you much."

Relational databases are oft-misunderstood and underestimated time and time again. Where are the Codds of yesteryear?

I've never understood the typical programmers aversion to query writing. Writing SQL is in general pretty easy (basic set theory).

I'm guessing it's all about what people have experienced. I have been coding against RDBMSes since my first job while still in college. I have even read Celko for fun at one point :)

IMHO, learning the hard parts of SQL only come from experience with particular RDMSes and dealing with large datasets. For example, knowing that the only way to get a query to run against a large Teradata table without running out of spool space is to create and use temp tables only comes as a result from trying to run the query.

I think it's easy once you've got your head round the model of data sets but there are plenty of experienced programmers who, due to lack of experience with SQL, still drop back to iterating through large collections of data.

It's the fastest way for them to solve a problem in terms of development effort, but it often leads to poor solutions and poor SQL.

Most developers learned in school to program, i.e. writing algorithms. Most development in the "business" world is not that, it is basically managing and transforming data. Something that RDBMSes and SQL is very good at.

The result is that many developers will do selection and sorting algorithmically (taking the bugs and LOC that entails), instead of using declarative SQL.

I have worked at places were the database is just seen as something in the periphery, that you just have to deal with since Oracle is a requirement from customers. Its sick really when you think about it.

Im disappointed by many of the comments here that basically is "SQL sucks because I don't understand it".

In four weeks I've gone from not remembering the difference between inner and outer joins to isolating a bug in the Postgres optimizer... and yeah, it's all about daily experience.

I have no formal CS schooling, so thinking in sets is far from basic - I often have to stop myself from thinking in iteration. Reading Celko's "Thinking in SQL" book is in fact helpful (although why does he keep yelling at me? and what's FORTRAN?)

But it's very foreign and black-box-y to many programmers, and a tiny bit of SQL voodoo, like a tiny bit of Javascript voodoo, will get you a very long way with very poor performance.

I learned SQL with copy/paste/modify from PHPBB code base. When I got my first web programming job I could write basic select/insert/update/delete but knew almost nothing of joins. I learned the rest by reading MySQL docs on the job. (Later I also improved my understanding of joins by doing an MS Access project.)

So SQL is just as easy to pick up as anything else, but you're going to be horrible at it at first and need to get past the mental block of seeing something so different conceptually to "normal" programming.

One of the biggest hurdles that typical programmers must overcome is that SQL is a declarative language, and therefore quite different to the imperative languages they are used to.

Imperative programmers all secretly believe in a declarative style, that's why they compile with -O and hope the compiler's "query optimizer" will make it fast for them.

I work in a very heavy database environment, and it's interesting. Nobody here ever talks about "left" vs. "right" joins. We also don't use ansi syntax (oracle), so for a typical outer join, we'll do either:

  select * 
    from foo f, bar b
   where f.baz = b.baz (+)

  select * 
    from foo f, bar b
   where f.baz (+) = b.baz
But we don't call one a "left" and the other a "right" join. Just bringing this up to note that the right vs. left outer join distinction might be unclear to even relatively experienced engineers coming from such an environment.

This kind of notation (a comma-separated list of tables) used to be the standard. We've moved beyond it for a very good reason: there's a very common class of queries that this notation cannot express.

It's very common to want to LEFT JOIN two tables, and then see specifically which ones on the left had no corresponding record on the right. But the old way cannot specify the JOIN criteria separately from the filter criteria, so we can't express the query.

For example, in my current database, I might do the following to see what categories are not represented by any products:

    FROM Products.Category AS c 
    LEFT JOIN Products.Product AS p ON c.CategoryId = p.CategoryId
    WHERE p.ProductId IS NULL 
By specifying the JOIN criteria in the ON, I can then use the WHERE to winnow down to the left-only rows. But if that were expressed in the old notation, it would be mistaken for JOIN criteria, and thus all categories would appear to have no products.

That’s expressible in Oracle syntax as:

    FROM Products.Category AS c
       , Products.Product AS p
    WHERE c.CategoryId = p.CategoryId (+)
      AND p.ProductId IS NULL
so I think your point is mistaken. I do agree that the explicit JOIN syntax is clearer.

I find the JOIN syntax to be far more maintainable too, the WHERE clauses become really hard to read amongst the dross of boilerplate JOIN clauses from the salient WHERE clauses. Things also get confusing as soon as you throw an OR in the WHERE clause, too many unnecessary brackets.

Also, changing schema tends to be easier too as you can often just delete the JOIN line if you've been coding consistently instead of having to pick through the WHERE clause on more complex queries.

In the end though I guess a lot of it comes down to what you're used to, though I have been exposed to both and would definitely be in the JOIN camp and help utterly destroy you evil ,s if we had to have a worldwide battle to decide the fate of SQL kind.

That’s expressible in Oracle syntax

Let's modify the query slightly, to retrieve all categories are not represented by any products from a particular manufacturer. The standard syntax would look like this:

    FROM Products.Category AS c 
    LEFT JOIN Products.Product AS p ON c.CategoryId = p.CategoryId AND p.ManufacturerID = 123
    WHERE p.ProductId IS NULL 
How will you express this in your notation? In particular, what differentiates the "p.ManufacturerID = 123" of the join criteria from the "p.ProductId IS NULL" of the filter criteria?

I'm no expert in SQL (Especially the Oracle flavor) but why would the following not work?

SELECT * FROM Products.Category AS c , Products.Product AS p WHERE c.CategoryId = p.CategoryId (+) AND p.ProductId IS NULL AND p.ManufacturerID = 123

I never ask for an explicit left or outer join in an interview. It's always a question, like, "Get me the total sales of every salesperson for this quarter given this schema." And then they return it, and I say, "Can I get the list of all salespeople, even if they haven't had any sales?" And that's where it falls apart, sadly.

unfortunately, I must agree with you. 9 times out of 10 candidates cannot answer the question you've posed. One time I even had a candidate with 17 years of experience (!!!) who didn't even know what a join was (masked in a question form). I let it slide, pretended like it was an "advanced" question, and came back to it later (figured he was nervous). The second time around I phrased the question "so, imagine I want to join data from these two tables so that the output would be first name, and location" (2 different tables). Even then, the candidate did not get the hint.... Instead, he offered to change the schema of the fake database and have all the information in one table....

I also find the Oracle outer join operator (+) easier to use.

I just searched the full (Oracle) sql codebase of the project I'm working on, and it seems that there is no clear winner in the "outer join" vs "(+)" battle (outer = 46%, (+) = 54%). There is no code style requirement for joins.

I have trouble with the oracle syntax once you get to two or three tables left joined against each other, and consider it more a relic of old oracle programmers than any actual advantage one way or the other.

I think I read recently that using the Oracle (+) operator can give you the error "cannot outer join to more than one table" if you try to outer join to two or more, but the ANSI syntax will allow you to do that - I really should test this out ...

edit - I tried this out and right now cannot get that error, so not sure where I got that idea from.

This code:

    Where a.client_id = c.client_id
    And a.client_id = b.client_id (+)
    And c.client_id = b.client_id (+)
Doesn't fly in our database

There's no reason ever to use a right join. I consider those to be a code smell.

Agree with this as well. Forgot exactly what one was and googled it to make sure. The example that is used here would indicate orphaned records.


There are valid use cases for orphaned records. A quote database might want to allow registered and anonymous submission, for example.

Or, ala Reddit, comments from deleted users.

There's about as much reason to use a right join as there is to use a left join.

Yes, but one of them means that you're thinking about the problem backwards.

I don't think I've ever used a right join in a production query (perhaps a couple of times when I was doing some digging and it was easier), but I don't understand your statement. Left joins are insanely useful. For example, getting all the sales for a salesperson:

        sum(s.Amount) as TotalSales
        SalesPerson sp
        left join Sales s on
            sp.SalesPersonKey = s.SalesPersonKey
    group by
    order by
        TotalSales desc
If you did a correlated subquery, it would take a ton of time to complete (`select sum(Amount) from Sales s where s.SalesPersonKey = sp.SalesPersonKey`), especially on large tables.

Left joins (and full outer joins) are plenty useful and I use them almost daily. Care to explain what you mean?

What he (probably) means is that the following query returns the same result as your query:

        sum(s.Amount) as TotalSales
        Sales s
        RIGHT join SalesPerson sp on
            sp.SalesPersonKey = s.SalesPersonKey
    group by
    order by
        TotalSales desc
EDIT: Formatting.

That's what was meant, I assume, but it's misguided. Right joins don't create a "code smell" for technical reasons, but for practical reasons. Because left joins are the standard approach, if you're using a non-standard approach to a common problem you'd better have a damned good reason (because you're just confusing maintenance programmers... and thus causing errors... with no justification, otherwise).


I must take my SQL skills for granted. It really just "clicks" for me, I guess.

Me too. Most programmers don't understand the "brain" of a select statement. People gtalk me all the time when they are stuck and the answer is often times to remove 90% of the query and replace it with a GROUP BY and a HAVING. Most people don't think of the steps the interpreter goes through because it is a declarative language.

When I interviewed for my current position (which involves database design/query writing/storage/etc), the interview was heavily SQL-focused. My would-be manager started with very easy, general questions, and then continued to plumb the depths of my knowledge 'til he hit bottom (took about 30 minutes[1]). He is of the opinion that if you can think on your feet about general database design and set theory, you're a pretty competent programmer for the work the company does.

[1] Though I apparently did well enough to get the job, I felt as if I utterly failed the interview.

Can you explain each of them, when they are typically used and why they matter? I've used them before but honestly, I am not that comfortable with them and don't ever think about using them.

Jeff Atwood already did it much better than I can:


Hope that helps!

I cannot help but think that the recent flavor-of-the-month rush to all that NoSQL is nothing but the average web monkies who do not understand databases, SQL and the fundamental problems RDBMS are dealing with so, yea, when I have no idea about all that I would prefer a network-accessible key-value map as well...

One of our basic interview questions is doing schema modeling on a white board from business concepts we lob at them.

"Because most undergrad education around databases is poor and antiquated. Plus, it is not SQL focused - much time is spent on talking about data modeling. Not wrong, but not often helpful in practice."

I am taking a database management class this semester. Its awful. We spent over half the semester drawing diagrams. Just last week we started working with actual SQL.

Have you been to DB-Class.org? It might make a good supplement to your real-life class.

The & and | operators weren't good enough for him?

That stuff ends up in databases because on your average team the knowledge of databases drops dramatically when you get past selects and maybe left joins. I worked with a few developers who wrote and lived with queries that were taking 30 seconds each to run on their local machines because they couldn't imagine how to fix them and just waited for me to get assigned to the project in a week or two so I would fix it. They just kept raising the timeout values until the pages rendered. The problem was simple in that they were accessing a table with 30 million rows without using any index at all so changing it to use the clustered index made the query return almost instantly.

I've seen similar things where people come up with crazy ways of doing your most basic database tasks.

So show your coworkers how to run the query-plan dumper for your DB brand. The curtain is torn away, and the underlying ISAM is revealed for all to see to allow the needed "hints", indices or restructuring to be understood.

Don't lord it over them, explain.

As my prior boss used to say... "Give a man a fish, feed him for a day. Teach a man to fish, and you'll have to answer fucking fish questions the rest of your life."

Second this - I've cooked up some weird SQL to do neat things.

Other devs on my team had real problems with it - they were stuck and found it really difficult to "think in sets." They kept asking me about how to use cursors to go through data . . . heck, they may even have been right from a clarity and maintenance perspective.

But the object-oriented part of my brain has apparently been excised by spending so much time fishing (SQL). So now I have too many fish questions in that direction.

You're assuming the other people want to learn. Generally, if people want to learn something, they ask questions. Telling co-workers how to do something when they don't ask can be... tricky. It depends on the environment you're in, your relationship with the people, and the people themselves.

Fair enough. You can lead an idiot to knowledge, but you can't make him think.

They were disinterested. Under normal conditions you are correct and most people would be interested but they were not.

I've had more than just a taste of this myself recently. I'm no database expert but I know how to build a query, normalisation, relational integrity and all that. Enough to identify bad SQL in a CRUD application. And I'll take the time to learn more about it and how to better construct a schema and query (and whether or not an RDBMS is even appropriate for the application).

With this in mind my boss mentioned one of our sites was "dying on its arse." It was hammering the slow query log, pages could take 10, maybe 20 seconds to load.

Turned out there were NO indexes at all. Maybe the odd primary key from the default CMS install (which was by no means database efficient). So we sorted that and that sorted out the immediate speed issues.

Of course, it went without saying that a poorly constructed schema must be related to a poorly constructed query or two. We were actually wrong. It was more like a query or twenty.

The submission of one form was done field by field. There was an UPDATE query constructed for each field and processed there and then, on a form with at least 12 fields. There was a comment above, "We have to do it this way. Trust me." As far as I could tell they didn't know how to dynamically build a query.

Further along, different update queries were performed by first deleting the record, and then re-inserting it with one of the fields updated.

I never quite got as far as that. There was no comment anywhere about the reasoning for such a strange approach.

I think some developers don't bother figuring it out because they think there are better things to do than optimise your code.

Nothing builds confidence in a fellow developer like

  //We have to do it this way. Trust me
followed by crap code...

Hi there! I made mostly random changes until it worked, so I have no idea why this contorted approach fixes the earlier failures. But I spend two weeks on this section, so don't touch it!

[this comment removed and replaced with the "Trust me" line, because maybe multi-line comments are breaking things today...]

We have a legacy database we support that creates a new dynamic stored procedure (with the same body) every time it performs a particular operation. How you can have enough knowledge to do this, but not to use a single procedure is beyond me.

Is your name a wire reference? If so well done.

I tried caching the hell out of Rails, kept adding indexes to my SQL tables, created 4 MySQL replicas, added more RAM to my linodes, tuned many default parameters in my.cnf, but the website was still slow to a crawl.

This parameter saved it all "innodb_buffer_pool_size = 768M" ... how am I supposed to know that Rails creates InnoDB by default, i thought MySQL default was always MyISAM.

MySQL's default may be MyISAM, but it's a bad default. It is a good thing that Rails uses InnoDB by default.

I think MySQL 5.5 onwards the default would be InnoDB.

This is changing because MyISAM is not really a true database (in the ACID compliant sense of the term). I guess I would have assumed they were MyISAM too but they don't perform differently enough at low usage levels that you would notice it right? How big was this was application?

About 5 million pageviews / month. The website is quite processing intensive.

That's only two pageviews a second, on average.

We can agree that traffic rarely comes in at an average rate.

Assuming the standard bell curve it's six pageviews a second, at peak.

And that's apparently spread across four read-only slaves, too. I'm more inclined to blame the code than the database in this case.

No, it is not the code. It is InnoDB's default buffer pool set to 8MB that cause my website to be so slow ...

Dumping the schema from the database itself will reveal the table type. Knowing the physical organization of the schema is required for optimization work. If you only used the logical organization information (UML diagram for example) you were doing it wrong.

In real RDBMS systems, tables can be clustered (index organized in Oracle-speak) or not, nested or not, in different tablespaces, with many different parameters that do not change the semantic of DML/queries but change the performance significantly.

Yes i know, but it just never cross my mind that my tables have been InnoDB all these while ... among all the optimization, this is one i overlook, well i'm not a DBA afterall.

Isn't rewriting a stored procedure into multiple queries plus Java code sort of the opposite of what most DB best-practices advocate? Now your query logic is in a mixture of SQL and Java, rather than all in SQL, and you've hidden some of the logic from the database's query planner by moving it into application code.

(On the other hand, I've rarely found query planners, even for Oracle, to be as magical at optimizing as one might hope, so maybe that's why.)

Sure, but I took an unmaintainable stored procedure (with very few comments, by the way) and turned it into something that was small, well-documented, and that ran over two orders of magnitude faster.

If this is wrong and goes against "DB best practices" then I guess this answers my question about where the howlers come from.

[We had a project under development. At some point, and an utter surprise to the team, an Oracle consultant showed up one day and said, "I'm here to tune your database."


And he did. Against a half-built system with a schema we were still designing, he "optimized" our queries and determined buffers sizes and whatnot, and then he went away.

Apparently the people in Sales were panicked that we had a database under development that we hadn't yet made plans to tune, so they had hired a guy to come in and fix things.

We tossed all of his scripts and rolled back everything he'd broken, and continued coding. Total waste of money.

Databases are a weird confluence of the power-mad, the knuckle-draggers, the money-grubbers, and a few techies who know what the heck they're doing.]

You are assuming that the Java code is query logic instead of pure business logic. Many people try to do too much in SQL.

The performance problem with using multiple queries comes mostly from the network latency and query overhead that happens every time your application needs to make a call to the database. Prepared statements minimize the later and connection pools help with the former. The primary thing to worry about when writing queries is whether the logic behind them is efficient and doesn't for the database to do stupid things or fetch the wrong information. (Like using count( * ) where exists is what you mean.) Paying attention to what your queries will make the database actually do is more important than following any guidelines.

At least that's my experience.

I'd agree. I'd also add that well written stored procs are often more readable than well written ad hoc queries.

The longest query I have ever written I thought was approaching unmaintainability at about 100 lines, but I can sit down and digest it without too much effort. I have however spent a week debugging a query that was three times that long. I think the time to understand starts increasing with the square of query length...... 100 lines may not take too long to understand, but 200 lines takes 4 times as long, 1000 lines takes 100 times as long, etc.....

At least my experience is that even well written SQL starts becoming a bear really fast when over 100 lines ling.

Some databases really do loose their mind when the query contains many tables. Sybase was particularly a pain and pretty much required breaking up big queries. Sybase's optimizer for version 12.5 and earlier (cannot speak after that) often required forcing indexes and plans in stored procedures.

Because SQL is declarative and few(er) people can wrap their mind around that then straight procedural code.

See also Linq, you get some really crazy stuff written in Linq because people don't really understand it.

"Wow. Why do I find all the really howling bad stuff so close to databases?"

I have seen a few of those, but you also have to include not only stored procedures but also table layout etc. Things like "if I don't define my foreign keys, I can use a foreign key to reference rows in EITHER table a or table b!"

And then of course when you write sprocs against that mess, you get more messes.

The thing is that databases are fundamentally math engines with some extra services tacked on. This means that designing databases is ideally a mathematically sound representation of your data not specific to your application, and the database queries bridge the gap. Most developers don't understand this though and so as the application changes, you get big messes.

Now, obviously the above is an ideal which really can't be met in most cases. However the closer one is, generally the more ideal the situation, and most apps are really far from it.

"Wow. Why do I find all the really howling bad stuff so close to databases?"

BTW, as a counterexample, I saw one developer figure out how any invoices were closed by retrieving millions of them from the db and checking in Perl. All for the want of a HAVING clause.... Needless to say while this worked ok for small amounts of data, it failed with large sets.

I've made a few SQL errors in my past that failed on large sets, but fortunately I was working against large sets, and they failed miserably, and we had to figure out why. I got good advice from an expert and fixed the problem. In many ways keeping queries separate, and focusing on readability/code standards helps a great deal I think.

The worst problem I've seen in a similar vein was a build process that involved munging a large (200+ mb) xml file. Someone had written a little JS tool for the purpose, and later the use of it had expanded so that it was being run over 150 times during the build. Since it took about 4 minutes (each time) just to read and save the xml file this meant that the build was now spending around 10 hours every day just munging the xml file. I rewrote the JS bit so that it took in a config file and could do all the processing necessary in a single pass, which only took 4 minutes.

Just so folks know, there are tools that will decompose queries and make nice little pictures out of them. With something like this, you'd have to use it just to get started.

Once you've visually decomposed it, you'd physically decompose it by splitting it inside-out. Then proceed to understand and debug inside-outwards.

Not fun, but not impossible. Just a huge pain in the ass. Making it more fun would be a database with bad RI, nulls, and duplicate data all over the place. Don't get me wrong -- from looking at the image it definitely looks like aspirin will be required. :)

I use the cheapest and the best tool available: EXPLAIN EXTENDED

You cannot do any better than have the DB tell you exactly what it is going to do with your query. Then you can experiment with changing / adding / removing clauses and see how it would affect the query plan produced by EXPLAIN.

For example if EXPLAIN says the query would generate a temp table you could often achieve improvement by managing the same temp tables explicitly. Many times you can get a huge performance lift by using a "group by index". You could identify and rewrite un-indexed table scans too.

I've found lots of other "unobvious" optimizations that cut down queries that ran for days or hours to minutes or seconds.

Here are a few references to get started-

1) http://dev.mysql.com/doc/refman/5.5/en/execution-plan-inform...

2) http://www.mysqlperformanceblog.com/2006/07/24/extended-expl...

3) http://www.mysqlperformanceblog.com/2010/06/15/explain-exten...

I've always believed those tools must exist, but sorting out the SEO crap and advertising copy that gloms up search results for them is painful. Can you name any Linux/Mac tools like that?

Not linux, but I have a windows example. Being an El Cheapo, what I do from a windows box is fire up MS Access, link into the database (and this works even if the database is on a *nix box somewhere), then drag the text of the query into the graphical query builder tool. It's taken some pretty complex stuff apart for me in the past.

It's been a while since I last deep-dove in a complex database, so I don't have any other examples handy. Sorry. Maybe somebody else can pick up the thread. I remember reviewing a bunch of them several years ago -- these tools have been around for a long time.

on PostgreSQL it's part of pgAdmin you don't need a separate tool

I haven't found anything for Mac/Linux to do it but on Windows Visio can actually do a good job of visualizing a schema and SP structure.

I want to see someone use a language model for SQL to reconstruct the query just from that photograph, and then debug it.

I once worked on a site where the original developer clearly didn't know joins existed, so if he wanted data from two related tables, he'd get all the required results from table one, then loop through them, one by one, querying table two for the corresponding record. Sometimes this went 3 or 4 tables deep, the site would take nearly a minute to load a table of products.

Mmmm... nested loop joins. Little more code and he would have his own database. :)

I've seen this in an Access application. I've also noticed a strange character that constantly shows up: the accountant that has acquired a taste for programming. Luckily, there never seems to be more than one per institution.

I think Accenture is the corporate instantiation of that character.

I see this pretty often with people who haven't spent much time with relational databases. I suppose it's understandable since thinking in sets with SQL is a different paradigm than they are used to, but I even see tons of PHP/MySQL tutorials online that use this method when they should be using a join.

In the old days, MySQL didn't support foreign key indices, so the LAMP developer community had this mentality that "Joins are slow" or even "Joins are evil" and actively encouraged people not to use them. So its not really surprising this kind of thing is still out there - PHP tutorials are somewhat infamous for promoting 'worst practices'.

Yeah, I know a Rails developer who thinks four-way joins are a design smell.

I think would have been understandable if he'd grabbed all the required ids from table one and done some kind of where in() with those ids on table two. But the way he did it, if table one returned 100 rows, then he'd be making 101 queries to the database. Don't think I could ever understand that :S

I'm currently reading "Mastering Relational Database Querying and Analysis" by John Carlis, which posits that SQL is inherently flawed for several reasons. To paraphrase from the text:

First, both the syntax and the way querying is generally presented in textbooks, lead you to think that your task when querying is to display one unnamed table. The author objects to each of those four words.

Second, many people have found querying with SQL terribly difficult. Even experts find SQL hard to create and read. Do not be surprised if an analyst struggles to understand his/her own SQL. It is impossible for users to understand any but the simplest SQL.

Third, SQL practice suffers from the notion of a "correlated query" -- which has a monolithic subquery that is executed repeatedly via looping, once for each value of a candidate row picked by an outside SELECT.

The book has much more to say on the topic of SQL before going on offer relational algebra (built on top of SQL) as a n alternative.

I disagree with the readability part. I do heavy use of "WITH" to name my intermediate steps and comment the tricky parts (as I would do with any other programming language) and my colleagues find them pretty readable(or that's what I'm told). That's how I "reverse engineer" such monster queries, refactoring them in intermediate relations with names. Pretty often the same subselect is used more than once.

In fact, due the lack of side effects it's much easier to do than with procedural code.

Doh! I must be one of the SQL bunnies everyone else is superior to...

I didn't know about the WITH statement. Thanks for enlightening me.

I disagree with those criticisms of SQL, but I think it is inherently flawed from another perspective. Obviously how textbooks portray SQL has nothing to do with whether it is inherently flawed.

The real issues with SQL come down to the fact that it is an imperfect representation of relational math, and then also the dreaded ambiguity regarding NULLs.

The problem with NULLs is that NULL is used to refer to two very distinct conditions. In one case (outer joins) they are used to refer to a NULL set. In another, unknown values. These really should be distinct values.

A lot of the problems I have seen with queries (other than DBA issues) is the conflict between application developers and report writers. A lot of databases are designed for transactions and resources are not often available to do a proper reporting database or at least summary data.

I have a very simple rule for myself - "if a user of the application is concerned about a certain attribute or state an element (e.g. person, truck, plane) is in, then a report will be required showing all elements with that attribute or state."

If your database design cannot support that rule, then trouble will happen and you will have serious performance problems.

To give a simple example, suppose you are running a group of storage garages. You have a table with all your customers, a table with all your storage units, an assoc table joining customer and units with active flag + date of start, and a table with all your payments. Good enough to do transactions and figure out for a unit if they are payed up.

On the other hand, writing the report to tell who hasn't paid is going to be kind of a pain. It is a simple example, but not much different from what you find in large systems.

Is your point here that the data being stored is insufficient (e.g. you'd want an end date, not just an active flag; this doesn't cope at all with prices changing over time, bulk discounts, or different customers paying different rates; there's no concept of invoices, or whether payment is due based on calendar months or based on opening date; etc) or that you're ignoring all that sort of stuff just to keep the example simple (so assume everyone pays a fixed rate per unit, due weekly; someone can't close their account until they're paid up; etc) but that you'd still want a more complex schema so as to be able to more easily generate a "Who owes us money?" report?

If it's the former, then sure: you need to be able to model all these things properly. If it's the latter, then I'm not so sure. The SQL to create that sort of report is going to be non-trivial, but it shouldn't be overly complex for someone who knows what they're doing, and if you have the correct indexes it shouldn't take very long to run either.

If you want to start doing all sorts of fancy data warehouse slicing and dicing, you're usually better extracting daily (or more/less frequent depending on needs) dumps of your transactional database into a different structure more suitable for reporting, than in restructuring your 'live' database and having to deal with all the resulting denormalisation issues, etc.

As stated, I believe all the information needed to generate the report is there. My point was that what is sufficient for an application is generally insufficient for basic reporting. So, when budgets get short, the database work to make queries by reporting easier is often ignored and complicated queries and processes become the norm. Those laying the problem solely at the feet of the DBA's are missing the other group that tends to make these queries necessary: App Developers. I once had an app developer tell me one type of invoice was impossible to make because some of the relationships and data were intrinsic to the application and would need to be modeled the same as the application. He wasn't far off.

There is a point between transactional and data warehousing that needs to be hit. Simple summaries or considering the question of "how do I retrieve everything in this state" will sometimes suffice. Building a system to get single transactions in an out tends to make routine report take all night or be impossible to get in a reasonable amount of time.

If your schema and app requires non-trivial SQL, then expect maintenance nightmares and lack of ability to scale.

Not always. An app requiring non-trivial SQL can be an indication of a problem requiring non-trivial workflows.

If the problem is complex, you want to model relations, not ignore them.

(But clearly not in this hairy query.)

The denormalization argument is misguided.

Denormalization is a performance consideration, and now even happens transparently at the storage layer in some RDBMs. So restructuring a 'live' database should be done for semantic reasons: because the workflow changes; because the app needs to change.

When I've seen this article back in my RSS reader, it reminded me of one particular query that was generated in the application I'm maintaining. My irrational fear of sending too many queries to the database (I've outgrown that in the last 6 years) caused a single query to be generated which was 4KB in size.

Which of course is much less than the one in the picture, but still very, very bad.

Some so we've refactored the beast. Now it's 2-3 smaller queries (which are much easier to optimize for PostgreSQL and, above all, individually cacheable) which lead to a nearly 100% speedup for common cases. Also, the code is infinitely more readable which means that it's much easier to extend it.

I'm incredibly happy that we've seen the light and fixed it before it grew to proportions like the ones on the original article shudder

4KB? That's nothing. Search for 'media.sql' in any recent Adobe installation media. You'll find 3 MB+ SQL files, containing:

  -(BASE64?) encoded InstallerIcon,
  -(BASE64?) encoded EULAs in various languages
  -GUIDs like {01C3BD72-7371-4472-B179-B4DFE6DDD251}
  -and my personal favorite: a 25 KB XML fragment

There was a source control system whose code I had the pleasure of reading several months ago whose favorite way to store data was as a SQLite3 database, with a single table, with a single column, with a single row, containing JSON. Words failed me.

Based on what you're describing, I now believe those developers were poached from Adobe.

It's common to use SQLite for data storage in Mac OS X, but 3MB SQLite databases aren't the same thing as a 4KB database query, which is the horror the original comment was admitting to. :o)

I can see the value of sticking that kind of stuff into a SQLite database rather than some obscure structured resource format.

People in that thread are bragging about their 10-page queries with 20 joins or 8 unions.

I'm looking at a query here that is 37 printed pages, with 92 joins over 25 unions.

I think at this point I am bragging that I have never written one. Ok, I have used views of views, but.......

I have, however, had the misfortune of troubleshooting those 10 page queries. Finding a stupid typo in one of those is like looking for a needle in a haystack.....

And thus Skynet was born.

My first assignment in my first ever job working for a 'proper consultancy' was to babysit an overnight process which was a SQL Server stored procedure. Back in those days they had a 64k limit, so it was split into 3 or 4 sections. It took around 12 hours to run.

I'd like to say I rewrote it, but I didn't. I just left.

Reading the comments below, I get the impression that all the "good" DB people hang out on HN, not like those "other" incompetent nits out there who don't know what a join is. Hubris, people.

Maybe HNers are just better at keeping their mouths shut when they don't know anything about a topic.

I guess you're new around here ;-)

Really nasty piece of SQL code, definitely not for human-based processing. What do you think about tools that may decipher and visualize such complex queries in a more structured way, like DBClarity (http://www.microgen.com/dbclarity/)? Have you been using something similar recently?

(disclaimer: I work for mcgn)

The original version of foursquare.com contained a lot of stuff like this (though not as epicly bad). It was a very small amount of poorly written PHP code surrounding a bunch of unreadable SQL statements. It's amazing that it worked at all.

Dens is a great guy, but I hope I never have to rewrite his code again.

I was just trying to figure out a stored procedure that queries one table, loops over the rows, and within the loop queries another table using the values from the first query. Now, looping over these rows, it has a third query and a corresponding loop over those rows.

And all that for inserting the values taken from the three tables into a 4th table. This could have been done with a simple 3-table join query. Hell, it could even have been done with a single insert statement! I wonder how people fail to recognize an N+1 selects problem when it's staring them in the face.

Well, to be fair, this problem I described isn't exactly an N+1 problem is it? More like an N(M(L+1)+1)+1 selects problem. ;-) (Unless I've got my math all wrong there?)

How I hate working with PL/SQL stored procedures! :(

Reminds me of the stock update system for one of our major clients at my first job. The predecessor of myself and my colleague had thought it a brilliant idea to build a clothing ecommerce site, with a complete list of all stock going back to the year dot with ASP and Access (that's Classic ASP, not .NET). Towards the end the stock update would take pretty much an entire afternoon to run.

Eventually we got the approval to change to MySQL for the database. When they ran the first stock update with the new version they rang us up to check it had worked because it was near instantaneous.

The moral of the story: Access is BAD! VERY BAD!

I've seen and had to debug longer stored procedures for sure but never a single query. I can't see the last page so maybe this is a stored procedure. It's hard to tell.

I worked for a company where there were queries somewhat like this, however they were obscured because they would create views on the fly. So a query would look deceptively simple only to realize (not exaggerating here) there were four levels of views underneath it. Bugs were a pain, but the worst was trying to optimize those queries. Just untangling what the actual query was made life really difficult.

I've seen SQL that looked like this but didn't wind up being very complicated. I've also seen seemingly simple queries that were actually very tricky!

I can't read much of the query, but at least a few lines are checking for null values. I wouldn't be surprised if 80-90% of the query is simply output formatting. Depending on the DB platform, some formatting and null-check statements are fairly verbose.

People can write queries that large without formatting? or was that the result of some query generation application?

I'll wager it's generated within the code, based on report parameters or something similar.

That does look like rough going... though in my experience, a nasty-looking mess like that may not be so bad if you simply format it -- with indenting to indicate subquery levels, and maybe some color (like light grey for formatting/null check fluff and bold for SQL keywords).

Often a massive query like that indicates business logic and even display details built into the query -- are there lots of WHEN clauses? Big chunks of the query just managing formatting the output?

If SQL is your "hammer" to fix every problem, you can take a simple query -- say, fetching a user's full name and address from a single table -- and make it really damned long just by cramming everything into the query... is there an address2? A middle name? Did the user capitalize their names (you could fix that in SQL...)? Etc..

I have written some very long SQL before -- a few years ago I flattened many many pages of buggy PHP into a dozen carefully-designed (but a bit complex) materialized views that were the base for the final reporting queries (which could be pretty straightforward; the data involved was already neat & tidy).

If you blew up each view involved and removed all formatting, the reporting queries would be pretty tough to digest; that's the point, though -- all of the complexity was cut up & compartmentalized into well-named bite-sized pieces, so other developers were still able to maintain and extend it after I left.

It's got to be from a point-and-click Query-O-Matic tool.

Or an ORM.

Please forgive me, this is OT: can anyone here recommend a good online resource for learning SQL "the hard way"?

http://sql.learncodethehardway.org/ (not yet finished though)

I will say, Ingres was not my favorite database, but its query plan display should be used to explain how a database query works. It showed a tree of operations for each query. If you saw FSM (full sort merge) or Cartesian Product you better mean them or re-write the query.

I'm pleased that most of the commenters recognize that SQL has a need and is it's own language, for good and bad. I really expected to find a troll popping out "NoSQL rulez" type comments, and the level of understanding of how and where SQL can help is very encouraging.

Soooo... what is it supposed to do? Looks like a million sub-selects and joins! Already have a headache just with looking at it. I'd probably right it all over again from scratch.

A place I used to work at used an ORM which gradually constructed SQL throughout the flow of a request. One of the calls we generated was probably a couple of pages long.

I'd love to see an EXPLAIN on that. ;)

* 1. row * id: 1 select_type: SIMPLE table: lots_of_em type: not_good possible_keys: none key: none key_len: n/a ref: NULL rows: googol filtered: 0 Extra: You're screwed, Do not pass go.

This is an old link, but you just reminded me of: http://howfuckedismydatabase.com

That's awesome, especially this - http://browsertoolkit.com/fault-tolerance.png

hahaha great :D

htsql (www.htsql.org) is a business reporting language -- one line in htsql can generate 5 or 6 lines of SQL.

This query could be condensed considerably if rewritten in htsql.

(htsql automatically generates SQL code that covers all corner cases and executes faster than hand-crafted SQL.)

While we're sharing horror stories...

I once encountered a stored procedure that returned HTML in a result set. It literally created the UI of a webpage. It returned several columns of HTML that the app would place in strategic parts of the page. Well, as years went by, the app required a more innovative and web 2.0 UI. Rather than remove the HTML from the sproc, more columns were returned with more HTML, Javascript and the like. One time I had to fix some Javascript that rendered on the page. When I finally found where the errant JS was coming from, I realized I had to file a database change ticket to fix the UI :)

I had a boss one time who, rather than write a CGI, would write SQL statements against oracle to format the results as HTML.

Things like

select '<table>' from dual; select '<tr><td>'||column1||'</td><td>'||column2||'</td></tr>' from dual;

At least some versions of Sybase SQL Anywhere have a small HTTP server built in. Specially named/declared stored procs return HTML. I've built a smallish web app based on one, complete with Ajax touches. It was fun for one of my first programming jobs, but I don't know how maintainable it ended up. On one hand, at least your logic is close to your data... on the other hand, oh my god.

I once joked (as an April fools joke) about writing an HTTP server as a stored procedure on PostgreSQL (that would mean patching PostgreSQL to run as root since only root can listen on port 80).

You can read the email at http://ledger-smb.1045705.n5.nabble.com/Announcing-the-Devel...

Unfortunately people thought I was serious and I still get questions about it. Reading this thread I am starting to understand why....

Sounds like Oracle's Portal product?

I would guess Oracles mod_plsql

It was a custom built web app that had grown over the course of ten years or so. It was actually a fairly heavily trafficked web site, and this was a particularly old piece of it.

See, this kind of crap is what happens when you have the programmers sit on the bench and let the "engineers" take charge. Why do you hate society, you Luddites?!

This is kinda impressive.

Im sure 12 pages of Java code doing this procedurally instead is much more maintainable.

Sorry, I call BS.

I might have just been lucky enough to always work at professional companies and startups where this kind of stuff can never happen.

But something tells me there is no reasonable way an SQL query can grow to these proportions.

All the good techniques and technologies you use now? Yeah, those were invented because the way a lot of people used to do things was terrible.

I'm not old enough to have built systems like that, but I have inherited some for maintenance and debugging. That's not likely to be BS. Back in the day "do it in a stored procedure" was common advice. All the logic for untold applications and millions of lines of code lived at the database level.

Count yourself lucky perhaps to have entered the market when you did, but don't discount that you're standing on tall, sometimes ugly, shoulders.

There's still plenty of DBAs out there preaching "do it in a stored procedure", because in theory it lets them do things like changing normalization without affecting the application interface. (In practice, I've never seen happen that way.)

But in general, people did thing that way because there wasn't any common middleware or services interfaces. The RDBMS was the "app server".

It's a good point. At one time the database was the application. Stored procedures and such were part of the application UI (as it were). At one time features were being added to database systems you could build command-line interfaces for the end-users. These days the database is mostly just used as a storage engine.

Not that I really care about "back in the day!" But it can be helpful to understand why certain things were done in a certain way. It seems like ancient history, but there you definitely will find some old code when you start working at companies that have been around for more than a few years.

Systems like that don't even have to be particularly old. I used to work on a system that started out as an Access database, and rather than doing things properly, the developers basically just translated everything into SQL statements and used ASP and COM+ on top of it.

It was a load of crap, but it still ran about 10x faster than what customers were used to, so they weren't complaining. Last I knew they were still running the same basic setup, and it's been 5 years since I worked there.

Most of the stuff on Daily WTF comes from corporate America, often from places where the chief focus of the company is not technology/web.

Don't know what to say, other than it's not only possible, it's not all that uncommon. I can't speak to legitimacy of this particular entry, but the absurdity of it doesn't shock me at all.

Spend some time in the marketing or IT divisions of many Fortune 1000 companies, and you'll understand how Daily WTF is therapeutic for those who have to work in those environments.

I was once handed a database with four SQL procedures in it. Each procedure was 6+ pages of run-on PL/SQL. The picture in the original article is frighteningly close to what I had in hand at that point. Mostly nested cursors and imperative style programming munged into the proc.

I've seen other cases, none quite as bad as that one. It tends to be the result of giving a developer who hasn't had any formal training or experience with SQL a requirements doc without any supervision. If all you have ever done is normal imperative coding, you will not even know you are doing it wrong.

You can also find some weird SQL created by report generator tools.

I worked at a company that had a ton of business logic in stored procedures. Yes, queries can get ridiculously long. All of their queries were written by hand.

Why not? It's probably computer-generated. You can actually see the first page, and if you look at it, it uses nested SELECT statements. No one said it was smart, but it certainly is reasonable.

Automated query builders.

No one claimed it was reasonable. But it happens.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact