It's funny that this even needs to be a blog post. I can see why simple key-value stores have become so popular in recent years. Developers don't want to learn SQL properly, and end up using the database as if it were a key-value store, and at some point, they begin to question why they need SQL at all, and switch to NoSQL. Later, when they do need more sophisticated querying, they then have to slap Elastic on top of their key-value store to get that feature back, when they could've been using SQL all along. Full circle!
So, while I agree that too many developers don't write sql these days, my experience, is that usually the database is the bottleneck for any application I write on the web. There's a reason that there's been a trend away from doing things in the database for years, and a lot of it has to do with moore's law slowing down in the last decade. I can trivially scale out webheads to do many of the calculations that the database could be doing for me, so if I'm doing work on the database, I take a bigger and bigger risk that I'm going to need to add a read replica, and now my code is at least an order of magnitude more complex. I've supported systems where the latency of updating the read slave caused enough problems due to assumptions baked into the code, that we ended up just turning off the read replicas, and had to live with the degraded performance of the app.
So I would say, that in the spirit of the article, we probably aren't writing enough SQL, there are definitely things that SQL can do that are essentially free in terms of computation/memory/network. I will warn that we should do it with eyes open, and treat the database as a scarce the resource it will likely become. And often it's important to ask, "will this query that returns the exact data I want have less load on the db than a simpler query that returns more data along with it's network load?". I've certainly had battle scars even with trivial SQL queries.
Once we were sorting 10 small results in the SQL, which led to huge load on the database, since the sort was on a TEXT blob, and even though the sort could be done in memory, MySQL required a file_sort, which meant that it had to create a temp table, which meant it had to touch the disk, which meant I/O latency, and huge application degradation. The solution ended up changing the TEXT field to a varchar(65000), but it was still a wakeup for me with just how alien and non intuitive the SQL ecosystem can be. (As a side note it led me to realize just how foreign modern computer architectures are in general compared to the CS I learned as an undergrad. I also highly recommend Martin Thompson's talk on mechanical sympathy https://www.youtube.com/watch?v=MC1EKLQ2Wmg).
I agree with what you're saying but I think its important to point out that there is a Developer experience in most mainstream languages and then there is the Operations experience of deploying, configuring, maintaining, scaling, and paying for, various RDBMS.
Developers grow up crashing their programs. Crashing your DBMS (or similiarly messing it up) is not as friendly or immediate in feedback/consequence, so people shy away from it.
The end result is that many developers are influenced to use code more and databases less.
Elasticsearch quite conveniently makes it so difficult and fucking annoying to do data analytics with their pile of turd custom query un-language shambles that it leaves anyone stuck with it rightly reinventing oblong wheels with diamond bearings lubricated with iron filings in code. Aah, but dev-mis-ops say that it’s easy to deploy and maintain on AWS. It scales too! Well, at least the whole set-up keeps everyone employed.
Could you elaborate more on your specific ES problems? Having worked with Elasticsearch extensively for many years, and do not share this sentiment.
Many of the ES queries fundamentally aren't meant to have a 1-to-1 equivalent in SQL as ES's goal is to specialize as a distributed search index, and not a general relative database.
If your point is that ES is being more frequently used as an alternative to just writing/scaling/using a SQL service to it's full potential, then I can somewhat understand the sentiment.
I completely agree with the idea that ES might be good as a distributed search index. I really wish people would leave it at that.
The company where I last worked had a main product which would send in readings every second. The only interface to these readings was Elasticsearch. dev-mis-ops were convinced that this was a good idea. Even the front end would send its queries to Elasticsearch to display statistics.
My task was to look at the average energy usage over one year for the seven days in the week divided into fifteen-minute intervals. This is literally just a SELECT MOD GROUP BY in SQL. I am not even talking about more advanced SQL features here like windows, transactions, joins and so on.
I tried for ages looking through terrible documentation trying to get it to work, at which point I thought I’d have one of the resident ES experts take a bit of his own medicine. It took about five attempts of back and forth over two days to get it right.
This wasn’t the only query we had problems with. At some point, I just asked them to give me a CSV interface so I could do it in AWK. You can’t even download CSVs of data sets without some external tool! They were still convinced that ES was fit for purpose.
What version of elasticsearch are you using? Even in version 1.x, aggregation queries are so intuitive and pleasant to work with, imo. Now the actual libraries written in Ruby and other languages may be clunky.
In fact, I am not sure why people need all those fancy BI tools. 99% of the time, Elasticsearch aggregations gives me the data I need.
Challenge, with context in my other comment at the same depth:
Suppose you have a table which includes power usage readings for all devices, sent every five seconds. Something like this:
(id, timestamp, power)
Write me an ES query which gets the average power usage for device id=x over the last year over the seven days in the week, each divided over fifteen-minute intervals.
To clarify:
The groups will be the product of the days of the week with the ninety-six fifteen-minute intervals in a day, making 7*96 unique groups.
You will not be able to do this in any simple way.
This is an indexing problem. Add 2 extra fields. For each timestamp, you also need to index the 15-minute interval (ie 7:13:05 maps to 7:00-7:15) and the day of the week.
After that, do a terms aggregation on day of week with a subaggregation on 15 min interval.
If you can’t reindex all your data, then of course it will be hard.
But once the data is indexed and mapped the right way, its easy.
So you couldn’t write the query. It just isn’t practical to create an index for every query which comes to mind during research, and even to create the index, you have to use a terrible, undocumented embedded query language.
Not create a new index - reindex the existing documents
I have personally managed 10+ TB clusters and reindexed multiple times when I need to add new fields. The key is to take some time to clearly understand what analytics you want to conduct, then add the fields you need afterwards.
Thanks for the detailed reply. I have been a bit inflammatory explaining my problem. I appreciate that for queries which are expected to be re-run, it is useful to have an appropriate index.
What I see from your proposal though is that the work for building the same query is split over re-indexing plus another now slightly less complex query.
What I have missed in Elasticsearch is the ease of expressing a data problem. I am also no stranger to learning new languages when needed. You might think differently, but I am of the opinion that the problem that I described translates a lot more easily into SQL.
What I need in this context is to have a language and tool where it is easy to try lots of different queries in succession; to be able to get a feeling for how the data works and test hypotheses. I haven’t been able to do this in Elasticsearch. I am not saying that there’s nothing which Elasticsearch is good for, but it’s just about using the right tool for the job.
I understand where the author is coming from, and for purposes of aggregation or filtering, they are correct; it’s often better to do those things in SQL. There are two problems, though. CPU and IOs on your DB server are likely your organization’s most scarce resource, and SQL is often hard to test and has unpredictable performance characteristics (cliffs, blocking, etc). By all means do the things you must do to get your app servers the data they need with the fewest IOs and bytes across the wire possible, but please do your string concatenation and sorting (when not using TOP or LIMIT) on cheap app servers that you can always deploy more of.
Relational databases are brilliant, but SQL is terrible and I don't understand how it hasn't gone the way of COBOL yet. For a start, the bizarrely English-oriented structure forces a completely different ordering of the code than what comes out, and the syntax appears to completely ignore the underlying tabular concepts. Not to mention that IMO joining on foreign keys shouldn't be a thing--the database should be able to figure out what you want when you write `order.customer.address.taxRegion.rate` without specifying any joins at all. Even if you use an ORM you still have to understand the code it will emit to avoid building bad queries. There's just so much awkwardness in trying to write a SQL query that I can definitely see why people would avoid it.
> Relational databases are brilliant, but SQL is terrible and I don't understand how it hasn't gone the way of COBOL yet.
Having written extensive amounts of SQL and greater than zero lines of COBOL (f. you, Lawson), this really couldn't be further from the truth.
> For a start, the bizarrely English-oriented structure forces a completely different ordering of the code than what comes out, and the syntax appears to completely ignore the underlying tabular concepts.
Every language has keywords; I'm not sure how you expect to write code that doesn't involve English in some form. Is the concept of `SELECT this_data FROM this_source WHERE true_conditions ORDER BY columns` really that convoluted? Sure, it definitely gets more complicated, but is there really a significantly better way to do it that doesn't involve taking away low-level control from the programmer, in exchange for something sexier and easier to pick up?
> Not to mention that IMO joining on foreign keys shouldn't be a thing--the database should be able to figure out what you want when you write `order.customer.address.taxRegion.rate`
Why can't you have a key-value storage engine that doesn't involve using keys to look up values?! (Is that really your question?) Okay, just as an example, what if you have more than one address for a customer? You need a way to tell the engine which address(es) you want.
> There's just so much awkwardness in trying to write a SQL query that I can definitely see why people would avoid it.
While I don't claim that SQL is easy (or always intuitive), saying that it is "terrible" for the above reasons is like saying that C is terrible because it's not Pythonic.
Re: Verbose/obtuse. There are different ways to use SQL. For example IMHO if you use the unnecessary "<LEFT/RIGHT> JOIN" syntax then statements often come across as pretty unreadable, whereas you can generally achieve the same with alternate and more readable syntax.
As opposed to SELECT * FROM a, b WHERE a.id = b.id? Looks ok on paper, but in exchange for brevity you end up demolishing your readability on complex queries, since you can’t easily identify filters (as opposed to join conditions). Most style guides, including my team’s, consider explicit JOINs mandatory, and there’s a good reason for doing so.
Admittedly my work is very heavy on data manipulation; maybe complex queries are less of a concern for other use cases.
I'm not often SQL wrangling these days, so you are probably more aware than I am on intricacies, but often you can wrap subqueries like:
SELECT stuff FROM tables WHERE conditions AND table.field IN (
SELECT othertable.otherfield FROM othertables WHERE more_conditions...)
);
IMHO this reads far more easily and both depth and order of execution are clear. I guess at this point in my career (20 years in) my bias is toward code I can walk away from for years and come back to easily.
Clarity is better than cleverness. - Eric S. Raymond, The Art of Unix Programming (2003)
Be careful in using group_concat in mysql - it actually truncates the result to the max_len setting which is by default 1024 and will not tell you it truncates the result in the logs. it took us forever to find the cause of this issue in one of our batch jobs (I wish mysql issued a warning when truncating):
"In MySQL, you can get the concatenated values of expression combinations. To eliminate duplicate values, use the DISTINCT clause. To sort values in the result, use the ORDER BY clause. To sort in reverse order, add the DESC (descending) keyword to the name of the column you are sorting by in the ORDER BY clause. The default is ascending order; this may be specified explicitly using the ASC keyword. The default separator between values in a group is comma (,). To specify a separator explicitly, use SEPARATOR followed by the string literal value that should be inserted between group values. To eliminate the separator altogether, specify SEPARATOR ''.
The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024. The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet. The syntax to change the value of group_concat_max_len at runtime is as follows, where val is an unsigned integer"
SQL is a real programming language (even though, many people seems to not think so). I use SQLite, although there are other SQL implementations (with different features) too.
It is true many thing can be done well with SQL, whether you call from other programs or use standalone.
Whether it will be faster or not to use SQL probably depends on what SQL implementation is in use, whether the indexing is set up properly, and what programming language you are calling it from.
Also, much things can be done with a shorter code in SQL if you are doing such things as sorting the result set.
Some things are harder to do in SQL then in code. For example, assigning a rank based on a field in MYSQL. I did this for production code in a rails app, and it wasn't just ugly (requiring separate DB calls for variable initialization and output and updating) but it ran slow enough that it had to be scheduled for off peak time, and it requires temporary variables (or you can do it with whole temporary tables, I believe).
That's not to say that I would do it in Ruby if I had it to do over again, I'd probably use Redis, but still, sometimes SQL isn't as fast as other solutions.
I like SQL. Optimizing DB calls to speed up stuff is fun for me. But it's not always the right choice.
(There's probably a clever and fast way to do it in postgres, but DB choices were not mine to make).
edit: Incidentally, if you just need to know 1 record's rank at a time, like getting a player's position on a leaderboard, instead of including each record in the DB, you can do a select on scores (or distinct scores) greater then the current record's score and add 1, and that can be much faster.
I think the problem is with Excel sheets and CSVs. We aren't taught how to manipulate data otherwise without the use of code. SQL seems to have immense overhead. Set up a db, import data, etc. Although it makes sense, it's an afterthought. We know how to write scripts, but we don't necessarily need to learn SQL to get the job done.
This discusses the "how" but not the "why." I'm not a SQL expert, so I find it far easier to maintain my code in a language I am familiar with than to maintain (seemingly cryptic) SQL equations, queries, etc. Is there a quantifiable performance gain with moving more operations to SQL rather than implementing it in my language of choice?
The conclusion states:
> Exploit the power of SQL to write less code because “the best code is the code that was never written”. If it is not written there is no need to maintain it.
But they're wrong, since you still need to maintain all of the SQL implementations...
Edit: I previously said "I'm not a DBA", but some folks seem to be getting hung up on that rather than the rest of my comments/questions.
I’d love to see an answer from someone more experienced on this as well.
In my experience, the database is just faster than application code at most stuff and queries can be written in a way that similar logic can’t be optimized at the application level. Additionally, database code is usually (always?) going to be faster than something like Python/Ruby.
Code review is more difficult at the database level, however, and SQL talent is more rare. It may be more challenging to scale and maintain a team of engineers working on a codebase with a lot of application logic implemented in stored procs/SQL versus ORM code (such as active record or Django). Just my opinion, have worked on teams with ORM usage as well as separate DBA teams and all logic in SQL. SQL by hand is more powerful and flexible, surely, but for most use cases I would prefer to keep code in one place (application) and only reach for the tool (raw SQL) when needed for optimizations.
There's a meaningful difference to where a particular calculation happens. Even if the exact same steps are done, it's more efficient to look at 10 items in the database, pick 1 item, and transfer it out to the application rather than transfer 10 items to the application and then pick one item there; doing the calculation close to the source data enables ways to do it efficiently.
Optimizer (with the statistical knowledge about the data) and being closer to the data (important data may be always in memory) and most importantly supportive data structures like indexes and materialized views.
Key-value stores are fine if you are always simply grabbing a single row/tuple from a single dataset. Any time you move beyond that, you're in for a world of hurt compared to a relational database that uses SQL.
The reasons are:
1) You often need a transactional, consistent view of the data across multiple datasets (tables). Relational databases using SQL provide that, key-value stores may or may not (most likely not).
2) Any time you're doing any kind of aggregation, key-value stores will require that you pull all data in locally and perform the operation there. It's questionable whether the local environment even has the resources to accomplish such a task, and the efficiency of such operations will be dismal compared to a SQL-based relational database server.
Key-value stores are, from an interface perspective, very much like the older ISAM databases that were popular before SQL and relational databases gained a foothold in the industry. They also recycled the exact same problems that SQL-based relational databases solved so elegantly (round and round we go).
Another way to put it is this: leaving aside scaling for now, SQL-based relational databases can typically do everything that a key-value store can do, whereas key-value stores cannot typically do everything that a SQL-based relational database can do.
I wasn't so much asking about relational databases vs key/value stores, but rather why should I implement things like advanced sorting, mathematical formulas, etc in SQL rather than in the application code that fetches data from SQL?
It's simply more efficient? If I have a table with millions of rows, I'm not going to bring down the entire data set on every request just to sort by latest and grab the 10 most recently updated rows that haven't been soft deleted (WHERE active=1 ORDER BY updated_at DESC LIMIT 10).
- Advanced sorting plus pagination comes to mind. How would you do that without pulling unnecessary rows?
- GROUP_CONCAT or STUFF in SQL Server already decreases the number of rows. His example has one row returned instead of 3.
- Aggregating in SQL would also reduce number of rows returned instead of pulling all the aggregated rows to code and doing it there
- I would guess RDBMS implementations are in something faster than the ORM is implemented in and their code is more optimized - that would be speculation.
Also, his examples are like the tip of the iceberg. There are all kinds of things like windowing functions and common table expressions that have so much power to get exactly as much data as you need right at the RDBMS without having to carry that baggage anywhere else.
Anything that deals with the source data, such as aggregation, sorting, or partitioning, should be done with SQL.
Anything that deals with the formatting/display of the columns in the result set rows can be done in SQL, but doesn't have to be done in SQL. It's really a matter of convenience to do it in SQL, especially when dealing with multiple front-end languages, because it means that the formatting is done in one place.
The difference really becomes apparent when your database isn't an app specific data store, but a business data warehouse that is used in many places. You could build a full API layer and force people to go through that for everything, then have another thing to maintain, or you can implement it at the data level.
Each method has pros and cons, it depends on the capabilities of your database, your requirements in terms of security, and the degree of cross connection between data stores. For internal business tools I like to do stuff at the database layer, but it would be the wrong choice for a high volume, simple, public API.
Off the top of my head I'm thinking that there's a serious lack of thought in these use cases, especially string manipulation which I could imagine becoming painfully slow as the database grows; any manipulation of the projection like this is going to incur a large performance overhead.
It would be quite good to see this blog post, or a response to it, evaluating the SQL codes performance on different datasets.