So I would say, that in the spirit of the article, we probably aren't writing enough SQL, there are definitely things that SQL can do that are essentially free in terms of computation/memory/network. I will warn that we should do it with eyes open, and treat the database as a scarce the resource it will likely become. And often it's important to ask, "will this query that returns the exact data I want have less load on the db than a simpler query that returns more data along with it's network load?". I've certainly had battle scars even with trivial SQL queries.
Once we were sorting 10 small results in the SQL, which led to huge load on the database, since the sort was on a TEXT blob, and even though the sort could be done in memory, MySQL required a file_sort, which meant that it had to create a temp table, which meant it had to touch the disk, which meant I/O latency, and huge application degradation. The solution ended up changing the TEXT field to a varchar(65000), but it was still a wakeup for me with just how alien and non intuitive the SQL ecosystem can be. (As a side note it led me to realize just how foreign modern computer architectures are in general compared to the CS I learned as an undergrad. I also highly recommend Martin Thompson's talk on mechanical sympathy https://www.youtube.com/watch?v=MC1EKLQ2Wmg).
Developers grow up crashing their programs. Crashing your DBMS (or similiarly messing it up) is not as friendly or immediate in feedback/consequence, so people shy away from it.
The end result is that many developers are influenced to use code more and databases less.
Many of the ES queries fundamentally aren't meant to have a 1-to-1 equivalent in SQL as ES's goal is to specialize as a distributed search index, and not a general relative database.
If your point is that ES is being more frequently used as an alternative to just writing/scaling/using a SQL service to it's full potential, then I can somewhat understand the sentiment.
The company where I last worked had a main product which would send in readings every second. The only interface to these readings was Elasticsearch. dev-mis-ops were convinced that this was a good idea. Even the front end would send its queries to Elasticsearch to display statistics.
My task was to look at the average energy usage over one year for the seven days in the week divided into fifteen-minute intervals. This is literally just a SELECT MOD GROUP BY in SQL. I am not even talking about more advanced SQL features here like windows, transactions, joins and so on.
I tried for ages looking through terrible documentation trying to get it to work, at which point I thought I’d have one of the resident ES experts take a bit of his own medicine. It took about five attempts of back and forth over two days to get it right.
This wasn’t the only query we had problems with. At some point, I just asked them to give me a CSV interface so I could do it in AWK. You can’t even download CSVs of data sets without some external tool! They were still convinced that ES was fit for purpose.
In fact, I am not sure why people need all those fancy BI tools. 99% of the time, Elasticsearch aggregations gives me the data I need.
Suppose you have a table which includes power usage readings for all devices, sent every five seconds. Something like this:
(id, timestamp, power)
The groups will be the product of the days of the week with the ninety-six fifteen-minute intervals in a day, making 7*96 unique groups.
You will not be able to do this in any simple way.
After that, do a terms aggregation on day of week with a subaggregation on 15 min interval.
If you can’t reindex all your data, then of course it will be hard.
But once the data is indexed and mapped the right way, its easy.
I have personally managed 10+ TB clusters and reindexed multiple times when I need to add new fields. The key is to take some time to clearly understand what analytics you want to conduct, then add the fields you need afterwards.
What I see from your proposal though is that the work for building the same query is split over re-indexing plus another now slightly less complex query.
What I have missed in Elasticsearch is the ease of expressing a data problem. I am also no stranger to learning new languages when needed. You might think differently, but I am of the opinion that the problem that I described translates a lot more easily into SQL.
What I need in this context is to have a language and tool where it is easy to try lots of different queries in succession; to be able to get a feeling for how the data works and test hypotheses. I haven’t been able to do this in Elasticsearch. I am not saying that there’s nothing which Elasticsearch is good for, but it’s just about using the right tool for the job.
Having written extensive amounts of SQL and greater than zero lines of COBOL (f. you, Lawson), this really couldn't be further from the truth.
> For a start, the bizarrely English-oriented structure forces a completely different ordering of the code than what comes out, and the syntax appears to completely ignore the underlying tabular concepts.
Every language has keywords; I'm not sure how you expect to write code that doesn't involve English in some form. Is the concept of `SELECT this_data FROM this_source WHERE true_conditions ORDER BY columns` really that convoluted? Sure, it definitely gets more complicated, but is there really a significantly better way to do it that doesn't involve taking away low-level control from the programmer, in exchange for something sexier and easier to pick up?
> Not to mention that IMO joining on foreign keys shouldn't be a thing--the database should be able to figure out what you want when you write `order.customer.address.taxRegion.rate`
Why can't you have a key-value storage engine that doesn't involve using keys to look up values?! (Is that really your question?) Okay, just as an example, what if you have more than one address for a customer? You need a way to tell the engine which address(es) you want.
> There's just so much awkwardness in trying to write a SQL query that I can definitely see why people would avoid it.
While I don't claim that SQL is easy (or always intuitive), saying that it is "terrible" for the above reasons is like saying that C is terrible because it's not Pythonic.
I’ve had learning datalog on my todo list for a long time. I’d love to see some other DSls for the relational algebra as well
Admittedly my work is very heavy on data manipulation; maybe complex queries are less of a concern for other use cases.
I'm not often SQL wrangling these days, so you are probably more aware than I am on intricacies, but often you can wrap subqueries like:
SELECT stuff FROM tables WHERE conditions AND table.field IN (
SELECT othertable.otherfield FROM othertables WHERE more_conditions...)
Clarity is better than cleverness. - Eric S. Raymond, The Art of Unix Programming (2003)
He's got a point. Quote via http://github.com/globalcitizen/taoup
"In MySQL, you can get the concatenated values of expression combinations. To eliminate duplicate values, use the DISTINCT clause. To sort values in the result, use the ORDER BY clause. To sort in reverse order, add the DESC (descending) keyword to the name of the column you are sorting by in the ORDER BY clause. The default is ascending order; this may be specified explicitly using the ASC keyword. The default separator between values in a group is comma (,). To specify a separator explicitly, use SEPARATOR followed by the string literal value that should be inserted between group values. To eliminate the separator altogether, specify SEPARATOR ''.
The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024. The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet. The syntax to change the value of group_concat_max_len at runtime is as follows, where val is an unsigned integer"
All massively more capable than the toydb that is mysql.
I think the trouble with most developers is that their experience is mostly with this terrible DB.
It is true many thing can be done well with SQL, whether you call from other programs or use standalone.
Whether it will be faster or not to use SQL probably depends on what SQL implementation is in use, whether the indexing is set up properly, and what programming language you are calling it from.
Also, much things can be done with a shorter code in SQL if you are doing such things as sorting the result set.
That's not to say that I would do it in Ruby if I had it to do over again, I'd probably use Redis, but still, sometimes SQL isn't as fast as other solutions.
I like SQL. Optimizing DB calls to speed up stuff is fun for me. But it's not always the right choice.
(There's probably a clever and fast way to do it in postgres, but DB choices were not mine to make).
edit: Incidentally, if you just need to know 1 record's rank at a time, like getting a player's position on a leaderboard, instead of including each record in the DB, you can do a select on scores (or distinct scores) greater then the current record's score and add 1, and that can be much faster.
The conclusion states:
> Exploit the power of SQL to write less code because “the best code is the code that was never written”. If it is not written there is no need to maintain it.
But they're wrong, since you still need to maintain all of the SQL implementations...
Edit: I previously said "I'm not a DBA", but some folks seem to be getting hung up on that rather than the rest of my comments/questions.
Regarding your question:
> Is there a quantifiable performance gain with moving more operations to SQL rather than implementing it in my language of choice?
It depends, of course, but in my experience the answer is yes for 90% of the cases.
In my experience, the database is just faster than application code at most stuff and queries can be written in a way that similar logic can’t be optimized at the application level. Additionally, database code is usually (always?) going to be faster than something like Python/Ruby.
Code review is more difficult at the database level, however, and SQL talent is more rare. It may be more challenging to scale and maintain a team of engineers working on a codebase with a lot of application logic implemented in stored procs/SQL versus ORM code (such as active record or Django). Just my opinion, have worked on teams with ORM usage as well as separate DBA teams and all logic in SQL. SQL by hand is more powerful and flexible, surely, but for most use cases I would prefer to keep code in one place (application) and only reach for the tool (raw SQL) when needed for optimizations.
The reasons are:
1) You often need a transactional, consistent view of the data across multiple datasets (tables). Relational databases using SQL provide that, key-value stores may or may not (most likely not).
2) Any time you're doing any kind of aggregation, key-value stores will require that you pull all data in locally and perform the operation there. It's questionable whether the local environment even has the resources to accomplish such a task, and the efficiency of such operations will be dismal compared to a SQL-based relational database server.
Key-value stores are, from an interface perspective, very much like the older ISAM databases that were popular before SQL and relational databases gained a foothold in the industry. They also recycled the exact same problems that SQL-based relational databases solved so elegantly (round and round we go).
Another way to put it is this: leaving aside scaling for now, SQL-based relational databases can typically do everything that a key-value store can do, whereas key-value stores cannot typically do everything that a SQL-based relational database can do.
It's simply more efficient? If I have a table with millions of rows, I'm not going to bring down the entire data set on every request just to sort by latest and grab the 10 most recently updated rows that haven't been soft deleted (WHERE active=1 ORDER BY updated_at DESC LIMIT 10).
Also, his examples are like the tip of the iceberg. There are all kinds of things like windowing functions and common table expressions that have so much power to get exactly as much data as you need right at the RDBMS without having to carry that baggage anywhere else.
Anything that deals with the formatting/display of the columns in the result set rows can be done in SQL, but doesn't have to be done in SQL. It's really a matter of convenience to do it in SQL, especially when dealing with multiple front-end languages, because it means that the formatting is done in one place.
Each method has pros and cons, it depends on the capabilities of your database, your requirements in terms of security, and the degree of cross connection between data stores. For internal business tools I like to do stuff at the database layer, but it would be the wrong choice for a high volume, simple, public API.
It would be quite good to see this blog post, or a response to it, evaluating the SQL codes performance on different datasets.