
How I slashed a SQL query's runtime with two Unix commands - DSpinellis
https://www.spinellis.gr/blog/20180805/
======
manigandham
The distinct on the date format is the problem, probably should use a separate
query on each table to fill in the date output as a new column first, and then
join afterwards.

Also wrong tool for the job comes to mind. MariaDB/MySQL are just not great at
joins and query planning, and at this scale, especially with all the work
involved to export and use unix tools, why not use any of the columnar data
warehouses that can handle this much faster?

Export it to bigquery with CSVs and the query will probably finish in 30
seconds.

~~~
olavgg
I highly recommend Clickhouse for this. It is blazing fast and can do over
100GB/s on a single modern machine if you have enough RAM. And it is very easy
to install and configure.

~~~
manigandham
> And it is very easy to install and configure

On a single node yea, unfortunately not beyond that. Really wish there was
bigger focus on operational features, and a bigger community for a rather
fantastic product.

~~~
RhodesianHunter
This was my exact reaction to the OP's comment as well.

CH is an amazing piece of tech, but getting configuration just right on many
nodes is quite complex, and all of the experts writing about how to do so are
writing in Russian.

The documentation seems to be improving rapidly though.

------
WorkLifeBalance
I don't know mysql/mariaDB that well, but I know in ms sql server that doing a
DISTINCT on a Function([Column]) would not use an index on [Column].

In this case it would be much faster to have an intermediate table or column
with that statistic that could then be properly indexed.

~~~
bhrgunatha
It even has a name - SARGable - able to use an index on the [S]earch
[ARG]uments but still easy to get tripped up.

[https://www.definitions.net/definition/sargable](https://www.definitions.net/definition/sargable)

[https://stackoverflow.com/questions/799584/what-makes-a-
sql-...](https://stackoverflow.com/questions/799584/what-makes-a-sql-
statement-sargable)

------
lunchladydoris
I'd like to see the runtime for the same query but (1) without 'distinct', (2)
without the date formatting and (3) with an inner join.

~~~
ckastner
My thoughts exactly.

The post references a StackExchange question [1] which has a simple analysis,
and four helpful suggestions (similar to yours), as to which the author
states:

 _" I tried the first suggestion, but the results weren't promising. As
experimenting with each suggestion could easily take at least half a day, I
proceeded with a way I knew would work efficiently and reliably"_

~~~
schrodinger
1: [https://dba.stackexchange.com/questions/213983/can-i-
speed-u...](https://dba.stackexchange.com/questions/213983/can-i-speed-up-a-
large-mysql-mariadb-join-through-partitioning-or-merging)

------
languagehacker
Am I missing something or is this blog post just someone discovering that ETL
exists

~~~
pjmlp
Lots of "modern" blog posts seem like people rediscovering that X exists.

A consequence of many skipping on traditional learning processes and how
developers are now also forced into castings and portfolios.

~~~
Spooky23
It's also a negative side-effect of the improvement of software and automation
in this space.

In ye olden times, there would be an grumpy DBA in this story terminating the
query and yelling at the user. Many, many people are consuming databases
without understanding how to approach SQL optimization, because nobody is
forcing them to!

~~~
sidlls
There's a happy middle ground between what we have now and those DBAs playing
hardass without even trying to understand the need.

~~~
mmt
> DBAs playing hardass without even trying to understand the need.

Although the parent posed a caricature with "yelling", this is an even more
extreme one.

The "forcing" function merely requires a bit of "hardass" but not the willful
ignorance you associate with it. At the risk of going true-scotsman, I believe
that competent DBAs capable of the forcing function would already understand
the need without trying.

As such, I don't think there needs to be any kind of "medium" between two
extremes of ignorance, but, rather, back to the GP's point, a reduction of
ignorance in "what we have now".

Of course, I don't believe hardass/yelling is the best way to achieve that,
either, instead favoring (original) Devops culture, but that also doesn't work
if Ops (including DBA) skills aren't recognized as valuable any more (and
"Devops" just means "replace Ops with Devs").

------
frou_dh
If you initiate the shell commands from inside the database, does it count as
less hacky? :)
[https://github.com/petere/plsh](https://github.com/petere/plsh)

~~~
snorkel
Code beauty is subjective. If the next developer that inherits your code yells
“WTF!” then incredulously tells all the other devs what you did, then it’s
hacky. Yes, invoking shell commands in a db would likely cause that.

------
linsomniac
I've run into situations like this where I was able to use a window function
or a materialized view, to get huge speed gains. It basically was
restructuring the tables into a new view, much like the shell script did in
this case, and that view then could be optimized by the query planner much
more effectively.

One was with an internal database we used for tracking employee time and
generating invoices and reports. I forget the exact details but this technique
took it down from an hour to a second, or similar.

Another time was more recently in an interview, using the mysql employee
sample database. I had a naive query that did most of what was needed, but
after quite a bit more work I was able to make a materialized view that caused
the query to go from using 64+GB of RAM and dying from OOM after 4 hours, to
running in 45 minutes in 2GB of RAM.

It usually takes me hours and hours to put together though. Because I do it so
infrequently.

------
ww520
Isn't the Unix join essentially doing an inner join? Why is the original SQL
doing a left join while the text processing with Unix join is doing an inner
join?

Yes, the left join will scan the whole table. Left join means to return all
the rows from the left hand side table. It won't use the index.

Inner join with index is way faster than left join.

------
ben509
If you find you're exporting and running via `uniq` or the like, `create
temporary table as` is probably what you want. If you think about it, it's
essentially what you're doing, without the additional filesystem involvement.

The optimizer _should_ be handling this for you, though for very large
datasets, constructing a new table from a query essentially avoids the locking
issues you might otherwise run into.

------
nextweek2
I'd also like to see a breakdown of the time taken to complete the project.

RAM is cheap and jumping from 16GB to 64GB (or even 128GB) might have cost as
much as the analysis time.

The only clue to that a memory upgrade might have been a quicker fix was that
the merged file was 133G. Seems to me that an upgrade to 128GB of RAM might
have led to a dramatically shorter query execution time.

~~~
hermitdev
If you need to add hardware to solve a software optimization problem such as
this, you've failed your job as a software engineer (if you should even be
called an engineer, neigh, a developer).

I'll probably get downvoted for this, but you should should examine
opportunities to optimize your SQL/other storage patterns before you assume
anything else if your indicators are your DB is slow.

~~~
mmt
> If you need to add hardware to solve a software optimization problem such as
> this, you've failed your job as a software engineer

That's an interestingly narrow viewpoint with which I, narrowly, agree.

I suspect that the narrow viewpoint is just a form of the aphorism "if all you
have is a hammer, everything looks like a nail."

> if you should even be called an engineer (if you should even be called an
> engineer, neigh, a developer)

However, in the broader context of engineering as problem solving (or
developing a _product_ ), I disagree.

Not all (computer) problems are best solved with software.

As a sysadmin (who can code but doesn't love to), I routinely struggle with
managers who don't have an intuitive sense of even a first approximation of
what modern hardware is capable of, because their entire background is
programming.

The fact that (latency) "numbers every programmer should know" exists (and has
for quite some time) is fairly telling. What's more telling is that there are
no dollar values attached to any of them.

------
pytyper2
I think the date_format call forces a table scan, the query could be rewritten
to be much faster. Optimizing the sql statement seems less error prone than
extracting the data to text files.

~~~
barrkel
There's no predicate or limit or inner join that could reduce the set of rows,
so the table might as well be scanned.

~~~
pytyper2
Or precompute the date on insert, there are a dozen ways to refactor the data
model and the query, all of which are better than extracting to txt files, a
format with no schema and not based on a standard. That would be the direction
I would give if a junior dev was attempting this method.

------
yread
There is some more information in this reddit thread

[https://www.reddit.com/r/programming/comments/94r6w6/how_i_s...](https://www.reddit.com/r/programming/comments/94r6w6/how_i_slashed_a_sql_query_runtime_from_380_hours/)

------
jerry40
As far as I understand, 'project_commits' and 'commits' have one-by-one
relationship (project_commits.commit_id = commits.id). From my point of view
it is a strange design since they could just add 'project_id' column into
'commits' table. 'project_commits' table seems to be redundant here.

I'd write this query:

    
    
        select pc.project_id, date_format(c.created_at, '%x%v1'), count(*)
        from commits c
            join project_commits pc 
                on c.id = pc.commit_id
        group by pc.project_id, date_format(c.created_at, '%x%v1')
    

And I'd use left join only at the stage when it is needed to join 'projects'
dictionary with the result of the query.

------
protomyth
Am I missing something fundamental like a WHERE clause? I see a join, but no
limiting of rows beyond that.

~~~
Sean1708
I think they just wanted all of the data.

~~~
protomyth
Is there an actual schema and does MariaDB do index scans?

------
Animats
MariaDB can do sort/merge joins. It's rare that you want to force one, but you
can. That's what IGNORE INDEX is for.

If you're testing, take a sample of the database and test on that.

First try the simple form:

    
    
        explain
        select 
          project_commits.project_id,
          date_format(commits.created_at, '%x%v1') as week_commit
          from commits, project_commits
          where project_commits.commit_id = commits.id;
    

That ought to call for a sort and merge.

------
oliverped
The query is simply very badly written. Who makes a left join of a 5b rows
table to a smaller table and at the same time expects a distinct from the
“left” table??? If the distinct is important to solve the problem without
expecting the join to return “true” then why have a join at first place???

------
dekhn
I once had a huge performance problem for a simple self-join that I expected
would run quickly.

I talked to our resident SQL expert (retired from IBM, now doing a job as the
lab administrative assistant). She suggested sorting the data before inserting
it.

That sped up the query.

it still bother me that data order matters so much. But, I can't reproduce the
problem in MySQL any more anyway.

~~~
miahi
I guess you were missing an index, as data order should only matter when doing
full scans.

~~~
dekhn
this was a full self-join (SELECT * FROM blah a, JOIN blah b WHERE a.id =
b.id). I ran strace on it, for each a item, it was doing a seek to each b row
and reading a whole page. By pre-sorting, I increased locality.

~~~
adrianmsmith
MySQL only supported “nested loop” joins in 5.5 and below. Hash and merge
joins were introduced in 5.6.

Depending on when you originally did the query, maybe you were using 5.5 or
below, which would explain why you now can’t reprodce it?

~~~
dekhn
It's very possible. I was running the query ~14 years ago and it wouldn't
surprise me if any number of improved condition pushdown mechanisms were
implemented.

------
mamcx
After reading the post, and considering the time of export/import the data, I
think will be better to denormalize the data (ie:precalculate the result).

I have something like in the near past, but with much heavy calculations.
Denormalizing using triggers turn query that take minutes in less 1 second.

------
slifin
How long does it take to run the distinct on its own? I'd be interested to
know what happens if the distinct is in a sub query then the result is joined
with the commits table

Ensure project_commits.project_id is in an index before running the test

~~~
slifin
Sorry disregard, I mis-read the query

------
thecopy
The question im interested in is why MySQL did not use the index for the
lookup.

~~~
jarym
Not sure if MySQL is the same but Postgres won't use an index if the
statistics suggest doing so will be slower than a full table scan - would
expect MySQL to be similar.

With 5bn+ rows and a memory constraint, the type of index begins to make a
difference - e.g. in Postgres I would have tried using a bloom index.

~~~
protomyth
It can be very frustrating that, even with updated statistics, the optimizer
decides to go a different way with a query. While using Sybase, this happened
quite a lot when tables got rather large (corporation large not Google large).
The normal response is to force the indexes on the query. Due to log space
issues, I do remember having an awk script between queries back in the Ingres
days.

------
edgarvm
Naive question: are your tables partitioned?

------
CyanLite2
TLDR: author doesn’t know how to use database indexes properly.

~~~
sampleinajar
Possibly. The author at least knew enough to observe "Both join fields are
indexed. However, MariaDB implements the join with a full scan of
project_commits and an index lookup on commits."

~~~
mjevans
That may have actually been a result of the ordering and poor query design.

Given that commits.id is a primary key, IIRC, it must also be NOT NULL; thus
the LEFT JOIN is uselessly equivalent to an inner join; which most database
engines seem to prefer expressed as a WHERE clause (since it's more trivial to
re-order those operations).

Were it an 'inner join' equivalent where clause or the smaller table specified
first, at least the full-table scan should have been on the smaller table.

Your database storage engine of choice might differ or have other options
(E.G. in PostgreSQL an index on the result of comparing the two source keys
COULD be created and a carefully structured query written to use /that/... I
think, I haven't tested it but it'd at least be worth the experiment.)

MySQL (mariadb presumably?) can't.
[https://stackoverflow.com/questions/8509026/is-cross-
table-i...](https://stackoverflow.com/questions/8509026/is-cross-table-
indexing-possible)

