

MySQL: Find records in one table that are not in another - revised - edw519
http://www.electrictoolbox.com/mysql-find-records-in-one-table-not-in-another-revised/

======
mwexler
NOT EXISTS has been known to be faster than a join for years now. In what
possible way is this news? NOT EXISTS allows for short circuiting, like an OR:
if at least one record exists, processing can quit. The Join would need to
process all records.

This puts into context all the people complaining about SQL and relational
databases, and how great key-value stores are. Perhaps some of the sturm und
drang is over the fact that, while SQL people admittedly may have little
experience with Hadoop or other non-relational processing and storage, sounds
like some HN folks have actually had little experience with the SQL they are
disparaging.

Joe Celko's SQL for Smarties is full of great explanations of how to really
use SQL, and is worth picking up if you want to see what SQL can really do and
work around it's (many) warts. [http://www.amazon.com/Joe-Celkos-SQL-Smarties-
Programming/dp...](http://www.amazon.com/Joe-Celkos-SQL-Smarties-
Programming/dp/0123693799/ref=sr_1_1)

~~~
didroe
The query optimiser should be dealing with that. If you're doing an outer join
and checking for the key being NULL then it can easily see that the operation
can be short circuited.

It's all implementation details. I've worked with databases where a NOT EXISTS
would run a query for every source row where as an outer join would run
through once doing the join and was therefore faster.

------
pradocchia
Yeah, LEFT JOIN ... IS NULL is problematic. If you don't have a true 1:M
relationship between the two tables, you could get cartesian sub-products in
the result set. If do have 1:M relationship, but the database doesn't know
about it, it still has to _check_ for cardinality on both sides. Even if you
have declared foreign keys, the optimizer might not know how to use that
information.

The basic rule is, don't ask the database to do more than is necessary. Don't
include tables in the FROM clause that won't be used downstream (GROUP BY,
HAVING, SELECT, ORDER BY). If you only need it in the WHERE clause, try using
a correlated subquery (ie, EXISTS).

Another example for finding duplicates:

    
    
        select * from table1 as a where exists (
          select * from table1 as b
          group by key1
          having count(*) > 1
             and a.key1 = b.key1
          )
    

(Not sure if this syntax works on MySQL, but there's the idea.)

~~~
ars
That works fine in MySQL, but assuming you have an index on (key1,
PRIMARY_KEY) this is faster:

    
    
      select distinct a.*
      from table1 a join table1 b
       on (a.key1 = b.key1 and a.PRIMARY_KEY != b.PRIMARY_KEY)
    

Yours took 0.59 sec, and mine was 0.10 sec (on some random test data).

In general MySql doesn't do so well with subqueries - it likes them as joins.
But in this case I think any database would be faster with my query (anyone
care to test)?

And when I say joins instead of subqueries I mean:

    
    
      select *
      from table1 as a join
       (select key1 from table1 as b
        group by key1
        having count(*) > 1
       ) c
        on (a.key1 = c.key1)
    

Which took 0.00 sec (must say I was not expecting that).

~~~
pradocchia
I have to say, I don't understand your join in the first example. In my head,
key1 was the primary key.

In the second example, if key1 is unique and the database knows it, the
optimizer should be able to determine that _c_ returns no records, and
therefore the outer query returns no records as well. It bypasses execution
entirely, thus 0.00 sec.

If key1 is not unique, or it is unique but the database doesn't know about it,
that is impressive.

~~~
ars
If key1 is the primary key then how can you have duplicates of it? By
definition it's not a unique key, and c does return records.

~~~
pradocchia
of course. i wasn't certain from context that you were running it against an
non-primary key.

------
jaxn
Ok, I haven't benchmarked this, but shouldn't NOT IN be faster than NOT
EXISTS, especially if the fields are indexed.

SELECT * FROM content c WHERE c.content_id NOT IN ( SELECT c2t.content_id FROM
content_to_tags c2t )

~~~
NyxWulf
In general not exists is faster than NOT IN because the optimizer will usually
eliminate any result set from the sub table. Usually they can handle select *,
but I always do select 1 just to make sure. It's been a few years since I
tested that on SQL Server, but not exists and exists were substantially faster
than in/not in.

------
Tichy
OK, I am beginning to see how to get a technical blog going, if discovering a
standard SQL form allows for a lengthy blog article.

------
trezor
_In my orginal post I used a LEFT JOIN from content_to_tags to content. ...
Using NOT EXISTS was just over two times faster, and I would assume that as
the tables get populated with more records the difference exponential._

If using sub-selects is faster than JOINS on what I would assume is indexed
data, your query-optimizer is doing something horribly, horribly wrong.

And given how everyone complains about MySQL's support for sub-selects and the
performance of those, I can't even start to comprehend how bad and
inefficiently it implements JOINs if sub-selects are indeed faster.

Yeah, I think I'll just stick to SQL Server for now.

~~~
pradocchia
correlated subqueries are faster than joins on SQL Server too.

~~~
trezor
Is this also true when the query would involve a full table scan or clustered
index scan anyway?

Maybe I'm missing something crucial, but that doesn't seem to make much sense
for me from a data-access perspective.

