
How SQL Database Engines Work, by the Creator of SQLite (2008) [video] - zbaylin
https://www.youtube.com/watch?v=Z_cX3bzkExE
======
ryanworl
There is a more recent lecture on the same topic from 2015 at CMU:
[https://youtu.be/gpxnbly9bz4](https://youtu.be/gpxnbly9bz4)

~~~
_wmd
this video is a 100ft overview, original link is about SQL internals

~~~
netgusto
Detailed explanations start at 26m40s:

[https://youtu.be/gpxnbly9bz4?t=26m40s](https://youtu.be/gpxnbly9bz4?t=26m40s)

------
dmoreno
I recently created a database engine (exosql [1]), only query and no storage.
It uses postgres-like foreign data wrappers to get all data.

It's not valid for big datasets, as it stores all in memory (or maybe it is?),
but as a learning experience has been amazing to think and develop a real
database: planner, executor, choose algorithms, implement features as lateral
joins and so on.

I will definetly listen very carefully to these talks.

[1]
[https://gitHub.com/Serverboards/exosql](https://gitHub.com/Serverboards/exosql)

~~~
zerr
Any books/lectures/articles you followed?

~~~
dmoreno
I followed specially the postgres and sqlite documentations. For some specific
areas I checked their source codes. But mainly I used explain from postgres as
reference on what algorithms (seq scan, hash scan and do on) use for specific
queries.

------
pipu
I truly recommend CMU's Andy Pavlov's video lectures on the topic (and also
more advanced stuff)

[https://www.youtube.com/playlist?list=PLSE8ODhjZXjYutVzTeAds...](https://www.youtube.com/playlist?list=PLSE8ODhjZXjYutVzTeAds8xUt1rcmyT7x)

~~~
PretzelFisch
I found Prof. Dr. Jens Dittrich database playlists interesting and pleasant to
watch.
[https://www.youtube.com/channel/UCC9zrtAkl6yY4dpcnWrCHjA](https://www.youtube.com/channel/UCC9zrtAkl6yY4dpcnWrCHjA)

------
logicallee
Any video filter experts here?

 _Request to any video filter expert_

\------------------------------------

I started watching this. The slides are unreadable but the camera is perfectly
still and the slides are for several "key frames" where the compression
algorithm decides to replace one set of compression artifacts for another.

For example try to read the first keyword under "Translates into:":

[https://www.youtube.com/watch?v=Z_cX3bzkExE&t=2m14s](https://www.youtube.com/watch?v=Z_cX3bzkExE&t=2m14s)

The keyword is unreadable at the start but as you keep looking at it over 50
keyframes it becomes readable to me.

Since the camera is in a fixed position it should be possible to combine the
data from those artifacts into a single superresolution video with very small
assumptions. (i.e. the assumption that the image is the same image until at
least 5% change or something). There's not even anyone moving in front of it.

-> Can someone who actually knows this stuff apply a superresolution interlacing filter to this video and post the superresolution version somewhere?

I hope this is not too much work, and I am sure we would all appreciate the
results since the slides are not human-readable before applying some kind of
superresolution!

~~~
peterwwillis
I'm sorry you got downvoted for this comment. HN voting is the worst.

~~~
logicallee
Perhaps I was naive about the state of the art. After the now-dead reply I
received, I searched, and found a couple of papers like this

[https://arxiv.org/abs/1801.04590](https://arxiv.org/abs/1801.04590) \-
"Frame-Recurrent Video Super-Resolution"

but if you look at p. 8, I think many of the algorithms still wouldn't end up
with readable text. This paper is from this year, so it is an area of active
research.

I wrote a quick mail to the authors to see if they would put the video through
their setup (since the last paper update was just 3 months ago) and share
their results.

2\. Trying it myself...

After my downvotes I tried this small piece of software:

[http://www.infognition.com/VideoEnhancer/](http://www.infognition.com/VideoEnhancer/)

Which shows a before/after. Here is their page on their super-resolution
algorithm:

[http://www.infognition.com/articles/what_is_super_resolution...](http://www.infognition.com/articles/what_is_super_resolution.html)

I used their plugin on virtualdub on a sample of the video. The results
weren't useable. Here is a picture which shows the before and after:

[https://imgur.com/a/0rhy7q7](https://imgur.com/a/0rhy7q7)

(The diagonal lines are a watermark because I didn't pay to register video
enhancer.) Also note that though it might look like a sharpen mask was
applied, in fact it was not: this is just the superresolution that video
enhancer came up with.

Now granted I don't think that this particular site uses state of the art
algorithms (its references on the page I linked are decades old) but it's the
first one I found.

The site also has a page explaining when it doesn't work:

[http://www.infognition.com/articles/when_super_resolution_do...](http://www.infognition.com/articles/when_super_resolution_doesnt_work.html)

It specifically calls out "If your video is compressed to a low bitrate, in
many cases this is very bad for super-resolution."

This certainly seems to be the case here. On my comparison picture above you
can see that it certainly is an improvement, it is just not enough. I still
can't read most of the lines. I think this also doesn't use as many keyframes
as it could. (Which makes sense - it is rare that a rare static image is up
for, in this case, 25 full seconds!)

There are at least 14 full keyframes there so I think there is more detail to
be extracted, but it would, obviously, take longer analysis. I'll let you know
if I find anything better or get an answer from the paper authors.

~~~
acqq
It's surely interesting that you try to figure out the "state of art" but as
far as I see the amount of information is too low to recover more readability,
and that was my estimate even before your experiments. The way I saw it, the
algorithm would have to model the "compression interference patterns" to do
that, and even if something like that existed the amount of information seems
to be far too low.

If it's really about the given talk and not exploring state of art in
algorithms for recovering of the (textual?) information lost due to video
compression, you'd be better off to communicate

1) with Dr Hipp who gave the talk -- he published the more recent original
slides on the sqlite site:

[https://www.sqlite.org/talks/](https://www.sqlite.org/talks/)

so it could be reasonable he'd be willing to publish these older slides (which
he probably considers in some aspects outdated). Then only if that fails:

2) with the author of the video who possibly still has a higher quality
version of the video, if the quality was dropped during the video compression
or preparation for youtube, e.g. while trying to reduce bandwidth or reencode
from the native recording format.

~~~
logicallee
yeah, I after I wrote those paper authors I also wrote the slide author - who
is the only one who emailed me back, so I got the slides.

Still, I think more information is recoverable. For example I could tell Video
Enhancer wasn't using that many frames (from the encoding progress and
framerates), maybe just a few adjacent frames.

Yet you could see a very clear improvement:

[https://imgur.com/a/0rhy7q7](https://imgur.com/a/0rhy7q7)

(As though a sharpen mask was applied, but it isn't.)

I did see if I could get an improvement if I forced it to use more frames by
running through a few times, each time doubling video speed so I end up with 1
frame from the whole segment - but it didn't end up better. Anyway, now I have
the original.

------
dicroce
so I was trying to figure out why a query was slow the other day... it was a
nasty query with like 14 joins... I used explain and saw that it was a mess...
now in my case I was able to switch to outer joints and nest related joins and
got it fast.. but I had some interesting thoughts.

In SQL, indexes are implicit.. they are used if available but it's easy get a
large query to scan sometimes when it shouldnt... what if there was a
different query language with explicit index syntax.. I think you'd get a lot
more predictable performance.

~~~
DenisM
Two reasons to _not_ have indexes in the query:

1\. Query expresses the result it produces, not the method that was used to
obtain it. Semantic vs implementation. It may be a pain to write, but it will
be easier to read later.

2\. DBA could add/drop indexes on the fly to tune performance of a live system
without making any changes to the application code. And being 100% certain he
is not changing the semantics of what's going on.

As others noted, if you must you can use query hints for force particular
index to be used for a particular operation. MSSQL also allows to pin down a
query plan you like for a given query so that it doesn't drift away later due
to environment changes.

I agree it is sometimes a pain to force SQL to use the index you wanted it to
use.

~~~
AmericanChopper
I’ve never worked anywhere where I had to worry about DBAs running around
dropping indexes. The main reasons not to build an index are usually storage
and write overhead. Every index a table has means you have to do another write
operation on every insert, which can really start to add up. They can also add
significant overhead to any migration operation that happens to require an
index rebuild.

In my experience, the most common reason for an optimizer choosing not to use
an existing index, is out of date statistics. For those who aren’t aware, the
database collects table statistics for things like cardinality, number of
distinct values, etc... This is the information the optimizer uses when it’s
building a plan. If they get out of date the optimizer will start to come up
with nonsense plans. Even worse, if your stats get too out of date, you can
become scared to update them, because a new set of stats can potentially
change the plans built for every single query in ways that are hard to
predict.

As others have stated, you can put index hints directly into your queries, but
this should be avoided as they’re hard to maintain. Most ‘enterprise’ RBDMS
also have some form of plan management, but this should be avoided even more,
as managed plans permanently bypass the optimizer, which is even harder to
maintain.

------
api
Sqlite is incredible. Tiny and usually used for small stuff but I have heard
of 1TB+ databases with acceptable performance.

------
stevoski
If you are a Java programmer and want to learn how an SQL database engine
works, take a look at the source code of H2.

Even better, try to add a basic feature to H2 (eg. a new built-in function).
It is surprisingly easy, and you come away with a decent understanding of the
basics of building an SQL database engine.

------
A_Person
Gosh, I must say there seems to be some misunderstanding of RDBMS concepts in
some posts in this thread!

I was writing database systems professionally, back in the days before the
RDBMS concept was even a thing. So here's my (enormously long and convoluted)
2 cents worth. Please be sure to pack a sandwich and get hydrated before you
continue.

Say you were dealing with doctors and patients, and needed to store that
information in a database. Back in the day, you'd typically use a so-called
hierarchical database. To design one of those, you need to decide, what is the
most common access method expected to be: getting the patients for a given
doctor, or the doctors for a given patient? You'd design the schema
accordingly. Then, the preferred access method was easy to code, and efficient
to run. The "other" access method was still possible, but harder to code, and
slower to run. The database schema depended on how you thought the users would
access the data.

But that is absolutely what _NOT_ what to do with an RDBMS. Certainly you look
at the users' screens, reports, and so on - but that's just to determine what
unique entities the system must handle - in this case, doctors and patients.
Then you _ignore_ how the users will access the data, and work out what are
the inherent logical relationships between all the entities.

Your initial answer might be this. A doctor can have many patients, and a
patient can have many doctors. As any competent relational designer will
instantly know, this means you need a resolving table whose primary key is a
composite key comprising the primary keys of the other two tables. So if Mary
was a patient of Tom, you'd add Mary to the patients table (if not already
there), Tom to the doctors table (ditto), then add a Mary/Tom record to the
resolving table. By that means, a doctor could have any number of patients, a
patient could have any number of doctors, and it's trivially easy to write
simple, performant SQL to access that data however you want.

But then you'd have a ghastly thought: patients can also be doctors, and
doctors can also be patients! Say Tom was also a patient of Mary! Now you need
a Tom record in the patient's table, but that would inappropriately duplicate
all his data from the doctors table! Something's clearly wrong. You'd soon see
that from a data modelling viewpoint, you don't want doctors and patients as
separate entities - you want a single entity Person, and a resolving table to
relate arbitrary people in specified ways.

So what?!!

So this. IMHO, many developers using relational databases have absolutely no
idea about any of that. They design hopelessly unnormalised schemas, which
then need reams of ghastly SQL to get anything out. The query planner can
barely parse all that crap, let along optimise it. The database has to stop
every five minutes to wet its brow and take a drink.

So here's my advice to any inexperienced relational database designers who
have actually managed to get this far!! If you can answer the following
questions off the top of your head, you're probably on the right track. If you
_can 't,_ you're lacking basic knowledge that you need in order to use an
RDBMS properly:

\- what is a primary key? \- what is a foreign key? \- what is an important
difference between primary keys and foreign keys? \- what is a composite key?
When would you use one? \- what are the three main relations? \- what is
normalisation? \- what is denormalization? \- what is a normal form? and so
on.

Just my 2c! :-)

~~~
kimdotcom
So, were you writing DB software before Codd's research at IBM was available?

~~~
A_Person
Further to my other reply, I've just checked the intertubes, and found that
Codd's paper "A Relational Model of Data for Large Shared Data Banks" was
published in 1970, and Dijkstra's "Go To Statement Considered Harmful" in
1968. So I think my fading memories of all this are accurate (for once!).

------
okket
Sadly bad audio (room mic with all the ambient noise) and bad video quality
(slides are almost unreadable). But great content.

------
randop
Thank you. Very educational. Interesting to know that ORDER BY includes
significant performance penalty without LIMIT.

------
serioushaha
slides : [https://www.slideshare.net/VikasBansal23/how-sqlite-
works](https://www.slideshare.net/VikasBansal23/how-sqlite-works)

------
angelfreak
Really great, thanks for posting.

------
bitmapbrother
This is a much better talk with better video and sound.

[https://www.youtube.com/watch?v=Jib2AmRb_rk](https://www.youtube.com/watch?v=Jib2AmRb_rk)

------
cup-of-tea
What is that acronym he keeps saying? MBCC?

~~~
ntonozzi
MVCC - Multi version concurrency control.

------
anothergoogler
Love how he started. People who don't stop their conversations for a presenter
are the worst. People who don't stop their conversations for a presentation by
Richard Hipp deserve a spell of laryngitis.

------
blackrock
I'm not going to call anyone out here, but why do people keep using the word
orthogonal?

It doesn't even compute. It doesn't even make any sense, in how they use it in
relation to the topic.

Are the issues at right angles of one another? No.

Are the issues statistically independent of one another? Perhaps.

I suggest to use a more appropriate descriptive word to describe the
situation.

You folks should read the urban meaning of orthogonal, to understand how
people roll their eyes at you, when you inappropriately use the term.

[https://www.urbandictionary.com/define.php?term=orthogonal](https://www.urbandictionary.com/define.php?term=orthogonal)

Just another friendly PSA.

~~~
ternaryoperator
That's how languages evolve. Words that meant one very distinct thing come to
mean something only partially like the original. People decry the misuse. And
finally the new meaning becomes the one true meaning and the original sense is
marked in dictionaries as "archaic."

For a word that's gone through that exact cycle, have a stare at "artificial,"
which was the adjective for "artifice," which at one time meant craftsmanship.
When St. Paul's Cathedral was first shown to King Charles II, he praised it
for being "very artificial" \-- a compliment. [1]

In the meantime, I agree that it can be frustrating to see words apparently
misused. But I think this is hardly the mark of an "idiot," as you put it.

[1] [https://quoteinvestigator.com/2012/10/31/st-pauls-
cathedral/](https://quoteinvestigator.com/2012/10/31/st-pauls-cathedral/)

------
jokoon
I don't like to use SQL engine because I don't understand how they work, I
never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind
of algorithm will optimize my query.

Who really does understand how a SQL engine work? Don't you usually require to
understand how something work before starting using it? Which SQL analyst or
DB architect really knows about the internals of a SQL engine? Do they know
about basic data structures? Advanced data structures? Backtracking?

That's why I tend to avoid systematically using a SQL engine unless the data
schema is very very simple, and manage and filter the data case by case in
code. SQL is good for archiving and storing data, and work as an intermediary,
but I don't think it should drive how a software works. Databases can be very
complex, and unfortunately, since developers like to make things complicated,
it becomes hairy.

I think SQL was designed when RAM was scarce and expensive, so to speed up
data access, it has to be properly indexed with a database engine. I really
wonder who, today, have data that cannot fit in RAM, apart from big actors.

I tend to advocate for simple designs and avoid complexity as most as I can,
so I might biased, but many languages already offers things like sets, maps,
multimaps, etc. Tailoring data structures might yield good results too.

Databases still scare me.

~~~
jnwatson
For Postgres at least, you can literally ask it how a query works, via
EXPLAIN. Now, there’s a skill to understanding the output of that, but at
least it isn’t a black box.

~~~
ebikelaw
And the query plan can literally change out from under you at any time. SQL
sucks. You should be able to dictate the query plan to the engine directly. If
SQL exists as a tool to create and serialize such plans via exploration and
experimentation, that’s fine. As a runtime query system it is completely
unsuitable.

~~~
purerandomness
SQL is a declarative language: You describe _what you want_ , not how to get
it.

If that's not what you need, there are plenty of procedural languages whith
which you describe _how to get things_.

A query plan will change based on statistics. It's the engine's job to decide
if your query is best served by parallelizing it to multiple cores, or
deciding if it's worth JITing it before execution.

The actual execution of the query is an implementation detail of the engine,
and should be. That's the entire point of a SQL standard: To provide users
with an interface to talk to engines, which then retrieve the data you asked
for.

It is its core strength and the reason why it's so successful.

~~~
ebikelaw
"You describe what you want, not how to get it."

The entire archive of the pgsql users' mailing list disagrees. Every wants to
know why their plan is suboptimal, or why it changed. People also want to know
why the planner takes 100ms to generate the plan and only 1ms to execute the
query, and so forth.

The idea that you just say what you want and you get the optimal result from
your database is just ridiculous to anyone who has had to use them under any
significant load.

~~~
zbentley
> The entire archive of the pgsql users' mailing list disagrees. Every wants
> to know why their plan is suboptimal, or why it changed.

And rather a lot of the archives of $scripting_language_of_your_choice are
people confused about duck-typing/type system failures. That doesn't mean
scripting languages should be replaced with statically typed ones; just that
there are pain points in every system, and right (or wrong) tools for every
job.

Don't believe me? Check how much of the FAQ traffic from first-time Rustaceans
(or Swift/Java/etc. newcomers) has to do with how to satisfy their language's
type system.

You pick your poison. SQL gives you a clearly defined set of tradeoffs up
front. If that's not for you, no worries, move along.

~~~
zzzcpan
The biggest tradeoff SQL gives you is exactly the one GP points out: it's a
leaky abstraction where understanding performance is tied to implementation so
much, that you pretty much lose all the benefits of SQL, except familiarity
with the technology.

