
A practical relational query compiler in 500 lines of code - ot
http://scattered-thoughts.net/blog/2016/10/11/a-practical-relational-query-compiler-in-500-lines/
======
gravypod
I don't like the idea of executing "real" code in a database. An DB's parser
shouldn't allow a query make network connections/IO access.

~~~
makmanalp
Well, in an actual system, this would be sandboxed and stuff.

If this all sounds dubious, consider:

In modern databases these days, most everything already lives in RAM, and CPUs
are getting incredibly fast, meaning most of the time the CPU is sitting there
doing nothing, and the bottleneck is bringing stuff up and down from memory
into the CPU cache(s) and registers. There is orders of magnitude difference
in the speed of accessing stuff from registers vs L1/L2 vs main memory.

Now, usually, for sequential access to memory, the hardware is smart and will
pre-fetch stuff into the cache, causing near-optimal performance. The more
random access to memory you have in your code, the more cache misses, and the
worse the performance. Now, imagine having a function call! This means
swapping the stack, code, variables all in and out. And often this is a
virtual pointer (e.g. MySpecialProcessor->next_tuple()) which is even more
indirection and more cache misses. On top of that, there's branches:
abstracted code has if's and special cases for many situations that we may
already know will never happen in our current query, so why have them? And
branches are also bad for other reasons to do with pipelining in modern CPUs.

Additionally, the code you're using is competing for space in the upper levels
of the memory hierarchy with the data you're trying to process. Meaning if
your code is large, then parts of it need to be paged in and out while you
loop through your data, which again, is sloooooooow.

If this sounds like hyper-optimization, consider that this is the inner-most,
hottest loop in the entire database software, and gets called millions and
billions of times per second. Trust me when I say this _does_ make orders of
magnitude differences overall. For a toy example, check out:
[https://www.naftaliharris.com/blog/2x-speedup-with-one-
line-...](https://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-
code/)

Also we'll see this method being used to some degree in Apache Spark:
[https://databricks.com/blog/2016/05/23/apache-spark-as-a-
com...](https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-
joining-a-billion-rows-per-second-on-a-laptop.html)

~~~
gravypod
I'm not saying the performance isn't important in this case, I'm saying it's
an inherently unsafe implementation. I can't think of a single way to secure
the system if the password falls to the wrong hands.

I'd much rather have this done in a query language and have the queries
cached. A compiled query should be very small (maybe only a few instructions)
and if done right could be very quick without any fancy meta evaluation. You'd
also get "some" way to sandbox the system inherently.

The code should only be able to access your data and if you design a language
around that idea your have something infinitely more safe then something that
isn't.

------
threepipeproblm
Thank you very much for this.

