
DuckDB: SQLite for Analytics - tosh
https://github.com/cwida/duckdb
======
scoresmoke
When I was developing my pet project for Web analytics
([https://github.com/dustalov/ballcone](https://github.com/dustalov/ballcone)),
I aimed at using an embedded columnar database for fast analytic queries with
zero maintenance, so literally, I wanted ‘OLAP SQLite’. There are essentially
two options, DuckDB and MonetDBLite, developed by the same research group. I
tried both of them.

DuckDB is a relatively new project and I generally enjoyed its design and code
quality. However, I found that its current implementation of storage works
better for dense data, which is not my case:
[https://github.com/cwida/duckdb/issues/632](https://github.com/cwida/duckdb/issues/632).
I think it would be pretty cool to have data compression and more careful NULL
handling needed for storing HTTP access logs.

MonetDBLite seems to be more mature in terms of functionality, but it seems to
be lagging significantly behind the non-embedded client-server version,
MonetDB:
[https://github.com/monetdb/monetdb](https://github.com/monetdb/monetdb). I
experienced unexpected segfaults when using the DATE type on any platform and
window functions on aarch64. Nevertheless, I am still using MonetDBLite
primarily due to the more efficient disk usage. I will be happy to switch to a
more lightweight and actively maintained solution.

~~~
mytherin
Hannes and me developed both MonetDBLite and DuckDB, precisely for the need
that you described :) We noticed that there was no easy RDBMS aimed at single-
machine analytical workloads, whereas these kind of workloads are very common
(e.g. R/Python data science workloads).

MonetDBLite was our initial approach, and is essentially an embedded version
of MonetDB. We wrote a paper about it
([https://arxiv.org/pdf/1805.08520.pdf](https://arxiv.org/pdf/1805.08520.pdf)).
While it works, the system was not built with embeddability in mind, and we
had to rewrite a lot of code to get it to work. Because of that we ended up
with a fork, as the rewrite was too big to be merged back upstream. This
caused a lot of problems with the fork becoming outdated, and a lot of
headaches with constantly merging changes.

MonetDBLite had a number of issues stemming from the fact that the original
system was made as a stand-alone system. For example, the database system once
started in-process could not be shut down, as the regular database system
would rely on the process shutting down to clean up certain parts of the
system.

In total, the features we wanted that would not be possible to implement in
MonetDB without huge rewrites are as follows:

* Multiple active databases in the same process (reading different database files)

* Multiple processes reading the same database file

* In-database shutdown/restart

* Single-file database format

* Dependency free system

* Single compilation file (similar to the SQLite amalgamation [https://www.sqlite.org/amalgamation.html](https://www.sqlite.org/amalgamation.html))

* Control over resource/memory usage of the database system

* Vectorized execution engine

* Compressed storage and compressed execution

Because of that (and increasing frustration with constantly merging changes)
we opted to develop a new system instead of sticking with MonetDB, as
rewriting the entire system to get those features would likely be more work
than just starting from scratch (and not politically feasible as well ;)).

The result of this is DuckDB. While it is still early in the process, it is
relatively stable and we hope to ship a v1.0 sometime this year, along with an
updated website :)

~~~
scoresmoke
You are doing a fantastic job and I am wishing you the best of luck!

I used only Python API of both DBs and what confused me is the mandatory
requirement of NumPy and Pandas. I think ndarray/DataFrame retrieval and
conversion should surely be optional. Some applications do not require all
these features and can go ahead with the built-in types (mine just uses
_fetchall()_ ).

~~~
mytherin
Good point, we will make it optional :) Thanks for the feedback!

------
banana_giraffe
As someone that uses SQLite a bit for analytics, I feel like the elevator
pitch on the github and the website (
[https://www.duckdb.org/](https://www.duckdb.org/) ) is missing something.

I'm sure there's some great reason, but I don't see it: Why would I want to
use this over SQLite? Is it considerably faster? Does it handle large scale
better? I suppose I could try and throw it some use cases I could come up
with, but I'd worry I'm missing the best use case for this tool.

~~~
kyllo
Here's a YT video of one of the implementers giving a talk to CMU students
about what DuckDB is, why they made it, and how it works:
[https://www.youtube.com/watch?v=PFUZlNQIndo](https://www.youtube.com/watch?v=PFUZlNQIndo)

In a nutshell, it's a file-per-database embedded DB like SQLite, and it's
installable as a Python or R package from PyPI or CRAN, but unlike SQLite,
it's a column store with vectorized operations so it's more performant for
analytical queries.

~~~
alecco
I usually just extend SQLite with virtual columns and try to do the
filtering/aggregate before sending tuples to the engine. This does the trick
most of the time and it's relatively straightforward to implement.

~~~
mywacaday
I'm need to move an excel sheet to something like SQLite, hadn't heard of
virtual columns, fixes most of the issues I have, thanks!

------
stereosteve
CMU database group videos recently had the duckdb lead on

[https://youtu.be/PFUZlNQIndo](https://youtu.be/PFUZlNQIndo)

~~~
anarchyrucks
Thanks! Loved the presentation.

------
lumost
I was looking for this exact type of project to back some of my ML projects
datasets. These datasets

\- Are less than 1 TB at maximum, and are typically less than 50GB

\- Written once and read many times

\- Require heavy and variable post-processing prior to training

\- Usually run on an ephemeral large instance which reads from s3, does it's
thing, and then shuts down.

This last task is incredibly slow in python, and limited on a standard
sqllite/postgres DB to a single thread. Having access to an embeddable
columnar sql engine could greatly simplify and accelerate this final
transformation stage before training.

~~~
nojito
Instead of using python for data transformation, just move over to data.table.

[https://h2oai.github.io/db-benchmark/](https://h2oai.github.io/db-benchmark/)

Cut your time AND your costs.

------
econcon
I've been using clickhouse with our 100gb click log data per day. It just
works, before that we were using bigquery and bigquery was cheaper than our
home brewed solution :(

So now we use two together, only recent data that's available for fast
querying for user dashboard is served from clickhouse cluster.

And old data were customer is willing to be more tolerant to slow responses
upto 2seconds is served by bigquery.

------
mmsimanga
I know this isn't a support forum but as someone who works analytics I tend to
use the same SQL editor to connect to different databases. At the moment it is
DBeaver[0]. Maybe I am missing it but I cannot see how to connect to DuckDB
using a regular SQL editor.

[0][https://dbeaver.io](https://dbeaver.io)

~~~
sceadu
Try connecting to it as if it's sqlite? Not sure if they match any existing
wire protocol but that would be my guess if they do.

~~~
justinclift
SQLite is an in-process database, so it doesn't really have a wire format.

------
snissn
Can this be made into a back end for sqlite?

edit example like this:
[https://github.com/LMDB/sqlightning](https://github.com/LMDB/sqlightning)

~~~
alexdoesh
SQLIte v4 supposed to be built on top of LMDB as I remember. For some reason
the project was closed, they even have note about it on the official site
somewhere.

~~~
hyc_symas
SQLite v4 was supposed to have modular, pluggable backends, of which LMDB was
only the first (of hopefully many). They were also working on their own from-
scratch implementation of an LSM backend, which was also abandoned.

As I recall, they decided that the mmap-related performance tricks they
learned from working with LMDB could just be plugged into their own SQLite v3
Btree engine. (Indeed, recent SQLitev3 is a bit faster than when the SQLitev4
project began. But still doesn't match LMDB.)

------
hathym
tried the online version [1] this simple request does not seem to work for me:
SELECT * FROM lineitem where l_discount < 1

any reason why?

[1]
[https://duckdbdemo.project.cwi.nl/select.html](https://duckdbdemo.project.cwi.nl/select.html)

------
lowkey
I just installed this and looking at the dependencies it includes both numpy
and pandas.

Could anyone advise how these relate to DuckDB? i.e. Is a DuckDB accessible as
a Pandas dataframe?

~~~
dloss
Yes, the Python bindings include functions to deal with Pandas dataframes.
Look at the examples:
[https://github.com/cwida/duckdb/blob/master/examples/python/...](https://github.com/cwida/duckdb/blob/master/examples/python/duckdb-
python.py)

