Hacker News new | past | comments | ask | show | jobs | submit login
DuckDB: SQLite for Analytics (github.com)
274 points by tosh 1 day ago | hide | past | web | favorite | 65 comments





When I was developing my pet project for Web analytics (https://github.com/dustalov/ballcone), I aimed at using an embedded columnar database for fast analytic queries with zero maintenance, so literally, I wanted ‘OLAP SQLite’. There are essentially two options, DuckDB and MonetDBLite, developed by the same research group. I tried both of them.

DuckDB is a relatively new project and I generally enjoyed its design and code quality. However, I found that its current implementation of storage works better for dense data, which is not my case: https://github.com/cwida/duckdb/issues/632. I think it would be pretty cool to have data compression and more careful NULL handling needed for storing HTTP access logs.

MonetDBLite seems to be more mature in terms of functionality, but it seems to be lagging significantly behind the non-embedded client-server version, MonetDB: https://github.com/monetdb/monetdb. I experienced unexpected segfaults when using the DATE type on any platform and window functions on aarch64. Nevertheless, I am still using MonetDBLite primarily due to the more efficient disk usage. I will be happy to switch to a more lightweight and actively maintained solution.


Hannes and me developed both MonetDBLite and DuckDB, precisely for the need that you described :) We noticed that there was no easy RDBMS aimed at single-machine analytical workloads, whereas these kind of workloads are very common (e.g. R/Python data science workloads).

MonetDBLite was our initial approach, and is essentially an embedded version of MonetDB. We wrote a paper about it (https://arxiv.org/pdf/1805.08520.pdf). While it works, the system was not built with embeddability in mind, and we had to rewrite a lot of code to get it to work. Because of that we ended up with a fork, as the rewrite was too big to be merged back upstream. This caused a lot of problems with the fork becoming outdated, and a lot of headaches with constantly merging changes.

MonetDBLite had a number of issues stemming from the fact that the original system was made as a stand-alone system. For example, the database system once started in-process could not be shut down, as the regular database system would rely on the process shutting down to clean up certain parts of the system.

In total, the features we wanted that would not be possible to implement in MonetDB without huge rewrites are as follows:

* Multiple active databases in the same process (reading different database files)

* Multiple processes reading the same database file

* In-database shutdown/restart

* Single-file database format

* Dependency free system

* Single compilation file (similar to the SQLite amalgamation https://www.sqlite.org/amalgamation.html)

* Control over resource/memory usage of the database system

* Vectorized execution engine

* Compressed storage and compressed execution

Because of that (and increasing frustration with constantly merging changes) we opted to develop a new system instead of sticking with MonetDB, as rewriting the entire system to get those features would likely be more work than just starting from scratch (and not politically feasible as well ;)).

The result of this is DuckDB. While it is still early in the process, it is relatively stable and we hope to ship a v1.0 sometime this year, along with an updated website :)


You are doing a fantastic job and I am wishing you the best of luck!

I used only Python API of both DBs and what confused me is the mandatory requirement of NumPy and Pandas. I think ndarray/DataFrame retrieval and conversion should surely be optional. Some applications do not require all these features and can go ahead with the built-in types (mine just uses fetchall()).


Good point, we will make it optional :) Thanks for the feedback!

ooc do you plan on binding in common functions for the DS/ML use cases? Things like

- String similarity measures

- ROC-AUC/MSE/correlation/Precison/Recall etc.

- LSH

- Sampling/joining with random records.

Keeping all of the transformation/prep logic in the sql engine seems like a great performance savings over python, and would also speed up the dev time for building up the code surrounding the ML functionality.


We already have a number of statistical ops (e.g. correlation) available, and we are planning to add more. The exact timeline I cannot promise, but feel free to open issues with the specific operations you are interested in/you think will be useful. We are always happy to review PRs as well :)

Can someone point me to what OLAP is? I've read the wiki page, but don't get. All the databases I have worked with are multidimenaional.(i.e. complex erp systems with 100's of tabled joined together in various ways). Granted most of my analytical work is done in R, after performing queries.

But I've never had the need for 'special sql'.

Is this 'cubes' all over? I understand cubes as basically views precalculated to perform faster on mostly to charge more?

But OK... If need to work on datasets bigger than e.g. 5 GB you begin to eat RAM fast. Is this the use case? Then why not simply say that?


Going by example, let's say you're designing a sales system.

OLTP workload is when you create/read one order at a time, or a short list of them.

OLAP workload is when you want to aggregate sales numbers across all orders for a given broad criteria, without particular interest in any one order. For example sales by customer by month, or sales by region by date.

The two cases benefit from different data structures and algorithms to achieve their goals.

A system for OLTP may have a table of orders T0 (id, date-time, customer, $total, address, ship method, contact person, ...) where any one record can be fetched efficiently. However reading all of them will require a lot of disk IO.

A system for OLAP can reduce this table to T1 (id, date-time, customer, $total) or even just T2 (date-time, customer, $total) which is good enough to answer most statistical queries and yet consuming less than half the disk bandwidth.

Further yet, you likely do not care about individual order made on a given date by a customer, so you can add up all orders for a given day before storing the data: T3 (day, customer, $total-for-the-day-for-this-customer), or for the week: T4 (week, customer, $total-for-the-week-for-this-customer) or for the months: T5 (month, customer, $total-for-the-month-for-this-customer). Alternatively if you're a big boss you may not care about individual customers, but may care about regions, so the table you will want to see is T6 (week, region, $total-for-the-week-for-all-customers-in-region).

These smaller tables will answer your questions several orders of magnitudes faster compared to T0. And no, they do not have to fit into the memory - to the contrary the entire purpose of the exercise is to make it work efficiently with disk.

One could also chose to not compute e.g. T6 and derive it on the fly from T4, it's slower than having T6 pre-computed, but allows more flexibility in e.g. computing sales by ZIP code or by proximity to major bodies of water, etc.


Okay. So its a matter of optimising for speed. Since my work requires flexibilty we just try to approximate normalization (I'm not strong on 1St, 2nd etc) - I guess what you describe a t2. Even with milions of rows its normally pretty fast to aggregate rows for e.g. one customer. Depending on the complexity. It might.take a few seconds.to.filter by delivery.method, group by payment etc...

But any ways... it seems to me, one would allways design for flexibilty (t2). And then just create views for the special cases you describe?


>it seems to me, one would always design for flexibilty (t2)

Clearly not - the OLAP field exists and brings in untold billions of dollars, so there is no shortage of takers. OLAP design is about finding an optimal point between two extremes:

1. put effort to pre-compute everything, gain a lot of speed while looking at the result.

2. don't pre-compute anything, spend more time waiting to visualize the result.

If things are fast enough for you in the extreme #2 then you don't need to precompute everything and you can enjoy the flexibility. But for many people this is not the case - they either want results in split-second (highly interactive systems, modeling with varying parameters over an over again), or they want to dramatically increase amount of data (e.g. not just all sales but also all page views).


Yes off course. It humbles me. I thought i've dealt with pretty big data, but with no need to be speedy, clearly. Thx

In my case, the typical scenario is running aggregation and analytical functions over a small number of columns, while the number of rows is huge. The SQL syntax is the same, but the columnar storage model better fits this case. A good illustration is available in ClickHouse documentation: https://clickhouse.tech/docs/en/#why-column-oriented-databas....

Thanks..reading the link.

Edit: really interesting. That makes sense. But also a very strict set of requirements.

Thanks.


Just as always, there is no silver bullet.

As someone that uses SQLite a bit for analytics, I feel like the elevator pitch on the github and the website ( https://www.duckdb.org/ ) is missing something.

I'm sure there's some great reason, but I don't see it: Why would I want to use this over SQLite? Is it considerably faster? Does it handle large scale better? I suppose I could try and throw it some use cases I could come up with, but I'd worry I'm missing the best use case for this tool.


Here's a YT video of one of the implementers giving a talk to CMU students about what DuckDB is, why they made it, and how it works: https://www.youtube.com/watch?v=PFUZlNQIndo

In a nutshell, it's a file-per-database embedded DB like SQLite, and it's installable as a Python or R package from PyPI or CRAN, but unlike SQLite, it's a column store with vectorized operations so it's more performant for analytical queries.


The video clarified my initial confusion. DuckDB is an embedded column store library and borrows the SQLite C API and some of the SQLite philosophy (single cpp/hpp amalgamation file, no external dependencies, and resource sharing) but it uses PostgreSQL syntax and parser, and a column oriented memory/storage engine based on their experience with MonetDB/VectorWise.

DuckDB borrows code from HyPer DB which looks like a in-memory hybrid OLTP/OLAP clone of SAP HANA. They did not discuss any hybrid OLTP/OLAP implementation details in the video so the borrowed HyPer code may only be the wrapper for the PostgreSQL parser.

Desktop column stores do have a precedent in the enterprise market. Microsoft's first attempt at a column store started with embedded integration with PowerPivot and eventually made its way into SQL Server. Sybase IQ, built on zero-admin SQL Anywhere, has been used as a local column store for Pivot Table applications (think BusinessObjects, Cognos, Tableau, QlikView, and Excel Pivot Tables). Pivot Table integrations are typically custom-built for each data warehouse vendor or MDX/XMLA driver.

DuckDB is focused on integrating with the desktop data science platforms R and Python (dbplyr and pandas). Analytics/OLAP in SQL means lightning fast GROUP BY queries and more recently SQL OLAP Windows. "SQLite for Analytics" confused me but that is probably due to my own historical mental clutter. The name DuckDB refers to a pet duck and not the duck typing system in SQLite.


The fact that it’s column oriented is a massive improvement for analytics over SQLite, I don’t understand why they don’t mention this more explicitly on their website and GitHub pages.

I usually just extend SQLite with virtual columns and try to do the filtering/aggregate before sending tuples to the engine. This does the trick most of the time and it's relatively straightforward to implement.

I'm need to move an excel sheet to something like SQLite, hadn't heard of virtual columns, fixes most of the issues I have, thanks!

The pitch is faster, and more space efficient since column stores are far better for analytics than row stores. Some benchmarks that found ~5-10x speedup: https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html

Consider someone who analyzes medium-sized volumes of (slowly-changing) data -- OLAP, not OLTP. People who need to do this primarily have 2 alternatives:

* columnar database (redshift, snowflake, bigquery)

* a data lake architecture (spark, presto, hive)

The latter can be slow and wasteful, because the data is stored in a form that allows very limited indexing. So imagine you want query speeds that require the former.

Traditional databases can be hugely wasteful for this usecase -- space overhead due to no compression, slow inserts due to transactions. The best analytics databases are closed-source and come with vendor lockin (there are very few good open-source column stores -- clickhouse is one, duckdb is another). Most solutions are multi-node, so they come with operational complexity. So DuckDB could fill a niche here -- data that's big enough to be unwieldy, but not big enough to need something like redshift. It's analogous to the niche SQLite fills in the transactional database world.


I read the post and had a couple of points:

1. Is DuckDB similar to having indexes on each column? Because generally when something is slow, the solution is indexes. I have a 100 GB database which records real time data and is lightning fast because of some minor tuning.

2. The example of STDDEV not be available shows the author's unfamiliarity with SQLite which worries me.

https://docs.python.org/2.7/library/sqlite3.html#sqlite3.Con...

Could very easily have made a similar interface if necessary.


Hi, one of the authors of DuckDB here.

ad 1) An index is an additional data structure that takes space and needs maintenance when the data changes. This is especially problematic for realtime data. DuckDBs main data structure and implementation is designed for efficient scanning without needing an additional index ("vectorised columnar").

ad 2) We support STDDEV_SAMP and STDDEV_POP. There is also an interface to define your own functions. We are very aware of how SQLite is doing things and even support their client API. But we also might eventually add their way of adding UDFs from Python.


Indexes don't solve the same problem as columnar because with an index you still have to dereference a pointer, read the page (performing I/O if on-disk), and locate each tuple you want within the page--the tuples you need to aggregate from may be scattered across many pages, and you have to read the whole page into memory even if you only want one column.

With a column store, the values of the same column are stored contiguously so if you're summing a column for example, you can read one or more pages that could consist entirely of values the column you need to sum, sequentially. It's much more efficient for aggregations due to cache locality--much fewer cache misses and I/O operations needed to summarize a column.

The major trade-off is that updates and deletes are more expensive in a column store. But this is usually fine for OLAP workloads.


Unsure if DuckDB implements this in its column store, but another common feature is compression.

A page of field values is a much better structure to compress than a page of tuple values. This is especially true of the sort of denormalized data that is typical in analytical workloads. E.g. dates may only be considered in a period of a few years - this will take much less space to encode than the full allocation for a date data type. Similarly we may have categorical fields with only a few values. Imagine a product category with only 8 values. This can take a set of numbers which may be otherwise stored as an integer, and encode it to 3 bits, plus some space for a dictionary.

Another easy bit of compression to apply to columns is run length encoding. Dates, again, serve as a good example. If the data is sorted by date, you can choose to store a data structure that captures (start index, date value, end index). If you had 1,000 records on the same date, you could encode this with only three likely-integer-sized values, rather than 1,000 repetitions of a single likely-integer-sized value.

Such compression offers several benefits. First, it can reduce data volumes, and allow more query workload to come from RAM, either through DB or filesystem caching, or by an in-memory architecture. Additionally, the smaller the data size is, the faster a scan can be completed. Analytic workloads tend toward queries that require a full table scan (though with a columnstore, it's more a full column scan). If the data size is smaller, then you can stream the entire thing through whatever your query is faster.


Yep, compressibility is another big win made possible by columnar layout, and it looks like DuckDB implements at least some form of column compression.

Ah yes, embarrassing not to remember basic DB architecture :-)

Do you have an example of a workload that benefits from this structure?


Yes, it's geared toward reporting / analytics / data science workloads where the most common read operation is calculating aggregate metrics like count/min/mean/max and histograms. Statistics stuff.

Thank you very much for your patience :-)

Taking stats on a collection of time-ordered set of sensor observations.

I'd love to see how it compares to MonetDB. I've used MonetDB successfully to look at smallish data set (few hundred millions rows) that neither postgres nor sqlite could complete queries at a useful speed on my laptop (at the time an x220 with a spinning disk).

https://www.monetdb.org/Home


The primary difference for analytic workloads is the processing model. MonetDB uses a column-at-a-time processing model which causes large intermediates to be materialized in memory. This significantly increases memory usage, and causes poor performance when data exceeds memory. DuckDB uses a vectorized execution engine, which materializes only small subsets of the columns. This increases cache-locality, decreases memory usage and also improves parallel execution capabilities (although parallel execution is still WIP currently).

DuckDB is developed by the same database research group at CWI that developed MonetDB, so they have applied a lot of the lessons they learned from MonetDB.

Learnings from both Monet AND Vectorwise.

I bookmarked DuckDB some time ago intending to revisit, and when I did, I had forgotten why I bookmarked it and the website's contents unfortunately didn't help for reminding me.

If the documentation had said something like "a columnar database based on sqlite, optimized for analytics", I think that would pique the interest of many. Columnar database = column compression = fast filtering/aggregations = high performance analytics compared to pure row stores.

There are very few free columnar databases available that are production grade (or built on production grade technology), so a columnar database built on top of sqlite is big news.


Apologies, a website rework is currently in progress :)

DuckDB is indeed a free columnar database system, but it is not entirely built on top of SQLite. It exposes the same front-end and uses components of SQLite (the shell and testing infrastructure), but the execution engine/storage code is new.


I can see why you are avoiding "built on top of SQLite", but I really think you should reconsider and go with that terminology.

It means the "right" thing to most people: I can use this like I use SQLite now.


If you want to dig into the differences more, I highly recommend the book Designing Data Intensive Applications. Analytics systems have different requirements from normal transaction processing systems, so it naturally follows that DuckDB could specialize to fill those requirements better than SQLite. Many people bring up row stores (SQLite) vs column stores, but there are many other interesting differences and optimizations to be made, so I can see how SQLite may be leaving some niche unfilled.

Just came here to say the same thing. I read the readme and it's not clear what this does exactly as compared to SQLite, particularly since it used SQL. What kinds of analytics is it optimized for? Does it have different data structures for streaming data? I can tell a lot of work went into this project and I think it could benefit from a more thorough introduction.

It's a columnar database and the query engine is inspired from MonetDB which processes columns in vectors - "Execution engine: The vectorized execution engine is inspired by the paper "MonetDB/X100: Hyper-Pipelining Query Execution" by Peter Boncz, Marcin Zukowski and Niels Nes.".

Interesting that the distinctive features are buried in the acknowledgements section.

I wouldn't call that "buried in the acknowledgements section." I wish more software projects would present themselves in this way actually. Inevitably everything is built upon prior art, so it makes sense to say where the ideas came from or which parts are novel (if any - sometimes packaging together a bunch of existing ideas is valuable in itself).

It sure beats "[project] is a new key-value store that is even faster than [other project]".


Agreed. I was trying to understand it earlier, and there doesn't seem to be a description of what its features are and what it's good for. It jumps straight to "how to use" without "why to use".

My first instinct based on a cursory glance (ie. specific language compatibilities) is that it's a portable embedded database for data analytics of the cybersecurity domain, particularly mobile penetration testing possibly.

It’s much faster for analytical queries, because it’s a column store rather than a row store.

Some more info about columnar DBs here https://en.wikipedia.org/wiki/Column-oriented_DBMS for anyone not familiar with the idea.

Adding to the pile of people agreeing. SQLite works okay for analytics for now, and I would be super-interested in something embeddable more tailor-made for analytics, but don't understand yet in what specific ways DuckDB is an improvement.

It might be dating me, but seem like a larger version of the difference between C and FORTRAN arrays, C being relational like and FORTRAN being column store like.

now get off my lawn :)


Well, seems to me the difference between rowstores and columnstores might be a little more profound than just row and column major orders in C and FORTRAN. (Which I remember from back in the day)

In columnstores, it’s not just the ordering that’s advantageous for analytics. It’s also the possibility of scanning only the columns required (to the exclusion of others), and the possibility of column compression using RLE compression, which speeds up search when there are many repeated elements (also compression makes searches cache efficient). This means if you workload only involves a few columns, and your primary operations are filter and aggregate — analytics workloads — then the performance gains are tremendous.

On the other hand, if you want to optimize for row by row writes (OLTP), and if your workloads involve operations on many columns, then a rowstore is advantageous.

Row and column major layouts are more about exploiting sequentiality, but in practice both C and FORTRAN can provide good linear algebra performance.

But for analytics workloads on row vs columnstores, the difference in performance can be in the orders of magnitude. The fastest analytics databases in the world (Kdb, Vertica, Exasol, some of the newer GPU based databases) are almost all columnar.


Same. Best elevator pitch would be how it complements (or replaces) SQLite

CMU database group videos recently had the duckdb lead on

https://youtu.be/PFUZlNQIndo


Thanks! Loved the presentation.

This presentation was awesome!

I was looking for this exact type of project to back some of my ML projects datasets. These datasets

- Are less than 1 TB at maximum, and are typically less than 50GB

- Written once and read many times

- Require heavy and variable post-processing prior to training

- Usually run on an ephemeral large instance which reads from s3, does it's thing, and then shuts down.

This last task is incredibly slow in python, and limited on a standard sqllite/postgres DB to a single thread. Having access to an embeddable columnar sql engine could greatly simplify and accelerate this final transformation stage before training.


Instead of using python for data transformation, just move over to data.table.

https://h2oai.github.io/db-benchmark/

Cut your time AND your costs.


I've been using clickhouse with our 100gb click log data per day. It just works, before that we were using bigquery and bigquery was cheaper than our home brewed solution :(

So now we use two together, only recent data that's available for fast querying for user dashboard is served from clickhouse cluster.

And old data were customer is willing to be more tolerant to slow responses upto 2seconds is served by bigquery.


I know this isn't a support forum but as someone who works analytics I tend to use the same SQL editor to connect to different databases. At the moment it is DBeaver[0]. Maybe I am missing it but I cannot see how to connect to DuckDB using a regular SQL editor.

[0]https://dbeaver.io


As a possible hack, you could rename it the soname as sqlite and put it first in the library search path. It would have to implement the same abi though.

Try connecting to it as if it's sqlite? Not sure if they match any existing wire protocol but that would be my guess if they do.

SQLite is an in-process database, so it doesn't really have a wire format.

Can this be made into a back end for sqlite?

edit example like this: https://github.com/LMDB/sqlightning


SQLIte v4 supposed to be built on top of LMDB as I remember. For some reason the project was closed, they even have note about it on the official site somewhere.

tried the online version [1] this simple request does not seem to work for me: SELECT * FROM lineitem where l_discount < 1

any reason why?

[1] https://duckdbdemo.project.cwi.nl/select.html


I just installed this and looking at the dependencies it includes both numpy and pandas.

Could anyone advise how these relate to DuckDB? i.e. Is a DuckDB accessible as a Pandas dataframe?


Yes, the Python bindings include functions to deal with Pandas dataframes. Look at the examples: https://github.com/cwida/duckdb/blob/master/examples/python/...



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: