DuckDB is a relatively new project and I generally enjoyed its design and code quality. However, I found that its current implementation of storage works better for dense data, which is not my case: https://github.com/cwida/duckdb/issues/632. I think it would be pretty cool to have data compression and more careful NULL handling needed for storing HTTP access logs.
MonetDBLite seems to be more mature in terms of functionality, but it seems to be lagging significantly behind the non-embedded client-server version, MonetDB: https://github.com/monetdb/monetdb. I experienced unexpected segfaults when using the DATE type on any platform and window functions on aarch64. Nevertheless, I am still using MonetDBLite primarily due to the more efficient disk usage. I will be happy to switch to a more lightweight and actively maintained solution.
MonetDBLite was our initial approach, and is essentially an embedded version of MonetDB. We wrote a paper about it (https://arxiv.org/pdf/1805.08520.pdf). While it works, the system was not built with embeddability in mind, and we had to rewrite a lot of code to get it to work. Because of that we ended up with a fork, as the rewrite was too big to be merged back upstream. This caused a lot of problems with the fork becoming outdated, and a lot of headaches with constantly merging changes.
MonetDBLite had a number of issues stemming from the fact that the original system was made as a stand-alone system. For example, the database system once started in-process could not be shut down, as the regular database system would rely on the process shutting down to clean up certain parts of the system.
In total, the features we wanted that would not be possible to implement in MonetDB without huge rewrites are as follows:
* Multiple active databases in the same process (reading different database files)
* Multiple processes reading the same database file
* In-database shutdown/restart
* Single-file database format
* Dependency free system
* Single compilation file (similar to the SQLite amalgamation https://www.sqlite.org/amalgamation.html)
* Control over resource/memory usage of the database system
* Vectorized execution engine
* Compressed storage and compressed execution
Because of that (and increasing frustration with constantly merging changes) we opted to develop a new system instead of sticking with MonetDB, as rewriting the entire system to get those features would likely be more work than just starting from scratch (and not politically feasible as well ;)).
The result of this is DuckDB. While it is still early in the process, it is relatively stable and we hope to ship a v1.0 sometime this year, along with an updated website :)
I used only Python API of both DBs and what confused me is the mandatory requirement of NumPy and Pandas. I think ndarray/DataFrame retrieval and conversion should surely be optional. Some applications do not require all these features and can go ahead with the built-in types (mine just uses fetchall()).
- String similarity measures
- ROC-AUC/MSE/correlation/Precison/Recall etc.
- Sampling/joining with random records.
Keeping all of the transformation/prep logic in the sql engine seems like a great performance savings over python, and would also speed up the dev time for building up the code surrounding the ML functionality.
But I've never had the need for 'special sql'.
Is this 'cubes' all over? I understand cubes as basically views precalculated to perform faster on mostly to charge more?
But OK... If need to work on datasets bigger than e.g. 5 GB you begin to eat RAM fast. Is this the use case? Then why not simply say that?
OLTP workload is when you create/read one order at a time, or a short list of them.
OLAP workload is when you want to aggregate sales numbers across all orders for a given broad criteria, without particular interest in any one order. For example sales by customer by month, or sales by region by date.
The two cases benefit from different data structures and algorithms to achieve their goals.
A system for OLTP may have a table of orders T0 (id, date-time, customer, $total,
address, ship method, contact person, ...) where any one record can be fetched efficiently. However reading all of them will require a lot of disk IO.
A system for OLAP can reduce this table to T1 (id, date-time, customer, $total) or even just T2 (date-time, customer, $total) which is good enough to answer most statistical queries and yet consuming less than half the disk bandwidth.
Further yet, you likely do not care about individual order made on a given date by a customer, so you can add up all orders for a given day before storing the data: T3 (day, customer, $total-for-the-day-for-this-customer), or for the week: T4 (week, customer, $total-for-the-week-for-this-customer) or for the months: T5 (month, customer, $total-for-the-month-for-this-customer). Alternatively if you're a big boss you may not care about individual customers, but may care about regions, so the table you will want to see is T6 (week, region, $total-for-the-week-for-all-customers-in-region).
These smaller tables will answer your questions several orders of magnitudes faster compared to T0. And no, they do not have to fit into the memory - to the contrary the entire purpose of the exercise is to make it work efficiently with disk.
One could also chose to not compute e.g. T6 and derive it on the fly from T4, it's slower than having T6 pre-computed, but allows more flexibility in e.g. computing sales by ZIP code or by proximity to major bodies of water, etc.
But any ways... it seems to me, one would allways design for flexibilty (t2). And then just create views for the special cases you describe?
Clearly not - the OLAP field exists and brings in untold billions of dollars, so there is no shortage of takers. OLAP design is about finding an optimal point between two extremes:
1. put effort to pre-compute everything, gain a lot of speed while looking at the result.
2. don't pre-compute anything, spend more time waiting to visualize the result.
If things are fast enough for you in the extreme #2 then you don't need to precompute everything and you can enjoy the flexibility. But for many people this is not the case - they either want results in split-second (highly interactive systems, modeling with varying parameters over an over again), or they want to dramatically increase amount of data (e.g. not just all sales but also all page views).
Edit: really interesting. That makes sense. But also a very strict set of requirements.
I'm sure there's some great reason, but I don't see it: Why would I want to use this over SQLite? Is it considerably faster? Does it handle large scale better? I suppose I could try and throw it some use cases I could come up with, but I'd worry I'm missing the best use case for this tool.
In a nutshell, it's a file-per-database embedded DB like SQLite, and it's installable as a Python or R package from PyPI or CRAN, but unlike SQLite, it's a column store with vectorized operations so it's more performant for analytical queries.
DuckDB borrows code from HyPer DB which looks like a in-memory hybrid OLTP/OLAP clone of SAP HANA. They did not discuss any hybrid OLTP/OLAP implementation details in the video so the borrowed HyPer code may only be the wrapper for the PostgreSQL parser.
Desktop column stores do have a precedent in the enterprise market. Microsoft's first attempt at a column store started with embedded integration with PowerPivot and eventually made its way into SQL Server. Sybase IQ, built on zero-admin SQL Anywhere, has been used as a local column store for Pivot Table applications (think BusinessObjects, Cognos, Tableau, QlikView, and Excel Pivot Tables). Pivot Table integrations are typically custom-built for each data warehouse vendor or MDX/XMLA driver.
DuckDB is focused on integrating with the desktop data science platforms R and Python (dbplyr and pandas). Analytics/OLAP in SQL means lightning fast GROUP BY queries and more recently SQL OLAP Windows. "SQLite for Analytics" confused me but that is probably due to my own historical mental clutter. The name DuckDB refers to a pet duck and not the duck typing system in SQLite.
Consider someone who analyzes medium-sized volumes of (slowly-changing) data -- OLAP, not OLTP. People who need to do this primarily have 2 alternatives:
* columnar database (redshift, snowflake, bigquery)
* a data lake architecture (spark, presto, hive)
The latter can be slow and wasteful, because the data is stored in a form that allows very limited indexing. So imagine you want query speeds that require the former.
Traditional databases can be hugely wasteful for this usecase -- space overhead due to no compression, slow inserts due to transactions. The best analytics databases are closed-source and come with vendor lockin (there are very few good open-source column stores -- clickhouse is one, duckdb is another). Most solutions are multi-node, so they come with operational complexity. So DuckDB could fill a niche here -- data that's big enough to be unwieldy, but not big enough to need something like redshift. It's analogous to the niche SQLite fills in the transactional database world.
1. Is DuckDB similar to having indexes on each column? Because generally when something is slow, the solution is indexes. I have a 100 GB database which records real time data and is lightning fast because of some minor tuning.
2. The example of STDDEV not be available shows the author's unfamiliarity with SQLite which worries me.
Could very easily have made a similar interface if necessary.
ad 1) An index is an additional data structure that takes space and needs maintenance when the data changes. This is especially problematic for realtime data. DuckDBs main data structure and implementation is designed for efficient scanning without needing an additional index ("vectorised columnar").
ad 2) We support STDDEV_SAMP and STDDEV_POP. There is also an interface to define your own functions. We are very aware of how SQLite is doing things and even support their client API. But we also might eventually add their way of adding UDFs from Python.
With a column store, the values of the same column are stored contiguously so if you're summing a column for example, you can read one or more pages that could consist entirely of values the column you need to sum, sequentially. It's much more efficient for aggregations due to cache locality--much fewer cache misses and I/O operations needed to summarize a column.
The major trade-off is that updates and deletes are more expensive in a column store. But this is usually fine for OLAP workloads.
A page of field values is a much better structure to compress than a page of tuple values. This is especially true of the sort of denormalized data that is typical in analytical workloads. E.g. dates may only be considered in a period of a few years - this will take much less space to encode than the full allocation for a date data type. Similarly we may have categorical fields with only a few values. Imagine a product category with only 8 values. This can take a set of numbers which may be otherwise stored as an integer, and encode it to 3 bits, plus some space for a dictionary.
Another easy bit of compression to apply to columns is run length encoding. Dates, again, serve as a good example. If the data is sorted by date, you can choose to store a data structure that captures (start index, date value, end index). If you had 1,000 records on the same date, you could encode this with only three likely-integer-sized values, rather than 1,000 repetitions of a single likely-integer-sized value.
Such compression offers several benefits. First, it can reduce data volumes, and allow more query workload to come from RAM, either through DB or filesystem caching, or by an in-memory architecture. Additionally, the smaller the data size is, the faster a scan can be completed. Analytic workloads tend toward queries that require a full table scan (though with a columnstore, it's more a full column scan). If the data size is smaller, then you can stream the entire thing through whatever your query is faster.
Do you have an example of a workload that benefits from this structure?
If the documentation had said something like "a columnar database based on sqlite, optimized for analytics", I think that would pique the interest of many. Columnar database = column compression = fast filtering/aggregations = high performance analytics compared to pure row stores.
There are very few free columnar databases available that are production grade (or built on production grade technology), so a columnar database built on top of sqlite is big news.
DuckDB is indeed a free columnar database system, but it is not entirely built on top of SQLite. It exposes the same front-end and uses components of SQLite (the shell and testing infrastructure), but the execution engine/storage code is new.
It means the "right" thing to most people: I can use this like I use SQLite now.
It sure beats "[project] is a new key-value store that is even faster than [other project]".
now get off my lawn :)
In columnstores, it’s not just the ordering that’s advantageous for analytics. It’s also the possibility of scanning only the columns required (to the exclusion of others), and the possibility of column compression using RLE compression, which speeds up search when there are many repeated elements (also compression makes searches cache efficient). This means if you workload only involves a few columns, and your primary operations are filter and aggregate — analytics workloads — then the performance gains are tremendous.
On the other hand, if you want to optimize for row by row writes (OLTP), and if your workloads involve operations on many columns, then a rowstore is advantageous.
Row and column major layouts are more about exploiting sequentiality, but in practice both C and FORTRAN can provide good linear algebra performance.
But for analytics workloads on row vs columnstores, the difference in performance can be in the orders of magnitude. The fastest analytics databases in the world (Kdb, Vertica, Exasol, some of the newer GPU based databases) are almost all columnar.
- Are less than 1 TB at maximum, and are typically less than 50GB
- Written once and read many times
- Require heavy and variable post-processing prior to training
- Usually run on an ephemeral large instance which reads from s3, does it's thing, and then shuts down.
This last task is incredibly slow in python, and limited on a standard sqllite/postgres DB to a single thread. Having access to an embeddable columnar sql engine could greatly simplify and accelerate this final transformation stage before training.
Cut your time AND your costs.
So now we use two together, only recent data that's available for fast querying for user dashboard is served from clickhouse cluster.
And old data were customer is willing to be more tolerant to slow responses upto 2seconds is served by bigquery.
edit example like this: https://github.com/LMDB/sqlightning
any reason why?
Could anyone advise how these relate to DuckDB? i.e. Is a DuckDB accessible as a Pandas dataframe?