Microsoft SQL Server has columnstore indexes and can even be combined with its in-memory tables. MemSQL has been doing this for years and v6 is incredibly fast, also combines in-memory row-stores. ClickHouse is very good if you don't mind more operations work. MariaDB has the ColumnStore storage engine, Postgre has the cstore_fdw extension. Vertica, Greenplum, Druid, etc. EventQL was an interesting project but abandoned now.
AWS RedShift, Azure SQL Data Warehouse, Snowflake Data, Google BigQuery are the hosted options, with BQ being the most advanced with its vertical integration.
If you want to operationalize Apache Arrow today, Dremio is built around it and works similar to Apache Drill and Spark to run distributed queries and joins across data sources.
It heavily relies on compression, data locality and SIMD instructions and supports external dictionaries for lookup.
How much of big queries performance do you think stems from Capacitor versus the rest of the system. For example if you switched it out with parquet, but kept everything else (Colossus, Dremel, Background reordering, metadata stored in Spanner etc) would it still be 10/30/50% worse or would it be an order of magnitude worse.
And yeah, this really depends not just on the dataset, but also on how selective your queries are, what predicates and aggregations they employ, etc. A significant percentage of queries gets orders of magnitude faster. I can’t disclose how much faster things got on average, but it was a significant gain, way more than would be sufficient to offset the increased cost of encoding (which is another aspect people typically don’t consider), even considering that much of the data people encode is hardly ever touched.
For anyone who doesn't know what Capacitor is: https://cloud.google.com/blog/big-data/2016/04/inside-capaci...
It appears that relatively new accounts that post here are automatically flagged. I've seen it before.
Usually if I click on their specific comment and vouch for them, they get unflagged, but it didn't work this time.
If Apache Arrow takes off I hope BigQuery will support it as a data interchange format in the future. Zero-copy is pretty awesome, as are open standards in general. This feature does not exist in BigQuery today (as far as I know - definitely not as discussed in the source).
So it’s a little incorrect to think of this as a “file format” in the first place. If you end up designing it like that, you’d not be able to have a lot of the gains that the Abadi paper (and others like it) alludes to. My suggestion would be to go whole hog and push down as much filtering and aggregation in there as is feasible, exposing a higher level interface with _at least_ filtering predicate support, and do it in C++.
The main precedent here is Spanner's Ressi storage engine, which, according to the most recent paper , uses a PAX-like format (with blocks arranged row-oriented, but values within a block are column-oriented, so kind of like Parquet) for on-disk data, but combines it with a traditional log-structured merge tree for writes and point queries.
The fun thing to try and imagine here is having literally the same physical data format that works for both kinds of workloads.
If this is what you mean by a "database frontend", I am really confused as to why you object to this?
I think this property of time series is going to prove very important in the 2020s
The question we're trying to answer is whether there exists at all a storage engine, at the scope of a single node (obviously generalizable/scalable), that can fit all use cases well. Once you figure out the answer to that, obviously having one storage engine is simpler than having two storage engines.
Which is exactly the tradeoff RDBMS indexes already make.
Two reasons are at the top of my mind:
1. The current best efforts in benchmarking are focusing on queries that "look" similar, and yet kdb is still 400x faster than Hadoop for those queries. For example:
select avg size by sym,time.hh from trade where date=d,sym in S
SELECT sym,HOUR(time),AVG(size) FROM trade
NATURAL JOIN s WHERE date=d GROUP BY sym,HOUR(time);
The 2020 column store has to do a lot of work to even match kdb, but assuming it does that, and even ekes out a few extra percent of performance on these queries, there's another issue:
2. Most kdb programmers don't write this way.
Sure some write their application in Java and send these queries over to the kdb "server" get the results, and do stuff with the results, etc., just like the application programmers that use Hadoop, but most kdb programmers don't. They just write their application in kdb.
That means that there isn't an extra second or two delay while this chunky result set is sent over to another process.
UDF/Stored Procedures/Foreign procedures are the rest of the world's solution for this problem, and they are massively under-utilised: Tooling like version control and testing of stored procedures just doesn't work as well, and I don't see any suggestion that's going to change in the next decade or so.
It's a really elegant way of dealing with large amounts of data, although the downside is that you've typically got to build a lot of the nice-to-have dbms type infrastructure yourself.
It would be interesting if BigQuery or Redshift ever figure out that they could have a much more powerful system if they stuck an array language on the front of their storage engines.
Everyone has seen "numbers every programmer should know" < https://gist.github.com/jboner/2841832 > If you're going to do complex data analysis, e.g. machine learning, you want your data access latency to be on the short end of this chart :) When you end up on the long end (like in RDBMS) this is known as N+1 problem.
But modern size data doesn't fit into memory. Distributed systems necessarily add latency, and to fix that we add caching, which hurts consistency. I blew up this thread further down about how Datomic's core idea is to provide consistent data to your code; which is the opposite of how most DBMS (including kbd) make you bring the code into the database.
Indeed, it's hard for exhaustive search beat index lookup.
kdb isn't using any indexes for this query.
Once you're both distributed and consistent, the problems today's stores are architected around, go away. Your distributed queries can index the immutable log however they like. column-oriented, row-oriented, documents, time-oriented, graphs, immutability means you can do all of it, as a library in your application process
http://www.datomic.com/ - it's what you get when Facebook's graph datastore has a baby with immutability.
This is interesting, because consider a large system like Facebook. Transactions naturally fall within certain boundaries. You never transact to Events, Photos, and Instagram all at once - from the write side, they don't have to share the same single-writer process delivering ACID.
You do however, on the read side, need to have fluid queries across them all, as if they were one database. RDBMS can't do that, but Datomic can, and Git can too - consider submodules. Immutability is what makes it possible to shard like this without sacrificing query expressiveness, strong consistency or ACID (like every other distributed system that isn't accumulate only)
To scale writes you shard writes. This generally means running multiple databases (in any DBMS)
The key insight is that Datomic can cross-query N databases as a first class concept, like a triple store. You can write a sophisticated relational query against Events and Photos and Instagram, as if they are one database (when in fact they are not).
This works because Datomic reads are distributed. You can load some triples from Events, and some triples from Photos, and then once you've got all the triples together in the same place you can do your queries as if they are all the same database. (Except Datomic is a 5-store with a time dimension, not a triple store.)
In this way, a Datomic system-of-systems can scale writes, you have regional ACID instead of global ACID, with little impact to the programming model of the read-side, because reads were already distributed in the first place.
For an example of the type of problems Datomic doesn't have, see the OP paragraph "To sort, or not to sort?" - Datomic maintains multiple indexes sorted in multiple ways (e.g. a column index, a row index, a value index). You don't have to choose. Without immutability, you have to choose.
Immutability in-and-of-itself is not the selling point. Consider why everyone moved from CVS/SVN to Git/dvcs. The number of people who moved to git then and said "man I really wish I could go back to SVN" is approximately zero. Immutability isn't why. Git is just better at everything, that's why. Immutability is the "how".
I don't see why I would use an RDBMS to store data ever again. It's not like I woke up one day and started architecting all my applications around time travel†. It's that a lot of the accidental complexity inherent to RDBMS - ORM, N+1 problems (batching vs caching), poorly scaling joins, pressure to denormalize to stay fast, eventual consistency at scale...
Datomic's pitch is it makes all these problems go away. Immutability is simply the how. Welcome to the 2020s.
†actually I did, my startup is http://hyperfiddle.net/
If I understand Datomic correctly, I can think of the time dimension in the triple store as a sequence of transactions each of which produces a new state. How that maps onto real storage is flexible (Datomic supports a number of backends which don't explicitly support the notion of immutable database).
So what would a key-value-time store be useful for? Conceptually it seems that if Datomic triples are mapped onto the key-value database then time in Datomic becomes transactions over the log. So one area of interest is as a database backend for Datomic that is physically designed to be immutable. There is a lot of hand waving, and Datomic has a large number of optimizations. Thanks for answering some of those questions about Datomic. It's a really fascinating system.
I will have to think about if a key-value-time storage backend would have any implications in Datomic. One thing Datomic Pro doesn't have is low-cost database clones through structure sharing, and it doesn't make sense to me why.
Here is a description of some storage problems that arise with large Datomic databases, though it's not clear to me if ZLog can help with this < http://www.dustingetz.com/ezpyZXF1ZXN0LXBhcmFtcyB7OmVudGl0eS..., >
This common challenge is perhaps magnified in systems that have deep storage indexes. For example, the link you posted seems to suggest that queries in Datomic may have cache misses that require pointer chasing down into storage, adding latency. Depending on where that latency cost is eaten, it could have a lot of side effects (e.g. long running query vs reducing transaction throughput).
This problem is likely exacerbated in the key-value store running on zlog because the red-black tree can become quite tall. Aggressive, optimistic caching on the db nodes, and server-side pointer chasing helps. It's definitely an issue.
I don't know anything about Datomic internals, so disclaimer, I'm only speculating about why Datomic doesn't have low cost clones: which is that the underlying storage solution isn't inherently copy-on-write. That is, Datomic emulates this so when a clone is made there isn't an easy metadata change that creates the logical clone.
Despite the lack of features etc in the kv-store running on zlog, database snapshots and clones are both the same cost of updating the root pointer of the tree.
At any decent scale, most companies now just use a dedicated log like Kafka or Pulsar as the main backbone to support more flexibility in producers and consumers. Either way, none of this has to do with column-stores as the actual representation of data.
CQRS has the right fundamental constituents but puts them together in the wrong way, I think. The command abstraction is in the application layer (we're talking about stores not apps) and the properties of the read-side are fixed (e.g. decisions about document-orientation, column- or row- are coded in advance). But those same parts can be used to make something more flexible, that lets the properties of the read-side be less fixed.
Datomic maintains multiple builtin indexes to support several query styles (so multiple parallel "representations of data") < http://docs.datomic.com/indexes.html >, so Datomic has native support for querying in the shape of: documents, rows, columns, time, values. The storage is actually represented all those ways in parallel copies. (and yet the time dimension keeps us linearized and strongly consistent, like git!)
More interesting than the builtin indexes though, is that you can conceptually implement your own index, since immutability lets you distribute/cache the data in your application processes, the query engine is actually a library running in your query process. (Datomic Datalog is literally a jar file running in your elastic app processes and it works on vanilla JVM data structures)
This is called "code/data locality" and it's extremely powerful. You don't need to go fork the database codebase to add a new type of index, like people had to do to add geospatial index to a fork of Postgres. You can incrementally maintain your own indexes. You can ignore datalog and implement your own query functions to query your special index. Or you can seamlessly compose your own query functions inside datalog queries, you can pass your index as an input to a datalog query. Here's a snippet of what that looks like: https://i.imgur.com/GJuTkJR.png
Some existing stabs in this direction include that some Samsung SoCs had a simple "MComp" memory compressor (https://github.com/XileForce/Vindicator-S6-Unified/blob/mast...), that the new Qualcomm Centriq chips use compression (https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-...), and that some IBM POWER CPUs have dedicated circuitry for memory compression (http://ibmsystemsmag.com/aix/administrator/lpar/ame-intro/). There's also hardware zlib offload, like Intel QuickAssist.
I'd expect more of this in the future because 1) space-is-speed is just a fundamental thing we deal with computing, 2) chips keep getting faster relative to RAM, 3) you already see lots of special-purpose instructions being added (crypto, strings, fancier vector stuff...) as it gets more expensive to improve general-purpose IPC. Maybe there's some additional value given the arrival of 3D XPoint and (probably) other future NVRAMs--would help you fit more on them without spending more time compressing than writing--but regardless, the trends seem to point to compression assists being interesting.
One reason I could turn out wrong is if the general-purpose facilities we have make software the best place to write compressors anyway, i.e., fast software packers get so good it becomes difficult to justify hardware assists. General-purpose low- and medium-compression algos like LZ4 and Zstd run pretty fast already, and we have even faster memory compressors (WKdm, Density). Of course, that's on big Intel cores; maybe special-purpose compressor hardware will continue to mostly be more interesting alongside smaller cores.
> That is, if you have N distinct values, you can store them using ceil(log(N)/log(2)) bits
Ideally you don't need to do ceil, if you had an low number like 5 items, then it looks like you need 3 bits to store it, but you can store it in 2.4 bits (just pack 10 values into 24 bits instead of 30).
Getting distinct and repeated values by tearing apart data so that you can use these algorithms is something which I could use some papers to refer to.
For instance, here's what we're trying to do with Double encoding loops, but it still suffers from the problems of a car moving from 0.3 -> 0.2 location.
 - http://bit.ly/2zt70iL
I ask because on a physical medium like hard disk, storing data on physical disk in column orientation can make a significant improvement to read operations.
But with SSD/NVME, you don’t have to worry about the inherent slowness of physical platters that exist in hard disk.
Yes. SSD generally has lower latency for responses.
While sending data (throughput) is similar, the operating system doesn't ask for all of the blocks of a file at once -- even if you read(fd,buf,size) -- because other processes might ask for other blocks in the meantime. An IO schedular is making decisions, and that latency helps turn around those decisions faster.
> I ask because on a physical medium like hard disk, storing data on physical disk in column orientation can make a significant improvement to read operations.
I'm not really sure this is true. Hard Drives have (for over a decade, probably longer) had logic that lies about the physical layout of the disk to the point where all I can believe about the linear block address is that the circuitry "believes" that likelihood software will ask for the next linear block address is higher than any other one.
Further that, I imagine SSDs can probably make similar optimisations.
The reason column-orientation helps is that it reduces the volume of data that needs reading. If you have a table with 100 columns in it, but a query that operates on 2, then a column-oriented database needs to read 2 things, while the row-oriented database either reads 100 things, or it interleaves reads of 2 things with skips of 98 things.
It isn't difficult to believe that the circuitry needed to handle the former will outpace the circuitry needed to handle the latter for a long time.
Also, even with NVMe SSD's, there is a lot of operating system overhead associated with every read. Having layers of drivers to orchestrate the transfer of 2 bytes of data you wanted really slows it down...
So when you query on an indexed column, you'll have contiguous access on the index, but the rows themselves may be stored on disk based on a different column (so you'll could get the row pointers for a range query in one go from the index but fetching the rows would be random lookups). But if you want to view an entire row, its trivial, because the full row data is contiguous on disk.
Column store groups the table values by the column, so the values of col A will be contigous, and col B will be contiguous, (but not pairwise!) but if you want to view the entire row, you'll probably have to do 2 lookups in random locations. But the range query on col A, selecting only col A, becomes a trivial fetch.
Thats my understanding anyways
1. How data is represented. A database is a collection of column objects. For example, it is easy to create a column or delete a column.
2. How data is being processed (queried). The engine processes columns as objected managed by the system
put data files in directory
name data files after columns
use ".data" filename extension for data files
write a tool to create index files (append ".index" to the name of the input text file) that map record number to byte offset in data file
If data files are all < 4GB, use a 32 bit unsigned integer to represent the byte offset in the index file
Each index file is a packed array of 32 bit integers
Write a tool to create length files ".length" that count the number of entries in a data file
Generate .length files for all data files
Use mmap to access index files
Use C for all of the above
This is for variable-length data values. Not every column will have these, making the .index files redundant in this case; the .index files should not be created in this case and program logic should support both uniform value length access and nonuniform value length access. The reason to prefer two access modes is to keep data from the .index files out of the cache when it is redundant.
When all of this is done, the next thing to do is write a tool to test the cache characteristics on your processor by implementing sorting algorithms and testing their performance. Unless you are using a GPU (why?) all data your algorithm touches will go through every level of the cache hierarchy, forcing other data out. If possible, use a tool that reports hardware diagnostics. These tools may be provided by the processor vendor.
Now, there is a trend to give the programmer control over cache behavior
I don't know if this is worth exploring or a wild goose chase. It may improve performance for some tasks, but it sounds a little strange for the programmer to tell the computer how to use the cache...shouldn't the operating system do this?
Anyway, that's a start.
> I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression).
In this blog post, I don’t agree with:
> Almost every single major data-processing platform that has emerged in the last decade has been either open source.
That’s somewhat true by definition. OTOH, I also know most financial firms use proprietary solutions (which leverage open source components).