Hacker News new | comments | show | ask | jobs | submit login
How should you build a high-performance column store for the 2020s? (lemire.me)
164 points by deafcalculus 10 months ago | hide | past | web | favorite | 67 comments



Most of these techniques are already in production:

Microsoft SQL Server has columnstore indexes and can even be combined with its in-memory tables. MemSQL has been doing this for years and v6 is incredibly fast, also combines in-memory row-stores. ClickHouse is very good if you don't mind more operations work. MariaDB has the ColumnStore storage engine, Postgre has the cstore_fdw extension. Vertica, Greenplum, Druid, etc. EventQL was an interesting project but abandoned now.

AWS RedShift, Azure SQL Data Warehouse, Snowflake Data, Google BigQuery are the hosted options, with BQ being the most advanced with its vertical integration.

If you want to operationalize Apache Arrow today, Dremio is built around it and works similar to Apache Drill and Spark to run distributed queries and joins across data sources.


Yandex's recently open sourced ClickHouse[1] column store does some of these.

It heavily relies on compression, data locality and SIMD instructions and supports external dictionaries for lookup.

[1]: https://clickhouse.yandex/


I am always impressed with clickhouse, especially when it holds up against massive data processing systems, but running on a laptop http://tech.marksblogg.com/benchmarks.html


HN discussion on ClickHouse from a few years ago: https://news.ycombinator.com/item?id=11908254


This already exists, in Google BigQuery. Uses darn near every trick in the book, and some that aren’t in the book. Source: shipped it.


It is frustrating that google is always 5 to 10 years ahead of everyone else but they never open source their back-end technologies (except recently with the ML stuff). The whole reason hadoop exists is because google only released whitepapers (which was good) but not code. I wonder whether google really benefits by this strategy, given that they have to be an ecosystem instead of benefitting from being a part of one. I also wonder whether the industry is better off by having to cooperatively reinvent the google architecture. I doubt the hadoop ecosystem would have arisen had it been google code at the heart.


Apache Beam, gRPC/protobuf, Kubernetes. There are examples besides Tensorflow.


Not code but they do publish papers that contain decent detail.


[Edit]: Why was the above comment flagged?

How much of big queries performance do you think stems from Capacitor versus the rest of the system. For example if you switched it out with parquet, but kept everything else (Colossus, Dremel, Background reordering, metadata stored in Spanner etc) would it still be 10/30/50% worse or would it be an order of magnitude worse.


We looked at Parquet early on and it wasn’t competitive even with what we were using at the time.

And yeah, this really depends not just on the dataset, but also on how selective your queries are, what predicates and aggregations they employ, etc. A significant percentage of queries gets orders of magnitude faster. I can’t disclose how much faster things got on average, but it was a significant gain, way more than would be sufficient to offset the increased cost of encoding (which is another aspect people typically don’t consider), even considering that much of the data people encode is hardly ever touched.


It would have to depend on the dataset, right?

For anyone who doesn't know what Capacitor is: https://cloud.google.com/blog/big-data/2016/04/inside-capaci...


> Why was the above comment flagged?

It appears that relatively new accounts that post here are automatically flagged. I've seen it before.

Usually if I click on their specific comment and vouch for them, they get unflagged, but it didn't work this time.

Lame.


BigQuery uses a lot of tricks to get efficiency, but this post emphasizes Apache Arrow and open data formats like it as the way forward (in particular the last point, "Be open, or else…") which are not currently supported by BigQuery.

If Apache Arrow takes off I hope BigQuery will support it as a data interchange format in the future. Zero-copy is pretty awesome, as are open standards in general. This feature does not exist in BigQuery today (as far as I know - definitely not as discussed in the source).


One thing people commonly miss about this is this is all meaningless if you don’t have the corresponding runtime integration, and these techniques very much imply co-design, and therefore tight coupling, between the format and the runtime. To give a concrete example, to make any of this efficient and fast you must have predicate pushdown directly into decoder such that filters could skip the data they don’t need to decode. Some aggregations could be handled the same way.

So it’s a little incorrect to think of this as a “file format” in the first place. If you end up designing it like that, you’d not be able to have a lot of the gains that the Abadi paper (and others like it) alludes to. My suggestion would be to go whole hog and push down as much filtering and aggregation in there as is feasible, exposing a higher level interface with _at least_ filtering predicate support, and do it in C++.


I think one interesting project in the near future could be to try and build a column-oriented storage engine that's "good enough" for both OLAP and OLTP workloads.

The main precedent here is Spanner's Ressi storage engine, which, according to the most recent paper [1], uses a PAX-like format (with blocks arranged row-oriented, but values within a block are column-oriented, so kind of like Parquet) for on-disk data, but combines it with a traditional log-structured merge tree for writes and point queries.

[1] https://static.googleusercontent.com/media/research.google.c...


"I think one interesting project in the near future could be to try and build a column-oriented storage engine that's "good enough" for both OLAP and OLTP workloads." SAP Hana is an example of a system that fits into this category. This isn't new either and has been existing since the 90s. Sybase IQ (which SAP acquired) was the first commercially successful columnar database. They have an in-memory row engine to handle OLTP. OLAP queries perform exceptionally well due to the column oriented nature of the storage. Customer deployments are in the 100s of TBs and low PBs these days. Blows most open source software in terms of performance if you are willing to shell out the $. Source: I work at SAP.


Oh, I don't mean a database frontend that can handle both OLTP and OLAP workloads, usually by having some kind of OLAP column-store and some kind of OLTP main memory row-store. I know there's a lot of those (not only HANA, but also MemSQL, SQL Server, etc.)

The fun thing to try and imagine here is having literally the same physical data format that works for both kinds of workloads.


You actually don't need to have the same storage data layout if you use a time series as a starting point; because you can maintain different data layouts in parallel, and the time dimension permits strong consistency across them all.

If this is what you mean by a "database frontend", I am really confused as to why you object to this?

I think this property of time series is going to prove very important in the 2020s


Again, I don't care about distributed consistency here, nor is it mutually exclusive with what I'm talking about.

The question we're trying to answer is whether there exists at all a storage engine, at the scope of a single node (obviously generalizable/scalable), that can fit all use cases well. Once you figure out the answer to that, obviously having one storage engine is simpler than having two storage engines.


Parallel but consistent indexes could be done in a single process too. The tradeoff is natural reads at the cost of extra constant factor of write computation and storage.

Which is exactly the tradeoff RDBMS indexes already make.


That's still not what I'm talking about. https://cs.brown.edu/~ugur/fits_all.pdf


thank you for that link !!


I wonder if the 2020s column store would outperform kdb, which was written in the 1990s with a UI from the 1950s.


Unlikely.

Two reasons are at the top of my mind:

1. The current best efforts in benchmarking are focusing on queries that "look" similar, and yet kdb is still 400x faster than Hadoop for those queries. For example:

    select avg size by sym,time.hh from trade where date=d,sym in S

    SELECT sym,HOUR(time),AVG(size) FROM trade
            NATURAL JOIN s WHERE date=d GROUP BY sym,HOUR(time);
To answer this question, the database has to read two or three columns across ten billion rows -- it's hard to be much faster than kdb: 10 billion rows completes in 70msec on kdb, but Hadoop takes something like 30 seconds.

The 2020 column store has to do a lot of work to even match kdb, but assuming it does that, and even ekes out a few extra percent of performance on these queries, there's another issue:

2. Most kdb programmers don't write this way.

Sure some write their application in Java and send these queries over to the kdb "server" get the results, and do stuff with the results, etc., just like the application programmers that use Hadoop, but most kdb programmers don't. They just write their application in kdb.

That means that there isn't an extra second or two delay while this chunky result set is sent over to another process.

UDF/Stored Procedures/Foreign procedures are the rest of the world's solution for this problem, and they are massively under-utilised: Tooling like version control and testing of stored procedures just doesn't work as well, and I don't see any suggestion that's going to change in the next decade or so.


kdb also allows you to do much more than SQL, since a select is more or less just syntactic sugar for a certain set of operations on columns as arrays. The full language is available in the context of the query, or you could just treat your table as a bunch of arrays in the context of a larger program.

It's a really elegant way of dealing with large amounts of data, although the downside is that you've typically got to build a lot of the nice-to-have dbms type infrastructure yourself.

It would be interesting if BigQuery or Redshift ever figure out that they could have a much more powerful system if they stuck an array language on the front of their storage engines.


Yes! "code/data locality" is key for real apps.

Everyone has seen "numbers every programmer should know" < https://gist.github.com/jboner/2841832 > If you're going to do complex data analysis, e.g. machine learning, you want your data access latency to be on the short end of this chart :) When you end up on the long end (like in RDBMS) this is known as N+1 problem.

But modern size data doesn't fit into memory. Distributed systems necessarily add latency, and to fix that we add caching, which hurts consistency. I blew up this thread further down about how Datomic's core idea is to provide consistent data to your code; which is the opposite of how most DBMS (including kbd) make you bring the code into the database.


> it's hard to be much faster than kdb: 10 billion rows completes in 70msec on kdb, but Hadoop takes something like 30 seconds.

Indeed, it's hard for exhaustive search beat index lookup.


> it's hard for exhaustive search beat index lookup.

kdb isn't using any indexes for this query.


I talked to someone recently who ran kdb using an SSD. Is that the standard approach?


Whatever works, as much and as fast as you can afford. The kdb disk game is more about serial transfer rates and quantity.


Datastore of 2020s will be designed around an immutable log because it permits both strong consistency and horizontal scaling (like git).

Once you're both distributed and consistent, the problems today's stores are architected around, go away. Your distributed queries can index the immutable log however they like. column-oriented, row-oriented, documents, time-oriented, graphs, immutability means you can do all of it, as a library in your application process

http://www.datomic.com/ - it's what you get when Facebook's graph datastore has a baby with immutability.


You can't just say "immutable log" and then be done. You certainly don't want to have just one immutable log, because then unrelated operations, for example to different parts of a key space, have to "see" each other. If you go the route of Datomic, your writes can't outpace the one CPU that processes them. (Correct me if I'm wrong, I'm just reading its documentation.) Git, with a DAG history, is just eventual consistency.


In RDBMS, when you shard, read shards and write shards are in lock-step, which is the whole problem with sharding. In Datomic (and in git), by sharding writes, it doesn't really impact reads.

This is interesting, because consider a large system like Facebook. Transactions naturally fall within certain boundaries. You never transact to Events, Photos, and Instagram all at once - from the write side, they don't have to share the same single-writer process delivering ACID.

You do however, on the read side, need to have fluid queries across them all, as if they were one database. RDBMS can't do that, but Datomic can, and Git can too - consider submodules. Immutability is what makes it possible to shard like this without sacrificing query expressiveness, strong consistency or ACID (like every other distributed system that isn't accumulate only)


I was under the impression, based on its docs, that Datomic only supports processing transactions serially through a single transactor.


Not surprisingly, it also says it's not designed for high throughput write workloads, the topic of this blog post.


Both what you write and what I wrote are true.

To scale writes you shard writes. This generally means running multiple databases (in any DBMS)

The key insight is that Datomic can cross-query N databases as a first class concept, like a triple store. You can write a sophisticated relational query against Events and Photos and Instagram, as if they are one database (when in fact they are not).

This works because Datomic reads are distributed. You can load some triples from Events, and some triples from Photos, and then once you've got all the triples together in the same place you can do your queries as if they are all the same database. (Except Datomic is a 5-store with a time dimension, not a triple store.)

In this way, a Datomic system-of-systems can scale writes, you have regional ACID instead of global ACID, with little impact to the programming model of the read-side, because reads were already distributed in the first place.

For an example of the type of problems Datomic doesn't have, see the OP paragraph "To sort, or not to sort?" - Datomic maintains multiple indexes sorted in multiple ways (e.g. a column index, a row index, a value index). You don't have to choose. Without immutability, you have to choose.


What is the primary reason people choose Datomic? From reading the website, I get the impression that the append-only nature and time-travel features are a major selling point, but in other places its just the datalog and clojure interfaces. I'm sure it's a mix, but what brings people to the system to begin with?


Datomic is my default choice for any data processing software (which is almost everything).

Immutability in-and-of-itself is not the selling point. Consider why everyone moved from CVS/SVN to Git/dvcs. The number of people who moved to git then and said "man I really wish I could go back to SVN" is approximately zero. Immutability isn't why. Git is just better at everything, that's why. Immutability is the "how".

I don't see why I would use an RDBMS to store data ever again. It's not like I woke up one day and started architecting all my applications around time travel†. It's that a lot of the accidental complexity inherent to RDBMS - ORM, N+1 problems (batching vs caching), poorly scaling joins, pressure to denormalize to stay fast, eventual consistency at scale...

Datomic's pitch is it makes all these problems go away. Immutability is simply the how. Welcome to the 2020s.

actually I did, my startup is http://hyperfiddle.net/


Thanks for the explanation, and I definitely agree with all your points. The reason I ask is that over the past year or so, we have been hacking on a little side project that maps a copy-on-write tree to a distributed shared log (https://nwat.io/blog/2016/08/02/introduction-to-the-zlog-tra...). This design effectively produces an append-only database with transactions (similar to the rocksdb interface), and means that you can have full read scalability for any past database snapshot. We don't have any query language running on top, but have been looking for interesting use cases.


A lot of stuff in those links that I'm not familiar with. I'd expect challenges around what will happen when the log doesn't fit in memory? What if even the index you need doesn't fit in memory? If I understand, zlog is key-value-time, so I'm not sure what type of queries are interesting on that. Datomic is essentially a triple store with a time dimension so it can do what triple stores do. What do you think the use cases are for a key-value-time store?


The log is mapped onto a distributed storage system, so fitting it in memory isn't a concern, though effective caching is important. The index / database also doesn't need to fit into memory. We cap memory utilization and resolve pointers within the index down into the storage system. Again, caching is important.

If I understand Datomic correctly, I can think of the time dimension in the triple store as a sequence of transactions each of which produces a new state. How that maps onto real storage is flexible (Datomic supports a number of backends which don't explicitly support the notion of immutable database).

So what would a key-value-time store be useful for? Conceptually it seems that if Datomic triples are mapped onto the key-value database then time in Datomic becomes transactions over the log. So one area of interest is as a database backend for Datomic that is physically designed to be immutable. There is a lot of hand waving, and Datomic has a large number of optimizations. Thanks for answering some of those questions about Datomic. It's a really fascinating system.


My questions were in the context of effective caching in the query process, sorry that wasn't clear. A process on a single machine somewhere has to get enough of the indexes and data into memory to answer questions about it. Though I guess there are other types of queries you might want to do, like streamy mapreduce type stuff that doesn't need the whole log.

I will have to think about if a key-value-time storage backend would have any implications in Datomic. One thing Datomic Pro doesn't have is low-cost database clones through structure sharing, and it doesn't make sense to me why.

Here is a description of some storage problems that arise with large Datomic databases, though it's not clear to me if ZLog can help with this < http://www.dustingetz.com/ezpyZXF1ZXN0LXBhcmFtcyB7OmVudGl0eS..., >


> A process on a single machine somewhere has to get enough of the indexes and data into memory to answer questions about it.

This common challenge is perhaps magnified in systems that have deep storage indexes. For example, the link you posted seems to suggest that queries in Datomic may have cache misses that require pointer chasing down into storage, adding latency. Depending on where that latency cost is eaten, it could have a lot of side effects (e.g. long running query vs reducing transaction throughput).

This problem is likely exacerbated in the key-value store running on zlog because the red-black tree can become quite tall. Aggressive, optimistic caching on the db nodes, and server-side pointer chasing helps. It's definitely an issue.

I don't know anything about Datomic internals, so disclaimer, I'm only speculating about why Datomic doesn't have low cost clones: which is that the underlying storage solution isn't inherently copy-on-write. That is, Datomic emulates this so when a clone is made there isn't an easy metadata change that creates the logical clone.

Despite the lack of features etc in the kv-store running on zlog, database snapshots and clones are both the same cost of updating the root pointer of the tree.


Ah, I have looked into Datomic a bit but didn't realize this, though it makes sense in retrospect.


You're confusing OLTP and OLAP, and it isn't even really relevant to the topic at hand. Column stores are mostly about how to layout the physical representation of data in a single node for read-mostly workloads on wide tables with selective filters and projections. The discussion of how distributed consistency may work is irrelevant here.


All datastores already have WAL logging which is effectively the same, and commonly used for replication, changefeeds and other downstream consumers. Saving the entire history (with compaction) and some CQRS patterns is nothing new.

At any decent scale, most companies now just use a dedicated log like Kafka or Pulsar as the main backbone to support more flexibility in producers and consumers. Either way, none of this has to do with column-stores as the actual representation of data.


It's definitely not new, but it is innovative. Kafka can totally be an implementation detail of a system like what we are discussing. Once you're immutable, we're no longer constrained to a single "actual representation of data"; you can maintain many in parallel, so long as there is a way to keep the representations consistent (that time dimension is really important!)

CQRS has the right fundamental constituents but puts them together in the wrong way, I think. The command abstraction is in the application layer (we're talking about stores not apps) and the properties of the read-side are fixed (e.g. decisions about document-orientation, column- or row- are coded in advance). But those same parts can be used to make something more flexible, that lets the properties of the read-side be less fixed.

Datomic maintains multiple builtin indexes to support several query styles (so multiple parallel "representations of data") < http://docs.datomic.com/indexes.html >, so Datomic has native support for querying in the shape of: documents, rows, columns, time, values. The storage is actually represented all those ways in parallel copies. (and yet the time dimension keeps us linearized and strongly consistent, like git!)

More interesting than the builtin indexes though, is that you can conceptually implement your own index, since immutability lets you distribute/cache the data in your application processes, the query engine is actually a library running in your query process. (Datomic Datalog is literally a jar file running in your elastic app processes and it works on vanilla JVM data structures)

This is called "code/data locality" and it's extremely powerful. You don't need to go fork the database codebase to add a new type of index, like people had to do to add geospatial index to a fork of Postgres. You can incrementally maintain your own indexes. You can ignore datalog and implement your own query functions to query your special index. Or you can seamlessly compose your own query functions inside datalog queries, you can pass your index as an input to a datalog query. Here's a snippet of what that looks like: https://i.imgur.com/GJuTkJR.png


How does datomic play with encryption? Say I wanted things encrypted at rest, but still queryable?


That's a question for the mailing list, but IIRC the new product "Datomic Cloud" (aws-native rewrite) does this out of the box. Datomic Cloud isn't out yet but Cognitect says 2017 Q4. https://www.youtube.com/watch?v=Ljvhjei3tWU


Datomic honestly sounds amazing based on everything I've read and heard about it. I wish it wasn't completely proprietary though. Even an open-core model would make it a much more viable option.


Counters is the achilles' heel of immutable log based dbs.


If I had to guess new capabilities chips will add in the 2020s, hardware-accelerated compression or compact encoding are near the top of the list. That could be anything from branch-free instructions to read/write varints to fully-self-contained (un)packer(s) you just point at some data and run. I'm most interested in something so fast as to be worth considering to replace fast software algos or to use in places we don't think about compressing at all now, though hardware accelerated zlib would obviously have applications too.

Some existing stabs in this direction include that some Samsung SoCs had a simple "MComp" memory compressor (https://github.com/XileForce/Vindicator-S6-Unified/blob/mast...), that the new Qualcomm Centriq chips use compression (https://3s81si1s5ygj3mzby34dq6qf-wpengine.netdna-ssl.com/wp-...), and that some IBM POWER CPUs have dedicated circuitry for memory compression (http://ibmsystemsmag.com/aix/administrator/lpar/ame-intro/). There's also hardware zlib offload, like Intel QuickAssist.

I'd expect more of this in the future because 1) space-is-speed is just a fundamental thing we deal with computing, 2) chips keep getting faster relative to RAM, 3) you already see lots of special-purpose instructions being added (crypto, strings, fancier vector stuff...) as it gets more expensive to improve general-purpose IPC. Maybe there's some additional value given the arrival of 3D XPoint and (probably) other future NVRAMs--would help you fit more on them without spending more time compressing than writing--but regardless, the trends seem to point to compression assists being interesting.

One reason I could turn out wrong is if the general-purpose facilities we have make software the best place to write compressors anyway, i.e., fast software packers get so good it becomes difficult to justify hardware assists. General-purpose low- and medium-compression algos like LZ4 and Zstd run pretty fast already, and we have even faster memory compressors (WKdm, Density). Of course, that's on big Intel cores; maybe special-purpose compressor hardware will continue to mostly be more interesting alongside smaller cores.


Pilosa, https://github.com/pilosa/pilosa which is mentioned, is actually open source, and a relatively readable Go codebase if anyone is interested in what "an entire data engine on top of bit-vectors" looks like.


(man, I'd love to go work on this for three years, without worrying about a "customer" or "backwards compatability")

> That is, if you have N distinct values, you can store them using ceil(log(N)/log(2)) bits

Ideally you don't need to do ceil, if you had an low number like 5 items, then it looks like you need 3 bits to store it, but you can store it in 2.4 bits (just pack 10 values into 24 bits instead of 30).

Getting distinct and repeated values by tearing apart data so that you can use these algorithms is something which I could use some papers to refer to.

For instance, here's[1] what we're trying to do with Double encoding loops, but it still suffers from the problems of a car moving from 0.3 -> 0.2 location.

[1] - http://bit.ly/2zt70iL


Sure comes across as arrogant for Prof Abadi to remark:

> I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression).

In this blog post, I don’t agree with:

> Almost every single major data-processing platform that has emerged in the last decade has been either open source.

That’s somewhat true by definition. OTOH, I also know most financial firms use proprietary solutions (which leverage open source components).


Are column-store databases relevant on SSD/NVME?

I ask because on a physical medium like hard disk, storing data on physical disk in column orientation can make a significant improvement to read operations.

But with SSD/NVME, you don’t have to worry about the inherent slowness of physical platters that exist in hard disk.


Columnar stores are as much about the compression benefits as the physical layout on disk. The article goes through a bunch of different relevant compression strategies.


> Are column-store databases relevant on SSD/NVME?

Yes. SSD generally has lower latency for responses.

While sending data (throughput) is similar, the operating system doesn't ask for all of the blocks of a file at once -- even if you read(fd,buf,size) -- because other processes might ask for other blocks in the meantime. An IO schedular is making decisions, and that latency helps turn around those decisions faster.

> I ask because on a physical medium like hard disk, storing data on physical disk in column orientation can make a significant improvement to read operations.

I'm not really sure this is true. Hard Drives have (for over a decade, probably longer) had logic that lies about the physical layout of the disk to the point where all I can believe about the linear block address is that the circuitry "believes" that likelihood software will ask for the next linear block address is higher than any other one.

Further that, I imagine SSDs can probably make similar optimisations.

The reason column-orientation helps is that it reduces the volume of data that needs reading. If you have a table with 100 columns in it, but a query that operates on 2, then a column-oriented database needs to read 2 things, while the row-oriented database either reads 100 things, or it interleaves reads of 2 things with skips of 98 things.

It isn't difficult to believe that the circuitry needed to handle the former will outpace the circuitry needed to handle the latter for a long time.


Hard drives exhibit the straightforward performance that you'd expect from block addresses linearly increasing around the circle according to all the experiments I've done.


Even SSD's are super slow compared to RAM. When you want to read a few bytes from millions of rows, an SSD has to decode an entire block of data for every read.

Also, even with NVMe SSD's, there is a lot of operating system overhead associated with every read. Having layers of drivers to orchestrate the transfer of 2 bytes of data you wanted really slows it down...


Possibly naive question, but isn't an index (in a classical relational database) the same as a column store?


An index stores pointers to rows based on the column value; the values are still stored as rows though.

So when you query on an indexed column, you'll have contiguous access on the index, but the rows themselves may be stored on disk based on a different column (so you'll could get the row pointers for a range query in one go from the index but fetching the rows would be random lookups). But if you want to view an entire row, its trivial, because the full row data is contiguous on disk.

Column store groups the table values by the column, so the values of col A will be contigous, and col B will be contiguous, (but not pairwise!) but if you want to view the entire row, you'll probably have to do 2 lookups in random locations. But the range query on col A, selecting only col A, becomes a trivial fetch.

Thats my understanding anyways


An index for a column-store and row-store is (conceptually) the same. Why is different is

1. How data is represented. A database is a collection of column objects. For example, it is easy to create a column or delete a column.

2. How data is being processed (queried). The engine processes columns as objected managed by the system


Not Invented Here, huh?


put data in text files, ASCII printable characters, one data point per line

put data files in directory

name data files after columns

use ".data" filename extension for data files

write a tool to create index files (append ".index" to the name of the input text file) that map record number to byte offset in data file

If data files are all < 4GB, use a 32 bit unsigned integer to represent the byte offset in the index file

Each index file is a packed array of 32 bit integers

Write a tool to create length files ".length" that count the number of entries in a data file

Generate .length files for all data files

Use mmap to access index files

Use C for all of the above

This is for variable-length data values. Not every column will have these, making the .index files redundant in this case; the .index files should not be created in this case and program logic should support both uniform value length access and nonuniform value length access. The reason to prefer two access modes is to keep data from the .index files out of the cache when it is redundant.

When all of this is done, the next thing to do is write a tool to test the cache characteristics on your processor by implementing sorting algorithms and testing their performance. Unless you are using a GPU (why?) all data your algorithm touches will go through every level of the cache hierarchy, forcing other data out. If possible, use a tool that reports hardware diagnostics. These tools may be provided by the processor vendor.

Now, there is a trend to give the programmer control over cache behavior

https://stackoverflow.com/questions/9544094/how-to-mark-some...

I don't know if this is worth exploring or a wild goose chase. It may improve performance for some tasks, but it sounds a little strange for the programmer to tell the computer how to use the cache...shouldn't the operating system do this?

Anyway, that's a start.


This sounds almost identical to the datastore honeycomb.io built and describe in the talk https://www.youtube.com/watch?v=tr2KcekX2kk


I think incorporating a blockchain element could prove an interesting way to implement this in practice.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: