
Capacitor, BigQuery’s next-generation columnar storage format - fhoffa
https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format
======
eva1984
This simple idea has been helping me a lot in the past projects.

The performance gain from columnar storage is the compression ratio. And by
ordering similar attributes together, it is going to greatly reducing the
entropy between rows, which in turn leads to high compression and better
performance.

The trick is to smartly select the which column you are going to have all the
rows sorted upon.

I my previous company, we are using Redshift to encode 1 billions rows, and
the simple change to let the table sorted by user_id reduce the whole table
size by 50%, that is half a TB of disk storage, the improvement is nothing
more but jaw-dropping. I think Google here just takes this trick into a more
systematic method, which is really neat.

To point out, in columnar storage system, take ordering into account. Try some
ordering that you feel could maximize the redundancy between rows, usually it
is going to be primary id that is most representative of the underlying data.
You don't need to have a fancy system like this one to leverage this power
idea, it could apply to all columnar systems.

~~~
mtanski
Sorting esp. if your data has some kind of natural sort order can increase
compression (in columnar stores), decrease index size (bitmap or fancy bitmap
indexes) or lower query execution time (data is cluster vs random read).

Here's one paper: Sorting improves word-aligned bitmap indexes,
[http://arxiv.org/abs/0901.3751](http://arxiv.org/abs/0901.3751)

------
kgp7
> doesn’t end here. BigQuery has background processes that constantly look at
> all the stored data and check if it can be optimized even further. Perhaps
> initially data was loaded in small chunks, and without seeing all the data,
> some decisions were not globally optimal. Or perhaps some parameters of the
> system have changed, and there are new opportunities for storage
> restructuring. Or perhaps, Capacitor models got more trained and tuned, and
> it possible to enhance existing data. Whatever the case might be, when the
> system detects an opportunity to improve storage, it kickstarts data
> conversion tasks. These tasks do not compete with queries for resources,
> they run completely in parallel, and don’t degrade query performance. Once
> the new, optimized storage is complete, it atomically replaces old storage
> data — without interfering with running queries. Old data will be garbage-
> collected later.

I wonder if they could share more details on how this is handled.

~~~
jandrewrogers
There are a few ways to do it. It is not that difficult in principle, you need
to collect selectivity statistics for both writes and queries and have a
storage engine that is flexible enough to rewrite layouts on the fly. The
mechanics are pretty simple, since rewriting a shard can be viewed as a
trivial subset of splitting or replicating a shard under load. Some closed
source databases also do this to one extent or another, adapting layout to
load.

You can do adaptive layout rewriting at either the page or shard level,
depending on the design. There are advantages and disadvantages to both
models. Some designs can do layout conversion in place without the need for
garbage collection but it is much trickier to do correctly.

------
polskibus
Slightly offtopic, but it's great to see that Mosha of MDX and SQL Server
Analysis Services fame (data warehouse from MS) is now part of Google's
BigQuery. His throrough blog posts full of technical details are a blessing
even after so many years.

------
michaelmior
I'm surprised they didn't give a justification as to why they couldn't just
adopt Parquet[0].

[0] [https://parquet.apache.org/](https://parquet.apache.org/)

~~~
int_handler
It is clearly stated in the article:

    
    
      When Google published the Dremel paper in 2010, it explained how this structure is preserved within column store.
      ...
      The definition and repetition levels encoding is so efficient for semistructured data that other open source columnar formats, such as Parquet, also adopted this technique.
    

Parquet is an open-source reimplementation of the columnar storage format
described in Google's 2010 Dremel paper. Capacitor is Google's next-generation
columnar storage format.

~~~
michaelmior
That answer isn't very clear to me, but thanks for pulling the relevant bits.

