
Ask HN: Is this a good fit for a column-oriented database? - meowface
I&#x27;m looking for a database which can fit this use case well:<p>- Able to store 100M+ items&#x2F;records<p>- Everything stored in one table&#x2F;collection, with 100+ columns&#x2F;fields (probably)<p>- Many columns will have numerous duplicate values (often with a small handful of distinct values accounting for most of the data, and a long tail of rare values, e.g. most &quot;favorite ice cream flavor&quot; data clusters around vanilla, chocolate, and strawberry)<p>- Pre-defined schema, but each item will only have values for a subset of the columns (5 - 20 on average); everything else will be empty &#x2F; null<p>- No need for a sophisticated or relational query language, but basic aggregation support is needed (ideally group by count, sum, and average)<p>- A very common query will be a &quot;group by count&quot; for a few columns<p>- Most queries will only select a few of the column values<p>- Regex support required in the query language<p>- Support for streaming ingestion is a plus<p>- There will be about 3-6x more reads than writes<p>- Potential for data loss is acceptable<p>- No need for an in-memory datastore<p>- Python driver support<p>I&#x27;d be fine with using PostgreSQL, but for this use case, I wonder if a column-oriented database like Cassandra might make more sense. Any recommendations?
======
nishantvyas
Things that would work in favor of Column Store (like Cassandra or MariaDB
ColumnStore or Redshift etc),

\- Everything stored in one table/collection, with 100+ columns/fields

\- Many columns will have numerous duplicate values (often with a small
handful of distinct values accounting for most of the data, and a long tail of
rare values, e.g. most "favorite ice cream flavor" data clusters around
vanilla, chocolate, and strawberry)

\- Pre-defined schema, but each item will only have values for a subset of the
columns (5 - 20 on average); everything else will be empty / null

\- No need for a sophisticated or relational query language, but basic
aggregation support is needed (ideally group by count, sum, and average)

\- A very common query will be a "group by count" for a few columns

\- Most queries will only select a few of the column values

\- Support for streaming ingestion is a plus

\- There will be about 3-6x more reads than writes

\- Potential for data loss is acceptable

\- No need for an in-memory datastore

Not so good for Cassandra,

\- Regex support required in the query language

\- You did not mention if most queries would be accessed by primary key; if
not, then this will be an anti-pattern...

------
caymanjim
This is not a good use case for Postgres. You might want to look into one of
the schemaless solutions, like Mongo. If you're using a single table with lots
of duplicate and empty data, Postgres is going to waste space and processing
time. You're not going to use features of or gain any benefit from a RDBMS.
You might even be able to get away with using Redis or a similar key-data
store, for even better performance.

What I really think you need to do is reconsider why you're storing data like
this. Anonymous columns, duplicate data, and excessive nulls are all big red
flags. If you're not storing relational data, don't use a relational DB. It
sounds to me like you might be cramming relational data into a junk bin for
what are likely ill-conceived reasons.

------
PeterZaitsev
Checkup ClickHouse [https://clickhouse.yandex/](https://clickhouse.yandex/)
this looks like a very good fit for what you say

------
stochastimus
If you want open source I would look to postgres and try out the cstore_fdw
foreign data wrapper.

