
Ask HN: How to store billions of rows of time-series data? - Kkoala
I&#x27;m trying to build a system to store price data. According to my calculations, I&#x27;m going to be inserting about 20 million rows every day. This will quickly add up to billions of rows.<p>The queries will be very ad-hoc, mostly things like histograms, and grouping by certain attributes. It&#x27;s also possible that some of the products will rarely&#x2F;never be queried, not sure if it makes a difference. Read performance is still more important.<p>Is something like TimescaleDB a good choice?<p>I would also like to not spend a fortune on this (couple hundred a month at maximum). So is this even possible with the cost constraint?
======
valyala
Take a look at VictoriaMetrics. It supports OHLC function (rollup_candlestick)
via MetticsQL [1], which is backwards compatible with PromQL - query language
used in Prometheus. PromQL is optimized for typical queries over time series
data [2].

VictoriaMetrics works out of the box without the need to set up third-party
systems or wroting complex configs. It is also very fast for both data
ingestion and querying [3], [4].

[1]
[https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/Metr...](https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/MetricsQL)

[2] [https://medium.com/@valyala/promql-tutorial-for-
beginners-9a...](https://medium.com/@valyala/promql-tutorial-for-
beginners-9ab455142085)

[3] [https://medium.com/@valyala/measuring-vertical-
scalability-f...](https://medium.com/@valyala/measuring-vertical-scalability-
for-time-series-databases-in-google-cloud-92550d78d8ae)

[4] [https://medium.com/@valyala/billy-how-victoriametrics-
deals-...](https://medium.com/@valyala/billy-how-victoriametrics-deals-with-
more-than-500-billion-rows-e82ff8f725da)

------
nknealk
We use kinesis firehose and Athena as a SQL API for customer interactions
analytics on our digital properties (among other things). Data sizes are
upwards of 10M rows per day. Athena will blow through months of data pretty
quickly at very reasonable cost.

Some gochas: AWS lock in. Especially since S3 is your storage back end.
Kinesis firehose writes files by insert date and you may need ETLs to
partition by some other time dimension inherent to the data. This gets tricky
if stuff comes in really late and you may have to decide to drop records older
than N days to make such an ETL work at reasonable cost.

Some pros: easy to POC. Pure variable cost architecture. Serverless (also a
downside because if any of the services go down so do you) with little
administrative work. It can get expensive if you don’t think through
partitioning for analytics, but Kinesis firehose can be configured to write
parquet to S3 natively allowing you to use existing spark/hive/whatever if
your org has that as your SQL layer.

Edit: come to think of it, there is another division in my company using
bigquery with whatever google’s streaming ingestion product is. They’re
achieving similar results at similar cost. I haven’t personally done
development in GCP though so I couldn’t comment

------
gwittel
A timeseries db may work fine. 20 million per day isn’t bad, unless you expect
huge spikes in a short duration.

What do your rows look like? What time ranges do you expect to query? Does
your data arrive in a stream or large batches? This will help determine a
rough IO and storage range needed (or at least worst case).

At this scale and going for low cost you have some trade offs that could help
reduce costs and improve performance. I would try to keep it simple and test
something like timescale with dummy data first. Chunks of Parquet or ORC data
and Presto/Athena may also work depending on your workload.

That aside, other things may help make it even easier for you. Can you reduce
the problem any (e.g. reduce number of rows you need)?

For example:

What is the minimum time period granularity you need for a single datapoint?
Lower res = fewer rows.

Can you aggregate upon ingest?

For distributions can you use approximations (eg t-digest) or do you need
exact values?

~~~
Kkoala
Rows will be at least (id, other_id, price_1, price_2), prices should be 32
bit integers, id should be 64 bit or string integers and other_id can be 32
bit int. id and other_id should reference other things. Or if normalization is
not possible then it would add up to bunch of names and other metadata, 6
strings more at minimum, but more like 10 strings or so.

The data will be a stream basically.

Granularity is quite important, but I think older data could be made less
granular and turned into approximates.

------
seektable
Yandex ClickHouse one more free/open source solution that can be considered
for this purpose. Setup/maintenance may be a bit tricky - but you'll be
surprised how fast CH executes your ad-hoc queries (SQL!) over billions of
rows.

~~~
Kkoala
Interesting, haven't heard of ClickHouse before. I will check it out, thanks.

------
mfreed
(Timescale person here.)

Many of our users store many billions of rows + 10s of millions per day.
Native compression will give you significant storage savings, and very
flexible options for indexing to support various queries. And if you have some
things which are commonly queried, easy to build automated continuous rollups
(e.g., to continuous rollup to OHLCV candlesticks).

Can deploy either self-managed or Timescale Cloud (timescale.com/cloud), where
you can find deployments within $100s per month.

Cheers!

------
GauntletWizard
Yes, it sounds like what you want is a timeseries DB. Prometheus can handle
tens of millions of rows and billions of datapoints with ease, and it's query
language is designed for that kind of grouping via set arithmetic.

------
icsa
Try the 32-bit evaluation version of kdb+
([https://kx.com/](https://kx.com/)). Twenty million rows of data per day is
easy with kdb+ - even on a laptop.

~~~
anonu
That's not a great suggestion given the constraints. Kdb costs way more than
"a couple hundred per month". Also a billion rows quickly falls outside the 32
bit addressable range given most of the calcs happen in memory.

What's wrong with postgresql in this context? OP doesn't mention any speed,
latency or bandwidth requirements...

~~~
icsa
I have built many billion+ row databases on a laptop with 16 GB RAM, using the
32-bit version. kdb+ uses memory mapping and can support databases _much_
larger than available RAM.

PostgreSQL is a fine choice and I use it for many projects. However, kdb+ is a
much better choice when time-series analysis is required.

I'm happy to provide more details if you're interested.

------
sman393
Depending on what your reporting needs are I would definitely recommend
TimescaleDB. I love Prometheus for storing server resource metrics or basic
app metrics (counts/timers). For more business-facing data, having the query
functionality and visualization tooling of Postgres is a major win. I am
working on a similar use-case to you, we considered Cassandra, Timescale, and
Clickhouse. With our insert rates and reporting needs, We decided to start
working with TimescaleDB.

------
runT1ME
My current gig is using Scylla for storing time series data, we're
partitioning by day and anywhere from 5-15 million rows per day.

We've mostly been using Apache Beam/Cloud Dataflow for working wtih the data
en mass for occasional reports, and we're planning on trying Presto soon as
well. I'm interested in what solution you come up with, happy to chat more
about this kind of thing on twitter or email.

~~~
PeterCorless
Thanks for the shout-out! Given that the OP was mentioning only having a
budget in the hundreds per month, this seems an appropriate time to cite our
most recent blog: [https://www.scylladb.com/2020/04/16/ifood-relies-on-
scylla-t...](https://www.scylladb.com/2020/04/16/ifood-relies-on-scylla-to-
deliver-over-100-million-events-a-month-to-restaurants/)

------
jll29
Google BigQuery is your friend.

~~~
Kkoala
I have heard it's quite expensive. The storage itself seems quite cheap, but I
think the queries also cost so that's where the costs come from

