
Ask HN: Recommended way to store financial time-series based data for trading? - dfischer
I&#x27;ve been studying and experimenting with this as a hobby and want to get more serious now.<p>I&#x27;ve been storing using flat files but I now want to experiment with scaling out my ideas and infrastructure to collect my own data raw 24&#x2F;7 across the globe.<p>I see various time series databases to use but this doesn&#x27;t seem clear to me on a winner. I looked at influx, timeseriesdb, and various others. Most of them have material geared towards IoT and not much financial.<p>I&#x27;ve been considering a stack built entirely on GCP that looks roughly like:<p>regional injestor (compute) -&gt; pub&#x2F;sub -&gt; Dataflow -&gt; pub&#x2F;sub -&gt; firestore and BigQuery<p>The idea is to allow clients to subscribe to prebuilt aggregation metrics from dataflow&#x2F;beam and optimize for latency cross-regionally. The automated rules at the most would need to react in seconds not milliseconds. I would be more than happy with a guaranteed rolling window of 5-15 seconds for my most time hungry decisions.<p>Basic aggregations: OHLC, stdev
Advanced aggregations: values based on custom strategies that would be injected into the feed for a client (automated trading app) to consume and act on.<p>Is it crazy to do all the rolling window &#x2F; strategy calculations in the airflow piece of the architecture or does that make more sense in comparison trying to compute it per client?<p>Visually I am imagining various signals&#x2F;strategies would be separate airflow templates and a client would subscribe to whatever strategy it wants to use.<p>Thanks.
======
ArtWomb
As an alternative to rolling your own, you may be able to leverage IEX Cloud.
Advantage is access to all high res trading data right on the platform ;)

[https://iexcloud.io/cloud-cache/](https://iexcloud.io/cloud-cache/)

[https://medium.com/@jun.ji/build-your-own-neural-network-
sto...](https://medium.com/@jun.ji/build-your-own-neural-network-stock-market-
prediction-model-on-gcp-c1fe07b1a87)

~~~
dfischer
I do want to roll my own for learning sake, I do fallback to iex when I need
it for historical data and backtesting.

Did you write the medium article? I’m curious on the first leg of the
architecture - cloud scheduler to cloud functions would only make sense for
longer periods between injestion right? Something 24/7 would be most efficient
on an instance running 24/7 via compute/kube/app engine since it’s not request
based right?

~~~
srazzaque
If you are interested in learning how to roll a timeseries database, and if
you're not already familiar with it, I'd second the suggestions to try KDB/Q.
Even if only to be inspired and understand what "good" looks like in this
space.

I say this as someone who is generally averse to proprietary enterprise
software. More often than not, it's a horrible pile of crap that's sold on the
golf course, and there's almost always a path through open source and custom
build that does an infinitely better job.

But, in the case of KDB, whilst it was pressed upon me in a former role of
mine, I found it to be an _extremely_ impressive piece of tech once I got over
the initial hurdles (my biases, and basics of Q). It's not without its warts,
but the amount of power and expressivity it packs into a small number if chars
is quite mind blowing. I watched other teams on flavour of the month tech
struggle with queries over "millions" of rows, whilst our stack routinely
served complex queries over billions of rows.

~~~
dfischer
What’s the insertion capability on the freeware version? Does it handle event
time ordering well for distributed injestion?

Appreciate the push.

It does look interesting. I like the idea of working with array native type
construct as the mental model is very close to analyzing time series data for
a strategy.

~~~
srazzaque
Ingest - the free version as I'm sure you're aware is 32bit only. So it is
limited in how much data it can address. But the /rate/ of ingest I'd imagine
is close to that of the 64bit, with the caveat that you'll hit the ceiling
sooner (untested).

The system I was working with was ingesting in the order of 10s of millions of
rows per day from multiple sources. This was obviously the 64bit version.

Also if you do plan on using it for anything beyond tinkering and personal
research I'd review the licensing. I'm not too sure what is and isn't allowed.

Time ordering - yes. The typical architecture of a production system in KDB
will have something called a Tickerplant (TP), a process through which all
updates must traverse. This will stamp its own timestamp on all the updates it
receives.

So for time ordering - you can (1) use the TP timestamps, and/or (2) have your
source nodes also provide a timestamp in their own column (eg
sourceNodeTimestamp), and then you've got some means of detecting skew across
your systems with comparisons to the TP timestamps. You could then order your
data by whatever timestamp you want to use in your queries.

------
natalyarostova
How much data are we talking about here?

~~~
dfischer
Inserts would be in the hundreds to thousands max <10k a second tick data
spread across <100 symbols. I generally only focus on 10 at a time but
upscaling the window there for historical needs.

~~~
bananapear
Where is your market data coming from? At that resolution I imagine you’re
paying for colocation?

~~~
dfischer
Coming from either the exchange and using the closest server to it based on
latency tests (crypto) or from a provider for tick data (polygon, iex) but
trying to move away from the providers and go more direct to source (easier
with crypto right now).

------
shhshahassa
kdb/q

~~~
notduncansmith
This comment might be more helpful if it mentioned why.

KDB is a database (with a query language, Q) written in K by Arthur Whitney.
Arthur hails from the fintech sector and specifically designed these
technologies to work efficiently on the types of problems described by OP.

