Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Recommended way to store financial time-series based data for trading?
21 points by dfischer 32 days ago | hide | past | web | favorite | 14 comments
I've been studying and experimenting with this as a hobby and want to get more serious now.

I've been storing using flat files but I now want to experiment with scaling out my ideas and infrastructure to collect my own data raw 24/7 across the globe.

I see various time series databases to use but this doesn't seem clear to me on a winner. I looked at influx, timeseriesdb, and various others. Most of them have material geared towards IoT and not much financial.

I've been considering a stack built entirely on GCP that looks roughly like:

regional injestor (compute) -> pub/sub -> Dataflow -> pub/sub -> firestore and BigQuery

The idea is to allow clients to subscribe to prebuilt aggregation metrics from dataflow/beam and optimize for latency cross-regionally. The automated rules at the most would need to react in seconds not milliseconds. I would be more than happy with a guaranteed rolling window of 5-15 seconds for my most time hungry decisions.

Basic aggregations: OHLC, stdev Advanced aggregations: values based on custom strategies that would be injected into the feed for a client (automated trading app) to consume and act on.

Is it crazy to do all the rolling window / strategy calculations in the airflow piece of the architecture or does that make more sense in comparison trying to compute it per client?

Visually I am imagining various signals/strategies would be separate airflow templates and a client would subscribe to whatever strategy it wants to use.

Thanks.




As an alternative to rolling your own, you may be able to leverage IEX Cloud. Advantage is access to all high res trading data right on the platform ;)

https://iexcloud.io/cloud-cache/

https://medium.com/@jun.ji/build-your-own-neural-network-sto...


I do want to roll my own for learning sake, I do fallback to iex when I need it for historical data and backtesting.

Did you write the medium article? I’m curious on the first leg of the architecture - cloud scheduler to cloud functions would only make sense for longer periods between injestion right? Something 24/7 would be most efficient on an instance running 24/7 via compute/kube/app engine since it’s not request based right?


If you are interested in learning how to roll a timeseries database, and if you're not already familiar with it, I'd second the suggestions to try KDB/Q. Even if only to be inspired and understand what "good" looks like in this space.

I say this as someone who is generally averse to proprietary enterprise software. More often than not, it's a horrible pile of crap that's sold on the golf course, and there's almost always a path through open source and custom build that does an infinitely better job.

But, in the case of KDB, whilst it was pressed upon me in a former role of mine, I found it to be an _extremely_ impressive piece of tech once I got over the initial hurdles (my biases, and basics of Q). It's not without its warts, but the amount of power and expressivity it packs into a small number if chars is quite mind blowing. I watched other teams on flavour of the month tech struggle with queries over "millions" of rows, whilst our stack routinely served complex queries over billions of rows.


kdb/q looks interesting! My main concern in suggesting cloud functions is COST and portability. It will always save versus running something like AresDB on a GPU cluster. Performance is within typical cloud latencies so for generating trading statistics for humans its acceptable. Cache data in memory (using multiple buffers). Cloud functions can be called synchronously. And all eventing is handled via calling goroutine


What’s the insertion capability on the freeware version? Does it handle event time ordering well for distributed injestion?

Appreciate the push.

It does look interesting. I like the idea of working with array native type construct as the mental model is very close to analyzing time series data for a strategy.


Ingest - the free version as I'm sure you're aware is 32bit only. So it is limited in how much data it can address. But the /rate/ of ingest I'd imagine is close to that of the 64bit, with the caveat that you'll hit the ceiling sooner (untested).

The system I was working with was ingesting in the order of 10s of millions of rows per day from multiple sources. This was obviously the 64bit version.

Also if you do plan on using it for anything beyond tinkering and personal research I'd review the licensing. I'm not too sure what is and isn't allowed.

Time ordering - yes. The typical architecture of a production system in KDB will have something called a Tickerplant (TP), a process through which all updates must traverse. This will stamp its own timestamp on all the updates it receives.

So for time ordering - you can (1) use the TP timestamps, and/or (2) have your source nodes also provide a timestamp in their own column (eg sourceNodeTimestamp), and then you've got some means of detecting skew across your systems with comparisons to the TP timestamps. You could then order your data by whatever timestamp you want to use in your queries.


How much data are we talking about here?


Inserts would be in the hundreds to thousands max <10k a second tick data spread across <100 symbols. I generally only focus on 10 at a time but upscaling the window there for historical needs.


Where is your market data coming from? At that resolution I imagine you’re paying for colocation?


Coming from either the exchange and using the closest server to it based on latency tests (crypto) or from a provider for tick data (polygon, iex) but trying to move away from the providers and go more direct to source (easier with crypto right now).


kdb/q


This comment might be more helpful if it mentioned why.

KDB is a database (with a query language, Q) written in K by Arthur Whitney. Arthur hails from the fintech sector and specifically designed these technologies to work efficiently on the types of problems described by OP.


If your data is big enough, the free 32 bit version will fall over pretty quickly. The 64 bit version will cost you...


Yeah trying to stay away from this one due to the cost requirements.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: