The current visualization is by far not perfect but it was hard for me to put more information there. Please give your ideas to improve this visualization or make others!
I mostly aimed at making an aesthetically pleasant image that would represent what cells were controlled and moves were used the most.
As for usage examples, it's very easy to see the difference between European and Indian openings (the former advancing in the center and the latter on the sides) and it's quite easy to guess who won by looking at who controlled the most cells last.
On the tech side, this is a single-file, local first, vanilla JS app querying the (non official) chessgames.com API through corsproxy.io (because CORS). Then I draw using svg elements. Finally, I use canvg [^1] to produce png images. The js code is embedded in the HTML so you can read the code just by viewing the source (or look on github [^2]). I also have a Python version that I also maintain to produce the same outputs as the browser version.
I actually quit a quant trading job after 2 weeks because they used kdb+. I could use it but the experience was so bad...
People could complain about abysmal language design or debugging but what I found the most frustration in the coding conventions that they had (or had not), and I think the language and the community play a big role there. But also the company culture: I asked why the code was so poorly documented (no comments, single letter parameters, arcane function names). "We understand it after some time and this way other teams cannot use our ideas."
Overall, their whole stack was outdated and ofc they could not do very interesting things with a tool such as Q. For example, they plotted graphs by copying data from qStudio to Excel...
The only good thing was they did not buy the docker / k8s bs and were deploying directly on servers. It makes sense that quants should be able to fix things in production very quickly but I think it would also make sense for web app developers not to wait 10 minutes (and that's when you have good infra) to see a fix in production.
I have a theory on why quants actually like kdb: it's a good *weapon*. It serves some purpose but I would not call it a *tool* as building with it is tedious. People like that it just works out of the box. But although you can use a sword to drive nails, it is not its purpose.
Continuing on that theory, LISP (especially Racket) would be the best *tool* available as it is not the most powerful language out of the box but allows to build a lot of abstractions with features to modify the language itself. C++ and Python are just great programming languages as you can build good software on them, Python being also a fairly good weapon.
Q might give the illusion of being the best language to explore quant data, but that's just because quants do not invest enough time into building good software and using good tools. When you actually master a Python IDE, you are definitely more productive than any Q programmer.
And don't get me started on performance (the link covers it anyway even though the prose is bad).
The article calls out Python and DuckDB as possible successors.
I remember being very impressed by Kdb+ (went to their meetups in Chicago). Large queries ran almost instantaneously. The APL like syntax was like a magic incantation that only math types were privy to. The salesperson mentioned KdB was so optimized that it fit in the L1 cache of a processor of the day.
Fast forward 10 years. I’m doing the same thing today with Python and DuckDB and Jupyter on Parquet files. DuckDB not only parallelizes, it vectorizes. I’m not sure how it benchmarks against kdb+ but the responsiveness of DuckDB at least feels as fast as kdb+ on large datasets. (Though I’m sure kdb+ is vastly more optimized). The difference? DuckDB is free.
We use DuckDB similarly but productionize by writing pyarrow code. All the modern tools (DuckDB, pyarrow, polars) are fast enough if you store your data well (parquet), though we work with not quite “big data” most of the time.
It’s worth remembering that all the modern progress builds on top of years of work by Wes McKinney & co (many, many contributors).
Also a tip: for interactive queries, do not store Parquet in S3.
S3 is high-throughput but also high-latency storage. It's good for bulk reads, but not random reads, and querying Parquet involves random reads. Parquet on S3 is ok for batch jobs (like Spark jobs) but it's very slow for interactive queries (Presto, Athena, DuckDB).
The solution is to store Parquet on low-latency storage. S3 has something called S3 Express Zones (which is low-latency S3, costs slightly more). Or EBS, which is block storage that doesn't suffer from S3's high latency.
You can do realtime in the sense that you can build Numpy arrays in memory from realtime data and then use these as columns in DuckDb. This is approach I took when designing KlongPy to interop array operations with DuckDb.
Not real time, just historical. (I don’t see why it can’t be used for real time though... but haven’t thought through the caveats)
Also, not sure what you mean by Parquet is not good at appending? On the contrary, Parquet is designed for an append-only paradigm (like Hadoop back in the day). You can just drop a new parquet file and it’s appended.
If you have 1.parquet, all you have you to do is drop 2.parquet in the same folder or Hive hierarchy. Then query>
Select * from ‘*.parquet’
DuckDB automatically scans all the parquet in that directory structure when it queries. If there’s a predicate, it uses Parquet header information to skip files that don’t contain the data requested so it’s very fast.
In practice we use a directory structure called Hive partitioning, which helps DuckDB do partition elimination to skip over irrelevant partitions, making it even faster.
Now, it's not so good at updating because it's a write-once format (not read-write). To update a single record in a Parquet file entails regenerating the entire Parquet file. So if you have late-arriving updates, you need to do extra work to identify the partition involved and overwrite. Either that or use bitemporal modeling (add data arrival timestamp [1]) and do a latest date clause in your query (entailing more compute). If you have a scenario where existing data changes a lot, Parquet is not a good format for you. You should look into Timescale (time-series database based on Postgres)
Not surviving more than 2 weeks in a QF role because of kdb, and then suggesting they should rewrite everything to LISP is one of the more HN level recidivous comments I think I have ever seen.
You didn’t learn Q in two weeks to the extent that you are qualified to assert that someone who knows how to use a Python IDE is more productive than a quant dev with decades of experience.
I find it much more likely that you couldn’t understand their code and quit out of frustration.
If you were a highly skilled quant dev and this was a good seat, quitting after two weeks would have been a disaster to manage the next transition given the terms these contracts always have.
Their pykx integration is going a long way to fix some of the gaps in:
- charting
- machine learning/statsmodels
- html processing/webscrapes
Because for example you can just open a Jupyter Notebook and do:
import pykx as kx
df = kx.q(“select from foo where bar”)
plt.plot(df[“x”], df[“y”])
It’s truly an incredibly seamless and powerful integration. You get the best of both worlds and it may be the saving feature of the product in the next 10 years
I think this will only work with regular qSQL on a specific database node, i.e. RDB, IDB, HDB[1]. It will be much harder for a mortal Python developer to use Functional qSQL[2] which will join/merge/aggregate data from all these nodes. The join/merge/aggregation is usually application-specific and done on some kind of gateway node(s). Querying each of them is slightly different, with different keys and secondary indices, and requires using a parse tree (AST) of a query.
---
[1] RDB - RAM DB (recent in-memory data), IDB (Intraday DB - recent data which doesn't fit into RAM), HDB - Historical DB (usually partitioned by date or other time-based or integral column).
That’s accurate enough. I think the workflow was more built for a q dev occasionally dipping into python rather than the other way around.
I think you touch on something really interesting which is the kink in the kdb+ learning curve when you go from really simple functions,tables, etc. to actually building a performant kdb architecture.
I'm perfectly capable of learning obscure language _and_ thinking I'm a special snowflake. (In fact, I'm a special snowflake _because_ I am into weird languages.)
reply