
Nasdaq ITCH - veyron
http://www.ryanday.net/?p=573
======
ggruschow
Some things to think about:

    
    
      - 50mb is >> total cache on most systems you'll use.
      - This is far from the only function you need to optimize.
      --> The best performing algorithm in a micro-benchmark can be sub-optimal in real use.
    
      - You've got a fixed-for-the-day list of <10k <9 letter unique upper-case strings
      - The bulk of them you'll process are <= 4 letters (CMCSA/CMCSK be damned).
      - The median is 3 and the mean is <3.6.
      - The symbols aren't anywhere close to uniformly active
      --> Benchmark with a typical data stream, not a uniformly distributed one.
    
      - You're likely not even considering trading most of them.
    
      - How much better is your algorithm than the obvious binary-search?
      - Or worse -- plain'ol front-to-back linear search?
      - Are you sure you understand 100% of how the computer works?
        --> Yes? Mail a check to Goldman and find a better line of work.
        --> No? Test it with a quick benchmark.
    
      - Is binary search cache friendly?
      - Why not? Plot out the memory accesses if need be.
    
      - What's wrong with taking a linear-interpolated guess?
      - Is that cache friendly? .. or could it be?
      - When your guess misses, what's that cost?
      - Can you think of a cheap modification to lower that cost?
      - Is the distribution really linear?
    
      - What's wrong with a "perfect hash function"?
      - Does it need to be perfect?
    

Incidentally, hardware wins at this function. You can just generate logic to
do all the if symbol == "XYZ": return 3's in parallel.

~~~
tptacek
This is a crazy smart comment. Thanks.

... is anyone really using hardware to do this, by the way?

Hardware regexing at this scale is practically a COTS part, or was when I
stopped doing network security products in '05.

~~~
veyron
Unfortunately it's not that simple anymore. You actually have to do most of
the order-book-building legwork to figure out if the quote is relevant: every
order modification message, such as canceling an order or execution, omits the
ticker. Makes sense: after all, the add-quote actually tells you the ticker,
so there's no need to repeat that info.

This is where it gets tricky. FPGA based systems don't really give an edge
here, and tilera-based solutions are still nascent

~~~
aortega
This craziness is what happens when you try to implement a specialized
algorithm on a general-purpose state-machine. What stupid thing is a "cache"
anyway, but an optimization because we are not using real computers but hugely
complex state-machines called CPUs. FPGAs are as close as a real computer as
we can have today.

~~~
tptacek
The parallels to high end network security are again interesting here; for
instance, on the more interesting network processors, memory accesses were
asynchronous, and instead of a "cache" you had a manually addressed "scratch"
memory with latency comparable to registers.

... because "general purpose cache" wasn't helpful, but "exploitable locality"
definitely was.

~~~
wladimir
That's pretty close to how the first NVIDIA GPGPU (G80) worked. There was no
cache, but only different kinds of memory, of which one was as fast as
registers. Optimal access to RAM memory was possible with a limited set of
access patterns, so "caching" had to be managed carefully manually.

In later GPUs they (mainly due to programmer laziness :-) ) switched to a more
GPU-like automatically managed general purpose cache. The problem was that in
most cases, being able to quickly write code that is reasonably fast was more
important to developers than being able to squish out every cycle (though I
believe it's still possible to disable the cache and manage it manually...)

The problem with automatic cache is the same as all "intelligent" CPU
features, the CPU tries to predict what the program is doing, the programmer
has to predict what the CPU will predict to optimize for that CPU. In the next
generation of the CPU, the CPU vendor will try to optimize again for programs
currently around. In the end, optimizing becomes much more complex.

There's certainly an advantage to keeping the logic "dumb" and simply having
the software manage everything.

------
wglb
This sounds like the beginning of a fun project. However, may I humbly suggest
a few things?

(In 2005, I built a system to do interesting computations on each stock option
tick. The volume for the time was pretty high. Each time the IBM stock, for
example, moved a penny, this generated 400 quotes. While small by today's
volumes, 100k quotes with an OAS calculation with a binomial matrix on each
tick was a challenge. So I am not totally unfamiliar with these throughput
issues.)

And Python was used, but only on the GUI side, presenting the results. While I
didn't measure it, I feel quite confident in saying that Python isn't going to
keep up. At the risk of being a premature optimizer.

Before you optimize the stock look-up function, you might want to look ahead a
bit. veyron has commented below or elsewhere that a very important calculation
is going to be your total holdings on each tick. This is how you will
calculate your risk, or at least the fundamental start.

Another thing to consider is your goal in trading. How will you be making
money? For example, will you be buying/selling on, say, 10 second intervals
and collect profits from that? Or longer intervals? Shorter?

Whatever your overall strategy function is, you will spend some time in that
calculation. Will you do it on every tick? This computation time might well at
this point be unknown. If it is an interesting calculation, it might well be
complex.

So how is a startup going to deal with this firehose of a problem?

Well, consider splitting up the processing. One obvious way to do this is to
filter tickers you are interested in in one box and push a much more
manageable stream to your main logic box. Perhaps start with one frequently-
traded stock, tune your algorithms. This will be a rich learning experience.

Then, watch the CPU percentage. If it is above 60% at 0910 central, maybe
start profiling your code. Your time-consuming parts will surprise you.

If it is going well, and you want to expand, consider adding one box for each
stock that you plan to trade. You can do it for about $200 diskless, quantity
one retail. Perhaps put the quote stream or a divided quote stream on a
multicast channel, two NICs per box. Allocate a pre-determined amount of
capital for each stock so they don't need to coordinate to give you your risk
calculation.

Also, the idea of High-Frequency Trading is kind of a fractal concept. On the
low end, you have a trading situation that is slightly more active than a day
trader. On the high end, you have boys who rewrite network stacks in windows
or linux for their boxes co-located with exchange points in New Jersey, where
the cables are measured to the inch so everybody sees the tick at the same
time. Those are the 100 nanosecond boys. These are the fellows who order
servers by the multiple racks of 30. And they trade in a whole different style
--their net hold time might average out to zero, but because of agreements
they have with the exchanges, they have to stick with the market when it takes
a dive.

So don't try to be them at the start.

In summary, success here will come more from the engineering than the computer
science, and not trying to beat the big boys at their game.

------
z_
I suspect you'll want something a touch faster than O(l) for every lookup, in
fact given the previous constraint of memory being no problem, you should
instead use a hashtable.

~~~
rwolf
Exactly. I was amazed at how little ground was covered by the end of the post,
off on a yak shave for the part of this problem that is _not_ hard or novel.

------
veyron
I'm happy someone was interested enough to just start coding :)

------
cdtitch41
Here's some trivial YAML-driven python code:
<https://bitbucket.org/cdtitch41/itch41/overview>

It doesn't construct books or anything else sophisticated but it may be of
interest to ETL folks.

------
dr_rezzy
Lots of parallels here: <http://www.weatherlabs.com/2011/08/device-
detection.html>

------
rday
Thanks for all the comments! Veyron had told me a hastable would be a better
idea. I have been reading about creating a perfect hash so I don't have to
deal with collisions, although I'm (obviously) not the best algorithm guy. So
that post may be coming soon.

I had thought that due to the extreme low latency that a hardware solution
would be required. So that may be the next place to start experimenting.

------
ajays
Is there a place where we can download end-of-the day ITCH data? There are so
many HFT/LLT outfits out there with this data; surely one will be generous
enough to release it for free or some trivial amount?

~~~
veyron
I looked into it: it's expensive to redistribute market data (you have to
register with the various exchanges to redistribute their data, and its not
cheap). Though nasdaq did have some data available on their ftp site:
ftp://emi.nasdaq.com/ITCH/

