
Google and Amazon Vie for Big Inroad into Wall Street Data Trove - walterbell
http://www.bloomberg.com/news/articles/2016-08-30/google-and-amazon-vie-for-big-inroad-into-wall-street-data-trove
======
elecengin
The CAT on HN! I went to one of the early meetings for this project (years ago
now!) It was a question and answer session for potential bidders.

My favorite question: "How long is the contract for?" (the SEC reps look to
each other, and then respond...)

"There is no term."

"And the bidder is committed to storing all generated data?"

"Yes."

------
highlynt
This has come up on HN before... One of the bidders has apparently run a load
test on Google Cloud with some impressive numbers:
[https://cloudplatform.googleblog.com/2016/03/financial-
servi...](https://cloudplatform.googleblog.com/2016/03/financial-services-
firm-processes-25-billion-stock-market-events-per-hour-with-Google-Cloud-
Bigtable.html)

~~~
boulos
Yep that's FIS, running atop our now Generally Available release of Cloud
Bigtable
([https://cloud.google.com/bigtable/](https://cloud.google.com/bigtable/)).
With the HBase compatibility, several folks have swapped out Cassandra for
Bigtable (like Spotify, mentioned in our GA announcement
[https://cloudplatform.googleblog.com/2016/08/Google-Cloud-
Bi...](https://cloudplatform.googleblog.com/2016/08/Google-Cloud-Bigtable-is-
generally-available-for-petabyte-scale-NoSQL-workloads.html)).

Disclosure: I work on Google Cloud, so I want you to use Bigtable ;).

~~~
TheEzEzz
~20GB/s read/write over a thousand+ cores seems slow, especially for
embarrassingly parallel data such as this (split on security). That works out
to megabytes per second per core. Am I missing something?

~~~
mbrukman
They're not doing sequential scans of files on disk, they're doing random
reads and writes in a database, where each write is replicated and durable, in
parallel, across the entire key space of market transactions. The task was to
reconcile market transactions end-to-end by matching orders with their
parent/child orders (e.g., as orders get merged/split or routed from
broker/dealers to others or to exchanges to be executed), thus building
millions (billions?) of graphs across the entire dataset. You can see more
details in the video of the presentation at the bottom of this blog post:
[https://cloudplatform.googleblog.com/2016/03/financial-
servi...](https://cloudplatform.googleblog.com/2016/03/financial-services-
firm-processes-25-billion-stock-market-events-per-hour-with-Google-Cloud-
Bigtable.html) but I presume you're much more familiar with the intricacies of
the stock market than I am. :)

Here's the performance you can expect to see per Cloud Bigtable server node in
your cluster, whether for random reads/writes or for sequential scans:
[https://cloud.google.com/bigtable/docs/performance](https://cloud.google.com/bigtable/docs/performance)

Here's a benchmark comparing Cloud Bigtable to HBase and Cassandra that may be
of interest (on a different benchmark than presented in the FIS blog post, but
shows the relative price/performance):
[https://cloudplatform.googleblog.com/2015/05/introducing-
Goo...](https://cloudplatform.googleblog.com/2015/05/introducing-Google-Cloud-
Bigtable.html)

Disclosure: I am the product manager for Google Cloud Bigtable. Let me know if
you have any other questions, I'm happy to discuss further.

------
thr0waway1239
Here is a section from the article I found interesting:

"Some worry that any insight into what could be the world’s largest repository
of securities transactions will provide ways for either company to profit
beyond cloud services....It’s also specified in the CAT proposal that whoever
wins the bid must ensure the security and confidentiality of the data, and
agree to use it only for appropriate surveillance and regulatory activities."

How will they actually enforce such clauses? Who is going to monitor what goes
on inside these big corporations?

And why not initiate a sort of private-public partnership to form an
independent entity dedicated solely for this purpose if there is such a
desperate need?

If Wall Street was too big to fail during the last recession, does it not mean
that now Amazon and Google are also going to be conferred with the same
blessing once they become the repository of such information, especially if
there really isn't any simple way to enforce these clauses? So two of the
biggest tech companies are now becoming candidates for bailouts - so the
threat they pose with the data they already possess is not enough for people?

Would love to hear thoughts from the folks who are already working in fintech
who might be more familiar with the enforcement of such clauses.

~~~
danblick
I don't have direct experience to answer your question, but I think perhaps
audits are part of the answer.

Really, you can go a long way by asking "who has access to the system storing
the data", "what is the policy for granting and revoking that access", "what
are the policies for handling the data (to avoid leaking it)"?

I've worked at big tech companies but never come across a customer credit card
number because there are policies for handling that data and audits to make
sure they are obeyed. I think basic checks will go a long way.

(Granted, you're talking about a situation in which a company would have an
incentive to subvert the controls on data; that's not really the case for
credit card data.)

~~~
thr0waway1239
I would expect as much. But in these cases, would the auditors be expected to
make their findings public?

My understanding is, the typical audit is stakeholder driven. For an audit of
Google and Amazon's data handling policies in these kind of scenarios, who is
the stakeholder?

------
saretired
This is an excellent idea, and clearly it would also help the SEC investigate
illegal trading. Prediction, based on the exceptional efficiency of Wall St
lobbying: the Congress will refuse to give the SEC the funding for this.

------
lordnacho
There must be more to this than what it says in the article. Where I work, I
can look at the exact state of the orderbook, for tens of thousands of
securities, on dozens on exchanges, at any point in history, or live. I can
run simulations over the data in a few minutes per day simulated.

It didn't cost nearly $100M to build.

------
rboyd
Are there not feeds that already collect this data? I was just listening to a
podcast that described something similar from Nanex called NxCore.

~~~
usefulcat
Existing market data doesn't include info like the actual names of the
firms/individuals behind each order. From the article it sounds like they
might be including (or proposing to include) that level of detail.

For anyone doing trading, or considering it, it would be _immensely_ useful if
you knew which orders were from the same firm, even if you didn't know the
real names.

------
vgt
Here is the video from Sungard FIS and Google Cloud discussing their approach:

[https://www.youtube.com/watch?v=fqOpaCS117Q](https://www.youtube.com/watch?v=fqOpaCS117Q)

TL;DR: They achieved a peak of 56m qps and sustained 38m qps when processing
market data.

(disc: I work at Google)

