
How Streak built a graph database on Cloud Spanner to wrangle billions of emails - mooreds
https://cloud.google.com/blog/products/databases/how-streak-built-a-graph-database-on-cloud-spanner-to-wrangle-billions-of-emails
======
latchkey

      > "Between Google App Engine and Cloud Datastore, we’ve never had to have an explicit infrastructure on-call rotation."
    

This.

So many people gloss over this fact. Hiring top notch DevOps is extremely hard
and expensive for a startup (the best ones all have jobs and likely work for
Google). Nobody likes being on rotation (or carrying a pager).

If you can base your initial platform on something like GAE, this decision
alone will more than pay for itself over the long run.

~~~
patrickg_zill
You don't need a $350K/year worked-at-Netflix guru to set up 2 or 3 PostgreSQL
or MySQL boxes in separate datacenters and ship transaction logs around.

I think many people over-estimate by a huge factor, how tough it is to run and
maintain things. The cloud providers of course, love this!

~~~
forgingsheep
You are correct. For this system, we didn't go that route because of the
amount of data involved, and the rate at which data are ingested, are both
large enough to make maintaining our own fleet of database servers
operationally challenging.

We have a small, experienced team with a lot of operational experience, so we
know how we could build this with other technologies and how to maintain it
(And how much time it would take). All of that would be time away from
actually building the product, and it would also change some of the character
of what is expected for the job.

We have worked at some of the larger bay area companies, and there are some
really nice aspects of those jobs. One of them is having a strong enough
operational base so that you can concentrate on planning and building a thing,
instead of always being caught in a reactionary loop.

Postgres is always going to have a special place in my heart, and we do use it
as well, we are just careful to use it in operational ways that won't ruin our
weekends. If this feature were the one core thing that made Streak special, we
likely would have built it with Postgres and Citus, but it is just one of many
features, and going with Spanner let's us treat it that way.

------
porker
No details of the "graph database" bit. Sounds like it's a graph in the sense
any RDBMs is, and not with a specialised query language or helpers that any
graph database gives.

~~~
frew
(I'm the author of the post - just woke up and saw this discussion. Hi!)

Definitely wasn't meaning this post's title to be clickbait-y. The intention
wasn't to hold out Spanner as a purpose-built graph database, but rather to
talk about using it for a graph-y database use case. Will put together a
follow-up with some more information, but in summary: * yeah, no built-ins
other than relational SQL for graphs * the key thing that make this work are
the ability to easily construct global indexes that aren't sharded by the
primary key and reasonably fast joins between them * it's also helpful that
Spanner does a reasonable job of parallelizing queries (e.g. a lot of times
we'll get a 15x increase in speed vs. a sequential plan) * we then do the fan-
out across the graph in our Java client

~~~
porker
Thank you for engaging here!

With respect, the title is wrong as that's not a graph database. "How Streak
built a graph on Cloud Spanner" would be more accurate (and even then, it
needs the follow-up post with some actual details).

------
logiclabs
Isn't this just 4 many-to-many tables and querying using joins, not a graph
database?

~~~
syastrov
Indeed. They admit that it could be done with an traditional relational
database but argue that it wouldn’t be able to scale in the same way as with
Cloud Spanner. But they are making it sound as though Cloud Spanner has some
graph data model.

My interpretation is that they are arguing that they would have to sacrifice
the ability to fully model all connections in the system or answer any kind of
query (they talk about doing some things “per-user”) because they would have
scalability issues or would have to do manual sharding of data. They are
considering the full model with all connections as a “graph database”. This
terminology is confusing and seems to ascribe extra capabilities to Cloud
Spanner.

~~~
frew
(I'm the author of the post - just woke up and saw this discussion. Hi!)

The intention wasn't to hold out Spanner as a purpose-built graph database,
but rather to talk about using it for a graph-y database use case. Will put
together a follow-up with some more information, but in summary: * yeah, no
built-ins other than relational SQL for graphs * the key thing that make this
work are the ability to easily construct global indexes that aren't sharded by
the primary key and reasonably fast joins between them * it's also helpful
that Spanner does a reasonable job of parallelizing queries (e.g. a lot of
times we'll get a 15x increase in speed vs. a sequential plan) * we then do
the fan-out across the graph in our Java client

------
forgingsheep
I'm an engineer at Streak and the primary creator of Ratchet, so I'm happy to
answer any questions about the library that people might have.

------
sandGorgon
this is interesting, but how do you do the actual graph queries ?

The ones built on Cassandra do this using Spark and neo4j has a built in
engine (Gremlin). Any examples of how you map graph queries to relational
table structures ? Especially the ones that need traversal

~~~
frew
(I'm the author of the post - just woke up and saw this discussion. Hi!)

Will put together a follow-up with some more information, but in summary: *
yeah, no built-ins for graph operations in Spanner other than relational SQL
for standard joins * the key thing that make this work are the ability to
easily construct global indexes that aren't sharded by the primary key and
reasonably fast joins between them * it's also helpful that Spanner does a
reasonable job of parallelizing queries (e.g. a lot of times we'll get a 15x
increase in speed vs. a sequential plan) * we then do the fan-out across the
graph in our Java Spanner client - each distributed SQL index read takes ~10
ms so we can do multiple round trips of graph traversal in the client

~~~
sandGorgon
Would love to know more details about the graph traversal/fanout part. This is
the stuff that most of us go to spark for, so would love to know how spanner
makes this easier.

------
topicseed
What is the graph engine used to query the nodes and edges with traversals?

