
Spanner: Becoming a SQL System [pdf] - elvinyung
http://dl.acm.org/authorize?N37621
======
elvinyung
This "modern" Spanner feels very different from the one we saw in 2012 [1].
Some interesting takeaways:

* There is a native SQL interface in Spanner, rather than relying on a separate upper-layer SQL layer, a la F1 [2]

* Spanner is no longer on top of Bigtable! Instead, the storage engine seems to be a heavily modified Bigtable with a column-oriented file format

* Data is resharded frequently and concurrently with other operations -- the shard layout is abstracted away from the query plan using the "distributed union" operator

* Possible explanation for why Spanner doesn't support SQL DML writes: writes are required to be the last step of a transaction, and there is currently no support for reading uncommitted writes (this is in contrast to F1, which _does_ support DML)

* Spanner supports full-text search (!)

[1]
[https://research.google.com/archive/spanner.html](https://research.google.com/archive/spanner.html)

[2]
[https://research.google.com/pubs/pub41344.html](https://research.google.com/pubs/pub41344.html)

~~~
iambvk
Did not digest the paper yet. Does it provide nested transactions?

My understanding from 2012 paper was that it doesn't support nested
transactions, even at the storage layer.

Can anybody provide insider-knowledge if it was even a requested feature from
Google devs internally?

~~~
elvinyung
Since writes are required to be the last step of a transaction, I suspect
there wouldn't be much of a point in having nested transactions.

------
rusanu
Tangential.

For me the fascinating thing is looking at the list of authors to recognize so
many from the 2005-2012 Microsoft SQL Server team. Folk I know personally as
exceptional performers. Same when I look at Aurora papers. I see this as the
result of Ballmer's famous HR initiatives and the massive brain drain that
occurred at Microsoft around 2010-ish.

~~~
pavlov
Can you tell some more about the HR initiatives? (I would just google it, but
a bit unclear what I should be looking for.)

~~~
rusanu
You can point back to the 2004 benefits overhaul [0], the famous Towels story
[1], low compensation rates compared to Google, Amazon and Facebook [2]. Just
go over minimsft.blogspot.com posts at the time.

Add to this the lack of vision and direction, catastrophic acquisitions,
dismal flagship product releases. At the time there were running jokes about
the Inbox filling up with "After 15 years, is time to send that email" subject
lines...

    
    
      [0] http://old.seattletimes.com/html/businesstechnology/2001938654_microsoft26.html  
      [1] http://www.zdnet.com/article/microsoft-brings-back-the-towels-5000148135/  
      [2] http://minimsft.blogspot.com/2006/03/internal-microsoft-compensation.html

------
speedplane
I was at the Google Cloud conference a few months ago and spoke to a few
engineers there about Spanner. When I asked them about how it would affect
their other storage options (e.g., datastore, cloud sql, etc.), they said that
over time, all of their internal storage systems would be moved over to
Spanner. His words were "you'll be using Spanner whether you know it or not."

This engineer was clearly a cheerleader for the product, so I'm dubious as to
whether that will actually happen, but it's clear that they have quite a bit
of confidence in it.

~~~
Artemis2
That sounds a lot like the announcement of Azure Cosmos DB:
[https://docs.microsoft.com/en-us/azure/cosmos-
db/introductio...](https://docs.microsoft.com/en-us/azure/cosmos-
db/introduction)

> we made the service available externally to all Azure Developers in the form
> of Azure DocumentDB. Azure Cosmos DB is the next big leap in the evolution
> of DocumentDB and we are now making it available for you to use. As a part
> of this release of Azure Cosmos DB, DocumentDB customers (with their data)
> are automatically Azure Cosmos DB customers.

From my understanding, DocumentDB ≈ Dynamo, while Cosmos DB would be closer to
Cloud Spanner.

------
sheeshkebab
Is anyone outside of Google or msft using these proprietary k/v/sql/Json
databases?

As a developer I can't bring myself to code against something I can't install
on my laptop. And as enterprises would go, they won't use any data store that
can't run their financials/hr/ldap/sharepoint thing.

So, who uses them and for what?

~~~
tyingq
Your viewpoint is shared by many, but there are lots of enterprises using
proprietary cloud features. They either use an abstraction layer for running
on a laptop, or otherwise a CI process that kicks off dev instances and test
cases on demand, forcing you to be online when you check things in. That's not
terribly new. Teams have had to find solutions/standins for things like AWS
load balancers, lambdas, certificate servers, etc.

Cloud spanner, being fairly new and unusual (SQL, but no INSERT/UPDATE),
though, doesn't yet have a big name customer. Jda.com and quizlet.com were
their reference customers.

------
gravypod
Is Spanner free/open source software? Can we look at the code?

~~~
slackingoff2017
This is part of a worrying new trend. Increasingly you can't buy software
anymore, only rent.

Innovation is being kept from scrutiny hidden behind closed doors. The kind of
thing patents were meant to prevent back when the system wasn't broken.

Google is one of the better players in this regard, at least telling the world
what they're up to. Try to figure out how something like Amazon's systems work
and you'll run into a deafening wall of silence.

Funny that we're so willing to trust these "clouds" when we know next to
nothing about their internal workings. I don't think the honeymoon will last
forever. Somebody will eventually abuse their position and within a few years
everyone will be "on prem" again.

~~~
brandur
> _This is part of a worrying new trend. Increasingly you can 't buy software
> anymore, only rent._

More or less agreed, but you'd probably have to concede that the existence of
papers like this one and some of Google's other Spanner publications are
admirable; they're being more open about the system's design than they have to
be.

Of course Google knows that the system's secret sauce is not the concept
itself, but the cost of its implementation, the infrastructural harness to
support it, and the resources to reliably operate it. Even with a rough
understanding on how Spanner works, it's still going to be difficult to ever
migrate off it for practical reasons alone — who else is going to be able to
build and run an alternative?

I have huge respect for what the Spanner team is doing, but this is a reason
that Citus [1] is also very interesting to me right now. You could conceivably
start out with nothing but Postgres and migrate into a Citus cluster when (and
_if_ ) you need to.

If a point comes where you realize that you need out, you could either (1) see
if you can scale back down to simple Postgres, (2) host your own cluster on
your own infrastructure with the Citus source code, or (3) migrate onto your
own Postgres sharding scheme à la Instagram. At no point do you lock yourself
into custom GRPC APIs which are going to be ~impossible to get off of.

GCP and AWS provide hugely useful foundational IaaS, but they're incentivized
to move beyond that layer and provide more custom solutions that (1) provide
better margins, and (2) lock you into their services. As people and companies
building on top of these clouds, we should be looking for whatever
opportunities we can to keep our stacks generic so that AWS <-> GCP <-> Azure
migrations are possible, even if a last resort.

[1]
[https://www.citusdata.com/product/cloud](https://www.citusdata.com/product/cloud)

~~~
slackingoff2017
Does this feel like rent seeking to you? The internet is not young anymore. It
feels like cloud hosting is little more than the giants allowing controlled
competition as long as you rent their servers. The platforms are just
extensions of the massively powerful systems they use internally.

I wonder what would happen if the open source community built a viable
alternative to the cloud IaaS. Like OpenStack but not a failure :). OpenFlow
has shown promise and could form the core for an open IaaS. Network
virtualization is the hardest part.

~~~
speedplane
> I wonder what would happen if the open source community built a viable
> alternative to the cloud IaaS. ... Network virtualization is the hardest
> part.

I would think that the hardware itself is the hardest part. Companies move to
the cloud because it reduces their internal OPs team, and you can scale up and
down hardware extremely easily.

~~~
slackingoff2017
I respectfully disagree. Every company I've been part of moving to the cloud
already had plenty of hardware. It was ease of provisioning VM's, backups,
deployment, and networking that really convinced them.

VMWare comes close to offering the same thing on premesis but Oracle is too
obsessed with wringing money from it to let it reach it's full potential.

------
guilt
Did they just write it during CockroachDB's release?

So sketchy.

