

My [Google] Compute Engine Pipe Dream - tghw
http://bjk5.com/post/26346827037/my-compute-engine-pipe-dream

======
ChuckMcM
I read this, and I wonder "What happens to your company when Google goes
away?"

It seems like Google has everyone gunning for it, from anti-trust, to privacy,
to patents, to governments. Every tech company that has been in a war of
attrition with the world has paid a heavy price. IBM and Microsoft were
relatively recent examples. The battles wear on the company and while shields
are at maximum now, they erode over time. People get burned out from fighting,
and governments change.

I have no idea what the Google of 2020 will look like compared to the Google
of 2010. But having been at Sun as a 'startup' and watched it rise, flare up,
and then fade away, I realize that tech companies have a horrible track record
of permanence. The list is long, Compaq (the unkillable titan), DEC (the
company that 'invented computing for the masses'), 3COM, Tandem, Cabletron,
Ungerman-Bass, Telebit, Etc.

Google is on a path to join them. Google is intensely protective about how
they do what they do. A common complaint on the ex-Google email lists is
"Where can I find feature <x> in the FOSS or even enterprise software world?"
when they slip below the waves, there is a very real possibility that stuff
running on their infrastructure not only won't be extractable (due to lack of
access) but even if it was could not be recreated elsewhere.

I see Google's CE/AE and Amazon's EC2/AWS as a quick way to demonstrate you're
product has 'legs' and if you can afford to run it on their stuff and still
make a profit then, once you get above about 500 'instances' you can afford to
run it yourself as well. The downside is that once you get that validation,
its probably a good idea to start your migration plan off of them.

------
kanwisher
7 terabytes seems like a lot of data perhaps the author is counting blob
storage also. In which case surely s3 is the largest blob storage database
already.

~~~
kamens
No, but I'm counting our indexes which store tons and tons of denormalized
data (Google counts these when assessing datastore size). If you completely
ignore all indexes, we're at 1TB (but even that includes some denormalized
data).

~~~
meanguy
It seems like AppEngine is saving your ass at the moment but aren't you
worried about scale? This is sort of a classic "storage not data" problem
where you mapreduce raw data to a structured store for reporting. Are you
really still querying everything live? When do you expect this to break down?

~~~
meanguy
It broke down for me on AppEngine. I had to move data out of the store to
blobs, then use AppEngine queues to reduce the data into the store for
reporting access.

Basically they promised me what they promised you and, after I got past a few
TB of real data, the whole thing blew up.

Also what "front end user apps" are you unable to write on AppEngine itself
that require something like EC2? Splatting data out the HTTP hole was the
least of my worries.

~~~
kamens
I'm not sure what you mean. Do you mean "why would a team choose EC2 over App
Engine?" I never claimed that you are unable to do anything specific on App
Engine.

~~~
meanguy
Your post confused me because it said a lot of things about App Engine's
datastore that conflicted with my direct experience. Khan Academy is one of
the few sites that I'm excited about at the moment, so I'm concerned.

I chose AppEngine because I was very much aware of the issues around big data
and I thought I could avoid having to deal with it. I came away from your post
with the feeling that you may be underestimating what you're up against. Step
one: look at your data size and querying cost every day!

Right now you can access the datastore externally via the remote_api shim or
an API you put on your app. Performance isn't great. (An OData-style HTTP
interface to the datastore seems like an obvious addition.)

Specific to my query: you say you're excited about Google's EC2 equivalent.
I'd be more excited about the managed Hadoop that's likely the next step along
your dev path whether you're aware of it yet or not. Custom mapreduce
operations against the Google App Engine datastore, ironically, really suck
and are really expensive.

So... was this general excitement or is there something specific you want to
do with App Engine but you can't yet? And have you estimated out the
transactional costs for walking across your full record set even if they gave
you access to it?

You're likely going to find yourself stuffing at least some things in a SQL
store and talking to that.

~~~
kamens
Ah. Well, first of all, we have already gone through the pain of building a
pipeline to export the majority of our heavy data analytics to a Hadoop/Hive
setup on EC2. So, yes, we only use App Engine's mapreduce in certain cases
where it makes sense.

However, what I'm specifically referring to in this blog post is the ability
to keep relying on App Engine's datastore for the everyday work involved in
serving our application (forget the mapreduce stuff) while gaining more
flexibility to run non-App Engine pieces of software on the virtual servers
without suffering the App Engine-to-EC2 latency pain.

A trivial example would be Lucene (right now we have to run it on EC2 and
communicate back'n'forth). Another example would be our own memcached servers
that we control the size of.

