

Behind a Backend-as-a-Service Provider: The How and Why of Our Architecture. - spladow
http://www.spire.io/posts/our-architecture.html

======
larubbio
Full disclosure, I'm heading up the backend development for Zipline Games Moai
Cloud (<http://getmoai.com/>). We're targeting game developers so that lead us
to some different use cases and language choices (like Lua).

Thanks for sharing this, at first read through it sounds a lot like the
architecture mongrel2 provides (which we use). If you were to swap out the
node.js dispatcher with mongrel2 and the redis queue with ZeroMQ.

Have you run into any issues using redis as a queue? If it were replicated
across machines I wonder if you could have multiple workers dequeuing the
request. If it's on a single server wouldn't the blocking operation on the
list become a bottleneck?

Again, thanks for sharing, the offering looks great.

~~~
automatthew
Regarding the architectural question:

We actually tested a few designs using m2 and 0mq during our R&D phase. 0mq's
push/pull sockets provide the same "take" behavior as the Redis RPUSH/BLPOP,
so we definitely could have used m2 as the HTTP end of a similar architecture
to what we use now.

One of the considerations that led to the choice of Redis was the transparency
of the queueing and dequeueing. The messages in the queue are easily
inspected, and the Redis MONITOR command helps greatly in debugging.

More important from a design perspective was our desire to hide the HTTP
specifics from the workers. m2 pushes a JSON or tnetstring representation of
the HTTP request to its workers, but we want the task we send to the workers
to be generalized, stripped of information that is only meaningful for HTTP.
We also want to classify the task by resource type and requested action, which
allows us to use multiple queues. Multiple queues allows us to implement
workers on an ad hoc basis.

M2 could work here if our request classification only depended on the URL. But
that is a limitation we are not willing to accept. Request headers can be very
useful in dispatching, especially those related to content negotiation
(Accept, Content-Type, etc.)

There is an interesting hybrid approach using mongrel2: write one or more m2
handlers that perform the same function as our node.js dispatchers. I.e. m2
sends the JSON-formatted request to an m2 handler that deserializes it,
removes the HTTP dressing, classifies the request according to type and
action, then queues a task in the appropriate queue. A worker takes the task,
does its own little thing, and sends the result to an m2 handler that knows
how to re-clothe the result as an HTTP response and deliver it to the m2
cluster.

Regarding the question about queue behavior across replicated Redises:

I do not know for certain, but I do certainly hope it is not possible for an
item in a Redis list to be popped by more than one client, no matter how the
replication is configured.

With our architecture, we could relieve at least some of the strain on the
task/result messaging system by using a cluster of Redis servers for the task
queues. Each queue server might have its own cadre of workers listening for
tasks. The return trip (getting a result from a worker back to the HTTP front
end) is a little trickier, because it matters which HTTP server is holding
open the HTTP connection. You could use PUB/SUB (which I believe is how m2
currently does it), or each HTTP server could be popping results from its own
result queue.

When using a single Redis server, the only hard limitation we have seen with
using the BLPOP operation is the number of client connections Redis can keep
open. In case it's not clear, the BLPOP is blocking for the client, not the
server.

~~~
djb_hackernews
Nice post.

Do you do anything special to make the queue which your workers are BLPOPing
durable?

Is there a reason you didn't use Redis pub/sub? Seems like the perfect use
case.

~~~
automatthew
The "queue" is merely a Redis list. Durable by default.

Redis PUB/SUB is not suited for the task queues we use, because any number of
subscribers will receive the messages. We want to guarantee that only one
worker will act upon each message.

~~~
djb_hackernews
Err, I guess I meant something else. For instance, what happens if one of the
workers goes down after popping? The message is lost, right?

I think I invented your architecture in my head that isn't near reality. I
imagined you were using the workers to take incoming published messages and
pushing them into the queues that subscribed connections are popping off of.
Effectively building your own fanout. So in this context, I was wondering why
not just use pubsub which will handle fanout and get rid of the entire worker
model.

Thanks for the reply!

------
stefanve
Maybe a bit off topic,

but is it just me or is the BaaS term just a marketing ploy, or at least
unnecessary?

I think every thing is placable in the IaaS, PaaS and SaaS model.

~~~
reinhardt
They're not the first and they won't be the last; just search for '"as a
service" -software -platform -infrastructure -saas -iaas -paas' and weep (or
laugh). The "as a service" meme is begging to be parodied if it hasn't been
done already.

~~~
kozubik
I don't understand the business model at all.

I undertand the attraction of implementing web based "messaging" (chat) in
javascript. But why wouldn't I just point that javascript back to myself ?

Why would I route the product of JS based chat through a third party when it
could just communicate with the server it got the HTTP from in the first place
?

My guess is that this is for folks that don't have any control over their back
end - it's just a web serving black box, and this is just some more content to
paste into it. Is that about right ?

The missing piece, though, is the revenue model - the users who would generate
more than 30 million messages in a month are the same users who actually might
have their own back end, and the wherewithal to use it. I would think if you
need to use third party javascript snippets, you're ipso facto a smaller,
lower volume user ...

~~~
nl
_The missing piece, though, is the revenue model - the users who would
generate more than 30 million messages in a month are the same users who
actually might have their own back end, and the wherewithal to use it_

This is true, but if you had a service with the potential to generate say 50
million messages per month would you spend $60/month and use this, or multiple
thousand dollars to develop your own?

(Also, note that a big market for this is mobile, not just javascript on
websites)

~~~
kozubik
Ok, fair enough. I'm still wrapping my head around JSAI (javascript as
infrastructure) so bear with me ...

------
knwang
Typo on the front page:

We run our servers on Amazon Web Services and use their elastic load balancer
(which is were we terminate SSL).

which is _why_ (?)

~~~
lolcatstevens
We've been bitten by EC2 instances having issues accepting incoming
connections (multiple HAProxy boxes in TCP mode to an STunnel cluster), and
we've never had that issue in testing with ELB. ELB also beats our failover
time when we lose an EC2 machine.

But ELB is in no way a permanent part of our infrastructure (nothing is
permanent) especially as we move to supporting technologies such as SPDY on
spire.io or, for example for the right customer requirement, SSL throughout
the network stack. We're also fond of Stud running on our internal servers. I
do think ELB is the right tool for our cloud today.

~~~
sdepablos
Haven't you run into performance terminating SSL on ELB? For me the
performance is so-so, as I'm using a 2048 bit key and it seems I hit the
maximum requests per second limit pretty fast. There're a pair of threads
regarding this issue on the AWS forums, where a user did a really exhaustive
test of ELB and even got an Amazon engineer to really look into that issue:

<https://forums.aws.amazon.com/thread.jspa?messageID=327283>
<https://forums.aws.amazon.com/thread.jspa?messageID=327715>

~~~
lolcatstevens
We're looking pretty good in AWS West 1a. The second thread you linked shows
great performance from markdcorner's second load balancer -- the
[https://forums.aws.amazon.com/servlet/JiveServlet/download/3...](https://forums.aws.amazon.com/servlet/JiveServlet/download/30-89413-335305-6425/Both-
ELB-Latency-Avg.png) image -- vs the original ELB, which SpencerD@AWS
describes as a custom ELB with some sort of customer-requested secret sauce
(maybe some sort of "slow start" to some backend servers?). Indeed we have
reached out to AWS a few times in the past and had some magic (at the time
ciphers and removing SSL v2) done to our ELBs.

We have seen issues with performance on ELB which is why we originally went
with TCP mode HAProxy on the edge of our stack to a cluster of STunnel
servers, but again reliability was an issue here and our ELB performance with
up to 10K rps looks great in benchmarks. Past 10K we are considering separate
dispatchers behind a separate ELB. But at that point I am also tempted to,
frankly, switch to our own metal.

Curious: are you comparing ELB performance vs High I/O EC2 instances (say
m1.xlarge) open to the world?

~~~
sdepablos
Well, I didn't know you could fine tune your ELB's. In fact reading the
Developer Guide
([http://awsdocs.s3.amazonaws.com/ElasticLoadBalancing/latest/...](http://awsdocs.s3.amazonaws.com/ElasticLoadBalancing/latest/elb-
dg.pdf), page 46) it seems that it's now possible to choose SSL protocols and
ciphers via the web interface (and I supposse also via the API).

Regarding the comparison you suggest, we are too short-handed right know not
only to do this kind of comparisons but even to think to manage our own load
balancers ;) Thanks for the suggestion anyway.

------
gyaresu
Your chat demo is broken. <http://www.spire.io/examples/chat/>

~~~
spladow
I know this is the least satisfying support answer possible, but the chat
appears to be available for me right now. I'd like to help figure out the
problem, can you give me some details like your OS and browser?

~~~
gyaresu
Opera Version: 11.62 Build: 1347 Platform: Mac OS X System: 10.7.3

[https://www.dropbox.com/s/nhqolr7q11vb4hj/Screen%20Shot%2020...](https://www.dropbox.com/s/nhqolr7q11vb4hj/Screen%20Shot%202012-05-04%20at%2010.38.54%20AM.png)

------
rwolf
I like the apologetics for node.js in the first section. a similar argument
for why you are using jruby instead of node.js for the "backend whatchamajigs"
would be nice.

~~~
automatthew
The short answer is: Use of synchronous Redis calls made it easier to develop
and to experiment with storage patterns.

Workers can be written in any language, which was another major design
consideration. We have about a dozen types of worker right now; when we want
to evaluate a new language, we can port just one. Alternatively, we can write
any new worker types in an arbitrary language.

Thus we're not married to JRuby.

------
pydanny
I saw a presentation on it last night and it looked pretty awesome. We're
planning to use their Messaging API in our upcoming project.

~~~
spladow
Cool. What are you planning to build, if you don't mind me asking?

~~~
pydanny
One of our client projects involves having customer service staff talk and
resolve payment issues with paying users. The messaging API means we have one
less thing to worry about, allowing us to focus on handling payment resolution
rather than the communication side of things.

------
skrebbel
Software architecture stories without pictures make me sad :-(

