Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Node.js+Redis+CouchApp+BigCouch+CouchDB+TurnkeyLinux = Nirvana?
11 points by lzw 2140 days ago | hide | past | web | 28 comments | favorite
We've been very happy with AppEngine, especially since I hate system administration and maintenance, but we've got features coming up that simply can't be done on appEngine. I've also got some "hacker time" coming up, about two months with no deliverables so I can work on a new infrastructure.... And I'm ruminating on a solution which ill share below. I'm keen to know if any of this is clueless or foolish, as I have not worked with any of these new hot technologies before, though I do know erlang and JavaScript pretty well.

All these technologies are to be assembled into a pre-built appliance. Most of it is the set of apps described in the title, but the real magic comes from a couple relatively static applications written in node.js. To increase capacity, we add more nodes. Should be zero configuration.

Under node.js there are services I have to write-

cluster management-- which is catch all for connecting new nodes to the cluster, or migrating off of nodes going away.

authoritativeDNS-- the cluster is it's own name servers. This say load can be balanced, and we can dynamically add customers custom domains to our app and support them right away,

web platform with caching-- this will run couch apps, custom JavaScript stored in the couch and but run outside under node.js

Simple queue service, and cron service--- using node.js and redis.

This is tied into redis to cache dns records, static or unchanged pages using the etag couch provides, etc. The cluster manager reports the local nodes CPU load in redis as well, and so the dns knows to return address to direct traffic to nodes in the cluster that are less busy. This way apps can add customer dns so that our neophyte customers can get set up by merely putting out name servers into their registrar.

Most of the product functionality will be written in CouchApps, except for places where couchapp is limited..... Like showing multiple views on a page, and checking for uservauthentication. Don't want one customer to be able to see anthers private info. Well do that, and cron like tasks in node.js.

All of this is stored in couchdb which is distributed dynamo style by big couch. Like couchapp we have JavaScript in design documents, this makes deployment really easy. Part of the node. Web platform will support a/b testing and beta/production splits.

In this way, new node.js level functionality can be added by putting functions in JavaScript into documents and adding urls to the URL routing scheme that point to these handlers, all loaded from the couch, but executed in node.js. So, deployment still is just a push into the couch.

The reason for doing things this way is to make continuous deployment easier, but also to never have to reboot or change e servers much. The node.js services will be well tested and unchanging.

All of this will be built up using turnkey linux into an appliance. This can then be deployed on amazon or vps or wherever without the need for a load balancer in front. Since it does it's own dns and the authoritative dns is the cluster itself.

This should reduce maintenance to the bare minimum,and I can push out a new set of appliances over time and slowly migrate trafic to them once or twice a year when I need to update core funcationality.

I might go with web machine and raik instead of node.js and couch if I can find a compelling reason to, but the big holdup there is th question of how much of ahassle it would be to attach beam files to documents and get web machine to run them.

I know this is unconventional. I'm up for criticisms and suggestions, especially if there is a way to avoid writing all this code in node.js. Please don't tell me it is wrong because it's unconventional or that I don't need this flexibility. I want to just focus on the business and the product, if I could get this kind of capability from a hosted solution, i would, but I can't. (for instance, I need to download zip files nightly, unpack them, feed the data into a database, something I can't do on app engine, for instance, or cloudant, etc.)

Thanks. What do You think?

This sounds like a good fit to me. I'd watch the complexity, though, and see what you can get done with the simplest thing that could possibly work, then build up, always with the end-goal in mind.

Thanks for the advice. It is my hope that I can keep the custom code to a bare minimum. There may already be a node based dns server out there, found a couple references, and am hoping the web platform is very minimal, letting me extend it later without changing the appliance itself.

As i dig into it more, I may have some contributions to CouchApp to make, but may just end up making the node part of it a separate open source project. Not sure if ill need to hack couchapp or not.

Edit: reread your comment. Will start just with couch, then add couch apps, then add simple node server front end, look at node based frameworks, then add executing js in node from the couch, then redis for caching, then add the load balancing, logging and see what features I need for apps.

If you come up with a generic way to run show and list functions in front of Couch in node, that will be useful to many people.

>> "what do you think?"

until now i thought there were only 3 stages of startups. thanks for introducing me to the fourth kind.

.stages of a startup.

if you do nothing at all, you might start at a -1 index. you will fail because you're not in the market. if you get some traction you're up on 0. you're not making a difference yet but you're in the market and there will be competition and whoever executes best, wins. which brings me to index 1 if you're really crushing it, and finally becoming a success.

you just put yourself at -2, because you don't need either your competitiors, or the market or timing to fail. you just need yourself.

the defaults right now are that you're not obligated to get past your creative integrations choices, which basically puts yourself and your teammates at -2 in my book.

i'd love it if you 'do' get across and prove me wrong though. i possibly would have been in your place 3 years back when my team & I were all roughly 23. it's safe to say that i'm now older, bruised, but more pumped & wiser and have a more clarity of purpose in my hacking.


Already doing very well in the marketplace, thank you very much. Our iPad product is a huge success, so I figure when the market expands to 100million devices rather than a couple million, I'll be spending some of my upcoming vacation time giving myself a competitive advantage. I've been an engineer longer than you've been alive. I know what I'm capable of and this won't take much time, and even if it were a total failure, I'd still have app engine.

By definition, every web startup has to assemble a web technology stack. While investigating the best choices for mine, I posted it here for feedback. This has saved me a bunch of time, already.

I wonder about people like you who feel that they must so desperately try to make others feel like failures. Projecting, maybe?

I don't have an answer for you, but I do have a question: What types of features are you implementing that can't be done with AppEngine?


Here's a list of things that may be related to features were going to do thatbwe can't do in app engine. Whate features we actually do will depend on customer response, but were increasingly needing to do stuff that app engine doesn't support. But also, I'm uncomfortable with app engines lockin and odd performance limitations that require a fair bit of optimization. Couchdb makes seems like I will write things efficiently the first time.

Anyway it is impossible ornery hard to: -- support new domains for app engine apps. Only way to do it is to manually add them to your google apps account which takes from 48 hours to 48 weeks, and requires control over the domain, which is hard when it is a customers domain. -- apple puch notification sevice, or any kind of long standing process -- big data processing is difficult you can do map, but not reduce on app engine, but you can't unzip a 25gb file and then process the results.

Much ok,f the influence here and the focus on easy deployment are inspired by app engine, though.

Thanks for the answer. Makes sense -- especially the domain stuff, hadn't thought of that one.

Sorry for the language, the iPad is onry about changing my words on me!

Wouldn't the ability to host GAE apps outside of Google solve the problem? GAE is open source isn't it?

(disclaimer: I have been thinking of offering this but assume others are already doing it; and don't have a "first customer" to build around, which is always a requirement for putting time into such a thing.)

I think the short answer is "no", but check out appscale. They are working on doing exactly that. Though much of the stuff behind GAE is proprietary-- like BigTable-- so these components have to be replaced. Gives some portability but not much.

The idea here is to sorta build my ideal app engine using off the shelf components as much as possible.

It seems the setup I was looking at earlier is essentially a VM with the Python and Java SDK's preinstalled and running, which is not the same thing, as you point out.

If you are going to have to do so much re-coding, though, wouldn't it make sense to start with Appscale and just add in couchdb or whatever else you like, to get the data storage you want? After all, you are already familiar with the rest of the way GAE works.

I'll look at appscale again. I'm more fond of the way CouchApp/CouchDB works than the way GAE works, though, so I'm going to try and do this with very little coding. I really think that I'll only have to write a couple hundred lines of code to bring this stack together, but I may be missing something.

I have a feeling you will be spending the next year debugging a bunch of bleeding-edge stuff. But I see you have already got your ear-plugs on:

"Please don't tell me it is wrong because it's unconventional or that I don't need this flexibility. I want to just focus on the business and the product"

Your point about these projects being bleeding edge is well taken. I picked BigCouch and CouchDB because both of them have been used in production for over a year.

I haven't vetted redis in the same way, so maybe it isn't quite stable enough.Do you think so? Maybe I should look at using memcached instead? (update: looks like redis is being used in production.)

Instead of doing a node.js based web framework and caching, concievably I could use HAProxy, NINGX or Squid as the caching front end. They'll listen to etags, right? And I could use another web framework, maybe something off the shelf... so long as I can pull script code from the database and execute it, I can replicate the CouchApp deployment style.

Node.js is new, and I am most wary of it because it involves the most customization. I may punt and get powerDNS installed and hooked up to a couchDB backend. I may put in a little python based RESTful server to do some of this, or to completely replace node.js if it proves problematic. I think I can load python code from the couch and execute it.

Personally, I'd avoid node.js. It is most definitely cool, interesting and has been proving itself useful, but (!) there are a lot of details that won't come up until you really need them. If you're focusing on business logic and you already have a rather large set of applications that are relatively "bleeding edge", then I'd stick to a language that has a bit more history on the basics.

It is the really simple things like form parsing and cookies/sessions that may not be fully fleshed out in Node.js. Not that it won't be soon, but it is a nightmare to spend time on low level issues when you don't have to. This has always been the issue trying new languages and frameworks for projects. I get part of the way there and realize that some basic aspect of web development hasn't been done. Sometimes I start to handle the issues, but if the project is something I want to finish, it usually ends up with me turning back to Python. This might be a bad reflection on myself, but when you want to get something done that doesn't need the latest/greatest tech, it is nothing but frustrating to hit your head against a wall doing basics.

Just my two cents!

Thanks for the warning. Any efficient front end could replace node.js in this design. I'm not wedded to it and it is the most recent component added.

I was trying to get out a huge concept in my head, and probably wasn't very clear. The real "web development" will be happening in CouchApp. I'm just putting node.js in front to provide a couple services, each of which I expect to be very small.

I'm expecting to write well less than 100 lines of node.js code to interface in front of CouchApp. Basically, it will just provide a quick lookup into the routes, and switch between two CouchApp handlers for certain pages depending on if the user is authenticated (Eg: if you look at profile of user Y, it woudl be good to know if you're logged in as user Y and thus can edit the profile, or are some other user, and thus see only the public info for user Y.) Also, to pull results from a couple lists or shows and compose them into a single page, which has to be done outside of CouchApp... but really isn't very complicated.

The caching bit will simply be checking to see if the etag for a page in couchapp is in redis, if it is, update its TTL to keep it fresh and send it to the user, if it isn't then get the full/new page from CouchApp, push it into redis and send it along. This will be problematic if the APIs for redis and couchapp are sychronous in their node.js drivers. Will have to check on that.

Anyway, like I said, I'd be interested in hearing of a good replacement for node.js.

Just need something I can write a simple caching, routing code in... have looked at webmachine a bit, and it might be better... not sure I can pull erlang out of the couch an run it, though.

Twisted may also be a possibility.

If you can clear your mind of all the attractive tech that could potentially fill gaps, it may help isolate the essential needs you have. After reading your description I didn't have a solid idea of what you were working to accomplish so it's tough to judge.

Then add technology as necessary. I bet you'd be surprised what you could accomplish with couchdb, or node/express+a DB alone.

I'm wet behind the ears with web development (just started late last year), but look forward to seeing how you and your team makes out.

Further investigation reveals that node.js is probably not ready for prime time, which is too bad given that it is trying to do the right thing. This shouldn't be too bad because powerDNS has been around a long time, and webmachine or another framework in front of the couch should let me composite multiple shows and lists. I think I may end up writing even less code now.

Also, looks like redis has some nice features, but memcached is what I really want. Want that built in LRU for expiring objects from the cache.

Redis is rock-solid. If you make sure you've got enough RAM for your data, there's no downside to using Redis.

I believe it, but I was just going to be using it as a cache and memcached seems more appropriate.... I'd rather memory only cache ghat uses least-recently-used purging than one that I'd trying to be s persistent database. Might still use redis, I've an open mind there, but seems memcached was designed for this purpose.

Update: -------

-- PowerDNS as authoritative server is almost laughably easy. They have a pipe backend and example which would take just a few minutes to rework in python to pull records from CouchDB. A few minutes more and it could check memcached to see which server is loaded and re-order results appropriately. Geo wouldn't even be that hard to add.

-- Twisted and Tornado might work to replace node.js. It is kinda shocking how many people are investing in node.js, though.

-- If anyone can recommend a caching proxy server that will also work as a sort of template engine and auth engine in front of couchDB, that would be great. I can write this in any number of tools, but if it already exists, great. Remember, it needs to take multiple CouchDB shows/lists and run them thru a template to produce a final resulting html page.

-- CouchDB caches the trees in memory so can quickly tell me if the etag is different, but for commonly used data, I do want to cache it to save time on disk seeks given how dramatic the latency of disks are vs. memory. But only for the most commonly used things.

-- as I originally mentioned, I don't want to reinvent the wheel, I'd rather rent wheels by the hour. Unfortunately, none of the app hosting options offer the kind of wheels I need. The app we're building will involve some heavy lifting, in the 10s of gigabytes of data a day range, with a fair bit of custom processing, and even app engine doesn't provide the kind of indexing we really need (But that CouchDB does well, and incrementally.)

-- I think this might work as an open source project, an open stack for people to use and build upon.

I have only used Node.js a little bit so far, but I've been pleased. I'd say at this point it is great for JavaScript business logic. I'd hold off on using it for long running processes like caches or proxies (http server behind a load balancer would be fine, as those are cheap to restart.)

Basically Node is solid but has the occasional hiccough. Also the libraries are rapidly evolving. Real life example: CouchOne hosting originally prototyped our proxy in Node.js before @_jhs moved it to Erlang. We still do business logic (async operations) in Node and it's great for that. In a few months it should be ready for prime time in the critical path as well.

Seems like a lot of reinvention to me - cluster management, DNS service, caching, queue service, etc. Reinventing cluster-management infrastructure, for example, is just a bad idea plain and simple. It will take a lot of time, and that time won't be very enjoyable. You seem rather proud of having been an engineer longer than some commenters have been alive, but just about everybody who could say that about me is retired or dead by now and I actually have done production-level work in some of these areas so maybe you'll have to find a different excuse for dismissing my advice.

Although it's possible to determine that the large pile of bleeding-edge technology and almost-as-large pile of reinvented wheels that you've described is almost certainly the wrong solution, it's hard to tell what the right solution might be without any description of what you're trying to achieve. At least part of it looks like an attempt to clone DynDNS, so perhaps one starting point would be to look at what they're doing and how it might be improved. Using something like BigCouch that's already clustered seems like a good start, though I'd contend that something like Riak/Voldemort/Cassandra does that particular kind of clustering better. Adding Redis to the mix seems to add very little, since it's inherently a SPOF and you don't mention anything that would use its unique capabilities. Similarly, why not measure the effectiveness of the caching already inherent in your primary Couch/Riak/whatever store before you consider adding yet another caching layer?

My advice would be to use BigCouch/Riak or similar as your authoritative data store, and not add Redis or separate caching. That leaves your node.js front ends as the only thing that could possibly benefit from your home-grown cluster management, and if they're written properly so that they're not coupled to one another (bad for availability anyway) then you won't need that either. If you try that and still find that you need to build something else then fine, but I very seriously doubt that you would except as an artifact of implementing the simpler architecture poorly.

You mentioned raik/Cassandra etc do the clustering better, and this may be true, but a big part of the attraction of couchdb is couchApps and the built in map reduce which does the data crunching we need to do in a way well suited to our problem. Will have to investigate raik more, just haven't quite gotten my head around walking links. Or how vwebmachine fits in.... Whether it is comparable to couch apps.

If you want a context in which to think of our product, think of it as a recommendation engine or search engine. A large amount of data comes in, some of it realtime, and it needs to be processed and queries of various types need to be handled, some using customer specific information. Eg: someone who likes shoes will see more shoes in their results.

If we're at amazon, well use amazons load balancer. But it looks like we'll be at linode or somewhere like that initially... And since we need to host a dns server anyway, figured round robin dns is the way to go.

I suspect it would take more coding effort to get powerDNS pulling records out of the couch an it would to adapt a pre-written authoritative server for node to our needs. But since node seems like it wont be ready in time...

I'm not planning any wheel reinventing, and a big part of the reason for posting here was for someone to say "hey, here's a wheel you can buy off the shelf!" not "hur hur, you're stupid because you're assembling a web stack, fail!".

Already had decided, if you'll notice the followups below, to replace redis with memcached, though I don't see how it would be a single point of failure if was just using it as a cache, which is all i was using it for. I just saw it had support for some nice data structures and figured it would be better... But changed my mind when i realized it didn't have an LRU.

As for cluster management, turn key Linux provides some of that, and otherwise it was just going to be keeping queues and load readings in the distributed cache so that large jobs could be distributed to less busy machines. This doesn't seem like a big deal to me, because it wasn't even something we needed in a big way, though I didn't mention that. But if someone has already done it feel free to let me know.

I also figured I'd need to write some scripts to have nodes join and leave the clusters, mostly calling the Apis built into the components of the stack.... Figure everyone has to do this.

I know there are some open source projects for managing cloud clusters, but not very familiar and would appreciate recommendations if there are some that would be a good fit.

>and a big part of the reason for posting here was for someone to say "hey, here's a wheel you can buy off the shelf!" not "hur hur, you're stupid because you're assembling a web stack, fail!".

I don't think assembling a web stack is automatic fail, but I think not buying a wheel you don't need is even better than buying one off the shelf. ;)

>Already had decided, if you'll notice the followups below, to replace redis with memcached, though I don't see how it would be a single point of failure if was just using it as a cache

If you're just using it as a cache then the SPOF doesn't matter, you're right, but if you're just using it as a cache then I'd say it's the wrong tool. It's great at doing fancy stuff with data structures in memory, but memory on a single node is all you get. (Dumping memory to disk improves durability, but does nothing for capacity.) If you want more cache than that, then you'll have to add your own clustering. Memcached itself is not any better, actually. At the very least you'll need to add libketama to consistent-hash across several memcached instances. Even better would be to use something like Kumofs or Voldemort which provide roughly the same semantics - Kumofs even uses the same protocol - while also handling server-set changes and such transparently.

That brings me to the point about whether to implement a separate caching layer at all. Why? Is there some known or inevitable performance need that wouldn't be solved by having enough cache on the nodes serving the authoritative data? If not, then implementing interfaces to two different layers - with all of the consistency problems and impedance mismatches involved - will just be a waste. My general advice is to never add a tier unless you're absolutely and empirically sure that the pain your gaining won't outweigh the pain you're avoiding.

>I also figured I'd need to write some scripts to have nodes join and leave the clusters, mostly calling the Apis built into the components of the stack

Writing those scripts can be a significant part of implementing your own cluster infrastructure. I'm sure we could have an interesting discussion about whether writing those scripts would be easy or hard, but IMO the more pressing question is whether they need to be written at all. If your storage layer (e.g. BigCouch) already does its own clustering, then you need to do practically nothing at that level. If you then keep your front-end servers as stateless as possible (i.e. not storing any persistent/shared data other than in the primary data store) then adding/removing those servers should be a complete no-brainer. Just give them some addresses and credentials, and boom - no cluster magic necessary.

It's not components I'm recommending so much as a pattern that avoids the need for so many components, but I hope it's still useful. Good luck.

Thanks for your thoughts, and I don't disagree with the general philosophy you're pitching. I think, though, you may be imagining something different than I'm trying to describe. It seems that you're thinking in terms of front end and back end servers and a cache that is seperate severs, which is why you suggest that the cache in the nodes serving the authoritative data should be sufficient. I'm describing a system whereby there is only one server configuration and it contains the whole stack and thus to increase capacity you add more servers, which means more cache, more database capacity, etc.

This may be non-traditional, but it is amenable to an ec2 type of setup where you scale the cluster up or down without having to keep ratios of types of servers and their connections coordinated. I don't intend to change the cluster size, except to slowly add more servers over time. But having different types means different configurations, and keeping it coordinated. The idea here is, one type, it is already configured, it is actually an image and you just clone it.

Didn't realize that memecahced worked that way, I do need to look into it more, and thanks for the alternative recommendations.

There are some elements, like images that appear on every page, or CSS files, that will be served much more than others, I can store all these elements in the couch, and maybe they will live in ram due to the systems virtual memory, making explicit caching pointless. I will need to do tests here, you are right.

But I was going by some previous experience and common "best practices" advice to try and explicitly keep some of this in ram. Further, I do want ot keep some operational state information in ram, so that it can be written often, even though it doesn't need to be persisted, but does need to be read by multiple machines in the cluster.

Anyway, I appreciate your suggestion, and agree with the desire to keep complexity down. One of the ways I'm doing this is to have a single machine type, thus no complex heirarchy of machines that have to be configured to work with each other. Just add additional nodes to add capacity.

Btw, one thing ive been looking for is a relatively high level queue/worker service, that is already written and distributed. I'd like to be able to write workers in JavaScript or python, and found one the other night in erlang that does this, called DISCO. However, it seems a bit heavyweight.

Just need to spawn periodic tasks that will gather data and feed it into the couch. I could do it with cron, but something that divides them over each machine in the cluster seems like something people would have written already. Maybe DISCO is what I need but it seems a bit heavyweight, while cron is too little.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact