How can I learn to scale my project?

brooksbp · on Dec 18, 2007

1) Profile. You really shouldn't start blindly optimizing code without profiling. Profile it, find your bottlenecks, optimize, rinse, and repeat. You should also benchmark your hardware; run scripts that make requests to your www server. Check out Apache's ab benchmark. Etc...

2) Cache. Google "Memcached". Django has support for this. This will alleviate huge database bottlenecks (if your app is read heavy). Also possibly consider some sort of FS (file-system) structure specific for your app. It looks like you deal with a bunch of media files, that could get hard to manage without a proper FS plan.

3) Write clean code. There is no sense in profiling, optimizing, caching, etc if your code is horrible. Those things will lead to even worse, unreadable, unmanageable, possibly slower code.

From your description you asked slightly more specified questions, but you really need to understand the basics of above before specializing for your app. At least that's what I believe.

neilk · on Dec 18, 2007

You are already thinking in the right direction if you understand how Hadoop works. But don't bother for now. No, really, don't!

Just be ready to profile and refactor your code when you do need to scale.

That said, you can improve the odds with practices that are almost cost-free, but will help a lot later on.

- Read Cal Henderson's book.

- The center of your design should be the data store, not a process. You transition the data store from state to state, securely and reliably, in small increments.

- Avoid globals and session state. The more "pure" your function is, the easier it will be to cache or partition.

- Don't make your data store too smart. Calculations and renderings should happen in a separate, asynchronous process.

- The data store should be able to handle lots of concurrent connections. Minimize locking. (Read about optimistic locking).

- Protect your algorithm from the implementation of the data store, with a helper class or module or whatever. But don't (DO NOT) try to build a framework for any conceivable query. Just the ones your algorithm needs.

Think this is obvious? Just the other day I heard of a project, staffed by so-called experts, that made every one of the mistakes I mentioned above. And in simulations, they cannot even keep up with the load they expect at launch.

icky · on Dec 18, 2007

> - Avoid globals and session state. The more "pure" your function is, the easier it will be to cache or partition.

I'd like to emphasize this point in particular. Shared state is a big, inefficient, centralized bureaucrat and is the enemy of horizontal scaling. Statelessness and decentralization are best-friends-forever (they each have a denormalized copy of different aspects of the same friendship bracelet!), and you should figure out as many ways as possible to exploit statelessness and minimize unnecessary shared state in your application.

If you have the time, learn statelessness-in-the-small as well: play with a strict functional programming language (for your own edification, not necessarily for implementing this particular project!), and you will learn how to become keenly aware of the flow of state in any program, and how to maximize statelessness in any language. This will improve anything you ever program again, and will form a powerful mental model of state that will carry over by analogy to statelessness-in-the-large.

goodgoblin · on Dec 19, 2007

Ok - I am using the session to speed up performance - by keeping around an object that is central to a user's workflow for just about every request w/o going back to the db for it.

Are you suggesting that going back to the db each time is more scalable, or that I am better off using some kind of method level cache?

Honestly - just wondering - any good articles on what you are discussing? I guess I just don't quite 'get' it - not having worked on an application that scaled to the point that session access was the bottleneck.

neilk · on Dec 19, 2007

Are you suggesting that going back to the db each time is more scalable, or that I am better off using some kind of method level cache?

Yes and yes.

This is a really long and deep topic. There are all sorts of reasons why sessions are not a good idea. But let's stick to scalability.

If you're using your session as a sort of cache for objects, that's probably okay (although, consider using something designed for this, like memcached). The point is, you ought to be able to reconstruct all the objects you need from just the request parameters.

This is a pretty good article about it all.

"Session State is Evil" -- http://davidvancouvering.blogspot.com/2007/09/session-state-...

sbraford · on Dec 20, 2007

Yep, sessions are nice when your site is in the 1-5k visitors per day range, but don't seem to scale much beyond that. (coming from an RoR background here)

mattmaroon · on Dec 18, 2007

Do you have enough traffic that scaling is really a concern? It's typically putting the cart in front of the horse to worry about that too soon. You'd be much better off trying to improve the site and get out the name otherwise.

voidfiles · on Dec 18, 2007

You are absolutley right that the site doesn't need any form of scaling right now. I have never had the chance to write a massivley scalable system of anykind, I am looking for a challange and I don't have a lot of money so I am trying to figure out how to learn how to scale on a budget for fun , not for buisness. If the time comes though I will have some tricks up my sleave.

mattmaroon · on Dec 18, 2007

Ah. It's definitely an interesting problem. Hopefully you'll get a chance to test anything you come up with.

bootload · on Dec 18, 2007

"... How can I learn to scale my project? ..."

Go work for google? [0]

".... I really want to start testing and developing methods for scaling and speeding a project of mine, and also just for general knowledge. I can't think of a way to do this without creating a bunch of amazon ec2 instances, or buying more computers and doing this from house. ..."

Think distributed.

Create a useful installable tool that utilises spare cpu cycles that are being under-utilised. Or get some experience with some of the existing systems. Some examples that I can think of are

- SETI ~ http://setiathome.berkeley.edu/

- Electric Sheep ~ http://electricsheep.org/

- PlanetQuest ~http://www.wired.com/science/space/news/2005/03/66757

Another approach is to take a look at seeing how you can create a Beowulf cluster ~ http://www.beowulf.org/showcase/index.html You don't have to build it (though operacy is worth 10x reading) you can look at the software. If you are still at school or know somebody there you can see if anyone is working on parallel software.

[0] Take a read of this blog on the development of a personalised RSS crawler ~ http://blog.persai.com/ to see the kinds of problems you have to overcome (scaling, data integrety, storage, re-writes)

voidfiles · on Dec 18, 2007

The persai blog was really helpfull, its great to find other people dealing with this idea of natural language parseing , that aren't phd students.

pistoriusp · on Dec 18, 2007

Scalable Internet Architectures by Theo Schlossnagle was a good introduction for me.

http://www.amazon.com/Scalable-Internet-Architectures-Develo...

Kaizyn · on Dec 18, 2007

I would suggest that you consider a couple things:

1) How can you break down the work tasks your site/crawler is doing so that it could be divvied out across N processes? In the case of the crawler, how would you handle breaking up your work queue so that multiple spiders could coordinate their efforts without all crawling the same pages and without skipping any? How do you recombine the results back together when they're finished? What happens if a spider grabs a URL to crawl and dies before it can report in the results - will it still get crawled or has it been lost?

2) Google's view of mass concurrency looks at hardware in the same sort of way. They considered questions like: What happens to the running processes if this disk drive goes offline? How can work being processed on multiple computers be recombined together? What happens when a critical computing node dies while you're waiting on it to report its results?

3) As pertains to database and other resources: how can the data needed to power the site be distributed across my computer resources as evenly as possible? How do I go about identifying and correcting bottlenecks?

It does not seem like you need Hadoop to either think through these problems or work on coding solutions that take this sort of thing into account. Though if you build a system capable of mass concurrency, I can understand the desire to want to test it out in a widely distributed environment.

simonw · on Dec 18, 2007

Read Cal Henderson's book "Building Scalable Web Sites".

brooksbp · on Dec 18, 2007

Good book, but it's somewhat bias. Not gonna go into details, but definitely worth a read.

toddh · on Dec 18, 2007

You can plug into someone else's crawler. For example http://spinn3r.com/.

In general take a look at http://highscalability.com/.

robertgaal · on Dec 18, 2007

Fall on your ass many, many times :)

benn · on Dec 18, 2007

Try and get lots of people to use it - then you'll have to scale it. Otherwise it's a bit imaginary. :)