

How can I learn to scale my project? - voidfiles

I really want to start testing and developing methods for scaling and speeding a project of mine, and also just for general knowledge. I can't think of a way to do this without creating a bunch of amazon ec2 instances, or buying more computers and doing this from house. <p>Can anyone think of think of way to do this for free or very little money? is their something I am not thinking of. <p>Background<p>I am creating a website http://www.tastestalkr.com. Right now I run it on http://www.dreamhost.com. They are great when it comes to serving webpages, but not if you want to run a crawler. I run the crawler from a computer in my house. This works great for now and I understand that I don't need scale for a long time, but I am doing this all by myself. I am using django and pythong. I am thinking that mayby something like hadoop is going to be my best bet.
======
brooksbp
1) Profile. You really shouldn't start blindly optimizing code without
profiling. Profile it, find your bottlenecks, optimize, rinse, and repeat. You
should also benchmark your hardware; run scripts that make requests to your
www server. Check out Apache's ab benchmark. Etc...

2) Cache. Google "Memcached". Django has support for this. This will alleviate
huge database bottlenecks (if your app is read heavy). Also possibly consider
some sort of FS (file-system) structure specific for your app. It looks like
you deal with a bunch of media files, that could get hard to manage without a
proper FS plan.

3) Write clean code. There is no sense in profiling, optimizing, caching, etc
if your code is horrible. Those things will lead to even worse, unreadable,
unmanageable, possibly slower code.

From your description you asked slightly more specified questions, but you
really need to understand the basics of above before specializing for your
app. At least that's what I believe.

------
neilk
You are already thinking in the right direction if you understand how Hadoop
works. But don't bother for now. No, really, don't!

Just be ready to profile and refactor your code when you _do_ need to scale.

That said, you can improve the odds with practices that are almost cost-free,
but will help a lot later on.

\- Read Cal Henderson's book.

\- The center of your design should be the data store, not a process. You
transition the data store from state to state, securely and reliably, in small
increments.

\- Avoid globals and session state. The more "pure" your function is, the
easier it will be to cache or partition.

\- Don't make your data store too smart. Calculations and renderings should
happen in a separate, asynchronous process.

\- The data store should be able to handle lots of concurrent connections.
Minimize locking. (Read about optimistic locking).

\- Protect your algorithm from the implementation of the data store, with a
helper class or module or whatever. But don't (DO NOT) try to build a
framework for any conceivable query. Just the ones your algorithm needs.

Think this is obvious? Just the other day I heard of a project, staffed by so-
called experts, that made every one of the mistakes I mentioned above. And in
simulations, they cannot even keep up with the load they expect at launch.

~~~
icky
> \- Avoid globals and session state. The more "pure" your function is, the
> easier it will be to cache or partition.

I'd like to emphasize this point in particular. Shared state is a big,
inefficient, centralized bureaucrat and is the enemy of horizontal scaling.
Statelessness and decentralization are best-friends-forever (they each have a
denormalized copy of different aspects of the same friendship bracelet!), and
you should figure out as many ways as possible to exploit statelessness and
minimize unnecessary shared state in your application.

If you have the time, learn statelessness-in-the-small as well: play with a
strict functional programming language (for your own edification, not
necessarily for implementing this particular project!), and you will learn how
to become keenly aware of the flow of state in any program, and how to
maximize statelessness in any language. This will improve anything you ever
program again, and will form a powerful mental model of state that will carry
over by analogy to statelessness-in-the-large.

~~~
goodgoblin
Ok - I am using the session to speed up performance - by keeping around an
object that is central to a user's workflow for just about every request w/o
going back to the db for it.

Are you suggesting that going back to the db each time is more scalable, or
that I am better off using some kind of method level cache?

Honestly - just wondering - any good articles on what you are discussing? I
guess I just don't quite 'get' it - not having worked on an application that
scaled to the point that session access was the bottleneck.

~~~
neilk
_Are you suggesting that going back to the db each time is more scalable, or
that I am better off using some kind of method level cache?_

Yes and yes.

This is a really long and deep topic. There are all sorts of reasons why
sessions are not a good idea. But let's stick to scalability.

If you're using your session as a sort of cache for objects, that's probably
okay (although, consider using something designed for this, like memcached).
The point is, you ought to be able to reconstruct all the objects you need
from just the request parameters.

This is a pretty good article about it all.

"Session State is Evil" --
[http://davidvancouvering.blogspot.com/2007/09/session-
state-...](http://davidvancouvering.blogspot.com/2007/09/session-state-is-
evil.html)

~~~
sbraford
Yep, sessions are nice when your site is in the 1-5k visitors per day range,
but don't seem to scale much beyond that. (coming from an RoR background here)

------
mattmaroon
Do you have enough traffic that scaling is really a concern? It's typically
putting the cart in front of the horse to worry about that too soon. You'd be
much better off trying to improve the site and get out the name otherwise.

~~~
voidfiles
You are absolutley right that the site doesn't need any form of scaling right
now. I have never had the chance to write a massivley scalable system of
anykind, I am looking for a challange and I don't have a lot of money so I am
trying to figure out how to learn how to scale on a budget for fun , not for
buisness. If the time comes though I will have some tricks up my sleave.

~~~
mattmaroon
Ah. It's definitely an interesting problem. Hopefully you'll get a chance to
test anything you come up with.

------
bootload
_"... How can I learn to scale my project? ..."_

Go work for google? [0]

 _".... I really want to start testing and developing methods for scaling and
speeding a project of mine, and also just for general knowledge. I can't think
of a way to do this without creating a bunch of amazon ec2 instances, or
buying more computers and doing this from house. ..."_

Think distributed.

Create a useful installable tool that utilises spare cpu cycles that are being
under-utilised. Or get some experience with some of the existing systems. Some
examples that I can think of are

\- SETI ~ <http://setiathome.berkeley.edu/>

\- Electric Sheep ~ <http://electricsheep.org/>

\- PlanetQuest ~<http://www.wired.com/science/space/news/2005/03/66757>

Another approach is to take a look at seeing how you can create a Beowulf
cluster ~ <http://www.beowulf.org/showcase/index.html> You don't have to build
it (though operacy is worth 10x reading) you can look at the software. If you
are still at school or know somebody there you can see if anyone is working on
parallel software.

[0] Take a read of this blog on the development of a personalised RSS crawler
~ <http://blog.persai.com/> to see the kinds of problems you have to overcome
(scaling, data integrety, storage, re-writes)

~~~
voidfiles
The persai blog was really helpfull, its great to find other people dealing
with this idea of natural language parseing , that aren't phd students.

------
pistoriusp
Scalable Internet Architectures by Theo Schlossnagle was a good introduction
for me.

[http://www.amazon.com/Scalable-Internet-Architectures-
Develo...](http://www.amazon.com/Scalable-Internet-Architectures-Developers-
Library/dp/067232699X)

------
Kaizyn
I would suggest that you consider a couple things:

1) How can you break down the work tasks your site/crawler is doing so that it
could be divvied out across N processes? In the case of the crawler, how would
you handle breaking up your work queue so that multiple spiders could
coordinate their efforts without all crawling the same pages and without
skipping any? How do you recombine the results back together when they're
finished? What happens if a spider grabs a URL to crawl and dies before it can
report in the results - will it still get crawled or has it been lost?

2) Google's view of mass concurrency looks at hardware in the same sort of
way. They considered questions like: What happens to the running processes if
this disk drive goes offline? How can work being processed on multiple
computers be recombined together? What happens when a critical computing node
dies while you're waiting on it to report its results?

3) As pertains to database and other resources: how can the data needed to
power the site be distributed across my computer resources as evenly as
possible? How do I go about identifying and correcting bottlenecks?

It does not seem like you need Hadoop to either think through these problems
or work on coding solutions that take this sort of thing into account. Though
if you build a system capable of mass concurrency, I can understand the desire
to want to test it out in a widely distributed environment.

------
simonw
Read Cal Henderson's book "Building Scalable Web Sites".

~~~
brooksbp
Good book, but it's somewhat bias. Not gonna go into details, but definitely
worth a read.

------
toddh
You can plug into someone else's crawler. For example <http://spinn3r.com/>.

In general take a look at <http://highscalability.com/>.

------
robertgaal
Fall on your ass many, many times :)

------
benn
Try and get lots of people to use it - then you'll have to scale it. Otherwise
it's a bit imaginary. :)

