|I'm one of the co-founders of Media Wombat ( flash search engine at http://mediawombat.com ) and we're a startup that's about 8 months old. We have no funding except what we can afford to do ourselves, no investors and very little spare hardware.|
Our site is a search engine - like google, but for flash content. We threw together the site in a weekend and have been slowly tweaking it over time but recently have run into some growth issues. If you have funding or investors and have growth issues, you can just throw more hardware at the problem and ta-da! You're fast again. However, for those of us who don't have money being thrown at us, we have to be a little more creative and start to look at optimization.
I've got a couple of old machines and an 8-drive SCSI RAID in my basement that I'm using for our search engine to crawl the web and process the data that we index for our search engine. My machines are not quad-core and don't have 64GB of ram in them. They're old and tiny.
When we first put http://mediawombat.com together, we threw it all together just to get it to work. We did everything as quick and dirty as we could. We used perl and mysql for the back-end. The crawler was straight-forward, single-threaded, slow, clunky, but it worked. After about 4 months of collecting data, we started to see some growth issues. Searches were becoming slow.
We were using a live search through all of our indexed data. First step to optimization - caching of course. This was a pretty easy no-brainer. We recorded all of the searches that people did on our site and we pre-cached the search results for the top-2000 of the most-popular searches. This way, when someone does a search for a popular search phrase, they get (almost) immediate results. Not too bad of a solution.
Just a few weeks ago, I noticed that our crawler has become the slowest part of our back-end process. We had crawled most of our initial sites and gotten some good data back, but now, the crawler is just crawling lots of uninteresting urls and not getting anything of any value back. We overflowed onto other sites with no flash and were now crawling sites that didn't return any useful data back. We were wasting resources.
So, I was at my mother-in-law's place last weekend and she doesn't have any internet connectivity. I was bored and needed some time away from the family to geek-out. I thought to myself how I could make the back-end crawler and database more optimized ... I re-wrote the crawler in C and made it multi-threadded. And instead of reading and writing to a database, I used flat files. I also pre-processed everything outside of the database using the old-style unix text utilities (grep, sort, uniq, sed, awk, ...). One of those cartoonish lightbulb-over-the-head moments happened to me.
The unix text utilities were written in the 60's and 70's when computers were 33mhz and had 5MB of ram. Of course these utilities are going to be lean and mean! Perl was a memory hog and if I multi-threadded it, ate up most of my available ram on my machine if I spawned > 5 threads.
I read the man pages on all of the unix text utils that I could find. I even found some new ones that I didn't even know about before (and I've been using unix (linux) as my primary OS since 1990). I managed to replace about 90% of my crawler that was previously written in perl, to a bunch of unix utilities, a few shell scripts and my multi-threadded crawler in C. I did my crawling operations in bulk and processed them in the background while the crawler was doing it's thing.
I was super-proud that I had optimized the code as much as I had. I went from about 30k urls a day to about 60k urls crawled an hour! To me, that was a huge speedup! Anyway, to make a long story short, I'm still looking for ways to optimize things and I've got a long list of things to do if/when the time becomes available and I've got more time than money at this point so it's worth the effort, and it's really rewarding!