
Shoestring Budget?  Starting to feel growth issues on your back-end?  Embrace unix and C - scumola
I'm one of the co-founders of Media Wombat ( flash search engine at http://mediawombat.com ) and we're a startup that's about 8 months old.  We have no funding except what we can afford to do ourselves, no investors and very little spare hardware.<p>Our site is a search engine - like google, but for flash content.  We threw together the site in a weekend and have been slowly tweaking it over time but recently have run into some growth issues.  If you have funding or investors and have growth issues, you can just throw more hardware at the problem and ta-da!  You're fast again.  However, for those of us who don't have money being thrown at us, we have to be a little more creative and start to look at optimization.<p>I've got a couple of old machines and an 8-drive SCSI RAID in my basement that I'm using for our search engine to crawl the web and process the data that we index for our search engine.  My machines are not quad-core and don't have 64GB of ram in them.  They're old and tiny.<p>When we first put http://mediawombat.com together, we threw it all together just to get it to work.  We did everything as quick and dirty as we could.  We used perl and mysql for the back-end.  The crawler was straight-forward, single-threaded, slow, clunky, but it worked.  After about 4 months of collecting data, we started to see some growth issues.  Searches were becoming slow.<p>We were using a live search through all of our indexed data.  First step to optimization - caching of course.  This was a pretty easy no-brainer.  We recorded all of the searches that people did on our site and we pre-cached the search results for the top-2000 of the most-popular searches.  This way, when someone does a search for a popular search phrase, they get (almost) immediate results.  Not too bad of a solution.<p>Just a few weeks ago, I noticed that our crawler has become the slowest part of our back-end process.  We had crawled most of our initial sites and gotten some good data back, but now, the crawler is just crawling lots of uninteresting urls and not getting anything of any value back.  We overflowed onto other sites with no flash and were now crawling sites that didn't return any useful data back.  We were wasting resources.<p>So, I was at my mother-in-law's place last weekend and she doesn't have any internet connectivity.  I was bored and needed some time away from the family to geek-out.  I thought to myself how I could make the back-end crawler and database more optimized ...  I re-wrote the crawler in C and made it multi-threadded.  And instead of reading and writing to a database, I used flat files.  I also pre-processed everything outside of the database using the old-style unix text utilities (grep, sort, uniq, sed, awk, ...).  One of those cartoonish lightbulb-over-the-head moments happened to me.<p>The unix text utilities were written in the 60's and 70's when computers were 33mhz and had 5MB of ram.  Of course these utilities are going to be lean and mean!  Perl was a memory hog and if I multi-threadded it, ate up most of my available ram on my machine if I spawned &#62; 5 threads.<p>I read the man pages on all of the unix text utils that I could find.  I even found some new ones that I didn't even know about before (and I've been using unix (linux) as my primary OS since 1990).  I managed to replace about 90% of my crawler that was previously written in perl, to a bunch of unix utilities, a few shell scripts and my multi-threadded crawler in C.  I did my crawling operations in bulk and processed them in the background while the crawler was doing it's thing.<p>I was super-proud that I had optimized the code as much as I had.  I went from about 30k urls a day to about 60k urls crawled an hour!  To me, that was a huge speedup!  Anyway, to make a long story short, I'm still looking for ways to optimize things and I've got a long list of things to do if/when the time becomes available and I've got more time than money at this point so it's worth the effort, and it's really rewarding!
======
paul
This may have been a fun and interesting project for you (and a great way to
learn the unix utils), but I wouldn't recommend that other startups follow
this path.

Rewriting something in C should be the last thing you do, not the first. The
first thing you should do is find out why it's slow. In your case, it sounds
as though you were fetching one url at a time (blocking). Switching to async
io all could have fixed this.

Btw, if you're using the Gnu utilities, it's unlikely that they were written
in the 60's and 70's (also, people were processing much smaller amounts of
data back then).

~~~
scumola
Yea, fetching one URL at a time was a major issue, but things like the
executable size in memory was also a problem. Remember, this is an old
machine. Mysql takes 400M for its index for my data, my other processes run on
the same machine also and all take memory. I initially added swap for things,
but that's clearly not the right direction.

Also, I'm 40 years old. When I went to college, C programming was the norm, so
it wasn't that difficult for me and didn't take long to implement and seemed
like a great place to start to optimize things. It's fast, saturates my home
Comcast pipe (they might cap my bandwidth, which I'm now worried about) and
takes about 200M of memory with 30k URLs loaded into the queue. I suppose that
I could reduce the memory requirements of the app some more by keeping the
URLs in a BerkDB or something, but I think that 200m is acceptable for my
needs now. For me, it's all about getting the most out of my available
resources without throwing more money at it.

Re: gnu utils and 60's and 70's comments ... Yea, I admit that my knowledge of
the gnu utils started in 1990 and I know that unix has been around since the
60's, so my initial post stating that the utils have been around since the
60's and that CPU speed and memory sizes were larger than they actually were
back then were incorrect. My bad. ;)

~~~
barryfandango
Funny how these 2.0 young bucks consider a C-rewrite extreme, when for many
coders is is the lingua franca, the bread and butter. I got started on high-
level languages but I've made an effort to get familiar with C because it's
small, beautiful, and fast as all hell. Fast is powerful.

~~~
paul
Part of knowing a tool is knowing when it's inappropriate. A rewrite in ANY
language to "fix" a problem that you don't understand is generally a mistake.
Doing things the needlessly complex way doesn't make you a better or manlier
programmer.

By the way, a lot of these perl and python scripts spend most of their time
calling into C libraries anyway, so the performance difference is often
negligible.

~~~
barryfandango
And unless my knowledge of the Python standard library is way off, the Python
standard library is written in Python, not C. That's what all those .py files
are in the /lib folder. So python scripts spend most of their time calling
into Python libraries. It's true that the interpreter is written in c (c++?,)
maybe that's what you were thinking of.

------
thomasmallen
Great post!

Please, HN, change the color of text in posts like the above. There is very
little contrast between #828282 (copy) and #F6F6EF (background), and I for one
am sick of having to fix this with Firebug.

~~~
jcl
I've always assumed that this font coloring is meant to discourage people from
making lengthy text-based posts on HN.

~~~
thomasmallen
I think that's what a letter or word limit is for, though.

~~~
palish
Artificial limits don't seem like the correct answer. The community is the
filter.

------
DanHulton
scumola, this should have been a blog post that you linked to. I'm not saying
this because I don't think this is HN-quality material, it's actually an
awesome little story to read. But if this was on a company blog somewhere, it
would be generating juice for your company in addition to for HN.

This is great news and a great way to spread the word about your service.
Don't let that opportunity go to waste!

------
ConradHex
>The unix text utilities were written in the 60's and 70's when computers were
33mhz and had 5MB of ram.

I'm not positive (I was born in the 70s), but I'm pretty sure they had less
RAM and speed than this.

~~~
ojbyrne
The very first computer I saw Unix on was a minicomputer/workstation in 1985.
It had a 68020 processor running at 20 mhz, and had 4mb of RAM. It cost around
$75k.

~~~
rbanffy
The first one I saw (not used - the first one I used was a 680x0 box) was a
Z-80 based Cromemco monster running Cromix on less than 512K.

Unix can be really small, if we sacrifice some stuff.

~~~
ojbyrne
I thought we were making the point that hardware was slow and expensive much
later than the 60s and 70s, not that Unix could be really small.

~~~
rbanffy
I was making the point that even a lowly hare-brained 8-bit processor can run
a Unix-like OS.

------
palish
Okay, there is a lot of noise in this thread, so this needs to be clearly
stated:

Good job, scumola.

It's impressive that you diagnosed the architectural bottleneck of your design
and solved it with the least amount of effort (from your standpoint) and
achieved a 48x speedup. There are many developers who simply can't do that;
they have tunnelvision, wasting a ton of time on improving portions of their
systems without first thinking deeply about the problem they're trying to
solve (and about simple ways to sidestep that problem).

In your case, you identified the correct problem, which was "How can I
maximize the number of sites crawled per day?" and not "How can I optimize
[the database, the perl scripts, etc]?" And then you did the most
straightforward optimization you thought of, accomplishing your goal in one or
two nights. Your solution is valid, maintainable, and most importantly works,
and so I personally don't see anything wrong with it. Again, nice hack.

------
ivankirigin
This is one reason I like Python so much. Writing hooks into C/C++ is pretty
easy. Often they are already there for you. For example, OpenCV is an
excellent image processing library in C++, and already has hooks to Python.

Lots of companies take this approach, of Python + C/C++. A few that come to
mind are Google, Weta, ILM, iRobot, and Tipjoy.

~~~
kingkongrevenge
Python has no advantage in this regard over any other language supported by
SWIG. And perl almost certainly has more available library bindings than
python.

~~~
inklesspen
Someone hasn't heard of ctypes: <http://docs.python.org/lib/module-
ctypes.html>

It lets you use arbitrary c libraries in Python. It automatically does the
wrapping for you.

~~~
kingkongrevenge
I hadn't heard of ctypes, but the same functionality is available in perl. It
seems pretty silly and crash prone to me to dynamically load object code and
then wrap it with a load of error checking rather than write proper bindings.

------
silentbicycle
Congrats!

If you haven't yet, check out _The Unix Programming Environment_ and _The
Practice of Programming_, both by Rob Pike and Brian Kernighan (K of K&R).
They're concise, highly informative books about using the Unix toolset to
their maximum potential. The former was written back when computers were slow
and had little memory, the latter in from 1999 but very much in the same
spirit. (It seems to include a lot of insights from developing Plan 9.)

Also, a dissenting opinion here: C's performance vs. higher level languages'
development speed is not necessarily an either/or choice. Some languages
(OCaml, the Chicken Scheme compiler, implementations of Common Lisp with type
annotations or inference for optimizing, Haskell (under certain
conditions...), others) can perform very favorably compared to C, but tend to
be much, _much_ easier to maintain and debug.

As a generalization, languages that let you pin down types are faster because
they only need to determine casts once, at compile time, but if those
decisions can be postponed until your program is already mostly worked out (or
better still, automatically inferred and checked for internal consistency),
you can keep the overall design flexible while you're experimenting with it.
Win / win.

Also (as I note in a comment below), Python can perform quite well when the
program is small amounts of Python tying together calls to its standard
library, much of which (e.g. string processing) is written in heavily
optimized C.

Alternately, you could embed Lua in your C and write the parts that don't need
to be tuned (or the first draft of everything) in that.

~~~
palish
"Some languages ... can perform very favorably compared to C, but tend to be
much, much easier to maintain and debug."

Keep in mind that each language has a learning curve. As you learn more
languages, that curve becomes much easier to traverse. However, the goal is to
Get Stuff Done. The most straightforward engineering tactic to accomplish that
is to become highly skilled with a few choice tools, then use those tools to
solve almost all of your problems. (Note: that's different than a "one tool
for every problem" mindset. You can solve most problems with a small toolbox
while still avoiding the trap of using an inappropriate tool for the job.)

~~~
silentbicycle
Fully agreed. I'm posting for the archives and other readers at least as much
as him, trying to add in a reminder that "rewrite the whole shebang in C"
isn't necessary for good performance.

I dabble in languages for fun, but find that I ultimately end up using the new
techniques I learn in a few practical multiparadigm languages, e.g. OCaml,
Lisp, and Python* . If you want to learn about types/powerful type systems,
try Haskell or one of the MLs. If you want to see a brilliantly designed
module system, look at the MLs. If you want to see well-designed syntactic
sugar, look at Python and Lua (among others). If you want to understand OO
better, look at Smalltalk. Etc. Spending a week (or weekend) now and then
exploring the ideas and mindsets in unfamiliar languages, particularly those
influential on what you do your real work in, can really expand your toolkit.
(Also, read code.)

* Python is not fully multiparadigm, e.g. it's awkward for functional programming, but it's fairly flexible, and its giant standard library makes up for several weaknesses IMO.

Really knowing any one of the several languages I listed in the initial
comment would probably be enough, and some approaches (e.g. Lua in C) could
probably be learned relatively quickly. I suspect it would probably take much
more time to learn to use C++ really well than OCaml or Lisp, though.

------
huhtenberg
> _multi-threadded crawler in C_

If your crawler is I/O-bound, then just wait till you discover _epoll_ :)

Or, on a more general note, have a look at <http://www.kegel.com/c10k.html>

~~~
cliff
Do epoll / other c10k solutions address I/O-bound situations?

I thought that most of these solutions deal with CPU-bound and sometimes RAM-
bound situations -- i.e. they fix spending too much time spinning the CPU in
various ways waiting for I/O, or too many threads at once taking up too much
RAM.

~~~
spc476
epoll() is for IO-bound situations. You write your program using an event
model, where the events are IO related (you select events based upon file
descriptors being ready for reading and/or writing) which (in my opinion) make
for easy programming of network based daemons (
<http://boston.conman.org/2007/03/08.1> ).

The general method appears to be, write the program using an event model, then
create a thread per CPU, which each thread waiting on epoll() (timeouts
optional).

~~~
cliff
Maybe I'm missing something -- if you're I/O-bound why would it matter whether
you use epoll() vs poll() other than CPU usage?

From what I understand (and how I've used it in my work), one uses epoll()
specifically because they're NOT I/O-bound and so need to come up with
strategies for using the minimum amount of CPU and RAM per simultaneous I/O so
as to avoid becoming CPU- or RAM- bound.

Hence my original point -- if one is I/O-bound while using poll(), it doesn't
really matter whether the CPU is spinning on that or epoll(), since I/O won't
happen any faster.

~~~
spc476
I just found epoll() much easier to work with than select() (been there, done
that, rather not go back to it) or poll(), and the fact that you avoid
scanning an array upon return.

------
softbuilder
Congratulations. That is an impressive speed up.

I agree about the built-in Unix utils. You just do not see people taking
advantage of these powerful and extremely optimized programs any more. I
wonder how many times grep or sort have been unwittingly rewritten in Perl or
Ruby because the programmer lacked familiarity with basic Unix tools?

As for your crawler, I think the significant thing here is that you rewrote
something in C after you already had it working in another language. Not to
bag on C, but writing the original in a higher level language first gives you
a better shot at correcting any bugs in the actual solution domain. Then if
you move to C you're only fighting against C, not against C _and_ bugs in your
solution at the same time.

------
ConradHex
Congrats on your big speedup; successful optimization like that is always a
rush.

I wonder what the result would be if you did everything you describe, but
wrote the code that's now in C, in Python instead. I suspect the speed would
be very similar. (I like C, for what it's worth.)

~~~
thomasmallen
_I wonder what the result would be if you ... wrote the code that's now in C,
in Python instead. I suspect the speed would be very similar._

Is that a joke?

~~~
jmtulloss
Not necessarily. Let's think of some operations a crawler would need to do:

1\. Spawn some threads (Python uses native threading) 2\. Connect to and load
some URLs (This isn't going to be slow anywhere.) 3\. Run some regular
expressions (Python's regexp engine is all C) 4\. Write to the disk. (Python
uses the standard system libraries for this)

It wouldn't be as fast, but it might be surprisingly close. There would be
fewer lines of code to boot.

The problem would be the GIL (Global Interpreter Lock). You couldn't actually
have more than 1 thread running at a time. It sounds like the man only has 1
processor, so that wouldn't be the end of the world, but if he has more, you
can just swap in the process module for the threading module. Of course, then
you might have some RAM issues.

~~~
jrockway
Why use threads when you can use an event-based IO system? (I think Twisted is
the Python way of doing this.)

~~~
jmtulloss
I was just going off of what he had already done. I agree that async is a
better way to do this.

Twisted is one way, but Python ships with asynchronous libraries
<http://docs.python.org/lib/module-asyncore.html>.

------
hopeless
"Starting to feel growth issues on your back-end?"

I think you'd better get a doctor to look at that.

~~~
LogicHoleFlaw
"Does this algorithm make my memory footprint look big?"

------
pjf
Key to innovation: no funds, no Internet, a laptop and free time.

------
bnolan
We had some views in our rails app that could be hit several times per second
by our users, and they were uncacheable, so we implemented the views in c++
using libpg and fastcgi. So much awesomely faster.

If you've got some code that works well and you can solidify into c++ (ie -
you're no longer tweaking it 3 times a day) - it's totally worth spending the
time to rewrite.

I think C++ on rails would be a great idea, ie - some code generators to help
you port your most used .rhtml views over to C++.

------
axod
Very cool. I have had a similar experience with Mibbit...

Next step, get rid of those threads and use non blocking IO ;)

~~~
maxklein
Mibbit is written in C? So you run it at home? How do you pay for the
bandwidth?

~~~
axod
Not quite, Mibbit is in Java. I took "c" to mean "currently unpopular non-
hyped language". Much like Java is... Mibbit runs on pretty much a single
server at the moment so it's been a similar scaling exercise.

I'd definitely agree though with the OP... having limitations forces you to
look at better ways to do stuff, and ways to squeeze every last drop out of
the hardware/resources.

~~~
jrockway
_I took "c" to mean "currently unpopular non-hyped language". Much like Java
is._

Java is a popular, hyped language. It's just not a very good one.

------
youngnh
If you enjoyed learning/using sed, awk, grep and other Unix text processing
utils, you'll love this: <http://borel.slu.edu/obair/ufp.pdf>

Its called Unix For Poets and it'll show you just how far you can really push
these tools.

------
natch
What's your average page size?

Does your crawler support gzip compressed transfers?

What speed is your Comcast link?

~~~
scumola
Page size varies, but when I was doing the initial speed tests, pages were
around 20k each (guestimate).

Yea, the crawler uses libcurl, which supports gzip compressed content.

My comcast link is (theoretically) 7Mbit down, 385Kbit up.

------
lallysingh
First, I hope you weren't using the obscenely slow version of Perl that ships
with RedHat.

Second, for more perf analysis, there are some very good unix tools for
profiling & optimizing C code. Many of them free.

------
tyohn
What's interesting about this posting is unbenost to my co-founder, yesterday
I was discussing some of the issues we were facing with a "retired" unix
programmer and he talked about early sorting and searching methods. It was
quite remarkable what they achieved with very little in the way of hardware
and memory.

------
zenspider
perl threading was (and I'm sure still is) absolutely horrid.

I'd be interested in seeing the results of a rewrite not to C, but to python
or ruby where the threading support is much much better. Then you could
rewrite functions at a time in C as needed, but not have the extra burden of
rewriting the whole thing.

I totally agree with the rest of the approach. Going low tech and using unix
tools is a very good way to reduce overhead, increase parallelism, and delay
calculations. One of the nice things about this approach is you can cobble up
another $50 unix box to do some of the bulk processing via nfs or other means.

Congrats... It sounds like a very interesting project.

~~~
kaens
Threading support is better in python than in perl? I'm not familiar with
perls threading model (or libraries), but this seems wrong - considering
things like the GIL in python.

Could you elaborate a bit on this?

~~~
kingkongrevenge
The thing with perl threads is that pretty much nobody uses them. They work
fine, as far as I know, but with a default 16MB stack size and "share nothing"
semantics don't start more than a few. Perl is more unix centric and people
just prefer to fork and use the well baked IPC mechanisms. Various libraries
such as POE make fork+IPC easy enough that it's hard to see the need for
threads in the kinds of domains where a language like perl is applicable.

~~~
zenspider
they certainly didn't work fine when I tried to use them... after they'd gone
into production releases of perl, the canonical demo/example scripts that were
out there all crashed the perl runtime in a fiery horrible death.

I can only hope that it has improved since then, but I've moved on to greener
pastures.

~~~
kingkongrevenge
They were clearly marked as experimental in the documentation all the way up
to recent 5.8 releases. From what I hear they work fine now.

------
corentin
> I'm still looking for ways to optimize things

Re-implement your crawler on a FPGA.

Just kidding :) It's great that you got what you were expecting, but as Paul
said, you have to profile before you optimize (otherwise you will "optimize"
useless stuff, and not only waste your time but also likely do counter-
optimizations).

Anyway, your post was very interesting. Because a lot of people assume they
have to use layers on top of the OS, while modern Unix systems have good file
systems and memory managers (maybe have a look at DragonFly BSD; they are
going into an interesting direction).

------
hs
i used an implementation of cilk when i crawled ~1000000 htmls from a social
network in 2 hours (can't disclose -- the site went into maintenance
during/after my experiments ... just to exercise my curiosity)

had to be root to raise the user's maxproc-max (previous experiments locked me
out --- "sh can't fork" messages ... can't even ssh in)

it's done in a 100mbps amd2000+ 256 mb 40 GB IDE OpenBSD colo ... but i don't
think the hardware matters that much (the cilk is really the key)

------
pwoods
Yes, I love it when things can be speed up with old school techniques.
Although I'm not sure C was a really great choice in case you need to upscale
later.

------
mattmaroon
Most startups that have scaling problems have them due to number of customers,
in which case it's trivial to either monetize or raise some money.

~~~
jshen
i'm not sure this is true. I've been at a couple of startups where the scaling
issues came from the database growing everyday with a relatively fixed set of
users.

~~~
mattmaroon
I would still bet the number of startups that have problems like that are <10%

~~~
jshen
that's probably true. We weren't getting VC funding and had a stream of
revenue to keep it going for a couple years.

------
donniefitz2
Congrats. I'm sure that's a good feeling.

------
thwarted
I once had a boss who refused to buy the programmers Pentium machines and made
us use 486s instead. He said it wasn't because he was cheap, but it was so
that if we wrote code that ran really fast on a 486, just think how fast it
will run on a Pentium. Nevermind that none of the new Pentium features, like
SIMD, could be taken advantage of.

That company is no longer in business.

------
kingkongrevenge
It's true that utilities like grep, sort, and join are highly optimized. Many
people have had wins replacing scripting logic with calls to these utilities.
But I'm having some difficulty understanding the merits of putting the top
level logic in C, using shell scripts, and the alleged advantages of flat text
files.

Shell scripts are hard to make robust. I don't understand why you wouldn't
just drive the unix utilities from perl.

Why use perl threads on unix? Everyone know they suck. Why not fork and use a
transactional database for IPC?

I remain unconvinced of the alleged wins people have with flat file solutions.
It's usually about replacing a toy like mysql or some convoluted berkelydb
mess. You can use db2 for free up to 2 gigs of memory. Why waste even a minute
replicating transactional RDBMS functionality by hand? As soon as you're
dealing with flock and company and you COULD be using a DB, you should, as far
as I'm concerned.

~~~
scumola
kingkongrevenge - I didn't mean to sound like I replaced all of my DB
functionality with flat files. I have several stages that my data flows
through (fetching, ripping, sanitizing, indexing, uploading) and I replaced
the constant calls to mysql by writing things to a flat file, munging the data
on-disk with the gnu tools before they get written to the database. With large
datasets, DB calls (even with good indexing) can get expensive, so I tried to
avoid them as much as possible for the crawling stage of the process. I
certainly need a database for the other stages of the process.

A simplified example: If I crawl a website, the chances are good that I'll
pull several thousand copies of the same url from the site over and over
again. I want to insert all URLs that I find back into the database so I can
can crawl them, but I don't want any dupes. I just want to crawl the URL once
and then move on. If I insert a URL into a DB and let the DB check if it's a
dupe, then I waste DB time/IO. My solution was to crawl thousands of pages,
dump all of the urls into a text file, run the text file through 'sort | uniq'
and then dump those URLs into the DB and let the DB ignore the dupes at that
level. I still have to use a DB to do some of the work, but pre-processing the
data up-front using 'sort | uniq' is mega-speedier in my special case.

Also, I did wrap most of the scripts with perl to talk to mysql and do other
sanity-checking that's easier in perl than bash or tcsh. :)

------
cdr
Nice, but are your growth issues so bad that you couldn't make this a blog
entry on mediawombat.com instead?

