Hacker News new | comments | show | ask | jobs | submit login
Shoestring Budget? Starting to feel growth issues on your back-end? Embrace unix and C
140 points by scumola on Sept 9, 2008 | hide | past | web | favorite | 120 comments
I'm one of the co-founders of Media Wombat ( flash search engine at http://mediawombat.com ) and we're a startup that's about 8 months old. We have no funding except what we can afford to do ourselves, no investors and very little spare hardware.

Our site is a search engine - like google, but for flash content. We threw together the site in a weekend and have been slowly tweaking it over time but recently have run into some growth issues. If you have funding or investors and have growth issues, you can just throw more hardware at the problem and ta-da! You're fast again. However, for those of us who don't have money being thrown at us, we have to be a little more creative and start to look at optimization.

I've got a couple of old machines and an 8-drive SCSI RAID in my basement that I'm using for our search engine to crawl the web and process the data that we index for our search engine. My machines are not quad-core and don't have 64GB of ram in them. They're old and tiny.

When we first put http://mediawombat.com together, we threw it all together just to get it to work. We did everything as quick and dirty as we could. We used perl and mysql for the back-end. The crawler was straight-forward, single-threaded, slow, clunky, but it worked. After about 4 months of collecting data, we started to see some growth issues. Searches were becoming slow.

We were using a live search through all of our indexed data. First step to optimization - caching of course. This was a pretty easy no-brainer. We recorded all of the searches that people did on our site and we pre-cached the search results for the top-2000 of the most-popular searches. This way, when someone does a search for a popular search phrase, they get (almost) immediate results. Not too bad of a solution.

Just a few weeks ago, I noticed that our crawler has become the slowest part of our back-end process. We had crawled most of our initial sites and gotten some good data back, but now, the crawler is just crawling lots of uninteresting urls and not getting anything of any value back. We overflowed onto other sites with no flash and were now crawling sites that didn't return any useful data back. We were wasting resources.

So, I was at my mother-in-law's place last weekend and she doesn't have any internet connectivity. I was bored and needed some time away from the family to geek-out. I thought to myself how I could make the back-end crawler and database more optimized ... I re-wrote the crawler in C and made it multi-threadded. And instead of reading and writing to a database, I used flat files. I also pre-processed everything outside of the database using the old-style unix text utilities (grep, sort, uniq, sed, awk, ...). One of those cartoonish lightbulb-over-the-head moments happened to me.

The unix text utilities were written in the 60's and 70's when computers were 33mhz and had 5MB of ram. Of course these utilities are going to be lean and mean! Perl was a memory hog and if I multi-threadded it, ate up most of my available ram on my machine if I spawned > 5 threads.

I read the man pages on all of the unix text utils that I could find. I even found some new ones that I didn't even know about before (and I've been using unix (linux) as my primary OS since 1990). I managed to replace about 90% of my crawler that was previously written in perl, to a bunch of unix utilities, a few shell scripts and my multi-threadded crawler in C. I did my crawling operations in bulk and processed them in the background while the crawler was doing it's thing.

I was super-proud that I had optimized the code as much as I had. I went from about 30k urls a day to about 60k urls crawled an hour! To me, that was a huge speedup! Anyway, to make a long story short, I'm still looking for ways to optimize things and I've got a long list of things to do if/when the time becomes available and I've got more time than money at this point so it's worth the effort, and it's really rewarding!

This may have been a fun and interesting project for you (and a great way to learn the unix utils), but I wouldn't recommend that other startups follow this path.

Rewriting something in C should be the last thing you do, not the first. The first thing you should do is find out why it's slow. In your case, it sounds as though you were fetching one url at a time (blocking). Switching to async io all could have fixed this.

Btw, if you're using the Gnu utilities, it's unlikely that they were written in the 60's and 70's (also, people were processing much smaller amounts of data back then).

Yea, fetching one URL at a time was a major issue, but things like the executable size in memory was also a problem. Remember, this is an old machine. Mysql takes 400M for its index for my data, my other processes run on the same machine also and all take memory. I initially added swap for things, but that's clearly not the right direction.

Also, I'm 40 years old. When I went to college, C programming was the norm, so it wasn't that difficult for me and didn't take long to implement and seemed like a great place to start to optimize things. It's fast, saturates my home Comcast pipe (they might cap my bandwidth, which I'm now worried about) and takes about 200M of memory with 30k URLs loaded into the queue. I suppose that I could reduce the memory requirements of the app some more by keeping the URLs in a BerkDB or something, but I think that 200m is acceptable for my needs now. For me, it's all about getting the most out of my available resources without throwing more money at it.

Re: gnu utils and 60's and 70's comments ... Yea, I admit that my knowledge of the gnu utils started in 1990 and I know that unix has been around since the 60's, so my initial post stating that the utils have been around since the 60's and that CPU speed and memory sizes were larger than they actually were back then were incorrect. My bad. ;)

Funny how these 2.0 young bucks consider a C-rewrite extreme, when for many coders is is the lingua franca, the bread and butter. I got started on high-level languages but I've made an effort to get familiar with C because it's small, beautiful, and fast as all hell. Fast is powerful.

Part of knowing a tool is knowing when it's inappropriate. A rewrite in ANY language to "fix" a problem that you don't understand is generally a mistake. Doing things the needlessly complex way doesn't make you a better or manlier programmer.

By the way, a lot of these perl and python scripts spend most of their time calling into C libraries anyway, so the performance difference is often negligible.

And unless my knowledge of the Python standard library is way off, the Python standard library is written in Python, not C. That's what all those .py files are in the /lib folder. So python scripts spend most of their time calling into Python libraries. It's true that the interpreter is written in c (c++?,) maybe that's what you were thinking of.

Why would you bring manliness into the discussion? Are you suggesting that Perl and Python are somehow manlier? Because I have no idea what that means.

Speed comes in two forms, execution time and development and maintenance time, C is only good for one of these. Python or Ruby would be better for the other.

People that complain so much about C never had a change to learn it well. Go talk to Linus Torvalds about changing from C anything else....

I wrote almost exclusively in C for 12 years. I wrote cgiemail, which was very widely used. Today I would never start a server-side web project in C.

Which parts don't I know well?

apples and oranges. Linus writes OS code. Good luck doing your next web 2.0 startup if you write it primarily in C

Why do you think C is such a maintenance issue?

I like it's well accepted that garbage collection is a huge boon for productivity and stability

C has a variety of garbage collectors, here's one:


I wonder if it's really more of an intimidation factor than anything (and I'm not saying this to be a jerk, I am truly curious why people have the perception that C is such a nightmare to maintain).

Again, you're looking at best case. Talk to us when GC is mandatory, not optional, because until then, we'll still have to work with C code that can't be trusted because it may not use a GC. To folks who want higher level languages, CG is not an optional feature, it's a core part of the language.

Are you suggesting that you can be as productive in C as I can be in ruby? If not then you answered your own question.

Second, are those garbage collectors widely used in C code and if not why?

Because it is. It's much easier to blow your foot off in C than in Python or Ruby. I think it's pretty well accepted that scripting languages are far more maintainable than C. Sure you can write maintainable C, but that's best case, I think it's more revealing to look at the worse case.

Yeah but C's worse case is still better than most of Perl's, PHP's, etc. best cases. Just as an example.

I don't disagree, btw, that dynamic languages are easier, but I'm really wondering if C's maintenance is really more of a perception than a reality.

That's just an absurd statement. C's worse case is obfuscated garbage http://en.wikipedia.org/wiki/International_Obfuscated_C_Code... that just about no one could read or maintain and you think that's better that Python, PHP, Pearl, or Ruby's best case?

Or is it more likely that like most C folk, you're just stuck on execution speed as the only measure of a language or you don't know what maintainability actually means?

C's being hard to maintain is not a perception, it's a reality. If C were easy to maintain there wouldn't have been room for scripting languages to exist, everyone would have just kept using C. C is extremely low level and extremely powerful, making it hard to work in and very dangerous.

Yes you can do anything with a for loop, but I'd much rather use higher level abstractions like higher order functions, first class functions, objects, continuations, and exceptions and not be able to directly play with memory or pointers because I don't give a crap about the machine, I give a crap about abstracting and solving my problem with the least amount of necessary boilerplate or repetitive code.

Look, make it simple, assume for the sake of argument that Python and C have exactly equal execution speed under all circumstances, given that axiom, why would anyone choose C over Python? Which language do you think most people would choose?

You reveal the primary flaw in your thinking with the phrase "C folk." Languages are not enemy camps. Just learn them both and use the appropriate tool in the appropriate situation - python for glue, prototyping, business systems, web programming, etc. C for operating systems, networking layers, database engines, etc. Try not to attach your ego to either one.

You're hand waving, and there was no flaw in my thinking. C doesn't scare me and I use many languages, and that still has no bearing on the point of the conversation. C is a maintainability nightmare compared to lighter weight scripting languages. That fact stands on its own.

There are "C folk" who won't use anything else because they're stuck in the rut of only thinking of execution speed, pointing that out is not a flaw, it's a fact.

Ok, I will address your argument directly. You state that it's possible to write maintainable C in the best case, but that the worst-case is more revealing. However if we're to be fair, we must also look at the worst case for scripting languages. For example, which of these is more maintainable?

http://www.p-nand-q.com/python/obfuscated_python.html http://en.wikipedia.org/wiki/International_Obfuscated_C_Code...

I would suggest that neither one is particularly maintainable, nor are either of these styles of code likely to appear in the real world. Therefore the worst case for these languages is not really more telling, and not applicable to this discussion.

I wasn't the one who claimed C's worse case was better than Python's best. Perhaps you should reread the thread and see what I was objecting to. You can write bad code in any language, but some don't let you blow your foot off as easily. It's well known and well accepted that higher level languages are more maintainable that C, that's just a fact. If you don't agree, then you're the one making radical claims, not I. Regardless, I'm bored now, this is a dead horse.

I'm curious as to what you are storing besides the URL. Given the two figures, (200M RAM, 30k URLs) I calculate that you are using 7k per URL, which seems a bit excessive to me. Or does the 200M include other data besides the URLs?

Yes, that's the right question to be asking. There's no reason why anything should be taking 7k per url.

fully agree. From one 40 year old hacker to another, C is a very meaningful and useful language. It is not a "premature optimization" for someone who really gets unix/linux. I have built http://shellshadow.com with C and have been very happy with the decision to get "closer to the metal" from the beginning. My server processes are lean and mean. They don't crash for inexplicable reasons. When I have a bug, there are not many layers of technology to slug through.

A server written with boost's asio in C++ will be just as lean and mean, but a fraction of the code.

even if you don't have funding how hard is it to get some old desktops that can have a couple of gigs of memory?

Your urls don't need to be in BDB. Just write them to a text file and stream from that. If you want to check for duplicates, just find a simple bloom filter library. There, now you're under a megabyte. :)

dbserver swf # ps auxw | grep crawl

root 6984 3.0 0.5 215832 5168 pts/5 Sl+ 12:59 0:00 ./crawl

I'm using a linked-list in C to store all of the queued-up URLs so all of the threads can mark them as being worked on or whatever. Also, the app and curl lib takes a few megs too. I'm sure that the heap and multi-threadding libs are responsible for some of the resident memory usage. :) the on-disk binary is only 12k if that make a difference. ;)

The struct:

typedef struct my_url { char * url; int retcode; double old_size; double new_size; status status; struct my_url * next; char * filename; int flashurl_id; int depth; } my_url;

I also like keeping them in memory as opposed to a BDB because of access speed - 25+ threads don't have to wait when it's all in memory, but 25+ threads hitting a BDB would be much nastier on the OS.

Great post!

Please, HN, change the color of text in posts like the above. There is very little contrast between #828282 (copy) and #F6F6EF (background), and I for one am sick of having to fix this with Firebug.

I've always assumed that this font coloring is meant to discourage people from making lengthy text-based posts on HN.

I think that's what a letter or word limit is for, though.

Artificial limits don't seem like the correct answer. The community is the filter.

Use the "zap" bookmarklet, found here:


Add it to your bookmarks toolbar, and any site you use it on will turn all text into black-on-white (in fact, it gets rid of most background colors and images), remove anything flash-based, links are changed to blue & underlined, etc.

here here!

*hear hear

scumola, this should have been a blog post that you linked to. I'm not saying this because I don't think this is HN-quality material, it's actually an awesome little story to read. But if this was on a company blog somewhere, it would be generating juice for your company in addition to for HN.

This is great news and a great way to spread the word about your service. Don't let that opportunity go to waste!

>The unix text utilities were written in the 60's and 70's when computers were 33mhz and had 5MB of ram.

I'm not positive (I was born in the 70s), but I'm pretty sure they had less RAM and speed than this.

Yep. Unix was written on a PDP-7 http://www.linfo.org/pdp-7.html -- probably with 8,000 or 16,000 words of memory (18 bit words), probably running on the order of several thousand instructions per second

My 1990's Amiga 1200 had 14 MHz and 2-8 MB ram, so you're off by 20-30 years :-)

Yeah, my first non-commodore computer was a blazing fast 33MHz 386 with 4MB RAM -- 1991 (~$3,400 by the way)

Damn, do I feel spoiled watching a modest app compile in a few seconds today!

The very first computer I saw Unix on was a minicomputer/workstation in 1985. It had a 68020 processor running at 20 mhz, and had 4mb of RAM. It cost around $75k.

The first one I saw (not used - the first one I used was a 680x0 box) was a Z-80 based Cromemco monster running Cromix on less than 512K.

Unix can be really small, if we sacrifice some stuff.

I thought we were making the point that hardware was slow and expensive much later than the 60s and 70s, not that Unix could be really small.

I was making the point that even a lowly hare-brained 8-bit processor can run a Unix-like OS.

IIRC, my first unix machine was a 16 MHz 68020 with about a megabyte of RAM and about 200 MB of fixed disk storage.

It served around 40 terminals.

The Amiga 500 came in the late 80s and had a 7.14Mhz CPU and 512kb of RAM, and it was considered awesome... So yeah.

So much misinformation indeed! One reason, out of many, of course, that I stopped reading Slashdot was the amount of misinformation propagated there. I was afraid that I would read something false while in an insufficiently wary frame of mind and believe it. I'd hate to see the same start to happen here.

What are you talking about?

People shooting off things like "The unix text utilities were written in the 60's and 70's when computers were 33mhz and had 5MB of ram." as though they were fact.

Okay, there is a lot of noise in this thread, so this needs to be clearly stated:

Good job, scumola.

It's impressive that you diagnosed the architectural bottleneck of your design and solved it with the least amount of effort (from your standpoint) and achieved a 48x speedup. There are many developers who simply can't do that; they have tunnelvision, wasting a ton of time on improving portions of their systems without first thinking deeply about the problem they're trying to solve (and about simple ways to sidestep that problem).

In your case, you identified the correct problem, which was "How can I maximize the number of sites crawled per day?" and not "How can I optimize [the database, the perl scripts, etc]?" And then you did the most straightforward optimization you thought of, accomplishing your goal in one or two nights. Your solution is valid, maintainable, and most importantly works, and so I personally don't see anything wrong with it. Again, nice hack.

This is one reason I like Python so much. Writing hooks into C/C++ is pretty easy. Often they are already there for you. For example, OpenCV is an excellent image processing library in C++, and already has hooks to Python.

Lots of companies take this approach, of Python + C/C++. A few that come to mind are Google, Weta, ILM, iRobot, and Tipjoy.

It's also true of Perl. But it wouldn't have solved the problem in this case...a blocking IO single-process vs. non-blocking multi-process configuration is obviously not going to fare well, and it couldn't be made (significantly) faster by rewriting some bits and pieces in C.

There are also pretty big costs associated with shell scripted processing, like spawning a new shell for every command (while Perl, Python, etc. do not, unless you're explicitly forking). While there was a thread about programmers greps a few days ago, and out of curiosity I compared find+grep to grin and ack, and found that find piped to grep required a quarter the time of ack (written in Perl) and more than an order of magnitude less time than grin (written in Python) to perform a simple search, I also know that for more complex cases the cost of shell-based processing grows as complexity grows.

For an interesting comparison, look at the GNU autotools vs. cons (in Perl) or SCons (in Python). Both cons are dramatically faster than the autotools. Partly because they aren't as complex and don't target quite as many build systems, but also partly because they do most everything from a single running process.

This also seems to be true with Lua. I'm relatively new to the language, but in many ways it feels like a minimalistic Python, intended to be embedded in a C program as a library (though it's usable on its own as well).

Most of the language is closely tied to "tables", which are very similar to Python's dictionaries. Unlike Python, however, it has real closures and tail call optimization.

Lua is semantically more similar to Scheme, according to the language designers. The resemblance is quite intentional. Ease of embedding into C is a first-priority of the language implementation. It's almost trivial to interface to C, and in many cases it's easier to write bindings to the Lua API than to massage SWIG generated interfaces.

"userdata" are first-class values in Lua. They are containers for pointers to arbitrary C structures, and by overloading their behavior it's extremely easy to interface between C and Lua.

Lua tables are extremely elegant, and are the basis of object and class systems, all data structures, namespaces, packaging, and libraries.

The Lua interpreter is quite fast by itself, and with the LuaJIT written by Mike Pall your speeds can be even greater. The rumor is that the first public release of LuaJIT 2.0 is coming this winter... that one is a tracing compiler based on the same concepts as Tamarin, SpiderMonkey, and V8.

Yes. It seems to have many of the same abilities as Scheme (and I forgot to mention it has coroutines.), though it doesn't have read / the sexp syntax or macros. It's extraordinarily portable, though, and also seems to also contain most of what I like about Python (though it lacks its remarkable libary).

I read The Implementation of Lua 5.0 (www.tecgraf.puc-rio.br/~lhf/ftp/doc/jucs05.pdf, 15ish page pdf) last week, experimented some with _Programming in Lua_ 2nd ed. since, and I've been very impressed with how tasteful the overall language design seems. It seems like a brilliant tiny language, much like Forth and its descendants. (I've read that the Lua source is good reading, too. See: http://www.reddit.com/comments/63hth/ask_reddit_which_oss_co...)

The Lua Emacs mode is nothing special, but I'll see if I can improve it. :)

Python has no advantage in this regard over any other language supported by SWIG. And perl almost certainly has more available library bindings than python.

Someone hasn't heard of ctypes: http://docs.python.org/lib/module-ctypes.html

It lets you use arbitrary c libraries in Python. It automatically does the wrapping for you.

I hadn't heard of ctypes, but the same functionality is available in perl. It seems pretty silly and crash prone to me to dynamically load object code and then wrap it with a load of error checking rather than write proper bindings.

ctypes, cython, Boost::python and pyrex are real advantages in this regard; SWIG is extremely painful in comparison.

Both Perl and Python have far nicer ways to link to C than SWIG. Having worked with both, I think they're about equal these days on the binding to C front (though, as always, in Perl there's far more than one way to do it, and there's only a couple of obvious ways to do it in Python).

Do you know of any perl & C++ solutions other than perlxs and swig?

I've never used C++ from Perl, only C, but there is a version of Inline for C++ here: http://search.cpan.org/~neilw/Inline-CPP-0.25/

The nice thing about Inline is that the bindings are obviously named (function foo() in C becomes sub foo in Perl), and it's all automagic. Of course, if your C/C++ code is big or an existing library you want to bind to rather than just a few bits and pieces, perlxs (and its related utilities) might very well still be the way to go. There is XSpp, which supposedly makes C++ and XS play nicer together. But perlxs is dramatically better than SWIG in most circumstances I can think of (I don't mean to imply that SWIG is not a wonderful tool--it's a fantastic tool, but both Perl and Python have language-specific solutions to most of the same problems that are superior).

But you can't beat Inline::C (and I guess Inline::CPP) for simplicity. If you know both Perl and C/C++ their usage is absolutely obvious. perlxs requires a little more reading, and I have to consult the docs whenever I cross paths with it. Python has Weave (part of SciPy), if you like the inline model, though I'm not sure it's being maintained heavily these days.

If large data sets are your problem (some of the things you might use the numeric data structure tools in Boost for), PDL might be a solution, since it has a lot of tools for manipulating large data sets without having to revert to writing your own C. Likewise NumPy in Python. And, of course, there are already Perl bindings for Boost in CPAN, though the vast majority of Boost seems to be just making C++ more like Perl, rather than adding anything that Perl coders would find interesting.

But python is a pleasure to code. Also, I prefer Boost, and know the python bindings are solid.


If you haven't yet, check out _The Unix Programming Environment_ and _The Practice of Programming_, both by Rob Pike and Brian Kernighan (K of K&R). They're concise, highly informative books about using the Unix toolset to their maximum potential. The former was written back when computers were slow and had little memory, the latter in from 1999 but very much in the same spirit. (It seems to include a lot of insights from developing Plan 9.)

Also, a dissenting opinion here: C's performance vs. higher level languages' development speed is not necessarily an either/or choice. Some languages (OCaml, the Chicken Scheme compiler, implementations of Common Lisp with type annotations or inference for optimizing, Haskell (under certain conditions...), others) can perform very favorably compared to C, but tend to be much, much easier to maintain and debug.

As a generalization, languages that let you pin down types are faster because they only need to determine casts once, at compile time, but if those decisions can be postponed until your program is already mostly worked out (or better still, automatically inferred and checked for internal consistency), you can keep the overall design flexible while you're experimenting with it. Win / win.

Also (as I note in a comment below), Python can perform quite well when the program is small amounts of Python tying together calls to its standard library, much of which (e.g. string processing) is written in heavily optimized C.

Alternately, you could embed Lua in your C and write the parts that don't need to be tuned (or the first draft of everything) in that.

"Some languages ... can perform very favorably compared to C, but tend to be much, much easier to maintain and debug."

Keep in mind that each language has a learning curve. As you learn more languages, that curve becomes much easier to traverse. However, the goal is to Get Stuff Done. The most straightforward engineering tactic to accomplish that is to become highly skilled with a few choice tools, then use those tools to solve almost all of your problems. (Note: that's different than a "one tool for every problem" mindset. You can solve most problems with a small toolbox while still avoiding the trap of using an inappropriate tool for the job.)

Fully agreed. I'm posting for the archives and other readers at least as much as him, trying to add in a reminder that "rewrite the whole shebang in C" isn't necessary for good performance.

I dabble in languages for fun, but find that I ultimately end up using the new techniques I learn in a few practical multiparadigm languages, e.g. OCaml, Lisp, and Python* . If you want to learn about types/powerful type systems, try Haskell or one of the MLs. If you want to see a brilliantly designed module system, look at the MLs. If you want to see well-designed syntactic sugar, look at Python and Lua (among others). If you want to understand OO better, look at Smalltalk. Etc. Spending a week (or weekend) now and then exploring the ideas and mindsets in unfamiliar languages, particularly those influential on what you do your real work in, can really expand your toolkit. (Also, read code.)

* Python is not fully multiparadigm, e.g. it's awkward for functional programming, but it's fairly flexible, and its giant standard library makes up for several weaknesses IMO.

Really knowing any one of the several languages I listed in the initial comment would probably be enough, and some approaches (e.g. Lua in C) could probably be learned relatively quickly. I suspect it would probably take much more time to learn to use C++ really well than OCaml or Lisp, though.

> multi-threadded crawler in C

If your crawler is I/O-bound, then just wait till you discover epoll :)

Or, on a more general note, have a look at http://www.kegel.com/c10k.html

Do epoll / other c10k solutions address I/O-bound situations?

I thought that most of these solutions deal with CPU-bound and sometimes RAM-bound situations -- i.e. they fix spending too much time spinning the CPU in various ways waiting for I/O, or too many threads at once taking up too much RAM.

epoll() is for IO-bound situations. You write your program using an event model, where the events are IO related (you select events based upon file descriptors being ready for reading and/or writing) which (in my opinion) make for easy programming of network based daemons ( http://boston.conman.org/2007/03/08.1 ).

The general method appears to be, write the program using an event model, then create a thread per CPU, which each thread waiting on epoll() (timeouts optional).

Maybe I'm missing something -- if you're I/O-bound why would it matter whether you use epoll() vs poll() other than CPU usage?

From what I understand (and how I've used it in my work), one uses epoll() specifically because they're NOT I/O-bound and so need to come up with strategies for using the minimum amount of CPU and RAM per simultaneous I/O so as to avoid becoming CPU- or RAM- bound.

Hence my original point -- if one is I/O-bound while using poll(), it doesn't really matter whether the CPU is spinning on that or epoll(), since I/O won't happen any faster.

I just found epoll() much easier to work with than select() (been there, done that, rather not go back to it) or poll(), and the fact that you avoid scanning an array upon return.

> The general method appears to be, write the program using an event model, then create a thread per CPU, which each thread waiting on epoll() (timeouts optional).

Alternative arrangement is to have a single IO thread that deals exclusively with getting data off the wire and multiple "worker" threads that handle the actual processing. I like this model better, because it provides cleaner logic separation, run-time threading control and it has several other advantages. Though "your experience may vary".

Congratulations. That is an impressive speed up.

I agree about the built-in Unix utils. You just do not see people taking advantage of these powerful and extremely optimized programs any more. I wonder how many times grep or sort have been unwittingly rewritten in Perl or Ruby because the programmer lacked familiarity with basic Unix tools?

As for your crawler, I think the significant thing here is that you rewrote something in C after you already had it working in another language. Not to bag on C, but writing the original in a higher level language first gives you a better shot at correcting any bugs in the actual solution domain. Then if you move to C you're only fighting against C, not against C and bugs in your solution at the same time.

Congrats on your big speedup; successful optimization like that is always a rush.

I wonder what the result would be if you did everything you describe, but wrote the code that's now in C, in Python instead. I suspect the speed would be very similar. (I like C, for what it's worth.)

I wonder what the result would be if you ... wrote the code that's now in C, in Python instead. I suspect the speed would be very similar.

Is that a joke?

Not necessarily. Let's think of some operations a crawler would need to do:

1. Spawn some threads (Python uses native threading) 2. Connect to and load some URLs (This isn't going to be slow anywhere.) 3. Run some regular expressions (Python's regexp engine is all C) 4. Write to the disk. (Python uses the standard system libraries for this)

It wouldn't be as fast, but it might be surprisingly close. There would be fewer lines of code to boot.

The problem would be the GIL (Global Interpreter Lock). You couldn't actually have more than 1 thread running at a time. It sounds like the man only has 1 processor, so that wouldn't be the end of the world, but if he has more, you can just swap in the process module for the threading module. Of course, then you might have some RAM issues.

Or, briefly: Certain classes of Python programs can actually be faster than average C, because they consist of small amounts of Python code connecting libraries that are already written in very thoroughly tuned C.

Why use threads when you can use an event-based IO system? (I think Twisted is the Python way of doing this.)

I was just going off of what he had already done. I agree that async is a better way to do this.

Twisted is one way, but Python ships with asynchronous libraries http://docs.python.org/lib/module-asyncore.html.

> The problem would be the GIL (Global Interpreter Lock). You couldn't actually have more than 1 thread running at a time.

Not actually true; socket calls release the GIL when they're not active, as do many of python's C libraries. Especially the better-written standard ones.

True, although I would argue that my statement is still correct. You cannot have more than 1 thread running simultaneously.

The GIL is not as big of a problem as most people see it, it only interferes with CPU bound, highly parallelizable tasks.

The "..." above had some important bits in it. My point was, how much of the speedup came from switching to C, and how much came from all the other stuff he did (much of which could also be done in Python).

Switching from a scripting language to C doesn't mean an automatic huge speed boost. It depends where your time is being spent.

"everything [he] described" regarding code was porting code over to C.

"Starting to feel growth issues on your back-end?"

I think you'd better get a doctor to look at that.

"Does this algorithm make my memory footprint look big?"

Key to innovation: no funds, no Internet, a laptop and free time.

We had some views in our rails app that could be hit several times per second by our users, and they were uncacheable, so we implemented the views in c++ using libpg and fastcgi. So much awesomely faster.

If you've got some code that works well and you can solidify into c++ (ie - you're no longer tweaking it 3 times a day) - it's totally worth spending the time to rewrite.

I think C++ on rails would be a great idea, ie - some code generators to help you port your most used .rhtml views over to C++.

Very cool. I have had a similar experience with Mibbit...

Next step, get rid of those threads and use non blocking IO ;)

Mibbit is written in C? So you run it at home? How do you pay for the bandwidth?

Not quite, Mibbit is in Java. I took "c" to mean "currently unpopular non-hyped language". Much like Java is... Mibbit runs on pretty much a single server at the moment so it's been a similar scaling exercise.

I'd definitely agree though with the OP... having limitations forces you to look at better ways to do stuff, and ways to squeeze every last drop out of the hardware/resources.

I took "c" to mean "currently unpopular non-hyped language". Much like Java is.

Java is a popular, hyped language. It's just not a very good one.

If you enjoyed learning/using sed, awk, grep and other Unix text processing utils, you'll love this: http://borel.slu.edu/obair/ufp.pdf

Its called Unix For Poets and it'll show you just how far you can really push these tools.

What's your average page size?

Does your crawler support gzip compressed transfers?

What speed is your Comcast link?

Page size varies, but when I was doing the initial speed tests, pages were around 20k each (guestimate).

Yea, the crawler uses libcurl, which supports gzip compressed content.

My comcast link is (theoretically) 7Mbit down, 385Kbit up.

First, I hope you weren't using the obscenely slow version of Perl that ships with RedHat.

Second, for more perf analysis, there are some very good unix tools for profiling & optimizing C code. Many of them free.

What's interesting about this posting is unbenost to my co-founder, yesterday I was discussing some of the issues we were facing with a "retired" unix programmer and he talked about early sorting and searching methods. It was quite remarkable what they achieved with very little in the way of hardware and memory.

perl threading was (and I'm sure still is) absolutely horrid.

I'd be interested in seeing the results of a rewrite not to C, but to python or ruby where the threading support is much much better. Then you could rewrite functions at a time in C as needed, but not have the extra burden of rewriting the whole thing.

I totally agree with the rest of the approach. Going low tech and using unix tools is a very good way to reduce overhead, increase parallelism, and delay calculations. One of the nice things about this approach is you can cobble up another $50 unix box to do some of the bulk processing via nfs or other means.

Congrats... It sounds like a very interesting project.

Threading support is better in python than in perl? I'm not familiar with perls threading model (or libraries), but this seems wrong - considering things like the GIL in python.

Could you elaborate a bit on this?

I'm much more familiar with threads in ruby. It would actually be a real boon to this project imo.

I don't have as much experience with python threads as I do ruby. I've toyed with the internals of python and perl as well and know perl to be a wasteland and python to be rather elegantly clean. So I would expect python's threads to be better implemented. Poking at it, it does appear to be so.

Doesn't ruby only have green threads?

Yes - unless its jruby then they are non green (forget the term for it at this time of the morning).

yes... and they're great. Esp for situations like the OP was describing. Fire off a ton of incredibly lightweight threads and they'll quietly sit waiting for IO... perfect.

The thing with perl threads is that pretty much nobody uses them. They work fine, as far as I know, but with a default 16MB stack size and "share nothing" semantics don't start more than a few. Perl is more unix centric and people just prefer to fork and use the well baked IPC mechanisms. Various libraries such as POE make fork+IPC easy enough that it's hard to see the need for threads in the kinds of domains where a language like perl is applicable.

people just prefer to fork and use the well baked IPC mechanisms

Ahh, that makes sense. I'm willing to bet that the vast majority of perl code runs on *nix systems (I know that with the exception of one app, all the perl I've ever written was only run on linux servers).

they certainly didn't work fine when I tried to use them... after they'd gone into production releases of perl, the canonical demo/example scripts that were out there all crashed the perl runtime in a fiery horrible death.

I can only hope that it has improved since then, but I've moved on to greener pastures.

They were clearly marked as experimental in the documentation all the way up to recent 5.8 releases. From what I hear they work fine now.

> I'm still looking for ways to optimize things

Re-implement your crawler on a FPGA.

Just kidding :) It's great that you got what you were expecting, but as Paul said, you have to profile before you optimize (otherwise you will "optimize" useless stuff, and not only waste your time but also likely do counter-optimizations).

Anyway, your post was very interesting. Because a lot of people assume they have to use layers on top of the OS, while modern Unix systems have good file systems and memory managers (maybe have a look at DragonFly BSD; they are going into an interesting direction).

i used an implementation of cilk when i crawled ~1000000 htmls from a social network in 2 hours (can't disclose -- the site went into maintenance during/after my experiments ... just to exercise my curiosity)

had to be root to raise the user's maxproc-max (previous experiments locked me out --- "sh can't fork" messages ... can't even ssh in)

it's done in a 100mbps amd2000+ 256 mb 40 GB IDE OpenBSD colo ... but i don't think the hardware matters that much (the cilk is really the key)

Yes, I love it when things can be speed up with old school techniques. Although I'm not sure C was a really great choice in case you need to upscale later.

Most startups that have scaling problems have them due to number of customers, in which case it's trivial to either monetize or raise some money.

i'm not sure this is true. I've been at a couple of startups where the scaling issues came from the database growing everyday with a relatively fixed set of users.

I would still bet the number of startups that have problems like that are <10%

that's probably true. We weren't getting VC funding and had a stream of revenue to keep it going for a couple years.

Congrats. I'm sure that's a good feeling.

I once had a boss who refused to buy the programmers Pentium machines and made us use 486s instead. He said it wasn't because he was cheap, but it was so that if we wrote code that ran really fast on a 486, just think how fast it will run on a Pentium. Nevermind that none of the new Pentium features, like SIMD, could be taken advantage of.

That company is no longer in business.

It's true that utilities like grep, sort, and join are highly optimized. Many people have had wins replacing scripting logic with calls to these utilities. But I'm having some difficulty understanding the merits of putting the top level logic in C, using shell scripts, and the alleged advantages of flat text files.

Shell scripts are hard to make robust. I don't understand why you wouldn't just drive the unix utilities from perl.

Why use perl threads on unix? Everyone know they suck. Why not fork and use a transactional database for IPC?

I remain unconvinced of the alleged wins people have with flat file solutions. It's usually about replacing a toy like mysql or some convoluted berkelydb mess. You can use db2 for free up to 2 gigs of memory. Why waste even a minute replicating transactional RDBMS functionality by hand? As soon as you're dealing with flock and company and you COULD be using a DB, you should, as far as I'm concerned.

kingkongrevenge - I didn't mean to sound like I replaced all of my DB functionality with flat files. I have several stages that my data flows through (fetching, ripping, sanitizing, indexing, uploading) and I replaced the constant calls to mysql by writing things to a flat file, munging the data on-disk with the gnu tools before they get written to the database. With large datasets, DB calls (even with good indexing) can get expensive, so I tried to avoid them as much as possible for the crawling stage of the process. I certainly need a database for the other stages of the process.

A simplified example: If I crawl a website, the chances are good that I'll pull several thousand copies of the same url from the site over and over again. I want to insert all URLs that I find back into the database so I can can crawl them, but I don't want any dupes. I just want to crawl the URL once and then move on. If I insert a URL into a DB and let the DB check if it's a dupe, then I waste DB time/IO. My solution was to crawl thousands of pages, dump all of the urls into a text file, run the text file through 'sort | uniq' and then dump those URLs into the DB and let the DB ignore the dupes at that level. I still have to use a DB to do some of the work, but pre-processing the data up-front using 'sort | uniq' is mega-speedier in my special case.

Also, I did wrap most of the scripts with perl to talk to mysql and do other sanity-checking that's easier in perl than bash or tcsh. :)

Nice, but are your growth issues so bad that you couldn't make this a blog entry on mediawombat.com instead?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact