
Taco Bell Programming (2010) - Jarred
http://web.archive.org/web/20101202135616/http://teddziuba.com/2010/10/taco-bell-programming.html
======
yuvipanda
> The Taco Bell answer? xargs and wget. In the rare case that you saturate the
> network connection, add some split and rsync. A "distributed crawler" is
> really only like 10 lines of shell script.

As someone who has had to cleanup the messes of people who started with this
and built many hundred line dense bash scripts... please do not do this.

> I made most of a SOAP server using static files and Apache's mod_rewrite. I
> could have done the whole thing Taco Bell style if I had only manned up and
> broken out sed, but I pussied out and wrote some Python.

I feel sad for whoever inherited this person's systems.

"Write code as if whoever inherits it is a psychopath with an axe who knows
where you live" is something I heard pretty early on in life and it's been
pretty useful.

~~~
mruniverse
> "built many hundred line dense bash scripts... please do not do this"

Of course. But that's with any language that's many hundred lines of dense
code.

His point is if something can be done simply with built-in proven tools, use
them until you need something more.

~~~
skybrian
No, it's not the same.

Most experienced programmers know a little bash and enough UNIX commands to
get by. This is enough to write a script that handles the happy path, but not
enough to handle all error conditions correctly. There are all sorts of tricks
you need to know that are commonly skipped. (Forgetting to use -print0 for
example, and that's an easy one.) The resulting script is probably okay if you
run it interactively and check the output but will blow up or silently do the
wrong thing for unexpected input in production. To properly review a bash
script for errors you need to be an expert.

By contrast, Go programmers with a few months of experience typically know all
of Go.

The older tool is not necessarily better if it has lots of obscure sharp edges
that most people don't learn.

~~~
csours
+1 - If you thought "it works on my machine" was bad with binaries, shell
scripts are so much worse.

Just like Excel "programs", shell scripts can be easily mined for requirements
for a real program though.

~~~
itomato
This is a simple UNIX pipeline, not multi-hundred line spaghetti of korn, c,
bash or even zsh shell scripts.

No builtins were used in the example, just core utilities deployed the way
they were designed.

Rewriting the wheel is completely bogus, doubly so when you ultimately make
calls to those utilities, as is common when 'admins-cum-programmers' start
getting their hands dirty with Python.

~~~
gkop
I am generally in favor of the idea of the OP, but the core utilities _do_
differ across environments and this will bite you sooner or later.

I was bit recently by some pretty boring search and replace functionality
differing between sed on OSX and on Debian. Like, I would have had to pass a
different argument to sed based on the version of sed (So I switched to Perl
for the task). But this is certainly an insidious category of bug where you
don't discover it until to try to run the script in another environment and
then you're potentially stuck debugging the script from the top down.

~~~
csours
If you need something that is only used once or for a short time, in one
place, script away!

I'm actually strongly in favor of scripts; but the web is eating everything.
If it has to scale, put it on the web.

------
mabbo
While I agree with some comments here that "Fuck you, I had to MAINTAIN your
bullshit Taco Bell system", for any project that needs to be run likely no
more than once (prototypes, single-run analysis, etc) or is never going to be
checked into source control, the power of the shell cannot be underestimated.

I had an intern a couple years ago. Nice guy, but he didn't listen when we
said "Keep this simple". We had all the data from an A/B test he ran, and we
needed to do the analysis. He broke out MapReduce on EMR and all sorts of
other complexity. It was a few MB of data!

After his analysis went pretty poorly, I wrote up a shell script in a few
hours (sed, awk, xargs, woo) and got us the data we needed. I'd never ask
someone to maintain that madness, but I was able to break it down into simple
functions, piped into each other, in a single file.

~~~
TTPrograms
Using MapReduce for a few MB of data? Jeese. That's a whole new can of worms.

I remember being younger, doing things the way I found "interesting" rather
than the way I found practical. Eventually for a project or two you end up
with basically nothing for all that complexity because you're focusing on the
wrong things, and I learned to start putting functionality first.

------
twic
This article is also on Ted Dziuba's current site, where you don't need to go
way back to see it:

[http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
progra...](http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
programming.html)

------
peterjmag
I found myself nodding along (if not necessarily agreeing 100%) until I got to
this:

 _I could have done the whole thing Taco Bell style if I had only manned up
and broken out sed, but I pussied out and wrote some Python._

I know this article was written over 5 years ago, but I still feel the need to
say: expressions like _manned up_ and _pussied out_ are a huge turn-off for
me. Regardless of the ideas surrounding them or the author's programming
skills, the author loses a good chunk of credibility in my eyes simply by
using them. Sure, it's a stylistic choice, but it's one that I feel is
actively harmful to our industry, especially for those who are new to software
development or interesting in learning.

So if you're reading this and you're one of those new or interested people:
please don't let this turn you off from discovering and learning tools like
sed or Python. One is not inferior to the other—they are simply different
tools that can be used for a wide range of different things. Don't be afraid
of using the "wrong" tool for the job (because even experienced developers do
this sometimes), just keep on learning new tools and adding them to your own
tool belt. Share your work with others, and then when someone tells you you're
using the "wrong" tool, ask them to explain why and to propose alternatives.
Lather, rinse, repeat.

~~~
seanclayton
Is the mindset of "cats are easily scared" considered socially unacceptable
and offensive nowadays? Or is the word pusillanimous a derogatory term? I'm
extremely confused by your distain of the phrase "pussied out." "Manned up" is
understandable, though.

~~~
skybrian
It doesn't refer to cats. It's a rude way to refer to women.

~~~
AnkhMorporkian
Do you have a reference on that? I've always interpreted as in 'scaredy cat,'
since female genitalia aren't traditionally considered afraid of anything, and
cats are.

~~~
skybrian
I don't really want to search for a reference, but one other person posted
with the same interpretation and this one got upvoted, which suggests at least
some other people understood it the same way.

~~~
AnkhMorporkian
I'm not claiming it's the correct interpretation, and honestly my googling is
coming up short because of the vast usages of that phrase on urban dictionary,
but honest to god I had never heard a misogynistic interpretation of it before
this thread. I never use the phrase in real life because it sounds vulgar, but
it's surprising.

If my interpretation was correct, I wonder if it's sort of a 'niggardly'
situation, where people stop using a word because it sounds similar to
something vulgar.

~~~
skybrian
Yeah, that's why I didn't google it.

Like all words its meaning depends on what most people think it means. I'm not
all that surprised that you never heard of it. Maybe I just had rude friends
when I grew up.

------
zdw
Another, similar post:

[http://www.leancrew.com/all-this/2011/12/more-shell-less-
egg...](http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/)

The point of both of these posts is more that Unix style tools are extremely
powerful and expressive, to the point that writing more involved code for most
simple tasks (especially one-off tasks) frequently isn't worth the effort.

Complexity doesn't improve things, especially when it doesn't add value.

------
wwweston
Related:

    
    
       http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/
    

TLDR: Donald Knuth and Doug McIlroy both wrote a program that would read a
textfile, determine the n most frequently used words, and print out a sorted
list of those words along with their frequencies.

Knuth's was 10 pages of (very tightly/well written/literately documented)
Pascal.

McIlroy's was tweetable:

    
    
       tr -cs A-Za-z '\n' |
       tr A-Z a-z |
       sort |
       uniq -c |
       sort -rn |
       sed ${1}q
    

Now... a programmer doesn't always have the luxury of working with a full
suite of convenient tools well-suited to their problem domain (as UNIX shell
tools were in this case), and the merits of Knuth's careful and literate
approach can serve well across many domains.

Still, there's something about this I think should strike any developer. It
seems we talk about reuse & composability a lot more than we see it done this
well, and when it is done this well it's not just elegant but kindof shocking.

(And I bet the astute FP programmers see some familiar lessons at work
here...)

------
lukaslalinsky
Sometimes you have to admit that the problem you need to solve is hard. Hard
problems usually can't be solved by easy solutions.

Some examples:

> I have far more faith in xargs than I do in Hadoop.

Me too, but those things are very far from comparable. You can only compare
xargs to Hadoop only if you have a Hadoop cluster with one node and I'm really
not sure why would anybody use Hadoop like that.

> I trust syslog to handle asynchronous message recording far more than I
> trust a message queue service.

You mean you trust the protocol that sends messages over UDP and can silently
truncate or lose messages?

Standard Unix tools are nice and I always try to use them first, but for some
tasks, they are just not the right tools.

~~~
samcheng
I think the point is that most problems aren't actually hard enough to need
the more complicated tools.

------
bluedino
It's like the old programming interview, someone sits down and is asked to
write a program that sorts a file containing a list of 100 random numbers.

The programmer enters the UNIX command _sort -n numbers_

------
jlgaddis
Off-topic: every Taco Bell restaurant has a server running "Taco Bell Linux"
(or did, a few years ago, anyways -- I assume they still do).

~~~
golergka
Off-topic of off-topic: I tried to google it and came up with this instead:
[https://www.youtube.com/watch?v=FcAgIapM9HM](https://www.youtube.com/watch?v=FcAgIapM9HM)
and it was just to magnificent not to share

~~~
mclovinit
Hilarious and strangely addicting! We need to know what is in that Taco Bell
Linux flavor.

~~~
jlgaddis
It was OpenSUSE flavored, if memory serves. :)

------
cdevs
I was going to write a program to generate static files from our horrible
Wordpress site and then I remembered oh yea - wget. I relate to this post in
the fact that I'm a system admin that hopes the devs learn some daily tricks,
faster word manipulation on a large scale but keep the system calls out of
production.

------
VLM
Hearing hooves out in the wilderness? Ah, obviously zebras escaped from the
zoo. What if all the worlds zebras escaped continuously from all the zoos,
better make it scalable. You'll never be able to reimplement so better make it
very generic... like a self reproducing herd of artificially intelligent
superhuman androids. I've had to maintain systems that should have been
implemented with herds of artificially intelligent androids but were not. Most
importantly, this is gonna look awesome on my resume.

Or... those hooves could be a couple horses. Use the standard low effort
solution of stepping out of their way. Redirect the energy that would have
been wasted into something actually useful.

------
fffrad
If only we could get paid to find solutions, not write lines of code.

I was once asked to create a pie chart of the number of lines of code my team
wrote every week. I still haven't figured it out.

~~~
vmorgulis
diff + sloccount?

~~~
scwoodal
They could figure it out but there's no value so why bother.

------
halayli
> The Taco Bell answer? xargs and wget. In the rare case that you saturate the
> network connection, add some split and rsync.

If you're going to download millions of webpages, you'll instantly saturate
your network io. yes you can split and rsync, but you'll lose proper error
reporting, and ability to systematically retry, dynamic scaling, machine
failure recovery, along many others that a properly designed system would
provide you.

It often depends on what you require and expect from your solution.

~~~
timonovici
Unless, of course, you read wget documentation and realize most of that stuff
is available as flags, you can log errors, you can use GNU parallel and a
bunch of other specialized tools.

It still complexity in the end, but you just have to factor it all in - I bet
you that 90% of those "big data" problems can fit in a small server's RAM.
Most of the time is just people making shit up, so they'll have a job.

~~~
halayli
I've read wget, and I am a cURL contributor as well. I know the capabilities
of each very well. But when you have 1M+ urls to download and you care about
proper error handling at a large scale, those tools will fall short. Not all
errors need to be handled in the same way. Some need retries some don't
depending on http response codes for example.

Another problem you'll hit is how to make sure that the machines are
saturated. How many jobs should be running at the current time depends on
what's being downloaded and how much room you have to run additional
downloads.

Again, it all depends on what you need from the system and how much leeway you
have.

------
alexchamberlain
I'm not sure I agree. They may work under good circumstances, but how do you
test the error cases? What about when you need to scale beyond a single node
or you need an "online mode"?

~~~
themckman
Well, of course, the "online mode" is an `nc -l` at the beginning of the
pipeline.

------
cayblood
Most crawlers of any reasonable size need to use anonymizing proxies,
throttling and shuffling of proxy IP addresses. All this starts to get
complicated in bash.

------
vmorgulis
Taco Bell programs as one liners.

~~~
cube00
Hopefully the 80 character restriction applies.

------
jlgaddis
N.B.: [2010]

------
johansch
"Resume ability" is kind of useful when dealing with large data sets though.
This kind of scripting often fails in that regard.

~~~
tyingq
The tools he mentions support "resume ability".

[https://www.gnu.org/software/wget/manual/wget.html#Time_002d...](https://www.gnu.org/software/wget/manual/wget.html#Time_002dStamping)

~~~
johansch
Well, yes, for a single download. Not for crawling - a repeated pattern of
download, inspect for links, download some more etc.

~~~
tyingq
Not sure I understand your point. Wget does recursive crawling, and supports
timestamp comparison. Perhaps you are noting that it wouldn't prioritize new
pages over changed pages? Or that it doesn't support something like "don't
check for changes at all unless the download is more than X days old?".

Generally, yes, there is some point at which wget would fall short of a
purpose built tool. I think that point, though, is farther out than you're
suggesting.

~~~
johansch
In this very specific example:

It doesn't keep the crawler state (which links have been visited, which links
are discovered but not yet visited) persistently.

(You made it this specific by picking this particular example; my more general
observation is that this is a common thing in command line/shell script
constructs. They remain simple only until you start to care about such
things.)

~~~
tyingq
>>It doesn't keep the crawler state

It does though. Just not in the way you're expecting.

The timestamp support would keep it from re-downloading anything already
downloaded (though it would do an HTTP head for the comparison).

Or, it also has a "no-clobber" feature that would keep it from even trying to
download.

Yes, both approaches are more limited than a specific state datastore, but
there is state.

------
douchescript
shellscript that crawls web? Surely doable but never secure.

