
Mining the Web with Clojure - fogus
http://measuringmeasures.com/blog/2010/11/8/mining-the-web-with-clojure.html
======
peregrine
Another pretty good Clojure library for for scraping is enlive which is
normally a templating language based on selectors. But using the same model
you can easily scrap the web in an idiomatic way in clojure.

<https://github.com/swannodette/enlive-tutorial> this is probably one of the
best places to start learning.

------
SageRaven
This looks like something I might want to check out.

I've been writing modest little web bots/crawlers in shell script for quite a
few years now. I think it was a comment on HN just yesterday that prompted an
impulse purchase of Michael Schrenk's _Webbots, Spiders, and Screen Scrapers_
(ebooks at nostarch.com currently 1/2 the dead-tree version price). Even
though I knew the platform going in (PHP w/ the CURL lib), at every page I
can't help but gasp at what an awkward platform this is for the task. I'm only
1/3 of the way through, so I'm still hoping to learn some advanced techniques
that I haven't thought of yet. At the least I hope to walk away from the book
with some new ways to look at the problem space.

Anyway, as someone who's wanted to learn and dig into Lisp, this library
sounds really cool. I'm not a fan of anything which depends on Java, so maybe
the libray can be ported to something more scriptable like ECL, NewLISP, or
SLisp. I will, however, install Clojure and give it a spin.

------
mark_l_watson
Looks useful for scraping the web and shows a now common pattern: start with a
Java library, wrap with Clojure making the APIs nice to use.

~~~
djacobs
Even though every Clojure book out there says something like "We're telling
you, wrapping Java libraries in Clojure functions isn't _idiomatic_!", I think
this is the way Clojure will go. (And good for it.)

Why?

1\. No one likes calling Java in an imperative or OO manner from Clojure
("easy" as it may be -- it really just isn't Lisp that way).

2\. A lot of people who are now interested in Clojure once coded using Ruby.
We Rubyists wrap _everything_.

~~~
bradfordcross
For sure, guys, you are exactly right.

However, a couple things to add based on the fact that Java libs usually stink
pretty bad, and there are usually 10 for a given task.

1) One step is to find the 1 in 10 libs that is remotely close to being sane.
This often takes a bit of time, trying to figure out which project is
strongest from an engineering perspective, mavenized their jars, has some
tests, and has an api that is close to being usable, etc.

2) We've had a number of java libs crap out on us after some time. There are
10 libs for everything, and none are complete - that is the general rule. So
there is this pattern now of wrapping a java lib, evolving the wrapper layer,
and then replacing the underlying lib once you find to many limitations. We've
done this with libs like Rome and even Lucene and we're about to do it again
for our linear algebra (but will try using Mahout first).

~~~
bokchoi
Lucene crapped out? Can you elaborate? We're using it in a project since it
seems like a very popular library for indexing.

------
wslh
Don't know if that compare to using htmlunit and lxml.html? I found a lot of
html parsers with issues with malformed html. And htmlunit is one of the best
for scraping javascript/ajax sites...

