Hacker News new | past | comments | ask | show | jobs | submit login
Mining the Web with Clojure (measuringmeasures.com)
95 points by fogus on Nov 8, 2010 | hide | past | favorite | 20 comments



Another pretty good Clojure library for for scraping is enlive which is normally a templating language based on selectors. But using the same model you can easily scrap the web in an idiomatic way in clojure.

https://github.com/swannodette/enlive-tutorial this is probably one of the best places to start learning.


This looks like something I might want to check out.

I've been writing modest little web bots/crawlers in shell script for quite a few years now. I think it was a comment on HN just yesterday that prompted an impulse purchase of Michael Schrenk's Webbots, Spiders, and Screen Scrapers (ebooks at nostarch.com currently 1/2 the dead-tree version price). Even though I knew the platform going in (PHP w/ the CURL lib), at every page I can't help but gasp at what an awkward platform this is for the task. I'm only 1/3 of the way through, so I'm still hoping to learn some advanced techniques that I haven't thought of yet. At the least I hope to walk away from the book with some new ways to look at the problem space.

Anyway, as someone who's wanted to learn and dig into Lisp, this library sounds really cool. I'm not a fan of anything which depends on Java, so maybe the libray can be ported to something more scriptable like ECL, NewLISP, or SLisp. I will, however, install Clojure and give it a spin.


Looks useful for scraping the web and shows a now common pattern: start with a Java library, wrap with Clojure making the APIs nice to use.


Otherwise known as "wrap the crap."

http://www.datawrangling.com/how-flightcaster-squeezes-predi...

"Building layer upon layer of abstraction is a big key. On the jvm, you have to do this, it is the path around the verbosity of Java and the vast abyss of poorly done APIs. You just keep searching until you finally find the folks who have built a sane, high level API on top of the thing you want to use - then you wrap it in a high level language like Clojure. The technical term for this is 'wrap the crap.'"


A word of caution -- this works until it doesn't. For example, if you have to optimize.


Even though every Clojure book out there says something like "We're telling you, wrapping Java libraries in Clojure functions isn't idiomatic!", I think this is the way Clojure will go. (And good for it.)

Why?

1. No one likes calling Java in an imperative or OO manner from Clojure ("easy" as it may be -- it really just isn't Lisp that way).

2. A lot of people who are now interested in Clojure once coded using Ruby. We Rubyists wrap everything.


For sure, guys, you are exactly right.

However, a couple things to add based on the fact that Java libs usually stink pretty bad, and there are usually 10 for a given task.

1) One step is to find the 1 in 10 libs that is remotely close to being sane. This often takes a bit of time, trying to figure out which project is strongest from an engineering perspective, mavenized their jars, has some tests, and has an api that is close to being usable, etc.

2) We've had a number of java libs crap out on us after some time. There are 10 libs for everything, and none are complete - that is the general rule. So there is this pattern now of wrapping a java lib, evolving the wrapper layer, and then replacing the underlying lib once you find to many limitations. We've done this with libs like Rome and even Lucene and we're about to do it again for our linear algebra (but will try using Mahout first).


Lucene crapped out? Can you elaborate? We're using it in a project since it seems like a very popular library for indexing.


Even though every Clojure book out there says something like "We're telling you, wrapping Java libraries in Clojure functions isn't idiomatic!"

I know one Clojure book that doesn't say that. ;-)


Okay, fair point. I am only halfway through Joy of Clojure, so I semi-retract the "every" in my claim.


This makes portability with Clojure-CLR a real problem. I don't think Clojure-CLR has much in the way of adoption right now, but it would be nice to use any Clojure library in any Clojure runtime, like you can with Ruby and all their runtimes.


Given Clojure's emphasis on Java interop, I don't think full portability was ever the motivating factor behind Clojure.


What do the motivating factors behind Clojure have to do with anything?

The CLR port appears to be fairly active (with regards to the amount of work going into it), and most of the libraries that are just shells over Java libraries can't be used.


  What do the motivating factors behind Clojure have to do with anything?
Motivating factors dictate current practices. If every piece of Clojure PR says things like "harness all the power of Java from a language that uses Lisp semantics/syntax", of course library authors will be comfortable doing so.

Part of the reason Clojure is great is because we don't have to re-implement all the functionality that Java libraries have given us in order to have a viable language. That benefit goes away if we start saying "stay away from Java".


I don't know of any Clojure book that says that. Did you have a specific book in mind?


I have a digital copy of Programming Clojure with me, so I can give direct evidence for that now. On page 30:

  Because the Java invocation syntax in Clojure is clean 
  and simple, it is idiomatic to use Java directly, 
  rather than to hide Java behind Lispy wrappers.
I believe Pragmatic Clojure says something to that effect, too, but I'll have to look it up in my dead-tree version later.


Newcomers to Clojure may dislike the intrusion of Java class and method names into their Clojure code, and rush to wrap every Java method call in a Clojure function. More experienced Clojure programmers appreciate the power offered by Java libraries and are comfortable mixing Java methods and Clojure functions.


Thanks, I knew I'd read it in there.


It is idiomatic .... sometimes but if you want a full api nobody want to work with java all the time.


Don't know if that compare to using htmlunit and lxml.html? I found a lot of html parsers with issues with malformed html. And htmlunit is one of the best for scraping javascript/ajax sites...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: