

Ask HN: Interested in a centralized large-scale crawl architecture? - jarsj

I have been using Nutch for a while and find it quite low-level. The batch based indexing system is good but doesn't work well in many scenarios.<p>I am talking about a single point, auto-updated crawl database. The distributed database will support APIs to inject/remove URLs, fetch content for a URL and register callbacks when a crawl is over, etc.
======
mathgladiator
I am, so I wrote one. I plan to open it up as a service.

<http://www.wheelbarro.ws/>

I'm building up a lot of the final components, but I have the extraction
engine basically done. I convert the HTML into JSON and then pass it off to
your script which you can then process out the data using my jEX (my own
jQuery).

Once I get my stuff figured out, I'll probably add jsdom support with jQuery.

------
il
Yes, definitely, especially if you 1. can scale up and down quickly like cloud
hosting and 2. make extracting text from a large set of pages as easy as
writing a regular expression.

Shoot me an email, I would love to talk about this further.

