And here's his list of resources: http://tomazkovacic.com/blog/56/list-of-resources-article-te...
For reference, this post is originally linked to http://www.readwriteweb.com/hack/2011/03/text-extraction.php
I do text extraction for lynk.ly. First we clean the HTML for the page. Then remove a bunch of tags we consider not useful like script or embed.
Then we looked at all the opening tags and checked if they have specific text in them like "hide", "display:none". If it does, then we skip it the tag.
Finally we get text from a tag if its above a specific threshold.
This seems to have worked best for us.
To see it in action, check it out at lynk.ly
I since have started a completely different (and profitable) company and the code has just been bit-rotting. I'm not sure what I should do with it. Keep it around in case I ever decide to do a business model like some of these companies (I probably don't have time for that)? Open source it (time-consuming to clean the code and what do I stand to gain from that)? I guess I could use the open-sourced stuff to help me find contracts for freelancing, but I just don't see a lot of NLP remote work being offered.
Still, I hate seeing the code rot...
what is it written in?
News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.
If your code works well I also think you should put it up on github. You can see what I intend to use this technology for by reading this text snippet: https://github.com/sbuss/revisionews/blob/develop/web/index....
Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.
I already have a set of a few thousand classifications at hand. What would probably be a good algorithm to run it through? I assume I'd use something like webstemmer/boilerpipe/... to extract just the main text first.
What I am a bit uncertain is what I should do after that. My guess would be that I isolate the nouns/adjectives with the highest frequency and do a clustering with my already categorized dataset as training data.
Does that somewhat makes sense?
If yes: any recommendations or alternatives for libraries (preferably ruby) or just algorithms themselves (k-means, svm, neural network...)
To do "template drop out", it would seem productive to look longitudinally across pages on a single site, or in a subdirectory. For instance, almost all pages in Hacker News have the same chrome. Methods used for DNA clustering (such as Hidden Markov Models) could quickly find 'conserved' and 'unconserved' areas of documents.
This touches semantic technology because it links the ability to find nameless statistical patterns with meaningful semantic identifiers, such as domain names.
Looking at pages in isolation is much more horizontally scalable. You can take a look at Webstemmer (http://www.unixuser.org/~euske/python/webstemmer/index.html) for a method exploiting similarities.
Your argument is like Chomsky's argument about the poverty of the stimulus, just in reverse. There are heuristics that let us radically prune the N^2 possible relationships between things into a much smaller set that will let us do things that would be otherwise unscalable.
I can imagine the cost/benefit of the approach is favorable for largest search engines like Google and Bing that are trying to squeeze last few percentage points of precision out of results.
For everybody else, the engineering and scaling difficulties are probably too big. I'd love to be proven wrong.
Rather I'm selective with my inputs so I start with unscrambled eggs so I can improve precision not by "a few percentage points" but rather reduce the false positive rate by an order of magnitude.
My use of ML so far has been modest, limited to solving a few straightforward problems. Personally I think search is boring (on webscale, too big of a game for small players plus search as we know it probably can't get much better because the queries are not precise -- better performance will require changing the game) but I've been forced to put effort into it because end users expect it.
I've spent the last month writing scrapers for newspaper sites. No one uses any of these things.