Readability has a LOT of hand tuned heuristics for figuring out the most likely content of the page, but the primary indicator on whether a tag with text in it is part of an article or not is the number of commas in the tag. It's my favorite thing about the algorithm because it's a dumb idea that works. The comma rule gets the extraction correct on about 70% of the web, the rest of the heuristics are mostly there to cover screwy ways people structure their articles.
I've been dabbling in content-scraping, what bugs me is that with all the AJAX trickery that's going on, merely analyzing the XHTML source doesn't get you very far in many cases. Executing the page (JS, DOM and all) via browser-programming is an option but of course quite expensive. A headless browser is what's needed!
Yeah. I think that is the challenge. A good way to get around the AJAX problem is to see if a site has an RSS feed and use that to extract content. I wish sites had a url for bots built in so you didnt have to do all this fancy stuff to extract the content.
I studied a fair amount of NLP (a true passion of mine) at school and after I graduated I spent several months working on tech which did this (and other things). That was intended to be a startup, but sadly, at the time, my business sense sucked and I couldn't decide on a good product to fit the tech to (the fact that I was developing tech before I had a strong sense of my product is already telling).
I since have started a completely different (and profitable) company and the code has just been bit-rotting. I'm not sure what I should do with it. Keep it around in case I ever decide to do a business model like some of these companies (I probably don't have time for that)? Open source it (time-consuming to clean the code and what do I stand to gain from that)? I guess I could use the open-sourced stuff to help me find contracts for freelancing, but I just don't see a lot of NLP remote work being offered.
News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.
I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.
When I started using the internet in 1992 Usenet (which BTW was almost always referred to as netnews or just plain "news" before Time magazine, etc, used their influence as explainers of the internet to the general public to change the name) was the social heart of the internet the way the web is now, and you did not need algorithms to extract the text from Usenet because the text was all dead-simple plain text files.
I always wanted to generate a simple service that classifies websites. Something where you dump in the HTML/URL and it returns something like "agriculture"/"government"/"retail"/"education".
I already have a set of a few thousand classifications at hand. What would probably be a good algorithm to run it through? I assume I'd use something like webstemmer/boilerpipe/... to extract just the main text first.
What I am a bit uncertain is what I should do after that. My guess would be that I isolate the nouns/adjectives with the highest frequency and do a clustering with my already categorized dataset as training data.
Does that somewhat makes sense?
If yes: any recommendations or alternatives for libraries (preferably ruby) or just algorithms themselves (k-means, svm, neural network...)
For simple text classification, you should just use something like Wekka, rainbow or the Google Prediction API. The hardest part is always labeling and verifying your category training data. There are many open source algorithms that can do the classification and any form of naive bayes will probably be good enough.
Unless I'm missing something, all the methods he's talking about involve looking at web pages in isolation, or, alternatively across the set of all web pages.
To do "template drop out", it would seem productive to look longitudinally across pages on a single site, or in a subdirectory. For instance, almost all pages in Hacker News have the same chrome. Methods used for DNA clustering (such as Hidden Markov Models) could quickly find 'conserved' and 'unconserved' areas of documents.
This touches semantic technology because it links the ability to find nameless statistical patterns with meaningful semantic identifiers, such as domain names.
Methods that do clustering on similar web pages are mostly too CPU intensive for processing larger sets (we're talking millions of web pages). They are also harder to scale from data-locality perspective, you need to figure out which pages to put together and then get the data together.
Great reply. However, I think something is only worth doing if it's impossible.
Your argument is like Chomsky's argument about the poverty of the stimulus, just in reverse. There are heuristics that let us radically prune the N^2 possible relationships between things into a much smaller set that will let us do things that would be otherwise unscalable.
Let me know if you know of this approach being used somewhere in production processing millions of web pages. I would be very interested to know how they overcome the difficulties!
I can imagine the cost/benefit of the approach is favorable for largest search engines like Google and Bing that are trying to squeeze last few percentage points of precision out of results.
For everybody else, the engineering and scaling difficulties are probably too big. I'd love to be proven wrong.
Google and Bing are doing billions of web pages, not millions. I process millions of web pages myself with 3 computers -- millions aren't a lot these days, although I'm not currently using clustering methods.
Rather I'm selective with my inputs so I start with unscrambled eggs so I can improve precision not by "a few percentage points" but rather reduce the false positive rate by an order of magnitude.
My use of ML so far has been modest, limited to solving a few straightforward problems. Personally I think search is boring (on webscale, too big of a game for small players plus search as we know it probably can't get much better because the queries are not precise -- better performance will require changing the game) but I've been forced to put effort into it because end users expect it.
I have been thinking there should be a way in html to mark the main part of a web page, as opposed to the header, navigation, footer, etc. It could be used when printing, by screen readers, search engines, Readability. I don't know if the W3C would approve such a tag, or if site owners would bother to use it though.