Using Machine Learning to Understand Page Templates (bloomreach.com)
83 points by nwinkels on July 6, 2018 | hide | past | web | favorite | 13 comments

When I saw this title I assumed it was about using ML to build dynamic scrapers that self-configure where the changing data elements. Does that exist anywhere?

I think this is what Diffbot does !

At a very high level, it's similar. We use computer vision and ML to extract structured data from any web page, even ones we haven't seen before. https://www.diffbot.com/

If anyone has any questions or wants to try it out, feel free to email me directly at dru@diffbot.com

interesting read. I liked the part about identifying rare templates. plus, if I ever plan to do ML on html the article gave me a good place to start on feature extraction.

might have some uses in identifying phishing urls, etc

The title should be "Clustering HTML pages". It's not a terribly interesting application. The only thing new I got from it was a de-noising technique.

Not sure why you're getting down-voted. You're right, the author didn't close the loop. Typically we would expect to see an insight after you apply the ML technique.

So they extracted features, clustered the pages and found... what?

I am sure they learned something but it might be proprietary.

What I'd really like to see is a presentation of an algorithm that automatically recognizes and hides the first dismissive HN comment that inevitably appears. Any takers?

Recognize, hide, and automatically post to /iamverysmart.

You can do it yourself, if you're so inclined. Out-of-the-box sentiment analysis is ~90% accurate. Feel free to provide training data for it.

Why would you want to have your comment hidden?

On the more serious note I would not mind a but more critical and less clickbaity attitude in the tech. Not every if statement or regexp is ML and AI, not eveything requires blockchain.


most of the images won't load for me.

and the title of the browser tab is flashing at me...

