
Using Machine Learning to Understand Page Templates - nwinkels
https://www.bloomreach.com/en/blog/2018/07/using-machine-learning-to-learn-page-templates.html
======
philipodonnell
When I saw this title I assumed it was about using ML to build dynamic
scrapers that self-configure where the changing data elements. Does that exist
anywhere?

~~~
ksahin
I think this is what Diffbot does !

~~~
dwynings
At a very high level, it's similar. We use computer vision and ML to extract
structured data from any web page, even ones we haven't seen before.
[https://www.diffbot.com/](https://www.diffbot.com/)

If anyone has any questions or wants to try it out, feel free to email me
directly at dru@diffbot.com

------
autokad
interesting read. I liked the part about identifying rare templates. plus, if
I ever plan to do ML on html the article gave me a good place to start on
feature extraction.

might have some uses in identifying phishing urls, etc

------
abadon
The title should be "Clustering HTML pages". It's not a terribly interesting
application. The only thing new I got from it was a de-noising technique.

~~~
pagnol
What I'd really like to see is a presentation of an algorithm that
automatically recognizes and hides the first dismissive HN comment that
inevitably appears. Any takers?

~~~
goostavos
Recognize, hide, and automatically post to /iamverysmart.

------
edhu2017
most of the images won't load for me.

~~~
coding123
and the title of the browser tab is flashing at me...

