Ask HN: What info would you web scrape in 2020?

PaulHoule · on Jan 15, 2020

For 2020 I'd like to get the web browser out of my life as much as possible, that is, motivated by this work I've done

https://ontology2.com/essays/HackerNewsForHackers/ https://ontology2.com/essays/ClassifyingHackerNewsArticles/

I'd like to crawl a large number of sites that have quality articles, for instance

https://voxeu.org/ https://www.anandtech.com/

and put them through a workflow where I never see an article more than once, things get classified, etc.

One major issue I have is ads. In 2020 it is not just a matter of ads getting in the way of content, but rather ads getting in the way of ads. That voxeu site doesn't have ads, but it does abuse Javascript in such a way that the back button really works wrong.

The web is breaking down to the extent that I'd really like to filter the junk out and have an order-of-magnitude better interface.

tiburon · on Jan 15, 2020

@PaulHoule Thanks for sharing your research paper and classifier, It is interesting. I've question though which I found while going through the top 200 articles picked in your algorithm, would it not be a more efficient classifier if you have the data from all those shared links instead of just the titles or some meta data, like if you would have the web pages crawled and scraped all at once to feed the classifier in realtime, would not that bring more accurate results?

I totally agree about ads in ads, what solution do you think can work at scale for those?

tiburon · on Jan 15, 2020

@PaulHoule i've seen in the conclusion of your research that you are pointing to classifying to the content of those webpages behind the links, so I guess you are working on it. I think there will be a great improvement on how the classifier works if you have more content to analyse.

PaulHoule · on Jan 16, 2020

Check my profile and send me an email and I'd be glad to talk more.

Here is the progress I've made since then.

After I did that project I spent a year working on text analysis tools for somebody else. Then I was looking for a new job and I made a new version of that software to scrape 1000's of job listings and do a similar classification based on the whole text of job listings which are usually a few paragraphs.

That software has a much better user interface than the old software for adding labels and it's designed to handle "workflow" tasks that have some human and some automated elements.

If I do more work in this area I will probably build on that code. Personally I think the framework for getting training data and putting the model to work is more important than the model itself. (That said, with a good document embedding I think you could get good results with less training data)

kristianp · on Jan 15, 2020

I'm toying with scraping a certain type of product listing and running classifiers to help users find the product they want.

It sounds like you're building a scraping service and looking for clients.

tiburon · on Jan 15, 2020

@kristianp thanks for the input, I am simply looking for ideas, my initiative is clear about that. Can you give an example in real world about how would a user benefit from a product classifier, are not those existing since long in the market?