
Ask HN: What info would you web scrape in 2020? - tiburon
I&#x27;ve been web scraping for a while and I am running out of ideas. I use an anonymous crawling service which provides HTML content at scale, it also has a set of predefined scrapers which I use instead of maintaining own scrapers, that speeds up any scraping I do.<p>I need new ideas to build which are unique and can be useful to people. I build services based on scraping which can be used in different fields, like marketing, SEO, drop shipping etc. In many cases I need JS crawling capabilities and with the service I use, I can get those widgets and rendered pages handy so I could focus on the data and the idea.<p>knowing that you have the resources I have, what would you be looking to scrape today? I like services that help people so any feedback would be  great. I am trying to think out of the box and find new ideas and would appreciate some inspiration hereby.<p>I&#x27;m also open to hear what data you would scrape from the web in realtime, if you have the right tools to scale your scraping.
======
PaulHoule
For 2020 I'd like to get the web browser out of my life as much as possible,
that is, motivated by this work I've done

[https://ontology2.com/essays/HackerNewsForHackers/](https://ontology2.com/essays/HackerNewsForHackers/)
[https://ontology2.com/essays/ClassifyingHackerNewsArticles/](https://ontology2.com/essays/ClassifyingHackerNewsArticles/)

I'd like to crawl a large number of sites that have quality articles, for
instance

[https://voxeu.org/](https://voxeu.org/)
[https://www.anandtech.com/](https://www.anandtech.com/)

and put them through a workflow where I never see an article more than once,
things get classified, etc.

One major issue I have is ads. In 2020 it is not just a matter of ads getting
in the way of content, but rather ads getting in the way of ads. That voxeu
site doesn't have ads, but it does abuse Javascript in such a way that the
back button really works wrong.

The web is breaking down to the extent that I'd really like to filter the junk
out and have an order-of-magnitude better interface.

~~~
tiburon
@PaulHoule Thanks for sharing your research paper and classifier, It is
interesting. I've question though which I found while going through the top
200 articles picked in your algorithm, would it not be a more efficient
classifier if you have the data from all those shared links instead of just
the titles or some meta data, like if you would have the web pages crawled and
scraped all at once to feed the classifier in realtime, would not that bring
more accurate results?

I totally agree about ads in ads, what solution do you think can work at scale
for those?

~~~
tiburon
@PaulHoule i've seen in the conclusion of your research that you are pointing
to classifying to the content of those webpages behind the links, so I guess
you are working on it. I think there will be a great improvement on how the
classifier works if you have more content to analyse.

~~~
PaulHoule
Check my profile and send me an email and I'd be glad to talk more.

Here is the progress I've made since then.

After I did that project I spent a year working on text analysis tools for
somebody else. Then I was looking for a new job and I made a new version of
that software to scrape 1000's of job listings and do a similar classification
based on the whole text of job listings which are usually a few paragraphs.

That software has a much better user interface than the old software for
adding labels and it's designed to handle "workflow" tasks that have some
human and some automated elements.

If I do more work in this area I will probably build on that code. Personally
I think the framework for getting training data and putting the model to work
is more important than the model itself. (That said, with a good document
embedding I think you could get good results with less training data)

------
kristianp
I'm toying with scraping a certain type of product listing and running
classifiers to help users find the product they want.

It sounds like you're building a scraping service and looking for clients.

~~~
tiburon
@kristianp thanks for the input, I am simply looking for ideas, my initiative
is clear about that. Can you give an example in real world about how would a
user benefit from a product classifier, are not those existing since long in
the market?

