Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What info would you web scrape in 2020?
2 points by tiburon on Jan 15, 2020 | hide | past | favorite | 6 comments
I've been web scraping for a while and I am running out of ideas. I use an anonymous crawling service which provides HTML content at scale, it also has a set of predefined scrapers which I use instead of maintaining own scrapers, that speeds up any scraping I do.

I need new ideas to build which are unique and can be useful to people. I build services based on scraping which can be used in different fields, like marketing, SEO, drop shipping etc. In many cases I need JS crawling capabilities and with the service I use, I can get those widgets and rendered pages handy so I could focus on the data and the idea.

knowing that you have the resources I have, what would you be looking to scrape today? I like services that help people so any feedback would be great. I am trying to think out of the box and find new ideas and would appreciate some inspiration hereby.

I'm also open to hear what data you would scrape from the web in realtime, if you have the right tools to scale your scraping.




For 2020 I'd like to get the web browser out of my life as much as possible, that is, motivated by this work I've done

https://ontology2.com/essays/HackerNewsForHackers/ https://ontology2.com/essays/ClassifyingHackerNewsArticles/

I'd like to crawl a large number of sites that have quality articles, for instance

https://voxeu.org/ https://www.anandtech.com/

and put them through a workflow where I never see an article more than once, things get classified, etc.

One major issue I have is ads. In 2020 it is not just a matter of ads getting in the way of content, but rather ads getting in the way of ads. That voxeu site doesn't have ads, but it does abuse Javascript in such a way that the back button really works wrong.

The web is breaking down to the extent that I'd really like to filter the junk out and have an order-of-magnitude better interface.


@PaulHoule Thanks for sharing your research paper and classifier, It is interesting. I've question though which I found while going through the top 200 articles picked in your algorithm, would it not be a more efficient classifier if you have the data from all those shared links instead of just the titles or some meta data, like if you would have the web pages crawled and scraped all at once to feed the classifier in realtime, would not that bring more accurate results?

I totally agree about ads in ads, what solution do you think can work at scale for those?


@PaulHoule i've seen in the conclusion of your research that you are pointing to classifying to the content of those webpages behind the links, so I guess you are working on it. I think there will be a great improvement on how the classifier works if you have more content to analyse.


Check my profile and send me an email and I'd be glad to talk more.

Here is the progress I've made since then.

After I did that project I spent a year working on text analysis tools for somebody else. Then I was looking for a new job and I made a new version of that software to scrape 1000's of job listings and do a similar classification based on the whole text of job listings which are usually a few paragraphs.

That software has a much better user interface than the old software for adding labels and it's designed to handle "workflow" tasks that have some human and some automated elements.

If I do more work in this area I will probably build on that code. Personally I think the framework for getting training data and putting the model to work is more important than the model itself. (That said, with a good document embedding I think you could get good results with less training data)


I'm toying with scraping a certain type of product listing and running classifiers to help users find the product they want.

It sounds like you're building a scraping service and looking for clients.


@kristianp thanks for the input, I am simply looking for ideas, my initiative is clear about that. Can you give an example in real world about how would a user benefit from a product classifier, are not those existing since long in the market?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: