Hacker News new | past | comments | ask | show | jobs | submit login

Web scraping. I love figuring out how to reverse engineer websites and defeat systems designed against web scrapers. It's also super interesting (concerning?) how much data websites leak. 4 out of the 5 bug bounties I've discovered have been while poking around in my scraping efforts.



I like webscraping and made a lot of money making farms 15 or so years ago; I started disliking it somehow when Python kind of took over. What are you using? I am also interested in doing it as cheap as I can which is a lot of fun for tech reasons.


What's a good tool/language to write scrapers in these days? A decade ago I was using ruby with mechanize and hpricot. I hope tools have improved since then, especially for scraping sites that use javascript.


Really depends on how big your scraping operation is going to be. These days there's a lot of "managed" providers that give you headless browsers / proxy rotators through an easy API so it's relatively easy to plug them into your code. Examples of these would be https://www.browserless.io or https://www.scrapingbee.com for headless browsers to render JS.

From my work experience of working on a large scraping stack with thousands of integrations, I can say that we are very happy with our own custom framework, written in Go (https://github.com/PuerkitoBio/goquery for HTML parsing) and using headless Chrome for JS rendering.


A fun way, though maybe there are much more productive ways, is to learn Scheme and/or Lisp, and, with a language that has a library for it, convert the html to a big s-expression. Then you have it in a form that is the form of the language itself, where you can literally do anything with it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: