Hacker News new | past | comments | ask | show | jobs | submit login
More notes on writing web scrapers (cushychicken.github.io)
14 points by cushychicken on Feb 25, 2022 | hide | past | favorite | 7 comments



I'd note, regarding storing data, that HTML compresses extremely well. It will shrink to <10% of its size. If you want to save it, you would be a fool if you saved it uncompressed.


Guess I'm a bit of a fool then because I was definitely saving uncompressed haha.

Maybe I'll give it a try at some point, but right now, I don't have any need for it. It'd be work without a clear payoff. I'm starting a new job at the end of March, and I'd rather concentrate on some features that help me run the site between now and then.


Author here. Let's talk web scrapers.


Copying a request as cURL, PowerShell, fetch etc… right from inside the DevTools is a blessing.


Isn't it though?

Does create a bit of work when you have to figure out which parts of cURL need to be ported to Python, and which can be safely omitted. Copying a cURL request adds in a lot of headers - many of which I still don't properly know the purpose of.

I'll get there eventually. : ) In the meantime - thank god for whoever wrote "Copy as cURL request"!


Is there such thing as an unscrapable site? I tried to open driver.uber.com with Pyppeteer and it fails. I’m guessing it’s due to redirects, so what have you seen solve this problem?


Workday is my 800 pound scraping gorilla.

Never heard of Pyppeteer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: