Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been involved in many web scraper jobs over the past 25 years or so. The most recent one, which was a long time ago at this point, was using scrapy. I went with XML tools for controlling the DOM.

It's worked unbelievably well. It's been running for roughly 5 years at this point. I send a command at a random time between 11pm and 4am to wake up an ec2 instance. It checks its tags to see if it should execute the script. If so, it does so. When it's done with its scraping for the day, it turns itself off.

This is a tiny snapshot of why it's been so difficult for me to go from python2 to python3. I'm strongly in the camp of "if it ain't broke, don't fix it".



Using `2to3` might get you 80% of the way there. Although cases like this make tests really valuable.


why can't you just keep using python2? surely some people out there are interested enough to keep updating and maintaining it?


I certainly can keep using it. There have been so many efforts to get people to update Python 2 code to Python 3 code that it's on my backlog to do it. Will I get to it this year? Probably not.


"I send a command at a random time between 11pm and 4am to wake up an ec2 instance."

Any chance you could tell me your setup for this?


Not my project, but if I had to do it I'd try something like the following:

* Set an autoscaling group with your instance template, max instances 1, min instances 0, desired instances 0 (nothing is running).

* Set up a Lambda function that sets the autoscaling group desired instances to 1.

* Link that function to an API Gateway call, give it an auth key, etc.

* From any machine you have, set up your cron with a random sleep and a curl call to the API.

And that should do the trick, I think.


>From any machine you have, set up your cron with a random sleep and a curl call to the API.

You might as well just call the ASG API directly.


Why use autoscaling and not just launch the instance directly from lambda? The run time is short so there's no danger of two instances running in parallel


the tl;dr for all web scraping is to just use scrapy (and scrapyd) - otherwise you end up just writing a poorer implementation of what has already been built

My only recent change is that we no longer use Items and ItemLoaders from scrapy - we've replaced it with a custom pipeline of Pydantic schemas and objects




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: