I've been involved in many web scraper jobs over the past 25 years or so. The mo...

Topgamer7 · on Feb 10, 2021

Using `2to3` might get you 80% of the way there. Although cases like this make tests really valuable.

silicon2401 · on Feb 10, 2021

why can't you just keep using python2? surely some people out there are interested enough to keep updating and maintaining it?

NDizzle · on Feb 10, 2021

I certainly can keep using it. There have been so many efforts to get people to update Python 2 code to Python 3 code that it's on my backlog to do it. Will I get to it this year? Probably not.

craigmi · on Feb 10, 2021

"I send a command at a random time between 11pm and 4am to wake up an ec2 instance."

Any chance you could tell me your setup for this?

rinze · on Feb 11, 2021

Not my project, but if I had to do it I'd try something like the following:

* Set an autoscaling group with your instance template, max instances 1, min instances 0, desired instances 0 (nothing is running).

* Set up a Lambda function that sets the autoscaling group desired instances to 1.

* Link that function to an API Gateway call, give it an auth key, etc.

* From any machine you have, set up your cron with a random sleep and a curl call to the API.

And that should do the trick, I think.

ses1984 · on Feb 11, 2021

>From any machine you have, set up your cron with a random sleep and a curl call to the API.

You might as well just call the ASG API directly.

warsheep · on Feb 11, 2021

Why use autoscaling and not just launch the instance directly from lambda? The run time is short so there's no danger of two instances running in parallel

wp381640 · on Feb 11, 2021

the tl;dr for all web scraping is to just use scrapy (and scrapyd) - otherwise you end up just writing a poorer implementation of what has already been built

My only recent change is that we no longer use Items and ItemLoaders from scrapy - we've replaced it with a custom pipeline of Pydantic schemas and objects