Hacker News new | comments | show | ask | jobs | submit login

Scraping with Selenium in Docker is pretty great, especially because you can use the Docker API itself to spin up/shut down containers at will. So you can spin up a container to hit a specific URL in a second, scrape whatever you're looking for, then kill the container. This can be done via a job queue (sidekiq if you're using Ruby) to do all sorts of fun stuff.

That aside, hitting Insta like this is playing with fire, because you're really dealing with Facebook and their legal team.

Serious question: What do you gain from having an extra layer like docker?

Well it does make it extra easy to deploy a scrape node to any type of machine you might encounter (and having a diverse set of source IPs is extra important for scraping; that means you might need to deploy to AWS, Azure, Google Cloud, rackspace, digitalocean, random vps provider X and so on). So instead of having to have custom provisioning profiles for every hosting provider/image combination, you just need to get docker running on a host and you're good to go.

Because you can use pre-packaged Selenium in Docker images with a few commands: https://github.com/SeleniumHQ/docker-selenium

Selenium grid runs in docker, so it's easy to have multiple instances running. Better control.

Also, if use Kubernetes to manage the grid you can scale out to your credit card limit on GKE: https://github.com/kubernetes/kubernetes/tree/master/example...

What are the advantages of this versus a thread pool of web drivers? I'm not really familiar with Selenium Grid.

Grid can dynamically dispatch based on the browser and capabilities you want when you create the session.

True that, I hope Zuck sues me so I'll get extra famous

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact