
Show HN: Instagram-scraper – Scrape instagram photos by tags, without API - meetmangukiya
https://github.com/meetmangukiya/instagram-scrapper
======
meetmangukiya
This scraper was written to get images and create a dataset for ML models for
personal project while studying Machine Learning and Artificial Intelligence.

~~~
nsfmc
curious why you are scraping instagram for this purpose and not something like
flickr which has a reasonable public api and tagged creative commons licensed
images that are suitable for your ML purposes. at the very least, it's worth
investigating archive.org's many freely licensed archives for this sort of
thing.

as somebody that has fielded numerous emails from friends asking me to remove
tagged photos of them from flickr, i sort of wonder about the ethics of
harvesting these sorts of images from instagram, a community whose norms sort
of revolve around semi-public sharing of photos. I don't doubt that there's
some rationale for harvesting the images from ig, but aside from thumbing your
nose at their TOS, it feels like it's a greater violation of trust to harvest
your friends and strangers photos for an ML project without their informed
consent.

at the very least, it's worth considering pointing your app's gaze at a set of
images licensed for any purpose whatsoever rather than ones that are
explicitly licensed All Rights Reserved by their respective photographers.

------
Jonovono
I wish Instagram had an API for their photos. You used to be able to query
photos by coordinates and what not. I started building a weather app that
would only show you the feels like temperature and wanted to show some photos
of what it looked like outside near your location. Instagram kept rejecting my
app and then they shut down the API completely. So what I had to do is this:

Query Facebook for a list of location IDs near your location -> use those
location IDs to get photos tagged with that location on Instagram -> wait for
response for all of those photos to come back and then sort by recently taken.
It ends up taking fairly long.

I still made the app anyway: [https://itunes.apple.com/ca/app/feels-see-what-
it-feels-like...](https://itunes.apple.com/ca/app/feels-see-what-it-feels-
like/id1318570720?mt=8) but I am going to transition to using public Snapchat
stories.

------
ciferkey
I've use instalooter as a library to great success:
[https://github.com/althonos/InstaLooter](https://github.com/althonos/InstaLooter)

Even made an issue on the repo when I ran into an issue setting it up on AWS
and the maintainer was fast to respond.

Used it to make a bot to scrap soup special menus from a sandwich place near
my work: [http://blog.matthewbrunelle.com/projects/2018/05/07/Soup-
Bot...](http://blog.matthewbrunelle.com/projects/2018/05/07/Soup-Bot.html)

------
katzgrau
If you've ever tried scraping Facebook you'd know that it's nearly impossible
to do so reliably. They have a formidable anti-scrape strategy. Instagram is
currently ridiculously easy to scrape - but I doubt it's going to be that way
for long.

~~~
retube
what do they do that makes it so hard?

~~~
epmaybe
They randomize the tag ids and classes a lot of the time. At least that has
been my experience with the Facebook messenger website.

~~~
slow_donkey
If you're doing messenger, there's a community api that emulates the browser
on github

------
cavDXF
I use [https://github.com/rarcega/instagram-
scraper](https://github.com/rarcega/instagram-scraper)

The rate limit by instagram is a bit tough, though, but as i only for
archiving a few of my close friends, as it supports private accounts, that's
OK.

------
stingraycharles
What is HN’s opinion on the legality of these types of scrapers? Instagram’s
robots.txt disallows this kind of scraping, same for their ToS. Legal
precedents have been mixed - the recent LinkedIn vs HiQ case is a good signal,
but it’s still in appeals court.

~~~
Rjevski
If the data is made available to me as a human, then I am free to delegate the
job of retrieving it to a machine if I choose, and I will be doing it whether
you like it or not.

~~~
madeofpalk
Unfortunately "whether you like it or not" doesn't carry much legal weight.

~~~
merinowool
That's why it is important to root out corruption from law making, nowadays
called "lobbying" \- this should be illegal. If you expose an endpoint to the
public you can't restrict who or what can consume it. You can do throttling on
your side but that's it. Otherwise this is just racism towards machines.

~~~
notduncansmith
How do you outlaw lobbying?

~~~
qbrass
It probably involves lobbying.

------
fapjacks
Well, this is just a Python web scraper, and Instagram does in fact attempt to
detect and prevent/rate-limit this kind of scraping. They rely very heavily on
the source IP to help them determine when to cut you off.

~~~
donjoe
... indeed. Isn't the script 'under the hood' calling the API anyways (looking
at 'scrolldown' here)?

I reversed engineered the API myself a couple of weeks ago which was great fun
- especially figuring out Instagram's rate limits on interactions such as
comments and likes per day/hr.

~~~
meetmangukiya
Not the Instagram developer APIs, but the one that instagram's frontend
consumes. The script scrapes instagram's frontend here.

------
chiefalchemist
A couple years ago, in order to replace (the bloated and slow) Instagram
widget on a website, I whipped up a simple PHP scraper for an account page. I
don't see why it would have taken much to do it by tag. All it did, more or
less, was visit a URL, scrape, and then parse.

------
bigmit2011
What is this regex looking for?:

re.compile('(?:#)([A-Za-z0-9_](?:(?:[A-Za-z0-9_]|(?:\\.(?!\\.))){0,28}(?:[A-Za-z0-9_]))?)')

~~~
meetmangukiya
hashtag, as the key name 'hashtag' states :)

------
r0f1
Does not really work... downloads the same 5 pictures over and over again.

~~~
anonu
There are dozens and dozens of similar projects on Github and elsewhere...

~~~
fapjacks
Right, this is what I was driving at. This does not belong on HN, as it is
neither interesting nor novel. And it's self-promotion which, when it's also
completely unimpressive, stinks even more in my opinion. If the OP had posted
a link to his blog which explained his ML project in detail -- and it was an
interesting project and not for example an Intro to ML Coursera project or
something equally lame -- then perhaps it might be appropriate for HN. This is
not that.

