Hacker News new | past | comments | ask | show | jobs | submit login
Scraping Instagram (scrapingfish.com)
55 points by mateuszbuda on March 31, 2022 | hide | past | favorite | 36 comments



Can anyone explain to me how these services are legal? I didn't read Instagram terms and conditions but I'm pretty sure there are tons of points against scraping, copying and distributing their data, in particular using them to make money.

How is this possible?


I don't follow this topic closely but it is definitely in a legal grey area and under frequent debate (and lawsuits).

To highly summarize...

A frequent allegation is that this is unauthorized access of computer systems. The scrapers argue that this is public data so they are just accessing it. Their access isn't meaningfully different from regular users which are allowed. From their point of view if the service doesn't want to share the data they shouldn't make it available.

Another common accusation is breaching the ToS. Generally the defense is that they didn't agree to any contract.

A last effort is some sort of copyright. Generally the scrapers will argue that that the data can't be copyrighted, isn't owned by the service or that some sort of license was given (back to the public data argument).

Of course every case is different and has different points but these are the common ones that I have seen.


Potentially useful reference regarding the status of one of the most important such lawsuits:

https://news.bloomberglaw.com/us-law-week/supreme-court-scra...


yeah post linkedin, it gave the green light to scrape any publicly available information. Craiglist bullied scrapers via lawsuit (EFF covered it) but post linkedin, there has been zero grounds for Craigslist to use the DDOS argument (since the website is built to handle far more traffic than scrapers can).


Breach of ToS has nothing to do with legality. It's definitely a breach of ToS, but legality will depend on the local jurisdiction, and enforcement will depend on whether the user is in reach of a legal system that cares about it (good luck when the user is anonymous or based in Russia or other US-unfriendly country).


The simple answer is: this is not legal and also doesn't work at scale. Try running this type of scaling for a few thousand profiles - you will quickly be restricted.


It's definitely a breach of ToS, but I wouldn't be so fast at calling it illegal. It's a grey area that has yet to be properly litigated - I think the closest we've got is the LinkedIn scraping case and I don't remember whether that one even reached a conclusive answer.

In fact this is one of the downsides of the US legal system - litigation is so expensive that nobody dares trying it even though it could set a legal precedent that would benefit society at large. This is IMO something a consumer-friendly regulatory environment (such as the EU) should settle in advance like with the GDPR for example, but given they're not even bothered to enforce that effectively, I don't have much hope (if they enforced it, it would actually remove a big use-case for scraping Instagram, as you would be able to use the official clients without compromising your privacy).


AFAIK the latest status of the LinkedIn case is still inclusive (due to the Supreme Court stepping in).

https://news.bloomberglaw.com/us-law-week/supreme-court-scra...


You are wrong. This is not illegal. With an 4g/LTE proxy machine you can easily generate thousands of profiles rapidly and cheaply. They would be able to detect them at some point (will be harder if goes slowly) but it wouldn't stop the scraping.

The only way is for Instagram to restrict registration altogether, but you might create a black market where existing users sell their accounts, and cannabilize its own userbase (Bad for meta stock prices).


I may be wrong about this being illegal (depending on the country you reside in), but it is certainly not an approach that scales. Meta/Instagram have multiple teams dedicated to preventing this type of scraping. Unless you're willing to invest an equivalent level of resources, any success in scraping Instagram data will be temporary.


> it is certainly not an approach that scales

If there's demand for their service I don't see why it wouldn't scale. Get more phones, more SIM cards, and have automation around all this infrastructure to automate away as much of the stuff as possible.

> Meta/Instagram have multiple teams dedicated to preventing this type of scraping

That's great but ultimately they still have a weakness: they want people to be able to see their stuff - at least some of it - without logging in. As long as you can either simulate a normal device perfectly, or even better, use real devices or virtualize them, there isn't much that Facebook can do without impacting legitimate usage which they don't want.


Instagram is one service that is very particular about enforcing their API usage. Anyone attempting to monetize Instagram data obtained outside of the developer program will get a C&D very quickly.

The most kosher way to get Instagram data is to get it through CrowdTangle which is owned by Meta but has its own caveats.


so? ToS is not the law! Nor can you use the CFAA here. It is not hacking. In addition, the operator lives in a jurisdiction that does not respect an American corporation C&Ds, what happens? Instagram has no legal ground to start an extradition treaty because somebody is scraping them lol.

You think Instagram is going to get FBI to bust doors in Mogadishu or wherever the operators are?

Might be an issue if you are in the US or West since its behind a walled garden (you need to authenticate to access) but you do not need to pay for it, nor are new registrants restricted (they have access to everything) so its a public website that forces user accounts. The best Instagram can do is throttle or ban those accounts scraping.


It’s a bit difficult to monetize all the data you get from Instagram if you don’t have an Instagram account. And Instagram will happily mess you up by requiring phone number confirmation, and by banning IP addresses or phone numbers.


The business model here is that they've streamlined the process to get phone numbers & IPs - Facebook can't do shit without impacting other, legitimate users on the same IP & number ranges.


Scraping doesn't necessarily imply monetization. There are plenty of not-for-profit usecases for scraping this data.


Instagram is hard to scape. They don't want people having that data. How about searching old hashtag results by date


What kind of things are people scrapping Insta for? I have a hard time with scrapping apps anyways, but at least some of it makes sense when making comparisons on prices or what not. But I'm just not imaginatve to come up with why you'd want to scrape obviously copyrighted images.

'*Also, I'm not an Insta user, so in my mind it is just a thread of images and comments. Maybe my understanding of Insta is off?


I scrape the data that public officials post (thought not on instagram yet) -- that has lots of utility in determining their positions on issues, where in their jurisdiction they visit most, who they're meeting with, etc.


Friends and family publish via Instagram but I don’t use it myself. Scraping allows me to follow via “RSS” feeds so I’m not left out.

(Remember when Facebook and Twitter had this built in?)


Real state agencies in my town post newly available properties on insta. I'm looking for a place to rent so I'd like to scrape it so that I don't have to be checking my phone constantly.


have we really gotten to the point that this is the only place they post the data? you have to be "cool" to know the listings are available rather than checking "lame" websites? If true, I weep for society


Unfortunately yes. More and more stuff is being monopolized by Facebook.


This just seems like it would violate some sort of MLS rule.


If scraping Instagram was allowed or easy, there are a tonne of use cases. One example: detection of products and sentiment for marketing (e.g. a post: I love my new Apple Watch!)


yep, it is. The one loophole that made it easy to scrape was using the Facebook URL scrape API, but they removed that loophole a year or so ago.


Never heard of this product before, did you guys recently launched? What are the differences between you and ScrapingBee?


We focus on offering the most premium mobile 4G/LTE proxies to constantly get new IPs.


heh I remember that HN thread where anybody can pay a few grand to buy a 4g/LTE proxy that can generate get endless new IP and scrape anything.

what do you guys do differently vs somebody just making the leap and purchasing the 4g/LTE proxy hardware and doing things themselves?

Where are you guys located?


Do you have a link to that? I'd be interested in knowing about it.


just a quick google "how to build your own 4g proxy" brings up this thread where the OP talks about scraping instagram by building his own

https://www.blackhatworld.com/seo/diy-how-to-create-your-own...


There is proxidize: https://proxidize.com But we're not using it. We've built our own infrastructure.


One is a bee while the other is a fish, so there are many differences. One lives in water while the other lives on land. One is tasty when fileted, the other vomits tasty goodness. I'm sure you can think of other differences. </kidding>


[flagged]


how would you hack to reach the frontpage on HN? I'm genuinely curious to know if it is possible, think dang said it is quite hard to do so.


[flagged]


so how many nicks do you have lying around https://news.ycombinator.com/user?id=nxmnxm99 and why spend all this time & effort ?

you don't login for 2 weeks and suddenly focusing your comments on me after the last one under your other nick got flagged ;)


[deleted]


[deleted]


[flagged]


Please don't break the site guidelines like this! It's really important not to harass new users—we don't want HN to become a closed, stale community.

https://news.ycombinator.com/newsguidelines.html

Yes, some new accounts are abusive, but we have ways of dealing with that (you can always email hn@ycombinator.com to alert us). But it's very important to err on the side of welcoming people. There's a limit to how much damage new accounts can do anyhow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: