
Ask HN: How do I build a robust social media profile data scraper? - nerdynapster
* though any language would do, but please Python.
* any existing solutions paid&#x2F;free, is welcome.
======
AshArchangel
Hello, I do some webscraping at my job using python, but I've found that
scraping social media for non-specific data is often the same amount of work
as manual searching or using google search tools like "site:". With that said,
and similar to anigbrowl's comment, without a specific goal in mind you will
be hard pressed to solve your problem. Social media scraping varies heavily by
platform in terms of what information is available to scrape (without brute
force).

If you want some social media OSINT tools that are already built in python,
Black Arch has a list of open-source tools that you can access and use:

[https://blackarch.org/social.html](https://blackarch.org/social.html)

If you are trying to identify someone's social media, Sherlock or Spiderfoot
are commonly cited, but again, I don't think that these tools save that much
as opposed to just using Google's search logic efficiently.

[https://github.com/sherlock-project/sherlock](https://github.com/sherlock-
project/sherlock)

------
anigbrowl
You need to be more specific about what part you're having a problem with and
what your goal is: to build a scraper you can sell or give away, to accumulate
social media data for commercial purposes, or some research goal?

There's no generic solution, since every platform is different, and there's no
one scraping library (or approach) to rule them all. Most efforts I've seen
use BeautifulSoup to parse web pages and/or Selenium to automate browser
actions, but I'm sure there are better alternatives. It is a frustrating space
to work in as many/most tools are limited and the methods jealously guarded,
much as most social media companies jealously guard the data they harvest.

You could probably learn a lot by leveraging existing tools and seeing what
you can do on the analysis side. Twitter has a fairly well-specified API and
if you are getting frustrated with the limits of that, there's twint. Facebook
is the biggest 'pile' of data but they know it and when you look at the source
a FB page you can see there's a lot of stuff that messes up your ability to
parse that data, accidentally or deliberately. You might be better providing
tooling for small but growing social media platforms that are not as big (and
so less valuable/profitable to scrape) but also don't have the accumulated
digital sediment that makes it difficult to do so.

------
TechBro8615
It’s not an easy problem, and doing it successfully will be expensive.

First, you need a reliable API, ignoring rate limiting / verification
concerns. To get that, you should reverse engineer the mobile apps and
replicate their API calls. For many apps, you can find an active GitHub
project doing this already. But note that it’s a moving target. As an
alternative, you may consider setting up a device farm and automating
interactions via something like Cycript.

Next, you need to circumvent anti-abuse measures. For most social networks,
this means you need to create fake profiles. This will likely entail phone
verification. You will also need multiple residential proxies to route traffic
through, but not too many per account.

For phone verification, VOIP numbers will not work. Check blackhat forums for
services offered in countries in SE Asia with SIM farms for real numbers. Note
that you may not have perpetual access to these numbers, so if you’re prompted
to re-verify a number for an account, you might have to just burn the account.
You may be able to appear “less suspicious” by setting up TOTP (which can be
automated) on your accounts and removing the phone number, if possible.

For IP addresses, you need non-datacenter IPs that other people are not using.
Your best bet is luminati.io, which is the business side of the Hola chrome
extension that routes your requests through users’ computers. You can get
“sticky” IPs but only insofar as a user continues to be online. The minimum
commitment is $500 per month, bandwidth is expensive and all requests are
tracked. You will need to pass a Skype interview to sign up.

tl;dr It’s possible, but doing it successfully requires significant investment
in infrastructure and time. You will need to partake in “gray market”
activities and deal with some shady operations. Depending on your
jurisdiction, you will likely violate at least one law.

------
nerdynapster
thank you fellas for your comments, they are definitely helpful. for now, i'm
finding a way to extract bios (biographies) from twitter, instagram and FB for
research purposes; later I plan to scale that up to include other social media
platforms (Youtube, LinkedIn, ...)

I am thinking it as use APIs for it (if they exist) or build a crawler to
scrape the data.

is it something that can be done without violating any law?

------
verdverm
You could pay for Nexus Lexus if you want the real dirty on people

