
Wanna have zillions of google profiles? - hazelnut
http://www.google.com/search?q=site%3Ahttp%3A%2F%2Fgstatic.com%2Fs2%2Fsitemaps%2F
======
avar
Here's a Torrent with a tar.xz that contains all of them:
<http://v.nix.is/~leech/google-profiles.tar.xz.torrent>

That's files sitemap-000 through sitemap-3278. They contain a total of
16,256,271 profile URLs.

------
blocke
I must be missing something...

When I filled out my Google Profile I was told by Google that the information
would be added to their search and would make me more easily discoverable.

Are people going to act all shocked that means it's indexable? _gasp_

------
acangiano

        curl -O http://www.gstatic.com/s2/sitemaps/sitemap-[000-3278].txt

~~~
mahmud
damn, i have always been saving the files explicitly:

curl <http://foo.com/[1-1024]> -o "#1.txt"

------
ptn
OK, I'll be _that_ guy: what are we looking at here?

~~~
pixelbath
Alright, I'll be the other guy: what would this be useful for? Google profiles
don't allow links, so SEO appears to be out.

Just to do it?

~~~
ElbertF
Google profiles don't allow links?

<http://www.google.com/profiles/112195136944637672112>

~~~
gscott
And none of the links have a nofollow tag.

------
fdb
It's interesting that the profiles list the public followers and following
stats. With some deep crawling, you could build a huge social graph from this
data.

~~~
joshfraser
you need to be signed in to google to see the followers/following data. it
shouldn't be hard to pass a valid cookie along, but it might get a little
suspicious for one user to make ~60m requests.

------
jpablo
This morning I started getting spam that used my real name on my gmail
account. I wonder if this is connected.

------
pierrefar
The odd thing, is that getting a list of all profiles correctly doesn't wor.
If you go to Google's robots.txt and scroll to the bottom to find the profiles
sitemap is located at:

<http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml>

Which 404s.

Luckily, it's quite easy to reverse engineer. Start with:

<http://www.gstatic.com/s2/sitemaps/sitemap-000.txt>

and increment 000 till you 404. From quick testing, it's less than 3500 but
more than 3200.

~~~
blueberry
As of this comment it's exactly 3278. I wonder why you didn't bother doing 3-4
more trials before writing the comment :)
<http://www.gstatic.com/s2/sitemaps/sitemap-3278.txt>

~~~
pierrefar
Because I spent too much time trying to figure out why the index sitemap XML
file was returning a 404. I'm really surprised that Google, one of the
inventors of sitemaps and big proponent of using them, can have such a high-
profile error. I dug around to see if I could find what's going on.

By the time I got to figuring out how many text files there are, I was getting
bored.

------
Mark_B
I wonder - what could be some non-evil things to do with easy access to all of
this information?

~~~
niels_olson
openid?

------
deltaqueue
sitemap-3278 is the upper limit, so there appear to be roughly 16.4 million
profiles available. I would be interested to see how many of these have public
information (name, address, etc.).

------
DanielRibeiro
Twitter provides pretty much the same info through its api. Granted, it
doesn't not disclose email.

------
dacort
The thing that really bugs me about Google profiles is that they're directly
connected to email addresses. I guess Google thinks their ability to deal with
spam is good enough that putting millions of their user's email address
publicly available isn't a concern.

~~~
frognibble
The email address is not displayed on the profile page.

A user can customize his or her profile URL with an email address. The
settings page clearly states that the email address is publicly discoverable
if the email address is used in the profile URL.

~~~
dacort
Shouldn't I be allowed to customize my profile URL with a string other than my
email address? Why force me to expose my email if I want something other than
a 21-character numeric id?

~~~
frognibble
I suspect that Google used email addresses because they already have tools for
managing that namespace.

------
DotSauce
Would be useful to know which profiles have Buzz enabled and further be able
to search those profiles for specific keyword interests.

Point being to follow like-minded people on Buzz to gain exposure to your own
profile.

------
joshfraser
i get the feeling my machines have a busy day ahead of them.

------
underdown
I may be a little daft but how is this a big deal other than these files
showing up in G's index?

~~~
blueberry
I don't think it's a big deal but I guess the author wanted to point out an
easy way to crawl this public info. There are similar ways to crawl public
Facebook profiles too, so it's really not a big deal.

~~~
retube
Re facebook: really? I'd be interested in a hint here....

~~~
ulrich
[http://www.commandlinefu.com/commands/view/4726/view-
faceboo...](http://www.commandlinefu.com/commands/view/4726/view-facebook-
friend-list-hidden-or-not-hidden)

~~~
thamer
Have you tried it? This was posted in January, but doesn't seem to work
anymore.

    
    
        curl -A Opera 'http://www.facebook.com/ajax/typeahead_friends.php?u=4&__a=1' 
        for (;;);{"error":0,"errorSummary":"","errorDescription":"","errorIsWarning":false,"silentError":0,"payload":{"friends":[]}}

------
ajaimk
About 817 results (0.22 seconds) - Not the Zillions I was expecting

~~~
jonah
Each of the ~800 sitemap-???.txt files contain ~ 20k profile entries.

------
andrewbadera
Those profile lists aren't complete. For more discussion, see the Google Buzz
list where, until the Buzz team released their own firehose, any of us who
wanted a firehose had to build our own. Also noteworthy, it's easier to just
download and parse those XML files -- why use a search engine for that to
begin with when it's already aggregated for you?

------
NewSoftzzz
Or easier..
[http://www.google.com/#q=site%3Ahttp%3A%2F%2Fgoogle.com%2Fpr...](http://www.google.com/#q=site%3Ahttp%3A%2F%2Fgoogle.com%2Fprofiles)

~~~
cj
No, harder. Your link would require you to crawl thousands of Google results
pagea, while the OP's link gives you sitemap URLs of many profiles at once.

