

LinkedIn sitemap.xml - wslh
http://www.linkedin.com/sitemap.xml

======
JackWebbHeller
It would be helpful if an editor could provide some kind of warning of this in
the title.

My Netbook is still trying to open this over dialup, and xPUD has ground to a
halt.

~~~
gaving
I agree, it just beach-balled Safari here.

~~~
navs
I've noticed it doesn't take much to beach-ball Safari, at least for me. OS
10.8, 2008 model MBP, Safari 6.x

Canary kept on chugging at the expense of the whole OS hanging.

------
makmanalp
I've actually been working on my own site's sitemap, and I've figured out a
bunch of things:

You can use a sitemap index [0] to split a sitemap into multiple ones. This is
beneficial because you can change part of it and not have to have google re-
crawl everything, since index updates happen right after a whole sitemap has
been crawled. As a result, changes you make to a smaller sitemap get indexed
way faster.

I've switched to grouping my bulk pages into sitemaps alphabetically, and then
putting important ones (front page, about) into different sitemaps, and
specific landing pages into yet another different one.

[0] <http://en.wikipedia.org/wiki/Sitemap_index>

------
sleepyhead
They might want to consider using sitemap index files:
<http://www.sitemaps.org/protocol.html#index>

Edit: "If you want to list more than 50,000 URLs, you must create multiple
Sitemap files." According to 'icebraining' they have 47785 urls.

~~~
tszming
They already have it: <http://www.linkedin.com/robots.txt>

~~~
sleepyhead
Ah, ok, strange that they have /sitemap.xml in addition to that then. I would
assume any robot that would understand sitemap also understands that it can be
specified in robots.txt

------
WatchDog
Here is one for google plus <https://ssl.gstatic.com/s2/sitemaps/profiles-
sitemap.xml>

~~~
mokash
Wow, this is a sitemap of .gz files containing the links to the actual
profiles. I did a wget to download all the .gz files. It's still going 10
minutes later.

~~~
sahaskatta
Do you have a script to fetch them all?

~~~
mokash
Here, I wrote about it on my blog: [http://mgakashim.com/123/downloading-and-
processing-a-list-o...](http://mgakashim.com/123/downloading-and-processing-a-
list-of-all-google-plus-profiles)

------
mooism2
What about it?

~~~
wslh
An easier scraping?

~~~
pjbrunet
It's against their TOS.

~~~
adanto6840
The Clickwrap based TOS you mean? ;-)

It's frustrating, at least for me, that the legality here is still so gray (at
least IMO it is).

Much of their content is likely in the public domain (facts / basic non-
creative information) although there is definitely plenty that is not; the
lack of 'black-and-white' is what frustrates me...

Clickwrap - <http://en.wikipedia.org/wiki/Clickwrap>

------
binarysolo
Is this actually for all 150M users? I guess they really want crawlers to go
the shortURLs easier...

~~~
icebraining
Not even close to 150M:

    
    
      $ curl http://www.linkedin.com/sitemap.xml | grep -c '<url>'
      47785
    

I wonder why were these selected. Probably the most searched.

~~~
arcatek

      user@/tmp > grep lastmod sitemap.xml  | cut -d - -f 1 | uniq -c
      47785     <lastmod>2006
    

They were probably sitemaping all their users when they started the service,
then they have decided to stop shortly after.

------
mokash
Managed to get a list of all only the URLs.

<https://dl.dropbox.com/u/8433360/linkedinusersurls.txt>

------
message
Found in robots.txt link to Sitemap:
<http://partner.linkedin.com/sitemaps/smindex.xml.gz>

------
krob
This document just locked up my browser, i7 w/ 8gb ram using chrome.

------
scottmcleod
Interesting, thanks for the share.

------
pjbrunet
I'm not In the sitemap :(

In case you were looking for me:

<http://linkedin.com/in/pjbrunet> :)

