

Google Bot Attempts to Crawl Shortest URLs First - foxhop
http://russell.ballestrini.net/google-bot-attempts-to-crawl-shortest-urls-first/

======
AshleysBrain
It's probably just a short URL is a heuristic for an important site.
www.site.com/section is probably more important than
www.site.com/section/subsection/detail/page/5/comments. A good move for the
crawler - don't get distracted by "deep" pages - try and stick to high level
ones first.

Edit: this would also encourage webmasters to use short URLs, which benefits
users by being easier to remember, too.

~~~
cma
I don't agree with the last point; seems like it would encourage things like
foo.com/1hu83FG2 or lead to excessive abbreviation.

~~~
DrJokepu
If I remember correctly, your page ranks better in Google if the search terms
are in the URL (even more so if they are in the hostname).

------
JonnieCache
There is probably some highly non-obvious reason that sorting your queue of
URLs by length is optimal, which was arrived at after a lot of modelling and
testing.

We're unlikely to ever know the answer unless someone from google explains it
to us.

~~~
JonnieCache
Thinking about it more, its probably just a breadth-first search. Duh.

~~~
frisco
Breadth-first search means crawling all of the links on the page and adding
all of the links on the child pages to the queue at once rather than drilling
down on one link-path first before moving to the other links on the first
page. There's probably some highly non-obvious reason for crawling by url
length.

~~~
xyzzyz
The thing is, the deeper you are, the longer are the urls, so if you do a
breadth first search, you are more likely to visit short urls first.

------
esryl
The site adheres to a strict url structure. /state/city/id/schoolname -
entering from the homepage, the only way to crawl the site 1 level at a time
would be crawling the shortest urls first. this structure is also emphasised
in the breadcrumbs on every page, the shortest urls are also the ones with the
most internal links.

why would you crawl the site in any other way?

~~~
foxhop
If you look at a particular city page you will notice that the cities are in
alphabetical order, however google bot still crawls by length of url...

------
personalcompute
I did a more scientific analysis of the googlebot requests in the provided log
(graph! <http://i.imgur.com/uMoUT.png>) and it definitely looks like it is
taking shortest urls first. Anyone else with a large site want to check as
well for further data?

~~~
foxhop
Thanks for the graph, I've added it to the blog page

------
meow
That's probably because short urls usually tend to be static pages while long
ones tend to be dynamically generated content.

~~~
personalcompute
His entire site that he observed this behavior on is static.

~~~
jules
But Google doesn't know that.

------
christianwilde
I imagine that in these specific case is because longer URLs represent deeper
pages on the site that are less "important" (in terms of internal incoming
links and pagerank) than the shorter ones. It doesn't seem logical that google
order the URLs by length and then crawl them in that order; probably the URL
length can be a factor that the bot takes into account, but not the only one
in the manner this article suggest :)

Anyway, good point, that deserves more testing to extract some conclussions

------
orijing
The question that I have is whether this is a relative behavior (i.e. whether,
for a given domain 'domain.com' Google prioritizes domain.com/short-url over
domain.com/longer/url.html) or a global one (i.e. prioritizing short.com/url
over very-long-domain.com/nested/pages/hierarchy.html, all else equal).

I can definitely see the local/relative effects being a natural consequence of
prioritizing by pagerank, but the global part sounds more like a separate
signal.

Does anyone have insights?

------
arn
I noticed this behavior also when I was following Googlebot's crawl of my old
pages after I had done a redesign.

it's not because of sitemap or because of url structure or because of dynamic
content.

Mine were blog articles in the same format. This is how it was crawled:

sitename.com/year/mo/day/stub

sitename.com/year/mo/day/stub-one

sitename.com/year/mo/day/stub-one-two

sitename.com/year/mo/day/stub-one-two-three

sitename.com/year/mo/day/stub-one-two-three-four

------
_grrr
A poor mans PageRank algorithm, assuming nothing else, would assign a higher
PageRank to shorter site links on a page. Presumably the crawler visits pages
with a higher PR first.

~~~
TuxPirate
Pagerank determine crawls rate amongst other things. I find it is also likely
that short URLs (especially in the case of a directory-type site) are seen
first by the spider and that this order is respected by the crawler (FIFO).

You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO
people in there who might be able to provide you with a decent answer.

------
TuxPirate
Pagerank determine crawls rate amongst other things
([http://techpatio.com/2009/search-engines/google/matt-
cutts-g...](http://techpatio.com/2009/search-engines/google/matt-cutts-google-
crawls-pagerank-video-bloggers)).

I find it is also likely that short URLs (especially in the case of a
directory-type site) are seen first by the spider and that this order is
respected by the crawler (FIFO).

You can also ask in #seo on irc.freenode.net I know there are knowledgable SEO
people in there who might be able to provide you with a decent answer.

------
abrudtkuhl
It starts at the top of your sitemap - which are likely shorter URLs

~~~
bauchidgw
cant confirm this - sitemaps are used for discovery -> the urls listed in the
sitemap get pushed into the 'discovered urls queue' then this queue is
prioritized for crawling- and - if there are no other factors - the shorter
urls get prioritized higher (as there is a bigger chance that a shorter url is
a canonical version of a longer url - well, the chance is bigger then the
other way round

------
ignifero
Maybe because that's how they are sorted in the hashtable/database they use to
queue urls? Or maybe because they want to index the shortest pages first, so
that they are processed before any duplicates with longer urls (i.e. get
/articles/ before /articles/index.php)

~~~
georgemcbay
I suspect you're on the right track with your first guess.

Most people posting here are looking for some sort of deep meaning in this
when IMO it is more likely just due to a localized side-effect of doing
something such as storing the urls in a trie-like structure and then iterating
over it breadth-first.

