

War on Urchin - ryannielsen
http://blog.pinboard.in/2011/06/war_on_urchin/

======
seldo
Claiming that the urchin URL parameters are "one of the malicious effects of
URL shortening" is confusing cause and effect. Urchin's software dates from at
least 2003, and was already massively distributed by Google in 2005, while URL
shortening did not catch on in any meaningful way until the advent of Twitter,
which was not founded until 2006.

The very long URLs created by Urchin and other web analytics was one of the
problems that URL shorteners were created to solve.

Whether short URLs have a "malicious" effect is a topic for a different
discussion, but whatever the effects of URL shortening, Urchin parameters are
not among them.

Urchin parameters certainly _do_ make it harder to detect duplicate URLs,
which would be pinboard's primary problem with them. However, so lots of other
URL parameters like sessions, landing page refs, etc.. Urchin's are just the
most common ones.

[Disclosure: I work for awe.sm, which provides social media analytics using,
amongst other methods, short URLs]

~~~
idlewords
I'm not arguing that URL shorteners came first, but rather that their
widespread use is what has enabled this kind of URL crapification to become so
pervasive.

[Disclosure: I hope your entire product category dies]

~~~
seldo
Wow. That's a little rude.

I'm arguing that URL shorteners have had nothing to do with the popularity of
Google Analytics, which is the source of urchin parameters. The dominance of
Google and the massive profit incentive around accurately tracking adwords and
adsense is what's driving that. Consumers don't care what their URLs look
like, so it doesn't matter whether URL shorteners obfuscate them.

I recognize that shortened URLs are troublesome for a bookmarking service, but
they are hardly insurmountable. I'm not sure I understand why you'd have so
much hate for us.

(And for the record, URL shortening is not our primary product)

~~~
SoftwareMaven
What he's saying is that, because url shorteners hide the Urchin crap, it has
allowed it to become more pervasive, since most people don't see the Urchin
crap until it's hidden far to the right in the browser.

~~~
mattmcknight
Couldn't you say the same thing about an ordinary hyperlink on a page? I see
the text of the link and even less of the URL than you see in the address bar
in the 64 char preview at the bottom of the browser when I hover. It's only a
very small percentage of the population that copies the destination URL and
strips out unnecessary query parameters. The URL shorteners just an inferior
version of regular anchor tags...

------
txxxxd
I sense some confusion here. Those parameters are also used by Google
Analytics which the author is using on that very page.

The utm parameters allow a site to track campaign information and replace much
more annoying techniques like setting up unique landing pages or redirects.
There's no relationship between these parameters, which have been around for
nearly 10 years, and URL shorteners.

~~~
dangrossman
This is because Google Analytics was based on Urchin Live (Google acquired
Urchin in 2005).

------
famfamfam
I have always been uneasy about the fact that Urchin/Google Analytics
'campaign tags', when copied and distributed via other services, do not
continue to represent the original campaign. For example, for the following
campaign URL includes three additional pieces of information:

[http://example.com/?utm_source=June%2B2011%2BNewsletter&...](http://example.com/?utm_source=June%2B2011%2BNewsletter&utm_medium=email&utm_campaign=Free%2BSummer%2BTickets)

Source: June 2011 Newsletter

Medium: Email

Campaign Name: Free Summer Tickets

The moment that you store that URL in other service, a number of those tags
become incorrect anyway (it is not an email anymore), and the stats you will
get from it will be tainted. Campaign tags are useful, but this approach by
pinboard may end up in tracking being more accurate (certainly from in terms
of tracking campaign media/terms/content), at the cost of removing the
campaign name.

There is also little difference between this and a URL such as
[http://example.com/Free-Summer-
Tickets/June-2011-Newsletter?...](http://example.com/Free-Summer-
Tickets/June-2011-Newsletter?via=email) being set up to serve the original
content other than at least with the campaign tags you've got a single
canonical URL using the more correct query parameter mechanism.

~~~
joeyespo
Yes I think this is exactly what bothers me about it too. The fact that I use
bookmarks and send links to others makes the information slightly skewed.

I also think, maybe I'm missing something. Maybe it's actually decent
information to have. For instance, if you send a link with source=newsletter
to somebody, it still is the newsletter that brought both of you to the site.
You may not have visited otherwise. And your friend, probably even less so.

I don't know. I still don't like seeing it. It really does defeat the purpose
of making your site have pretty links.

------
jannes
Note that there are many other implementations of this type of click tracking,
so it wouldn't be possible to strip all of these tracking strings. For example
ReadWriteWeb also does this for posts on HN. Just take a look at this recent
submission:

<http://news.ycombinator.com/item?id=2643515> The link contains this URL
fragment: #.TfMwNJgETxs;hackernews

I think it's bad design to not assume that people can and will always share
URLs with others. Putting tracking strings in URLs ignores this fact.

I often see Urchin URL parameters even on submissions on Hacker News. Most of
the time they say utm_source=feedburner, which means that they were taken from
an RSS reader. Just think about how easily such a submission reaching the top
of Hacker News would distort the statistics.

Doesn't this make the the statistics almost worthless? Or at least harder to
interpret? I think so.

~~~
sbarre
I disagree, I think there is still value in knowing where the link originally
came from (the one that was posted to Hacker News), even if you get lots of
secondary traffic. Using referrer logs you can probably differentiate between
"real" views from RSS and the ones that were propagated by a re-posting of
that link.

------
benologist
Pretty sure SEO is the real reason long urls are such a problem, it's why URLs
are long and it's why they're all tagged to hell.

There was a time where your site would be

site.com/product.php?id=x or

site.com/product.asp?id=x,

now it's:

site.com/keyword-keyword/keyword/keyword/product-name?urchin or google
analytics crud

~~~
SoftwareMaven
There is a middle-ground, which is good and wholesome. site.com/category/page-
name.html gives good context for users (those few of us who understand URLs,
anyway ;) that site.com/index.php?id=x doesn't without bloating things for SEO
purposes.

~~~
andybak
Except of course - you wouldn't really need .html (.php or .anything) on a
url...

~~~
sbierwagen
Seconded.

Example: <http://www.w3.org/Provider/Style/URI>

------
there
if a user initially sees your url in your rss feed, and then posts it to
reddit or hn or bookmarks it on delicious, doesn't it screw up all of your
statistics for visitors-from-rss-feeds if it has those url tokens? i mean
everyone would be visiting the page telling your statistics engine that they
came from an rss feed when they didn't.

it seems like a better way of doing it would be to capture those tokens at an
initial url, but then redirect the user to the proper, clean url without them.
that way you get accurate stats for visitors-from-rss-feeds, but everyone else
that clicks through as that clean url is passed along appears as a different
source.

------
initself
Maybe certain Pinboard uses _want_ to store those parameters along with their
URL for one reason or another? Seems a little extreme for one guy to make some
arbitrary decision about the content the users themselves are choosing to
store.

~~~
sbierwagen
Why would _anyone_ want to save tracking code?

~~~
willwagner
If you include affiliate codes as part of tracking codes, I can definitely see
some users wanting to keep that data preserved. For instance, someone may want
to reward some shareware author with their amazon purchases. Social
organizations like churches, clubs, etc. might encourage their members to tack
on an affiliate code to keep them funded. It's hard to see the use case for
keeping the utm_* tags, but not all tracking codes are meaningless to an end
user.

------
a5seo
detecting duplicate urls? seomoz's api will do that for you...

url cruft is much more an aesthic thing. see
[http://www.mattcutts.com/blog/clean-up-extra-url-
parameters-...](http://www.mattcutts.com/blog/clean-up-extra-url-parameters-
when-searching-google/)

~~~
nikcub
there is an RFC section dedicated to URL normalization, this doesn't need an
API:

<http://tools.ietf.org/html/rfc3986#section-6>

I am currently working on making a urllib for python which is 3986 compliant

~~~
yahelc
Anywhere I can follow progress on that? I'd love a Python-based URL
normalization tool for <http://SharedCount.com>

~~~
nikcub
email me and ill ping you once it is up (email in profile)

------
yahelc
You could use the HTML5 History API to remove these query strings once Google
Analytics is done using them. (This example assumes you'd never want query
strings, but could be refined to just remove utm_* query strings.

    
    
        if (window.history && history.replaceState && location.search.match(/utm_/)) {
        var check = setInterval(function () {
            if (document.cookie.indexOf("__utmz=")!==-1) {
                history.replaceState({}, "", location.pathname); //assuming you want no query string
                clearInterval(check);
            }
        }, 500);
        }

------
sleepyhead
Canonical URLs would solve this.

------
mmaunder
"They create needless URL bloat" - You just said you don't like short URL's.
You can't have it both ways.

"erode user privacy" - The data gathered is aggregated and doesn't identify
individuals.

"make it more difficult to identify duplicate content" - You're already
stripping them because you don't like them, so just ignore them when de-
duping. Again, you can't have it both ways.

"and benefit ad publishers at the expense of everyone else." - You're out to
screw the guys who pay for all those wonderful free services you use, like
GMail for example:

;; ANSWER SECTION: pinboard.in. 3600 IN MX 1 s3.pinboard.in. pinboard.in. 3600
IN MX 2 s5.pinboard.in. pinboard.in. 3600 IN MX 4 ASPMX.L.GOOGLE.COM.
pinboard.in. 3600 IN MX 5 ALT1.ASPMX.L.GOOGLE.COM. pinboard.in. 3600 IN MX 5
ALT2.ASPMX.L.GOOGLE.COM. pinboard.in. 3600 IN MX 10 ASPMX2.GOOGLEMAIL.COM.
pinboard.in. 3600 IN MX 10 ASPMX3.GOOGLEMAIL.COM.

~~~
tptacek
The data gathered absolutely does pose a threat to user privacy.

~~~
staunch
How? All those vars are tracking is campaign/traffic source, etc.

~~~
jrockway
The more information you have along with an HTTP request, the easier it is to
identify the person that made that HTTP request.

User-agent / IP / list of fonts that Flash can use pretty much identifies
individuals uniquely. Adding "I got to this page by clicking a link on site X"
to the URL adds one more piece of data that makes it even easier for the site
to guess that you are you.

~~~
staunch
Why would a site need to rely on utm_* query params to identify you when they
probably already have your IP/user agent/cookie session?

Can you think of a practical way in which this would actually make a
difference?

