Hacker News new | past | comments | ask | show | jobs | submit login
War on Urchin (blog.pinboard.in)
211 points by ryannielsen on June 11, 2011 | hide | past | web | favorite | 64 comments

Claiming that the urchin URL parameters are "one of the malicious effects of URL shortening" is confusing cause and effect. Urchin's software dates from at least 2003, and was already massively distributed by Google in 2005, while URL shortening did not catch on in any meaningful way until the advent of Twitter, which was not founded until 2006.

The very long URLs created by Urchin and other web analytics was one of the problems that URL shorteners were created to solve.

Whether short URLs have a "malicious" effect is a topic for a different discussion, but whatever the effects of URL shortening, Urchin parameters are not among them.

Urchin parameters certainly do make it harder to detect duplicate URLs, which would be pinboard's primary problem with them. However, so lots of other URL parameters like sessions, landing page refs, etc.. Urchin's are just the most common ones.

[Disclosure: I work for awe.sm, which provides social media analytics using, amongst other methods, short URLs]

I'm not arguing that URL shorteners came first, but rather that their widespread use is what has enabled this kind of URL crapification to become so pervasive.

[Disclosure: I hope your entire product category dies]

Wow. That's a little rude.

I'm arguing that URL shorteners have had nothing to do with the popularity of Google Analytics, which is the source of urchin parameters. The dominance of Google and the massive profit incentive around accurately tracking adwords and adsense is what's driving that. Consumers don't care what their URLs look like, so it doesn't matter whether URL shorteners obfuscate them.

I recognize that shortened URLs are troublesome for a bookmarking service, but they are hardly insurmountable. I'm not sure I understand why you'd have so much hate for us.

(And for the record, URL shortening is not our primary product)

What he's saying is that, because url shorteners hide the Urchin crap, it has allowed it to become more pervasive, since most people don't see the Urchin crap until it's hidden far to the right in the browser.

Couldn't you say the same thing about an ordinary hyperlink on a page? I see the text of the link and even less of the URL than you see in the address bar in the 64 char preview at the bottom of the browser when I hover. It's only a very small percentage of the population that copies the destination URL and strips out unnecessary query parameters. The URL shorteners just an inferior version of regular anchor tags...

Shortened URLs are also troublesome for anyone who likes clicking on links, due to the propensity for URL shorteners to run out of money and die, breaking tens of thousands of links.

Content creators need analytics and that is not going to die.

I saw URL shorteners first in a paper magazine, the articles about Internet provided the links in that format (tinyurl.com), much easier to write in a computer keyboard.

Yes, we just want free content, no strings attached. There must be no way anyone could ever benefit even from distributing free stuff, because that's how entitled we are.

We pay for content one way or another. At this point money would be the least painful of alternatives, if only it were an alternative.

If only more people were willing. I'm 100% positive that most current ad-based free content providers would like nothing more than to charge for their content and get rid of the ads.

Welp, now I know another service not to use. Thanks for clearing that up for me.

Why? Because he doesn't like URL shorteners?

No; because he seems to be convinced that the advertising-funded revenue model is the only viable source of funding for content providers.

I can think of quite a few other support frameworks for content creators -- some within the capitalist system, some outside it.

(My take is that almost all advertising sucks, from a consumer point of view, because it's designed to steal the consumer's attention from the item that drew it in the first place. At its worst, it becomes as unwelcome as spam. After all, we've only got 168 hours in any given week to pay attention to stuff: ads steal the only thing from us that money can't buy, and that's time.)

Might want to change that to "he (eropple)" there. I started typing out an enraged response here before I figured out that you were talking about him, not idlewords.

Ah thank you - that completely threw me!

Would you mind linking to some resources about alternative revenue models? I've never been a fan of ads and I'n curious as to other viable alternatives.

I pay via Readability and ( while I'm no expert) I think that is a good balance between the content producers need to feed their kids and my aversion to irrelevant ads.

At this point I should probably just stop talking, but I don't see how my comments indicate that I think advertising is the only way for content providers to make money. I said it was a very popular way, and that popularity has affected the way URLs are formed.

No; I don't much like URL shorteners myself. I will avoid his products because he's a dink. There's no reason to mock somebody because they work for a company you don't like.

I think the argument is that url shorteners hide this types of URLS which has led to them being used more often since they fly under peoples radars.

Sounds reasonable to me.

It doesn't sound reasonable to me, and for obvious reasons I spend quite a lot of time thinking about URLs and how people interact with them.

The truth is that outside of nerd circles most people don't even understand URLs, far less care whether they have extra parameters in them. Major browsers are considering getting rid of the URL bar entirely. The tidiness of URLs is therefore of almost zero concern to major online publishers, but accurate analytics is, which is why UTM tracking is so popular.

It's not just tidiness. There's also a valid privacy concern.

These parameters have nothing to do with serving the content the user wants and are only there to track users and behavior and are metadata about the real url that's attached like a parasite. I understand the value for content providers, but I think stripping them off for archival purposes is appropriate.

Exactly these parameters are the ones helping webmasters identify what the user wants. Knowing what pages are popular, where the users come from and what acquisition channels can you use and at what costs in order to bring eye-balls to your content influence directly what content area is going to be invested into and expanded by the content providers, which makes them do exactly what the user wants.

In a world without analytics you have tons of crappy content written or created at the whip of an executive who thinks she's good at guessing market demand (and nobody in the company can prove her wrong scientifically speaking before the job is done). That content will prove to be a failure when it's launched in let's say 9 out of 10 cases, which means 9 bankrupt projects, 10 times less interesting content on the web at 10 times higher costs of production, which in turn leads to less competition, higher prices (paywalls anyone?), reduced rate of learning/innovation etc.

I have a balanced view on the issue and I know the pros and cons of each side, including the privacy issues involved for everyone when surfing the web. But I'm sad when I see remarks as "I hope your product dies" or when someone chooses to blatantly represent just one side of the story.

"these parameters are the ones helping webmasters identify what the user wants."

That's a completely one sided and biased portrayal. They also help webmasters manipulate the psychology of users, hinder privacy, make an open web more difficult, etc, etc.

Nobody is suggesting we abolish analytics as a practice, but it's misrepresenting the issues to suggest that content providers will die a sad death if they lose out on whatever benefits those querystring params provide.

Content providers already have feedburner for rss metrics (also, now by google) not to mention google analytics (nee urchintracker) and good ol' fashioned server access logs (which you can analyze with urchin proper (or mint, or what have you)).

I can see from a gut-reaction standpoint how you could write what you did, but aside from gawker (who notoriously uses analytics, c.f.[1]), how many legitimate content providers use analytics as anything other than a rough barometer for trends? The problems you describe are problematic for certain types of tabloid publishers (drudge report, ny post, etc.), but they are hardly addressed by a handful of querystring parameters.

[1] http://www.newyorker.com/reporting/2010/10/18/101018fa_fact_...

> how many legitimate content providers use analytics as anything other than a rough barometer for trends?

Nowadays it's built into Adwords, so the answer is anybody that sets up conversion tracking in his Adwords account.

More details: Google provides a tool called conversion optimizer[1]: it's enough to put a tracking code on one of your objective pages (the purchase page, the signup page etc) and Adwords will use machine learning to see the analytics for ads that convert well to your objective (what keywords did they use in Google, what locations are they coming from etc). This way, you can stop paying money for keywords with 20 clicks and 0 conversions, and instead you can raise your ad bids on those keywords performing well (i.e. 5 clicks and 3 conversions). The publisher is happy (more conversions, less clicks, less money), Google is happy (better targetting means less impressions used means more impressions remaining in the inventory to be sold to others for additional income), the customers are happy (publishers with lower customer acquisition costs can pour more money into the actual content/product).

[1] http://www.google.com/adwords/conversionoptimizer/howitworks...

In a world with analytics you have content farms.

I do too! It's absolutely the right behaviour for a service like pinboard, and it's even the right behaviour from Google's perspective, since any clicks on that link will be from pinboard, and not whatever the original source was listed in the UTM parameters.

Also, if stripping the utm parameters are supposedly going to make it easier to dedup urls, why not just make the deduping logic strip them before running? So that argument is not fully thought out either. Of course, that doesn't make the conclusion invalid, it just means that this point cannot rationally be used to support it.

Because it's not just going to compare URLs; he might well index by that column and compare hashes or something. And doesn't want to store two copies of each URL - decrapified and normal.

I sense some confusion here. Those parameters are also used by Google Analytics which the author is using on that very page.

The utm parameters allow a site to track campaign information and replace much more annoying techniques like setting up unique landing pages or redirects. There's no relationship between these parameters, which have been around for nearly 10 years, and URL shorteners.

This is because Google Analytics was based on Urchin Live (Google acquired Urchin in 2005).

The point is that URL shorteners hide those parameters; you're reading it the wrong way around :-)

I have always been uneasy about the fact that Urchin/Google Analytics 'campaign tags', when copied and distributed via other services, do not continue to represent the original campaign. For example, for the following campaign URL includes three additional pieces of information:


Source: June 2011 Newsletter

Medium: Email

Campaign Name: Free Summer Tickets

The moment that you store that URL in other service, a number of those tags become incorrect anyway (it is not an email anymore), and the stats you will get from it will be tainted. Campaign tags are useful, but this approach by pinboard may end up in tracking being more accurate (certainly from in terms of tracking campaign media/terms/content), at the cost of removing the campaign name.

There is also little difference between this and a URL such as http://example.com/Free-Summer-Tickets/June-2011-Newsletter?... being set up to serve the original content other than at least with the campaign tags you've got a single canonical URL using the more correct query parameter mechanism.

Yes I think this is exactly what bothers me about it too. The fact that I use bookmarks and send links to others makes the information slightly skewed.

I also think, maybe I'm missing something. Maybe it's actually decent information to have. For instance, if you send a link with source=newsletter to somebody, it still is the newsletter that brought both of you to the site. You may not have visited otherwise. And your friend, probably even less so.

I don't know. I still don't like seeing it. It really does defeat the purpose of making your site have pretty links.

> The moment that you store that URL in other service, a number of those tags become incorrect anyway (it is not an email anymore), and the stats you will get from it will be tainted.

It sounds like you're probably misunderstanding the actual use case involved.

If you're a business, paying for traffic, you want to know ROI. If you spend X dollars and get Y visitors which make you Z dollars, you want Z > X, and you want to know what the relationship between the three values are, to know if spending more would be worthwhile. You don't actually care if the human being who clicked the link was currently in their email, rss, or anything else. You wan to know "spending dollars this way resulted in this profit". You want to identify the source, in your dollars, of your income.

Note that there are many other implementations of this type of click tracking, so it wouldn't be possible to strip all of these tracking strings. For example ReadWriteWeb also does this for posts on HN. Just take a look at this recent submission:

http://news.ycombinator.com/item?id=2643515 The link contains this URL fragment: #.TfMwNJgETxs;hackernews

I think it's bad design to not assume that people can and will always share URLs with others. Putting tracking strings in URLs ignores this fact.

I often see Urchin URL parameters even on submissions on Hacker News. Most of the time they say utm_source=feedburner, which means that they were taken from an RSS reader. Just think about how easily such a submission reaching the top of Hacker News would distort the statistics.

Doesn't this make the the statistics almost worthless? Or at least harder to interpret? I think so.

I disagree, I think there is still value in knowing where the link originally came from (the one that was posted to Hacker News), even if you get lots of secondary traffic. Using referrer logs you can probably differentiate between "real" views from RSS and the ones that were propagated by a re-posting of that link.

Pretty sure SEO is the real reason long urls are such a problem, it's why URLs are long and it's why they're all tagged to hell.

There was a time where your site would be

site.com/product.php?id=x or


now it's:

site.com/keyword-keyword/keyword/keyword/product-name?urchin or google analytics crud

For a short time there, it was:


We should get back to that. There need to be better ways to do SEO than polluting URLs. Slapping this information on a URL is a misuse of the purpose of URLs (permanent locators for a resource on the web. E.g. even if the resource has moved, there is a response code and a way of reaching the new location).

There is a middle-ground, which is good and wholesome. site.com/category/page-name.html gives good context for users (those few of us who understand URLs, anyway ;) that site.com/index.php?id=x doesn't without bloating things for SEO purposes.

Except of course - you wouldn't really need .html (.php or .anything) on a url...

Those sort of URLs are good for discovery aswell. It's easy to navigate and guess other URLs with that.

URLs are long and tagged with parameters for SEM purposes, not SEO. Long URLs that are spidered by Google are usually considered bad from an SEO perspective because they're not "pretty" URLs. There's a lot of incorrect information being passed around on this thread.

All the parameters you see specify campaign variables to help marketers track their campaigns. As for privacy, these parameters don't reveal user behavior on a site. It's only when it's connected to Google Analytics that is placed ON the site that the campaign data is then connected to site data. Long URLs have no adverse affect on the user browsing experience and URL shorteners do a great job of hiding them so removing those parameters simply serves to screw over marketing people that have spent time and money crafting their campaigns and gathering valuable data.

In regards to making it harder to detect duplicate URLs...really??? It's that hard to strip our everything before parameters?

No, because you would reasonably requests with different parameters to return different pages.

if a user initially sees your url in your rss feed, and then posts it to reddit or hn or bookmarks it on delicious, doesn't it screw up all of your statistics for visitors-from-rss-feeds if it has those url tokens? i mean everyone would be visiting the page telling your statistics engine that they came from an rss feed when they didn't.

it seems like a better way of doing it would be to capture those tokens at an initial url, but then redirect the user to the proper, clean url without them. that way you get accurate stats for visitors-from-rss-feeds, but everyone else that clicks through as that clean url is passed along appears as a different source.

Maybe certain Pinboard uses want to store those parameters along with their URL for one reason or another? Seems a little extreme for one guy to make some arbitrary decision about the content the users themselves are choosing to store.

Why would anyone want to save tracking code?

If you include affiliate codes as part of tracking codes, I can definitely see some users wanting to keep that data preserved. For instance, someone may want to reward some shareware author with their amazon purchases. Social organizations like churches, clubs, etc. might encourage their members to tack on an affiliate code to keep them funded. It's hard to see the use case for keeping the utm_* tags, but not all tracking codes are meaningless to an end user.

detecting duplicate urls? seomoz's api will do that for you...

url cruft is much more an aesthic thing. see http://www.mattcutts.com/blog/clean-up-extra-url-parameters-...

there is an RFC section dedicated to URL normalization, this doesn't need an API:


I am currently working on making a urllib for python which is 3986 compliant

Anywhere I can follow progress on that? I'd love a Python-based URL normalization tool for http://SharedCount.com

email me and ill ping you once it is up (email in profile)

You could use the HTML5 History API to remove these query strings once Google Analytics is done using them. (This example assumes you'd never want query strings, but could be refined to just remove utm_* query strings.

    if (window.history && history.replaceState && location.search.match(/utm_/)) {
    var check = setInterval(function () {
        if (document.cookie.indexOf("__utmz=")!==-1) {
            history.replaceState({}, "", location.pathname); //assuming you want no query string
    }, 500);

Canonical URLs would solve this.

"They create needless URL bloat" - You just said you don't like short URL's. You can't have it both ways.

"erode user privacy" - The data gathered is aggregated and doesn't identify individuals.

"make it more difficult to identify duplicate content" - You're already stripping them because you don't like them, so just ignore them when de-duping. Again, you can't have it both ways.

"and benefit ad publishers at the expense of everyone else." - You're out to screw the guys who pay for all those wonderful free services you use, like GMail for example:

;; ANSWER SECTION: pinboard.in. 3600 IN MX 1 s3.pinboard.in. pinboard.in. 3600 IN MX 2 s5.pinboard.in. pinboard.in. 3600 IN MX 4 ASPMX.L.GOOGLE.COM. pinboard.in. 3600 IN MX 5 ALT1.ASPMX.L.GOOGLE.COM. pinboard.in. 3600 IN MX 5 ALT2.ASPMX.L.GOOGLE.COM. pinboard.in. 3600 IN MX 10 ASPMX2.GOOGLEMAIL.COM. pinboard.in. 3600 IN MX 10 ASPMX3.GOOGLEMAIL.COM.

The data gathered absolutely does pose a threat to user privacy.

How? All those vars are tracking is campaign/traffic source, etc.

The more information you have along with an HTTP request, the easier it is to identify the person that made that HTTP request.

User-agent / IP / list of fonts that Flash can use pretty much identifies individuals uniquely. Adding "I got to this page by clicking a link on site X" to the URL adds one more piece of data that makes it even easier for the site to guess that you are you.

Why would a site need to rely on utm_* query params to identify you when they probably already have your IP/user agent/cookie session?

Can you think of a practical way in which this would actually make a difference?

Care to elaborate beyond pounding the table?

How do you know he isn't paying for google apps for domains?

My MX listings look awfully similar to that and it isn't a free service for me.

It's even better than that - I'm both paying for it AND not using it (you can see how the first two MX records point to my own servers). I'd feel like a double winner if the grandparent's original argument had made any sense.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact