What every web developer must know about URL encoding

jrochkind1 · on June 24, 2013

This is great stuff.

The reserved characters are different for each part

But if it's true that you can _optionally_ percent-encode any character in any part of the URL (path or query) -- is this true? I think so -- then there is in fact a way to _encode_ any URI without as much syntactical awareness of the URI structure.

This means that the "blue+light blue" string has to be encoded differently in the path and query parts: "http://example.com/blue+light%20blue?blue%2Blight+blue". From there you can deduce that encoding a fully constructed URL is impossible without a syntactical awareness of the URL structure.

Does it? Could you optionally encode it as:

   http://example.com/blue%2Blight%20blue?blue%2Blight%20blue

That is, percent encode (not the legacy encode-space-as-plus) both the plus and the space in both path and query?

* You'd still need enough syntactic awareness of the URI structure to know to leave path-seperator "/" alone, and query-part-beginning "?" alone.

* And you'd need even more syntactic awareness to properly _decode_ URIs, which may not have chosen to optionally always-percent-encode.

* But it might be wise if libraries chose to encode things in that consistent way, for instance encoding plus in path even though it's not required, never encoding space as plus even in query.

* I think there is _no_ good reason for any modern library to encode spaces, even in the query string, as "+" instead of %20. Even though many of them do. Don't do it.

* I don't use Java much, and couldn't entirely follow those examples (this stuff is confusing to talk about -- as in all escaping issues) -- but it sure does seem like those parts of stdlib/commonly used libraries are pretty darn broken. (And I'm sure Java is not alone here -- there is a long history of devs being confused about this stuff. Again, as with just about any escaping-related issue).

stan_rogers · on June 24, 2013

The + as a space in a URL often originates from form data submitted via GET (the wisdom of which is a separate topic for discussion). At least we've gotten to the point now where we've pretty much dispensed with nested hierarchies of framesets, in which the "percent" encoding needed to be done multiple times to different parts of the URL so each target could decode its source and dispatch an appropriately-encoded source to its descendants, yielding %25252520 and other similar monstrosities.

gpvos · on June 24, 2013

Are you using some kind of framework? I don't think there is a way in plain HTTP+HTML to pass URLs to child frames in a frameset.

Hello71 · on June 24, 2013

There's target=frame (never used in practice), but that's only one level down.

stan_rogers · on June 24, 2013

You can use an encoded target=frame to pass it down as many levels as you want (encode the ampersand and the equals sign, and of course the percent sign for things that arent at the immediate child level). The '90s sucked for web dev, but there was almost always a (scary and hackish) way.

gpvos · on July 2, 2013

But which mechanism would do the substitution of those parameters in the HTML of the child frame? I still fail to see how it would be done without either Javascript or server-side code. (And I did web dev in the 90s, with frames, but apparently was not evil enough to think of such stuff.)

Udo · on June 24, 2013

I wanted to make the same comment after reading the article. You're correct. The "+" literal in the URL is a painful remnant and at least when encoding I think it should be avoided entirely - so I think the article has it wrong when they say the string has to be encoded differently depending on its place within the URL. However, it's something to be aware of when decoding.

gpvos · on June 24, 2013

The only problem I can think of is that the URL becomes a bit longer than necessary, so you might hit IE's URL length limit a bit earlier.

mikeash · on June 24, 2013

I'm always amazed and dismayed that this heap of junk has become the foundation of such a massive set of activities.

I mean, who thought it was a good idea to have different escaping for different parts of a URL? That's like having a car that uses a lever to turn right and a foot pedal to turn left.

Of course, the answer is: nobody. URLs weren't designed, they grew organically. The result is the ubiquitous mess we have now.

It does work. The organic, evolving web has survived and thrived while sanely-designed standards have withered and died. Maybe it's ultimately the best way to do things.

But it still bugs me that I have to deal with such weird, awful nonsense when I need to do something related to the web. TCP/IP, while certainly not lacking in quirks, is still a million times better. It's sad that we couldn't end up with something a little more consistent at the higher levels.

MichaelGG · on June 24, 2013

There seems to be a drive in some of the core standards to allow as flexible "human" encoding as possible. Instead of writing a tight-as-possible encoding, it's like it becomes a game to see how complicated yet still unambiguous they can make syntax.

HTTP and SIP, for instance, have ridiculous parsing rules and tons of edge cases. The SIP authors even put together a "torture test" RFC where they take glee in making the most insane messages that still parse. And when they don't parse, they suggest the implementation infer the meaning. Seriously.

HTTP and SIP allow newlines in headers that get consumed in parsing, so you can manually word-wrap lines. They allow comments in HTTP messages. SIP (HTTP too?) allows headers _in the URL_.

When a non-committee (TCPIP) or non-academic (SPDY) entity does a spec, they tend to remove a lot of this cruft and realise that human-readability comes second (unlike in programming languages).

mikeash · on June 24, 2013

That's an interesting thought. There is definite advantage in having a protocol be human-readable and human-writable. However, there's a vast gulf between a protocol that you can write if you have the spec at hand and take your time, and a protocol that you can comfortably write by hand routinely, like a programming language. The latter strikes me as a really bad idea, one responsible for much of this craziness.

tel · on June 24, 2013

In my reading of the RFCs, HTTP doesn't allow headers in the URI. But it does demand that you accept nest comments in header values. Only. These comments appear nowhere else.

chrismorgan · on June 24, 2013

I'm amused by the fact that a link is broken in the very first paragraph due (presumably) to a poor linkifier and what it thinks of URL encoding (or potentially other syntactic features of the source language). The document was probably done in Markdown, and the trailing right parenthesis in the link is not treated as part of the URL, so instead of being http://en.wikipedia.org/wiki/Java_(programming_language) the link target becomes http://en.wikipedia.org/wiki/Java_(programming_language with the parenthesis appearing after the link.

Assuming Markdown, that would be::

    [Java](http://en.wikipedia.org/wiki/Java_(programming_language))

Bad parser.

buro9 · on June 24, 2013

I think you're really close, except I think that they have an autolink regex to find unanchored URLs and anchor them. And that it is that regex that is failing to deal with Wikipedia's brackets in URLs.

tel · on June 24, 2013

I've been chugging away at a direct translation of RFC 2616 into parser/pretty-printer pairs in Haskell [1]. It's taken me a few times to get it right and I've learned a bit in the process such as RFC 2396 having an ambiguous grammar (look for 2 calls to `micro` in my code which penalize parsing to simulate a greedy algorithm). I cannot imagine almost any URL decoder/encoder 'in the wild' being actually RFC compliant.

There's also some difficulty with how RFC 2616 (the current HTTP/1.1 spec) demands using definitions of URLs from RFC 2396 which is not the current URI spec, but the update URI spec seemed to me to be inconsistent with general usage of HTTP. If I remember correctly, it ended up implying that you cannot have a query string without sending your scheme as well, but I may be wrong.

The URI spec is hideously complex, but also very comprehensively defined in those RFCs. It's an interesting job to look through them.

[1] https://github.com/tel/requests

pyre · on June 24, 2013

  | you cannot have a query string without sending
  | your scheme as well, but I may be wrong

I don't remember reading that in the RFC.

Also, I'm curious for examples of URLs that break RFC 3986[1].

[1] http://tools.ietf.org/html/rfc3986

tel · on June 24, 2013

It wasn't URLs that break RFC 3986, it was more an instance of Postel's Law where some common URLs people might send and parse in HTTP messages every day actually weren't allowed in HTTP according to the upgrade path from RFC 2396. When I became worried about that possibility I decided to jump back to 2396 since that's the RFC that 2616 truly inherits from.

Bockit · on June 24, 2013

That went from "Oh, another article about the parts of a URI" to "wow..." very quickly. The path parameters were kind of neat, I was under the impression ; could be used instead of & and was only for query string parameters (Didn't even know path parameters were a thing).

mooism2 · on June 24, 2013

I believe it used to be recommended that applications accept ; to separate query string parameters so that if a url containing query string parameters was double-html-escaped for some reason...

e.g. a link to

  http://example.com/?colour=blue&age=old

with link text

  balloon

should appear in html as

  <a href="http://example.com/?colour=blue&amp;age=old">balloon</a>

but might erroneously appear as

  <a href="http://example.com/?colour=blue&amp;amp;age=old">balloon</a>

...or if a (simple) client neglected to html-decode the uri for some reason, then the link would still work.

Whether this was officially recommended or was merely a folk recommendation, I don't know.

ohwp · on June 24, 2013

The difference:

  Does not work: http://www.foobar.com/api?v=2/get?item=1
  Does work: http://www.foobar.com/api;v=2/get?item=1

Yhippa · on June 24, 2013

I didn't either and I like the idea of that. I'll have to try that when writing up the paging for my next web app for something like search results.

adamauckland · on June 24, 2013

Path parameters aren't particularly well supported on many server platforms, unfortunately.

Of course, it does depend on what parser gets them first!

drewcoo · on June 24, 2013

As a career-long test dev I have to ask . . . how do you test all of that? Do you test all of that? You probably don't in practice.

Maybe I'm jaded. As a newly-minted SDET I once tried to test email addresses as defined by RFCs and eventually realized that I was trying to test adherence to standards instead of what actually happened.

My humble suggestion is to use a well-vetted URI-handling library and employ a combination of eyeballing it and some faith. (I hate to recommend faith but there you go.) And if there's not already a lib for that in Java (gotta be but I try to avoid Java) maybe some Java-head could write one for everyone . . .

jameswyse · on June 24, 2013

The last time I came across this article I was excited to learn about 'path parameters' / 'matrix parameters' but I couldn't find anything that actually supports them, at least in node.js.

Shame, I can see a use for them.

rossy · on June 24, 2013

What surprised me the most is that URL components can contain encoded slashes (%2f). I tried replacing some of the slashes in my open tabs with %2fs. Some sites treat them as path separators, some don't and some treat them differently depending on which components they appear between.

pacifika · on June 24, 2013

Anyone has a verified list of libraries for languages that deal with this correctly?

noinsight · on June 24, 2013

Python's urllib and urlparse make this trivial. urllib.urlencode / urllib.urldecode and urlparse.urlparse

justincormack · on June 24, 2013

Do they get all these cases right?

noinsight · on June 24, 2013

I haven't tested it thoroughly, from what I remember urlparse works fine.

berdario · on June 24, 2013

I checked it right now... it's not complete, but seems to be mostly fine

basically, unlike Java, it doesn't give you an encode() function that takes an arbitrary string... the only urlencode() function expects data representing a query

obviously, you still have to remember to handle the quoting of each part of the url separately... if you build your url (actually just resource+query+fragment) , and then you just quote() it at the end, you're no better than with java

e.g, if you have a path made by 2 segments "yadda/yadda" and "foo/bar"

quote("/".join(["yadda/yadda", "foo/bar"]))

yields 'yadda/yadda/foo/bar', which might not be correct, if what you want is actually

"/".join(quote(segment, safe="") for segment in ["yadda/yadda", "foo/bar"])

that yields 'yadda%2Fyadda/foo%2Fbar'

kinda error-prone, if you ask me

Also, python's urlparse seems to not handle correctly path parameters:

urlparse("http://example.com/egypt;p=0/nile;p2=1;p3=2")

only recognizes p2=1;p3=2 as path parameters

I think that part of the confusion is that we think of encoding as "The way in which symbols are mapped onto bytes", but if we use that meaning, it's not correct to talk about "url encoding", because each part of the url cannot be converted in ascii while ignoring the context (are the / meaningful?) and the place it appears into the url

it's more like a "url language", and if we would talk about "parsing" or "formatting" imho we could get less ambiguities and misunderstandings

PS: I realized just now that for the "http://example.com/egypt;p=0/nile;p2=1;p3=2" example, the doc suggests to use the urlsplit function instead (that will avoid to parse the path parameters altogether... kind of a non-solution imo, but at least it's known)

chii · on June 24, 2013

java's http://docs.oracle.com/javaee/6/api/javax/ws/rs/core/UriBuil... is pretty good.

berdario · on June 25, 2013

Thanks! I wonder why the post author didn't use/discover it

chii · on June 25, 2013

may be they haven't experienced enough java (especially the j2EE stack...its a bit of a big and complicated stack, and unless you really worked with it, you won't know it).

Guava also has tonnes of good stuff - http://docs.guava-libraries.googlecode.com/git/javadoc/com/g... , and its ostensibly lighter weight and more modular too.

the_mitsuhiko · on June 24, 2013

urlparse is okay but it has issues in the sense that it only supports URIs and not IRIs. werkzeug.urls provides full unicode support.

noptic · on June 24, 2013

>> A URL cannot be analysed after decoding

Sadly true. Drovr me nuts trying to analyze a URL on Apache/PHP.