

What every web developer must know about URL encoding - bemmu
http://blog.lunatech.com/2009/02/03/what-every-web-developer-must-know-about-url-encoding

======
jrochkind1
This is great stuff.

 _The reserved characters are different for each part_

But if it's true that you can _optionally_ percent-encode any character in any
part of the URL (path or query) -- is this true? I think so -- then there is
in fact a way to _encode_ any URI without as much syntactical awareness of the
URI structure.

 _This means that the "blue+light blue" string has to be encoded differently
in the path and query parts:
"[http://example.com/blue+light%20blue?blue%2Blight+blue"](http://example.com/blue+light%20blue?blue%2Blight+blue").
From there you can deduce that encoding a fully constructed URL is impossible
without a syntactical awareness of the URL structure._

Does it? Could you _optionally_ encode it as:

    
    
       http://example.com/blue%2Blight%20blue?blue%2Blight%20blue
    

That is, _percent_ encode ( _not_ the legacy encode-space-as-plus) _both_ the
plus and the space in both path and query?

* You'd still need enough syntactic awareness of the URI structure to know to leave path-seperator "/" alone, and query-part-beginning "?" alone.

* And you'd need even more syntactic awareness to properly _decode_ URIs, which may not have chosen to optionally always-percent-encode.

* But it might be wise if libraries chose to encode things in that consistent way, for instance encoding plus in path even though it's not required, never encoding space as plus even in query.

* I think there is _no_ good reason for any modern library to encode spaces, even in the query string, as "+" instead of %20. Even though many of them do. Don't do it.

* I don't use Java much, and couldn't entirely follow those examples (this stuff is confusing to talk about -- as in all escaping issues) -- but it sure does seem like those parts of stdlib/commonly used libraries are pretty darn broken. (And I'm sure Java is not alone here -- there is a long history of devs being confused about this stuff. Again, as with just about any escaping-related issue).

~~~
stan_rogers
The + as a space in a URL often originates from form data submitted via GET
(the wisdom of which is a separate topic for discussion). At least we've
gotten to the point now where we've pretty much dispensed with nested
hierarchies of framesets, in which the "percent" encoding needed to be done
multiple times to different parts of the URL so each target could decode its
source and dispatch an appropriately-encoded source to its descendants,
yielding %25252520 and other similar monstrosities.

~~~
gpvos
Are you using some kind of framework? I don't think there is a way in plain
HTTP+HTML to pass URLs to child frames in a frameset.

~~~
Hello71
There's target=frame (never used in practice), but that's only one level down.

~~~
stan_rogers
You can use an encoded target=frame to pass it down as many levels as you want
(encode the ampersand and the equals sign, and of course the percent sign for
things that arent at the immediate child level). The '90s sucked for web dev,
but there was almost always a (scary and hackish) way.

~~~
gpvos
But which mechanism would do the substitution of those parameters in the HTML
of the child frame? I still fail to see how it would be done without either
Javascript or server-side code. (And I did web dev in the 90s, with frames,
but apparently was not evil enough to think of such stuff.)

------
chrismorgan
I'm amused by the fact that a link is broken in the very first paragraph due
(presumably) to a poor linkifier and what it thinks of URL encoding (or
potentially other syntactic features of the source language). The document was
probably done in Markdown, and the trailing right parenthesis in the link is
_not treated as part of the URL_ , so instead of being
[http://en.wikipedia.org/wiki/Java_(programming_language)](http://en.wikipedia.org/wiki/Java_\(programming_language\))
the link target becomes
[http://en.wikipedia.org/wiki/Java_(programming_language](http://en.wikipedia.org/wiki/Java_\(programming_language)
with the parenthesis appearing after the link.

Assuming Markdown, that would be::

    
    
        [Java](http://en.wikipedia.org/wiki/Java_(programming_language))
    

Bad parser.

~~~
buro9
I think you're really close, except I think that they have an autolink regex
to find unanchored URLs and anchor them. And that it is that regex that is
failing to deal with Wikipedia's brackets in URLs.

------
tel
I've been chugging away at a direct translation of RFC 2616 into
parser/pretty-printer pairs in Haskell [1]. It's taken me a few times to get
it right and I've learned a bit in the process such as RFC 2396 having an
ambiguous grammar (look for 2 calls to `micro` in my code which penalize
parsing to simulate a greedy algorithm). I cannot imagine almost any URL
decoder/encoder 'in the wild' being actually RFC compliant.

There's also some difficulty with how RFC 2616 (the current HTTP/1.1 spec)
demands using definitions of URLs from RFC 2396 which is not the current URI
spec, but the update URI spec seemed to me to be inconsistent with general
usage of HTTP. If I remember correctly, it ended up implying that you cannot
have a query string without sending your scheme as well, but I may be wrong.

The URI spec is hideously complex, but also very comprehensively defined in
those RFCs. It's an interesting job to look through them.

[1] [https://github.com/tel/requests](https://github.com/tel/requests)

~~~
pyre

      | you cannot have a query string without sending
      | your scheme as well, but I may be wrong
    

I don't remember reading that in the RFC.

Also, I'm curious for examples of URLs that break RFC 3986[1].

[1] [http://tools.ietf.org/html/rfc3986](http://tools.ietf.org/html/rfc3986)

~~~
tel
It wasn't URLs that break RFC 3986, it was more an instance of Postel's Law
where some common URLs people might send and parse in HTTP messages every day
actually weren't allowed in HTTP according to the upgrade path from RFC 2396.
When I became worried about that possibility I decided to jump back to 2396
since that's the RFC that 2616 truly inherits from.

------
Bockit
That went from "Oh, another article about the parts of a URI" to "wow..." very
quickly. The path parameters were kind of neat, I was under the impression ;
could be used instead of & and was only for query string parameters (Didn't
even know path parameters were a thing).

~~~
Yhippa
I didn't either and I like the idea of that. I'll have to try that when
writing up the paging for my next web app for something like search results.

~~~
adamauckland
Path parameters aren't particularly well supported on many server platforms,
unfortunately.

Of course, it does depend on what parser gets them first!

------
drewcoo
As a career-long test dev I have to ask . . . how do you test all of that? Do
you test all of that? You probably don't in practice.

Maybe I'm jaded. As a newly-minted SDET I once tried to test email addresses
as defined by RFCs and eventually realized that I was trying to test adherence
to standards instead of what actually happened.

My humble suggestion is to use a well-vetted URI-handling library and employ a
combination of eyeballing it and some faith. (I hate to recommend faith but
there you go.) And if there's not already a lib for that in Java (gotta be but
I try to avoid Java) maybe some Java-head could write one for everyone . . .

------
jameswyse
The last time I came across this article I was excited to learn about 'path
parameters' / 'matrix parameters' but I couldn't find anything that actually
supports them, at least in node.js.

Shame, I can see a use for them.

------
rossy
What surprised me the most is that URL components can contain encoded slashes
(%2f). I tried replacing some of the slashes in my open tabs with %2fs. Some
sites treat them as path separators, some don't and some treat them
differently depending on which components they appear between.

------
pacifika
Anyone has a verified list of libraries for languages that deal with this
correctly?

------
mikeash
I'm always amazed and dismayed that this heap of junk has become the
foundation of such a massive set of activities.

I mean, who thought it was a good idea to have different escaping for
different parts of a URL? That's like having a car that uses a lever to turn
right and a foot pedal to turn left.

Of course, the answer is: nobody. URLs weren't designed, they grew
organically. The result is the ubiquitous mess we have now.

It does work. The organic, evolving web has survived and thrived while sanely-
designed standards have withered and died. Maybe it's ultimately the best way
to do things.

But it still bugs me that I have to deal with _such_ weird, awful nonsense
when I need to do something related to the web. TCP/IP, while certainly not
lacking in quirks, is still a million times better. It's sad that we couldn't
end up with something a little more consistent at the higher levels.

~~~
MichaelGG
There seems to be a drive in some of the core standards to allow as flexible
"human" encoding as possible. Instead of writing a tight-as-possible encoding,
it's like it becomes a game to see how complicated yet still unambiguous they
can make syntax.

HTTP and SIP, for instance, have ridiculous parsing rules and tons of edge
cases. The SIP authors even put together a "torture test" RFC where they take
glee in making the most insane messages that still parse. And when they don't
parse, they suggest the implementation infer the meaning. Seriously.

HTTP and SIP allow newlines in headers that get consumed in parsing, so you
can manually word-wrap lines. They allow _comments_ in HTTP messages. SIP
(HTTP too?) allows headers _in the URL_.

When a non-committee (TCPIP) or non-academic (SPDY) entity does a spec, they
tend to remove a lot of this cruft and realise that human-readability comes
second (unlike in programming languages).

~~~
tel
In my reading of the RFCs, HTTP doesn't allow headers in the URI. But it does
demand that you accept nest comments in header _values_. Only. These comments
appear nowhere else.

------
noptic
>> A URL cannot be analysed after decoding

Sadly true. Drovr me nuts trying to analyze a URL on Apache/PHP.

------
noinsight
Python's urllib and urlparse make this trivial. urllib.urlencode /
urllib.urldecode and urlparse.urlparse

~~~
justincormack
Do they get all these cases right?

~~~
noinsight
I haven't tested it thoroughly, from what I remember urlparse works fine.

~~~
berdario
I checked it right now... it's not complete, but seems to be mostly fine

basically, unlike Java, it doesn't give you an encode() function that takes an
arbitrary string... the only urlencode() function expects data representing a
query

obviously, you still have to remember to handle the quoting of each part of
the url separately... if you build your url (actually just
resource+query+fragment) , and then you just quote() it at the end, you're no
better than with java

e.g, if you have a path made by 2 segments "yadda/yadda" and "foo/bar"

quote("/".join(["yadda/yadda", "foo/bar"]))

yields 'yadda/yadda/foo/bar', which might not be correct, if what you want is
actually

"/".join(quote(segment, safe="") for segment in ["yadda/yadda", "foo/bar"])

that yields 'yadda%2Fyadda/foo%2Fbar'

kinda error-prone, if you ask me

Also, python's urlparse seems to not handle correctly path parameters:

urlparse("[http://example.com/egypt;p=0/nile;p2=1;p3=2"](http://example.com/egypt;p=0/nile;p2=1;p3=2"))

only recognizes p2=1;p3=2 as path parameters

I think that part of the confusion is that we think of encoding as "The way in
which symbols are mapped onto bytes", but if we use that meaning, it's not
correct to talk about "url encoding", because each part of the url cannot be
converted in ascii while ignoring the context (are the / meaningful?) and the
place it appears into the url

it's more like a "url language", and if we would talk about "parsing" or
"formatting" imho we could get less ambiguities and misunderstandings

PS: I realized just now that for the
"[http://example.com/egypt;p=0/nile;p2=1;p3=2"](http://example.com/egypt;p=0/nile;p2=1;p3=2")
example, the doc suggests to use the urlsplit function instead (that will
avoid to parse the path parameters altogether... kind of a non-solution imo,
but at least it's known)

~~~
chii
java's
[http://docs.oracle.com/javaee/6/api/javax/ws/rs/core/UriBuil...](http://docs.oracle.com/javaee/6/api/javax/ws/rs/core/UriBuilder.html)
is pretty good.

~~~
berdario
Thanks! I wonder why the post author didn't use/discover it

~~~
chii
may be they haven't experienced enough java (especially the j2EE stack...its a
bit of a big and complicated stack, and unless you really worked with it, you
won't know it).

Guava also has tonnes of good stuff - [http://docs.guava-
libraries.googlecode.com/git/javadoc/com/g...](http://docs.guava-
libraries.googlecode.com/git/javadoc/com/google/common/net/package-
summary.html) , and its ostensibly lighter weight and more modular too.

