The reserved characters are different for each part
But if it's true that you can _optionally_ percent-encode any character in any part of the URL (path or query) -- is this true? I think so -- then there is in fact a way to _encode_ any URI without as much syntactical awareness of the URI structure.
This means that the "blue+light blue" string has to be encoded differently in the path and query parts: "http://example.com/blue+light%20blue?blue%2Blight+blue". From there you can deduce that encoding a fully constructed URL is impossible without a syntactical awareness of the URL structure.
That is, percent encode (not the legacy encode-space-as-plus) both the plus and the space in both path and query?
* You'd still need enough syntactic awareness of the URI structure to know to leave path-seperator "/" alone, and query-part-beginning "?" alone.
* And you'd need even more syntactic awareness to properly _decode_ URIs, which may not have chosen to optionally always-percent-encode.
* But it might be wise if libraries chose to encode things in that consistent way, for instance encoding plus in path even though it's not required, never encoding space as plus even in query.
* I think there is _no_ good reason for any modern library to encode spaces, even in the query string, as "+" instead of %20. Even though many of them do. Don't do it.
* I don't use Java much, and couldn't entirely follow those examples (this stuff is confusing to talk about -- as in all escaping issues) -- but it sure does seem like those parts of stdlib/commonly used libraries are pretty darn broken. (And I'm sure Java is not alone here -- there is a long history of devs being confused about this stuff. Again, as with just about any escaping-related issue).
The + as a space in a URL often originates from form data submitted via GET (the wisdom of which is a separate topic for discussion). At least we've gotten to the point now where we've pretty much dispensed with nested hierarchies of framesets, in which the "percent" encoding needed to be done multiple times to different parts of the URL so each target could decode its source and dispatch an appropriately-encoded source to its descendants, yielding %25252520 and other similar monstrosities.
You can use an encoded target=frame to pass it down as many levels as you want (encode the ampersand and the equals sign, and of course the percent sign for things that arent at the immediate child level). The '90s sucked for web dev, but there was almost always a (scary and hackish) way.
But which mechanism would do the substitution of those parameters in the HTML of the child frame? I still fail to see how it would be done without either Javascript or server-side code. (And I did web dev in the 90s, with frames, but apparently was not evil enough to think of such stuff.)
I wanted to make the same comment after reading the article. You're correct. The "+" literal in the URL is a painful remnant and at least when encoding I think it should be avoided entirely - so I think the article has it wrong when they say the string has to be encoded differently depending on its place within the URL. However, it's something to be aware of when decoding.
I'm always amazed and dismayed that this heap of junk has become the foundation of such a massive set of activities.
I mean, who thought it was a good idea to have different escaping for different parts of a URL? That's like having a car that uses a lever to turn right and a foot pedal to turn left.
Of course, the answer is: nobody. URLs weren't designed, they grew organically. The result is the ubiquitous mess we have now.
It does work. The organic, evolving web has survived and thrived while sanely-designed standards have withered and died. Maybe it's ultimately the best way to do things.
But it still bugs me that I have to deal with such weird, awful nonsense when I need to do something related to the web. TCP/IP, while certainly not lacking in quirks, is still a million times better. It's sad that we couldn't end up with something a little more consistent at the higher levels.
There seems to be a drive in some of the core standards to allow as flexible "human" encoding as possible. Instead of writing a tight-as-possible encoding, it's like it becomes a game to see how complicated yet still unambiguous they can make syntax.
HTTP and SIP, for instance, have ridiculous parsing rules and tons of edge cases. The SIP authors even put together a "torture test" RFC where they take glee in making the most insane messages that still parse. And when they don't parse, they suggest the implementation infer the meaning. Seriously.
HTTP and SIP allow newlines in headers that get consumed in parsing, so you can manually word-wrap lines. They allow comments in HTTP messages. SIP (HTTP too?) allows headers _in the URL_.
When a non-committee (TCPIP) or non-academic (SPDY) entity does a spec, they tend to remove a lot of this cruft and realise that human-readability comes second (unlike in programming languages).
That's an interesting thought. There is definite advantage in having a protocol be human-readable and human-writable. However, there's a vast gulf between a protocol that you can write if you have the spec at hand and take your time, and a protocol that you can comfortably write by hand routinely, like a programming language. The latter strikes me as a really bad idea, one responsible for much of this craziness.
In my reading of the RFCs, HTTP doesn't allow headers in the URI. But it does demand that you accept nest comments in header values. Only. These comments appear nowhere else.
I'm amused by the fact that a link is broken in the very first paragraph due (presumably) to a poor linkifier and what it thinks of URL encoding (or potentially other syntactic features of the source language). The document was probably done in Markdown, and the trailing right parenthesis in the link is not treated as part of the URL, so instead of being http://en.wikipedia.org/wiki/Java_(programming_language) the link target becomes http://en.wikipedia.org/wiki/Java_(programming_language with the parenthesis appearing after the link.
I think you're really close, except I think that they have an autolink regex to find unanchored URLs and anchor them. And that it is that regex that is failing to deal with Wikipedia's brackets in URLs.
I've been chugging away at a direct translation of RFC 2616 into parser/pretty-printer pairs in Haskell [1]. It's taken me a few times to get it right and I've learned a bit in the process such as RFC 2396 having an ambiguous grammar (look for 2 calls to `micro` in my code which penalize parsing to simulate a greedy algorithm). I cannot imagine almost any URL decoder/encoder 'in the wild' being actually RFC compliant.
There's also some difficulty with how RFC 2616 (the current HTTP/1.1 spec) demands using definitions of URLs from RFC 2396 which is not the current URI spec, but the update URI spec seemed to me to be inconsistent with general usage of HTTP. If I remember correctly, it ended up implying that you cannot have a query string without sending your scheme as well, but I may be wrong.
The URI spec is hideously complex, but also very comprehensively defined in those RFCs. It's an interesting job to look through them.
It wasn't URLs that break RFC 3986, it was more an instance of Postel's Law where some common URLs people might send and parse in HTTP messages every day actually weren't allowed in HTTP according to the upgrade path from RFC 2396. When I became worried about that possibility I decided to jump back to 2396 since that's the RFC that 2616 truly inherits from.
That went from "Oh, another article about the parts of a URI" to "wow..." very quickly. The path parameters were kind of neat, I was under the impression ; could be used instead of & and was only for query string parameters (Didn't even know path parameters were a thing).
I believe it used to be recommended that applications accept ; to separate query string parameters so that if a url containing query string parameters was double-html-escaped for some reason...
As a career-long test dev I have to ask . . . how do you test all of that? Do you test all of that? You probably don't in practice.
Maybe I'm jaded. As a newly-minted SDET I once tried to test email addresses as defined by RFCs and eventually realized that I was trying to test adherence to standards instead of what actually happened.
My humble suggestion is to use a well-vetted URI-handling library and employ a combination of eyeballing it and some faith. (I hate to recommend faith but there you go.) And if there's not already a lib for that in Java (gotta be but I try to avoid Java) maybe some Java-head could write one for everyone . . .
The last time I came across this article I was excited to learn about 'path parameters' / 'matrix parameters' but I couldn't find anything that actually supports them, at least in node.js.
What surprised me the most is that URL components can contain encoded slashes (%2f). I tried replacing some of the slashes in my open tabs with %2fs. Some sites treat them as path separators, some don't and some treat them differently depending on which components they appear between.
I checked it right now... it's not complete, but seems to be mostly fine
basically, unlike Java, it doesn't give you an encode() function that takes an arbitrary string... the only urlencode() function expects data representing a query
obviously, you still have to remember to handle the quoting of each part of the url separately... if you build your url (actually just resource+query+fragment) , and then you just quote() it at the end, you're no better than with java
e.g, if you have a path made by 2 segments "yadda/yadda" and "foo/bar"
quote("/".join(["yadda/yadda", "foo/bar"]))
yields 'yadda/yadda/foo/bar', which might not be correct, if what you want is actually
"/".join(quote(segment, safe="") for segment in ["yadda/yadda", "foo/bar"])
that yields 'yadda%2Fyadda/foo%2Fbar'
kinda error-prone, if you ask me
Also, python's urlparse seems to not handle correctly path parameters:
I think that part of the confusion is that we think of encoding as "The way in which symbols are mapped onto bytes", but if we use that meaning, it's not correct to talk about "url encoding", because each part of the url cannot be converted in ascii while ignoring the context (are the / meaningful?) and the place it appears into the url
it's more like a "url language", and if we would talk about "parsing" or "formatting" imho we could get less ambiguities and misunderstandings
PS: I realized just now that for the "http://example.com/egypt;p=0/nile;p2=1;p3=2" example, the doc suggests to use the urlsplit function instead (that will avoid to parse the path parameters altogether... kind of a non-solution imo, but at least it's known)
may be they haven't experienced enough java (especially the j2EE stack...its a bit of a big and complicated stack, and unless you really worked with it, you won't know it).
The reserved characters are different for each part
But if it's true that you can _optionally_ percent-encode any character in any part of the URL (path or query) -- is this true? I think so -- then there is in fact a way to _encode_ any URI without as much syntactical awareness of the URI structure.
This means that the "blue+light blue" string has to be encoded differently in the path and query parts: "http://example.com/blue+light%20blue?blue%2Blight+blue". From there you can deduce that encoding a fully constructed URL is impossible without a syntactical awareness of the URL structure.
Does it? Could you optionally encode it as:
That is, percent encode (not the legacy encode-space-as-plus) both the plus and the space in both path and query?* You'd still need enough syntactic awareness of the URI structure to know to leave path-seperator "/" alone, and query-part-beginning "?" alone.
* And you'd need even more syntactic awareness to properly _decode_ URIs, which may not have chosen to optionally always-percent-encode.
* But it might be wise if libraries chose to encode things in that consistent way, for instance encoding plus in path even though it's not required, never encoding space as plus even in query.
* I think there is _no_ good reason for any modern library to encode spaces, even in the query string, as "+" instead of %20. Even though many of them do. Don't do it.
* I don't use Java much, and couldn't entirely follow those examples (this stuff is confusing to talk about -- as in all escaping issues) -- but it sure does seem like those parts of stdlib/commonly used libraries are pretty darn broken. (And I'm sure Java is not alone here -- there is a long history of devs being confused about this stuff. Again, as with just about any escaping-related issue).