> We are well aware of the problem with URL.equals and URL.hashCode. The cause of the problem is due to the existing spec and implementation, where we will try to compare the two URLs by resolving the host IP addresses, instead of just doing a string comparison. Because hashCode has to maintain certain relationships with equals, namely if two objects are equal, they should have the same hashCode, the implementation of hashCode also tries to resolve the host in the URL into an IP address. As a result, we are facing problems with http virtual hosting, as described in the Description part, and performance hit due to DNS name resolutions.
> Unfortunately, changing the behavior now would break backward compatibility in a serious way, plus Java Security mechanism depends on it in some parts of the implementation. We can't change it now.
> However, to address URI parsing in general, we introduced a new class called URI in Merlin (jdk1.4). People are encouraged to use URI for parsing and URI comparison, and leave URL class for accessing the URI itself, getting at the protocol handler, interacting with the protocol etc. So, at present, we don't plan on changing the URL.equals/hashCode behavior and we will leave the bug open until Tiger, when we re-investigate our options.
When I was a Java developer ~8 years ago, "Use java.net.URI" was a whole sentence in my vernacular. Not just as a replacement for URL, but also for all of the dumb string concatenation people were doing to build urls and query parameters.
It was already ancient advice then, but I kept finding myself having to teach it to a new group of people, usually after some incident had happened (because they won't listen until something has already broken)
Yes, I’m constantly giving advice to Ruby and JS developers that “URLs are not strings” and to use URI libraries for manipulation. This is also true for phone numbers, emails, IPs, etc.
Hell, even Rails maintainers rejected a PR I opened that tried to get rails link helpers to work with URI objects.
I see why this was useful, though it still wasn't a good idea. This looks like it was added in ~1995, at which point (a) it was common for multiple hostnames to refer to the same server and (b) they were guaranteed to act identically because there was no HOST: header yet.
Your b) was actually only ever true for the HTTP protocol, but URLs are not strictly limited to that protocol. They are not limited to any finite set of protocols at all. You could easily make up your own protocol which differentiates between host names in a similar way as modern HTTP does. And you could have made that in 1995.
You could have made one in 1995, but looking over https://www.iana.org/assignments/protocol-numbers/protocol-n... I think HTTP/1.1 was the first protocol to send the host name like this. (Though I could be missing something, since that's a lot of protocols.)
I might be wrong, it was a long time ago, but IIRC a different DNS rebinding attack was actually part of the reason this behavior was introduced to the URL class, to help protect against such attacks in Java Applets.
Hell, no need for a record update: just multiple A records for the same hostname, with a short enough TTL that two different URL instances could conceivably resolve differently.
You should also use a gradle or maven plugin called forbidden-apis to prevent people using deprecated stuff like this. I've been adding this to pretty much any Java/Kotlin project for years with some custom patterns to also prevent people using shadowed packages that some libraries expose. In this case, for forbidding java.net.URL.equals is probably a good idea. Might already be in the default patterns.
Also sounds like something a good static code analyzer should be able to detect.
I wish that was part of the Javadoc. I just looked up the Javadoc for jdk8 and URL, URL.equals and URL.hashCode and I would have no idea that it does DNS requests. I am a very seasoned Java dev (since 2001) and I am sure I read at one time about this but forgot. But I do use URL from time to time though and I am pretty sure to have used it with Map or Set somewhere.
A clear warning at top of URL, URL(String) would be good to make it recognizable in most IDEs when using it.
No and probably it will never happen in javac. Some IDEs may try to do static code analysis to identify possible uses of URL.equals (e.g. Map<? extends URL, ?>), but I doubt anyone would invest in this feature.
Doesn't look like it's turned on by default though, which seems strange. Presumably if you care enough to go into the settings to enable that specific warning you're unlikely to make the mistake in the first place. It's exactly the type of non-obvious gotcha that would be nice to have your IDE warn you about out of the box.
I agree and disagree. I mean, the thing is that outside of personal projects it's not uncommon to inherit a codebase from other people, sometimes quite a large codebase. Such a warning can be useful when eliminating boo-boos (acting as an lint I guess).
In this case Sonar will do a better job. IDE warnings are good for fixing them immediately - when they accumulate it’s very easy to stop paying any attention.
This is nonsense. The spec is wrong. There is no chance I would consider two URLs with equal local parts and query parameters, but different authorities resolving to the same addresses as equal -- that would be WRONG. The authority could easily provide different semantics for the same URL based on differing Host: values.
I'm really curious as to why they don't make it, say, protected where the JVM wouldn't actually barf at runtime (or would it? I'm a bit too enmeshed in Scala bincompat at the moment), but where compiling from source would error out.
If I'm wrong: Why not invent a 'protection level' where this were a feasible thing... even if just to be able to remove old APIs and really force people to stop just adding @SuppressWarnings-deprecation all over the place. That's not a solution to anything.
> "Two hosts are considered equivalent if both host names can be resolved into the same IP addresses"
Uhmm... yikes. Why are they resolving anything? A URL is a string, this should just be doing a string comparison. All of the parts of the URL object are strings or ints, so at worst they should just be comparing all of those individually, not resolving domains and comparing IP addresses. That makes no sense at all.
This might have something to do with the way that name resolution worked inside of Sun at the time. If there was a host, say
doppio.eng.sun.com
it would be referred to simply as "doppio" from within the engineering ("eng.sun.com") domain, or possibly as "doppio.eng" from other domains inside of Sun. It was fairly rare to use FQDNs inside of Sun to refer to other hosts inside of Sun. Thus, the following URLs all referred to the same resource:
It's a plausible point of view that URL.equals() should report true for any two of the above URLs. (That doesn't mean that I think it was a good idea, though.)
Agreed, but consider that the internet, and state of HTTP hosting, was a very different place in 1995. Was the Host header even a thing back then? (And if it was, was it widely deployed?) This is clearly wrong in hindsight, but I could see why it was naively designed that way in the first place.
Regardless, given all Java's warts, this doesn't even make my top 10.
There's one thing in Javadoc that says it all:
`@since JDK 1.0`
Java is as good as Windows in the sense, that its standard library and set of APIs is very stable and supports a lot of legacy software. I doubt there's a real need for URL class in the new code by now, given that URI class was introduced in JDK 1.4 almost 18 years ago. There's plenty of dependencies though, so URL will probably stay in the core library forever, but URI represents a superset for URLs, has reasonable implementation of equals/hashCode and is sufficient for majority of the uses.
I'm a seasoned java dev (since before 1.4) and i didnt realize i was supposed to use URI, i did know of it, but I'm always confused what to use and wind up using URL.
But, just like Vector remains in use in Swing, URL remains in use in the core java library (looking at you, classloader.getresource), so it's easy for me to have made that mistake.
then again i dont use it often. mostly come in contact with it when using ClassLoader API.,
Eh, I dunno, I've been an on-and-off Java person since ~2000 (with consistent JVM development over the last 8 years or so), and I only learned about this URL gotcha within the last couple years.
Or if the computer running the JVM doesn't have a reliable network connection? Does the behavior of URL.equals() change if your network connection is down?
... and what happens when you try to compare a URL which contains a nonexistent hostname, like http://asdfghjkl.example.com/ ? Does that compare as equal to all other URLs with unresolvable hostnames?
One thing I find interesting is that it doesn't canonicalize the path portion at all. E.g.:
scala> new URL("http://localhost/foo/bar/baz").equals(new URL("http://localhost/foo/../foo/bar/baz"))
res0: Boolean = false
It's just funny to me that they went to the effort to do a full network request to see if the hostnames resolve to the same IP, but didn't bother to normalize paths.
For example, because the "http" scheme makes use of an authority
component, has a default port of "80", and defines an empty path to be
equivalent to "/", the following four URIs are equivalent:
http://example.com
http://example.com/
http://example.com:/
http://example.com:80/
* A URL is a parameter to a function returning a resource.
* A URL is a string, conforming to a certain format.
Neither of these specifies that the resource it points to is always the same resource, nor that it's always possible to resolve it. That's by design; hostnames change. The Web is not permanent and that's why we have eg. HTTP 404.
Network topologies also change.
However, the default equals() comparison between two objects is supposed to compare those two objects, not the current topology of the Internet.
There is no way this behaviour ever made sense, nor any way it ever could make sense. It's moronic, through and through, for any language or library, to implement default equality in this way.
If you want to implement a comparer which does stuff like this and accepts a hostname resolver as a dependency, great. But there is simply no excuse for this kind of stupidity in a default dependency-less implementation.
It definitely doesn't make sense, because two unrelated hosts can share the same IP. If you tried to fetch http://a.b.c.d/some/path, then the server will fail because it doesn't know which host you mean.
In this case, Java has thrown away the critical distinction between the two.
And it especially doesn’t make sense because resolving to the same IP on the server right now doesn’t guarantee it will resolve the same on the client, or even on the same server in 15 minutes time.
I mean, I can see host resolution equivalence being a useful piece of functionality. I could even see it being built-in to a std URL library. Just, y'know... not as the default equality operator.
And yes, I know that overriding .equals() does not change the behavior of == in Java. But I would still consider .equals() to be the "default" equality operator for objects.
I'm pretty sure the specification of this class predates name-based virtual hosting. After all, that only appeared in 1997, or perhaps late 1996, and it took a couple of years to be standardized. The URL class existed in Java in 1995 :)
(Edit: And really, virtual hosts are an extremely weird HTTP feature. What other internet protocol cares about what domain name you used to establish a connection?)
> (Edit: And really, virtual hosts are an extremely weird HTTP feature. What other internet protocol cares about what domain name you used to establish a connection?)
kerberos? (btw. this is older than 1995) or smtp (little bit different than the http version, tough)
Buy a domain name, point it at Enterprise Business Inc.'s server, submit a link using your domain to a form the operate which does a
if (internalURLs.contains(submittedURL)) {
check. Then change your DNS records to point to some other server, once your domain is in their database and they assume it to have already been validated as internal.
Still this sort of expansionary philosophy in library design does give some cause for optimism. In general library authors want to make their specs efficiently implementable, so when an method has this sort of semantics in an official spec it makes one suspect that a breakthrough in AI/algorithms research is right around the corner.
It's possible to have URLs that are not equal as strings but resolve to the same thing so this actually makes perfect sense. If you really want to test if URLs are equal as strings, I'm sure you could still do it that way.
Not only is it possible to have two hostnames that resolve to the same IP address, hostname -> IP address is not unique either:
$ dig amazon.com
...
;; ANSWER SECTION:
amazon.com. 60 IN A 205.251.242.103
amazon.com. 60 IN A 176.32.103.205
amazon.com. 60 IN A 176.32.98.166
Something resolving "amazon.com" is welcome to resolve it to any of those three addresses. Code that tries to "resolve" these IPs is at the very least coming perilously close to deciding that amazon.com != amazon.com, nondeterministically, if the underlying resolution call changes its behavior.
The further obvious change of trying to compare the whole record just gets worse; what is the answer if domain 1 resolves to IPs (V, X, Y) and domain 2 resolves to (X, Y, Z)?
Oh, and let's not forget, DNS can be different depending on your geographical location, so in the US two domains may happen to resolve to the same IP but in Europe they may be different. What's the use of a URL equality operator that changes behavior based on where the user is? Whatever the use of such a thing may be (I mean, yeah, I get the internet contrarian impulse, yeah, I can construct some bizarre situation in which it is useful), it is certainly less useful than a simpler operator.
Basically, the entire idea is just fundamentally flawed and shouldn't be used. I make no claim to have exhaustively enumerated all the ways in which this is a bad idea, merely added to the pile, and demonstrated sufficient evidence for my claim that it shouldn't be used.
Edit: They know, see yzmtf2008: https://news.ycombinator.com/item?id=21766138 and consider my post here a lightly educational post on further reasons why it's a bad idea. If you only take one thing away, remember, a function that takes a hostname and yields an IP doesn't have very many useful properties beyond just that it yields an IP or failure. You can count on very little else about that IP. It should be treated as opaque and not compared, stored (other than logging), etc.
Unlike virtual hosting, which wasn't possible when java.net.URL was written, this kind of DNS behavior had been around for years: https://tools.ietf.org/html/rfc1794
> It's possible to have URLs that are not equal as strings but resolve to the same thing so this actually makes perfect sense.
No, that doesn't make "perfect sense." Two different URLs are different, they cannot be equal if they're different, and a library called "URL" shouldn't use voodoo-magic to say that two different URLs are equal when they're not.
If it was e.g. Net.DNS.Equal(Uri, Uri), perhaps. But even then it is ambiguous as DNS has multiple record types, not just "A" records. So is it pulling all records (A, AAAA, MX, etc) and comparing them all? Or just arbitrary comparing one?
But as it stands, it is a URL library that is ignoring the URL part of the URL and using resolution to decide equality instead. It is nonsensical. And even if they did resolve identically they may not be treated identically by routing or endpoints.
This seems absolutely ripe for abuse, as now there's a nice string you can search for in GitHub to see where that's used as a security feature. I'm imagining things like "if URL("http://example.com/some/path").equals(URL(checkedUrl)) { return AllowEditRights }", and checkedUrl = "http://wiki.example.com/some/path" or similar.
That's an interesting way to think about URLs. If both `url`.com` and `url2.com` resolve to the same IP then they are functionally the same. It's like comparing two strings whose variable names are different but that resolve to the same space in memory.
It also means that if one URL changes, then they could be equal sometimes and not others.
My company Datastreamer has been around for a decade. We provide crawl data (usually a massive amount of crawl data, north of 300GB per day) to our customers.
... so we have a LOT of real-world experience pushing data to customers in production over long time periods.
Here's what we've learned.
Networking libraries around HTTP are and have been fundamentally broken for a long time and they're broken in pathological ways that you don't realize until years later in production.
DNS caching is a good one. A lot of systems do infinite DNS caching. Java, until at least Java 8, does infinite DNS caching.
Some do infinite HTTP timeout. Timeouts are awesome. You should use a timeout. Without a timeout if the network breaks your code just locks up.
Some libs provide no API to change TCP buffer sizes (which you have to do at the kernel level).
So about 5 years ago we took a harsh stance. NO CUSTOM CLIENTS.
We have a streaming firehose client that we implemented from the ground up to do everything properly. The API is literally that we just stream JSON files to disk.
It's a docker container now so not too hard to deploy.
Your job is to just to listen to the disk and wait for new files to be written. We do a move from a tmpdir to the final dir so the entire file is written and you don't have to worry about partial reads.
About 80% of our customers love it. The other 20% of customers seem to initially hate it and we have to explain to their CTO or senior architect that, no, you DO NOT want to implement this from the ground up.
What happens is that it works immediately, but then 18 months in it will break pathologically and everyone running it has moved on or it's in some datacenter that no one has access too.
This causes us to break our SLAs and means we have upset customers.
This decision by far was one of the best decisions I've ever made and has really helped our growth and stability over time.
It's really really really nice to keep customers for 5-10 years. They're happy and you get steady checks and predictable growth.
A long time ago I remember seeing those timeout parameters when I took Java class, and saying, "ehhh it's probably fine to leave it as null". Also the docs at the time seemed to agree that an infinite timeout was totally fine.
Been there. I remember fixing a performance bug ages ago where we had URL (or maybe some other address class) used as a key in a HashMap (which seemed like a perfectly valid thing to do with a value class). We were doing literally thousands of DNS lookups for what seemed like the most trivial algorithms.
equals() and hashCode() are probably one of the weakest points of Java. While it seems like an obvious candidate for a contract, the issue has always been that one person ends up defining equality for everyone, when often different usecases will warrant different definitions of equality. Are objects equal if they have the same identity? If they have the same data? If they resolve to the same thing? It's easy enough to leave them unimplemented but the issue then is a lack of standard library support for providing custom hash and equality functions for Maps and Sets.
I agree! I get that it's easy to critique in hindsight but it's clear that they eventually solved this sort of thing well via the Comparable/Comparator relationship.
It's less an issue so I get why it's been backlogged, but it's one that I'd love to see worked out.
What's frustrating in hindsight is they knew the URL class was obviously terrible, but never made a serious effort to stop people using it.
I'm fine with not breaking old jars, but they could have made newer javac's by default die with a notice "use -legacy to build this". Same thing with Date, Vector, all the broken thread semantics, you name it.
These classes should have been relegated to a handful of ancient jars, they shouldn't keep popping up in modern Java because newbies have to learn a whole host of classes they aren't supposed to use.
I sorta agree when in comes to Vector and some of the Thread stuff, but URL is only broken in gethashcode() and equals(). The rest of the class is perfectly fine. It's a great simple curl that takes you pretty far before you reach for a real http client lib.
Java has a bunch of relics like this: holdovers from the more zealous OOP days of the past, when updating objects was supposed to update the database, URLs were the concept of domain names instead of the data of a physical URL, etc. I think most people have away from this mindset because it leads to code being hard to reason about.
Java could really benefit from more pruning/obsoleting. There's far too many "gotcha" pieces that nobody uses because better alternatives have come along, or ideas have been lost to time. Instead of leaving them in as traps for noobies, they should mark them bad or outright remove them.
There's plenty of software worth somewhere between 10^11-10^12$ which is still working because all those legacy APIs still exist. Java is not <put the name of any other platform in permanent beta stage>. It prospered for 20 years and it will stay alive for another half of century, because it is conservative enough.
“Cleanup” vs “backwards compatibility” is a really tough problem for language designers/maintainers. Breaking backwards compatibility leads to a fractured community (think Python 2/3), as well as resentment from all the users who want stability, and don’t want to waste time on painful upgrades. But over time, not breaking backwards compatibility leads to all sorts of weird warts, adding lots of overhead to learning the language, and using it safely.
Ultimately I think it’s a somewhat natural part of language evolution that they accumulate cruft over time, because the costs of breaking backwards compatibility are too great. And after a few decades or so, new languages come out that are (currently) much cleaner, and they take over until the cycle repeats. i.e. “cleaner Java” will not be a new version of Java, but a new language solving similar problems (Go?).
Ok kids. URL was deprecated in favor of URI before some of you were born. Easy to be a harsh judge now but back when java was first developed the concept of a "design pattern" did not even exist. In fact the development design patterns was initially driven primarily by people figuring out good ways of doing things in java.
> URL was deprecated in favor of URI before some of you were born.
It hasn't been deprecated, it's still in use in the standard library[1][2][3] and they don't offer additional classes that accept URIs.
> Easy to be a harsh judge now but back when java was first developed...
They've made no effort over 11 major versions to diminish its use beyond a brief note that encoding is easier with URIs. There's not even a warning on the equals or hashCode method, let alone the class documentation, it just quietly mentions that it's resolving a name. There's no way people are finding out that you should avoid URL except through lore.
Ok, java was influential in a ton of subsequent design patterns. For a long time java was pretty much the test-bed for these ideas. And the other half of the point remains: modern design patterns were not part of the standard toolbox at the time. To be clear, I am not defending the design of the URL class. Just that you need to judge it in context.
I don't mean @deprecated. Just that the recommendation has been to use URI instead of URL wherever possible since URI was introduced. URL is not needed any longer with the new HttpClient stuff in java 11.
Submitter here. Someone asked me how often this is actually a problem given there's already URI class. Unfortunately github doesn't do global exact match in searching, but randomly clicking around I found this in a minute:
Three of the truisms in my varied programming career are:
* DNS will fail even if it is implemented correctly on your clients' network
* It probably isn't implemented correctly on your clients' network
* Your software is probably doing more DNS queries than you think
This seems like a particularly unfortunate example but things like this are not uncommon. Doing any kind of RPC, even just to a server on your local machine? Half the time your library is doing DNS queries under the hood for no good reason. Or performing reverse DNS queries just to display a hostname in a log file.
Its easy to accidentally trigger a lookup. If it happens in a loop, a query that should take .2 seconds now takes 30 minutes.
“ This method assumes that path is a directory if it ends with a slash. If path does not end with a slash, the method examines the file system to determine if path is a file or a directory.”
People often overlook this and then wonder why their app stutters randomly when the UI thread gets blocked on this NSURL ctor :^)
> The recommended way to manage the encoding and decoding of URLs is to use URI, and to convert between these two classes using toURI() and URI.toURL().
and at that point you might as well just use URI. The resource resolution functionality is basically the only thing URL offers over URI.
They quietly mention at the bottom that they recommend using URI.
I don't think it's unreasonable to ask that an API be deprecated when core functionality, like comparing the thing, is broken. The stdlib docs should provide best practices to new developers and clearly mark old, broken APIs as deprecated.
Same with Date[1]; you have to make a defensive copy every time you accept or return it or you risk your internal state being changed. You're not going to pick this up from the docs unless you have a. learned about the problems with mutability, and b. carefully scrutinize the docs and notice, oh, hey there are all these setter methods.
What's nuts is the APIs in Date that they did deprecate aren't even broken, they're just not very general! It's the Date object itself that is hilariously dangerous to use.
Actually, some of the deprecated methods on Date are broken. If you have a Date representing the wrong date and invoke the wrong sequence of deprecated methods, you can end up with Exceptions. Admittedly, the only such bug I'm aware of affects only dates before October 15th, 1582.
To be fair, more modern languages have the same problem. For instance, Go's "syscall" package is considered to be a bad idea to use (and people are pushed to use "golang.org/x/sys/unix" instead).
I can see why a method like this could be helpful. The URL specification RFC has an incredibly flexible specification for what constitutes a URL. The hostname/address section can be particularly hairy - to the point that it likely becomes impractical/infeasible to compare hostnames for equality. Comparison via DNS resolution is likely a simple, desirable solution to a real problem.
That being said, URL.equals() is a terribly opaque and non-obvious method signature for performing DNS-based comparisons of hostnames. It lacks any indication that calling it involves network IO.
This is only a desirable solution if you ignore basic internet stuff such as round robin resolution, non-infinite-TTL, SNI and Host headers. Even the authors realized it was a bad idea...
I can’t see when can it be helpful. Resolving an ip shows you only a subset of dns information, basing equality on this behavior makes zero sense. It should have been deprecated and removed long time ago. Backward compatibility like this is cancer.
Intellij's code analysis(and other tools as well) warn about this. Particularly they recommend not sticking URL objects in Set or Map structures, due to the inherent equality check.
Why is URL.equals(url) doing DNS Resolution comparison under the hood? That seems wildly outside the scope of what the URL library should be doing. It might be a useful feature, but there's no rational reason it should be here.
Some quick googling suggests people are using java.net.URI instead to bypass this poor design.
Well the "why are they doing DNS resolution" comes down to "because the contract requires it":
> Two URL objects are equal if they have the same protocol, reference equivalent hosts, have the same port number on the host, and the same file and fragment of the file.
> Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null.
> Since hosts comparison requires name resolution, this operation is a blocking operation.
So it's not an implementation bug, it's a requirements bug. Now why on Earth they thought this was a reasonable requirement for an equals() method, that's a fair question.
---
To quote a JDK bug ticket: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4434494
> We are well aware of the problem with URL.equals and URL.hashCode. The cause of the problem is due to the existing spec and implementation, where we will try to compare the two URLs by resolving the host IP addresses, instead of just doing a string comparison. Because hashCode has to maintain certain relationships with equals, namely if two objects are equal, they should have the same hashCode, the implementation of hashCode also tries to resolve the host in the URL into an IP address. As a result, we are facing problems with http virtual hosting, as described in the Description part, and performance hit due to DNS name resolutions.
> Unfortunately, changing the behavior now would break backward compatibility in a serious way, plus Java Security mechanism depends on it in some parts of the implementation. We can't change it now.
> However, to address URI parsing in general, we introduced a new class called URI in Merlin (jdk1.4). People are encouraged to use URI for parsing and URI comparison, and leave URL class for accessing the URI itself, getting at the protocol handler, interacting with the protocol etc. So, at present, we don't plan on changing the URL.equals/hashCode behavior and we will leave the bug open until Tiger, when we re-investigate our options.