Hacker News new | past | comments | ask | show | jobs | submit login
When URL parsers disagree (canva.dev)
156 points by hannob 8 months ago | hide | past | favorite | 56 comments



When I designed Cloudflare's filters, a.k.a. Firewall Rules... we had an API written in Go, and the edge written in Lua. The risk I foresaw was that building rules that act on URLs and perform matching, and testing those, in one language (Go) that were then deployed and executed in another (Lua) would lead to a risk that the difference in the behaviour of the two engines (no matter how much they implemented the same spec) would result in a rule being created that behaved ever so differently elsewhere... for me, that was a huge hole as if it was known it would be leveraged.

The first piece of Rust in Cloudflare was introduced for this reason and for this project. We needed a single piece of code to implement the filters that could be executed from Lua, and that Go could call too. It needed to ensure that only a single piece of code was responsible for things like UPPER(), LOWER(), URL_PARSE() precisely to prevent any differences being leveraged.


WAF reminded me of the Cloudflare outage on July 2, 2019 due to small regex mistake [1]

[1] https://blog.cloudflare.com/details-of-the-cloudflare-outage...


Yup, and the filters described above replaced the pcre regex matching in lua.

That was already planned at the time of this incident, but this incident accelerated the pace at which the new system was put in place as well as introduced controls and process for the release of waf rules (regardless of what engine they were applied in)


Out of interest, was this the project that eventually became wirefilter [1]?

[1]: https://github.com/cloudflare/wirefilter


Yup, that's the one.

Uses wireshark display filters syntax to implement the ability to match traffic at any part of the stack.

It's still used internally, but clearly they stopped updating the OSS version just after I left. They were updating things to make it fully compatible for everything owasp when I left which meant additions beyond the wireshark definition of display filters. Everything firewall and WAF was moving to the filters and also other traffic matching features too.


Thank you for releasing it! I adopted wirefilter for a firewall rule testing project, firewalker [1]. But indeed, I wish Cloudflare kept maintaining its OSS version.

[1]: https://github.com/SerCeMan/firewalker/


Unfortunately cloudflare has a poor OSS track record. Either not maintaining the public version, or promising to open things that then don't see the light of day (quicksilver database and replication, and their Rust reverse proxy and nginx/lua replacement - both of which were announced but never released).

Most of what is OSS at cloudflare came in from elsewhere (V8) or was needed for collaboration (IETF), rather than started at cloudflare and opened.


This is such a cool example of real-world defensive engineering. I think I'll refer back to it when I want to illustrate the concept to somebody.


For curious readers, it sounds like the URL equivalent of header smuggling https://en.wikipedia.org/wiki/HTTP_request_smuggling


Why Rust instead of cgo?


Lots of reasons that on their own were not enough to decide but together made it compelling.

We weren't going to use C (Cloudbleed), cgo didn't give us as much control over the FFI as we wanted and produced slightly harder to read output (maintainability for me, is everything — always write code that the drunk at 2am future version of yourself can understand when you're paged), we didn't necessarily want another garbage collected process despite being good at this (Cloudflare's DNS server is written in Go), we wanted memory safety and wanted to keep things small (Go or Rust stand out), we wanted compile time safety over runtime errors, we believed that Rust macros would help make the filter code more readable / maintainable than rolling our own parser in Go or using YACC (see earlier point about your future self maintaining this at an unGodly hour)... it's not like anything was a determining factor, but the reasons accrued until it felt overwhelming.

We certainly were not looking to add another language at the time as being first to do so incurs pushback from the org, adding into all of the build and release chains, and the typical higher bar for proving you know what you're doing. Once done everyone can benefit from having the option of another language and being able to select the right tool for the job, but going first incurs a legitimate cost.


A student and I have been using coverage-guided grammar-aware differential fuzzing to discover bugs in URL parsers for a while now. There is extreme variation in this space; it's trivial to turn up meaningful bugs in widely-used URL parsers.

".://" is a particularly egregious example. (and, by the same principle, "evil.com://good.com")

- Python 3.6's urllib.parse sees the "." as the URL's scheme, and an empty authority.

- Python 3.11's urllib.parse sees the entire ".://" as the URL's path.

- urllib3.util.parse_urlsees the "." as the URL's hostname, the ":" as the separator for an empty port number, and the "//" as the path. (this is one of the most downloaded packages on PyPI)

- Boost::URL rejects the URL outright.

If you're going by RFC 3986, then only Boost::URL is exhibiting the correct behavior. If you're going by the WHATWG URL standard, then I don't know which one of these behaviors (if any) is correct.

If you're interested in collaborating on this project, please send me email. My address is in the footer at https://kallus.org


WHATWG rejects ".://", yeah? It's not the most readable spec, but there's a tester here: https://jsdom.github.io/whatwg-url/

I recently published bindings for ada (an implementation of the WHATWG URL Spec) for Python with the hope of having something that follows a single standard.


Indeed, ".://" is a hard error under the WHATWG URL spec. If the URL doesn't start with an ASCII alpha character, then the scheme start state transitions to the no scheme state [0]. In that state, if there's no base URL that the input is relative to, then parsing must fail [1].

However, "evil.com://good.com" is a valid URL string per WHATWG, since its state machine accepts "." within the scheme after the first codepoint. The resulting URL object has a scheme of "evil.com", a host of "good.com", an empty path, and a null port, query, and fragment.

[0] https://url.spec.whatwg.org/#scheme-start-state

[1] https://url.spec.whatwg.org/#no-scheme-state


It’s not fair to call it a hard error: it’s only invalid as an absolute URL. As a relative URL, it’s fine, just like “example.com” is invalid as an absolute URL but valid as a relative URL.


True; I neglected to mention relative URL parsing, mostly since most URL manipulation I've personally done has been with absolute URLs.


As another example of an egregious difference that might bite you, compare this JS in recent Node vs web browsers:

    new URL('postgres://user:pass@host/db')
Node parses this as if it were a web address, into protocol, username, password, host, pathname, etc.

Browsers just parse out the protocol and call everything else a pathname.


Browsers do that for a good reason.

They used to do the protocol, username, password, host, pathname, etc. But scammers used it to have a user name that looked like a well-known URL, while actually directing the user to a domain under the scammer's control. Not honoring the spec was therefore a security feature.


Why not just delete the url bar contents after attempting that url or something? Or take to a warning page.


Any such decision requires no longer honoring https://www.rfc-editor.org/rfc/rfc1738.


Please be more specific.


Section 3.1 specifies

    /<user>:<password>@<host>:<port>/<url-path>
Which means if you're presented with that and don't send it to that host, you've violated the RFC. But this was demonstrably resulting in confused users being sent to scammer's domains.


That ("if you're presented with that and don't send it to that host, you've violated the RFC") has nothing to do with the comment you responded to, which described a UI choice compatible with RFC 3986, which says "Applications should not render as clear text any data after the first colon (":") character found within a userinfo subcomponent" (and which also goes on to say "Applications may choose to ignore or reject such data when it is received as part of a reference").


`URL` in node and browser is supposed to be compatible, this looks like a bug (browser ignoring protocols they don't know).


URL parsing semantics are defined by and dependent upon the scheme. (That's spec.) By definition, if you don't recognize the scheme, you cannot guarantee a correctly parsed URL. From RFC 1738:

The Internet Assigned Numbers Authority (IANA) will maintain a registry of URL schemes. Any submission of a new URL scheme must include a definition of an algorithm for accessing of resources within that scheme and the syntax for representing such a scheme.

The behavior described (extracting the scheme and treating the rest as an opaque string) is pretty much the only thing you can do when the scheme is unrecognized. (The other options being to throw an exception or return null.)

Based on the description, it sounds like neither are breaking spec—it's just that Node supports "postgres". That is, unless it's true that Node's URL implementation is supposed to match what browsers do, in which case Node is breaking spec—its own.


RFC 3986 provides a generic URI grammar that is not scheme-specific, though other standards that define URL schemes may choose to subset the subset that grammar as they see fit. If a URL parser does not recognize a scheme, I would expect it to parse the URL using the generic parsing procedure.


Well, they don't. Browsers implement the WHATWG's spec, which says not to do that and was created to supersede RFC 3986.


When I worked at one of the tech giants that develops a certain suite of office apps, the C++ class they used to model URLs in the cross platform layer had a flaw around escaped characters. I worked in the group that developed the iOS/macOS versions of the apps and we had a number of kludges in place to deal with conversion between these C++ classes and NSURL and CFURL. In my time on different occasions devs discovered the flaws and composed lengthy emails explaining the fixes that needed to be made, but the C++ class was too entrenched to be fixed. That was about 5 years ago, I doubt it has changed.


I remember having some difficulty with characters for HTML URLs on the clipboard (for images) that happened only on the Mac version of a certain office app, but not the windows version


I recently had a wild ride with using java.net.URL and java.net.URI, to parse and deconstruct URIs so they can be stored in a database with at least a little normalization.

The documentation page is great but the API is bizarre. So many different constructors with different permutations of whether or not they expect various segments to be URL-encoded.

If nothing else, the complexity convinced me not to write my own!

- https://docs.oracle.com/javase/8/docs/api/java/net/URI.html

- https://docs.oracle.com/javase/8/docs/api/java/net/URL.html


One of the weirdest things about Java's URL API is this:

    equals
    
         Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null.
If you're doing bulk operations on stored URLs, you may just be sending out tons of DNS resolutions without ever calling an obvious networking API (like openConnection). If your DNS is down or slow, equals() will block.

I'm sure most Java programmer here has learned this lesson, but new programmers get surprised by this every time.


So I'm certainly not a Java developer and I barely know very much about urls, but isn't that a semantically surprising comparison as well? You can host wildly different websites on the same server/IP as a temporary or permanent arrangement, but I'm not sure I would expect that to make you group them together


That's an ancient API that is de facto deprecated by best practices. Java 1.0 was far from perfect regarding the design decisions.


From the docs:

> The java.net.URL constructors are deprecated. Developers are encouraged to use java.net.URI to parse or construct a URL.


Except it accepts some inputs that URI balks at.

I don't recall the exact case, but when I was trying to toleratly parse a URI, I fall back a URL and then convert it to a URI.


It is beyond surprising, it is horrifying.

Also, URI not URL has been the preferred choice for years and years and years.

But the footgun persists, for backwards compatibility


java.net.URL is deprecated. It’s a mistake from the 90s.


Learned, but completely forgot about this. What remained was a vague suspicion that perhaps the benefits of clear type separation that would be lost by using a plain string might not necessarily be worth it. (and then you eventually start rolling your own and then one day someone goes wild with the allowed characters liberties within the username:password prefix...)

To sum things up: never ever use java.net.URL, it's a bad http client and it's worse at everything else you might expect it to be.


It doesn't really make sense to use URL today (it does not even support encoding). URI can be used to identify the resource and then more explicit API can be used for I/O instead of URL.openConnection().


It's up there with "java.text.SimpleDateFormat is not threadsafe". SONAR catches that at least.


Just have to convince your teammates to install SonarLint or convince management to pay for SonarQube.

I’ve been successful at neither.


Then just claim the credit for fixing their bugs.


Yeah, a golden rule of Java dev is don't use URL. Use URI instead.

URL is one of those "it seemed like a good idea in the 90s" legacy warts.


That description reads like an xkcd comic.


It is worth mentioning that Java started supporting URIs even before RFC 2396 became a standard. And then RFC 3986 came and made incompatible changes, e.g. by moving asterisk to reserved characters (I could not find an explanation why in W3C email archive - someone asked this question at the draft stage, but it was not answered there).


Speaking of URL parsing differences, Python's urllib library recently had a CVE for failing to strip whitespace from the scheme and domain.

https://github.com/python/cpython/issues/102153


One thing that drives me mad on the Python side of things is the way frameworks and middleware (even the WSGI spec) corrupt the most basic "quoting" mechanism of RFC 3986. All of these layers of trying to be clever or friendly subvert the correctness of the original concept. And these broken approaches become precedents that reinforce in the culture of web breakage and compatibility.

Specifically, "reserved" characters have a different meaning from their URL quoted form. It is obvious that the intent of this was to allow their use as meta-syntax to separate parts of a URL that might then have the quoted form embedded in them. But every stupid middleware layer that tries to unquote _before_ parsing and routing is throwing away this information and making it impossible to know what was the original meta-syntax and what was embedded quoted material that might be a fragment derived from arbitrary, user-generated content.


That sounds frustrating. Can you provide an example of how they mangle quoted URLs?


If I can get the formatting to work here... the simplest example would be:

   /foo/element%2Ftwo/bar
this three element path ought to be easily parsed and routed to a handler for "foo" that can have two args which would receive the path elements:

   element%2Ftwo
   bar
in other words, one path element has a quoted slash in it. But instead, it gets unquoted at the WSGI layer into:

   /foo/element/two/bar
and then such routing would fail entirely or get the arguments wrongly chopped up, depending on how routing patterns work in the framework.

This same problem applies to all the "reserved" chars from the spec, e.g. slashes, semi-colons, colons, equal-sign, asterisk, parentheses, and more.


This is also a pretty cool presentation on the topic: https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-Ne...


Browsers have discrepancies too of course. Here's an interesting Chromium bug I've been following: https://bugs.chromium.org/p/chromium/issues/detail?id=125253... and an associated WHATWG discussion: https://github.com/whatwg/url/issues/606

Some multiple examples of browsers disagreeing: https://www.yagiz.co/url-parsing-and-browser-differences


On the topic of disagreeing parsers, this is how the "Psychic Paper" iOS vulnerability worked: https://blog.siguza.net/psychicpaper/


XInclude is just like XML External Entities in that it never should have existed because of the glaring security issues.

Worse is that it seems most XML parsers have XXE's enabled by default.


I think XXE would have been fine if it followed the same rules as ajax does in web browsers. Although i guess it is a glaring example of xml not being sure what level of abstraction its trying to be.

The real wtf is recursive entity expansion (billion laughs)


I have temporarily forgotten where I got this from but below is a URL test I have used. Perhaps someone here recognises it.

1 indicates valid

   http://foo.com/blah_blah 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://foo.com/blah_blah/ 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://foo.com/blah_blah_(wikipedia) 1 1 1 1 1 1 1 1 1 0 1 1 1
   http://foo.com/blah_blah_(wikipedia)_(again) 1 1 1 1 1 1 1 1 1 0 1 1 1
   http://www.example.com/wpstyle/?p=364 1 1 1 1 1 1 1 1 1 1 1 1 1
   https://www.example.com/foo/?bar=baz&inga=42&quux 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://*df.ws/123 0 0 1 1 1 1 0 1 1 1 1 1 0
   http://userid:password@example.com:8080 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://userid:password@example.com:8080/ 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://userid@example.com 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://userid@example.com/ 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://userid@example.com:8080 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://userid@example.com:8080/ 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://userid:password@example.com 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://userid:password@example.com/ 0 1 1 1 1 0 1 1 1 1 1 1 1
   http://142.42.1.1/ 0 1 1 1 1 1 1 1 1 1 1 1 1
   http://142.42.1.1:8080/ 0 1 1 1 1 1 1 1 1 1 1 1 1
   http://**.ws/*** 0 0 1 1 1 0 0 1 1 0 1 1 0
   http://88.ws 0 0 1 1 1 0 0 1 1 1 1 1 0
   http://88.ws/ 0 0 1 1 1 0 0 1 1 1 1 1 0
   http://foo.com/blah_(wikipedia)#cite-1 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://foo.com/blah_(wikipedia)_blah#cite-1 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://foo.com/unicode_(*)_in_parens 1 1 1 1 1 1 0 1 1 1 1 1 0
   http://foo.com/(something)?after=parens 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://:-).damowmow.com/ 0 1 1 1 1 0 0 1 1 1 1 1 0
   http://code.google.com/events/#&product=browser 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://j.mp 1 1 1 1 1 1 1 1 1 1 1 1 1
   ftp://foo.bar/baz 0 0 1 1 1 1 1 1 1 1 1 1 1
   http://foo.bar/?q=Test%20URL-encoded%20stuff 0 1 1 1 1 1 1 1 1 1 1 1 1
   http://m+tka+l+.ahx+t+b+a+r+ 0 0 1 1 1 0 0 1 1 0 1 1 0
   http://******.****** 0 0 1 1 1 0 0 1 1 0 1 1 0
   http://******************.********************* 0 0 1 1 1 0 0 1 1 0 1 1 0
   http://-.~_!$&'()*+,;=:%40:80%2f::::::@example.com 0 1 0 1 1 0 0 1 1 1 1 1 1
   http://1337.net 1 1 1 1 1 1 1 1 1 1 1 1 1
   http://a.b-c.de 1 1 1 1 1 1 1 1 1 1 0 1 1
   http://223.255.255.254 0 1 1 1 1 1 1 1 1 1 1 1 1
   https://foo_bar.example.com/ 1 1 1 1 1 1 0 1 1 1 1 0 0
   These URLs should fail (0 -> correct)
   http:// 0 0 0 0 0 0 0 0 1 0 0 0 0
   http://. 0 0 0 0 1 0 1 0 1 0 0 0 0
   http://.. 0 0 0 0 1 0 1 0 1 0 0 0 0
   http://../ 0 1 1 1 1 0 1 0 1 1 0 0 0
   http://? 0 0 0 0 1 0 0 0 1 0 0 0 0
   http://?? 0 0 0 0 1 0 0 0 1 0 0 0 0
   http://??/ 0 1 1 1 1 0 0 0 1 1 0 0 0
   http://# 0 0 0 1 1 0 0 0 1 1 0 0 0
   http://## 0 1 0 1 1 0 0 0 1 1 0 0 0
   http://##/ 0 1 1 1 1 0 0 0 1 1 0 0 0
   http://foo.bar?q=Spaces should be encoded 0 1 1 1 1 1 0 0 1 1 0 0 0
   // 0 0 0 0 0 0 0 0 0 0 0 0 0
   //a 0 0 0 0 0 0 0 0 0 0 0 0 0
   ///a 0 0 0 0 0 0 0 0 0 0 0 0 0
   /// 0 0 0 0 0 0 0 0 0 0 0 0 0
   http:///a 0 1 1 1 1 0 0 0 1 1 0 0 0
   foo.com 0 0 0 0 1 0 0 0 0 0 0 0 0
   rdar://1234 0 0 1 1 1 0 1 0 1 0 0 0 1
   h://test 0 0 1 0 1 0 1 0 1 0 0 0 1
   http:// shouldfail.com 0 0 0 0 1 0 0 0 1 1 0 0 0
   :// should fail 0 0 0 0 0 0 0 0 0 0 0 0 0
   http://foo.bar/foo(bar)baz quux 0 1 1 1 1 1 0 0 1 1 0 0 0
   ftps://foo.bar/ 0 0 1 1 1 0 1 0 1 0 0 0 1
   http://-error-.invalid/ 0 1 1 1 1 1 1 1 1 1 0 0 0
   http://a.b--c.de/ 1 1 1 1 1 1 1 1 1 1 0 0 1
   http://-a.b.co 1 1 1 1 1 1 1 1 1 1 0 0 0
   http://a.b-.co 1 1 1 1 1 1 1 1 1 1 0 0 0
   http://0.0.0.0 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://10.1.1.0 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://10.1.1.255 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://224.1.1.1 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://1.1.1.1.1 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://123.123.123 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://3628126748 0 1 1 1 1 0 1 1 1 1 1 0 1
   http://.www.foo.bar/ 0 1 1 1 1 0 1 0 1 1 0 0 0
   http://www.foo.bar./ 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://.www.foo.bar./ 0 1 1 1 1 0 1 0 1 1 0 0 0
   http://10.1.1.1 0 1 1 1 1 1 1 1 1 1 1 0 1
   http://10.1.1.254 0 1 1 1 1 1 1 1 1 1 1 0 1


It's from https://mathiasbynens.be/demo/url-regex and rather fantastic. Thanks for sharing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: