Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

XPath post 1.0 got ridiculous, like many things do. What started with a simple, elegant language morphed into one with a http client, filesystem methods, json support, functions, loops, extensions and the ability to read environment variables.

I wrote a post about it a while back[1] (I regret some of the wording used there) and maintain a tool[2] that can exploit XPath injection issues. I'd recommend sticking with 1 or maybe 2, and pretending 3.x doesn't exist.

1. https://tomforb.es/xcat-1.0-released-or-xpath-injection-issu...

2. https://github.com/orf/xcat



I largely agree. XPath 2.0 started the downwards trajectory and XPath 3 made it worse.

The things XPath 2.0 and later do improve on XPath 1.0 is the "standard library", most of exslt got standardised in 2.0, and new useful functions got added in later revisions (e.g. contains-token from 3.1 is XPath finally adding the ~= operator from CSS).

Here's the deal though: it should be possible to add most functions without updating the rest of the engine (indeed the majority were originally developed for 1.0). I think some of the functions are designed to work with and around types, which would not be useful in 1.0.


There are other useful things besides functions

Sequences for example. In XPath 1 the query returns a set, so the output is always in document order. When the document reorders things, the query output changes, and you can never get the original output. In a sequence, the query can output anything in any order


XPath/XSLT 2+ also have only a single implementation (by the spec author) so don't meet W3C's requirement of two interoperating implementations. Basically, XSLT ceased to be a "standard" whereas XSLT 1.0 had excellent portability across libxslt, Saxon, Xalan, and MS' xslt.exe.

Edit: there is/was a token implementation for XSLT 2.0 called Gestalt


It also worth noting that the specification’s author also built his company on this single implementation.


> XPath/XSLT 2+ also have only a single implementation

As far as XPath goes, that's wrong:

1. Saxon (the one you talk about)

2. BaseX (an XQuery 3.1 processor)

3. Xidel (implements many XQuery 3.1 features)

4. eXistdb

5. fonto-xpath (NodeJS)

6. frameless.io (JS, also XSL)

And these are the ones, which face the public internet. I think, Microsoft has an 2.x implementation, I am pretty sure, IBM and Oracle do so, as well.

Now, as for XSL-T, you are right: the easily available implementations are Saxon, but, as it seems, also frameless.io (which I just found out about a few minutes ago, so I may be wrong). But again, I guess, that big enterprise has their own solutions bundled.


Saxon supports all xpath versions though? It also bundles some very dangerous functions, some of which xcat can take advantage of.


I've read your article... Holy shit. They took a simple, sed-like tool and turned it into an abomination.


It ain't done before it can receive e-mail.


It can receive email. See my follow-up here with a working implementation:

https://github.com/clopen/xpath-receive-email

https://news.ycombinator.com/item?id=24960548


Hahah, that just really got carried away. Ok, any idea if there is replacement? I found version 1 useful for scrapping web sites. Or should I just stick with 1.0 ?


Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.


I was going to suggest Vim as a counterexample, but sure enough https://github.com/soywod/iris.vim


That's a community based plugin, Vim is still focused on text editing and not much else.


Most of the functionality on the editors like vim and emacs comes from community based plugins. People would mostly not use them if there were no such expansions.


This baseless assertion is simply wrong. Plugins are nice to have, but the bulk of their use is to customize default installs.


Curious what is the problem with this? You can still use your small sed-like subset of language in your project?


>>> They took a simple, sed-like tool and turned it into an abomination.

> Curious what is the problem with this?

Product Managers.


With the rise of numerous hierarchical document formats (JSON, YAML, TOML, properties files), what XPath REALLY should have evolved into was a more format-flexible path language with format-specific extensions as needed.


> what XPath REALLY should have evolved into was a more format-flexible path language with format-specific extensions as needed.

You can probably already do that just fine: ignore attribute nodes, and e.g.

    {"menu": {
      "id": "file",
      "value": "File",
      "popup": {
        "menuitem": [
          {"value": "New", "onclick": "CreateNewDoc()"},
          {"value": "Open", "onclick": "OpenDoc()"},
          {"value": "Close", "onclick": "CloseDoc()"}
        ]
      }
    }}

    /menu/popup/menuitem/*[last()]/preceding-sibling::*[1]/value
selects "Open". Something along those lines.

Maybe relax nodetypes so they can be pluggable per-language, but I'm not sure that's even useful or necessary.


XPath always was extensible, at least at the implementation level. E.g. in 'lxml' it's trivial to add XPath functions with Python. Homegrown, of course, but still possible. In addition to extension elements this is about the only way to hook XSLT into the rest of the system. How else one is supposed to read environment variables from XSLT? The only other way is to pass everything via command line as parameters.

It's insecure to run untrusted XPath, but isn't it same with untrusted anything? A good solution here could be a way to sandbox such XPath, i.e. to limit which functions can be called, the same way it's done with XML where you can forbid the processor to use network or access arbitrary files on case-by-case basis.


> How else one is supposed to read environment variables from XSLT?

Setting aside whether it’s even a good idea to allow XSLT to do that, XPath is only a subset of XSLT, so you’re just changing the subject. The “path” in XPath should be a hint at what it’s supposed to be: a query language to select nodes by path in XML documents. As opposed to an alternative of Awk, or Perl.


I'd say XPath a way to get a nodeset or another XPath type out of something. E.g. the current date is not selected from a document. There always will be a need to get yet another thing as a nodeset, e.g. list a directory. Or, for boolean expressions, there will always be a need to test yet another thing, such as an environment variable.

These things, of course, should come as extension functions rather than special syntax, but then there will be a need to provide a small standard library of such functions :)

So yes, I believe it's useful if we're going to use XPath in a trusted environment, e.g. as a typical command-line tool. You won't deny Bash or Python this and other powerful abilities, will you? But of course it would be very unwise to run an untrusted Bash script.


XPath 3 was conceived with support for XSLT and XQuery in mind - where reading environment variables and text files are most definitely very useful features. This is indeed not something you want in a browser, but that ship had already sailed by then.


Damn. XPath went off the rails after v2. Though, to be fair, so did JavaScript, and look where that is today!


A bloated abomination?


Abomination with 500000 open job offerts and ppl wanting their page to load 2 minutes, because we can do that “async” and download half of internet to display a table.


That is the most popular language in tech right now? (Unfortunately?)


But does it send email?


I don't know what XPath 3.1 implementation you use, but the two major implementations, I have at hand, SaxonEE and BaseX, both do not understand your code. You write:

  for-each(normalize-unicode(upper-case(json-doc('x.json'))) => tokenize("\s+"),
    function($a) {
      let $a := $a * 10
      load-xquery-module('abc'):some-func(
        function-lookup($a, 1)(array:map($a, function($b) {
          let $c := unparsed-text-lines($b)
          trace($c)
          if ($c) {
            return xml-to-json($b)
          } else {
            error('This is an error')
          }
        })) 
      }
  )
1. fn:json-doc(), by default and typically, returns either an XPath map or array datatype. Therefore fn:tokenize() can not be used with it. Just as fn:normalize-unicode() and fn:upper-case() can not (both take a string as input) The error is: [FOTY0013] Items of type map(*) cannot be atomized.

2. in your anonymous function 'function($a) {...' you use a 'let' expression without 'return'. This is illegal.

3. As is the use of a colon ':' after fn:load-xquery-module(). The colon seperates prefixes from namespaces. Did you mean the question-mark '?'? That would make sense, since fn:load-xquery-module() returns a map. But then 'some-func(...' would not work out, since the returned map has two keys: "variables" and "functions" and 'some-func()' would be referenced in another map, which is the value of the 'functions' key.

4. Calling fn:function-lookup() (as the parameter to some-func()) with a value, that must be a string ($a), which you then multiply with an integer (10)) will already have errored out (multiply string with integer is not possible), but even if this would be possible, the next error would arise since the first parameter to this function must be an xs:QName type, which a number (or string), clearly, is not.

5. The function, you look up in the external module takes an array:map() function as first parameter. Such a function does not exist in the XPath 3.1 standard (see https://maxtoroq.github.io/xpath-ref). You may have studied a version of the specification, which was written, before the array functions have been finalized, which was in 2017. That could mean, that what you wanted to express, would now be array:fold-left(), which is how a map() function is being called in XPath. However, that function takes three parameters.

6. Again, this is not valid XPath grammar:

    let $c := unparsed-text-lines($b)
    trace($c)
    if ($c) {
      return xml-to-json($b)
    } else {
      error('This is an error')
    }
It would need to be:

    let $c := unparsed-text-lines($b)
    return (
             trace($c)
           , if ($c)
             then xml-to-json($b)
             else error('This is an error')
           )
Also, $c can not evaluate to a boolean ([FORG0006] Effective boolean value not defined for xs:string+), since it is a sequence of strings. Of course, such things could be implementation dependent...

I don't want to go any further, since I can not totally recapitulate, what your code is supposed to do. It may be some 'blackhattish' dark magic, that fucks up some engines, but the engines I use do not even go through with compilation, due to the invalid XPath code. As is, your code does not make much sense. At least, if we talk about XPath 3.1.

Your bashing of XPath 3.1 (and also 2.1) makes no sense either, since you seem to use a completely non-standard processor, with a different syntax and functions, that behave very differently from XPath 3.1, or, even worse, did not understand the language.

Many of the changes to XPath, starting with version 2.1, stem from the fact, that XPath became a subset to XQuery, a fully functional and declarative programming language, that satisfies the need, to query XML documents as they would be databases. This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

XQuery 3.1, as a superset of XPath 3.1, is a language made in heaven! Nowhere is it that simple, to work with (X)HTML, JSON and XML documents as here! A fully functional, declarative language, that has templating (think handlebars, moustache) built in as an integral part of the language, where you can just intermix you XML code with program code, with easy error tracing (stateless!), quick "to production" development cycles and a painless approach to anything XML!

XML is one of the most misunderstood technologies in our industry and the only solid document technology, I know of. Sadly, also based on this misunderstanding, a whole generation of developers has evolved, which listened to those, who jumped the hypetrain of XML, just to realize, that a document format may not be the best tool for the job (RPC, configuration files, etc.). And instead of admitting to themselves, that they were wrong, they accuse the technology, they abused, teaching the kids to make their lifes more difficult. And these kids now are in charge of browser development, etc. And adding to that, your lightheaded comments do not really help the issue.


The code sample is illustrative and shows off as much of the crazy, unneeded dynamism as possible rather than something that would actually work.

> Your bashing of XPath 3.1 (and also 2.1) makes no sense either, since you seem to use a completely non-standard processor, with a different syntax and functions, that behave very differently from XPath 3.1, or, even worse, did not understand the language.

Granted, it’s been a few years since I’ve looked at XPath but I feel that I know it quite well. xcat is a testament to that.

> This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

This makes no sense. “We made a weird programming language to satisfy the needs of non-programmers”? No, if anything the original xpath was pretty well positioned to be consumed and used by non-programmers. The current, not so much.

The current xpath/xquery language is poorly supported, mostly ignored and incredibly over engineered to the point where it’s almost comically unfit for purpose. Sorry.

> XML is one of the most misunderstood technologies in our industry

It’s pretty well understood and has valid use cases. However it had a history of overly complex, over engineered tooling created by a committee and stuffed full of acronyms.

This quite rightly puts people off, and there are just better formats and technologies to use that isn’t encumbered with XML baggage.


> The code sample is illustrative and shows off as much of the crazy, unneeded dynamism as possible rather than something that would actually work.

You can not make up fantasy code, in a language, that tries to look like the actual thing, but is not, and then complain about the language not working. Your code is not working, since that is not XPath code.

>> This was done in order to satisfy the needs of non-programmers, like in the digital humanities, the publishing industry, etc.

> This makes no sense. “We made a weird programming language to satisfy the needs of non-programmers”?

It's not weird. I am one of these people, and I enjoy it greatly! As an example for some (functioning) XQuery code, everybody is invited to check this: https://gist.github.com/joewiz/6762f1d8826fc291c3884cce3634e... I don't think, that is weird. Or what about this:

  for $contact in $contacts/contact
  where $contact/familyname/data() = "Smith"
  group by $key := $contact/zip
  order by $key
  return <group>{ $contact }</group>
which will return all contacts named 'Smith', placed in the same 'group' element as long as they live in the same location. Not weird at all! But then, this may be a matter of taste.

> The current xpath/xquery language is poorly supported, mostly ignored and incredibly over engineered to the point where it’s almost comically unfit for purpose.

As for ignored and unsupported, that has mostly psychological, social and historical reasons (the needs of programmers vs. the needs of power-users, people who fell for the hype-train and then were turned off, etc.). As for "over-engineered" my experience is, that XQuery is extremely lean. One may argue about some of the datatypes (i.e. dates and times), but they were requested by database/enterprise people. XML is a technology, that had a lot of interest groups, who all wanted their share. The nice thing is: you don't need to use it, if you don't require it. But those, who do, they are happy. Also, these datatypes are not XPath/XQuery, they are XML Schema. What would your example for "over engineering" in XPath be? I am really curios!

> It’s pretty well understood and has valid use cases.

Again, my experience is very different. Most people do not know, whether to "push" or to "pull" when writing XSL-T, which is a strong indicator for them not having understood, at least, XSL-T. They just use it as a programming language and start complaining. Then there are those, who compare it with JSON, which is comparing apples to airplanes. They call it "verbose" while not realizing, that a complex data format, that implies a lot of logic, requires simple tools (XPath one-liner, anyone?), while simple datastructs require much more logic on the side of the programmer. Yes, something as simple as JSON is a low hanging fruit, just like "make money fast". And then you realize the small print. In XML you start lowly, just as in XPath. No need to type anything. Just code on. Do the typing before production release. The rest comes over time, like everyhwere.

Verbosity really happens on the code and the overly complex toolchains (just think ECMAScript and all the difficulties, that stem from combining HTML with JSON, in order to be somewhat semantic)

> However it had a history of overly complex, over engineered tooling created by a committee and stuffed full of acronyms. This quite rightly puts people off, and there are just better formats and technologies to use that isn’t encumbered with XML baggage.

Well, lazy people, who want to speak a foreign language without learning it (best example is your XPath code). I seldomly read the W3C specs, only if I can't help myself any further. There are some nice books on every of these technologies, which are pretty simple to understand. However, one needs to read them, rather than just "coding on".

I do not doubt, that you are a capable programmer. However, judging by your code example, you got no clue about the XPath language. You may know how to abuse functions in a language, that access a server, and I guess, most of these attacks are pretty standard and do not require deeper knowledge of XPath. It's just the functions, that offer access.


Holy crap. What is this atrocity that is XPath 3.0!? What was wrong with sprinkling some XPath 1.0 queries into a Python script?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: