
XPath is actually pretty useful once it stops being confusing - sugnid
http://news.rapgenius.com/Mat-brown-xpath-is-actually-pretty-useful-once-it-stops-being-confusing-lyrics
======
kjhughes
XPaths are extremely useful. I actually enjoy writing them, much like I enjoy
writing regular expressions. In fact, I consider both to return manifold the
modest investment they require to learn well.

XPath : XML :: regex : text

~~~
6ren
Now you have three problems.

(But along with relational algebra, they are among the few abstractions that
work really well)

------
gopalv
xpath is awesome, especially once you understand what an axis is.

And that is what I've found most people who have trouble with it don't
understand - like what exactly following-sibling or child means.

I spent about 2 months writing my own xpath evaluator once and it gets so much
easier (to implement too) when you understand this is just a tree-traversal
with an iterator following the axis.

Unfortunately the axis syntax makes it very verbose to read.

------
neves
W3C is full of terrible standards: the verbose dom, the obtuse xml schema, the
crippled css (you can't have a variable), and others. XPath isn't one of them.
It is the best way to query XML documents in a forward compatible way. Maybe
someday we will able to use XPath in a CSS file instead of their crazy
selectors.

------
d23
> it's a whopping 11 lines of code.

If you think 11 lines of code is a lot, you're overly focused on concision at
the expense of readability. I've never (read: never) worked on any Ruby code,
yet I find the posted example more readable than the supposedly more valuable
xpath.

At the very least, they're the same. If you're writing code in Ruby, 11 lines
is nothing. If you're writing code in Ruby and xpath is used nowhere else in
the project, that single line of super-compact xpath might as well be 1000
lines of Ruby -- it doesn't matter.

If you're trying to compact 11 lines of code you're probably doing it wrong.

~~~
GuiA
I may or may not agree with you- but one could make the argument that xpath is
likely heavily tested and proven, and will handle unexpected corner cases that
arise in the future, which those 11 lines of ruby will not. In that case,
using xpath lowers the amount of future headaches with that code.

------
graue
After a great article on what looked to be a handy tool, this part
disappointed me:

 _for this particular task, XPath is actually considerably slower than the
pure-Ruby implementation. Interestingly, that 's not true if you take out the
<br> part and only look for text at the beginning of paragraphs. My guess is
that the following-sibling axis is the culprit, since it has to select all the
following siblings of the br tags, and then filter them down to only the first
sibling._

I was hoping selectors were lazy, in which case, selecting all the following
siblings but then immediately filtering that selection down to the first would
be cheap. Lazy or not, can there really be no efficient way to do the
equivalent of jQuery next()?

------
radicalbyte
If you have great tools which keep the feedback loop short, then XPath, like
Regex, SQL and CSS is extremely powerful and productive.

Just make sure that you document your test cases (i.e. what you should match)
or your colleges will hate you.

------
moron4hire
Years ago, I wrote a tool for wrapping .NET XmlDocuments and making them far
easier to work with via XPath: [https://github.com/capnmidnight/xml-
stuff](https://github.com/capnmidnight/xml-stuff)

On its own, .NET's XML libraries are really only good for consuming XML
documents, but even that is a rather painful experience, especially as it
forces a namespace on all documents, complicating the XPath expressions
necessary to query it. Actually authoring documents is a nightmare. My XmlEdit
project makes it almost as simple as key-value-pair config files.

~~~
pragmatic
That's cool. Do you have any example comparing your work to LinqToXml?
[http://msdn.microsoft.com/en-
us/library/bb308960.aspx](http://msdn.microsoft.com/en-
us/library/bb308960.aspx)

~~~
moron4hire
Well, I wrote this off of the top of my head, and it has been several years
since I've used the library heavily (though I have a project now that needs
it, so I most likely will be dusting it off and fixing any hairy bits).

[https://github.com/capnmidnight/xml-
stuff/blob/master/README...](https://github.com/capnmidnight/xml-
stuff/blob/master/README.md)

However, just take note that the main concept of the library is "make it
work". The idea was that, given an XPath expression with several attribute
selectors, it would fill in any necessary nodes to just make it happen. So you
can technically chain a ton of editing commands together, by using an
appropriately complex XPath expression.

It came out of a need to repair thousands of broken XML documents. It's
probably not very complete. It was written for one project--and though I took
time to make it generalized--it didn't make it into a key role into any other
projects; I just didn't ever again have the need to deal with XML documents on
such a scale.

It's actually one of the first "big" things I wrote out of college. I'm not
too happy with some of the design right now, but the functionality has held up
over the years and it's not as shitty as some of the other code I wrote at the
time. I guess I knew that a lot of the project was hinging on how easy it was
to write XML documents, so I made sure I did a ton of testing to make it work.

------
slig
> But it gets more interesting if the lyrics are stored as an HTML fragment.

Is there any reason to store the HTML version with <p>s and <br>s instead of a
plain text and converting it to HTML with simple rules à la markdown? (single
line break = <br>, double line break = <p>)

~~~
aliakbarkhan
Well, at a minimum, it saves the processing time required to format the text,
which lessens the server cost of each page hit. It's a small optimization, but
when the vast majority of the users are just coming to the site to read text,
I'd imagine it would save a lot of CPU time.

~~~
vdaniuk
Not really, the underlying text is just an HTML page that rarely changes and
the requests rarely hit the database because caching.

------
sixbrx
I also like XPath for some purposes, but I think it really suffers from (IIRC)
having been designed before the xml namespaces, which it only integrates very
awkwardly IMO and which ruins the simplicity of XPath. Or maybe XML namespaces
spoil everything they affect to some degree :)

~~~
fein
It's the latter (XML namespaces spoil everything).

In every single task I do that involves munching on XML with xpath, the
largest timesink is figuring out how the ns needs to be set up. (Looking you
you, PHP simpleXML).

~~~
masklinn
> In every single task I do that involves munching on XML with xpath

And more generally that's true of every single task involving muching on
namespaced XML. Namespaces are a good idea implemented absolutely terribly.

XPath is a good idea well-implemented (no, XPath 2 does not exist, there is
only one XPath). One of the few I've found in XML-land. I still hate that we
have to use CSS selectors rather than XPath (although that's understandable
considering CSS selectors predate XPath), most of the improvements since CSS1
were in XPath day 1, and the rest (pseudo-classes) could probably have been
implemented using functions.

Also, that might have finally gotten us a non-eye-stabbing standard function
for "match any item of a space-separated list in an attribute" (matching HTML
classes in XPath without custom helpers is the worst)

~~~
yeahbutbut
> Also, that might have finally gotten us a non-eye-stabbing standard function
> for "match any item of a space-separated list in an attribute"

CSS handles this rather nicely:

    
    
        [class~=foo]
    

[https://developer.mozilla.org/en-
US/docs/Web/CSS/Attribute_s...](https://developer.mozilla.org/en-
US/docs/Web/CSS/Attribute_selectors)

~~~
masklinn
Whereas xpath... does not. Which is a severe understatement considering the
equivalent to the CSS selector you wrote up (or to `.foo`) in xpath 1 is
something along the lines of:

    
    
        [contains(concat(' ', normalize-space(@class), ' '), ' foo ')]
    

the normalize-space can be dropped IIF you're certain all spaces are
normalized, the spaces around the needle not.

xpath 2 does quite a bit better through `tokenize`:

    
    
        //*[tokenize(@class, '\s+')='foo']
    

but still not great. And god forbid you need to match multiple classes in the
same selector.

[0] or xpath 1 + exslt if your xpath implementation provides it. exslt
actually does slightly better as the pattern is optional and defaults to
whitespace characters

~~~
voltagex_
Over in .NET land we appear to be stuck on XPath 1.0 forever. A project I used
to work on used it extensively, but I now use the HtmlAgilityPack (badly
formatted HTML) or XDocument (XHTML strict or XML) where I have the choice.

~~~
masklinn
> Over in .NET land we appear to be stuck on XPath 1.0 forever.

I don't think it's a bad idea, most of the improvements in XPath 2 are the new
standard functions which depending on your XPath implementation may be
available as extensions (e.g. tokenize comes from exslt, an xpath 1.0 library)
but along with that it brings significantly higher complexity and I think the
spec has gone from "difficult to read" to "meaningless word-salad".

I really like XPath, but I can't say I was impressed by XPath 2, it loses much
of xpath's simplicity with little to show for the added complexity.

------
hfsktr
"This is a perfectly reasonable solution, but it's a whopping 11 lines of
code. Further, it feels like we're using the wrong tool for the job: why are
we using Ruby iterators and conditionals to get at DOM nodes?"

Is it really that bad to have 11 lines in Ruby?

Initially I didn't get the wrong tool part but after reading it all that did
make more sense. I haven't used XPath more than a few times and they were
pretty simple so can't complain. Just something I'll have to keep in mind.

~~~
jpatokal
Agreed. I find the 11 lines of Ruby a lot more readable/obvious than that
fairly convoluted XPath. Although I'd probably have opted for RSLT instead:

[http://hackpackers.lonelyplanet.com/2013/03/05/XML-
Transform...](http://hackpackers.lonelyplanet.com/2013/03/05/XML-
Transformation-With-RSLT.html)

------
narrator
One problem with XPath is that it can be a lot slower than native or JIT'd
code depending on the implementation. Interestingly enough you can do xpath
like things in Scala with native code using pattern matching:

[http://ofps.oreilly.com/titles/9780596155957/HerdingXMLInSca...](http://ofps.oreilly.com/titles/9780596155957/HerdingXMLInScalaDSLs.html)

------
tomasien
Here's how I use XPath - constant Googling! This is a good explanation, maybe
I'll actually learn it now.

~~~
Kiro
When do you use XPath?

~~~
tomasien
Scraping mostly

------
ianbicking
If you are curious about XPath and CSS you might want to play with
[http://css2xpath.appspot.com](http://css2xpath.appspot.com)

------
gavinpc
I'm happy to see many other XPath fans here.

But as far as the OP, this seems like a case of worrying about the code
instead of the data structure. This would be easier to address before the
lines are transformed into HTML. Which I assume is not how they are stored.

~~~
sugnid
Actually, they are! We host many document types on RG, not just song lyrics,
so the “lyrics” field of a song is just a specific case of the general “body”
field of a text. And texts can definitely have rich formatting via HTML
markup.

------
habosa
XML gets a bad rap because of how much prettier JSON is, but there are a lot
of cool tools associated with it. XPath is pretty awesome, I had to write a
XPath parser/executor once (for a class) and it made me appreciate the value
and simplicity.

Then there was XSLT, which was a pretty sweet way to turn a data format into a
variety of "print" or "display" formats. Definitely been replaced by bigger
and better things but it's a pretty awesome technology that does one thing
really well.

~~~
skrebbel
XSLT is a turing-complete programming language. Really, the moment your source
XML has a slightly different structure from the target output, XSLT files
become monsters.

Programming in XML is _never_ a good idea. It isn't in XSLT, it isn't in
Spring, it isn't in Maven. Anything that's XML and has elements or attributes
with names like "if", "else" or "while", something went horribly, horribly
wrong somewhere. It's horribly verbose, you can't reasonably debug it, and
there's virtually no engineering best practices, which results in near-
impossible maintenance tasks.

Any modern programming language with a good, consice, XML parsing library is a
more effective tool to transform XML into something else than XSLT.

Don't code in XML.

~~~
ygra
Would you say _»Programming in S-expressions is_ never _a good idea«_ as well?

~~~
skrebbel
No. Why?

~~~
ygra
Then where do you draw the line between XML as data vs. XML as code (which
seems to be bad) and sexpr as data vs. sexpr as code (which seems to be good)?

Just curious; it's just syntax, after all, and I think that XSLT really has an
advantage in transforming XML to other XML (or text) compared to a program in
other languages.

------
GeorgeMac
I have made a little XPATH primer. Which is very much a work in progress.
Check it out on my github: [https://github.com/GeorgeMac/xpath-
primer](https://github.com/GeorgeMac/xpath-primer)

There are a few issues I am having with my markdown editor, in comparison to
githubs markdown support.

------
d0m
I prefer to use html parsers for such problems, such as beautifulsoup in
python. I used xpath in the past but the ending code wasn't that much shorter
then a more verbose version based on beautifulsoup. And, for someone looking
at the code, the beautiful version makes so much more sense.. Xpath also feels
like a big regex expression that magically works.

I'm not saying it's not useful. Actually, I believe that if you only have one
use-case, then using xpath might be overkill because of all the added-
complexity of maintaining a new library/technology/ideology. But if it's the
sort of domain that xpath would be useful more than once, then sure use it.

~~~
TylerE
The thing about xpath is that it runs in highly efficient native code. When
iterating over a 100MB+ file it makes an _immense_ difference.

It's conceptually the difference between:

    
    
        count = 0 
        for row in db.query("select id from <table>"):
          count += 1
    

and

    
    
        db.query("select count(*) from <table>")
    
    
    

I also don't find it at all confusing, you just have to understand the tree
nature of XML.

~~~
d0m
Oh no, you've used the performance argument against me ;-) Obviously, when
performance issues are on the line, you often need to trade simplicity and
maintainability.

~~~
TylerE
I'm not sure how using a platform specific library is more maintainable than
an open standard with myriad implementations.

------
callmeed
I just started getting into xpaths pretty hardcore with my trivia generator
for [http://playhattrick.com](http://playhattrick.com) ... I use it for
identifying tables of data to scrape. It's not as fun as regex IMO but it is
powerful.

Pro Tip: the chrome inspector lets you right-click on an element and get its
xpath.

Pro Warning: sometimes the xpath generated by chrome doesn't work when
scraping with Nokogiri. I'm not sure why yet, I've just learned not to rely on
it.

~~~
garethadams
There's not "an" XPath for an element, as XPath describes the _route_ you take
from the root of a document to the element in question. The correct route for
your situation depends on your use case.

Describing an element as "the first child of the fifth child of the second
child of the first child of the eighth child of the second child of HTML" is
as much the right path to an element as if you described the way to your house
as "Walk past the park then walk past the bus stop then walk past the hardware
store then walk past the butchers then turn left then walk past the pizza shop
then walk past the library"

~~~
callmeed
I'm aware what XPath is/does.

And I'm not disagreeing with you–I'm only saying Chrome has this feature. I
don't know what route they choose for you but I know they don't always work in
tools that parse (HT/X)ML.

Here's a screenshot in case you don't believe me:
[http://imgur.com/9FZSMSt](http://imgur.com/9FZSMSt)

The XPath Chrome returns for this page is: //*[@id="details"]/article/table[1]

------
jefflinwood
Interesting, (and this could just be a made up problem to illustrate the blog
post), but wouldn't it have been much easier to just store the lyrics in
another format (not HTML)?

For example, you could use TEI XML
([http://www.tei-c.org/index.xml](http://www.tei-c.org/index.xml)), and then
use stanzas and lines. Then when you go to render your lyrics, you can
capitalize the first letters in your presentation code.

------
m_st
XPath is alright but as sixbrx noted it suffers from problems with namespaces.
I keep using this xslt transformation to remove the ns info. Then it works
just fine:
[http://stackoverflow.com/a/413088/34022](http://stackoverflow.com/a/413088/34022)

~~~
dionidium
This is fine, I guess, and it's clearly something people want to do based on
how often it gets asked on Stackoverflow, but there's a reason XML zealots get
snarky when people ask how to do this. Some questions you might want to ask
yourself:

\- Why are the namespaces there in the first place?

\- Do I _really_ not care if the element is found in a namespace other than
the one expected?

\- Does my host environment have a way to specify the namespace of the element
I want to find (hint: it probably does)?

\- Is the reason that I want to remove the namespace that it's actually
something I need to do or is it that I am ignorant of the method for
specifying namespaces in my host environment?

~~~
djur
The question is usually "is the author of this XML actually using namespaces
in a reasonable manner?"

And the answer is usually "no".

I have seen SOAP responses with 20+ namespaces, all of them being essentially
implementation details -- every different section of their internal API
getting its own namespace. Inevitably, the elements are also prefixed in a way
that makes them distinct, or wrapped in a distinguishing element (i.e.
Contact/NameInfo/FirstName rather than FirstName xmlns="contact-name").

In situations like that, your best case scenario is that you do the grunt work
of setting up aliases for all the namespaces, putting them into your XPaths,
and you're done. The worst case scenario (which I've encountered) is when a
version update of the API changes the URIs for half the namespaces, even
though the structure of the data hasn't changed. In a case like that, you're
actually penalized for doing the 'right' thing and not just stripping the damn
things off.

------
gpsarakis
CSS selectors are much easier to remember than XPath. Python's BeautifulSoup
allows you to select elements with selectors and is very convenient. XPath is
a bit more verbose and most people already are familiar with CSS syntax.

~~~
gsnedders
And indeed, any CSS selector can be converted to an equivalent XPath query, at
least for selectors on XML and HTML.
[http://pythonhosted.org/cssselect/](http://pythonhosted.org/cssselect/) is a
Python implementation of such a conversion. (Note that there is no XPath to
CSS selector converter, as XPath can express certain things CSS selectors
cannot, as CSS selectors are designed such that they can be matched using a
streaming parser as soon as the first child of the element appears.)

------
masklinn
> the / in an XPath expression plays the same role as the > in a CSS selector:

The `/` in an XPath expression is probably a better match for the space in CSS
selectors.

~~~
mmastrac
I think you are thinking of "//", which acts like the space (ie: any number of
children).

~~~
masklinn
No, that's just a difference in default axis. I mean that `/` is a separator
between traversal expressions, much like whitespace in CSS selectors.

~~~
mmastrac
Ah yeah, I understand what you mean now. XPath's `/` is something like a token
that means "separate these two things". In CSS the separation between segments
is the zero or more whitespace characters that live between the parts of the
selector.

------
goflyapig
This is a great explanation and quick tutorial on XPath, but, like regex,
don't think I'd ever use it in production code unless I absolutely had to.

I'm sure I'd have _fun_ coming up with an XPath solution, but for me, the
ultimate goal is maintainability. If I wasn't 90% sure that the next person to
look at that code already knew XPath, then I'd go with the Ruby solution.

Dealing with 11 lines of code in a language you know is better than dealing
with 1 line of code in a language you don't (which ends up forcing you to read
1000 lines of documentation and examples to understand it).

------
optymizer
Writing an XPath 1.0 parser in C was fun. Maybe one day I'll use it to
(partially) replace MongoDB's JSON query language.

------
johnward
I use xpath way more than I care to at my job but it get's the job done.

