
HTML Command Line Utilities - TheZenPsycho
http://www.w3.org/Tools/HTML-XML-utils/README
======
TheZenPsycho
I found these the other day and I wonder how these have largely slipped under
the radar. Of particular interest is hxpipe and hxunpipe which makes
"scraping" tasks absurdly easy, by converting html to a form easily
manipulatable by sed, grep and other fun unix utilities.

 _update:_ tracking the score of this post on the front page using this:

    
    
        curl -s https://news.ycombinator.com/news | hxnormalize | hxpipe | grep -C 20 "TheZenPsycho" | grep points
        -9\n                    points

~~~
vdm
> how these have largely slipped under the radar

Your single example sells this more effectively than the entire OP. cf.
[http://bost.ocks.org/mike/example/](http://bost.ocks.org/mike/example/)

~~~
Thiz
That site is inspiring. How many wonderful things can be built with the right
mix of technology and passion.

Thanks for sharing.

------
TheZenPsycho
You know what makes me sad now? That there doesn't seem to be anything like
these for css files- in particular for extracting references to external files
and images, and moving a css file from one directory to another while
maintaining relative links.

------
networked
In Debian and Ubuntu those are in the official repositories. You can install
them with

    
    
        sudo apt-get install html-xml-utils

~~~
notfoss
For Arch Linux, they are available in the AUR.

[https://aur.archlinux.org/packages/html-xml-
utils/](https://aur.archlinux.org/packages/html-xml-utils/)

------
martijn_himself
From the README page:

    
    
      hxpipe (1) - convert XML to a format easier to parse with Perl or AWK
    

Being unfamiliar with either Perl or AWK, could anyone point me to an
explanation/ example of why it is easier to parse/ what format it generates.
Would it be easy to write a similar utility to say convert it to a Lua table?

~~~
p4bl0
The idea is that those utilities work in the UNIX way, which means that they
are line-oriented.

The following two xml documents are equivalent:

    
    
        <a><b><c /></b><d>foo</d></a>
    

and

    
    
        <a>
          <b> <c /> </b>
          <d>foo</d>
        </a>
    

But to understand that using classical UNIX tools which are line-oriented is
quite difficult, so you'll have a hard time doing operations such as "replace
'foo' by 'bar' if it appears as the textNode of a 'd' tag".

So the idea of hxpipe is that it is supposed to give you a line-oriented and
similar representation of those two documents to work with.

But it actually fails to do that properly (at least for my taste). I largely
prefer the output of xml2. Compare:

    
    
        # first doc, output of hxpipe
        (a
        (b
        |c
        )b
        (d
        -foo
        )d
        )a
        -\n
    
        # second doc, output of hxpipe
        (a
        -\n  
        (b
        - 
        |c
        - 
        )b
        -\n  
        (d
        -foo
        )d
        -\n
        )a
        -\n
    
        # output of xml2, for both documents
        /a/b/c
        /a/d=foo

~~~
martijn_himself
Many thanks for the detailed reply. That makes a lot of sense.

------
bokchoi
Lots of other fun little utilities from the author, Bert Bos:

[http://www.w3.org/People/Bos/#htmlutils](http://www.w3.org/People/Bos/#htmlutils)

------
ezequiel-garzon
There's also tidy-html5, developed by W3C: [https://github.com/w3c/tidy-
html5](https://github.com/w3c/tidy-html5)

~~~
downplay
I tried to use this recently, but failed to make this. Probably my bad.

------
super_mario
This builds with gcc 4.8.2 (and earlier), since gcc stdlib has definition for
min/max functions. But for clang you need to include MIN function/macro based
on your system type (hxindex.c does not link without it).

Specifically on Mac OS X you need to modify the hxindex.c like this:

    
    
        --- hxindex.c	2013-07-25 17:22:53.000000000 -0400
        +++ hxindex.c.patched	2014-03-11 10:05:55.000000000 -0400
        @@ -43,6 +43,7 @@
          * Version: $Id: hxindex.c,v 1.20 2013-07-25 21:04:05 bbos Exp $
          *
          **/
        +#include <sys/param.h>
         #include "config.h"
         #include <assert.h>
         #include <locale.h>
        @@ -439,7 +440,7 @@
         
           /* Count how many subterms are equal to the previous entry */
           i = 0;
        -  while (i < min(term->nrkeys, globalprevious->nrkeys) &&
        +  while (i < MIN(term->nrkeys, globalprevious->nrkeys) &&
    	     !folding_cmp(term->sortkeys + i, 1, globalprevious->sortkeys + i, 1))
    	 i++;
    

Basically, you need the sys/param.h include and change the min function calls
to MIN.

~~~
RBerenguel
Hmmm weird, I can't compile on Mac OS (clang,) complains about undefined
iofuncs in openurl. Can't figure out what it is exactly missing: seems to be a
library, but where and why? I could install the Homebrew version, but it seems
to have a bug that makes hxselect not work correctly :/

~~~
k6hkUZtLUM
$ brew install -g html-xml-utils

seems to work on OS X 10.9.2

~~~
RBerenguel
That's version 6.4, whereas the most recent is 6.6

------
shawndumas
brew install html-xml-utils

~~~
raimue
sudo port install html-xml-utils

------
p4bl0
Mh, interesting, I need to check this out. Currently I'm using xml2 and 2xml
and classic unix tools (sed, grep, cut…) to deal with HTML in Bash scripts and
Makefiles (this is how my personnal website is regenerated automatically when
I commit or push modifications, by calling `make` in the corresponding git
hooks).

~~~
TheZenPsycho
OH! I think I like the line format of those even better.

~~~
p4bl0
Indeed, I just tried hxpipe and its output is quite a mess. Almost impossible
to work with compared to xml2!

~~~
TheZenPsycho
nevertheless there are other gems such as hxselect and hxwls

------
alexanderri
To find out what's for lunch:

wget -qO- [http://fazer.se/fleminggatan](http://fazer.se/fleminggatan) -q |
hxselect 'div.OrangeHeader tr' | lynx -dump -stdin

------
shmerl
These look pretty useful for some scripts. Thanks for linking, I've never
heard of them before.

