
Show HN: Pup – A command-line HTML parser - ericchiang
https://github.com/EricChiang/pup
======
nikital
While reading the examples, I was surprised by the placement of the output
redirection statement:

    
    
        $ pup < robots.html title
    

For some reason I thought that it must come last. Turns out that you can place
it anywhere in the command! All these are equivalent in bash:

    
    
        $ pup title < robots.html
        $ pup < robots.html title
        $ < robots.html pup title

~~~
userbinator
For readability, redirections are usually placed last but indeed they can be
intermixed with all the other words in the command and this is specified by
the POSIX standard (so it's not bash-specific either):

[http://pubs.opengroup.org/onlinepubs/009604599/utilities/xcu...](http://pubs.opengroup.org/onlinepubs/009604599/utilities/xcu_chap02.html#tag_02_09_01)

'A "simple command" is a sequence of optional variable assignments and
redirections, _in any sequence_ , optionally followed by words and
redirections, terminated by a control operator.'

------
aw3c2
"I bet it's node or ruby..." Sees .go file extension. "Oh nice, I never used a
Go program before!" But then I am supposed to '$ go get
github.com/ericchiang/pup' to install it.

Why does everything nowadays have to come with its own package manager? I like
the separation between my home directory and the "system packages". I don't
want to have to care for and update and separately backup ~/go, ~/.npm and so
on and so forth.

This looks super nice, I especially like the detailed list of examples. Sorry
for the rant.

edit: There are binaries in the "dist" directory, the readme just did not
mention them. Thanks!

~~~
vinceguidry
The reason is dependencies, and the fact that an operating system is a
completely different kind of enterprise than a development platform. Platform
libraries are released whenever they're ready to be released, but an operating
system really needs a set release cycle, because it's got to ensure
compatibility between everything installed.

It's way too much to ask of already overworked OS maintainers to handle all of
the libraries of all of the development platforms and it's similarly too much
to ask every library and application developer to maintain packages for all of
the operating systems. You also can't have the One True Package Management
system that works on all the different operating systems, it would just be too
unwieldy to maintain. Even properly maintaining just .deb and .rpm packages is
non-trivial and requires a certain amount of skill, compounding again the
number of things developers need to be proficient in.

Packaging is a Really Hard Problem, and having every platform use its own
packaging system is actually a huge step up from the way open source software
used to be distributed, with tarballs and compilation instructions. The
tarball _was_ the package, and it was up to you to get it onto your system
somehow. So have a little respect and appreciation, wouldya?

~~~
nostrademons
FWIW, once I realized that "./configure && make && sudo make install" was
basically standard and worked the vast majority of the time, I really didn't
mind it. In some ways I prefer it to platform-specific packages, which often
lag development and include odd bugs and incompatibilities that don't bite you
until run time.

~~~
collyw
It works fine for basic packages without many dependencies, but try a bigger
package, where you have a few missing dependencies. Try installing those and
you have more missing dependencies, repeat until you give up.

------
jkbr
Happy to see this. Pup will be a nice companion to HTTPie[1] as it also works
with standard streams:

    
    
        $ http example.org | pup h1 text{} | http httpbin.org/post
    
    

[1] [http://httpie.org/](http://httpie.org/)

------
ushi
So getting the front page links is now as easy as:

    
    
       curl https://news.ycombinator.com | pup td.title a attr{href}
    

Well done and thx for sharing.

------
grannyg00se
Also see w3's html-xml-utils. For example hxextract:
[http://www.w3.org/Tools/HTML-XML-
utils/man1/hxextract.html](http://www.w3.org/Tools/HTML-XML-
utils/man1/hxextract.html)

~~~
NaNaN
`hxnormalize` can't format the new HTML5 tags normally.

------
artursapek
Really great seeing more and more CLI tools being built in Go. :-)

------
mbesto
Wait, what's the difference between this and using a Ruby/Python/etc REPL? In
other words, normally to achieve this same result I would do:

irb -> require 'Nokogiri' and require 'open-uri' -> doc =
Nokogiri::HTML(open('[http://www.google.com/')](http://www.google.com/'\)))

and no need to store the HTML via wget on my machine. Am I missing something?

~~~
aw3c2
You can use this with pipes and redirectors in the command line.

~~~
dashesyan
Nokogiri comes with a command-line tool for just that purpose:
[https://github.com/sparklemotion/nokogiri/blob/master/bin/no...](https://github.com/sparklemotion/nokogiri/blob/master/bin/nokogiri)

Example: nokogiri [https://news.ycombinator.com](https://news.ycombinator.com)
-e 'puts $_.css("td.title a @href")'

------
Gys
Did you know of goquery (github.com/PuerkitoBio/goquery) ?

------
morenoh149
very nice. Could replace a bunch of awk and sed one off scripts floating
around on people's harddrives.

~~~
mdaniel
I agree with your sentiment, but because pup knows CSS selectors and
understands the page hierarchy, this will blow the doors off of any line-
oriented tool. I'm also stoked about the pretty-printing, but that's just from
reading the English; I haven't actually tried the tool yet.

I also agree with the author: jq is _invaluable_.

~~~
kolev
I agree that jq is a must, along with httpie, and now pup. Thanksfully, jq is
now in all distros I've tried (except Arch Linux), I think httpie is as well,
so, let's hope same happens to pup.

------
illesim
Is there any way to use pseudo-selectors, like :last-child?

------
mholt
cat and pup play well together.

------
WorldWideWayne
Looks great! Thank you so much for making a Windows build.

