Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Pup – A command-line HTML parser (github.com/ericchiang)
126 points by ericchiang on Sept 13, 2014 | hide | past | favorite | 27 comments

While reading the examples, I was surprised by the placement of the output redirection statement:

    $ pup < robots.html title
For some reason I thought that it must come last. Turns out that you can place it anywhere in the command! All these are equivalent in bash:

    $ pup title < robots.html
    $ pup < robots.html title
    $ < robots.html pup title

For readability, redirections are usually placed last but indeed they can be intermixed with all the other words in the command and this is specified by the POSIX standard (so it's not bash-specific either):


'A "simple command" is a sequence of optional variable assignments and redirections, in any sequence, optionally followed by words and redirections, terminated by a control operator.'

"I bet it's node or ruby..." Sees .go file extension. "Oh nice, I never used a Go program before!" But then I am supposed to '$ go get github.com/ericchiang/pup' to install it.

Why does everything nowadays have to come with its own package manager? I like the separation between my home directory and the "system packages". I don't want to have to care for and update and separately backup ~/go, ~/.npm and so on and so forth.

This looks super nice, I especially like the detailed list of examples. Sorry for the rant.

edit: There are binaries in the "dist" directory, the readme just did not mention them. Thanks!

The reason is dependencies, and the fact that an operating system is a completely different kind of enterprise than a development platform. Platform libraries are released whenever they're ready to be released, but an operating system really needs a set release cycle, because it's got to ensure compatibility between everything installed.

It's way too much to ask of already overworked OS maintainers to handle all of the libraries of all of the development platforms and it's similarly too much to ask every library and application developer to maintain packages for all of the operating systems. You also can't have the One True Package Management system that works on all the different operating systems, it would just be too unwieldy to maintain. Even properly maintaining just .deb and .rpm packages is non-trivial and requires a certain amount of skill, compounding again the number of things developers need to be proficient in.

Packaging is a Really Hard Problem, and having every platform use its own packaging system is actually a huge step up from the way open source software used to be distributed, with tarballs and compilation instructions. The tarball was the package, and it was up to you to get it onto your system somehow. So have a little respect and appreciation, wouldya?

Thanks, you made me realise the benefit of doing it this way.

> It's way too much to ask of already overworked OS maintainers to handle all of the libraries of all of the development platforms

Not only that, but also different libraries can depend on "slightly" patched versions of the same library, making the kind of determinism an OS package manager needs completely impossible (or pointless, depending on who you're asking).

FWIW, once I realized that "./configure && make && sudo make install" was basically standard and worked the vast majority of the time, I really didn't mind it. In some ways I prefer it to platform-specific packages, which often lag development and include odd bugs and incompatibilities that don't bite you until run time.

It works fine for basic packages without many dependencies, but try a bigger package, where you have a few missing dependencies. Try installing those and you have more missing dependencies, repeat until you give up.

Call me old school but I never had trouble downloading, ./configuring, making, and make installing.

You're welcome to clone the repo yourself and build it. Go doesn't have a centralized package manager like npm, just a tool that automates downloading and building a repo. Nobody is forcing you to use it; it's a convenience.

Were yo around in the days before package managers? This way is a lot better, keeps things organised.

Happy to see this. Pup will be a nice companion to HTTPie[1] as it also works with standard streams:

    $ http example.org | pup h1 text{} | http httpbin.org/post

[1] http://httpie.org/

So getting the front page links is now as easy as:

   curl https://news.ycombinator.com | pup td.title a attr{href}
Well done and thx for sharing.

Also see w3's html-xml-utils. For example hxextract: http://www.w3.org/Tools/HTML-XML-utils/man1/hxextract.html

`hxnormalize` can't format the new HTML5 tags normally.

Really great seeing more and more CLI tools being built in Go. :-)

Wait, what's the difference between this and using a Ruby/Python/etc REPL? In other words, normally to achieve this same result I would do:

irb -> require 'Nokogiri' and require 'open-uri' -> doc = Nokogiri::HTML(open('http://www.google.com/'))

and no need to store the HTML via wget on my machine. Am I missing something?

The difference is that you don't have to know python/ruby, remember specific package names and to install them before using.

You can use this with pipes and redirectors in the command line.

Nokogiri comes with a command-line tool for just that purpose: https://github.com/sparklemotion/nokogiri/blob/master/bin/no...

Example: nokogiri https://news.ycombinator.com -e 'puts $_.css("td.title a @href")'

Did you know of goquery (github.com/PuerkitoBio/goquery) ?

very nice. Could replace a bunch of awk and sed one off scripts floating around on people's harddrives.

I agree with your sentiment, but because pup knows CSS selectors and understands the page hierarchy, this will blow the doors off of any line-oriented tool. I'm also stoked about the pretty-printing, but that's just from reading the English; I haven't actually tried the tool yet.

I also agree with the author: jq is invaluable.

I agree that jq is a must, along with httpie, and now pup. Thanksfully, jq is now in all distros I've tried (except Arch Linux), I think httpie is as well, so, let's hope same happens to pup.

Is there any way to use pseudo-selectors, like :last-child?

cat and pup play well together.

Looks great! Thank you so much for making a Windows build.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact