Hacker News new | past | comments | ask | show | jobs | submit login
Htmlq: like jq, but for html (github.com/mgdm)
961 points by jabo 15 days ago | hide | past | favorite | 166 comments

This is very nice!

For reasoning about tree-based data such as HTML, I also highly recommend the declarative programming language Prolog. HTML documents map naturally to Prolog terms and can be readily reasoned about with built-in language mechanisms. For instance, here is the sample query from the htmlq README, fetching all elements with id get-help from https://www.rust-lang.org, using Scryer Prolog and its SGML and HTTP libraries in combination with the XPath-inspired query language from library(xpath):

    ?- http_open("https://www.rust-lang.org", Stream, []),
       load_html(stream(Stream), DOM, []),
       xpath(DOM, //(*(@id="get-help")), E).

       E = element(div,[class="flex flex-colum ...",id="get-help"],["\n        ",element(h4,[],["Get help!"]),"\n        ",element(ul,[],["\n       ...",element(li,[],[element(a,[... = ...],[...])]),"\n   ...",element(li,[],[...]),...|...]),"\n        ...",element(div,[class="la ..."],["\n   ...",element(label,[...],[...]),...|...]),"\n    ..."])
    ;  false.
The selector //(*(@id="get-help")) is used to obtain all HTML elements whose id attribute is get-help. On backtracking, all solutions are reported.

The other example from the README, extracting all links from the page, can be obtained with Scryer Prolog like this:

    ?- http_open("https://www.rust-lang.org", Stream, []),
       load_html(stream(Stream), DOM, []),
       xpath(DOM, //a(@href), Link),
This query uses forced backtracking to write all links on standard output, yielding:


It's pretty easy in Python too, eg.:

    >>> soup = BeautifulSoup(requests.get("https://www.rust-lang.org").text)
    >>> [x["href"] for x in soup.find_all("a")]

    ['/', '/tools/install', '/learn', 'https://play.rust-lang.org/', '/tools', '/governance', '/community', 'https://blog.rust-lang.org/',...

In a certain sense (for example, when measuring brevity), it is indeed easy to write this example in Python. However, the Python version also illustrates that many different language constructs are needed to express the intended functionality. In comparison to Prolog, Python is a quite complex language with many different language constructs, including loops, objects, methods, assignment, dictionaries etc. all of which are used in this example.

As I see it, a key attraction of Prolog is its simplicity: With a single language construct (Horn clauses), you are able to express all known computations, and the example queries I posted show that only a single language element, namely again Horn clauses to express a query, is needed to run the code. The Prolog query, and also every Prolog clause, is itself a Prolog term and can be inspected with built-in mechanisms.

As a consequence, an immediate benefit of using Prolog for such use cases is that you can easily reason about user-specified queries in your applications, and for example easily allow only a safe subset of code to be run by users, or execute a user-specified query with different execution strategies etc. In comparison, Python code is much harder to analyze and restrict to a particular subset due to the language's comparatively high syntactic complexity.

The benefit of Python is that developers already know about these language constructs, and that more developers know Python than Prolog.

I don't think the op's point was "how easy it would be to hire developers", or even "taking all the considerations a business is under, I feel Prolog makes sense". He was just touting how easy Prolog's built in pattern matching and declarative style makes implementing and using selectors at a language level.

Honestly, if we didn't talk about the benefits of a language irrespective of how easy it is to hire for it, we'd never have introduced anything beyond FORTRAN, if we even made it that far. Bringing "X is easier to hire for" into a conversation about the language is, at best, a non-sequitur.

We might have been better off that way. FORTRAN does have its downsides, but language churn itself has downsides that almost always outweigh the assumed upsides of a better language.

If we had just stuck with FORTRAN forever, how many problems would have been completely avoided!? There’d be better, and more, IDEs, since even if the language is hard to parse, it’s still just one parser that needs all the effort. So many unfortunate problems in education caused by language and ecosystem churn would have been avoided (the infamous “by the time you graduate, it’s always outdated” problem).

The only problem is that FORTRAN is too new. Should’ve stuck with the Hollerith tabulator.

Genuinely having a difficult time determining if this is meant to be satire.

Ye olde pragmatist vs idealist.

Fun aside: In practice I've found that most people touting what's easy to hire for -vastly- overestimate how difficult it is to pick up a new language sufficiently well to be productive in it and able to support it in production. This is doubly amusing when you consider that the same people also frequently tout how they want to "hire the best".

Thanks, that's a rare example of something which is (a) simple enough to understand for a Prolog-newbie like me, and (b) more practical than ubiquitous family-tree example.

I'm always looking for opportunities to dip my toes into Prolog; in hindsight it's clearly a good fit for tree-structured data structures.

Interestingly, the only other context in which I've come across Prolog is from friends who studied at Cambridge, here in the UK. For some reason, the CS 'tripos' (course) there is really heavily focussed on Prolog, and everyone I know from there ended up a huge fan of the language. I'm not sure why that's the case, though, given that almost all other universities seem to use more common languages (Java, C++, etc).

"Prolog as a library" => Given "functional" constraints => $CONSTRAINTS.prolog( "query..." ) => results

...many languages (similar to regex / state-machine) can benefit greatly from offloading a portion to something prolog-ish, but it's unfortunate that prolog knowledge isn't as widely distributed.

I studied CS at a different university in UK and we used Prolog for one module on AI or perhaps machine vision. I really enjoyed working with it. This was 15 years ago. Looking through their current curriculum I can't see prolog being mentioned anymore. Shame!

cs.man.ac.uk, at least back in 1992, had a compulsory Prolog module in the first year. Don't know anyone from then who didn't hate that module with a burning passion.

(There was no Java, C++, etc. either. It was SML, Pascal, 68000, and Oracle Pascal-Embedded-SQL.)

AFAIK, this was first proposed and implemented in Ciao Prolog back in late 90s (modern versions here: https://ciao-lang.org/ciao/build/doc/ciao.html/html.html). It was way before Python was popular and JavaScript ever existed.

I tried to run this on my computer now, but as a complete Prolog noob, I'm having errors running the script? How do you load the http_open module/library in the first place? I tried following some Prolog tutorials in the past but I always get stuck trying to run something in the REPL. I'm using scryer-prolog. Thanks in advance!

The libraries I mentioned can be loaded by invoking the use_module/1 predicate on the toplevel, here is the complete transcript that loads the SGML, HTTP and XPath libraries in Scryer Prolog:

    ?- use_module(library(sgml)).
    ?- use_module(library(http/http_open)).
    ?- use_module(library(xpath)).
The second query also uses portray_clause/1 from library(format), which you can load with:

    ?- use_module(library(format)).
After all these libraries are loaded, you can post the sample queries from above, and it should work.

There are also other ways to load these libraries: A very common way to load a library is to use the use_module/1 directive in Prolog source files. In that case, you would put for example the following 4 directives in a Prolog source file, say sample.pl:

    :- use_module(library(sgml)).
    :- use_module(library(http/http_open)).
    :- use_module(library(xpath)).
    :- use_module(library(format)).
And then run sample.pl with:

    $ scryer-prolog sample.pl
You can then again post the goals from above on the toplevel, and it will work too.

Another way is to put these directives in your ~/.scryerrc configuration file, which is automatically consulted when Scryer Prolog starts. I recommend to do this for libraries you frequently need. Common candidates for this are for example library(dcgs), library(lists) and library(reif).

Personally, I start Scryer Prolog from within Emacs, and I have set up Emacs so that I can consult a buffer with Prolog code, and also post queries and interact with the Prolog toplevel from within Emacs.

Wow that works fantastically! Thank you for that. It almost seems like magic.

This looks very useful, big fan of all the ^[a-z]+q$ utilities out there. But as a user, I would probably want to use XPath[0] notation here. Maybe that is just me. A quick search revealed xidel[1] which seems to be similar, but supports XPath.

[0]https://en.wikipedia.org/wiki/XPath [1]https://github.com/benibela/xidel

I'd like to state my support for the author's choice of CSS selectors in this particular use case. I think it's a natural fit for this domain and already very well known, perhaps even known better than XPath.

I'd like to add my support here too, but with a note.

When scraping and parsing (or writing integration test DSL), I always start out with CSS selectors. But always hit cases where they lack or require hoop-jumping and then fall back on Xpath. I then have a codebase with both CSS-Sel and Xpath, which is arguably worse then having only one method.

I suspect here, one uses this tool untill CSS selector limitations are getting in the way, after which one switches to another tool(chain)

I've not had much friction using either, they are "close enough" that the time to (re)write a query from one to the other is not very significant.

Do you mind giving an example? I'm having trouble following where CSS is limited for selection.

XPath does general data processing not just selection

E.g. when you have a list of numbers on the website, XPath can calculate the sum or the maximum of the numbers

Or you have a list of names "Last name, First name", then you can remove the last name and sort the first names alphabetically. Or count how often each name occurs and return the most popular name.

Then it goes back to selection, e.g. select all numbers that are smaller than the average. Or calculate the most popular name, then select all elements containing that name

Like other commentor says: parent/child. But also selecting by content (e.g. "click the button with the delete-icon" or "find the link with '@harrypotter') or selecting by attributes (e.g. click the pager-item that goes to next page) or selecting items outside of body (e.g. og-tags, title etc). All are doable in CSS3 selectors, but everything shouts that they are not meant for this; whereas xpath does this far more natural.

The element(s) before an element: //h3/preceding-sibling::p[1] Match something's parent: //title/.. Match all ancestors: //title[@id = 'abc']/ancestor::comment

Element with src or href attr: //[@src or @href] or multiple conditions: //article[@state = "approved" and not(comments/comment)]

Element with more than two children: //ul[count(li) > 2] Element with matching descendents: //article[//video]

Element text containing substring: //p[contains(text(), "Foo")] Attribute containing substring: //a[ends-with(@href, ".jpg")]

Numerical attribute selection: //product[@price > round(2.5 @discount)] //product[sum(//[starts-with(name(), 'price-')]/@price) > 0]

Attribute values: //a/@href Text values with spaces normalised: //a/normalize-space(text())

Match all attributes or elements or text nodes: //user/@ or //user/node() or //user/text() or //user/comment()

Basically from any node in a document you can select its ancestors, children, descendants, siblings, attributes etc, and filtering has the same power as selecting does - in CSS there's :not() that can apply to selection or filtering, with :has() finally on the way and no :or(). CSS selectors match against HTML elements and they're great for that almost all of the time, but while you can filter by attribute value including substring and even by regular expression, for text there's :empty.

But for a query syntax you need to be able to select attributes and text content as well as elements. Either extend XPath to support #id and .class syntax

//#user-xyz//note/text() //code.language-js/@name

or extend CSS to at allow selecting attrs and text

#user-xyz note :text code.language-js @name

The former is more powerful, the latter a quick hack (if they only appear at the end of the selector anyway) with instant payoff.

Searching text content is my main remaining use of XPath.

Well, the big one is selecting a parent from the child.

You could do this with the :has() CSS psuedo-class[0], though inverted (select a parent that _has_ the child matching a selector).

Looks like that psuedo-class has not been implemented in the kuchiki library that htmlq uses though.

[0]: https://developer.mozilla.org/en-US/docs/Web/CSS/:has

You can do it either way in XPath thanks to how you can use a path expression and/or predicates almost everywhere in a query

  # Find all elements li and select the parent element for each

  # Find all element nodes with a child element named li

  # Non-abbreviated queries

  # CSS using :has
  :has(> li)

Playwright ppl had to solve this for themselves, you can mix them as they are distinct, have few small custom modifications to help with selectors. Playwright compatible selectors would be nice.

My web scraping tends to start with xidel. If I need a little bit more power I'll use xmlstarlet. If neither of those is enough, I'll use Python's beautifulsoup package :)

I like xmlstarlet too, if only because it's old enough that I can reliably get it in package repositories and the dependency footprint is tiny (less an issue now with this tool written in Rust, but previously I was comparing to NPM- and PyPI-based affairs).

lxml is one of the most pleasing to use Python libraries ever, managing to wrap a hot mess of XML APIs in a consistent and Pythonic fashion that you rarely need to escape. IIRC I used beautifulsoup to parse the HTML of a site, and then lxml and either find items and fields by CSS in IPython for quick and dirty data munging, or knock up an XSLT file to transform what I'd scraped into good data in an XML file :)

Thanks, this looks more powerfull. Support CSS, XPath and XQuery. Maybe I could learn a bit of XQuery when I have a use case for it :)

Well, here’s your first lesson then: if you prepend (: to your comment it will become a valid XQuery document!

(: XQuery comments are marked by mirrored smilie faces, like this. :)

Well, yes, but also no

An empty query is not valid. There needs to be something besides the comment

Nice - I've been writing XQuery for years and I had no clue

Everything that isn't a (: happy comments :) is a FLWOR:

    for $user in //users
    let $comments = //comment[@uid = $user/@id]
    where count($comments) > 0
    order by $user/lastName, $user/firstName
    return <user id="{ $user/@id }">
      <name>{ concat($user.firstName, " ", $user.lastName) }</name>
      <comments count="count($comments)">
        for $c in $comments return <comment id="{ $c/@id }" />
It's the bastard child of SQL and XPath 2 lol.


I kinda liked XQuery, but it seemed to never have got much traction.

This looks really neat! It supports a bunch of different query types, and can even do things like follow links to get info about the linked-to pages!

It's also in nixpkgs, though for some reason the nixpkgs derivation is marked as linux-only (i.e. not Darwin). (Edit: probably because the fpc dependency is also Linux-only, with a linux-specific patch and a comment suggesting that supporting other platforms would require adding per-platform patches)

part of the problem with this is that HTML is mostly not valid XML

Once upon a time I was using pup[0] for such thing as well as later I changed to cascadia[1] which seemed much more advanced.

Comparing the two repos, it seems pup is dead, but cascadia may not be.

These tools, including htmlq, seem to sell themselves as "jq for html", which is far from the truth. Jq is closer to the awk where you can do just about everything with json. Cascadia, htmlq, and pup seem closer to grep for html. They can essentially only select data from a html source.

[0] https://github.com/EricChiang/pup [1] https://github.com/suntong/cascadia

Well, jq is grep as well as sed and awk, but yeah, htmlq seems to be just grep, for sake of comparison.

But I don't think html has any need for a sed/awk tool, or at least not as much. Json output could very well be piped forward to the next CLI tool after you've changed it slightly with jq. I don't see this scenario as likely with html.

> Well, jq is grep as well as sed and awk, but yeah, htmlq seems to be just grep, for sake of comparison.

Exactly, and that is what I mean. If you want to compare, compare it with grep, not jq.

Someone else posted xidel[0] in this thread, which I've not used, but it seems to be the "jq but for html".

[0] https://github.com/benibela/xidel

I've used pup for a few projects, but was unaware of cascadia. Thanks for pointing it out.


This is the kind of obvious tool that once it exists, you can’t really grok the fact it did not earlier, and that it took until now to exist.

> grok

A good opportunity to introduce `gron` to those unfamiliar!

    ▶ gron "https://api.github.com/repos/tomnomnom/gron/commits?per_page=1" | fgrep "commit.author"
    json[0].commit.author = {};
    json[0].commit.author.date = "2016-07-02T10:51:21Z";
    json[0].commit.author.email = "mail@tomnomnom.com";
    json[0].commit.author.name = "Tom Hudson";


"A good opportunity to introduce `gron` to those unfamiliar!"

Thank you - appreciated.

I haven't done much work with json but have had reasons recently to do so - and I immediately saw how difficult it was to pipeline to grep ...

But what I still don't understand is that some json outputs I see have multiple values with the exact same name (!) and that still seems "un-grep-able" to me ...

What am I missing ?

  > But what I still don't understand is that some json
  > outputs I see have multiple values with the exact same name
This is neither explicitly allowed nor explicitly forbidden by the JSON spec. It is implementation dependent upon how to handle - does one value override the other? Should they be treated as an array?

In practice, this situation is usually carefully avoided by services that produce JSON. If you are interfacing with a service that does produce duplicate values, I'd be interested in seeing it for curiosity's sake. If you are writing a service and this is the output, then I implore you to reconsider!

You might be missing a change in index: `obj[0].prop` vs `obj[1].prop`. Or, your JSON might have the same property defined multiple times: `{a:1, a:2}` (though I'm not sure how gron handles that situation).

> (though I'm not sure how gron handles that situation).

It seems both gron and jq only use the value that has been defined last:

  ~  echo '{"a":1,"a":2}' | gron                                                                                                                                   
  json = {};
  json.a = 2;
  ~  echo '{"a":1,"a":2}' | jq                                                                                                                                    
    "a": 2

The json output likely contains multiple objects. Can you request more specifically the object(s) you need and grep on that?

It's not novel obviously. I have been using pup[1] for years. And xidel[2] is probably older.

[1] https://github.com/ericchiang/pup

[2] https://github.com/benibela/xidel

It did write it a few years ago.


and I use it almost every day. It's great, thank you very much!

There are already tools for xpath, but using css selectors is much more aligned with what I write every day, so that's nice.

Yes, and awk and others. I meant something semantically closer to the need, with css selectors.

I have been using hxselect from the html-xml-utils package to do this for many, many years.

It doesn't handle malformed HTML that well but can be coaxed into working about 90% of the time, with the help of the other included package hxclean or something like html-tidy.

"htmlq: like jq, but for HTML"

"jq is like sed for JSON data"

sed: "While in some ways similar to an editor which permits scripted edits (_such as ed_), sed works by making only one pass over the input(s)"

ed: "ed is a line-oriented text editor".

Software definition through a reference to another software is somewhat confusing. Potential users come from different backgrounds (I had no idea what is jq), and it is not clear what are the defining features of each project. Is jq line oriented? Is htmlq operating in a single pass?

1st sentence - Explaining the tool for those the tool was made for without beating around the bush.

2nd sentence - Explaining the tool to folks in the general web domain what it can do for them.

3rd sentence - Explaining where to learn how to use the tool if you've stumbled across it but web is not your area of expertise.

All that info fits in nearly 25 words then it lists the options for the tool and jumps straight into multiple examples (with outputs!). If the only explanation had been "htmlq: like jq, but for HTML" I'd agree but having the comparison to explain what it does isn't a bad thing it's _only_ having the comparison that would be bad.

Personally I think this is a model example of a opening for a Github readme.

I disagree. The 2nd sentence contains, "extract bits content." What is that?

If you're going to write a minimal introduction, at least make sure it's not confusing.

I get the feeling the author felt compelled to write an introduction and did so with as little effort as possible.

I believe he tailored it to his target audience. If you find it confusing, you are likely not it.

As web developer for over a decade "bits content" doesn't mean anything to me. But I understand what the tool does from the rest of the description. Try running a google search for "bits content," [0] it's not a commonly used phrase in web development or anything. It's a poor choice of words.

0. https://www.google.com/search?hl=en&q=%22bits%20content%22

It's supposed to be "bits of content", it's not jargon. The author's just accidentally a word, we all do it.

It's more than fair to say in technical documentation you intend others to use having a grammatical error or missing word is confusing and a problem. It's the writing equivalent of having a bug in your code. And it's definitely not "writing to a target audience" as the parent comment suggested. We all make mistakes but don't try to call a mistake effective documentation.

Of course it is, but neither parent nor anyone else is saying anything close to the mistake being effective documentation. There's a single missing word which needs to be added in, but the overall text is clearly writing to a target audience. You are aware of this, and of how small the mistake is, and you understand what the sentence should read as, so I'm not sure what your point is?

My hunch is that this is a typo and it should read "extract bits OF content."

I agree and having a missing word in your text often leads to confusion :)

Honestly you could drop the "bits" which is a bit redundant and use the phrase "Uses CSS selectors to extract content from HTML files."

Exactly this! I’ll fix it after work.

Maybe have the line about "jq" be 2nd. Have the first line be a brief description of what it actually does.

"htmlq is like jq but for html" is a very specific 'dog whistle' for people who use jq. I agree that people who don't know what jq is will get no value and pay no attention. But for people who use jq, the claim is, like a dog whistle, clear, concise, and means exactly what it says. In two seconds, everyone using both jq and html will instantly know what is available and log it away.

So for general purposes, it's a terrible marketing pitch. And yet I think it's a very, very valuable demonstration of knowing some of their 'customers'.

this isn't what a dogwhistle is. it's just explanation by analogy to a model presumed to be shared by the intended audience. a dogwhistle offers a surface meaning to the uninitiated that's anodyne but communicates a hidden, coded message to those who possess some undisclosed, shared knowledge with the author. this kind of analogy entirely lacks the surface meaning and the message shared via jargon also communicates something about how you might learn enough to understand the analogy.

I can't speak for people who don't know jq, but knowing jq, this is a great tagline: it gives me an immediate understanding of what it does, how I could expect to use it and what value and ease of use I can expect.

I'll be trying it out next time I'm on a PC.

> I can't speak for people who don't know jq,

I can, and it's not illuminating at all.

I agree, however if you do know how to use jq than "like jq, but for html" is extremely effective. I use jq all the time and that title hooked me, I immediately wanted to try it.

But if you haven't used Jq that I can see how that title is less than helpful.

The first three are not proper definitions per-se but kind of an advertisement, trying to familiarize by self-comparison with a tried & true tool that has proven its worth.

You know Jimmy the famous mechanic? I'm Timmy, _his brother_ but an electrician.

IMO, at least `jq` has proven itself as the indispensable tool for json-data manipulation.

What's this thing called a "computer" that people keep going on about, anyway?

It's a person who does mathematic calculations all day. For example, creating range tables for artillery, calculating averages or totals of a large range of values, or solving complex integrals or differential equations, and so on. They're commonly used in industry or government, especially in astronomy, aerospace and civil engineering for both simulation and analysis. Perhaps the most well-known computers were the Harvard Computers, which operated in the late 19th and early 20th centuries.

As a job, computers were largely automated out of existence by solid-state transistor based automated computers and integrated circuit transistor automated computers in the 60s, 70s and 80s, which replaced the enormously expensive and often largely experimental electro-mechanical automated computers while radically reducing cost and improving performance both by several orders of magnitude.

It's like a programmable loom, but for logical and mathematical operations.

You may be interested in the symbol grounding problem (https://en.wikipedia.org/wiki/Symbol_grounding_problem#Groun...). It's like the binding problem, but for symbols.

Sort of related: [Expecting Short Inferential Distances](https://www.readthesequences.com/Expecting-Short-Inferential...)

Here - this explains it really succinctly:


I mean...if you read the github readme it literally describes what it does in the next line: "Uses CSS selectors to extract bits content from HTML files".

> Software definition through a reference to another software is somewhat confusing.

Possibly, depending on background as you note, but not all promotion is intended at the same audience. When submitting to HN, "like jq, but for X" is short and conveys what it is to most the people that would care, I think. jq has been submitted and talked about here many times with lively discussion over the years.[1] At this point I think most those that are interested in what that is and what this is will understand fairly quickly from the title. Those that don't might be missed, or they might look it up like you, or they might see it through some other submission some other time with a different title which isn't based on a chain of references.

1: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

jq isn't line-oriented, it's json-oriented. it's operaring on a stream of jsons from stdin, so its query is applied to each json in sequence.

I would expect that htmlq run the query a single time for a single html; just like jquery $('#something') or document.querySelector('#something')

Very nice tool. I've long spoiled myself with Powershell's:


  eg. # what is the latest release of apache-tomcat?
  $LINKS=$(Invoke-WebRequest -Uri 'https://tomcat.apache.org/download-80.cgi' | Select-Object -ExpandProperty Links)
  $LATEST=$($Links | Where-Object -Property href -Match '#8.5.[0-9]+').href.substring(1)
  $FETCH=$($Links | Where-Object -Property href -match "apache-tomcat-${LATEST}.zip$").href

Should it be $LINKS instead of $Links (2x)?

"$links" works too because PWSH is not case sensitive. But I should have used $LINKS like you said for cleaner write-up.

See also the html-xml-utils from w3c.

hxextract and hxselect perform similar extract functions.

hxclean and hxnormalize (combined) will pretty-print HTML.


Funny, couple of years ago I thought someone should create something for JSON similar to what [XSLT](https://en.wikipedia.org/wiki/XSLT) is for XML. See example here https://www.w3schools.com/xml/xsl_intro.asp

Then I found out about jq because awscli was using it in example docs.

I guess `htmlq` makes sense if it has the exact same syntax as `jq`, and the user is already familiar with the latter?

JSON schemas are a thing that exists and can be useful, and `jq` probably covers the 80% of use cases for querying JSON.

XSLT can be an amazing tool when used properly and I've wondered about a JS equivalent over the years and started writing one on a couple of occasions. But JSON is just a data structure and not structured markup, and there's no sweet spot for a transformation tool like XSLT - you're more likely to be doing a "find items in JSON, filter() them, then map()/reduce() to output format" task that takes a minute or two in Node and then never gets used again, or doing a complete map from one domain to another where you'd need to do it in JS because of the complexity of processing and ability to handle errors, use third-party tools and even write tests.

An XQuery-esque language allowing selecting bits of JSON file(s) with filtering, grouping and ordering built-in, combined with a way of projecting results that's no worse than JS allows for i.e. not having to put quotes around everything and the like :)

Looks nice! Any comparisons with pup?


I'd use something like this script that you can put together yourself:

  #!/usr/bin/env ruby
  require 'nokogiri'; p Nokogiri::HTML(STDIN.read).css(ARGV[0]).text
Just save it to a file in your /usr/local/bin/hq and do chmod +x !$

Then you can do:

  curl -s "https://news.ycombinator.com/news"|hq "tr:first-child .storylink"
It uses Nokogiri[0], which is much more battle tested and works with CSS and XPath selectors.

[0] https://nokogiri.org/tutorials/parsing_an_html_xml_document....

Command just prints a bunch of text stitched together:

  curl -s "https://news.ycombinator.com/news"|hq "tr .storylink"
  Deploy a website on imgur.comMy £4 a month server can handle 4.2M requests a dayFirst Edition...

Make sure you add :first-child as in my example, otherwise you'll get all the stories smooshed together.

I want to see all stories, not first one only.

Call it "hq".

Just being that guy: is there a reason you didn't call it hq?

I just wanted to be slightly more descriptive and less likely to collide with other tools.

Hahah, I love how this is your second comment in 10 years on HN.

Hah. Yeah. I had another account for a little while but then HN started to let me reset the password for this one quite recently, so here I am.

Not author but neither is the poster: Jq got away with it because it's one of the few 2 letter combinations that wasn't absolutely overloaded and "jquery" was already taken. OTOH nobody shortens HTML to H and HQ is an extremely common acronym, if not one of the most popular 2 letter acronyms you could pick.

jq didn't get away with it! Have you never tried searching for anything to do with it? How I wish it were called `jsonq`!

I’d like to see a tool using lol-html [0] and their CSS selector API as a streaming HTML editor.

[0] https://github.com/cloudflare/lol-html

If anyone is looking for a good library to do this in Python, PyQuery works well:


From examples, this is only like jq in the sense that the q stands for the same thing. Even the way it does that is different.

An xmlq that was really like jq would be fun, about 20 years ago.

There is `xq` today, which parses XML like `jq`. I think that it is relatively unknown because it is part of the `yq` package for parsing YMAL. So just install `yq` via PIP and you'll get `xq` as well.

There is also `xmlstarlet` for parsing XML in a similar fashion.

Just looked into this and I think it's worth mentioning that there are two different projects called `yq`. The first one that came up (written in go instead of python) is not the right one and doesn't have the `xq` tool.

xmlstarlet is really nothing like jq, as a language. But yes, I use it because it is the best commandline xml processor I'd found. That's the only similarity to jq.

Is this the yq? https://kislyuk.github.io/yq/ It does contain an 'xq', as a literal wrapper for jq, piping output into it after transcoding XML to JSON using xmltodict https://github.com/martinblech/xmltodict (which explodes xml into separate JSON data structures).

This is a bash one-liner! But TBF it really is a 'jq for xml'. I think it would be horrible for some things, but you could also do a lot of useful things painlessly.

Thank you for the comments. I've only recently discovered both tools, and literally used them once each. Of the two `xq` was easier for my particular work case (parsing a Magento config) but I keep both tools in my virtual toolbox.

If you have any other suggestions for parsing XML for exploratory purposes I'm very happy to hear them.

Thanks! Not actually a reccommendation, but I have used xsltproc (command line xslt), but it is horrible to use because xslt syntax is horrible (though xslt's concepts are pretty cool). One thing is it enables you to use XPath in all its glory.

Just installed xq. It's nice just seeing the pretty-printed json output, so thanks for the pointer. Probably better than xmlstarlet for my usage, which just queries and outputs text, not xml. hmmm, that's probably true for most commandline uses...

I would still like xmlq, there are (regrettably) still a lot of applications that store data and configuration in xml

> like jg

"jq is a lightweight and flexible command-line JSON processor"

If you make the html well formed, xpath also works great. Great stuff if you ever need to pick html apart. Used this quite a bit when microformats were still a thing together with jtidy.

Jq is very loosely inspired by that, I guess. Might come full circle here and use some XSL transformations ...

You can usually find a html parser for your language, that you can use xpath/xsl on. It will just make the same assumptions that the browser does, by adding missing closing tags etc.

I made a tool that extracted parts of web pages 10-15 years ago, and it worked well. There are of course cases where the html is so unstructured that the results were unpredictable, but it worked well in general.

it's statically linkable rust, isn't it? Awesome. I'm looking for a successor to

$ xmllint --html --xpath …

that doesn't choke on inline svg.

I tend to reach for XPath selectors before CSS ones when querying HTML.

Nice, I expected something based on XPath (like xpd), but web developers dealing with HTML are infinitely more familiar with CSS selectors, so a great choice!

I want the option to use both, like Nokogiri gives you.

Sure, that sounds nice, but having two simple tools each doing the job well in its own space is perfectly fine for me — do you imagine needing to combine Xpath and CSS queries in a single run?

I've had to do it when dealing with some poorly-designed XML apis in the past. Nokogiri was a godsend.

shameless self promotion: parsel[0] is a python script in front of the identically named python lib, and extracts parts of the HTML by CSS selector. the advantage of it compared to most similar tools is that you can navigate in the DOM tree up and down to find precisely what you want if the HTML is poorly marked up, or the searched parts are not close to each other.

[0] https://github.com/bAndie91/tools/blob/master/usr/bin/parsel

Ive been looking for a library that can find the best set of selectors to most consistently find the element youre looking for in a page.

Any pointers to something that exists? Interestingly I've also found very little for dom extraction in the OS ML space.

https://jsoup.org/ has been around for a long time and seems a bit more mature & maintained than this two-code-files 2-year-old repo. Highly recommend.

Why? I find xpath's syntax much simpler and regular than jq's.

And a Java version with pre-compiled binaries: https://github.com/ludovicianul/hq

Super useful. You've created a fantastic tool here. Thank you.

When I saw the title I thought this was some machine learning-specific rmq/0mq message passing tech called HT. Very excited to zero.

This is nifty! Python + bs4 takes some googling to remember how to parse a webpage. This is just straight forward, thanks so much.

This looks really cool! I'd love to see a generic query language/tool library for structured data.

brilliant. does this spin up a heavy DOM implementation in the background or do something lighter such as regexp?

You can't parse html with regular expressions :)


"Oh Yes You Can Use Regexes to Parse HTML!"


Yeah, if you allow yourself some Perl to help you with those parts that regexes can't handle...

Technically correct, but did you see the regex he uses? It spans 82 lines...

And the obligitory caveat from the comments:

> While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML.

The emphasis here is on "known". The tool is general purpose (i.e. handling unknown HTML) so using regexes would be ill-advised.

Looks like it uses servos html5ever (through kuchiki), so no DOM representation.

Kuchiki materialises what they call a “DOM-like tree”. I’d consider it a DOM tree, myself, despite the differences in precise API.

But it’s not using a full browser to back it, which I suspect is what’s really being asked.

It looks to be using html5ever to parse the HTML, similar to something like BeautifulSoup in Python.

The source is right there. You can read it. It uses html5ever (part of the servo project).

You can’t parse HTML with regexps. It’s not a regular language.

What language implements regexps that actually correspond to regular languages though?

Crazy how a 300-line codebase manages to amass 2000 stars on Github and 700 upvotes on HN. Amazing ROI.

is anyone else using the https://github.com/json-path/JsonPath over the jq route?

I hope we standardize on some jq query language, like we have with a base set of SQL syntax

Maybe call it hq ?

My thoughts EXACTLY... but anyway, great new utility indeed!

Haha, Indeed its a very good utility :D

This is very cool. This will make scraping the web even easier!

is there a brew install command ?

What is jq?

Good work!

This is great! Thanks

Why not just jquery?

what's wrong with using html tidy + xmllint ?

nothing wrong. Searching unmodified html though is sometimes preferable.

Why not incorporate this into jq itself, like perhaps adding some command line arguments to switch to HTML mode?

What would the benefits of fitting a HTML parser into a JSON parser tool be?

Well once there's an HTML parser, then a pdf viewer, and then everything needed for PDFs (ie., programming, emailer, video support, etc.) we'll finally have that ideal operating system we've been waiting for.

JQ is not just a parser but a tool for doing operations, many of which are (or should be) generic across any tree-like data format. Reusing that part across different input formats makes a lot of sense.

sounds a lot more like blockchain.

Would probably be more useful to implement html2json, and pipe in html?

Ed: eg: https://github.com/Jxck/html2json

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact