Hacker News new | past | comments | ask | show | jobs | submit login
You can't parse [X]HTML with regex (2009) (stackoverflow.com)
127 points by BerislavLopac 41 days ago | hide | past | favorite | 130 comments

2nd answer: "While arbitrary HTML with only a regex is impossible, it's sometimes appropriate to use them for parsing a limited, known set of HTML."

This. So much this. Yes, you can't parse arbitrary, unknown XML with regex. But I don't find myself parsing arbitrary, unknown XML very often. Usually I know exactly what I'm expecting and if I can't find the information I need then it's a problem. Regex parsing is perfect for this scenario - and much, much faster. I created a regex parser for Java that even handles namespaces and relative paths. Can it parse every XML file? No - you can't parse XML with regex. But I can parse everything I need to parse - and if I can't? I can always fall back on full-featured XML parsers.

I found a one-pass top-down "XML parser". Not like a proper SAX parser, no... the XML had to be specially formatted, almost like TOML.

    <whatever> <--- Parser parsed this as a new section
    <foo attrib=bar> <--- All "XML" had to be one-per-line
    </whatever> <---- parser ignored this
It was an "XML parser" per se, but it really was just a linear one-pass parser that tricked me into thinking it was XML.

So really, it was more like TOML (or .INI files) than like XML. But I guess the advantage of making it "bastard XML" instead of TOML is that maybe this worked with XML-editors or something. I dunno...

I’ve written an XML parser like this for a toy project. I passed the XML through a prettifier first so that it was in a standard format. It wouldn’t work for every XML file, but it worked on the files that I needed it for.

I have also had success searching through html files with grep after passing them through a prettifier. It’s ugly, but 90% of the time, it works every time!

The specific project I saw that did this had ALL of its configuration files in this "Bastard XML" format.

Its ugly, but when 100% of your files you interpret match the one-per-line (and other clearly made-up rules), then it works 100% of the time!!

When I was working on a link parser in python for a crawler. I had two choices:

1. use some form of regex

2. use libxml and find links

1. was faster than 2. by a factor or two

Does a link-only parser would have been faster ? Yeah probably but it is much more complex to do

Maybe I'm confused about the problem you're solving but regexp was faster than just:

    from lxml import html
    doc = html.from_string(response.content)


Yes exactly. I'm not even sure XPath was faster than iter on a tags. But I would need to check on that one

What was the average difference in accuracy?

It's a fair tradeoff especially for a crawler where it's never guaranteed to reach all documents anyway.

In my experience, a regex can be a much more robust solution than an XML parser depending on the use case. Back in the days when I did some webscraping I often had parsers throwing exceptions becacuse of ivalid HTML. More often than not I switched to regular expressions, which always worked out flawlessly.

Yes. HTML isn't XML.

As you discovered, HTML in the real world is allowed to be malformed in various ways, while XML is not. A compliant parser MUST barf on various kinds of malformed text. (Search for "fatal error" in https://www.w3.org/TR/REC-xml/ to verify.) This makes XML parsers inappropriate for HTML parsing.

Interestingly, even perfectly valid HTML may not be valid XML. To see that consider <b>this <i>example</b> carefully</i>.

"always flawless" is a bold claim

It's hard to imagine any web scraping being described as "flawless". But there's a whole lot of room for "worked fine for my purposes."

I think that it's actually a pretty great example of a case where capturing data from HTML may not be best modeled as a parsing task. You might not need a whole parse tree just to match some pattern and grab an associated string. And skipping the parse may enable you to get useful data out of a file that technically can't be parsed due to syntactic errors. It's a fairly classic precision/recall tradeoff situation.

"I've never lost playing Russian Roulette"

All you need to parse HTML is regular expressions (to recognize tags) and a stack (to match tags).

Your programming language has a stack -- a call stack.

So in practice all you really need is regular expressions. (Which I tend to call "regular languages" to make a distinction with Perl-style regexes [1], although they work fine too in practice for this case)

Using the call stack in a more functional style is nicer than using the OOP style that s in the Python standard library, which is probably inherited from Java, etc.

I have done this with a bunch of HTML processors for the Oil blog and doc toolchain:


It works well in practice and is correct and fast.

Big caveat: this style is only for HTML that I generate myself, e.g. the blog and docs. There are a bunch of rules around matching tags in HTML5 that are subtle. Although one of the points here is that you don't have to do a full DOM-style parse and can ignore those rules for many useful cases.

The other caveat is that HTML has a bunch of rules for what happens when you see a stray < or > that isn't part of a tag. This style makes it a hard syntax error, so it's really a subset of HTML (which has no syntax errors). For my purposes that is a feature rather than a bug, basically following Postel's law.

I meant to write a blog post titled "why/when you can parse HTML with regexes" about this but didn't get around to it.

There is also a caveat where parsing arbitrary name=value pairs with regexes isn't ergonomic, because it's hard to capture a variable number of pairs. However the point is that I wrote 5 or 6 useful and compact HTML processors that don't need that. In practice when you parse HTML you often have a "fixed schema".

Concrete examples are generating a TOC from <h1>, <h2>, etc. and syntax highlighting <pre><code> blocks. Those all work great with the regex + call stack style.

[1] http://www.oilshell.org/blog/2020/07/eggex-theory.html

edit: for completeness, another caveat is that the stack-based style is particularly bad for C++/Rust and arbitrary input because you could blow the stack, although we already limited the problem to "HTML generated ourselves"

What you're describing is using regex to _lex_ html, not parse it. The parser is easy to build once you have a lexer (after all most of XML's unpleasantness comes from escaping). It still isn't the same thing as parsing with regular expressions.

Similarly you can't parse S-expressions with regular expressions, but if you have the lexer (e.g. with `lex` or other languages' equivalent) then the parser on top of it becomes absolutely trivial.

Right exactly, in fact I call the library "lazylex" because it only does the minimum to find the <> structure.

If you want to recognize more, then you invoke a attribute "name=value" lexer on the tag, but usually you don't. This makes it quite fast (and speed is useful because most doc toolchains are slow).

The lexing is the "generic" part and parsing is specific to the task at hand.

In the case of making an HTML table of contents, you literally just find <h1>, <h2> with regexes, and don't do all this DOM nonsense. It's easier to write than the SAX style with an explicit stack.

So yes the point is that parsing HTML is trivial for most purposes: use the call stack. It helps if your language has exceptions to indicate errors.

Yes, HTML is a deterministic context-free language, so you can parse with a DPDA.

In addition, tokens are regular (as it is for many languages), so you can use a regex for tokenization.

All you need for writing a recursive descent parser with backtracking is a call-stack, so are all LL(k) and most practical CFGs also parseable with regex?

I am not sure if you can say that precisely -- I added the caveats about the rules for end tags, and the rules for stray < and >.

But certainly a useful subset of HTML can be parsed with a DPDA. (I'd be interested in more analysis of that; arbitrary tags are another factor)

It's a matter of opinion but I would say recursive descent is "nontrivial", whereas matching tags is "trivial".

Recursive descent involves some choice around lookahead or backtracking. It can be slow if you do the wrong thing, hard to debug, etc. It takes a little practice, and correspondence with a grammar is important.

Whereas matching HTML tags requires no lookahead, and I would say anyone can figure it out just with a simple code reading. Even the "inverted" OOP style is "simple", but annoying for me to read and correctly modify. The call stack reads much better.

I've seen many people complain about StackOverflow, but this is the best example I've ever encountered.

The question: How to match

  <a href="foo">
The answers: rants about how RegEx is not suitable for parsing entire HTML.

Only the 5th answer starts to actually answer the question.

In all fairness, he does provide hints as to why regex will not work (namely: HTML is not a regular grammar).

Sure, it's somewhat obscured by the humorous rant, but it's not that bad an answer, either.

More to the point: I'm not sure I want to suck the humor out of everything. I agree that SO has problems, but humor and poetry are worthwhile things in otherewise serious places. It's all about quantity.

“You’re asking the wrong question” is a valid response.

I disagree, I see it as saying "you don't know what you really want, but I can read your mind". It's disrespectful and not giving the benefit of the doubt.

>It's disrespectful and not giving the benefit of the doubt.

Unfortunately, a good number of users who post questions on StackOverflow have not earned the benefit of the doubt. Browsing the site, you will occasionally come across questions which are the tech equivalent of asking "Which screwdriver is the right size to stick in this electrical socket?"

Frame challenges are a necessary part of learning, so they belong on a Q&A site. If a user doesn't want their problem to be challenged, the onus is on them to clarify in the question why their particular approach is the necessary one. It's only possible to respond with alternative solutions when the problem is not specified enough.

> Which screwdriver is the right size to stick in this electrical socket?

Note that this is a legitimate technique in UK sockets.

The live and neutral pins have a little gate over them that is retracted when you insert the earth pin, so you need to first stick a screwdriver into the earth pin in order to get your fingers into the live pin.

I can't parse if this is humour or a mistake. Putting your fingers on the live pin is not a great idea, trying this to get an euro 15 plug into an UK socket, also not great but in a different category.

Well, there's also a mains tester screwdriver which is a legit tool that you stick into a socket and also participate in the electrict current loop for the light on it to light up.

Good points.

I'm not so sure benefit of the doubt must be earned. More like, any participant in a discussion forum must show it when answering, and do proper research before asking anything. If all questions are good questions, there's no problem. But, as you say, they really aren't. I think poor question should be down voted with a brief explanation instead of trying to answer the "real" question. Or moved to a Frame challenge forum.

Are we trying to answer the question or to solve the problem?

> do proper research before asking anything.

Asking on SO is itself research. It is good to review the existing literature before taking contributors time, of course, but if the problem is not solved in the existing literature, then perhaps the framing issue isn't addressed by the existing literature either. In that case how could the learner know the best way to frame the problem in advance?

> I think poor question should be down voted with a brief explanation instead of trying to answer the "real" question. Or moved to a Frame challenge forum.

This precludes the possibility that some contributors might want to address the framing problem, whereas others might want to address the specific question as asked. They may have different opinions about whether it is framed wrong at all. It also means the OP is losing karma or getting penalized for no fault of their own.

The problem is, the answers are useful to more than just the original questioner. Sure, the questioner may be doing things vastly wrong - but the people who land on that question's page via search may have legitimate reasons for doing things a certain way.

The silent majority of viewers will benefit from an answer that does both of (1) explaining why the answer is probably not what is wanted, and (2) answering the initial question _as written_ anyway, for future viewers.

Then those other viewers will either benefit in the same way from the frame challenge as a learning experience, or they will have a sufficiently-specific problem that they can ask their own questions with more justification for taking a specific approach.

Answering the question as written has the risk that any solution will be blindly applied without appreciating why the approach itself should be avoided. This is especially true for those users who see SO as a "write my code" site, and copy-paste anything in backticks.

Strongly disagree. The point of SO is for experts to answer questions. They've learned things the hard way and would like to help others do better. They're not being paid. As such, telling the questioner that their whole approach is wrong is appropriate and even preferable.

From what I've heard Jeff Atwood and Joel Spolsky had different views on this and Spolsky's more tolerant, "no such thing as a stupid question" approach won out within the company, but is less popular among the people who write answers.

I don't think it is disrespectful to suggest someone is falling victim to the XY problem.

Actually I think it is a common and expected outcome that when investigating a new problem, we often get stuck in "XY problem" traps while researching the solution.

I very much value any feedback that suggests I should rethink the entire problem with a simpler model, because without experience it's hard to know what the simplest models are.

Absolutely agree. In my experience this is one of the more valuable features of asking someone to discuss a problem I'm mired in. Because they haven't been looking under every rock and studying the bark of every tree like I have, they're very likely to quickly see when I've wandered into entirely the wrong part of the forest.

unfortunately sometimes people who ask questions are really junior, and need to be told they are going to have an unpleasant surprise if they go down the path they are planning on going.

sometimes people who ask questions know the pitfalls but don't clarify that they know adequately because they are pressed for time. in this case those people unfortunately run the risk of being talked down to and they should accept that.

on the other hand if they have clarified adequately that they know what they're doing and they still want to do something that might seem weird then I agree it is disrespectful. Which is a thing you see often enough on StackOverflow to be notable.

Maybe so, but what about the non-junior person who needs to do something weird for an actual valid reason and stumbles on the refusal to answer the question years later? StackOverflow answers aren’t just for the original asker.

I think in that case - the new person should probably post a new question.

The point is that the original question - as framed - was better served by saying "if you go back a step and reexamine your assumptions, you'll find there is a better path to your intended goal".

The new person has a different goal or a different set of constraints.

Because asking new questions and getting them closed as duplicate because they sound vaguely similar to an existing question is sooo helpful an experience...

Yeah - but I'm just playing with hypothetical ideal cases here. "Annoying flawed habits of Stack Overflow moderators" isn't something that's on my list of things I'm thinking about. ;-)

EDIT - which got me thinking. Maybe the "correct" thing to do is answer the original question as asked but gently point out to the person asking it that there is probably a better solution for them if only they had asked a different question.

The original question still stands and has an answer useful for other people. The original questioner has the opportunity to learn and ask the question they should have asked in the first place.

It's going to be annoying for someone - so it should at least be the person that kicked things off in the first place.

> It's disrespectful

How do you respectful tell someone you think they are mistaken? I'd rather not be pussyfooted around by someone if I'm in the role of "person who has asked a question based on a faulty assumption". Don't be rude but don't avoid trying to answer truthfully to the best of your ability.

> How do you respectful tell someone you think they are mistaken?

How about "you're mistaken"?

The problem is with "You don't know what you're talking about, but I do, so let me answer your real question".

The wording used was "You’re asking the wrong question” not "You don't know what you're talking about".

I find that perfectly fine. It was slightly disingenuous to reword it.

>It's disrespectful and not giving the benefit of the doubt.

So what?

If actually OP knows that this is bad approach, then OP will clarify that he's aware and yada yada.

What's the problem? lack of thick skin?

Yes, but I wish people who like to assume and answer saying so would still answer the question they think is wrong. Context matters and I don't think you can determine that with certainly nearly as often as some people online like to think.

In this case the answer is correct given the parameters of the question: There is no way to have a regex that only matches the things which OP wants to match, but not any of the things OP doesn't want to match.

Given a specific situation, like a particular page or something, sure, regexes are still a possibility for solving the problem. The 2nd highest answer on the page details exactly that. So what is the problem? Is every single contributer obligated to artificially entertain the OP's preconceptions before giving the advice which they believe actually helps best? For example, if I were knowledgeable about XML but not regex, should I just not contribute in such a situation?

Do you educate people about the complexity of the physics and bureaucracy involved with defining the current time every time someone asks you "what time is it?" Or do avoid going onto irrelevant tangents that get you labeled as crazy and just tell them the current time?

What time is it isn’t an invalid question. “How do I make my hamster grow wings and fly?” is. How to parse HTML with a RegEx is an in-between. For a specialized case, why not? Answer that question, then provide a counter example to show how it will be very fragile, then explain the theory, then show a better way. IME that tends to work better to teach someone what you think they should know.

>Do you educate people about the complexity of the physics and bureaucracy involved with defining the current time every time someone asks you "what time is it?"

Maybe you're (inadvertently) making a caricature by using a simple "what time is it?" question but many user questions are under-specified.

Because of that, Stackoverflow answerers in particular do go into the extra complexities because it's part of its editorial DNA to restate the q&a so it's a high-quality community knowledgebase instead of just answering the direct question as stated. I tried to explain this hard-to-grasp nuance previously: https://news.ycombinator.com/item?id=21115438

But sometimes, this X-Y problem editorializing mechanism gets so enthusiastic that it can detract from a correct answer. Here's a famous example of a string bytes extraction question with smart people arguing with the correct answers from user541686 (was Mehrdad) and Michael Buen:

+ correct answer has lots of X-Y pushback in the comments: https://stackoverflow.com/questions/472906/how-do-i-get-a-co...

+ another correct answer from Buen that emphasizes user541686/Mehrdad works for broken unpaired surrogates: https://stackoverflow.com/questions/472906/how-do-i-get-a-co...

The meta layer issue is that the question is underspecified which causes 2 sides with very intelligent people arguing whether or not it's an X-Y problem!

I think the top answer in your example is highly misleading and deserves to have the caveats highlighted more clearly even though it's not "wrong". It is saying, "you don't need to worry about encoding", but really the point it is proving is "if you just use ONLY toCharArray and BlockCopy on ONLY one system and framework version then you can be sure they always use the same encoding as one another, so in that situation you don't need to worry".

So, the solution works, but only in specific situations which are not clearly explained and might be totally unrelated from OP's situation, and furthermore it doesn't really address the second part of OP's question "why take encoding into consideration?" I wouldn't necessarily call the problems with that answer just "XY pushback".

Only when you know enough about the person's context to be able to tell them what question they should be asking instead.

If you don't have that context, then the correct thing to do is to ask for more information, or say, "did you consider this", or find some other way to come up with a constructive response. You don't just assume you know what the person really wants to do and then try to mainsplain it to them.

Really, it depends the context. You might be aware that’s not something to generally do and still want to know the answer to the actual question.

But not a valid answer. That's what comments are for.

If the question were about full validated parsing of HTML with a regex, then I'd agree that "You can't do that" might be part of a valid answer. But finding tags is not doing a full validating parse.

Note that the set of valid C programs is not a context-free language. Yet it's common to use a context free-based approach to parsing. You just add additional code to handle the context-sensitive aspects (such as a symbol table).

I find this type of answer infinitely more paletable than "your question is answered here" or "comments are not for extended discussion, this conversation has been moved to chat"

According to the post, the more important part of the question is "what do you think", to which "I think you shouldn't, because..." is a good answer.

Yes, I have been down-voted and scolded for answering a question as literally described simply because others would rather assume the ignorance of the questioner. Yes I know people will often ask a question due to not understanding what they are doing, but when 10 other people have already responded with "don't do it that way!" I think it can be useful to actually answer the question as stated (if possible).

Actually, if you read his rant all the way to the end he does offer a helpful suggestion:

> Have you tried using an XML parser instead?

Except he's wrong in this case. The OP could use a regex in this specific scenario.

Not true. The questioner has not provided anywhere near enough detail to determine if regular expressions are sufficient. For example: should <br> match, or not? Its semantics are identical to <br />. To determine if regular expressions are enough, you would need to know exactly what markup you’re dealing with, and that has not been provided.

Yeah. I guess in other parts of this discussion I'm arguing for always probing hidden assumptions and missing background whilst here I'm saying "let's interpret the question in the most charitable way possible".

Plus - Stack Overflow is about trying to generalize any given question to maximize it's wider usefulness.

> Plus - Stack Overflow is about trying to generalize any given question to maximize it's wider usefulness.

Since when? You don't get extra points if you write stuff that doesn't concern OP's problem. Most SO problems don't get viral and get lots of upvotes from other people. From a game theory perspective, it doesn't make sense to add more to an answer than to make it the accepted one.

If you have slightly different constraints you are encouraged to open another question. Discussions are frowned upon and sometimes even interrupted by admins so you can't discuss if your situation is different from OP's situation and so could warrant a different answer.

This has always been the intent of Stack Overflow, from its very earliest days. One of the stock reasons for closing a question used to be “too specific—this question is unlikely to help anyone else” (or words to that effect), though that has been removed now (I think because it upset too many people who took its blunt message the wrong way). People have always been nudged towards adjusting questions so that they’ll be generally useful.

Discussions on questions are routinely about unrelated or not-closely-related matters, and quite apart from that Stack Overflow wants to be a Q&A platform, not a discussion platform.

This pops up every so often, and it sort of irritates me every time. Partially because it's overly simplistic, but, even more so, because, while it's cute and humorous, it's not actually very good advice and it doesn't actually answer the question. No, you can't parse html with regex. But go look at the question. The author is just trying to detect some tags. That's not exactly parsing.

It's true that there are some complications around things like "What if > appears in an attribute's value?" If you know your input well enough, or you don't need perfection, that might be a problem you can ignore. Alternatively, you can still use regex, if you use a sufficiently powerful regular expression tool. .NET's regular expressions, for example, have a concept of balancing groups that will let you do this.

I would also point out that a lot of open source HTML parsing libraries are even more dangerous than regular expressions for parsing unknown HTML, because they use recursive descent. Where you have recursion, you have the potential for a stack overflow. With a regex library, you do have to be careful about catastrophic backtracking, but that's at least something you can usually handle in your own code, or, in the worst case, defend against with timeouts.

A parser that's capable of blowing the call stack, and has been exposed to input from the Internet, though, is capable of taking down your process in a way you can't defend against in most languages. And it's difficult to patch up a parser like that without more-or-less rewriting it. I absolutely have had to deal with html handling code getting into situations like that in the past. Malicious input is real. So is plain old bad input. Reading the code before you use it is often a good idea.

I had once tweeted related quizzes. Can you guess parse trees (or reserializations) for following HTML fragments without invoking browsers? Assume that everything gets pasted right after the document body.

    1. <a b="42>c">d
    2. <a/b/c=d/e>f
    3. <a/="42>b
    4. <a x=&amp0>&amp0</a>
    5. a<!--->b<!--+->c<!-->d
Really, don't try to answer and just use complaint HTML parsers.

What kind of HTML parser? A SGML one or a HTML5 one?

I'm really sad that they didn't go with a XML base for HTML5.

> I'm really sad that they didn't go with a XML base for HTML5.

I'm really sad that they didn't evangelise an XML base for HTML5, and that many HTML5-ish tools don't explicitly support XML, but it's not strictly true that they didn't go for an XML base for HTML5[0][1]

[0] https://html.spec.whatwg.org/multipage/xhtml.html

[1] https://www.w3.org/TR/html-polyglot/

That's just HTML5 rewritten as a well formed XML document. HTML5 does not describe well formed XML.

That you can put the same information of a HTML5 document into a XML document doesn't help much if most of the HTML5 documents out there are not polygot.

But the point is that XML syntax is still a thing, supported by all browsers (and it’s reasonable to expect that support to remain as long as HTML remains). See also https://html.spec.whatwg.org/multipage/introduction.html#htm...:

> When a document is transmitted with an XML MIME type, such as application/xhtml+xml, then it is treated as an XML document by web browsers, to be parsed by an XML processor. Authors are reminded that the processing for XML and HTML differs; in particular, even minor syntax errors will prevent a document labeled as XML from being rendered fully, whereas they would be ignored in the HTML syntax.

HTML does use an XML base (elements, attributes, namespaces, &c.), it just doesn’t use an XML parser most of the time. But the XMLness is easily observed in various DOM APIs.

It doesn't help from a client perspective, but depending on your page delivery pipeline could be of potential help for some from a server perspective.

> What kind of HTML parser? A SGML one or a HTML5 one?

I intended the latter. In fact I'm a bit surprised that I have ever been asked for this, I thought "HTML" nowadays exclusively refers to HTML5...

I would have expected the former. Given that one of the rationales of HTML5 was "simpler parsing". I'm obviously not uptodate with the HTML5 parsing.

But why should I? Who writes HTML by hand these days?

The SGML heritage of HTML 4.01 and earlier lead to some gruesome legal constructs that look surprisingly similar to your examples. Looks like every generation has to make their own mistakes.

I get why all the complains against the top answer. At the same time one should appreciate its literary qualities in regard to structure and style.

You can't in general case. But you can in lots of typical cases.

Actually real world HTML usually can't be parsed by any strict parser, as it's not valid. It's just a machine-generated text which pretends to be similar to HTML. So extracting some bits of information with regexes often works.

I believe you really meant that you are frequently dealing with HTML which structure is already known in advance, not the general HTML. Because...

> [...] real world HTML usually can't be parsed by any strict parser [...]

There is the literal standard for parsing HTML [1]. Any conformant implementation (and there are plenty) can of course parse the real world HTML by definition. Just that you don't always need the full HTML parser to do your job.

[1] https://html.spec.whatwg.org/multipage/parsing.html

I think this is only true for HTML5, but previous versions of HTML supposedly weren't specced well enough to write a prefect parser. Fixing this was one of the goals of the HTML5 revision if I'm not mistaken.

Previous versions of HTML were based on SGML. You can write a perfect SGML parser, but the developers of web browsers couldn't be bothered.

“Worse is better” strikes again. In the early days of markup hand written by amateurs, users preferred the browsers that tolerated mistakes. Now the language is defined by a common error handling procedure, instead of a grammar.

Web browsers exist, therefore its possible to parse html, even earlier versions.

You're right that the de jure spec did not match de facto html, and browsers didn't neccesarily agree with each other. But that's always true. GCC has language extensions that aren't part of the c spec, but you wouldn't say that c is impossible to parse. Old html may have taken it up to 11, but its not fair to say its impossible to parse.

No, not really. The browsers did guess a lot and did standard-deviating parsing because the typical uses were wrong and they had to work. Nobody would switch to a new browser that doesn't work with existing pages.

Modern example - mXSS. Even though modern html have to be valid xml the browser will, instead of giving an error when served invalid html, transform what's given to make it standard-compliant.

Modern html by definition is not valid XML, unless you are using the xml serialization of html5 which isn't really teccomended and nobody does.

Really no version of the official html spec was valid xml other than XHTML which was never particularly popular.

But i don't really see your point. An implementation having a different idea how to parse html than you think is correct is not the same thing as something being unparsable. Its a tautology that if there exists a computer progran to parse something than it is possible to parse it with a computer program.

Perl is famously unparseable: it’s impossible to determine a parse tree without executing the code.

(HTML, however, was never unparseable, merely insufficiently defined.)

You are correct, but I don't think they are even slightly relevant in 2021.

> There is the literal standard for parsing HTML [1]. Any conformant implementation (and there are plenty) can of course parse the real world HTML by definition.

I believe GP was alluding to the fact that many actual resources that declare themselves HTML are not spec conformant, and thus can’t be parsed by a parser that only accepts valid HTML.

The distinction between "valid" and "invalid" HTML used to matter once upon a time, but it no longer does at least for agents (authors can still benefit from error-free HTMLs because errors can distort their intents). Pretty much every string can be parsed to HTML since HTML5 and all errors are non-fatal, so many modern HTML parsers default to ignore errors. There are parsers that can be configured to abort on any error, but I don't think the GP intended that.

True, and it's worth noting that since WHATWG HTML 5 has usurped HTML and taken it ad absurdum, WHATWG's parsing spec isn't actually useful nor representative at all of what people usually think HTML is. Nor do people have to follow WHATWG's (= bunch of Chrome developers) idea of HTML anymore than WHATWG did follow other's.

WHATWG’s HTML spec is the only thing that matters when considering what HTML is, because it’s what every browser uses, which is the primary target of HTML.

WHATWG is not a "bunch of Chrome developers", and if you want to understand what a browser does with HTML, it's the place to look. "HTML, but not the HTML web browsers mean" is a fairly niche concern.


HTML 5 is (as are 5.1, 5.1 second edition, 5.3) W3C

HTML Living Standard is WHATWG.

The html parser spec defines what every sequence of bytes should parse into. It defines certain such sequences as containing "errors", but it still defines exactly how they should be parsed. There is no invalid html. Every browser follows the spec, so every browser will parse the same html to the same thing. This is true even if the html contains "errors". The only checking most html receives is to make sure it renders correctly in a browser. If you are writing your own parser, you likely want it to do the same thing as every browser. In that case, you should use a parser that conforms to the spec.

Exactly my experience on this. In a past life I've had to parse valid HTML that was generated by a forum system; the user submitted something akin to bbcode [b]this sort of thing[/b] that was pre-parsed and converted to valid HTML, and then I had to parse that fragment again after the fact.

Given the constraints it's entirely possible to parse (a subset of) irregular grammar with regular expressions. Asking a question along those lines on SO would have have only elicited responses that I/someone was doing it wrong.

I won't argue that it was or wasn't the wrong to do, but you don't always get to pick your client.

Funny thing. Email addresses need a rant like this too. Yes, you can parse 99% or so with a regex, but like HTML or XML, you really need an email address parser. RFC 2822 was designed to be parsed using string processing (in C no less) and requires some complexity that most regexes fail on. Here's a discussion about using the simpler, older RFC (822) and regex: https://stackoverflow.com/questions/20771794/mailrfc822addre...

For most purposes, if you're trying to use parsing to achieve email address validation perfection, you've already lost the battle.

A valid email address typically isn't just a syntactically correct one; it's also one that can be used to get an email to the recipient. The only way to test that is to send an email and see if it gets to the recipient. Which is why it's much more common to see some minimal client-side validation that uses a simple regex that will (ideally) match all valid email addresses but only catch gross syntactic errors like typing # where you meant @, more for the sake of decent UX than anything else, and rely on asking the user to double-type their email address and sending an activation email to deal with finger-grained syntactic errors and the whole universe of non-syntactic errors.

# is a valid character in an email address. Regexes are a horrible idea to validate or parse emails and have been since the beginning. Most regexes will be too strict. The reason people use regexes on the client side is they copy pasta something they find on the internet. Incidentally, a great way to get emails delivered is to properly validate them. Sure typos are possible... Few things are a worse user experience than having a valid email rejected by a client side parser.

Yes, it's hopeless to try and parse arbitrary HTML with regexes (or with anything really. I believe, the full HTML5 parsing algorithm is not even type-2, it's more or less turing-complete. In-the-wild HTML can also interact with scripting in all kinds of entertaining ways, to the point that a conforming HTML5 parser has to be able to execute javascript while parsing - and the javascript can inject additional tokens into the parser's input stream. It's possible to create a HTML5 document that only validates on every second thursday of the month.)

However, if we know the document is well-formed(!) XHTML, shouldn't it be possible? This would mean the document is valid XML and XML was specifically designed to be regex-friendly, I believe.

At least, out of my head, the only gotchas that have to be accounted for are comments and CDATA sections - those may contain arbitrary unescaped text, including angle brackets. However they also have unambigous start and end markers and can't be nested, so a regex could account for them.

Attribute values should not be a problem, as angle brackets must be escaped inside those to be valid XML.

I'm not sure about processing instructions and doctypes though.

Yes that's basically what I wrote about here:


I listed 3 or 4 caveats. CDATA might be another one since I'm not handling those... I've never used it so I left it out.

Actually I remember that regular languages weren't entirely enough; I think the .*? non-greedy matching was useful, e.g. for finding --> at the end of comments.

On my blog, I write my posts in markdown. After it's converted to HTML, I use regex to search and replace images (for high-res and alternative formats), and get the first paragraph (for a 'preview'). I've been doing this for years, so the 'never use regex over HTML' advice isn't holding up for me.

To me, this typifies working with technology and programming. Computer programs only ever look like they are working, because they have not encountered problem data or conditions.

Aka, it works on the happy path.

Software engineering is how we balance how much of the unhappy path and corner cases we take care of, and how we handle them, imo.

Well, as stated that particuar answer is both right and wrong...

Yes, you can not use "true" regular expressions to parse recursive structures.

But the libraries that get used for regular expressions quite often include non-regular extensions (and confusingly call the resulting expressions still "regular").

Most notably, PCRE allows for recursive patterns via "(?R)". You can absolutely parse arbitrary HTML with it.

In fact you can parse anything whith that, including binary formats. You just can't do it whithout recursively applying the same "regex" again and again...

And precise error handling is basically impossible without writing a proper lexer anyway, since your regex won't (can't, really) tell you where it was thrown off. It either works or doesn't, the "why" is left to the program to figure out...

My 5yo daughter can already write more or less ok-ish but she has big problems reading that back, especially when she does spelling mistakes.

I feel more or less the same when I write regular expressions.

Counter argument: Oh Yes You Can Use Regexes to Parse HTML!


Discussion: https://news.ycombinator.com/item?id=26357237

Counter-counter-argument, one of the comments underneath that answer:

> what you have written is not really a regular expression (modern, regular, or otherwise), but rather a Perl program that uses regular expressions heavily. Does your post really support the claim that regular expressions can parse HTML correctly? Or is it more like evidence that Perl can parse HTML correctly?

Do the Halting Problem next.

I think we've all (mostly?) tried it. It really is the Wild West of the web when you're trying to parse other people's HTML, though.

I've played around with this parser which is extremely quick. https://github.com/lexbor/lexbor

You can't parse fully-general HTML with regex, but unless you're writing a web browser of something, that's not what you're trying to do; you're trying to parse the particular subset of HTML that happens to be emitted by this particular website that you got the HTML to be parsed from. And, much like the halting problem or integer factorization, despite the general case being difficult or impossible, the overwhelming majority of specific cases are easy.

Aren't html code highlighters using regex? Isn't vscode using TextMate regex for color highlithing?

Most code highlighters do. They’re normally close enough to accurate for highlighting purposes (though they will commonly have some uncommon constructs that they get wrong), but they tend to fall apart when you try to use that for much more; for example, indentation when you use regular expressions to parse your HTML tends to start falling apart if you take what XML users might consider “shortcuts” (such as omitting optional end tags).

Valid HTML allow optional end tags? For example?

https://html.spec.whatwg.org/multipage/syntax.html#optional-... describes which elements have optional start and end tags.

This document is valid:

  <!doctype html>
And is precisely equivalent to this document in the canonical serialisation:

  <!DOCTYPE html><html><head><title>Hello,</title>

`<script>` is an usual example that you can't self-close and absolutely need to be followed by `</script>` in HTML5.

In general though self-closing tag has no effect in HTML5 anyway, `<script>` is just an example where the usual heuristic specified by HTML5 doesn't help you at all (since it switches the lexer state).

You can usually use regexes for tokenization, which is sufficient for syntax highlighting, but you generally can’t use regexes for parsing (nested structures).

Should 2011 (when answer was first provided) or 2009 (when question was posted) be added to the title?

Clearly it should be 20(?:09|11)

On the other hand, the inclusion of [X] in the title is more than enough to establish the historical setting.


This argument is common, and this is a good answer; but so often people aren't "parsing" XML but extracting a few bits of it and would have benefited from less cargo cult and more thought in the answer cited.

As it is, I've seen this article used to scare people away from "can i make a game behave differently?" efforts that would have been trivial to do and likely given these people a gateway "i can try to be a programmer" experience.

This answer is right. The original question isn't asking anything about parsing, they are trying to search for specific tags, which is a great use for a regex. Yes, it will probably return some invalid matches unless you go out of your way to filter out comments and script tags, but odds are you don't need that.

Applying regex's only really count as "parsing" when you are matching against an entire document. Searching is not parsing. Same argument applies if you apply grep to an html doc - I wonder why there's no posts about "you can't parse HTML with grep". I apply grep all the time to source code...

You can't parse HTML with regex, but PCRE is not regex. I'm not sure if you can parse HTML with PCRE.

Where are all the comments gone ?

(note : I mean the comments on StackOverflow, not the comments here in Hacker News ... )

If curious, past threads:

RegEx match open tags except XHTML self-contained tags - https://news.ycombinator.com/item?id=14942060 - Aug 2017 (6 comments)

You can't parse [X]HTML with regex - https://news.ycombinator.com/item?id=14155015 - April 2017 (1 comment)

Why you can't parse HTML with regex - https://news.ycombinator.com/item?id=9728474 - June 2015 (2 comments)

Can you parse html with regular expressions? - https://news.ycombinator.com/item?id=5264511 - Feb 2013 (2 comments)

Can regular expressions parse HTML or not? - https://news.ycombinator.com/item?id=5257535 - Feb 2013 (23 comments)

Regexes Parse XML Just Fine, Actually - https://news.ycombinator.com/item?id=3088402 - Oct 2011 (27 comments)

Oh Yes You Can Use Regexes to Parse HTML - https://news.ycombinator.com/item?id=2741780 - July 2011 (77 comments)

Why you should not parse (X)HTML with a Regexp - https://news.ycombinator.com/item?id=2423301 - April 2011 (5 comments)

Stackoverflow, HTML by Regex, topmost answer - https://news.ycombinator.com/item?id=1487695 - July 2010 (37 comments)

You Cannot Parse HTML with Regular Expressions - https://news.ycombinator.com/item?id=1274870 - April 2010 (7 comments)

You can't parse [X]HTML with regex. - https://news.ycombinator.com/item?id=941401 - Nov 2009 (1 comment)

I'm surprised that most of these are so small. There must have been others? (I excluded the boring or empty ones.)

There must have been others?

Yog-Sothoth knows the gate. Yog-Sothoth is the gate. Yog-Sothoth is the key and guardian of the gate. Past, present, future, all are one in Yog-Sothoth. He knows where the Old Threads broke through of old, and where They shall break through again.

Would be interesting to know how many up and down votes HN is sending that answer.

The answer had 4440 upvotes and 27 downvotes at the time it was locked (click on it to reveal the breakdown, if you have sufficient SO "reputation").

The question is locked, so it cannot be voted on.

"This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention."

Stack Overflow can be remarkably humourless at times.

That is a defense against the humourless people that tried to edit the post down into a more objective answer years after the fact:


Yes, that was my point.

Ah I see, I misunderstood your comment as referring to the quote's clinical wording. Sorry!

Note that you cannot even vote on it, and it is marked CW, so it can doubly not give reputation. This reputation resentment makes me sad...

This never gets old.

Old, but gold.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact