This. So much this. Yes, you can't parse arbitrary, unknown XML with regex. But I don't find myself parsing arbitrary, unknown XML very often. Usually I know exactly what I'm expecting and if I can't find the information I need then it's a problem. Regex parsing is perfect for this scenario - and much, much faster. I created a regex parser for Java that even handles namespaces and relative paths. Can it parse every XML file? No - you can't parse XML with regex. But I can parse everything I need to parse - and if I can't? I can always fall back on full-featured XML parsers.
<whatever> <--- Parser parsed this as a new section
<foo attrib=bar> <--- All "XML" had to be one-per-line
</whatever> <---- parser ignored this
So really, it was more like TOML (or .INI files) than like XML. But I guess the advantage of making it "bastard XML" instead of TOML is that maybe this worked with XML-editors or something. I dunno...
I have also had success searching through html files with grep after passing them through a prettifier. It’s ugly, but 90% of the time, it works every time!
Its ugly, but when 100% of your files you interpret match the one-per-line (and other clearly made-up rules), then it works 100% of the time!!
1. use some form of regex
2. use libxml and find links
1. was faster than 2. by a factor or two
Does a link-only parser would have been faster ? Yeah probably but it is much more complex to do
from lxml import html
doc = html.from_string(response.content)
It's a fair tradeoff especially for a crawler where it's never guaranteed to reach all documents anyway.
As you discovered, HTML in the real world is allowed to be malformed in various ways, while XML is not. A compliant parser MUST barf on various kinds of malformed text. (Search for "fatal error" in https://www.w3.org/TR/REC-xml/ to verify.) This makes XML parsers inappropriate for HTML parsing.
Interestingly, even perfectly valid HTML may not be valid XML. To see that consider <b>this <i>example</b> carefully</i>.
I think that it's actually a pretty great example of a case where capturing data from HTML may not be best modeled as a parsing task. You might not need a whole parse tree just to match some pattern and grab an associated string. And skipping the parse may enable you to get useful data out of a file that technically can't be parsed due to syntactic errors. It's a fairly classic precision/recall tradeoff situation.
Your programming language has a stack -- a call stack.
So in practice all you really need is regular expressions. (Which I tend to call "regular languages" to make a distinction with Perl-style regexes , although they work fine too in practice for this case)
Using the call stack in a more functional style is nicer than using the OOP style that s in the Python standard library, which is probably inherited from Java, etc.
I have done this with a bunch of HTML processors for the Oil blog and doc toolchain:
It works well in practice and is correct and fast.
Big caveat: this style is only for HTML that I generate myself, e.g. the blog and docs. There are a bunch of rules around matching tags in HTML5 that are subtle. Although one of the points here is that you don't have to do a full DOM-style parse and can ignore those rules for many useful cases.
The other caveat is that HTML has a bunch of rules for what happens when you see a stray < or > that isn't part of a tag. This style makes it a hard syntax error, so it's really a subset of HTML (which has no syntax errors). For my purposes that is a feature rather than a bug, basically following Postel's law.
I meant to write a blog post titled "why/when you can parse HTML with regexes" about this but didn't get around to it.
There is also a caveat where parsing arbitrary name=value pairs with regexes isn't ergonomic, because it's hard to capture a variable number of pairs. However the point is that I wrote 5 or 6 useful and compact HTML processors that don't need that. In practice when you parse HTML you often have a "fixed schema".
Concrete examples are generating a TOC from <h1>, <h2>, etc. and syntax highlighting <pre><code> blocks. Those all work great with the regex + call stack style.
edit: for completeness, another caveat is that the stack-based style is particularly bad for C++/Rust and arbitrary input because you could blow the stack, although we already limited the problem to "HTML generated ourselves"
Similarly you can't parse S-expressions with regular expressions, but if you have the lexer (e.g. with `lex` or other languages' equivalent) then the parser on top of it becomes absolutely trivial.
If you want to recognize more, then you invoke a attribute "name=value" lexer on the tag, but usually you don't. This makes it quite fast (and speed is useful because most doc toolchains are slow).
The lexing is the "generic" part and parsing is specific to the task at hand.
In the case of making an HTML table of contents, you literally just find <h1>, <h2> with regexes, and don't do all this DOM nonsense. It's easier to write than the SAX style with an explicit stack.
So yes the point is that parsing HTML is trivial for most purposes: use the call stack. It helps if your language has exceptions to indicate errors.
In addition, tokens are regular (as it is for many languages), so you can use a regex for tokenization.
All you need for writing a recursive descent parser with backtracking is a call-stack, so are all LL(k) and most practical CFGs also parseable with regex?
But certainly a useful subset of HTML can be parsed with a DPDA. (I'd be interested in more analysis of that; arbitrary tags are another factor)
It's a matter of opinion but I would say recursive descent is "nontrivial", whereas matching tags is "trivial".
Recursive descent involves some choice around lookahead or backtracking. It can be slow if you do the wrong thing, hard to debug, etc. It takes a little practice, and correspondence with a grammar is important.
Whereas matching HTML tags requires no lookahead, and I would say anyone can figure it out just with a simple code reading. Even the "inverted" OOP style is "simple", but annoying for me to read and correctly modify. The call stack reads much better.
The question: How to match
Only the 5th answer starts to actually answer the question.
Sure, it's somewhat obscured by the humorous rant, but it's not that bad an answer, either.
More to the point: I'm not sure I want to suck the humor out of everything. I agree that SO has problems, but humor and poetry are worthwhile things in otherewise serious places. It's all about quantity.
Unfortunately, a good number of users who post questions on StackOverflow have not earned the benefit of the doubt. Browsing the site, you will occasionally come across questions which are the tech equivalent of asking "Which screwdriver is the right size to stick in this electrical socket?"
Frame challenges are a necessary part of learning, so they belong on a Q&A site. If a user doesn't want their problem to be challenged, the onus is on them to clarify in the question why their particular approach is the necessary one. It's only possible to respond with alternative solutions when the problem is not specified enough.
Note that this is a legitimate technique in UK sockets.
The live and neutral pins have a little gate over them that is retracted when you insert the earth pin, so you need to first stick a screwdriver into the earth pin in order to get your fingers into the live pin.
I'm not so sure benefit of the doubt must be earned. More like, any participant in a discussion forum must show it when answering, and do proper research before asking anything. If all questions are good questions, there's no problem. But, as you say, they really aren't. I think poor question should be down voted with a brief explanation instead of trying to answer the "real" question. Or moved to a Frame challenge forum.
Are we trying to answer the question or to solve the problem?
Asking on SO is itself research. It is good to review the existing literature before taking contributors time, of course, but if the problem is not solved in the existing literature, then perhaps the framing issue isn't addressed by the existing literature either. In that case how could the learner know the best way to frame the problem in advance?
> I think poor question should be down voted with a brief explanation instead of trying to answer the "real" question. Or moved to a Frame challenge forum.
This precludes the possibility that some contributors might want to address the framing problem, whereas others might want to address the specific question as asked. They may have different opinions about whether it is framed wrong at all. It also means the OP is losing karma or getting penalized for no fault of their own.
The silent majority of viewers will benefit from an answer that does both of (1) explaining why the answer is probably not what is wanted, and (2) answering the initial question _as written_ anyway, for future viewers.
Answering the question as written has the risk that any solution will be blindly applied without appreciating why the approach itself should be avoided. This is especially true for those users who see SO as a "write my code" site, and copy-paste anything in backticks.
From what I've heard Jeff Atwood and Joel Spolsky had different views on this and Spolsky's more tolerant, "no such thing as a stupid question" approach won out within the company, but is less popular among the people who write answers.
Actually I think it is a common and expected outcome that when investigating a new problem, we often get stuck in "XY problem" traps while researching the solution.
I very much value any feedback that suggests I should rethink the entire problem with a simpler model, because without experience it's hard to know what the simplest models are.
sometimes people who ask questions know the pitfalls but don't clarify that they know adequately because they are pressed for time. in this case those people unfortunately run the risk of being talked down to and they should accept that.
on the other hand if they have clarified adequately that they know what they're doing and they still want to do something that might seem weird then I agree it is disrespectful. Which is a thing you see often enough on StackOverflow to be notable.
The point is that the original question - as framed - was better served by saying "if you go back a step and reexamine your assumptions, you'll find there is a better path to your intended goal".
The new person has a different goal or a different set of constraints.
EDIT - which got me thinking. Maybe the "correct" thing to do is answer the original question as asked but gently point out to the person asking it that there is probably a better solution for them if only they had asked a different question.
The original question still stands and has an answer useful for other people. The original questioner has the opportunity to learn and ask the question they should have asked in the first place.
It's going to be annoying for someone - so it should at least be the person that kicked things off in the first place.
How do you respectful tell someone you think they are mistaken? I'd rather not be pussyfooted around by someone if I'm in the role of "person who has asked a question based on a faulty assumption". Don't be rude but don't avoid trying to answer truthfully to the best of your ability.
How about "you're mistaken"?
The problem is with "You don't know what you're talking about, but I do, so let me answer your real question".
I find that perfectly fine. It was slightly disingenuous to reword it.
If actually OP knows that this is bad approach, then OP will clarify that he's aware and yada yada.
What's the problem? lack of thick skin?
Given a specific situation, like a particular page or something, sure, regexes are still a possibility for solving the problem. The 2nd highest answer on the page details exactly that. So what is the problem? Is every single contributer obligated to artificially entertain the OP's preconceptions before giving the advice which they believe actually helps best? For example, if I were knowledgeable about XML but not regex, should I just not contribute in such a situation?
Maybe you're (inadvertently) making a caricature by using a simple "what time is it?" question but many user questions are under-specified.
Because of that, Stackoverflow answerers in particular do go into the extra complexities because it's part of its editorial DNA to restate the q&a so it's a high-quality community knowledgebase instead of just answering the direct question as stated. I tried to explain this hard-to-grasp nuance previously: https://news.ycombinator.com/item?id=21115438
But sometimes, this X-Y problem editorializing mechanism gets so enthusiastic that it can detract from a correct answer. Here's a famous example of a string bytes extraction question with smart people arguing with the correct answers from user541686 (was Mehrdad) and Michael Buen:
+ correct answer has lots of X-Y pushback in the comments: https://stackoverflow.com/questions/472906/how-do-i-get-a-co...
+ another correct answer from Buen that emphasizes user541686/Mehrdad works for broken unpaired surrogates: https://stackoverflow.com/questions/472906/how-do-i-get-a-co...
The meta layer issue is that the question is underspecified which causes 2 sides with very intelligent people arguing whether or not it's an X-Y problem!
So, the solution works, but only in specific situations which are not clearly explained and might be totally unrelated from OP's situation, and furthermore it doesn't really address the second part of OP's question "why take encoding into consideration?" I wouldn't necessarily call the problems with that answer just "XY pushback".
If you don't have that context, then the correct thing to do is to ask for more information, or say, "did you consider this", or find some other way to come up with a constructive response. You don't just assume you know what the person really wants to do and then try to mainsplain it to them.
Note that the set of valid C programs is not a context-free language. Yet it's common to use a context free-based approach to parsing. You just add additional code to handle the context-sensitive aspects (such as a symbol table).
> Have you tried using an XML parser instead?
Plus - Stack Overflow is about trying to generalize any given question to maximize it's wider usefulness.
Since when? You don't get extra points if you write stuff that doesn't concern OP's problem. Most SO problems don't get viral and get lots of upvotes from other people. From a game theory perspective, it doesn't make sense to add more to an answer than to make it the accepted one.
If you have slightly different constraints you are encouraged to open another question. Discussions are frowned upon and sometimes even interrupted by admins so you can't discuss if your situation is different from OP's situation and so could warrant a different answer.
Discussions on questions are routinely about unrelated or not-closely-related matters, and quite apart from that Stack Overflow wants to be a Q&A platform, not a discussion platform.
It's true that there are some complications around things like "What if > appears in an attribute's value?" If you know your input well enough, or you don't need perfection, that might be a problem you can ignore. Alternatively, you can still use regex, if you use a sufficiently powerful regular expression tool. .NET's regular expressions, for example, have a concept of balancing groups that will let you do this.
I would also point out that a lot of open source HTML parsing libraries are even more dangerous than regular expressions for parsing unknown HTML, because they use recursive descent. Where you have recursion, you have the potential for a stack overflow. With a regex library, you do have to be careful about catastrophic backtracking, but that's at least something you can usually handle in your own code, or, in the worst case, defend against with timeouts.
A parser that's capable of blowing the call stack, and has been exposed to input from the Internet, though, is capable of taking down your process in a way you can't defend against in most languages. And it's difficult to patch up a parser like that without more-or-less rewriting it. I absolutely have had to deal with html handling code getting into situations like that in the past. Malicious input is real. So is plain old bad input. Reading the code before you use it is often a good idea.
1. <a b="42>c">d
4. <a x=&0>&0</a>
I'm really sad that they didn't go with a XML base for HTML5.
I'm really sad that they didn't evangelise an XML base for HTML5, and that many HTML5-ish tools don't explicitly support XML, but it's not strictly true that they didn't go for an XML base for HTML5
That you can put the same information of a HTML5 document into a XML document doesn't help much if most of the HTML5 documents out there are not polygot.
> When a document is transmitted with an XML MIME type, such as application/xhtml+xml, then it is treated as an XML document by web browsers, to be parsed by an XML processor. Authors are reminded that the processing for XML and HTML differs; in particular, even minor syntax errors will prevent a document labeled as XML from being rendered fully, whereas they would be ignored in the HTML syntax.
HTML does use an XML base (elements, attributes, namespaces, &c.), it just doesn’t use an XML parser most of the time. But the XMLness is easily observed in various DOM APIs.
I intended the latter. In fact I'm a bit surprised that I have ever been asked for this, I thought "HTML" nowadays exclusively refers to HTML5...
But why should I? Who writes HTML by hand these days?
The SGML heritage of HTML 4.01 and earlier lead to some gruesome legal constructs that look surprisingly similar to your examples. Looks like every generation has to make their own mistakes.
Actually real world HTML usually can't be parsed by any strict parser, as it's not valid. It's just a machine-generated text which pretends to be similar to HTML. So extracting some bits of information with regexes often works.
> [...] real world HTML usually can't be parsed by any strict parser [...]
There is the literal standard for parsing HTML . Any conformant implementation (and there are plenty) can of course parse the real world HTML by definition. Just that you don't always need the full HTML parser to do your job.
You're right that the de jure spec did not match de facto html, and browsers didn't neccesarily agree with each other. But that's always true. GCC has language extensions that aren't part of the c spec, but you wouldn't say that c is impossible to parse. Old html may have taken it up to 11, but its not fair to say its impossible to parse.
Modern example - mXSS. Even though modern html have to be valid xml the browser will, instead of giving an error when served invalid html, transform what's given to make it standard-compliant.
Really no version of the official html spec was valid xml other than XHTML which was never particularly popular.
But i don't really see your point. An implementation having a different idea how to parse html than you think is correct is not the same thing as something being unparsable. Its a tautology that if there exists a computer progran to parse something than it is possible to parse it with a computer program.
(HTML, however, was never unparseable, merely insufficiently defined.)
I believe GP was alluding to the fact that many actual resources that declare themselves HTML are not spec conformant, and thus can’t be parsed by a parser that only accepts valid HTML.
HTML 5 is (as are 5.1, 5.1 second edition, 5.3) W3C
HTML Living Standard is WHATWG.
Given the constraints it's entirely possible to parse (a subset of) irregular grammar with regular expressions. Asking a question along those lines on SO would have have only elicited responses that I/someone was doing it wrong.
I won't argue that it was or wasn't the wrong to do, but you don't always get to pick your client.
A valid email address typically isn't just a syntactically correct one; it's also one that can be used to get an email to the recipient. The only way to test that is to send an email and see if it gets to the recipient. Which is why it's much more common to see some minimal client-side validation that uses a simple regex that will (ideally) match all valid email addresses but only catch gross syntactic errors like typing # where you meant @, more for the sake of decent UX than anything else, and rely on asking the user to double-type their email address and sending an activation email to deal with finger-grained syntactic errors and the whole universe of non-syntactic errors.
However, if we know the document is well-formed(!) XHTML, shouldn't it be possible? This would mean the document is valid XML and XML was specifically designed to be regex-friendly, I believe.
At least, out of my head, the only gotchas that have to be accounted for are comments and CDATA sections - those may contain arbitrary unescaped text, including angle brackets. However they also have unambigous start and end markers and can't be nested, so a regex could account for them.
Attribute values should not be a problem, as angle brackets must be escaped inside those to be valid XML.
I'm not sure about processing instructions and doctypes though.
I listed 3 or 4 caveats. CDATA might be another one since I'm not handling those... I've never used it so I left it out.
Actually I remember that regular languages weren't entirely enough; I think the .*? non-greedy matching was useful, e.g. for finding --> at the end of comments.
Aka, it works on the happy path.
Software engineering is how we balance how much of the unhappy path and corner cases we take care of, and how we handle them, imo.
Yes, you can not use "true" regular expressions to parse recursive structures.
But the libraries that get used for regular expressions quite often include non-regular extensions (and confusingly call the resulting expressions still "regular").
Most notably, PCRE allows for recursive patterns via "(?R)".
You can absolutely parse arbitrary HTML with it.
In fact you can parse anything whith that, including binary formats. You just can't do it whithout recursively applying the same "regex" again and again...
And precise error handling is basically impossible without writing a proper lexer anyway, since your regex won't (can't, really) tell you where it was thrown off.
It either works or doesn't, the "why" is left to the program to figure out...
I feel more or less the same when I write regular expressions.
> what you have written is not really a regular expression (modern, regular, or otherwise), but rather a Perl program that uses regular expressions heavily. Does your post really support the claim that regular expressions can parse HTML correctly? Or is it more like evidence that Perl can parse HTML correctly?
I've played around with this parser which is extremely quick. https://github.com/lexbor/lexbor
This document is valid:
In general though self-closing tag has no effect in HTML5 anyway, `<script>` is just an example where the usual heuristic specified by HTML5 doesn't help you at all (since it switches the lexer state).
On the other hand, the inclusion of [X] in the title is more than enough to establish the historical setting.
As it is, I've seen this article used to scare people away from "can i make a game behave differently?" efforts that would have been trivial to do and likely given these people a gateway "i can try to be a programmer" experience.
Applying regex's only really count as "parsing" when you are matching against an entire document. Searching is not parsing. Same argument applies if you apply grep to an html doc - I wonder why there's no posts about "you can't parse HTML with grep". I apply grep all the time to source code...
(note : I mean the comments on StackOverflow, not the comments here in Hacker News ... )
RegEx match open tags except XHTML self-contained tags - https://news.ycombinator.com/item?id=14942060 - Aug 2017 (6 comments)
You can't parse [X]HTML with regex - https://news.ycombinator.com/item?id=14155015 - April 2017 (1 comment)
Why you can't parse HTML with regex - https://news.ycombinator.com/item?id=9728474 - June 2015 (2 comments)
Can you parse html with regular expressions? - https://news.ycombinator.com/item?id=5264511 - Feb 2013 (2 comments)
Can regular expressions parse HTML or not? - https://news.ycombinator.com/item?id=5257535 - Feb 2013 (23 comments)
Regexes Parse XML Just Fine, Actually - https://news.ycombinator.com/item?id=3088402 - Oct 2011 (27 comments)
Oh Yes You Can Use Regexes to Parse HTML - https://news.ycombinator.com/item?id=2741780 - July 2011 (77 comments)
Why you should not parse (X)HTML with a Regexp - https://news.ycombinator.com/item?id=2423301 - April 2011 (5 comments)
Stackoverflow, HTML by Regex, topmost answer - https://news.ycombinator.com/item?id=1487695 - July 2010 (37 comments)
You Cannot Parse HTML with Regular Expressions - https://news.ycombinator.com/item?id=1274870 - April 2010 (7 comments)
You can't parse [X]HTML with regex. - https://news.ycombinator.com/item?id=941401 - Nov 2009 (1 comment)
I'm surprised that most of these are so small. There must have been others? (I excluded the boring or empty ones.)
Yog-Sothoth knows the gate. Yog-Sothoth is the gate. Yog-Sothoth is the key and guardian of the gate. Past, present, future, all are one in Yog-Sothoth. He knows where the Old Threads broke through of old, and where They shall break through again.
Stack Overflow can be remarkably humourless at times.