
Parsing HTML Using Regular Expressions - yammesicka
https://stackoverflow.com/a/1732454/1058671
======
jlhawn
Maybe I'm misunderstanding the question, but it sounds like the question is
not asking how to parse HTML with a regex, but how to match HTML open tags
specifically.

While you obviously can't match arbitrary HTML with a regex (because arbitrary
levels of nested elements requires a stack-based parser), can you not match
HTML tags with a regex? It seems to be that it should be possible since you
always have the pattern '<' followed by the name of the tag, followed my zero
or more "key=quoted-val" attributes, and finally a '>' token.

So, if the question is limited to just how to parse a single open token then
it seems like all of the answers have just decided to echo what they've heard
in the past which is "don't use regular expressions to parse HTML" when the
truth is that a real HTML lexer/parser does use regular expressions for
creating these "open" and "close" element tokens for the parser.

------
Tloewald
This is a fun (and classic) thread and it's worth reading the pro and con
arguments.

It really falls under the old joke "you have a problem and you decide to solve
it with regex, now you have two problems". HTML is very gnarly, and regex is
very gnarly. Doesn't mean you can't get shit done if you're aware of the
pitfalls.

------
bryanrasmussen
you know when you first read that you think - damn straight you can't parse
html with regex, but as it goes on the idea gets strangely enticing. I mean
maybe, with the correct rituals, and a gun and a willingness to fight with
ancient evils you could maybe parse some html with regex. A sort of
Lovecraft/Action flick.

~~~
ythn
Using regex in python to scrape some data from a website works just fine for
me... _shrugs_

~~~
scarface74
Is there ever a good reason to use regex to do a web scraper instead of using
a proper prebuilt parser and walking the DOM?

~~~
bryanrasmussen
regex is your hammer and you just by chance happen to have a really nail
shaped task to do?

------
BrandoElFollito
I am a moderately active user of SE (~25k of flair) and I find the contrast
between the regular channel (say, Stack Overflow) and the Meta one (SO Meta)
horrifying.

The SO Meta community is such a bunch of bullies that I now hardly go there
(even though I recently found two bugs which I did not bother to post). In
contrast, the regular channels are pragmatically helpful (pragmatically
because you still need to do some God offering sacrifices (called "what effort
have you put in the question" and suffer some psychotic down voters). It is
interesting to see that both populations are composed from the same
individuals who seem to have a personality flip when switching channels.

I would be interested someday to learn about the dynamics of such groups.
There are plenty of places on Internet populated by mentally deranged
participants (cowards hiding behind Internet) but the SE Meta ones are, I
belive, more educated / intelligent in average and, sometimes, more traceable.

~~~
xenomachina
For a long time, the SO and Meta.SO scores we completely separate. This meant
that even if you were a top contributor on SO, you might have very low meta-
reputation. It was pretty screwed up. I have a fairly high SO rep (top <
0.2%), but in those days my meta-rep wasn't even high enough to unlock many
basic features. I remember reporting bugs and getting treated like a noob.

I'd even complained about the reps being separate, pointing out how this gave
the power on meta to people who didn't even necessarily contribute on the main
site. A bunch of high meta-rep users descended, simultaneously shooting down
the idea of merging the reps while admitting that the main reason they like
the status quo was because they didn't want to lose their precious karma. I
was petty surprised when SO eventually fixed this, but it's kind of too
little, too late. I don't bother with meta anymore despite now having a high
rep on it. Too many bad memories.

~~~
BrandoElFollito
If it was me, I would not have put any rep in Meta. The fact that one is a
genius in Java and Python does not mean that he or she is a good moderator or
sie admin.

Anyway, I find it sad that they are losing some possibly useful feedback in
the name of self-adoration. And this particularly because SE is a fantastic
source of knowledge, just reading the Hot Topics made me learn about subjects
I did have never looked at.

------
dukoid
It should be possible to _tokenize_ html with regular expressions, an that's
all he seems to be asking for...

------
krallja
(2009)

