

PHP Simple HTML DOM Parser - barredo
http://simplehtmldom.sourceforge.net/

======
encoderer
I've been using this for some serious scraping recently... dozens of threads
scraping for 5, 6 days at a time.

Very good application, but watch out for memory leaks. Leaks like a sieve when
you miss a ->clear() call in your destructors if you're not using the circular
reference detection feature in the 5.3 garbage collector.

------
crux_
Since folks here seem to be in the know, if I were after "parsing as close to
the browser as possible", ideally in Java/Scala or Ruby or Python (sorry PHP!)
-- any recommendations?

I've done scraping (e.g. w/ BeautifulSoup) but haven't looked to see how true
the parses are to what IE/FF/WebKit would produce.

(On my list of things to look into: html5lib --
<http://code.google.com/p/html5lib/> ... is it any good?)

~~~
tremendo
for Ruby there's Hpricot and Nokogiri. Now I must admit not understanding what
"parsing as close to the browser as possible" would mean. These parsers would
not be for displaying the HTML, they're not rendering engines like those in
browsers, but will help you navigate the DOM of a (X)(HT)ML document
programatically. Surely I'm missing your meaning.

~~~
crux_
What I'm aiming at would mean "given this lump of (malformed) HTML, what DOM
would a browser give me?" Maybe Hpricot, BeautifulSoup, et al are already
there, but I don't know. :)

------
yannis
Another way to get to the DOM, especially if you are using tidy to clean-up
your HTML is tidy itself <http://php.net/manual/en/book.tidy.php> It is not
very hard to get elements using tidy, but you need to write your own
functions. However, is almost bullet-proof as it can correct for malformed
HTML which some of the other libaries don't.

~~~
dshah
I'm using Tidy now, and it's been working pretty well. What I do is use Tidy
to convert mal-formed HTML to XHTML and then use the Simple XML methods for my
parsing.

Anyone know if this new library is better than that approach?

------
gorm
<http://code.google.com/p/phpquery/> This also seems interesting.

------
danw
I've used this before. It's got a lovely api but uses far too much memory.

------
larrykubin
For Python folks, check out pyquery (<http://www.pyquery.org/>). It's really
handy.

------
mildweed
This is a good one to be sure. Also, check out QueryPath
(<http://querypath.org/>).

~~~
kylemathews
I used QueryPath for a recent project and really liked it. Its selectors are
almost identical to jQuery's so the learning curve was shallow. Two thing
QueryPath does that this library doesn't seem to is work w/ XML and allow for
chainable method calls. See this IBM Developer Work's article
[http://www.ibm.com/developerworks/opensource/library/os-
php-...](http://www.ibm.com/developerworks/opensource/library/os-php-
querypath/)

------
grumpycanuck
As someone who works with XML all day long, I am wondering why people don't
use SimpleXML in PHP 5+ for this sort of thing too.

~~~
tetsuo13
SimpleXML will choke on invalid HTML/XML while this library claims to support
it.

------
vinhboy
you guys read my mind.. i've been needing one of these.

