
I created a HTML parsing library in JAVA to extract data from complex websites - jbax
https://www.univocity.com/pages/html_parser_about
======
jbax
Hi I'm the guy who built and maintains univocity-parsers
([https://github.com/uniVocity/univocity-
parsers](https://github.com/uniVocity/univocity-parsers)) and I recently
released my new parser for HTML.

I've been developing this internally over the years to build scrapers for my
clients who need extract data from complex pages and finally packaged it into
a product that others can use.

I know devs are not very keen in using closed-source code but this is a
godsend if you work with website scraping as it's very fast and fast to code
with. You basically don't need to write a lot of code: it's usually _one line_
of very readable code for _each data point_ to collect. It beats any other
parser hands down in performance and ease of use.

If you don't feel like clicking around:

– it was built to process intricate pages with 100's of megabytes in size and
generate result rows that can be directly dumped into a database. No need to
traverse through nodes with code or to define complex XPATH or CSS selectors.
You can also annotate a few classes to get objects and do whatever you want
with them.

– you just define matching rules that crawl the page and they will be grouped
into threads that will collect the data for you

– to define a matching rule you look at the element you want instead of
figuring out a path that navigates to it. You only need to try to create a
rule that uniquely identifies the data to match in the page based on what is
around the element. For example:
"match(td).precededBy(td).withText('any*name')". This will match any table
cell of a page, where a cell on the left has text 'any previous name' or 'any
nickname'. It can be buried into a dozens of other tables and elements, but
you don't have to care about the page structure. Compare that to xpath or a
css selector.

– it follows links and let you aggregate data from the linked page into your
records

– it's easy to detect changes and missed elements when you want to ensure ALL
available data points in a page are being matched and collected

There's also a self-contained example using it to read a website:
[https://stackoverflow.com/questions/50914516/scrape-
informat...](https://stackoverflow.com/questions/50914516/scrape-information-
from-web-pages-with-java)

I'd like to invite you people to test and provide any feedback you might have.

