

Ask HN: compatibility layer to parse XML and HTML documents? - MasterScrat

Hi guys,<p>I'm making an aggregator for restaurants menus in Java. I'm looking for the best way to extract the content of documents formatted in various ways: some restaurants provide an RSS feed, others provide an HTML table, for others you have to compile the result displayed on multiple pages...<p>What I'm looking for, is a way to describe the transformation necessary to go from the source document to an easily-parsable format. Of course this transformation will have to be different and custom-made for every data source, but that's not a problem.<p>I know I could write a Java adapter for each restaurant, but I'm looking for a simpler, more standard solution. Basically something like XSLT but more flexible.<p>Any idea? Thanks for your help.
======
gspyrou
Take a look at Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>

------
geuis
You aren't going to find a magic module that does what you want. Break down
the problem a bit.

 _Parse xml (RSS/ATOM feeds)

_ Parse tabular data (some times) -> parsing html

*Parse html (DON'T USE REGEXES)

You're on the right track where you're going to have to write a separate
parser for each source.

I would recommend structuring your app in the mindset of drivers for your
computer's graphics card. The OS doesn't need to know about all the thousands
of kinds of graphics card hardware. It just needs a set of drivers for the
ones you're going to use with it.

Each custom parser is a driver. The output of each driver should be some
standard data format you've decided on (json is your best bet). Your app layer
should live over this, so it can just ingest the various sources of data
without having to worry about formats.

We went through something like this on an internal project at my office last
year, in Java. Its not fun.

I'd honestly recommend going with Python. If you're a java guy or girl, it'll
be easy to move over to Python. The language is much more expressive, so you
will be able to work easier, faster, and with better results.

Another benefit of using Python for something like this is all of the useful
modules that are available. Beautifulsoup for html parsing is particularly
useful.

