

Ask HN: What technology do you use to collect data from an HTML file? - epynonymous

if a site doesn't provide an api, you can still access its data through raw html.  i was just thinking, what if you could create a generic component that could gather data out of html pages and then quickly normalize this data and wrapper a REST endpoint on top of this (at another domain).  would this be of any value?  a hacky way to RESTify legacy websites?<p>i think the data in an html page must be so awfully organized that this would be a difficult task to do, but perhaps if you could provide this service, you could monetize per REST call.<p>Any thoughts from fellow hackers?
======
reirob
I had to do it several times to collect data from HTML pages, to put the data
in a small DB for further analysis.

At the end I cam up with a shell script using following UNIX/cygwin tools:

1.) curl to download the HTML side to a file;

2.) iconv to convert the HTML to UTF8 encoding if it was in a different
encoding (which was the case once);

3.) tidy -asxml -numeric -utf8 to convert the HTML page to XML;

4.) xmlstarlet (<http://xmlstar.sourceforge.net>) with the sel command and a
bunch of XPath expressions to extract data that I needed from the page and to
pipe it to other unix tools. Watch out after you have retrieved data with
xmlstarlet might return XML escaped characters, so I run it through
"xmlstarlet unesc"

This approach worked pretty fine for me.

------
madhouse
This would vary from site to site, and if the HTML of one changes, the
"collector" will need to be updated too.

It's a ton of work even for a few site, making it generic: yeah, you could do
that, but then you'd have to provide unorganised data as a result, which
wouldn't be useful at all.

~~~
epynonymous
that's one of the problems that i considered since html is not that
descriptive. i could also see sites purposely changing format just to cause
incompatibility if they're not excited about you ripping data from them.

i feel that html should be slightly overhauled to add context to all data, so
instead of a <td>, it's stock 52 week high or something along those lines.

i just had a thought, perhaps rss is the answer! do sites still use that? i
kind of feel rss is a dead technology.

------
mateo999
open.dapper.net is pretty good for this.

~~~
epynonymous
wow, thanks matt, this is powerful!

