
Toapi – Let any web site provide APIs - marban
http://www.toapi.org
======
anon1253
Now I don't want to be a downer: but we collectively seem to have forgotten
that HTML as a markup language with sufficient semantic elements, is a perfect
API in itself. In fact, if we had stuck with XHTML I would've postulated that
it would've been an even better API than JSON due to XPath, XQuery and XSLT.

"HyperText is a way to link and access information of various kinds as a web
of nodes in which the user can browse at will. Potentially, HyperText provides
a single user-interface to many large classes of stored information such as
reports, notes, data-bases, computer documentation and on-line systems help.
We propose the implementation of a simple scheme to incorporate several
different servers of machine-stored information already available at CERN,
including an analysis of the requirements for information access needs by
experiments… A program which provides access to the hypertext world we call a
browser — T. Berners-Lee, R. Cailliau, 12 November 1990, CERN "

Web apps are somewhat backwards in my opinion. We completely lost the idea of
a markup language and decided “wait, we need an API”. So we started using JSON
instead of the actual document (HTML or XML) to represent the endpoints. And
then we patted ourselves on the back claiming “accessibility".

~~~
porker
That was fine when (X)HTML was about semantics, but once it got used for
presentation too, the contract was broken. I scrape a few sites (due to the
lack of RSS feeds and the desire of UK organisations to treat Facebook pages
as their 'News feed') and it's easy... until they do a redesign and the HTML
changes.

For that reason alone, an API response format that stays constant no matter
how the site looks appeals to me.

~~~
pluma
> but once it got used for presentation too

So everything after HTML 2.0 (RFC 1866)? The STYLE and FONT elements were
added in HTML 3.2, but when tables were added to HTML 2 in RFC 1942 they
already included presentational attributes. Heck, RFC 1866 already included
the IMG element with an ALIGN attribute, as well as typographic elements like
HR, BR, B, I and TT (for monospace text).

It sounds like you're being nostalgic about a time that never was, especially
for XHTML (which the Semantic Web crowd loves to misremember as being 100%
about semantics and not just a hamfisted attempt to make HTML compatible with
the W3C's other XML formats).

~~~
anon1253
Really people didn't like XHTML because they didn't want to close their
elements. That's it. And now most web pages don't even parse in an XML parser.
What elements are standard and which are not is completely arbitrary, what the
browser does with them doesn't really matter either. What matters is that you
can extract the data from it, if you know the structure (either by following
some standard, or by having out-of-band documentation). In that respect JSON
and X(HT)ML are similar, except now you can't scrape web pages with a single
GET request from the canonical URI, and instead need to run a fully fledged
browser that parses javascript and/or read some site specific JSON docs (if
any are provided at all).

~~~
pluma
> because they didn't want to close their elements

That is an incredibly uninformed view of early 2000s web content authoring.
The reason people didn't like XHTML was that it provided no tangible benefit.

In fact, my experience was quite the opposite: technical people _loved_ XHTML
because it made them feel more legitimate, they just had to sprinkle a few
slashes over their markup. Validators were the hot new thing and being able to
say your website was XHTML 1.0 Strict Compliant was the ultimate nerd badge of
pride.

But these same people didn't use the XHTML mime type because they wanted their
website to work in as many browsers as possible.

> And now most web pages don't even parse in an XML parser.

Again with the nostalgia for a past that never was: most web browsers never
supported XHTML, they supported tag soup HTML with an XHTML DOCTYPE but an
HTML mime type. Why? Because they supported tag soup HTML and only used the
DOCTYPE as a signal for whether the page tried to be somewhat standards
compliant.

------
pluma
As HN sometimes likes to pretend XHTML is the answer to all problems with HTML
and the Semantic Web would have worked if only developers hadn't been such
idiots at the time, let me reiterate a few things:

First off, a full disclosure: I was (and in principle still am) a big fan of
web standards and was a strong believer in XHTML and XHTML 2. I thought XML
was going to save us all.

Here's a hard truth about HTML: it was never about semantics. HTML was created
to connect digital documents. Most of the initial tags were fairly ad hoc and
loosely based on CERN's SGML guide, as is still evident from the awkward h1-h6
elements[0].

The biggest thing the first W3C spec[HTML32] added to HTML? The FONT element
(and arguably APPLET because Java applets were a thing). Note that it really
didn't add any elements for additional semantics. It was just another
iteration on HTML that merged minor changes to HTML 2.0 with some proprietary
extensions (mostly by Netscape).

HTML 4 followed shortly after and promised to rid the world of presentational
markup because CSS had become somewhat widely supported (having won against
Netscape's JavaScript StyleSheets, which rapidly faded into obscurity). It
attempted to do this in a backwards compatible way by defining three different
DOCTYPEs: Strict, Transitional and Frameset.

At this point it's worth emphasising what has become the most important mantra
in web standards: Don't Break the Web. HTML 4 Strict was trying to enable a
break as an opt-in. The website would behave exactly the same way but the
DOCTYPE would tell browsers the page doesn't use certain elements the spec
defined as deprecated.

Of course by this point web browsers no longer cared about SGML and were
purpose built to handle HTML and be able to render whatever junk a content
author would throw at them. The Browser War was raging (Netscape still
believed it could make money selling commercial licenses for their Navigator
-- yes, surprise, Netscape Navigator wasn't free for commercial use) and if
your browser couldn't render a page but a competitor's could, that's where
users would go.

So in practice Strict vs Transitional didn't make any difference except as an
indicator to how sloppy the author likely was with their markup. Eventually
Internet Explorer started using it as a shibboleth to determine whether to
fall back to "quirks mode" or follow the standards more closely but deprecated
elements would still render in strict mode and the only thing that cared were
automated validators that spit out pretty badges.

When the W3C created XHTML[XHTML1] the entire point was to drag HTML into the
XML ecosystem the W3C had obsessively created and that just wasn't getting
much traction on the web. Instead of having to understand SGML, browsers could
just learn XML and they'd be able to handle XHTML and all the other beautiful
XML standards that would enable the Semantic Web and unify proprietary XML
extensions and open standards side by side.

Of course the only flaw in that logic was that browsers _didn 't_ understand
SGML. Adding support for XML actually meant implementing an additional
language in addition to HTML and as XML had much more well-defined
implementation requirements, this among other things meant that web pages that
weren't well-formed XML would have to break _by definition_ and browsers were
only allowed to show a yellow page of death with the syntax error.

This is why browsers for the most part ended up ignoring XHTML: Firefox
supported XHTML but developers were upset with WYSIWYG tools breaking their
websites in Firefox and because ensuring all your output is proper XML is
hard, the easier fix was to just send your "XHTML" tag soup with the HTML mime
type so Firefox would just handle it as HTML (which all the other browsers did
anyway -- except Internet Explorer, which for once did the right thing and
dutifully refused to render proper XHTML because after all it only understood
HTML, not XHTML).

But as I said: XHTML wasn't about semantics. In fact, XHTML 1.0 copied the
exact same Transitional/Strict/Frameset model HTML 4 had introduced, allowing
developers to write the same sloppy markup (but now with more slashes). And
the W3C even specified how authors could make sure their XHTML could be parsed
as HTML tag soup by non-XHTML browsers (which led to millions of developers
now thinking closing slashes in _HTML_ are "more standards compliant").

A few moments later the W3C created XHTML 1.1[XHTML11], which was mostly just
XHTML 1.0 Strict but now split into multiple specs to make it easier to sneak
other XML specs into XHTML in the future. Again, nothing new in terms of
semantics.

Meanwhile, work began on XHTML 2.0[XHTML2] which should never see the light of
day. XHTML 2.0 was finally going to be a backwards incompatible change to
HTML, getting rid of all the cruft (more than XHTML 1.1 had tried to shake off
by dropping Transitional, even promising to kill a few darlings like h1-h6)
and replacing lots of HTML concepts with equivalents already available in
other XML specs. XHTML 2.0 would finish the transition that XHTML had started
and replace HTML with the full power of the XML ecosystem.

Except obviously that was not going to happen. Browser vendors for the most
part had given up on XHTML 1 because authors had no interest in buying into a
spec that provided no discernable benefits but would introduce a high risk of
breaking the entire website or requiring workarounds for other browsers.
Netscape was dead, Internet Explorer had stabilised and didn't seem to be
going anywhere, the Browser War that had fueled innovation was largely over.

But even so, XHTML 2 wouldn't really have added anything in terms of
semantics. That wasn't the goal of XHTML 2. The goal of XHTML 2 was to
eliminate redundancies within the new XML/XHTML ecosystem, generalise some
parts of XHTML into reusable XML extensions and specify the rest in a way that
integrates nicely with everything else (leaving the decision what XML
extensions to support up to the implementors).

The real game changer in semantics were Microformats[MFORM]. Because the W3C
was too slow to keep up with the rapid evolution of the real web and too
academic to be aware of real world problems, the web community came up with
ways to use the existing building blocks of HTML to annotate markup in a
machine-readable way, building on existing specs -- all this without XML.

When browser vendors finally gave up relying on the W3C's leadership for HTML
and began working on a new HTML spec[HTML] under WHATWG, the main lesson was
to pave the cowpaths. Instead of relying on authors to explicitly add
annotations (although that is still possible using ARIA[ARIA]) various
semantic elements like MAIN and ASIDE finally made it into the language. But
the most important change was that the spec finally defined the error handling
browsers had previously implemented inconsistently.

But again, even with the new HTML spec, HTML's goal never was to be able to
declare semantics for any possible document. Yes, you can embed metadata (and
the W3C still likes to push RDF-XML as a solution to do that in HTML) but the
core elements are only intended to be good enough to provide sufficient
semantics to structure generic documents.

Domain-level semantics are still left as an exercise to the reader. And anyone
who's tried to actually parse well-formed, valid, well-structured HTML for
metadata can tell you that even for generic documents the HTML semantics just
aren't sufficient.

Sorry for the lengthy history lesson, but it never ceases to amaze me how
rose-tinted some people's glasses are when looking at the mythical Semantic
Web, early HTML and the XML ecosystem. I didn't even go into how much of a
mess the latter is but I think my point stands even so.

[0]: The initial list of HTML tags[TAGS] allowed for "at least six" levels of
heading but the RFC[HTML2] only defined h1-h6, but it also explicitly stated
how each level was to be rendered. The original SGML guide[SGML] the tag list
was loosely based on used up to six levels of headings in its examples. So in
a nutshell, the reason there are six is likely a mix of "good enough" and
nobody being able to figure out how to meaningfully define the presentation
for anything beyond h6.

[SGML]:
[http://cds.cern.ch/record/997909/files/cer-002659963.pdf](http://cds.cern.ch/record/997909/files/cer-002659963.pdf)

[TAGS]:
[https://www.w3.org/History/19921103-hypertext/hypertext/WWW/...](https://www.w3.org/History/19921103-hypertext/hypertext/WWW/MarkUp/Tags.html)

[HTML2]:
[https://tools.ietf.org/html/rfc1866](https://tools.ietf.org/html/rfc1866)

[HTML32]: [https://www.w3.org/TR/REC-html32](https://www.w3.org/TR/REC-html32)

[HTML4]: [https://www.w3.org/TR/REC-html40-971218/](https://www.w3.org/TR/REC-
html40-971218/)

[XHTML1]: [https://www.w3.org/TR/2002/REC-
xhtml1-20020801/](https://www.w3.org/TR/2002/REC-xhtml1-20020801/)

[XHTML11]: [https://www.w3.org/TR/2010/REC-
xhtml11-20101123/](https://www.w3.org/TR/2010/REC-xhtml11-20101123/)

[XHTML2]: [https://www.w3.org/TR/2010/NOTE-
xhtml2-20101216/](https://www.w3.org/TR/2010/NOTE-xhtml2-20101216/)

[MFORM]: [http://microformats.org/](http://microformats.org/)

[HTML]: [https://html.spec.whatwg.org/](https://html.spec.whatwg.org/)

[ARIA]: [https://www.w3.org/WAI/intro/aria](https://www.w3.org/WAI/intro/aria)

~~~
ChrisSD
> as XML had much more well-defined implementation requirements, this among
> other things meant that web pages that weren't well-formed XML would have to
> break by definition and browsers were only allowed to show a yellow page of
> death with the syntax error.

Indeed, part of the problem with switching to XHTML is that, even today, much
HTML is authored by hand. On top of that web pages are often a mix of content
from various sources. Third party widgets (like Twitter or ads) might be
injected into the page. Or user content, which has often been converted by a
hodgepodge of complex regexes into something vaguely resembling HTML.

------
laktek
I built a similar service called [https://Page.REST](https://Page.REST) couple
of months back, which uses CSS selectors instead of the XPath for capturing
content.

~~~
krisives
What about scraping based on logged in users?

------
julienfr112
I think it's what in french we call a <i>fausse bonne idée</i>, or good idea
at first sight only.

I scrapped many site, and making an api out of a site is really similar to web
scrapping. Despite having tried many time to create an universal scrapper, or
at least a helper class for scrapping, I found out that restarting from
scratch at each site was the best strategy.

There is a lot of info on stackoverflow about that, and there was a post
specifically on this issue but i can't find it.

Maybe one third of the time, everything is ok and I use my favorite tools :
python-requests with requests_cache for connection and lxml for parsing. Then,
Toapi should work. But there is so much variety in internet site that most of
the time, i end up doing something else : Use other way to get the data
(websocket, phantomjs, chromeheadless, or going through proxy or Tor or
anything ...). Use other data extraction methods, because xpath does not works
(parsing json, making screenshot then ocr to get position of text, even
fasttext !). Some time there is encoding issue. Some time html is malformated.
In this 2/3 of the case, Toapi won't work out of the box. So you will trying
to fix it, improve it, etc, and the code base of Toapi will grow but it will
never handle all the case.

~~~
porker
> (parsing json, making screenshot then ocr to get position of text, even
> fasttext !)

Really interested to hear how you use fasttext for this?

~~~
julienfr112
I'll make a HN post about that. Though, don't hold your breath until it
happens.

------
ktpsns
This kind of service is great and technically well done. Of course it "should
not be neccessary" if people exposed well-structured information in (X)HTML.
However, many actors work hard to publish their valuable data only in a way
they can control (for instance human-readable). Such API efforts are helpful
to demonstrate that it is impossible to publish data and keep them fenced at
the same time.

------
orf
I wish something like this would use asyncio rather than flask

------
0x006A
somehow misses an option to use local variables, i.e. use the id of the item
to look up related data instead of only info inside of that element.

for the hacker news example that would be needed to parse score and comment
count

------
krisives
Does anyone know a service that does scraping of content behind login gates?

------
yeukhon
This is still somewhat far from what I was hoping to see, but nonetheless a
great inspiration and a good start.

I worked on an a small Firefox plugin whereby would enable visually impaired
users to use firefox and interact with websites using voice. It was a small
attempt and it was not an easy task...

The two biggest challenges were (1) understand the semantics of the website
and (2) interacting with browser.

For challenge one, we can look at the case whereby a div/span/css clickable
button is used instead of using the <button> tag. Traditional reader has tried
its best to figure out how to find these buttons, but nonetheless it still
posts a challenge. “alt” attribute is missing in <imh> tag so no description
when image goes 404. So wouldn’t it be awesome if we can be more responsible
and also provide a standardized markup API so screen readers can use?

So here comes ARIA (Accessible Rich Internet Application) [1]. But even so a
lot pf popular websites are not compliant woth ARIA. Furthermore, dynamic
contents, single page application, and javascript makes screen reader more
difficult to use.

As part of the thesis, I worked woth a few visually impaired users. Whether
screen readers have evolved I do not know, so I’s love to hear feedback from
those on HN.

Since this project is intended to run in a browser, so the native choicenis an
add-on (which mostly means Javascript). Furtheremore, to protect privacy, I
want to do all the speech recognition locally, without leaving the user’s
local computer. So I tackled on Gmail and my college’s website first.

Parsing Gmail’s DOM was proven to be very difficult, but fortunately, Gmail.js
[2] exists - I owe the author so much (although I did help fix a bug I think).

Next comes the interaction. The idea is

1\. user says to the computer “find the latest email”

2\. the computer repeats back to the user

3\. a small wait time is given to user a chance to correct the commmand before
too late.

4.when time expires the code executes the command

Because there is no AI and because we neednto consider states, the best choice
(esp for such small experiement), I created a finite state machine.

However, the number of states have exploded as expected for just 5-6 (??)
commands for single website; clearly neither scalable nOr maintainable. It is
unacceptable to limit a user to just a couple actions right?

What would be awesome is if every website can expose APIs - not just any API,
but follows standard. To start small, we will take sitemap.xml as an
inspiration. There should be a manifest file to describe how to login, what
are needed to login, is there a “About us” page, how to do search, how to post
something and etc.

This manifest file is generated on demand and is dynamic since we cannot write
out 10,000 different interactions on Facebook righr? Then screen reader or my
plugin can read the manifest file, and let user tell us what he/she desires to
do, and we look up the manifest on how to make the call.

Toapi seems to be heading a similar direction, which is great.

The second challenge addon can only do so much. I did have to use the more
low-level APIs to do certain thing, and if I remembered correctly service
workers weren’t exactly easy back in 2013. Content script has limited
capability. Also, there was no speech recognition API in Firefox in 2013, so I
turned to speak.js [3] which was the only one that worked offline and has no
extra dependencies like nodeJS. There is some support for Web Speech API in
Firefox which is encouraging [4].

The take aways are:

\- we suck at designing a usable website that respects semantics (I am guilty
of that on my own silly homepage)

\- those don’t use ARIA should try. ARIA does not solve everything but it
gives a good start

\- DOM is a horrible place to use to deduce and find contrnts, and a nightmare
to interact with

\- addon has restricted capability for good security reason so not a great
idea to implement a complex software

\- javascript can hinder screen reader from doing its job - honestlt, please
stop single page application and please stop hijacking the god damn back
button

\- we hate ads espeically those overlay and popup ads (they cannot be fully
blocked even with Chrome’s new feature). But also consider we’d have to
unblock ads to load some of the popular websites (ironically HN search will
not work on mobile unless Focus is disabled)

\- finally, would be awesome if there is a way for machine to consime a
website or a webpage’s functionality

1\. [https://developer.mozilla.org/en-
US/docs/Web/Accessibility/A...](https://developer.mozilla.org/en-
US/docs/Web/Accessibility/ARIA)

2\.
[https://github.com/KartikTalwar/gmail.js/tree/master](https://github.com/KartikTalwar/gmail.js/tree/master)

3\.
[https://github.com/kripken/speak.js/](https://github.com/kripken/speak.js/)

4\. [https://hacks.mozilla.org/2016/01/firefox-and-the-web-
speech...](https://hacks.mozilla.org/2016/01/firefox-and-the-web-speech-api/)

------
rambojazz
Is this any different from any other framework such as Symfony that allows to
create regex routes with a controller?

~~~
contingencies
1\. This is HTML DOM XPath-based, not regex based.

2\. It's a service, so you don't have to automate the repetition/automation
part, and you get caching/discoverability/documentation/etc for free.

------
amelius
No support for SQL, to quickly query through complicated data structures?

~~~
jdc0589
I mean....that's a pretty arbitrary thing to assume would be present when the
data structure is HTML. xpath, css selectors, etc... are the standards there.

~~~
amelius
So let's say I want to access a huge database; it's too large to fit in
memory. Then why wouldn't SQL be an appropriate way to navigate through it?

~~~
Can_Not
You should use this to translate HTML into data, insert that data into a
database, then do SQL.

