"HyperText is a way to link and access information of various kinds as a web of nodes in which the user can browse at will. Potentially, HyperText provides a single user-interface to many large classes of stored information such as reports, notes, data-bases, computer documentation and on-line systems help. We propose the implementation of a simple scheme to incorporate several different servers of machine-stored information already available at CERN, including an analysis of the requirements for information access needs by experiments… A program which provides access to the hypertext world we call a browser — T. Berners-Lee, R. Cailliau, 12 November 1990, CERN "
Web apps are somewhat backwards in my opinion. We completely lost the idea of a markup language and decided “wait, we need an API”. So we started using JSON instead of the actual document (HTML or XML) to represent the endpoints. And then we patted ourselves on the back claiming “accessibility".
XHTML wouldn't have fixed anything. To make XHTML work on the web, browsers would have needed to make so many compromises we'd just have ended up with something even worse and less specified than HTML is now. HTML5 won because it codified all the compromises browser vendors had to make to deal with broken, real-world markup.
Sure, there are HATEOAS implementations that use HTML as a data format but the semantics of HTML are so generic and insufficient that all you're really gaining is being able to use an HTML parser instead of something easier to implement and more widely supported, like JSON.
Let me repeat that point: HTML semantics are insufficient and too generic. That is why microformats were a thing and embedded metadata was standardised in HTML5. So you can bolt on your custom semantics in a standardised fashion if you really want people to parse your markup.
HTML semantics are only sufficient to describe the structure of a document. And even that they don't do very well right now (just ask anyone who takes accessibility seriously and knows the WHATWG and W3C HTML specs).
Not just people. What's the point of being able to embed MathML in your XHTML if there's only a single browser that understands it? And what's the point in using XHTML when you need to pretend it's just HTML tagsoup for 90% of your vistors' browsers?
> rather than in a schema or DTD (which used to resolve in band, even)
Except that only happened with a very small number of tools. Browsers never cared about DOCTYPE declarations other than as a signal (or at best a bunch of ENTITY definitions). And schemas provide no use other than validation.
XML is more verbose than JSON and XML Schema is more established than most of the JSON equivalents but none of that is relevant when talking about HTML. The only thing you gain from embedding your metadata in your HTML is colocation.
Truth in DOM was a dream that web developers chased for more than a decade. The reason it doesn't work is simply that rendering information is a lossy process. You either end up duplicating the same information multiple times in your markup (once for presentation, once for display) or adding layers of indirection to render or reverse the render at runtime. It's a fool's errand.
The changes don't have to be breaking, if the HTML is well designed. Adding or removing an intermediate tag is OK as long as the data tags have identifiers to avoid having to write complex, fragile XPath queries.
Well established methods for authorization that is easy to use from code/cli and so on.
Basic Auth is very established and easy to use.
(I say 'XHTML' because it's just HTML that happens to be XML, rather than actually being served as XHTML. But having it in XML obviously makes things a lot easier.)
For that reason alone, an API response format that stays constant no matter how the site looks appeals to me.
It's just that nobody uses HTML like this and hence IDs and classes in particular have turned into being almost exclusively used for visual / UI aspects of a website (using CSS).
There's nothing in principle though that keeps you from mandating that specific IDs and classes have a meaning beyond what's represented visually and that those signifiers therefore have to both stay the same and must not be used for entities other than those they were meant for.
So everything after HTML 2.0 (RFC 1866)? The STYLE and FONT elements were added in HTML 3.2, but when tables were added to HTML 2 in RFC 1942 they already included presentational attributes. Heck, RFC 1866 already included the IMG element with an ALIGN attribute, as well as typographic elements like HR, BR, B, I and TT (for monospace text).
It sounds like you're being nostalgic about a time that never was, especially for XHTML (which the Semantic Web crowd loves to misremember as being 100% about semantics and not just a hamfisted attempt to make HTML compatible with the W3C's other XML formats).
That is an incredibly uninformed view of early 2000s web content authoring. The reason people didn't like XHTML was that it provided no tangible benefit.
In fact, my experience was quite the opposite: technical people loved XHTML because it made them feel more legitimate, they just had to sprinkle a few slashes over their markup. Validators were the hot new thing and being able to say your website was XHTML 1.0 Strict Compliant was the ultimate nerd badge of pride.
But these same people didn't use the XHTML mime type because they wanted their website to work in as many browsers as possible.
> And now most web pages don't even parse in an XML parser.
Again with the nostalgia for a past that never was: most web browsers never supported XHTML, they supported tag soup HTML with an XHTML DOCTYPE but an HTML mime type. Why? Because they supported tag soup HTML and only used the DOCTYPE as a signal for whether the page tried to be somewhat standards compliant.
I don't remember HTML 3.2 being such a problem (I wasn't around for HTML 2.0) because either people didn't do such complex things with their sites, or they didn't redesign much.
I did enjoy how simple publishing documents online was back then, and I'm nostalgic for that. Though definitely not for dial-up internet!
The shift that made HTML the mess it is today isn't technology but target audience and sponsorship. The early web was mostly hobbyists and academia, people who didn't care much about layout and were fine with just having a way to make something bold. The equivalent of people writing their blog posts in markdown these days.
I'm saying you're being nostalgic because you claim that there was a point when XHTML was about semantics and not presentation. That time never was. It's true that today it's easier to use HTML for presentation than back in the day, mostly thanks to CSS and the DOM, but what held HTML back initially wasn't technical.
That said, there were numerous progressions and many of them overlapped. Flash and Java applets were infinitely worse than the interactive blobs of web technologies we have today. Table layouts were followed by a second semantic renaissance led by the CSS Zen Garden (which for the first time really popularised the idea of separating markup and presentation).
The common sense of the user is that either the information on a website is free or it is not free. But "web APIs" have tried to blur this clear distinction. The information is free only if taken in small amounts. Consume too much, too fast and it is not free.
It is the Google model. Allow users to search public information on the web collected that Google collected for free from wesbites and has stored on its computers. But users may only query this public infromation in small amounts, and not "too fast". Otherwise users get blocked.
Now some websites might claim they need the ability to rate limit because if too many users started accessing their website at Googlebot speeds, the websites performance would degrade. But can Google make that claim? Is their infrastructure really that brittle? We are continually bombarded with PR that suggests Google is state of the art.
Is this really the era of "big data"? Users are restricted to very small data. The solution to any website performance being degraded by "too many" requests is to provide bulk data. For example, pjrc.org, where users can buy Teensy microcontrollers, makes the entire available as a tarball for users to download. The SEC traditionally provided bulk access to filings. There are countless other examples.
Is this really the age of Artificial Intelligence, Machine Learning, etc.? Teenagers and companies around the world build robots and the press gets excited. Yet websites block requests out of fear they are coming from "(ro)bots"? Is there something wrong with automation? (The majority of requests on the web are indeed from bots; using software to make requests, as Google and myriad other companies do, is far more efficient than manual typing, clicking, swiping and tapping.)
Summary: The "Web API" is nothing more than an another senseless urging for users to "sign up" to receive free information. Not every "Web API" user is an app developer submitting to an app store, nor are they necessary running a "competing" website. Everytime a website collects unnecessary "sign up" credentials it is one more unnecessary risk for the user that those credentials will be leaked.
As for rate-limiting usage of a provider’s services, that’s what the terms are for. They’re providing the infrastructure and direcly covering the costs. It is an agreement between you and them to abide by their rules for what fair access is. If you disagree, no one is forcing you to pay and/or use that service. You may use another service or even start your own if you feel you can provide a better service. What you can’t do, however, is expect that these services provide some minimum threshold of computation power, especially at their ditect cost. Not unless it’s in the agreement you signed with them.
First off, a full disclosure: I was (and in principle still am) a big fan of web standards and was a strong believer in XHTML and XHTML 2. I thought XML was going to save us all.
Here's a hard truth about HTML: it was never about semantics. HTML was created to connect digital documents. Most of the initial tags were fairly ad hoc and loosely based on CERN's SGML guide, as is still evident from the awkward h1-h6 elements.
The biggest thing the first W3C spec[HTML32] added to HTML? The FONT element (and arguably APPLET because Java applets were a thing). Note that it really didn't add any elements for additional semantics. It was just another iteration on HTML that merged minor changes to HTML 2.0 with some proprietary extensions (mostly by Netscape).
At this point it's worth emphasising what has become the most important mantra in web standards: Don't Break the Web. HTML 4 Strict was trying to enable a break as an opt-in. The website would behave exactly the same way but the DOCTYPE would tell browsers the page doesn't use certain elements the spec defined as deprecated.
Of course by this point web browsers no longer cared about SGML and were purpose built to handle HTML and be able to render whatever junk a content author would throw at them. The Browser War was raging (Netscape still believed it could make money selling commercial licenses for their Navigator -- yes, surprise, Netscape Navigator wasn't free for commercial use) and if your browser couldn't render a page but a competitor's could, that's where users would go.
So in practice Strict vs Transitional didn't make any difference except as an indicator to how sloppy the author likely was with their markup. Eventually Internet Explorer started using it as a shibboleth to determine whether to fall back to "quirks mode" or follow the standards more closely but deprecated elements would still render in strict mode and the only thing that cared were automated validators that spit out pretty badges.
When the W3C created XHTML[XHTML1] the entire point was to drag HTML into the XML ecosystem the W3C had obsessively created and that just wasn't getting much traction on the web. Instead of having to understand SGML, browsers could just learn XML and they'd be able to handle XHTML and all the other beautiful XML standards that would enable the Semantic Web and unify proprietary XML extensions and open standards side by side.
Of course the only flaw in that logic was that browsers didn't understand SGML. Adding support for XML actually meant implementing an additional language in addition to HTML and as XML had much more well-defined implementation requirements, this among other things meant that web pages that weren't well-formed XML would have to break by definition and browsers were only allowed to show a yellow page of death with the syntax error.
This is why browsers for the most part ended up ignoring XHTML: Firefox supported XHTML but developers were upset with WYSIWYG tools breaking their websites in Firefox and because ensuring all your output is proper XML is hard, the easier fix was to just send your "XHTML" tag soup with the HTML mime type so Firefox would just handle it as HTML (which all the other browsers did anyway -- except Internet Explorer, which for once did the right thing and dutifully refused to render proper XHTML because after all it only understood HTML, not XHTML).
But as I said: XHTML wasn't about semantics. In fact, XHTML 1.0 copied the exact same Transitional/Strict/Frameset model HTML 4 had introduced, allowing developers to write the same sloppy markup (but now with more slashes). And the W3C even specified how authors could make sure their XHTML could be parsed as HTML tag soup by non-XHTML browsers (which led to millions of developers now thinking closing slashes in HTML are "more standards compliant").
A few moments later the W3C created XHTML 1.1[XHTML11], which was mostly just XHTML 1.0 Strict but now split into multiple specs to make it easier to sneak other XML specs into XHTML in the future. Again, nothing new in terms of semantics.
Meanwhile, work began on XHTML 2.0[XHTML2] which should never see the light of day. XHTML 2.0 was finally going to be a backwards incompatible change to HTML, getting rid of all the cruft (more than XHTML 1.1 had tried to shake off by dropping Transitional, even promising to kill a few darlings like h1-h6) and replacing lots of HTML concepts with equivalents already available in other XML specs. XHTML 2.0 would finish the transition that XHTML had started and replace HTML with the full power of the XML ecosystem.
Except obviously that was not going to happen. Browser vendors for the most part had given up on XHTML 1 because authors had no interest in buying into a spec that provided no discernable benefits but would introduce a high risk of breaking the entire website or requiring workarounds for other browsers. Netscape was dead, Internet Explorer had stabilised and didn't seem to be going anywhere, the Browser War that had fueled innovation was largely over.
But even so, XHTML 2 wouldn't really have added anything in terms of semantics. That wasn't the goal of XHTML 2. The goal of XHTML 2 was to eliminate redundancies within the new XML/XHTML ecosystem, generalise some parts of XHTML into reusable XML extensions and specify the rest in a way that integrates nicely with everything else (leaving the decision what XML extensions to support up to the implementors).
The real game changer in semantics were Microformats[MFORM]. Because the W3C was too slow to keep up with the rapid evolution of the real web and too academic to be aware of real world problems, the web community came up with ways to use the existing building blocks of HTML to annotate markup in a machine-readable way, building on existing specs -- all this without XML.
When browser vendors finally gave up relying on the W3C's leadership for HTML and began working on a new HTML spec[HTML] under WHATWG, the main lesson was to pave the cowpaths. Instead of relying on authors to explicitly add annotations (although that is still possible using ARIA[ARIA]) various semantic elements like MAIN and ASIDE finally made it into the language. But the most important change was that the spec finally defined the error handling browsers had previously implemented inconsistently.
But again, even with the new HTML spec, HTML's goal never was to be able to declare semantics for any possible document. Yes, you can embed metadata (and the W3C still likes to push RDF-XML as a solution to do that in HTML) but the core elements are only intended to be good enough to provide sufficient semantics to structure generic documents.
Domain-level semantics are still left as an exercise to the reader. And anyone who's tried to actually parse well-formed, valid, well-structured HTML for metadata can tell you that even for generic documents the HTML semantics just aren't sufficient.
Sorry for the lengthy history lesson, but it never ceases to amaze me how rose-tinted some people's glasses are when looking at the mythical Semantic Web, early HTML and the XML ecosystem. I didn't even go into how much of a mess the latter is but I think my point stands even so.
: The initial list of HTML tags[TAGS] allowed for "at least six" levels of heading but the RFC[HTML2] only defined h1-h6, but it also explicitly stated how each level was to be rendered. The original SGML guide[SGML] the tag list was loosely based on used up to six levels of headings in its examples. So in a nutshell, the reason there are six is likely a mix of "good enough" and nobody being able to figure out how to meaningfully define the presentation for anything beyond h6.
Indeed, part of the problem with switching to XHTML is that, even today, much HTML is authored by hand. On top of that web pages are often a mix of content from various sources. Third party widgets (like Twitter or ads) might be injected into the page. Or user content, which has often been converted by a hodgepodge of complex regexes into something vaguely resembling HTML.
I scrapped many site, and making an api out of a site is really similar to web scrapping. Despite having tried many time to create an universal scrapper, or at least a helper class for scrapping, I found out that restarting from scratch at each site was the best strategy.
There is a lot of info on stackoverflow about that, and there was a post specifically on this issue but i can't find it.
Maybe one third of the time, everything is ok and I use my favorite tools : python-requests with requests_cache for connection and lxml for parsing. Then, Toapi should work. But there is so much variety in internet site that most of the time, i end up doing something else : Use other way to get the data (websocket, phantomjs, chromeheadless, or going through proxy or Tor or anything ...). Use other data extraction methods, because xpath does not works (parsing json, making screenshot then ocr to get position of text, even fasttext !). Some time there is encoding issue. Some time html is malformated. In this 2/3 of the case, Toapi won't work out of the box. So you will trying to fix it, improve it, etc, and the code base of Toapi will grow but it will never handle all the case.
Really interested to hear how you use fasttext for this?
for the hacker news example that would be needed to parse score and comment count
I worked on an a small Firefox plugin whereby would enable visually impaired users to use firefox and interact with websites using voice. It was a small attempt and it was not an easy task...
The two biggest challenges were (1) understand the semantics of the website and (2) interacting with browser.
For challenge one, we can look at the case whereby a div/span/css clickable button is used instead of using the <button> tag. Traditional reader has tried its best to figure out how to find these buttons, but nonetheless it still posts a challenge. “alt” attribute is missing in <imh> tag so no description when image goes 404. So wouldn’t it be awesome if we can be more responsible and also provide a standardized markup API so screen readers can use?
As part of the thesis, I worked woth a few visually impaired users. Whether screen readers have evolved I do not know, so I’s love to hear feedback from those on HN.
Parsing Gmail’s DOM was proven to be very difficult, but fortunately, Gmail.js  exists - I owe the author so much (although I did help fix a bug I think).
Next comes the interaction. The idea is
1. user says to the computer “find the latest email”
2. the computer repeats back to the user
3. a small wait time is given to user a chance to correct the commmand before too late.
4.when time expires the code executes the command
Because there is no AI and because we neednto consider states, the best choice (esp for such small experiement), I created a finite state machine.
However, the number of states have exploded as expected for just 5-6 (??) commands for single website; clearly neither scalable nOr maintainable. It is unacceptable to limit a user to just a couple actions right?
What would be awesome is if every website can expose APIs - not just any API, but follows standard. To start small, we will take sitemap.xml as an inspiration. There should be a manifest file to describe how to login, what are needed to login, is there a “About us” page, how to do search, how to post something and etc.
This manifest file is generated on demand and is dynamic since we cannot write out 10,000 different interactions on Facebook righr? Then screen reader or my plugin can read the manifest file, and let user tell us what he/she desires to do, and we look up the manifest on how to make the call.
Toapi seems to be heading a similar direction, which is great.
The second challenge addon can only do so much. I did have to use the more low-level APIs to do certain thing, and if I remembered correctly service workers weren’t exactly easy back in 2013. Content script has limited capability. Also, there was no speech recognition API in Firefox in 2013, so I turned to speak.js  which was the only one that worked offline and has no extra dependencies like nodeJS. There is some support for Web Speech API in Firefox which is encouraging .
The take aways are:
- we suck at designing a usable website that respects semantics (I am guilty of that on my own silly homepage)
- those don’t use ARIA should try. ARIA does not solve everything but it gives a good start
- DOM is a horrible place to use to deduce and find contrnts, and a nightmare to interact with
- addon has restricted capability for good security reason so not a great idea to implement a complex software
- we hate ads espeically those overlay and popup ads (they cannot be fully blocked even with Chrome’s new feature). But also consider we’d have to unblock ads to load some of the popular websites (ironically HN search will not work on mobile unless Focus is disabled)
- finally, would be awesome if there is a way for machine to consime a website or a webpage’s functionality
2. It's a service, so you don't have to automate the repetition/automation part, and you get caching/discoverability/documentation/etc for free.