

Scribd’s HTML5 is a mess - mikecane
http://loud.anotherquietday.com/post/598121586/scribds-html5-is-a-mess

======
snowmaker
Well, sure - what's he expecting? PDF is a low-level, mostly unstructured
display format. Converting that to fully semantic markup that recognizes all
aspects of high-level document structure is probably an AI-complete problem.

For those of you not familiar with the gory details of PDF, it basically uses
absolute positioning for each character. If we converted that directly into
HTML it would be a disaster. So, we actually extract quite a bit of structure
on top of that, recognizing spaces, lines, columns, and paragraphs, which
enables us to write much cleaner HTML. Scribd (and most PDF readers) does this
with heuristic algorithms that make reasonable guesses.

In the future, we'd like to push those algorithms further, and extract ever
more semantic markup. But this is a "nice to have" for us - mostly, people
just want the documents to display correctly and load quickly. And, anyway,
expecting the output of an automated converter to match what a human would
write shows a basic ignorance of the state of computers and AI.

~~~
ilike
In my opinion, web based documents should have certain level of semantic
purity. Of course there will be technical difficulties to achieve it, but that
is what hackers are for. Right?

While you have the right to justify the lack of semantic markup in Scribd html
docs, he has the right to expect a non-messy markup underneath.

Calling someone's expectation "ignorance" isnt very fair.

~~~
alextgordon
The author of the article _is_ being ignorant by not understanding the basic
differences between PDF and HTML before going off on a rant about it.

------
mustpax
Web applications are not the same as web pages. A web page is a document,
which ought to have simple clean markup. A web application on the other hand
is a tool that can edit said documents. It's more "meta."

To expect the tool to have the same sort of semantic purity as the documents
it interacts with is overly idealistic. It doesn't matter as much what we can
systematically infer about an application, it is more important that the
content be accessible in some sort of standard format.

Similarly, Scribd's HTML5 viewer, is just that, a viewer. Not a full fledged
PDF to HTML converter. It's primary purpose is to allow a user to experience a
document in its entirety in the browser. In a world of trade-offs, and
priorities, the markup of the viewer itself is near the bottom.

~~~
baddox
Would you really consider facebook to be a "web app"? I think facebook is, or
at least should be a pretty normal web page.

~~~
robryan
Far from it, amazing complexity in how certain parts of the page load and
certain page changes are coming through ajax. Very much a web app, probably as
complex as any web app out there.

~~~
baddox
There's no reason for so much complexity. I'm talking of course about the core
of facebook, not the third-party apps. Sure, asynchronous loads of certain
content is nice, but it hardly qualifies it as a "web app"? Which parts of
facebook are you referring to?

------
gruseom
This reads like self-parody. The author doesn't like the way their <P> tags
are positioned. Could there be a more ridiculous criticism?

From where I stand, Scribd's HTML5 implementation (if it turns out as good as
the demo) is one of the biggest leaps forward I've seen on the web in a while,
as well as one of the most impressive pivots I've seen a startup make to turn
a negative into a positive. It's a huge technical challenge to do what they're
doing and get it right; one of those things that seems hard to people who
haven't worked on it and turns out to be far harder than it seems.

There's also the minor detail that they're producing this output for an
enormous class of inputs, so to nitpick the generated HTML as if it were
written by hand is to miss the point colossally.

Edit: on another note, it's probably a positive sign if people with this
mentality don't like your work.

------
invisible
I think this dude needs to read the first part of Coders at Work with Jamie
Zawinski (in reference to Netscape). A version 1 system RUSHED to the public
that actually works and has few bugs is amazing. The fact that it's difficult
to read and hard to follow can be fixed just as bugs can in due time. I think
the concept of "get it done and fix it later" is better than "take forever to
work on it and someone already produced." I guess that's an age-old question
we all ponder though.

~~~
petercooper
He seems to be more peeved that it was promoted with the term "HTML 5." He has
a point on that, if so. Do we want to see the term diluted to mean almost
anything with a valid DOCTYPE, as Web 2.0 and AJAX were? Half the things on
Delicious that are tagged AJAX are just basic Javascript..

~~~
armandososa
I think the problem is having the specification named "HTML5", if it has been
named (as it was originally intended) "Web Applications 1.0" all this
misunderstandings will go away.

------
jasonkester
When people start whinging about the HTML markup for your thing, you know
you're on to something. Early on, we got a guy complaining loudly about how
Twiddla sucked because we had put a DIV into the HEAD tag to display a loading
message. Not sure exactly how it affected his user experience, but there you
go.

If the best criticism that people can come up with is about a piece that
nobody can ever see or be affected by, you're probably doing something right.

------
timdorr
And if it were proper, semantic HTML5, what problem would that solve? I'm
failing to see what benefit that has to the user. They don't need to see the
markup; it just has to work.

------
gfodor
A good rule of thumb: if you think a product is absolutely great until you hit
"view source", you're probably missing the point.

------
Legion
If the author can't be bothered to even pick out a couple of snippets to
support his claim, I can't be bothered to pay his post any mind.

------
tlrobinson
"every major HTML5 _app_ today is built using un read able _markup_ "

Ignoring Scribd specifically for a moment, why do people insist that
_applications_ be built using technologies and best practices designed for
_documents_? These are two very different things. For many applications it
makes little sense to try to shoehorn the user interface into semantically
"correct" markup.

However it often absolutely does make sense to render the _documents_ created
with these applications in clean markup.

That said, Scribd is fundamentally an application centered around displaying
documents, so it would be nice if they were able to render documents in clean
way. But given the nature of PDF, as the author alludes to, this is difficult
if not impossible, so you can't really fault Scribd for that. Despite this
(which is no worse and arguably much better than the Flash equivalent), there
are other benefits to using HTML/CSS/JavaScript.

------
jdietrich
The real lesson here? There are no awards for correctness. If the biggest and
best webapps use pig-ugly HTML, the rest of us can probably stop worrying.
Build it, launch it, bodge it.

------
ydant
Well, the best part about this article is that it got me to check out Scribd.
I'd been ignoring it, although I saw previous commentary about them moving
from Flash to plugin-free. I'm impressed with the HTML render - it looks
smooth. No such thing as bad publicity, I guess.

------
armandososa
It's like 2004 all over again. Back then, some people were complaining that
"links" on Gmail where not really links, but underlined blue paragraphs with
onclick events.

Remember that? This was the most advanced web application ever made, it had
100x the storage capacity of every mail provider at the time, awesome search
capabilities, and the first mainstream use of asynchronous loading of data.

And yet, this people were mad because the links were not real anchor elements.

Go get some perspective kids.

------
petercooper
OK, this guy is ragging on what is a good tech demo, but "HTML 5" seems to be
the 2010 equivalent of saying "Web 2.0" or "AJAX" when it comes to getting
early adopters to check out your stuff, and Scribd threw around the term
without it being, hmm.. authentic.. (that is, HTML 5 as a concept means more
than merely producing a document that would pass validation)

------
jsmcgd
Don't give me problems, give me solutions. And by me I mean scribd, not me.

------
gte910h
Is this article a joke? Scribd's new HTML based PDF renderer is working fine.

I think the author missed something.

------
lenni
He's kinda right, a file format meant for printing fixed size pages will
inevitably suffer from some impedance mismatch when automatically converting
it to a relatively simple document format meant for screens.

But I disagree with the claim that it isn't HTML. It sure isn't pretty but a
if you define HTML as the input that the HTML parser of today's browsers can
handle (rather than the HTML spec) it is perfectly fine.

