Hacker News new | comments | show | ask | jobs | submit login
How Scribd's HTML5 conversion works ... even when it shouldn't (scribd.com)
123 points by matthiaskramm on Aug 26, 2010 | hide | past | web | favorite | 19 comments

My respect for how hard it is to convert from .pdf to HTML just went waaaay up. Scribd must have really thought displaying in HTML5 was worth a lot of trouble if they went to all this effort to be precise!

Whilst technically impressive, it reminds me of the 'NASA spent millions developing a pen that writes in space vs Russians used a pencil'.

Googles competing pdf reader just renders pdfs into images which also have selectable text.

As for the NASA story, you might find this link interesting: http://www.snopes.com/business/genius/spacepen.asp

Their entire business model involves displaying PDF content on the internet.

The fact they have to jump through hoops to achieve that, while potentially impressive, is a function of their entire RFB (reason for being... I just coined the term).

"Raison d'être" actually. Damn you France ;]

I don't have any particular use for Scribd but I love them anyway.

Any company that can kick Adobe in the nuts by publicly reconfiguring their business to shun Flash is all right with me. And what technical prowess these guys have. Incredibly smart people there.

I've put a few PDFs of presentation slides on scribd, and then used their widget to embed those slides directly into a blog post talking about the presentation. Linking to a PDF on another page (and asking many people to a) open acrobat, b) download a multi-MB file from my host) and/or hosting that PDF yourself is sub-optimal. Let scribd host it (you can provide a link to download the actual content you uploaded, rather than their web widget), and pay for the bandwidth. I never saw the use for it, until I started giving talks and wanting to embed them. It's youtube for PPT :)

Weirdness. They have a section titled "Detecting the font family", where they're supposedly comparing Trebuchet and Courier, but in the example they show Trebuchet and Myriad.

Idle thought - would it be possible using layers to do selectable text in an image - top layer is 100% transparent text, 2nd layer is the image. The font face is preserved (though it still won't wrap), but the text is still selectable for copy and paste.

Browsers are "smart"- they won't let you select text that's 100% transparent. 99% transparency works... but also looks kind of weird if the texts don't overlap perfectly.

You can roll your own text selection in javascript (on top of the bitmap) if you know the glyph positions though- that's what e.g. Google Books does. It's a valid option if you don't care about zoomability.

Are you sure? I just made a quick test and it works ok in FF, Opera, Chrome, Safari, IE9 (Windows 7, Ubuntu 9.10):

  <body style="background:url(image.png)">
   <span style="color:rgba(0,0,0,0)">Hello world</span>

That's almost what Google Docs PDF viewer does. Each page is a JPG. This is served to the user with a catalogue of text layout then JavaScript is used to simulate selection and copying.

Yep, you can do just that. Browsers that support rgba for text color make it quite easy and in older browsers you can do onselect tricks and the like to make it work.

I don't find being able to view the mangled pdf on screen worth the time saved downloading the actual pdf. With video and audio I can understand the benefit of in-browser viewing, but why do we need this service for pdfs anyway?

It's horribly confusing to someone who doesn't understand that it's not being rendered by the browser itself. PDF's (especially when rendered inside the browser) break the browsing experience.

Your browser controls stop working as expected, history gets bent, links don't work as expected. All of a sudden you're now working within an Adobe Reader application or FoxIt Reader (albeit embedded in the browser) without even realizing you're in an entirely different context outside of an HTML page.

Links and history work as expected (at least in the browsers I know with native PDF support) and I’m not sure why it is bad that some users might not understand that a PDF document is not a HTML page.

I also don’t know how Scribd helps users understand that better. Seems horribly confusing to me if you don’t follow them closely. (Wait what? The PDF is suddenly a webpage? But sometimes Flash? I can still download the PDF? Why doesn’t it look exactly like the webpage? What’s going on?) It’s perfectly usable, even without a deep grasp of the concepts, but so are PDF viewers inside browsers.

The problem historically with plug-in PDF readers was that the Adobe one was very slow. That seems to be fixed now - though I can't tell if the software is better or I just use faster hardware.

Especially since downloading is no extra step if your browser can display PDFs natively. (Safari can, Chrome will soon and I’m pretty sure there are plugins for other browsers.)

I’m using Safari and to me viewing PDFs is the same as viewing webpages. Heck, it’s even a bit more comfortable. (There is one big button with which I can open the PDF in Preview which allows me to rearrange/delete pages and to annotate and there is another big button which allows me to save the PDF for later reading.)

What Scribd does certainly is impressive, I’m just not sure how useful it is.

Matthias At Work

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact