Hacker News new | past | comments | ask | show | jobs | submit login
PDFy – Instant PDF Host (pdf.yt)
244 points by ridgewell on July 15, 2014 | hide | past | web | favorite | 98 comments



Hello,

I'm one of the founders of Scribd. You might be surprised to hear this, but I applaud the PDFy service and am glad someone has built it.

Scribd is not designed to be a simple, lightweight way to host a PDF file. Yes, this was the original idea of Scribd 8 years ago, but we've long since left that path. We see that market as having been made irrelevant by a combination of Google Docs, Dropbox/Box.net/etc., and better PDF readers now built into browsers like Chrome. There may be room for someone to build an imgur like service for PDFs too, but that's not what we're doing.

Scribd is really good for two things:

1) Scribd is a subscription reading service ("Netflix for books") where you can read over 400,000 professionally published books for $9 / month, including thousands of new releases and best-sellers. It doesn't include many programming books unfortunately (yet!), but if you like to read other things, it's a good deal.

2) Scribd is good for serious authors and publishers who want to publish a lot of content and organize it well. For example, the World Bank uploads thousands of research reports to Scribd and organizes them into collections. And many serious authors publish books and other writings with us.

We're sorry that we haven't done a good enough of explaining who we are as it's changed over the years. And we're sorry if you've been frustrated trying to use Scribd for something it's not particularly good at.

To joepie91_ - I think it's cool that you've started this. We have some experience building document hosting services, and I can see you are already encountering some issues we've worked on, like DMCA and copyright. If you'd like to talk, we'd be more than happy to help you out.


Wow, I never knew.

So far I've seen Scribd as very annoying pdf host, and most often I decide not to read the content at all when given a scribd link ("just send me the pdf, damn it!").


Same here. But then again we're not the clients. We're just 'using' the technology Sribd provides their clients I suppose.


This is good to know about Scribd.

Up until now my encounters with Scribd have generally followed this scenario: I'm reading a news article[1] and notice a link to some of the source documents for the article. I click on the link and then am sent to a scribd page that displays the document along with nice little download buttons that purport to let me download a copy of the document[2]. Of course clicking any download button gives me a modal telling me I either need to "Login with Facebook" or create a scribd account. Back buttons are pressed, tabs are closed.

In the example links above these documents are not books, they are not part of a curated collection put up by the "serious author" or publisher of the document. It looks much more like a publisher looking for an easily linkable or embeddable document viewer was snookered into believing that by uploading the document to scribd it would be easily accessible and "available" to the world.

There are countless scribd accounts were I imagine the author really intended their upload to be freely available, not used as bait for scribd to suck people into account creation. For example, the U.S. Naval Research Laboratory[3], various government officials[4] or agencies[5], and in fact entire scribd categories seem to be documents which are neither authored by the uploader, or copyrighted at all, such as public court filings[6].

I think your "Netflix for books" is a great idea, and might even be something I would go for, except for the really bad taste the above interactions leave in my mouth. These documents don't fit into the two categories you say scribd is really good for and you mention you are sorry you haven't done a good enough job explaining who you are as it has changed over the years. A great place and way to explain this would be right next to a download button for this type of content that doesn't require someone to "Login with Facebook" or create an account. Instead of getting the feeling I've gotten suckered by clicking on the link, I might think it is great you are hosting and making available this type of content and I should explore some more about this "Netflix for books" thing you are talking about.

[1] For example: http://arstechnica.com/tech-policy/2014/05/in-18-months-feds...

[2] e.g.: http://www.scribd.com/doc/204954147/Lolli-v-BF-Labs-Journal-...

[3] http://www.scribd.com/USNRL

[4] http://www.scribd.com/SenatorMarkUdall

[5] http://www.scribd.com/stlouisfed

[6] http://www.scribd.com/browse/BusinessLaw/Court-Filings


Scribd was synonymous with copyright violation for years. Nice to see you're finally making an honest business out of it.


...and now pdf.yt has stepped in to take the reins!


So was YouTube, to be fair.


>"I got sick of documents getting locked up behind login walls of services like Scribd."

Now thats a fantastic example of building a product around a real problem. If I even see domains like Scribd anymore I won't even give it a click. No, I don't want to sign up. I'd rather just do site:domain"thedoc.pdf" or some other way.

I hope your product takes off and everyone uses it!


"...much like Imgur does for images. PDFy is free, ad-free, and non-commercial."

They must be using a different imgur than I use, because it's definitely commercial and has ads. I can't see pdfy surviving on donations indefinitely.


I probably should've worded that better, but that text will be changed/moved/removed in the near future anyway. The Imgur comparison really only refers to the upload/sharing process, not the non-commercial bit.

As for running off donations, I've addressed that here: https://news.ycombinator.com/item?id=8034529


imo the worst part is that imgur doesn't even permit direct linking in many cases: they will 3xx-force redirect direct links to their ad filled pages


yikes I did not know that. I generally link directly to the imgur image itself and bypass the image page. will be on the lookout for this.


Linking to the direct imgur image itself will trigger 3xx to the image page if a few conditions are met - referrer (is popular site), enough hits to the image (popular images trigger 3xxs), some weird cookies.

It's incredibly misleading and awful. Also, try uploading an image - you might notice that the "direct link to image" textbox on the right bar no longer exists, and hasn't existed in a very long time.

Check this out in term:

  $ curl -I -H 'Referer: http://twitter.com/' 'http://i.imgur.com/ZKfUroW.png'


Imgur does a lot of weird things like that from growing up to be just for reddit onto being a huge entity on its own. Maybe they're using ads to help pay for the site from sources that can afford their own image hosting anyway like twitter.


I just uploaded a random pic, and I definitely see the "direct link to image" box.


Are you logged in? http://i.imgur.com/7k3si8M.png


Yes, I am logged in.


It's been running since May and is hosted on rather cheap servers (Ramnode) which limits the amount of funds that actually need to be expended to run such a service like this. At the moment, the project is rather small and doesn't require expensive dedicated servers.


This looks great, but I hope you have a solid system in place to deal with the huge number of DMCA requests you'll get. Free PDF hosting services (free file hosting services) end up being a target of pirates, but not just that, of automated systems that index things specific to pirated text in order to get clicks.

If you don't get way ahead of these kinds of users, you'll end up with an untenable drain on your resources that will make it easier to shut down than to sustain.


I'm just going to wait and see how things go. I generally improvise; there's not really much prior work (let alone documentation) on the way I run projects, so it's mostly just a matter of figuring stuff out as I go along. One thing that's certain is that I have no intention of shutting down the service. I've run stuff that attracted more abuse :)

I have no intention of acting as a 'shield' against DMCA requests for this particular service (if they're valid, they'll be followed up on, as described on the TOS page), so ideally there shouldn't really be a problem. We'll see how it goes.


Yep, all I am suggesting is that you be prepared to deal with things in as automated a way as possible, like blacklisting naughty IP ranges, automating DMCA take-downs, fingerprinting known bad content and dropping it into a black hole etc...


Great idea. Just one little tip: I would not use the combination 'Instant and PDF' in your communication. Instant PDF is a well known product by Enfocus. It's basically a check-app for Print-optimized PDF files. As such it's a world-wide standard designers are forced to use. Create PDF, run it through Instant PDF, if approved it will attach a flightcheck report to said PDF and newspapers, magazines, printers can process the file. The flightcheck searches for common mistakes (non-embedded fonts, rgb colors, low res photo's, etc). Recently the Instant PDF name got absorbed by Connect but the brand Instant PDF is quite powerfull really).


I'm skeptical about the longevity of a site that operates entirely on donations, but seeing that you offer the source code for free (and it runs on PHP, which is arguably pretty well-supported), and your license is reasonable (if crass, but who cares?)...

This is really great. Thanks for posting this.


Hi, PDFy owner here. I've been running a number of services for 3 years now, without running into financial problems (see http://cryto.net/, http://cryto.net/~joepie91, http://cryto.net/~joepie91/projectlist). My biggest issue has actually been lack of time, rather than money :)

I've gotten pretty good over the years at running stuff on a shoestring budget (my current hosting expenses are around 100 euro a month for everything together at a large number of hosts, and I have plenty of resources to spare), and as far as I can predict PDFy won't be running into any issues any time soon.

I should add that it definitely helps to custom-develop everything - generic solutions tend to come with a large amount of (resource) overhead, which make it harder to run it on a small budget. By doing just about everything custom, the overhead is minimal. Traffic is dirt cheap nowadays if you know where to look, so that's not really a concern anymore either.


https://pdf.yt/tos -- You might want to get a lawyer to check if there are any pitfalls when using tongue-in-cheek terms of service and whether you wouldn't need to require people to accept ToS in a more nagging way.


hey -- I just loaded the page and I'm guessing 1/2 of the docs on the front page where copyrighted and uploaded w/o perms. I could be wrong.

In either case, you probably want to get very acquainted w/ the dmca and register an agent [1]; hopefully you know all about this but it's worth running requirements to stay w/in the dmca past a lawyer

[1] https://www.techdirt.com/articles/20101028/15533611640/damn-...


Similar content can be found on Scribd, with the additional property of Scribd making money on said content.


What do you mean by "Traffic is dirt cheap if you know where to look"

Are you talking about where to go find traffic for your website? Do you mind sharing?


As MitchellRobert already pointed out, I'm refering to traffic in the sense of data traffic (commonly called bandwidth). A very common remark I've gotten is "but what about the bandwidth usage?!", but nowadays it's not hard to get a few terabytes of monthly traffic allowance for under $10.

For this particular VPS, I'm paying $9.30 a month, and it includes 2TB of montly traffic. Cheaper offers exist, and there are always providers like OVH that genuinely offer unmetered traffic on the cheap (as long as it's used for a legitimate purpose, eg. serving hosted files).


He's literally referring to bandwidth. Transit to get data from A (PDFy) to B (the visitor)

:-)


Having licensed a small project or two under the WTFPL, it is delightfully permissive: https://en.wikipedia.org/wiki/WTFPL


Wow, love this so much more than Scribd. Already clicked on the latest uploads and found something cool, and it just worked and didn't require login. Amazing.


This is really great. Scribd can suck it. Also props for mirroring to the Internet archive.


This really needs NSFW tagging. Right now, the "latest public documents" is full of hentai. Doesn't mean that people will actually do that, but at least giving uploaders an option wouldn't hurt.


Yeah, I just ran into the same problem ! Opened the site at work and had to close asap !


I love it. But instead of "document.pdf" as the file name on download, what if you changed it to "[pdf title].pdf"? Several downloads in a row gets confusing


It should give you the original filename upon download, unless your browser ignores/mis-parses the Content-disposition header: https://github.com/joepie91/pdfy/blob/master/public_html/mod...

What browser are you using?


Were you letting users download it via PDF.js before which is why it was document.pdf? It makes sense to download it via PDF.js since the file is actually already loaded once the user renders it.


Oh, that might actually be it. The 'download file' button in the pdf.js menu is a different button from the button at the bottom/right hand side of the page (with different code behind it - the pdf.js 'download' button comes stock with pdf.js). I haven't really tested its behaviour much, but it's quite possible that it's calling things document.pdf.

Perhaps I should just remove the pdf.js 'download' button, seeing as there's one elsewhere in the UI anyway.


Weird, it works now... Was in Chrome 35: http://imgur.com/RhRafJK

Anyhow, thanks for putting together such a great site!


Well, someone could scrape all these PDF links and index them in a search engine, then Scribd could suck it. Or will Google do it automatically?


The gallery is intentionally plain HTML; Google appears to be indexing all public documents on PDFy correctly (both viewer pages and actual PDF files). Unlisted documents get a noindex tag.

That said, I really need to write some code to extract the document metadata from the PDF files and display it on page; right now, the only thing that search engines (and the site itself) have to go off is the filename, which is far from optimal.


Ideally, metadata extraction would be done on upload and presented to the user for optional manual correction. This would be a major contribution to findability, because PDFs often have incorrect metadata (try searching for anything on archive.org) or the person uploading may have metadata relevant to a use case that is unforeseen by the document creator.

In fact, the same file could be uploaded multiple times with different metadata. There's room here for experimentation, e.g. publishing content hashes and linking "duplicates" that have different metadata.

If PDF metadata can be published in a structured format, it should be then be possible for Calibre or Docear / Zotero to import the PDF + JSON metadata directly into the document database.


Linking hashes would free a lot of space, and letting each user place its own piece of metadata on each hash it wants would solve all the metadata and user trustworthiness problems.

Google would do the rest.


The problem is that I don't want to add any more roadblocks to the upload process - it should be as 'instant' as possible. Even the 'public' vs. 'unlisted' selection took quite some consideration.

Perhaps crowdsourcing metadata might be an idea - but that involves quite a bit more complexity, implementation-wise.


How about an "edit metadata" button linked to the submitter's session cookie, which is only active for a few minutes, similar to HN post editing? For those who don't care, upload process stays the same. Those who want to edit have the option, within a few mins.


Why not build a re-CAPTCHA type service around crowd sourcing PDF metadata. Read a PDF while you wait for something to happen.

Too idealistic?


Google has indexed PDF contents since 2001[1].

[1]: http://searchenginewatch.com/article/2067225/Google-Does-PDF...


@joepie91_: Just FYI, documents do not load until you accept the site's cookies. I don't know of a technical reason why this should be, off the top of my head, so if there isn't one, you may consider removing that restriction/requirement.


I suspect that might be caused by pdf.js. While PDFy itself will attempt to send you a cookie (even if you just try to download the PDF), it should still work even if the cookie is rejected, as it's not dependent on it for serving the file.

Can you tell me what the name of the offending cookie is?


Chromium reports that there are only two things being attempted by the site: a PHPSESSID cookie, and localStorage. I can't seem to make Chromium accept only one or the other, I can only make it "Allow" for the entire domain altogether, both cookie and localStorage.

I have vague recollections about Chromium and/or Firefox confounding localStorage with cookies when it comes to allowing or denying.


Works fine without accepting cookies in Opera 12. Needs JS and iframes enabled.


In my case, Chromium 35.0.1916.153 .


This is awesome! But I can't see it surviving on donations. Commercialize it please so that it survives. Tasteful ads aren't so bad like what Reddit does or Carbon[something]... I'm just concerned you won't survive on donations.


I absolutely don't need to commercialize it to keep it running. In fact, commercialization would come with an entire set of issues (and cost factors) of its own.

I have a solid track record of running non-commercial services :)



Could you please move the top toolbar to the side to create one side panel? It's taking up a lot of real estate on laptop screens.


On smaller screens, it should automatically move the entire sidebar to be a (relatively thin) footer bar instead. I should probably make the top bar shrink in height at that point as well. I suspect you might be just slightly above the cut-off for the small-screen layout.

EDIT: Here's a ticket: https://github.com/joepie91/pdfy/issues/13


What I'm describing is getting rid of the top bar and putting the logo in the side bar. Vertical screen real estate is precious.


Very nice initiative! Good luck with it!

I know this implies complexity, but I think it'd be nice if there was a comment feature for the PDFs, perhaps even something à la Soundcloud, i.e. per section comments.

I'd love to find things such as my washing machine's manual improved by user experience through comments etc. :-D


I've actually been considering this. I'm not yet sure how to implement it, though - and while I'd love to have positional comments (perhaps 'annotations' would be a better term?), it'd require some significant modifications to the pdf.js viewer.

Right now I'm quite swamped with stuff to do, but once I have some more free time I'll definitely be looking into this and some other enhancements that are currently sitting on the issue tracker.


Contact the author of this presentation, he wrote epub.js and could provide helpful advice.

http://www.w3.org/2014/04/annotation/slides/Hartnell.pdf

http://www.youtube.com/watch?v=Xtj4LYBzRiw

Related: http://www.w3.org/2014/04/annotation/


Annotation pdfs would be great. It's a feature we'd love to implement for viewerjs, easy pdf viewer for in js, too. http://viewerjs.org.


Flick me an email hengjie (at) notablepdf.com, I think we could provide you with a viewer for free that has annotation features but also based on PDF.js so that you can easily switch it over.


We have that in Notable PDF ( http://notablepdf.com ). We don't require that you host the file with us either - Files are fingerprinted, and comments can be made attached to that fingerprint (private by default, but can be made public to anyone who has the file).

We're focusing on annotating documents within teams, schools, etc, but the public annotations to improve documents (à la rapgenius) is something we are thinking about.

(We actually have the ability to share files with a link too)


This is really sweet, but if I may, what's wrong with using existing solutions like Dropbox or Google Docs? Neither shows a paywall or login screen when accessing the shared links.

Unless you don't use either service, which is totally valid.


Dropbox and Google docs rarely if even show up in Google search results.

Will PDFy submit lists of content to search engines or be easily crawlable?


The gallery is plain HTML; all public documents are correctly (and quite rapidly, <1 day) indexed by Google. Unlisted documents get a noindex tag.


Both require accounts to upload files.

EDIT: And both are more involved than just dragging and dropping a PDF.


I also made NextPrev.it a little while ago with a friend. It gives you option to host a PDF and control which page the viewer sees, perfect for presentations or looking through contracts etc.


I don't want to sound negative, but when I look at the "latest public documents", most of them seem to be copyright infridgement.

But good luck anyway.


Great work @joepie91_! I wish your service the very best and hope that would scale against the DMCA brigade.


Lots of copyrighted material there already. O'Reilley, Prentice Hall etc Get ready for a DCMA deluge.


This looks like duplicated effort to me. Upload the pdf to Google Drive, set the permissions to "people with link can view", and share the link it gives you. Alternatively, there's a publish to web option - they're identical in this use case. I already do something similar on my portfolio page - I have a link that downloads a pdf version of my resume on Google Drive.


This comment is almost identical to one posted when Dropbox first premiered on HN. They were saying Dropbox was duplicated effort. Interface matters a lot.


That's a great point. The exact differences between the work flows:

1. Adds drag-and-drop to choose the file to upload

2. Cleaner permissions structure

3. The button is much bigger, and there's fewer options to do other things

4. Adds galleries to look at collections of PDF files

5. Exact same number (6) of mouse clicks to copy a link to a hosted pdf to clipboard

6. Removes log-in requirement (though to be fair, it's really easy to stay logged in to Google services)

It seems to me to be a much smaller difference to the UI difference with Dropbox - the automatic mirroring and context menu actions are huge.


> (though to be fair, it's really easy to stay logged in to Google services)

Not particularly easy for the three-in-four Internet users who don't have any form of Google account, though...


Yup. I would have never thought about using Google drive to share PDF publicly. Now that I know, still not sure I would. You're damn right, interface does matter.


A login might be useful. You know.. to check (or delete) what I've uploaded and see the stats later.


Whoa, opened the site to a bunch of obvious nsfw stuff. Maybe there should be an option to filter this?


Add comments on each pdf and it could become an important tool for (anonymously) reviewing papers


Wait, is this the joepie91 that used to hang out on #anonops?


I saw some sexy contents, will you delete them?


Delete them? Hah


would be cool if we can edit pdf's


Editing of PDFs really is outside the scope of PDFy. There are quite a few PDF editors available elsewhere already, all of which will likely do the job better than PDFy ever could.

Not to mention that editing of PDFs can be a somewhat painful experience :)


To build on that, here's a list of web PDF editors that are available:

1. NotablePDF (https://web.notablepdf.com) 2. PDFZen (http://pdfzen.com) 3. PDF Escape (https://www.pdfescape.com/) 4. PDF Buddy (http://pdfbuddy.com)


Could we please stop hating on Scribd? In addition to being rude, it's also mistaken: I've found PDFs on Scribd that simply aren't available anywhere else, because the original links on other sites became broken. This is somewhat common for academic papers, for example. Yes, I had to upload a PDF in order to download the PDF from Scribd, but that's a fair trade; I chose to upload an academic paper that perhaps might be useful to someone else.


When you have a horrible, user-hostile service, being the only place to find something only tends to make people hate you more.


Which is better: not being able to find something at all, or having to jump through a hoop to download it?

I thought HN was objective. Maybe that's changing.


> I thought HN was objective. Maybe that's changing.

No, I don't think so, I just think a lot of people here disagree with you in your judgement of Scribd.

Hopefully this service can have the utility of Scribd as you've described and keep it's lovely usability.


The lack of response to my point that Scribd has content not found anywhere else is the most telling. Objectively, it seems like that point should win the debate. The disagreement is therefore subjective.

I wouldn't quite call this "tribalism," but perhaps it's one step short. In the old days of HN, a valid point wouldn't simply be dismissed without a response. The fact that no one is stepping up to point out why I'm mistaken is suggestive; if it continues to happen, then it's indicative of a trend. And over the last year I've seen this happen to others somewhat frequently.


The disagreement is therefore subjective.

No, your reasoning is specious, you're assuming "Scribd has content nobody else has, therefore if there was no Scribd, there would be no content". This is simply broken logic.

Scribd is like the cup I used to drink water from this morning; if that particular cup hadn't been made, would I die of thirst?


(There's a maximum nesting limit for replies?)

Scribd having a lot of unique content isn't a feature; it's an observed state. There is no reason why Scribd would exclusively be able to offer that, and any other PDF host wouldn't. It's simply a consequence of a lot of people using a service.

Scribd having amassed so much data due to their sheer size is not a point in favour of Scribd; if anything, it makes the walled-garden approach of Scribd even more frustrating.


Are you sure? If an academic paper were to become lost from the internet because Internet Archive hadn't archived it and no other site mirrored it, then the world wouldn't be worse off as a result? That seems dubious.

And if it's true that the world is better off for Scribd having the paper, then it must be true that Scribd is beneficial to the world. The fact that people don't like it is irrelevant.

(I created a new account not to dodge downvotes, but because HN wouldn't let me continue submitting replies to this thread with my other one.)


But the alternative to "Scribd having the paper" is not necessarily "no other site having it". If someone uploaded it to Scribd, why wouldn't they upload it to some other site instead?

Your argument is equivalent to saying: the registrar of google.com is MarkMonitor, therefore if MarkMonitor didn't exist, we wouldn't have Google.


> The fact that people don't like it is irrelevant.

It's really not, though. Scribd is terrible in terms of its user experience, and frankly it's interface is horrid in my opinion, but it has content that isn't found anywhere else (which is a plus). Those two are not mutually exclusive, you're drawing a false dichotomy. *shrugs


> There's a maximum nesting limit for replies?

Usually when the "reply" link isn't shown, you can click on the comment's "link" link and reply from there.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: