Hacker News new | comments | show | ask | jobs | submit login

We were frustrated by the experience of reading machine learning papers on screens (particularly phones/tablets). There are lots of good tools for authoring HTML papers (Distill, Authorea, etc) but nothing that deals with the vast number of PDF papers that already exist.

So, we built Arxiv Vanity: a site that renders Arxiv papers as web pages. It’s still pretty janky, but for the papers that do render correctly, the experience is so much better than reading a PDF. For example:




The source for the LaTeX to HTML renderer is on GitHub[0]. It’s built on Pandoc[1] and Distill.pub’s template[2].

[0] https://github.com/arxiv-vanity/engrafo

[1] https://pandoc.org

[2] https://github.com/distillpub/template

One of the things that I came across when writing my own janky pdf/latex->html converter for lecture notes[0] is that Pandoc doesn't handle references and subfigures correctly, even with pandoc-crossref and pandoc-citeproc enabled. I had to write a little python module[1] that used regex to extract those and then handle them on my own separately... This is definitely something you should look at.

[0] https://dmaitre.phyip3.dur.ac.uk/NPP/notes/ [1] https://github.com/JBorrow/latex-pandoc-preprocessor

I was looking for a way to turn my (soon-to-be-defended) PhD thesis into an epub, and investigated the various LaTeX2Html converters. I was pretty disappointed when I realized that all of them are terrible and have no hope of handling my manuscript. My current solution is to create a rendering of my thesis in a5 format. :/

This look quite a bit better, so here is the question: what do you not support at the moment?

Have you seen pandoc? That should be able to do that and the comments about results are usually positive.


One final thing, and wildly off topic, is that when you do your defense, remember that you probably know more about the specifics of the subject than anyone else in the room. Many folks stress over it, but you're almost certainly going to be the actual expert in the room. Good luck!

A lot of things. LaTeX and its packages have so much surface area. Our approach so far is to just make the papers that we read readable. That probably covers the 20% of LaTeX features that 80% of people use.

Here is the broken stuff we are keeping track of: https://github.com/arxiv-vanity/engrafo/issues (feel free to add to it!)

Is there a reason for relying on pandocfilters instead of on Panflute [1]?

I would think that panflute would allow for more readable code, which helps whendealing with all the corner cases and rough edges of latex.

[1] https://github.com/sergiocorreia/panflute

Because we didn't know that existed! That looks so much better, thank you. The pandocfilters library is really hard to use.


I'm not sure I understand. Well, I understand what you're doing but I'm not sure why you'd dislike PDF.

PDF has the great benefit of rendering the same on every system. With very few exceptions, PDF will look exactly the same on every system and will print the same on every system.

HTML doesn't really have that same benefit.

Don't get me wrong, I think your service is a great idea for those who would like HTML formatted results, but I'm not understanding the complaint about PDF.

Could you expand on why you don't like PDF?

Pdf pages are usually based on A4 size which is 210mm wide. Even at full size the writing is often tiny. Once rendered on a 10cm wide screen (landscape) it's pretty darn hard to read.

Also in general the mobile pdf reading experience sucks.

For example you have to download a file (rather than browse to) on Android and the hunt it down to open it.

The pdf readers I've used easily accidentally scroll you to a random page if you make a mistake in where you touch the screen. Kindles probably the best but then you have to email yourself the pdf which is a hassle.

IMO, Reading two-column papers on an iPhone (through PDF) is a real pain -- IMO the format relies on you using your eyes to jump from bottom left to top right, rather than having to scroll from the very bottom to the very top (diagonally). Same problem even exists for single-column styles -- you need to zoom in so much that you have to scroll horizontally as well as vertically.

The need to scroll doesn't exist on a large screen or on a piece of A4, but on smaller devices like mobile phones or even tablets, it's annoying. Having a responsive page means you can scroll vertically as you read, rather than having to make a big jumps (or constant horizontal scrolls) that can really break the flow.

I wonder if that's a personal thing? Over the past year, I've been trying to join the mobile revolution - sort of. The majority of my browsing is now done on a tablet.

I read quite a few PDFs and don't actually have any complaints. I am not personally seeing any readability issues and don't mind consuming PDFs at all.

That said, I think I now understand your complaint. Thanks! I just don't personally have any trouble with it. I use multiple tablets, of varied sizes, and I've had good experiences with all of the devices. While some PDFs are horribly formatted, I find that the device choice doesn't help that and it's a design choice from the author.

But, again, thanks for helping me understand.

> The majority of my browsing is now done on a tablet.

Reading PDFs on a tablet isn't too bad because of large screen real estate.

Reading PDFs on a small mobile phone requires me to zoom in to make the font big enough for me to read, and then I have to scroll right to read, and left and down to move to a new section of the column.

Try reading a PDF on a smaller device than a tablet. I'm sure you'll be able to see what we mean.

Two column papers are the worst format. They emphasize compact printability in a world where no one buys proceedings.

I consume almost all my media on phone. The problem with pdf is precisely that it renders the same on every screen - this makes most PDFs virtually unusable on the phone as you have to scroll down one column in a page, then up for the second column etc.

Not the OP, but PDF is bad on tablets and horrible on phones.

What size tablets are people talking about? I find the iPad size practically ideal for consuming PDF papers.

I think part of the issue is that PDF isn't responsive to the size of the device. A PDF is not much more than an image from the perspective of layout. I'd love to be able to reflow text from a PDF such that a single column fills my screen edge-to-edge and scrolling allows me to advance through the paper, as opposed to requiring me to reposition the viewport every time I reach the end of a column. I know this isn't the purpose of PDFs, and I love them in different contexts where layout (including typography) does matter to me. But I also really want to be able to easily consume papers in a way that isn't constrained by the PDF layout.

Yes, I want my cake and a pony. Cakepony.

Fwiw, this conversation greatly varies depending on who is doing the reading -- the rather banal fact is that the average 25-year-old student has much different ability to screen-read than the average 60-year-old professor (or a 60-year-old student, for that matter) :)

so no need to search tablet specs for the culprit. PEBCAK :)

The screen of a tablet is large enough to display a PDF. But PDFs are split into pages. That's perfect with paper, where we flip pages. It's very unnatural on screens, especially touchscreens, where we use vertical scrolling to move around.

Then there are minor issues of margins, possibly zooming to make text readable, etc.

That's why PDFs are so bad on mobile. The ideal format is one column text, figures and tables between paragraphs of that column, no page breaks, bidirectional links to notes. That's HTML, I guess.

Yeah fair point, I use an iPad mini. But I have heard similar complaints from older folks (40+) who have full-size iPads. I think much of it stems from dual-column printing, which is just kind of antiquated/annoying on digital.

Yup. I agree. I even find the experience fantastic on the original iPad, as well as a brand new one. I find it just fine on my phone, which isn't nearly as large a screen.

I am guessing it is an individual taste thing. That makes some sense.

> HTML doesn't really have that same benefit.

Yes, it does.

Not really, look at the myriad rendering issues between the most popular browsers. PDF should result in pixel-for-pixel reproducibility, browsers don't do that in practice.

That's why we still test pages in different browsers and end up using browser specific code to ensure proper rendering - which often only reaches the 'close enough' format.

Not sure if pixel-perfect matters for consuming papers.

No way. We're testing in multiple browsers because of Javascript only.

> particularly phones/tablets

I understand the problem with a phone, but PDFs on an ipad/tablet are beautiful and a joy to read. Much better to read the text as originally typeset than to put it through a process such as this which risks corrupting minor but important details in the mathematical content.

On my phone I put it in landscape mode and that allows me to read a PDF OK, but I don't really get why one would read academic papers on a phone, why not use a tablet?

However I'm very interested in engrafo. It sounds like it will allow me to automatically publish blog style content from my LaTeX sources without having to fork the LaTeX content into a markdown / HTML version.

I just don't understand why you don't like reading academic papers as PDFs on tablets!

This is cool, it would be nice to have a chrome extension to take me directly to this from the page/pdf.

I actually made one quickly, published here: https://chrome.google.com/webstore/detail/arxiv-vanity-plugi... . It injects the arxiv vanity link on abstract pages and if you click the button when viewing an online arxiv pdf it opens the respective arxiv vanity link.

Yes! This is a great idea, and something we have been thinking of. https://github.com/arxiv-vanity/arxiv-vanity/issues/67

It would be amazing if we could browse "Latest" by category, and for a certain day, much like: https://arxiv.org/list/math.NT/recent

It's very nice. You should expand to cover bioRxiv (biology) too.

bioRxiv doesn't expose LateX files, they explicitly only use PDFs to make things easier. Which means you're going to need to reflow PDFs (a la https://docushow.com/), and I would guess there are a lot more edge cases there

Thanks for linking to https://docushow.com Also a work in progress, but PDF reflow is a hard problem so you never ship if you want to solve all cases :)

Your solution using the LaTex source generates really nice HTML, congrats!

Lovely idea, and I can't wait until it gains super deep-learning smarts and gets everything perfect :-)

For now, it's hard to read: https://docushow.com/viewdoc?url=https%3A%2F%2Farxiv.org%2Fp...

That first article and how it looks when imported into Authorea in one click: https://www.authorea.com/users/3/articles/208068-automatic-e... (just a couple of labels and si units which do not render). Note: it is forkable and can be commented upon.

In all three cases I find the original PDFs more pleasant to read. HTML typography is not up to snuff. I read them on a laptop, however, and I can see that this would be useful if one is forced to read on a phone.

(One thing that is very ugly in the PDFs, and most scholarly papers, is the use of different-colored boxes for hyperlinks. Authors, please consider putting


in your LaTeX preambles.)

For me the PDF was fuzzier https://imgur.com/a/WJ5y3 and the HTML version was more convenient to read in a single column. The two-column format is nice if I'm skimming to see if a paper is going to be interesting, but when I sit down to read it the HTML version definitely wins.

Just tried your suggestion: it ends up looking much uglier with font colors imo.

The default saturated colors are a bit garish. But you can set them to be anything you want. See the hyperref documentation.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact