We were frustrated by the experience of reading machine learning papers on screens (particularly phones/tablets). There are lots of good tools for authoring HTML papers (Distill, Authorea, etc) but nothing that deals with the vast number of PDF papers that already exist.
So, we built Arxiv Vanity: a site that renders Arxiv papers as web pages. It’s still pretty janky, but for the papers that do render correctly, the experience is so much better than reading a PDF. For example:
One of the things that I came across when writing my own janky pdf/latex->html converter for lecture notes[0] is that Pandoc doesn't handle references and subfigures correctly, even with pandoc-crossref and pandoc-citeproc enabled. I had to write a little python module[1] that used regex to extract those and then handle them on my own separately... This is definitely something you should look at.
I was looking for a way to turn my (soon-to-be-defended) PhD thesis into an epub, and investigated the various LaTeX2Html converters.
I was pretty disappointed when I realized that all of them are terrible and have no hope of handling my manuscript. My current solution is to create a rendering of my thesis in a5 format. :/
This look quite a bit better, so here is the question: what do you not support at the moment?
One final thing, and wildly off topic, is that when you do your defense, remember that you probably know more about the specifics of the subject than anyone else in the room. Many folks stress over it, but you're almost certainly going to be the actual expert in the room. Good luck!
A lot of things. LaTeX and its packages have so much surface area. Our approach so far is to just make the papers that we read readable. That probably covers the 20% of LaTeX features that 80% of people use.
I'm not sure I understand. Well, I understand what you're doing but I'm not sure why you'd dislike PDF.
PDF has the great benefit of rendering the same on every system. With very few exceptions, PDF will look exactly the same on every system and will print the same on every system.
HTML doesn't really have that same benefit.
Don't get me wrong, I think your service is a great idea for those who would like HTML formatted results, but I'm not understanding the complaint about PDF.
Pdf pages are usually based on A4 size which is 210mm wide. Even at full size the writing is often tiny. Once rendered on a 10cm wide screen (landscape) it's pretty darn hard to read.
Also in general the mobile pdf reading experience sucks.
For example you have to download a file (rather than browse to) on Android and the hunt it down to open it.
The pdf readers I've used easily accidentally scroll you to a random page if you make a mistake in where you touch the screen. Kindles probably the best but then you have to email yourself the pdf which is a hassle.
IMO, Reading two-column papers on an iPhone (through PDF) is a real pain -- IMO the format relies on you using your eyes to jump from bottom left to top right, rather than having to scroll from the very bottom to the very top (diagonally). Same problem even exists for single-column styles -- you need to zoom in so much that you have to scroll horizontally as well as vertically.
The need to scroll doesn't exist on a large screen or on a piece of A4, but on smaller devices like mobile phones or even tablets, it's annoying. Having a responsive page means you can scroll vertically as you read, rather than having to make a big jumps (or constant horizontal scrolls) that can really break the flow.
I wonder if that's a personal thing? Over the past year, I've been trying to join the mobile revolution - sort of. The majority of my browsing is now done on a tablet.
I read quite a few PDFs and don't actually have any complaints. I am not personally seeing any readability issues and don't mind consuming PDFs at all.
That said, I think I now understand your complaint. Thanks! I just don't personally have any trouble with it. I use multiple tablets, of varied sizes, and I've had good experiences with all of the devices. While some PDFs are horribly formatted, I find that the device choice doesn't help that and it's a design choice from the author.
> The majority of my browsing is now done on a tablet.
Reading PDFs on a tablet isn't too bad because of large screen real estate.
Reading PDFs on a small mobile phone requires me to zoom in to make the font big enough for me to read, and then I have to scroll right to read, and left and down to move to a new section of the column.
Try reading a PDF on a smaller device than a tablet. I'm sure you'll be able to see what we mean.
I consume almost all my media on phone. The problem with pdf is precisely that it renders the same on every screen - this makes most PDFs virtually unusable on the phone as you have to scroll down one column in a page, then up for the second column etc.
I think part of the issue is that PDF isn't responsive to the size of the device. A PDF is not much more than an image from the perspective of layout. I'd love to be able to reflow text from a PDF such that a single column fills my screen edge-to-edge and scrolling allows me to advance through the paper, as opposed to requiring me to reposition the viewport every time I reach the end of a column. I know this isn't the purpose of PDFs, and I love them in different contexts where layout (including typography) does matter to me. But I also really want to be able to easily consume papers in a way that isn't constrained by the PDF layout.
Fwiw, this conversation greatly varies depending on who is doing the reading -- the rather banal fact is that the average 25-year-old student has much different ability to screen-read than the average 60-year-old professor (or a 60-year-old student, for that matter) :)
so no need to search tablet specs for the culprit. PEBCAK :)
The screen of a tablet is large enough to display a PDF. But PDFs are split into pages. That's perfect with paper, where we flip pages. It's very unnatural on screens, especially touchscreens, where we use vertical scrolling to move around.
Then there are minor issues of margins, possibly zooming to make text readable, etc.
That's why PDFs are so bad on mobile. The ideal format is one column text, figures and tables between paragraphs of that column, no page breaks, bidirectional links to notes. That's HTML, I guess.
Yeah fair point, I use an iPad mini. But I have heard similar complaints from older folks (40+) who have full-size iPads. I think much of it stems from dual-column printing, which is just kind of antiquated/annoying on digital.
Yup. I agree. I even find the experience fantastic on the original iPad, as well as a brand new one. I find it just fine on my phone, which isn't nearly as large a screen.
I am guessing it is an individual taste thing. That makes some sense.
Not really, look at the myriad rendering issues between the most popular browsers. PDF should result in pixel-for-pixel reproducibility, browsers don't do that in practice.
That's why we still test pages in different browsers and end up using browser specific code to ensure proper rendering - which often only reaches the 'close enough' format.
I understand the problem with a phone, but PDFs on an ipad/tablet are beautiful and a joy to read. Much better to read the text as originally typeset than to put it through a process such as this which risks corrupting minor but important details in the mathematical content.
On my phone I put it in landscape mode and that allows me to read a PDF OK, but I don't really get why one would read academic papers on a phone, why not use a tablet?
However I'm very interested in engrafo. It sounds like it will allow me to automatically publish blog style content from my LaTeX sources without having to fork the LaTeX content into a markdown / HTML version.
I just don't understand why you don't like reading academic papers as PDFs on tablets!
I actually made one quickly, published here: https://chrome.google.com/webstore/detail/arxiv-vanity-plugi... . It injects the arxiv vanity link on abstract pages and if you click the button when viewing an online arxiv pdf it opens the respective arxiv vanity link.
bioRxiv doesn't expose LateX files, they explicitly only use PDFs to make things easier. Which means you're going to need to reflow PDFs (a la https://docushow.com/), and I would guess there are a lot more edge cases there
Thanks for linking to https://docushow.com
Also a work in progress, but PDF reflow is a hard problem so you never ship if you want to solve all cases :)
Your solution using the LaTex source generates really nice HTML, congrats!
In all three cases I find the original PDFs more pleasant to read. HTML typography is not up to snuff. I read them on a laptop, however, and I can see that this would be useful if one is forced to read on a phone.
(One thing that is very ugly in the PDFs, and most scholarly papers, is the use of different-colored boxes for hyperlinks. Authors, please consider putting
For me the PDF was fuzzier https://imgur.com/a/WJ5y3 and the HTML version was more convenient to read in a single column. The two-column format is nice if I'm skimming to see if a paper is going to be interesting, but when I sit down to read it the HTML version definitely wins.
So, we built Arxiv Vanity: a site that renders Arxiv papers as web pages. It’s still pretty janky, but for the papers that do render correctly, the experience is so much better than reading a PDF. For example:
https://www.arxiv-vanity.com/papers/1705.04085v3/
https://www.arxiv-vanity.com/papers/1708.00884/
https://www.arxiv-vanity.com/papers/1705.06031v2/
The source for the LaTeX to HTML renderer is on GitHub[0]. It’s built on Pandoc[1] and Distill.pub’s template[2].
[0] https://github.com/arxiv-vanity/engrafo
[1] https://pandoc.org
[2] https://github.com/distillpub/template