So, we built Arxiv Vanity: a site that renders Arxiv papers as web pages. It’s still pretty janky, but for the papers that do render correctly, the experience is so much better than reading a PDF. For example:
The source for the LaTeX to HTML renderer is on GitHub. It’s built on Pandoc and Distill.pub’s template.
This look quite a bit better, so here is the question: what do you not support at the moment?
One final thing, and wildly off topic, is that when you do your defense, remember that you probably know more about the specifics of the subject than anyone else in the room. Many folks stress over it, but you're almost certainly going to be the actual expert in the room. Good luck!
Here is the broken stuff we are keeping track of: https://github.com/arxiv-vanity/engrafo/issues (feel free to add to it!)
I would think that panflute would allow for more readable code, which helps whendealing with all the corner cases and rough edges of latex.
PDF has the great benefit of rendering the same on every system. With very few exceptions, PDF will look exactly the same on every system and will print the same on every system.
HTML doesn't really have that same benefit.
Don't get me wrong, I think your service is a great idea for those who would like HTML formatted results, but I'm not understanding the complaint about PDF.
Could you expand on why you don't like PDF?
Also in general the mobile pdf reading experience sucks.
For example you have to download a file (rather than browse to) on Android and the hunt it down to open it.
The pdf readers I've used easily accidentally scroll you to a random page if you make a mistake in where you touch the screen. Kindles probably the best but then you have to email yourself the pdf which is a hassle.
The need to scroll doesn't exist on a large screen or on a piece of A4, but on smaller devices like mobile phones or even tablets, it's annoying. Having a responsive page means you can scroll vertically as you read, rather than having to make a big jumps (or constant horizontal scrolls) that can really break the flow.
I read quite a few PDFs and don't actually have any complaints. I am not personally seeing any readability issues and don't mind consuming PDFs at all.
That said, I think I now understand your complaint. Thanks! I just don't personally have any trouble with it. I use multiple tablets, of varied sizes, and I've had good experiences with all of the devices. While some PDFs are horribly formatted, I find that the device choice doesn't help that and it's a design choice from the author.
But, again, thanks for helping me understand.
Reading PDFs on a tablet isn't too bad because of large screen real estate.
Reading PDFs on a small mobile phone requires me to zoom in to make the font big enough for me to read, and then I have to scroll right to read, and left and down to move to a new section of the column.
Try reading a PDF on a smaller device than a tablet. I'm sure you'll be able to see what we mean.
Yes, I want my cake and a pony. Cakepony.
so no need to search tablet specs for the culprit. PEBCAK :)
Then there are minor issues of margins, possibly zooming to make text readable, etc.
That's why PDFs are so bad on mobile. The ideal format is one column text, figures and tables between paragraphs of that column, no page breaks, bidirectional links to notes. That's HTML, I guess.
I am guessing it is an individual taste thing. That makes some sense.
Yes, it does.
That's why we still test pages in different browsers and end up using browser specific code to ensure proper rendering - which often only reaches the 'close enough' format.
I understand the problem with a phone, but PDFs on an ipad/tablet are beautiful and a joy to read. Much better to read the text as originally typeset than to put it through a process such as this which risks corrupting minor but important details in the mathematical content.
On my phone I put it in landscape mode and that allows me to read a PDF OK, but I don't really get why one would read academic papers on a phone, why not use a tablet?
However I'm very interested in engrafo. It sounds like it will allow me to automatically publish blog style content from my LaTeX sources without having to fork the LaTeX content into a markdown / HTML version.
I just don't understand why you don't like reading academic papers as PDFs on tablets!
Your solution using the LaTex source generates really nice HTML, congrats!
For now, it's hard to read: https://docushow.com/viewdoc?url=https%3A%2F%2Farxiv.org%2Fp...
(One thing that is very ugly in the PDFs, and most scholarly papers, is the use of different-colored boxes for hyperlinks. Authors, please consider putting
in your LaTeX preambles.)
As for immediate tweaks, I tentatively suggest making the text 100% black (like the original PDF) instead of rgba(0, 0, 0, 0.8). The higher contrast will help those of us with less-than-great eyes.
For instance, the paper  appears to be quite readable on mobile, and clicking/tapping on a reference such as (8.1) leads you to equation (8.1) as you would expect.
The auto-generation of Arxiv-Vanity is really nice, maybe it would be easy to add the LatexML output too?
Only issue I've run into so far is that cross-references to theorem numbers don't seem to always work correctly, e.g. you'll see a lot of "Theorem ?" in https://www.arxiv-vanity.com/papers/1607.06711/.
That said, on cursory look, this is pretty impressive. latex->web converters have existed for a long time, and this appears to have navigated some aspects quite well!
It would be nice if an option to output MathML existed.
In brief, it allows treating Maths as a first-class citizen on the web.
For instance, with MathML the reader can choose what font the equations will be rendered in — if you prefer STIX or Latin Modern Math, then you can specify it with CSS, and the browser will correctly render it. With the mash of spans within spans that arXiv-vanity uses, you couldn't change the font, as then the pre-calculated spacings would be wrong. (Alternatively, the publisher could easily offer several styles, without having to re-render everything, just by changing the CSS.)
Arguably, client-side MathJax offers the same flexibility as MathML, but it's much, much slower, while rendering MathML in firefox is as fast as rendering standard, static HTML.
Another application of MathML is embedding it in SVGs for beautiful graphs.
MathML can also be pasted into other applications that support it, such as Thunderbird and Mathematica.
I've also been working on a similar open-source project "Sharead".
It has a chrome extension that uploads Arxiv papers, and you can manage papers with tags.
It also automatically converts pdf to HTML using a library called pdf2html:
Of course, it goes without saying that I want this.
Also a lot of MathJax failures (maybe Latex variables names?)
The MathJax failures are either things that MathJax doesn't support, or use of \DeclareMathOperator which we haven't added support for yet.
Edit: Added a more useful error message. :) https://www.arxiv-vanity.com/papers/1608.04012/
Unfortunately, among its sins, PDF discards a lot of the presentation semantics (headers, footnotes etc). Congrats on doing a credible job trying to reconstruct some of that! It's a tough, thankless job.
I was horrified when Adobe introduced PDF and indeed it has turned out at least as badly as I had feared.
(I'm an academic and I'm used to PDFs and I like them myself.)
I tried it on this one: https://www.arxiv-vanity.com/papers/1702.03277/
Some commands don't work (\textsl, \rotatebox, ...) and the thank you footnote is incorporated into the title, but otherwise very readable!
PDF is usually bad, of course, on small screens, unless the publisher makes special versions.
Me? I still mostly prefer reading physical academic papers because of needing to flip back and forth for re-reading (clarification) and adding personal notes/graphs/calculations.
Good job guys.
Tried a couple other papers: "This paper failed to render. Take a look at the original PDF instead."
So...with what probability does this actually work?
LaTeX is really tricky to parse, which is why you're seeing those "failed to render" errors. Judging from our logs, it works about ~80% of the time. That's up a lot from plain Pandoc though - it could render hardly anything from Arxiv.
Tried https://arxiv.org/abs/1511.06343 and a couple others and got the "failed to render."
Tried cloning engrafo, then installing docker, then building engrafo, then my disk was filling up and decided I'm done with this for now.
I hope this can be made to work reliably. I generally prefer web pages to pdfs.
Does anyone know if this kind of PDF reader exists? Such a PDF reflow reader would work on scanned old books.
I would love to see a bookmarklet that lets me hop from an arxiv page straight to Arxiv Vanity.
Also, the manicure emoji for the favicon was a great choice!
1. Center the text to the screen
2. Justify the text
(I'm not sure how difficult these are though)
Any recommendations for HTML templates other than the distill.pub one?
Personally, I prefer the PDF versions, but this could be very useful on a phone.
I always have trouble reading papers on Kindle, as the screen is small. panning and zooming are also painful as the device is slow.
I kinda hope papers can be turned into single column (more kindle friendly.)
I guess I could probably solve this with a custom stylesheet, though.
Shameless plug: I made an Android app for arXiv if anyone wants something simple to search articles on mobile. Graduating soon so if you try and enjoy it, any positive (but honest) views help the looming job search ;)
Edit to clarify: If people want to use or develop a broken sort-of-PDF viewer, that’s fine. However, if someone searches for a paper of mine, I would like them to only find the version where I at least had a chance to see that it renders correctly and is complete. In particular, I do not want to be "responsible" for broken rendering on random third-party websites. This website actually operating illegally does not make me more inclined to support it.
Sounds a lot more positive and might get better results than being as adversarial and negative (imo) as the original comment?
If you want us to remove your paper and just point at the PDF, we're happy to do so. My email's in my profile if you don't want to post the broken render here!
I also don’t want to keep tabs on every arXiv rehoster and inform them manually by e-mail every time a new paper goes up.
May I ask why this was not done together with the arXiv itself? I.e. have the infrastructure run there, let authors check the HTML render at the same time as the PDF render and then, if the author thinks they look ok, have them show directly on the abstracts page? This would even avoid all your license problems, as the arXiv already has the corresponding license!
 I got the impression it was a French site
 just guessing where you live
Longer version: it's illegal only if a license is required, which is a matter of the copyright law of the jurisdiction relevant to the act. In the US, that question may turn on things like fair use analysis, which can be tricky.
Could you clarify why you think that this site does not require a valid license to re-host and re-compile papers?
Time-shifting is one of many examples of where copying a whole work was found to be fair use; the idea that fair use applies only to citations is very, very wrong.
Fair use is extremely precedent dependent (and very hard to predict without clear applicable precedent) because the statute law gives only factors to weigh in the analysis.
> Could you clarify why you think that this site does not require a valid license to re-host and re-compile papers?
I didn't state an opinion on that; I said that, because it skips the question of whether license is required, the blanket statement that rehosting without a valid license is “clearly illegal” is inaccurate and overbroad.
A Native android or iOS tablet app would be neat to track your papers etc.
HTML is better read, smaller, faster, has more formatting options, and can have all contained in a single file.
Seriously, stop creating PDFs.
I agree that maybe layouting based on physical paper is maybe not ultimately necessary, but it gives the reader a familiar structure. The way the advertised web site is transferring the papers into a long scrolling list of text ... I find it rather disorienting and unstructured. Text that is split up into "pages" (whatever size they are in the end) somehow helps break up the reading flow.
In the end it remains to be shown that the gain from having academic papers not typeset in PDF outweighs the hassle of having to deal with non-standardized ways of rendering properly formatted text on websites (thinks like MathJax etc. do not support everything that is available in full LaTeX etc.).
What I don't understand is why I got at least 2 downvotes. These days I'm getting downvotes for every opinion I express on HN. It's very annoying.
But anyway, in the context of the discussion about this webpage/project, it's not relevant to ask why these PDFs exist. They do, and the scientific community is nowhere near a transition away from them. So bfirsh is trying to find a solution to consume those existing PDFs.
So think as not getting downvoted for expressing your opinion, but more for not contributing to the discussion about this particular project.