
Show HN: Arxiv Vanity – Read academic papers from Arxiv as responsive web pages - bfirsh
https://www.arxiv-vanity.com/
======
bfirsh
We were frustrated by the experience of reading machine learning papers on
screens (particularly phones/tablets). There are lots of good tools for
authoring HTML papers (Distill, Authorea, etc) but nothing that deals with the
vast number of PDF papers that already exist.

So, we built Arxiv Vanity: a site that renders Arxiv papers as web pages. It’s
still pretty janky, but for the papers that do render correctly, the
experience is so much better than reading a PDF. For example:

[https://www.arxiv-vanity.com/papers/1705.04085v3/](https://www.arxiv-
vanity.com/papers/1705.04085v3/)

[https://www.arxiv-vanity.com/papers/1708.00884/](https://www.arxiv-
vanity.com/papers/1708.00884/)

[https://www.arxiv-vanity.com/papers/1705.06031v2/](https://www.arxiv-
vanity.com/papers/1705.06031v2/)

The source for the LaTeX to HTML renderer is on GitHub[0]. It’s built on
Pandoc[1] and Distill.pub’s template[2].

[0] [https://github.com/arxiv-vanity/engrafo](https://github.com/arxiv-
vanity/engrafo)

[1] [https://pandoc.org](https://pandoc.org)

[2]
[https://github.com/distillpub/template](https://github.com/distillpub/template)

~~~
KGIII
I'm not sure I understand. Well, I understand what you're doing but I'm not
sure why you'd dislike PDF.

PDF has the great benefit of rendering the same on every system. With very few
exceptions, PDF will look exactly the same on every system and will print the
same on every system.

HTML doesn't really have that same benefit.

Don't get me wrong, I think your service is a great idea for those who would
like HTML formatted results, but I'm not understanding the complaint about
PDF.

Could you expand on why you don't like PDF?

~~~
JosephRedfern
IMO, Reading two-column papers on an iPhone (through PDF) is a real pain --
IMO the format relies on you using your eyes to jump from bottom left to top
right, rather than having to scroll from the very bottom to the very top
(diagonally). Same problem even exists for single-column styles -- you need to
zoom in so much that you have to scroll horizontally as well as vertically.

The need to scroll doesn't exist on a large screen or on a piece of A4, but on
smaller devices like mobile phones or even tablets, it's annoying. Having a
responsive page means you can scroll vertically as you read, rather than
having to make a big jumps (or constant horizontal scrolls) that can really
break the flow.

~~~
KGIII
I wonder if that's a personal thing? Over the past year, I've been trying to
join the mobile revolution - sort of. The majority of my browsing is now done
on a tablet.

I read quite a few PDFs and don't actually have any complaints. I am not
_personally_ seeing any readability issues and don't mind consuming PDFs at
all.

That said, I think I now understand your complaint. Thanks! I just don't
_personally_ have any trouble with it. I use multiple tablets, of varied
sizes, and I've had good experiences with all of the devices. While some PDFs
are horribly formatted, I find that the device choice doesn't help that and
it's a design choice from the author.

But, again, thanks for helping me understand.

~~~
abustamam
> The majority of my browsing is now done on a tablet.

Reading PDFs on a tablet isn't too bad because of large screen real estate.

Reading PDFs on a small mobile phone requires me to zoom in to make the font
big enough for me to read, and then I have to scroll right to read, and left
and down to move to a new section of the column.

Try reading a PDF on a smaller device than a tablet. I'm sure you'll be able
to see what we mean.

------
jpeloquin
This is really cool and has a lot of potential. Academic papers are dense and
heavily cross-referenced, so experimenting with new display formats that do
more to help the reader could make researchers a lot more productive. For
example, citation tooltips are a big time saver compared to cross-referencing
the bibliography. However, it's also beneficial for every paper to look the
same because this makes skimming easier. To get both innovation and
consistency is to develop tools, like Arxiv Vanity, that automatically
transform the source document. This example makes me hopeful that we'll
someday have similar tools for the commercial publishers' papers.

As for immediate tweaks, I tentatively suggest making the text 100% black
(like the original PDF) instead of rgba(0, 0, 0, 0.8). The higher contrast
will help those of us with less-than-great eyes.

------
dang
This is amazing. I hope you'll keep working on it. There's always a long tail
of details that need taking care of when trying to cover a large corpus, and
ploughing through successive 80%'s is (as you are no doubt acutely aware)
serious grunt work. But you've made a fabulous start, so I hope you find the
stamina to do it!

~~~
bfirsh
Yeah, even building upon Pandoc's LaTeX parsing, 3 months of grunt work got us
this 20% working. Over the next 12 months we'll get the other 80% working. :)

------
jknz
A big challenge is to get references working correctly. LaTeXML is quite good
at converting latex documents to html [1], including references such as
Theorem 2.1, equation (8.1) etc.

For instance, the paper [2] appears to be quite readable on mobile, and
clicking/tapping on a reference such as (8.1) leads you to equation (8.1) as
you would expect.

The auto-generation of Arxiv-Vanity is really nice, maybe it would be easy to
add the LatexML output too?

[1]:
[http://www.albany.edu/~hammond/demos/Html5/arXiv/lxmlexample...](http://www.albany.edu/~hammond/demos/Html5/arXiv/lxmlexamples.html)

[2]:
[http://www.albany.edu/~hammond/demos/Html5/arXiv/LaTeXML/110...](http://www.albany.edu/~hammond/demos/Html5/arXiv/LaTeXML/1108.5305.html#S8.E1)

------
ddinh
This is an awesome tool. Thanks!

Only issue I've run into so far is that cross-references to theorem numbers
don't seem to always work correctly, e.g. you'll see a lot of "Theorem ?" in
[https://www.arxiv-vanity.com/papers/1607.06711/](https://www.arxiv-
vanity.com/papers/1607.06711/).

~~~
bfirsh
Ah, looks like we don't support theorems. You can track it here:
[https://github.com/arxiv-vanity/engrafo/issues/157](https://github.com/arxiv-
vanity/engrafo/issues/157)

Thanks!

------
beezle
It failed on all three papers I tried it on.
[https://arxiv.org/abs/1710.05313](https://arxiv.org/abs/1710.05313)
[https://arxiv.org/abs/1710.06689](https://arxiv.org/abs/1710.06689)
[https://arxiv.org/abs/1710.07508](https://arxiv.org/abs/1710.07508)

~~~
bfirsh
We haven't implemented many of the LaTeX packages used in papers that aren't
machine learning papers yet - sorry. :(

------
tothrow2017
This looks really cool. (The program had some issues with the bibliography and
with custom layout, but other than that, was great.)

It would be nice if an option to output MathML existed.

(Why MathML?

In brief, it allows treating Maths as a first-class citizen on the web.

For instance, with MathML the reader can choose what font the equations will
be rendered in — if you prefer STIX or Latin Modern Math, then you can specify
it with CSS, and the browser will correctly render it. With the mash of spans
within spans that arXiv-vanity uses, you couldn't change the font, as then the
pre-calculated spacings would be wrong. (Alternatively, the publisher could
easily offer several styles, without having to re-render everything, just by
changing the CSS.)

Arguably, client-side MathJax offers the same flexibility as MathML, but it's
much, much slower, while rendering MathML in firefox is as fast as rendering
standard, static HTML.

Another application of MathML is embedding it in SVGs for beautiful graphs.

MathML can also be pasted into other applications that support it, such as
Thunderbird and Mathematica. )

------
sturmen
This is awesome! I was literally rolling my eyes this morning about trying to
read an arXiv paper on my phone.

~~~
ldenoue
shameless plug: give [https://docushow.com](https://docushow.com) a try!

------
strin
Great work.

I've also been working on a similar open-source project "Sharead".

[https://github.com/strin/sharead](https://github.com/strin/sharead)

It has a chrome extension that uploads Arxiv papers, and you can manage papers
with tags.

It also automatically converts pdf to HTML using a library called pdf2html:

[https://github.com/coolwanglu/pdf2htmlEX](https://github.com/coolwanglu/pdf2htmlEX)

------
j2kun
Looks like it's failing to process some standard tex commands (e.g. \textup)
as well as some user defined macros. See the many display errors in
[https://www.arxiv-vanity.com/papers/1710.07406/](https://www.arxiv-
vanity.com/papers/1710.07406/)

Of course, it goes without saying that I want this.

------
mastazi
When the render fails, why are you redirecting to the pdf file intead of
redirecting to the abstract? E.g. here (link stolen from another comment in
this page) [https://www.arxiv-
vanity.com/papers/1608.04012/](https://www.arxiv-
vanity.com/papers/1608.04012/)

------
jxramos
Noob question but how far does Calibre take pdf to epub conversion? I've been
really interested in learning more about the epub file format and was greatly
intrigued to discover it extends xhtml and is essentially a zip folder if I've
gathered that much correctly.

------
dbranes
Unfortunately it's failing on first things I tried with a not-so-helpful error
message:

[https://www.arxiv-vanity.com/papers/1608.04012/](https://www.arxiv-
vanity.com/papers/1608.04012/)

[https://www.arxiv-vanity.com/papers/0903.3065/](https://www.arxiv-
vanity.com/papers/0903.3065/)

Also a lot of MathJax failures (maybe Latex variables names?)
[https://www.arxiv-vanity.com/papers/1709.09439/](https://www.arxiv-
vanity.com/papers/1709.09439/)

~~~
bfirsh
Those problems are normally Pandoc parsing errors. Considering it's open
source, perhaps we should print the error message so people can actually help
fix it...

The MathJax failures are either things that MathJax doesn't support, or use of
\DeclareMathOperator which we haven't added support for yet.

Edit: Added a more useful error message. :) [https://www.arxiv-
vanity.com/papers/1608.04012/](https://www.arxiv-
vanity.com/papers/1608.04012/)

------
gumby
Thank you thank you thank you! I detest the reader-hostile PDF (and WTF? why
would you write something and then make it inconvenient to read???)

Unfortunately, among its sins, PDF discards a lot of the presentation
semantics (headers, footnotes etc). Congrats on doing a credible job trying to
reconstruct some of that! It's a tough, thankless job.

I was horrified when Adobe introduced PDF and indeed it has turned out at
least as badly as I had feared.

~~~
Analog24
I believe it's reconstructed from the Latex source, which is how every paper
is submitted to the ArXiv. Not to diminish this site but I'm guessing that
generating HTML from Latex is a lot easier than doing it from PDF format.

~~~
gumby
Thanks. Soon I suppose we'll be able to run a LaTeX->TeX rendering system
compiled to web assembly!

------
zitterbewegung
This is awesome ! Going on my home screen now. Love the design. Maybe you
could ask Arxiv to have a button on their site that would direct it so that it
opens on your site .

------
vimarshk
This is so good. I do prefer the HTML over PDF in this scenario.

------
auggierose
Not really my use case as I read PDF papers on an iPad Pro 12.9 inch, which is
just fine, but very neat work!

I tried it on this one: [https://www.arxiv-
vanity.com/papers/1702.03277/](https://www.arxiv-
vanity.com/papers/1702.03277/)

Some commands don't work (\textsl, \rotatebox, ...) and the thank you footnote
is incorporated into the title, but otherwise very readable!

------
captn3m0
This is so much awesome. Thanks for building this.

------
maxxxxx
Nice! PDF is the worst format I can think of to present papers. Especially for
reading on mobile this will be of great help.

~~~
leephillips
PDF is the only format that will preserve the typographical details that are
important in many technical papers; it also avoids the relatively bad
rendering created by the browsers.

PDF is usually bad, of course, on small screens, unless the publisher makes
special versions.

~~~
alexott
Another useful thing is ability to put comments/annotations - I'm reading on
iPad, and annotate quite a lot

------
huangc10
Definitely useful in certain situations. Can't comment on how well the
conversion works (yet) but I can see how this might be useful to a lot of
people.

Me? I still mostly prefer reading physical academic papers because of needing
to flip back and forth for re-reading (clarification) and adding personal
notes/graphs/calculations.

Good job guys.

------
rundigen12
Tried three papers: "Could not find Arxiv ID in that URL. Are you sure it's an
arxiv.org URL?" Why, yes, I copied-and-pasted directly from the browser tab
showing the Arxiv URL.

Tried a couple other papers: "This paper failed to render. Take a look at the
original PDF instead."

So...with what probability does this actually work?

~~~
bfirsh
What Arxiv URLs were you trying?

LaTeX is really tricky to parse, which is why you're seeing those "failed to
render" errors. Judging from our logs, it works about ~80% of the time. That's
up a lot from plain Pandoc though - it could render hardly anything from
Arxiv.

~~~
rundigen12
I tried some more and still got the "are you sure this is a valid URL?" e.g.,
[https://arxiv.org/abs/gr-qc/0702106](https://arxiv.org/abs/gr-qc/0702106)

Tried [https://arxiv.org/abs/1511.06343](https://arxiv.org/abs/1511.06343) and
a couple others and got the "failed to render."

Tried cloning engrafo, then installing docker, then building engrafo, then my
disk was filling up and decided I'm done with this for now.

------
pinouchon
Sad to see that it doesn't render the figures properly for this paper:
[https://www.arxiv-vanity.com/papers/1701.03757v1/](https://www.arxiv-
vanity.com/papers/1701.03757v1/)

I hope this can be made to work reliably. I generally prefer web pages to
pdfs.

~~~
bfirsh
LaTeX is hard, but we'll keep at it! That issue is being tracked here:
[https://github.com/arxiv-vanity/engrafo/issues/12](https://github.com/arxiv-
vanity/engrafo/issues/12)

------
visarga
For bitmap based PDFs it would be possible to segment the document into words
and images (just bounding boxes, not OCR), then "reflow" them to a different
page size, by allocating less words per row.

Does anyone know if this kind of PDF reader exists? Such a PDF reflow reader
would work on scanned old books.

~~~
ocrcustomserver
I'm working on it, email me for early access.

------
marknadal
Hallejuhah, we need to force academics and people who go around touting
"whitepapers" to be ripped out of the proprietary PDF era of terrible
UX/readability and into modern readable web documents. This is definitely the
way to do it, this is brilliant!

~~~
KGIII
FWIW, PDF is an open standard. There exist both open and proprietary readers
and creators. It has been an open standard for nearly a decade.

[https://en.wikipedia.org/wiki/Portable_Document_Format](https://en.wikipedia.org/wiki/Portable_Document_Format)

------
arkades
I would pay at least 10$ for an app that made an aesthetically-pleasing HTML
flow of any pdf document I’m looking at. At least. And I can’t imagine I’d be
the only one. So much of my reading is on-screen now, and PDFs do kind of suck
for two column reading.

~~~
ocrcustomserver
What kind of pricing schema would you prefer for this? Monthly or one time
payment?

~~~
arkades
It definitely feels like a one time payment, since my mental model for it
would be “pdf reader.” Also, since I’m accustomed to SaaS in my business
dealings, but detest it in the private consumer space, and this would appeal
to me as a consumer; my business documents are already mostly PowerPoints,
with the occasional word doc.

------
rahimnathwani
If you're on Android or an ereader and frequently read two-column PDFs,
consider using Koreader:
[https://github.com/koreader/koreader](https://github.com/koreader/koreader)

------
stefco_
This is a superb idea! Still not working on some of the papers I read, but
hopefully it will soon.

I would love to see a bookmarklet that lets me hop from an arxiv page straight
to Arxiv Vanity.

Also, the manicure emoji for the favicon was a great choice!

------
deipoda
Great tool for mobile! Have to say I'm too used to PDFs on my laptop, although
I can see this replacing it. Two suggestions-

1\. Center the text to the screen 2\. Justify the text

(I'm not sure how difficult these are though)

------
ocrcustomserver
Really like this! I'm working on something similar, a generic PDF to HTML
converter that enables reflowing of documents on a mobile device.

Any recommendations for HTML templates other than the distill.pub one?

------
Cacti
Great job!

Personally, I prefer the PDF versions, but this could be very useful on a
phone.

------
nodemaker
Awesome thank you so much :) The GAN paper somehow feels more readable :D

------
billconan
Thanks! This is useful.

I always have trouble reading papers on Kindle, as the screen is small.
panning and zooming are also painful as the device is slow.

I kinda hope papers can be turned into single column (more kindle friendly.)

------
gnicholas
You might add a contact link so people can easily get in touch. The first-
listed personal homepage is kind of insane with all the flashing colors, and
it contains dead links (e.g., your Twitter).

~~~
bfirsh
Yeah, that's a good point. Andreas's site is a bit out of date. I've made them
link to more useful places. Thanks!

------
pkrumins
I've a large collection of scientific papers and interesting publications all
in PDF format, and I've never had a problem reading a PDF document.

------
netheril96
Now, if this can be integrated with arXiv-sanity.com

------
nmca
So I tried this on a couple of papers I need to read: awesome awesome awesome
awesome. Would upvote a thousand times if I could. Awesome.

------
michaelmior
Looks cool although the first paper I tried it on other than the examples
didn't work :( Looking forward to seeing this improve!

------
thediff
Please make the References section hyperlinked so clicking on a reference
takes you to the paper.

~~~
bfirsh
So much this. This is one of the main reasons I wanted to build this.

[https://github.com/arxiv-vanity/engrafo/issues/127](https://github.com/arxiv-
vanity/engrafo/issues/127)

~~~
thediff
It's also one of the main reasons Larry built Google.

------
HaoZeke
An excellent project, though maybe Sakura CSS would be a lighter alternative
to distill...

------
doall
It will be great if I can bookmark the position where I quit reading a paper.

------
Kip9000
Looks great! Would be even better if the references are urls to other papers.

------
mihaitodor
That looks great! Well done!

------
visarga
Narrow column would be easier. Full screen width is not best practice.

------
fenwick67
Somebody please, PLEASE do this for Project Gutenburg.

~~~
ocrcustomserver
I just visited Project Gutenberg and tried opening a few books. There seems to
be a HTML option, what did you mean?

~~~
fenwick67
They are inconsistently formatted, and the default formatting usually takes up
the whole page width, making it hard to actually read.

I guess I could probably solve this with a custom stylesheet, though.

------
joshwcomeau
Really neat!! Thanks for sharing :)

------
mbillie1
I love this, thank you so much!

------
alkonaut
This will save so many trees.

------
jwilk
Why "Vanity"?

~~~
bfirsh
A bit of wordplay on our favourite Arxiv tool. :) [http://arxiv-
sanity.com/](http://arxiv-sanity.com/)

------
make3
fuck that's cool

------
cyberpunk0
This site is nice even for mobile!

Shameless plug: I made an Android app for arXiv if anyone wants something
simple to search articles on mobile. Graduating soon so if you try and enjoy
it, any positive (but honest) views help the looming job search ;)

[https://play.google.com/store/apps/details?id=xyz.imaginatri...](https://play.google.com/store/apps/details?id=xyz.imaginatrix.synapse)

~~~
ReverseCold
Also this one:

[https://f-droid.org/packages/com.commonsware.android.arXiv/](https://f-droid.org/packages/com.commonsware.android.arXiv/)

------
claudius
I have to admit I am not impressed. My first paper, which I tried to render,
does not work properly; references are removed and rendered poorly, figures
are misplaced and tables incomplete. Given that not all arXiv papers are under
a permissive license and _you do not have permission to do this_ , I would
much prefer if you at least made sure that arxiv-vanity rendered papers do not
show up in search results, e.g. by offering a suitable robots.txt and with a
bigger link to the author-endorsed version of the paper.

Edit to clarify: If people want to use or develop a broken sort-of-PDF viewer,
that’s fine. However, if someone searches for a paper of mine, I would like
them to only find the version where I at least had a chance to see that it
renders correctly and is complete. In particular, I do not want to be
"responsible" for broken rendering on random third-party websites. This
website actually operating illegally does not make me more inclined to support
it.

~~~
bfirsh
This is fair. If somebody stumbles across this thinking it is how you intended
it to be displayed, I can understand you'd be unhappy. We should make it
clearer that we're just a conversion tool, not a source.

If you want us to remove your paper and just point at the PDF, we're happy to
do so. My email's in my profile if you don't want to post the broken render
here!

~~~
claudius
Thank you for your reply. Ideally I’d prefer for you to respect the license
associated to each paper and only re-compile and re-host if the license
actually allows you to do that (i.e. CC0, CC-BY, CC-BY-SA and maybe CC-BY-NC-
SA, depending on whether you think you act commercially).

I also don’t want to keep tabs on every arXiv rehoster and inform them
manually by e-mail every time a new paper goes up.

May I ask why this was not done together with the arXiv itself? I.e. have the
infrastructure run there, let authors check the HTML render at the same time
as the PDF render and then, if the author thinks they look ok, have them show
directly on the abstracts page? This would even avoid all your license
problems, as the arXiv already has the corresponding license!

------
whataretensors
Very cool idea! Thank you.

------
rublev
Very cool. Downloading PDF's has always bothered me and this is a fantastic
and easy way to view papers, esp while commuting.

A Native android or iOS tablet app would be neat to track your papers etc.

~~~
gfody
I'm surprised there doesn't exist a more responsive/reactive PDF viewer. It
seems like that would be easier than converting the document.

~~~
rublev
You mean like some embed like Scribd? I've never really liked that sort of
stuff, always works wonky for me. Text is great.

------
fiatjaf
Why PDFs in the first place? They're hard to write, hard to format, hard to
generate, and they do no good at all to anyone.

HTML is better read, smaller, faster, has more formatting options, and can
have all contained in a single file.

Seriously, stop creating PDFs.

~~~
falkod
Why the hate for PDFs, I can see why you would not want to communicate dynamic
content with them, sure. But for academic papers, which are usually static
text and equations and maybe some plots, PDFs do exactly what you want: At
least in principle they look the same on every system. If I want to email them
to somebody, store a copy, read them offline on a reader etc. I have to deal
with one single file, in a format that just works on many readers. What am I
gonna do when I typeset a paper with webtechnology ... fire up a local server
to display some inline math in my html page .. no thank you.

I agree that maybe layouting based on physical paper is maybe not ultimately
necessary, but it gives the reader a familiar structure. The way the
advertised web site is transferring the papers into a long scrolling list of
text ... I find it rather disorienting and unstructured. Text that is split up
into "pages" (whatever size they are in the end) somehow helps break up the
reading flow.

In the end it remains to be shown that the gain from having academic papers
not typeset in PDF outweighs the hassle of having to deal with non-
standardized ways of rendering properly formatted text on websites (thinks
like MathJax etc. do not support everything that is available in full LaTeX
etc.).

~~~
fiatjaf
Thank you for your response. I understand your points and will rethink my hate
for PDFs.

What I don't understand is why I got at least 2 downvotes. These days I'm
getting downvotes for every opinion I express on HN. It's very annoying.

~~~
wereHamster
Your question "Why PDFs in the first place?" was answered in bfirsh's comment
([https://news.ycombinator.com/item?id=15534583](https://news.ycombinator.com/item?id=15534583)).

But anyway, in the context of the discussion about this webpage/project, it's
not relevant to ask why these PDFs exist. They do, and the scientific
community is nowhere near a transition away from them. So bfirsh is trying to
find a solution to consume those existing PDFs.

So think as not getting downvoted for expressing your opinion, but more for
not contributing to the discussion about this particular project.

