
Show HN: Read ArXiv Papers on Semantic Scholar as Responsive HTML Documents - undfined
https://www.semanticscholar.org/paper/Self-Paced-Deep-Learning-for-Weakly-Supervised-Obj-Sangineto-Nabi/1a5151b4205ab27b1c76f98964debbfc11b124d5/read
======
thecodeviking
Hi HN,

I'm an engineer on the Semantic Scholar team that worked on integrating this
feature into the site.

Here's a blog post that talks a bit more about what we're doing:
[https://blog.semanticscholar.org/announcing-a-new-way-to-
rea...](https://blog.semanticscholar.org/announcing-a-new-way-to-read-papers-
df0dec59d53b)

I'm around to answer questions / discuss the approach. We're super excited and
would love to hear your feedback!

~~~
ivan_ah
Feature idea 1: perhaps you could make references section in the bottom render
links, at least for references that provide and arXiv identifier.

Feature idea 2: better/shorter URL structure, the current URL
[https://www.semanticscholar.org/paper/{title}-{authors}/{uui...](https://www.semanticscholar.org/paper/{title}-{authors}/{uuid}/read)
ends up quite long and unreadable (may be good for SEO though). If you're
rendering mostly arXiv papers you could setup a short URL scheme that mirrors
the arXiv url paths: e.g if original URL is
[https://arxiv.org/abs/XXXX.YYYY](https://arxiv.org/abs/XXXX.YYYY) your URL
could be
[https://www.semanticscholar.org/paper/arxiv/XXXX.YYYY](https://www.semanticscholar.org/paper/arxiv/XXXX.YYYY)

Good stuff!

~~~
undfined
(I also work on the Semantic Scholar team)

1\. Totally agree. We know how we can / will do this and plan to do so given
enough interest in the MVP reading experience.

2\. The url structure is indeed for SEO purposes. (We get a large majority of
users discovering our paper pages through organic search)

~~~
detaro
The proposed shorter format could maybe be a redirect then? Quickly being able
to turn one link into the other would be quite useful.

~~~
thecodeviking
Definitely, thanks!

------
ivan_ah
Very nice! I like how the HTML rendering is responsive and allows you to read
on narrow screens.

MathJax seems to handle most things, but not 100%:
[https://www.semanticscholar.org/paper/Quantum-Broadcast-
Chan...](https://www.semanticscholar.org/paper/Quantum-Broadcast-Channels-
Yard-Hayden/8be3e848c2b69982150a3220674801e43c943ea0/read)

What are you using for parsing the LaTeX --> (HTML+MathJax) conversion?

~~~
ktpsns
The main workhorse is
[https://dlmf.nist.gov/LaTeXML/](https://dlmf.nist.gov/LaTeXML/) as you can
find once you reach [https://github.com/arxiv-
vanity/engrafo](https://github.com/arxiv-vanity/engrafo).

LaTeXML converts tex to XML by running latex ("only latex can parse latex")
and working on the DVI output.

But nevertheless this is a hard job so I will loook into the engrafo code soon
because I want to apply this to a book we have written.

~~~
thecodeviking
Yup, @kpsns nailed it. LaTeXML does the heavy lifting in converting TeX to
XML. From there some post processing does the job of converting it to a nice
responsive template (that's done by Engrafo / the ArXiv Vanity team).

We love OSS at AI2, and are looking to collaborate with the Engrafo / ArXiv
Vanity team as we expand the functionality.

~~~
ktpsns
So I digged into the code (engrafo repository) and was quite surprised that --
contrary to the suggestive title -- the method inherits all the problems
LaTeXML already has. This is the fact that (for instance compared to the
TeXLive distribution), tons of widespread sty files miss a LaTeXML integration
and thus the conversion fails for a wide range of papers. Converting a TeX
document to XML with LaTeXML really requires a lot of debugging and ideally
starting from a plain LaTeX paper/book and compiling with pdflatex and latexml
at the same time, making sure nothing breaks.

~~~
thecodeviking
Yea, it's tough work. We're hoping to invest more in the conversion library
(and support the Engrafo's team to do so).

It's going to take a lot of time and elbow grease to get it to where it needs
to be!

------
CardenB
In what ways is this different from arxiv-vanity? (Just append "-vanity" to
"arxiv" in any arxiv link to get a rich html version)

~~~
thecodeviking
It's not that different, at the moment. The only real difference is that we're
pre-computing the HTML so it's faster (ArXiv Vanity runs at request time).

We've talked a lot with the ArXiv Vanity team. If all goes well and our users
love the feature, we have an opportunity to support (and contribute to) to
their efforts at improving Engrafo and LaTeXML and maintain the front-end
facing portion of the system. That way they don't have to worry about hosting
/ providing a functioning front-end, which we're happy to foot the bill for
(and maintain)!

------
david2016
I would recommend that rather than copying "arXiv-Vanity", your team should
focus on improving and contributing to "Engrafo" since there are many errors
in the HTML conversion. And, leave everything at "arXiv-Vanity" since there is
no point of having 2 different places doing the same exact thing.

~~~
undfined
We plan to do both. We want to give back to Engrafo where it makes sense, but
we also have a unique opportunity given our other semantic features along with
the paper metadata from our corpus to build upon this experience. This release
is an MVP reading experience, but in the future we plan to add a number of
things like: user highlighting, direct linking to authors and citations, and
collaborative commenting.

~~~
thecodeviking
+1 to what @undefined said.

If all goes well, we hope to act as a front-facing host for the Engrafo
engine. That'll enable the ArXiv Vanity team to focus their efforts on
improving the conversion process (which is what they'd like to focus on) while
we can handle the logistics of serving this up to users (as quickly as
possible).

I'm really excited about the opportunity to collaborate and support the ArXiv
Vanity / Engrafo team!

