Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: Replace the X with a 5 in arXiv.org to display a paper in HTML
356 points by bgschulman31 on Feb 1, 2022 | hide | past | favorite | 89 comments
Check it out: https://ar5iv.org/pdf/2106.10522.pdf



People put in effort making PDFs and ensuring that nobody can read them without either a 14" portrait screen or a ton of scrolling—and you had to come along and ruin that carefully laid out inconvenience? What's wrong with you.

The incredibly complex layout of ‘a wall of text and occasional pictures’ is there for a reason. That's what the authors wanted, and only PDF is up to the task of representing such delicate formatting.


Maybe not only PDF is up to the task, but this site doesn't show that web would be. In section "The quantum circuit model", look how 1/sqrt(2) looks (the 2 falls out of the sqrt sign). Then there is inline maths using the font "Noto Serif", which isn't suitable for writing formulas and makes it look really inconsistent when taken with the formulas which stand in their own lines. uses "Noto Serif" even in dedicated lines - the straight line with the rounded line couldn't be closer together. This font just isn't up to the task. It makes the formulas both less legible and ugly. Inconsistent spacing in "|0>" looks really off. A lot of inline formulas look as if they were about to fall off their lines. I don't get the use of sans-serif for section titles - default serif fonts in LaTeX make section titles both stand out and look like the belong in their surroundings --- this one doesn't have that effect. Hyphenation algorithm is also suboptimal. Look at the original PDF - at the same time there are fewer occurrences of hyphenation and the space between words in paragraphs is more evenly distributed. Also, the hyphens are easier to spot.

What PDF-s generated by LaTeX have, that so far web failed to replicate, is beautiful fonts for maths, consistency, legibility and overall beauty of the end result. They also have the advantage that they are self-contained, so if I want to archive one or move to another place, I just grab the single file and I'm done. With web a single document is distributed across many files and you need make a weird wget dance (which is easy to misuse)/install some obscure browser extensions/etc - not exactly user-friendly. Also, if you want to print the document to get away from the distraction of the computer, you're guaranteed to get at least as good a result as what you see on the screen.

Don't get me wrong. This attempt is better than any that I've seen so far. Kudos to the author for that. It's no small feat. But it doesn't prove that PDF-s are obsolete and that people favouring them are irrational.


Maybe you should look up MathML. Structured PDFs use MathML just like HTML to represent math formulae.

The real problem exists because most people don't use a correctly formatted/structured PDF to begin with. I don't wanna think about all the problems MS word might cause here and probably violates spec-wise.

Vectorized PDFs also don't use embedded fonts and content flow instructions but a bunch of randomly sequenzed glyphs that make no sense without a very good OCR. Vector-based PDFs are usually garbage for automated usage, and they are not useful anyhow for assistive purposes (e.g. a screenreader or a converter that uses the DAISY format or similar).

So yeah, I'd argue that PDF is the wrong serialization format. There are standardized alternatives that would be easier to parse, communicate, and license.

And countering your argument about portability: MHTML and WARC formats are very portable, the former is the default format for the page save functionality of all mobile smartphones. They are a single file, containing all necessary resources to display the page.


> This attempt is better than any that I've seen so far.

Have you seen arXiv Vanity? - https://www.arxiv-vanity.com


Not sure if this is sarcasm... but I don't think OP made this (https://github.com/dginev/ar5iv), they are just sharing.

Besides I don't think this is meant to be a total replacement that everyone is going to magically use and abandon PDFs, it might just be a convenience within a workflow to skim and check the full PDF if necessary.


I think the holes on your sarcasm filter might be too large.


I used to think I could recognize sarcasm whenever I saw it. Then I lived through the past few years.


I'm not making detailed arguments about what your sarcasm filter should be, but if this comment isn't detected as sarcasm, you should turn a knob somewhere.


I have to calibrate it to the fact that during the 2016 presidential debates one of the major party candidates assured everyone watching that his sex organs were doing great. And he won. My knobs are on 11, frankly.


I think a lot of people overestimate how well sarcasm can be determined on the internet. Sarcasm detection involves not only culture but also age, morals, and politics. Something that may appear as sarcasm to one person/group when addressed to an "in-group" would appear as someone actually saying that seriously to an "out-group", especially when it reinforces any kind of stereotype. Something that would be easily detected as sarcasm by a friend would be treated absolutely seriously by my parents, and my cousins who are almost a generation younger than me often write sarcasm I can't detect.

There's only one rule of sarcasm on the internet, if you want to be sarcastic, always end your comment with /s, otherwise avoid it.

To get back to the topic, I detected the OP's post as sarcasm, but I had to read it most of the way through it before I determined it was probably sarcasm, but I wasn't 100% sure.


I couldn't disagree more. The rule of sarcasm on the internet is the same as every other rule where everyone is anonymous. Give people the benefit of the doubt.

Moreover, the claim that sarcasm can't be detected on the internet is extremely dubious. It's really not that hard. The person puts extra effort into showing how the thing that they are pretending to believe is an absurd thing to believe. This is not culture or age dependent.

"It's awful how those people are doing all that good stuff" is a statement that requires you to know zero context. An alien can detect that this sentence is absurd. Moreover, if someone actually believes something absurd, and presents it this way, it doesn't matter if it's sarcasm or not, since they aren't convincing anyone.


> It's really not that hard.

Correction: It's really not that hard for you. It IS hard for me, especially in recent years.


The rules of communication trump the rule of sarcasm. If you want to get your message across clearly, avoid disambiguation. When the target audience misunderstands the message, that's on the author.

If I wanted to hunt for a hidden meaning in written text, I'd rather read poetry.


PDF is also bad for machine translation. It's a accessibility issue.


My biggest complaint with all these super useful sites is that I can never remember them when I need them. Replace this in YouTube to bypass country restrictions, replace that in arxiv to view in browser, etc.

I wish somebody could make an extension or repository system to store all these, and prompt you sometimes when on the sites.


I usually make a userscript for these sites, which add it automatically. Its a one-liner for simple link insertion. https://github.com/FrozenVoid/Userscripts/blob/main/Arxiv/Ar...


The Redirector add-on for Firefox provides pattern-based (regex and glob) URL redirection on the browser side. You can just add rules to its configuration rather than new userscripts.

I used in the past to locally correct for broken links within intranet/CI pages. E.g. something correctably wrong with a gerrit link, or whatever.


Ah, this comment prompted me to add this to my Anki, so thanks! I got frustrated never remembering https://remove-js.com , so added it to a deck on Anki and now I'm unlikely to ever forget it. This is useful enough to go on the deck too.


Ah, I always forget about Anki! Wish there were an app to remind me of Anki when I need it the most.


The anki app reminds you of anki


I love Anki and use it like you do, it sounds like.


Does RemoveJS do anything that the “block scripts” button on uBlock doesn't?


Allows me to share it as a link to my parents, for example.

During the initial pandemic period, there were news articles I wanted to share with useful information. But the pages were full of Javascript based crap that guiding them on how to find the information became a task of its own. RemoveJS was very helpful in making those sites accessible to them.


I'd tell you, but RemoveJS blocks VPN users from viewing their site while uBlock does not.


I‘m using Anki to not forget my fitness workouts


Anki is great for remembering your fitness workouts, if you plan to sabotage your fitness with increasing intervals between workouts. :)


Hihi. That's true. Maybe an addon that keeps cards' ease at the same rate would do.

By the way, have you checked out Migaku's vacation add-on? I suggested it to you a while ago.


Replace "github.com" with "github.dev" or "github1s.com" on any repo


I agree, I can't wait to forget about what the name of the repository is


I can’t wait to turn down at least $6B offer for my company :)


Nice. It looks like a site with a similar idea was posted to hn before[1], but the result from ar5iv seems to be a bit slicker/cleaner than that site.

1: https://www.arxiv-vanity.com/


For the most recent papers on arxiv, you can check out another site here https://academ.us/article/2106.10522/ This one uses a different backend (pandoc), so slightly different pros and cons in rendering.


Unfortunatley, ar5iv also only hosts the first version of the paper, while arxiv-vanity only hosts the last.


Which is intentional. ar5iv does not aim to be a live preview service, or replace arXiv.

The primary aim is to serve the community with the outputs we have, while we improve the coverage and fidelity of our generator.

And yes, using only the official sources arXiv has released for reuse: https://arxiv.org/help/bulk_data_s3

This is indeed a major difference with -vanity


> The primary aim is to serve the community with the outputs we have, while we improve the coverage and fidelity of our generator.

Could you explain what you mean here? Who is "we" - I assume ar5iv? "while we improve the coverage and fidelity of our generator" <-- does this mean this is a temporary situation, and in the future multiple versions of the paper will be available?


Certainly, sorry for the confusion.

There's actually multiple "we", since there are two institutions involved, and one foil character - I'm the only one responsible for ar5iv "the website", in a personal capacity.

The fidelity of the generator has the "we" of the team behind LaTeXML, the TeX-to-HTML conversion tool. That is in many ways the most important project to remember here, as that is what we want to actively improve to a point where it is "good enough" in creating HTML over the entirety of arXiv.

The institution hosting the website, and wanting to "serve a community" is KWARC, a research group at the university of FAU-Erlangen in Germany. There are all kinds of projects and services brewing on that end, which have interplay with the HTML data behind ar5iv, but are not directly on the site.

And as to all of us reading HN, I think we are actually interested in arXiv itself being maximally useful. And so is the ar5iv site - it's a temporary deployment, that really is aiming to reintegrate back into the arxiv.org site, and general infrastructure.

If/when that happens is unclear, but in the meantime there is a lot of improvements that can be made, both in what HTML can be generated, deciding what the markup of scientific documents ought to be in the first place, as well as gaining some insights for what new problems arXiv would encounter if they served HTML.


Oh, and the last question - yes, if arXiv integrates the feature, they will be able to serve any of the versions, including the most recent one.

I can technically implement that, but I really don't want to, as I see it as crossing a certain line. Seeing ar5iv as a limited, constrained, service is a good thing - I think it clearly communicates that I do not want to compete with arXiv.


Unlike Ar5iv, Arxiv-vanity will also show papers more recent than a month, which are usually the papers you want to open.


Very nice idea, although the first PDF I tried wasn't available, and the second one seems to be missing a lot:

https://arxiv.org/pdf/2001.00888.pdf (19 pages)

https://ar5iv.org/html/2001.00888 (missing content)

This is no doubt a hard problem ...


If you have a spare minute, please pay a visit to the "report issue" button at the bottom.

Indeed - hard problem and a messy solution. I have no easy answers.


seems it craps out on the monospaced bits of text ?


The project is open source, so this can easily be ported to any site[0]. [0]: https://github.com/dginev/ar5iv


Before you get too excited:

"We are usually at least a month behind the live arXiv article list.

Also, we can only serve papers submitted with their LaTeX sources."

Moderate excitement is probably warranted, though.


If a paper is prepared in latex, arxiv will detect that and insist that the sources are included. (They compile the pdf themselves.) So I think the second point shouldn't be much of a problem in fields where everyone uses latex.


Anyone know what % of papers are submitted with their LaTeX sources?


arXiv have the advantage that they have LaTeX source for a majority of their submissions. Much easier to convert that to HTML that any arbitrary PDF.


This was a very far-sighted move by, I believe, Paul Ginsparg back in the early days. It significantly increases the headache of submitting to the arXiv because you have to get your tex file to compile with the tex distribution on their servers rather than just on your own home box. But it makes the arxiv vastly more future proof than it would be if you could just upload PDFs.


I think arXiv / LANL preprint service predate PDF and the web (ftp and gopher). tex is seems more popular still in physics than other disciplines.


Yes, the arXiv started with PDF-less tex. (I believe people just compiled to postscripts files, which could be printed directly.) But when PDFs appeared, and especially as latex distributions became less standardized, there would have been a natural pressure to accept PDFs.


yes, this is why they have accepted PDF for some time https://arxiv.org/help/submit_pdf

(maybe for as long as since 2004 -- but ironically archive.org could not access arXiv for some years https://web.archive.org/web/20041101000000*/https://arxiv.or... )

Most other preprint services, such as those based on eprints (and in other disciplines) have always accepted PDF.


The reason they accept PDF is not because of the difficulty in getting tex files to compile on different distributions that I mentioned. Indeed, as they say on the page you link to, PDF files created from a tex file are specifically rejected by the arXiv.

Rather, PDF files are accepted for the (fairly small minority) of papers that are written using alternative editors like MS Word.


For any arXiv submission that was submitted via LaTeX, you can download the source.


Awesome, easy way to read papers in the browser with dark mode without inverting image colors (like what Dark Reader does)


The Latex created PDF's are 100 times more pleasant to read though.


Problem is, the Latex created PDF's have fixed width and read horribly on smaller screens as a result, with no option of a dark mode. Often on mobile you'll be forced to download the PDF as well.


Thanks!

What other sites have these kind of URL hacks?


replace reddit.com with redditp.com to make it into a slide show of the images in the sub

example:

https://www.reddit.com/r/gifs/ --> https://www.redditp.com/r/gifs/

(there are a few other similar reddit --> image gallery url hacks, just google for them)


Youtube has the type "nsfw" before any youtube.com address to bypass the age-restriction.


I checked the example link, and it has trouble rendering utf-8 or something (search for the Feynman quote about "nature", it starts with "Nature isn’t classical...", or the poem at the end)

I wonder how well it works with more complicated mathematical formulas containing greek or arabic letters (although the example in that paper look fine to me), or other non-ASCII scripts.

other than that, this looks pretty amazing


Thank you! This is amazing! My only ask is that you bump up the default font size.


Don't most browsers remember your zoom when you revisit a website?


Yes, but why have such an hard to read default?


I'm not sure but it could be that it makes it easier to read on big displays. Of course this isn't very Responsive Design like. I will be glad to be proven wrong. Interested in why HN has such a small font by default as well.


Maybe it's a mobile issue? I've never adjusted zoom in my phone.


I was going to suggest bump up the saturation/blackness of the font so that it's easier to read.


Please consider adding a little bit of javascript to find and display the text that describes the variable in a formula when the mouse hovers over it. Even better would be to allow signed-in users (or paid subscribers) to add annotations (and links to youtube videos).

I would prefer if your site requires a paid subscription so you can incentivize people to annotate the content. For a paper author, making the paper terse and complex is more impressive but for the rest of us it is very tedious to decipher.


You can check out a similar site here, for example https://academ.us/article/2202.04668/ If you click on the cross reference of figures and equations it will pop up a sticker showing the corresponding content, if that's what you mean.


Am I a weirdo for preferring PDF to read papers?


By the way: if you can, it would probably be wise to drop justified alignment, and make text aligned to the left. Rectangle blocks of text only look good from a distance—when actually trying to read, the differing inter-word spaces just make the experience jarring. It's especially bad on phones, which is a prime use-case for the HTML conversion.

(Though, from ‘tell HN’, the poster is probably not the site author, right?)


HN can obviously summon the site author (hi!)

What I was wondering - and still am - shouldn't it be possible to get to a "Pleasant" justified layout on the web in general?

The jagged left-aligned paragraphs are some of the first bits people point to when invoking "my PDF looks better". I definitely am not saying I did it perfectly, but shouldn't it be possible to get a good justified scientific article on the web? Why not?


Hi.

“Looks better” only works in regard to justification when one is admiring a page overall. However, that's not how people actually read text. They look at words in lines, and at that time uneven spacing keeps tripping the eye up. I know this argument, but there's no way around this discussion, and that's all there is to say about it (so far). Nicely looking bricks of paragraphs won't make the eye glide smoothly over the holes.

Page layout programs spend some CPU time on fiddling the hyphenation until the spacing is even. Browsers can't afford to do that, afaik (not sure about currently, but that was the situation a while back). Moreover, from what I vaguely heard, the HTML specification defines paragraphs in such a way that browsers don't even have the freedom to fiddle the paragraph height—something about the height being the minimum for the text on hand, or something like that.

Even if you insist to keep justification on desktop (though I personally can see the holes clearly)—for the love of good, please disable it on phones. It's just a mess there.


Got it. So even with hyphenation, the gaps are still bad enough that you'd consider the current ar5iv rendering bumpy and distracting?

I think I can see that, but it's almost there, which is why it feels like there has to be something I'm missing for it to justify "just right".

But yes, at the least you've convinced me we should have a separate theme that goes left-aligned, and possibly makes a number of other choices that maximize readability.

Since I'd still want the folks that want "as good as PDF", to feel justified for sticking around.


I may have a better eye for it than most, all the way to having made my own browser extension to turn justification on paragraphs off. However, if I do turn on left-aligning on the linked example, I get plenty of very jagged right edges—which means that with justification all that uneven space gets shoved between words.


Excellent, whenever I stumble across a proof that P=NP, I think it would be better in HTML


This must be a Dartmouth project.


What's a Dartmouth?


Dartmouth College, a Cornell rivalry.


Need this for bioarxiv also!


bioRxiv converts PDFs to full-text already. They're right there on the bioRxiv web site—just click the "Full text" tab. For example, our most recent bioRxiv preprint:

https://www.biorxiv.org/content/10.1101/2022.01.07.475366v1....

If you want to do URL munging instead of using the UI, just add `.full` to the end of the URL ;)

This doesn't require that the manuscript be submitted in any particular format either. There are humans involved in making sure the full text formatting from PDF is good.


This is BEAUTIFUL work!


That's a lot less bad than I expected!


Thanks!


How does this compare to arxiv-vanity.org?


Same HTML backend generator (latexml), different frontends, and different coverage of arXiv.

Also, ar5iv may disappear very quickly, since I am unsure if it's more helpful or harmful. But I'll definitely lean on the public attention to keep asking arXiv to integrate an HTML preview for their articles. In the one-and-only arxiv.org itself.

Lastly, one difference that may ignite a curious debate is that ar5iv is committed to being MathML-native. Yes. MathML is the only markup used for math syntax, and you'll see it rendered directly, undisturbed, with Firefox today.

Over 500 million MathML elements in the full dataset too, pretty awe-inspiring.


Do this with eprint please!


The entire project is open source[0].

[0]: https://github.com/dginev/ar5iv


This is great! The hoverable footnotes are a nice touch.


wow thank you for this!


thats RAD. awesome work




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: