People put in effort making PDFs and ensuring that nobody can read them without either a 14" portrait screen or a ton of scrolling—and you had to come along and ruin that carefully laid out inconvenience? What's wrong with you.
The incredibly complex layout of ‘a wall of text and occasional pictures’ is there for a reason. That's what the authors wanted, and only PDF is up to the task of representing such delicate formatting.
Maybe not only PDF is up to the task, but this site doesn't show that web would be. In section "The quantum circuit model", look how 1/sqrt(2) looks (the 2 falls out of the sqrt sign). Then there is inline maths using the font "Noto Serif", which isn't suitable for writing formulas and makes it look really inconsistent when taken with the formulas which stand in their own lines. ℂ uses "Noto Serif" even in dedicated lines - the straight line with the rounded line couldn't be closer together. This font just isn't up to the task. It makes the formulas both less legible and ugly. Inconsistent spacing in "|0>" looks really off. A lot of inline formulas look as if they were about to fall off their lines. I don't get the use of sans-serif for section titles - default serif fonts in LaTeX make section titles both stand out and look like the belong in their surroundings --- this one doesn't have that effect. Hyphenation algorithm is also suboptimal. Look at the original PDF - at the same time there are fewer occurrences of hyphenation and the space between words in paragraphs is more evenly distributed. Also, the hyphens are easier to spot.
What PDF-s generated by LaTeX have, that so far web failed to replicate, is beautiful fonts for maths, consistency, legibility and overall beauty of the end result. They also have the advantage that they are self-contained, so if I want to archive one or move to another place, I just grab the single file and I'm done. With web a single document is distributed across many files and you need make a weird wget dance (which is easy to misuse)/install some obscure browser extensions/etc - not exactly user-friendly. Also, if you want to print the document to get away from the distraction of the computer, you're guaranteed to get at least as good a result as what you see on the screen.
Don't get me wrong. This attempt is better than any that I've seen so far. Kudos to the author for that. It's no small feat. But it doesn't prove that PDF-s are obsolete and that people favouring them are irrational.
Maybe you should look up MathML. Structured PDFs use MathML just like HTML to represent math formulae.
The real problem exists because most people don't use a correctly formatted/structured PDF to begin with. I don't wanna think about all the problems MS word might cause here and probably violates spec-wise.
Vectorized PDFs also don't use embedded fonts and content flow instructions but a bunch of randomly sequenzed glyphs that make no sense without a very good OCR. Vector-based PDFs are usually garbage for automated usage, and they are not useful anyhow for assistive purposes (e.g. a screenreader or a converter that uses the DAISY format or similar).
So yeah, I'd argue that PDF is the wrong serialization format. There are standardized alternatives that would be easier to parse, communicate, and license.
And countering your argument about portability: MHTML and WARC formats are very portable, the former is the default format for the page save functionality of all mobile smartphones. They are a single file, containing all necessary resources to display the page.
Besides I don't think this is meant to be a total replacement that everyone is going to magically use and abandon PDFs, it might just be a convenience within a workflow to skim and check the full PDF if necessary.
I'm not making detailed arguments about what your sarcasm filter should be, but if this comment isn't detected as sarcasm, you should turn a knob somewhere.
I have to calibrate it to the fact that during the 2016 presidential debates one of the major party candidates assured everyone watching that his sex organs were doing great. And he won. My knobs are on 11, frankly.
I think a lot of people overestimate how well sarcasm can be determined on the internet. Sarcasm detection involves not only culture but also age, morals, and politics. Something that may appear as sarcasm to one person/group when addressed to an "in-group" would appear as someone actually saying that seriously to an "out-group", especially when it reinforces any kind of stereotype. Something that would be easily detected as sarcasm by a friend would be treated absolutely seriously by my parents, and my cousins who are almost a generation younger than me often write sarcasm I can't detect.
There's only one rule of sarcasm on the internet, if you want to be sarcastic, always end your comment with /s, otherwise avoid it.
To get back to the topic, I detected the OP's post as sarcasm, but I had to read it most of the way through it before I determined it was probably sarcasm, but I wasn't 100% sure.
I couldn't disagree more. The rule of sarcasm on the internet is the same as every other rule where everyone is anonymous. Give people the benefit of the doubt.
Moreover, the claim that sarcasm can't be detected on the internet is extremely dubious. It's really not that hard. The person puts extra effort into showing how the thing that they are pretending to believe is an absurd thing to believe. This is not culture or age dependent.
"It's awful how those people are doing all that good stuff" is a statement that requires you to know zero context. An alien can detect that this sentence is absurd. Moreover, if someone actually believes something absurd, and presents it this way, it doesn't matter if it's sarcasm or not, since they aren't convincing anyone.
The rules of communication trump the rule of sarcasm. If you want to get your message across clearly, avoid disambiguation. When the target audience misunderstands the message, that's on the author.
If I wanted to hunt for a hidden meaning in written text, I'd rather read poetry.
My biggest complaint with all these super useful sites is that I can never remember them when I need them. Replace this in YouTube to bypass country restrictions, replace that in arxiv to view in browser, etc.
I wish somebody could make an extension or repository system to store all these, and prompt you sometimes when on the sites.
The Redirector add-on for Firefox provides pattern-based (regex and glob) URL redirection on the browser side. You can just add rules to its configuration rather than new userscripts.
I used in the past to locally correct for broken links within intranet/CI pages. E.g. something correctably wrong with a gerrit link, or whatever.
Ah, this comment prompted me to add this to my Anki, so thanks! I got frustrated never remembering https://remove-js.com , so added it to a deck on Anki and now I'm unlikely to ever forget it. This is useful enough to go on the deck too.
Allows me to share it as a link to my parents, for example.
During the initial pandemic period, there were news articles I wanted to share with useful information. But the pages were full of Javascript based crap that guiding them on how to find the information became a task of its own. RemoveJS was very helpful in making those sites accessible to them.
For the most recent papers on arxiv, you can check out another site here
https://academ.us/article/2106.10522/
This one uses a different backend (pandoc), so slightly different pros and cons in rendering.
> The primary aim is to serve the community with the outputs we have, while we improve the coverage and fidelity of our generator.
Could you explain what you mean here? Who is "we" - I assume ar5iv? "while we improve the coverage and fidelity of our generator" <-- does this mean this is a temporary situation, and in the future multiple versions of the paper will be available?
There's actually multiple "we", since there are two institutions involved, and one foil character - I'm the only one responsible for ar5iv "the website", in a personal capacity.
The fidelity of the generator has the "we" of the team behind LaTeXML, the TeX-to-HTML conversion tool. That is in many ways the most important project to remember here, as that is what we want to actively improve to a point where it is "good enough" in creating HTML over the entirety of arXiv.
The institution hosting the website, and wanting to "serve a community" is KWARC, a research group at the university of FAU-Erlangen in Germany. There are all kinds of projects and services brewing on that end, which have interplay with the HTML data behind ar5iv, but are not directly on the site.
And as to all of us reading HN, I think we are actually interested in arXiv itself being maximally useful. And so is the ar5iv site - it's a temporary deployment, that really is aiming to reintegrate back into the arxiv.org site, and general infrastructure.
If/when that happens is unclear, but in the meantime there is a lot of improvements that can be made, both in what HTML can be generated, deciding what the markup of scientific documents ought to be in the first place, as well as gaining some insights for what new problems arXiv would encounter if they served HTML.
Oh, and the last question - yes, if arXiv integrates the feature, they will be able to serve any of the versions, including the most recent one.
I can technically implement that, but I really don't want to, as I see it as crossing a certain line. Seeing ar5iv as a limited, constrained, service is a good thing - I think it clearly communicates that I do not want to compete with arXiv.
If a paper is prepared in latex, arxiv will detect that and insist that the sources are included. (They compile the pdf themselves.) So I think the second point shouldn't be much of a problem in fields where everyone uses latex.
This was a very far-sighted move by, I believe, Paul Ginsparg back in the early days. It significantly increases the headache of submitting to the arXiv because you have to get your tex file to compile with the tex distribution on their servers rather than just on your own home box. But it makes the arxiv vastly more future proof than it would be if you could just upload PDFs.
Yes, the arXiv started with PDF-less tex. (I believe people just compiled to postscripts files, which could be printed directly.) But when PDFs appeared, and especially as latex distributions became less standardized, there would have been a natural pressure to accept PDFs.
The reason they accept PDF is not because of the difficulty in getting tex files to compile on different distributions that I mentioned. Indeed, as they say on the page you link to, PDF files created from a tex file are specifically rejected by the arXiv.
Rather, PDF files are accepted for the (fairly small minority) of papers that are written using alternative editors like MS Word.
Problem is, the Latex created PDF's have fixed width and read horribly on smaller screens as a result, with no option of a dark mode. Often on mobile you'll be forced to download the PDF as well.
I checked the example link, and it has trouble rendering utf-8 or something (search for the Feynman quote about "nature", it starts with "Nature isnât classical...", or the poem at the end)
I wonder how well it works with more complicated mathematical formulas containing greek or arabic letters (although the example in that paper look fine to me), or other non-ASCII scripts.
I'm not sure but it could be that it makes it easier to read on big displays. Of course this isn't very Responsive Design like. I will be glad to be proven wrong. Interested in why HN has such a small font by default as well.
Please consider adding a little bit of javascript to find and display the text that describes the variable in a formula when the mouse hovers over it. Even better would be to allow signed-in users (or paid subscribers) to add annotations (and links to youtube videos).
I would prefer if your site requires a paid subscription so you can incentivize people to annotate the content. For a paper author, making the paper terse and complex is more impressive but for the rest of us it is very tedious to decipher.
You can check out a similar site here, for example
https://academ.us/article/2202.04668/
If you click on the cross reference of figures and equations it will pop up a sticker showing the corresponding content, if that's what you mean.
By the way: if you can, it would probably be wise to drop justified alignment, and make text aligned to the left. Rectangle blocks of text only look good from a distance—when actually trying to read, the differing inter-word spaces just make the experience jarring. It's especially bad on phones, which is a prime use-case for the HTML conversion.
(Though, from ‘tell HN’, the poster is probably not the site author, right?)
What I was wondering - and still am - shouldn't it be possible to get to a "Pleasant" justified layout on the web in general?
The jagged left-aligned paragraphs are some of the first bits people point to when invoking "my PDF looks better". I definitely am not saying I did it perfectly, but shouldn't it be possible to get a good justified scientific article on the web? Why not?
“Looks better” only works in regard to justification when one is admiring a page overall. However, that's not how people actually read text. They look at words in lines, and at that time uneven spacing keeps tripping the eye up. I know this argument, but there's no way around this discussion, and that's all there is to say about it (so far). Nicely looking bricks of paragraphs won't make the eye glide smoothly over the holes.
Page layout programs spend some CPU time on fiddling the hyphenation until the spacing is even. Browsers can't afford to do that, afaik (not sure about currently, but that was the situation a while back). Moreover, from what I vaguely heard, the HTML specification defines paragraphs in such a way that browsers don't even have the freedom to fiddle the paragraph height—something about the height being the minimum for the text on hand, or something like that.
Even if you insist to keep justification on desktop (though I personally can see the holes clearly)—for the love of good, please disable it on phones. It's just a mess there.
Got it. So even with hyphenation, the gaps are still bad enough that you'd consider the current ar5iv rendering bumpy and distracting?
I think I can see that, but it's almost there, which is why it feels like there has to be something I'm missing for it to justify "just right".
But yes, at the least you've convinced me we should have a separate theme that goes left-aligned, and possibly makes a number of other choices that maximize readability.
Since I'd still want the folks that want "as good as PDF", to feel justified for sticking around.
I may have a better eye for it than most, all the way to having made my own browser extension to turn justification on paragraphs off. However, if I do turn on left-aligning on the linked example, I get plenty of very jagged right edges—which means that with justification all that uneven space gets shoved between words.
bioRxiv converts PDFs to full-text already. They're right there on the bioRxiv web site—just click the "Full text" tab. For example, our most recent bioRxiv preprint:
If you want to do URL munging instead of using the UI, just add `.full` to the end of the URL ;)
This doesn't require that the manuscript be submitted in any particular format either. There are humans involved in making sure the full text formatting from PDF is good.
Same HTML backend generator (latexml), different frontends, and different coverage of arXiv.
Also, ar5iv may disappear very quickly, since I am unsure if it's more helpful or harmful. But I'll definitely lean on the public attention to keep asking arXiv to integrate an HTML preview for their articles. In the one-and-only arxiv.org itself.
Lastly, one difference that may ignite a curious debate is that ar5iv is committed to being MathML-native. Yes. MathML is the only markup used for math syntax, and you'll see it rendered directly, undisturbed, with Firefox today.
Over 500 million MathML elements in the full dataset too, pretty awe-inspiring.
The incredibly complex layout of ‘a wall of text and occasional pictures’ is there for a reason. That's what the authors wanted, and only PDF is up to the task of representing such delicate formatting.