ArXiv now offers papers in HTML format

shrimpx · on Dec 21, 2023

Since the article doesn't link to any example HTML article, here's a random link:

https://browse.arxiv.org/html/2312.12451v1

It's cool that it has a dark mode. Didn't see a toggle but renders in the system mode.

Overall will make arXiv a lot more accessible on mobile.

burkaman · on Dec 21, 2023

And here's the PDF of the same paper for comparison: https://arxiv.org/pdf/2312.12451.pdf

FredPret · on Dec 21, 2023

The contrast is massive. I'm much more likely to read the html version; that PDF is deeply off-putting in some hard to define way. Maybe it's the two columns, or the font, or the fact that the format doesn't adjust to fit different screen sizes.

tobias2014 · on Dec 21, 2023

This is very interesting, because for me it's just the opposite. In particular the two column layout is just more readable and approachable for me. The PDF version also allows for a presentation just as the authors intended. I guess it's good that they offer both now.

JumpCrisscross · on Dec 21, 2023

Do you work extensively with LaTeX?

Two columns is good, albeit annoying on mobile. But the font. The typeface kills me, and almost every LaTeX-generated document sports it.

saurik · on Dec 21, 2023

Hilariously, I would probably tolerate the HTML version a lot better if it had the font from the PDF (and FWIW, the answer for me is "no: I don't work with LaTeX at all... I just read a lot of papers").

folmar · on Dec 21, 2023

If you disable the font rule

  :root, [data-theme=light] {
    /* --text-font-family: "freight-sans-pro";
  }

it switches to "Noto Serif" that is way easier on the eyes.

GoblinSlayer · on Dec 22, 2023

I hard override the font in browser, designers never get it right.

borg16 · on Dec 22, 2023

what is your font of choice?

GoblinSlayer · on Dec 22, 2023

Verdana

westurner · on Dec 22, 2023

https://github.com/neilpanchal/spinzero-jupyter-theme /fonts/{cmu-text,cmu-mono} :

> "Computer Modern" is used for body text to give it a professional/academic look

cozzyd · on Dec 21, 2023

Hating on Computer Modern (ok, probably now Latin Modern) is something close to blasphemy.

kibwen · on Dec 22, 2023

Computer Modern was not designed for easy viewing on screens (think about the screens Knuth would have been using in 1977), it was designed for printing in books.

hollerith · on Dec 22, 2023

I hate Computer Modern, and I'm not even particularly fussy about typefaces.

isaacfung · on Dec 22, 2023

What device and app are you using to read the document?

kjkjadksj · on Dec 21, 2023

The authors don’t format the pdf, the editor does. Authors probably sent a double spaced word document with figures and tables on another file.

z2h-a6n · on Dec 21, 2023

Not on arXiv (unless I'm much mistaken), which is a preprint server, not a conventional journal.

arXiv accepts various flavors of TeX, or PDFs not produced by TeX [0], and automatically produces PDFs and HTML where possible (e.g. if TeX is submitted). In the case of the example paper under discussion, the authors submitted TeX with PDF figures [1], and the PDF version of the paper was produced by arXiv. The formatting was mainly set by using REVTeX, which is a set of macros for LaTeX intended for American Physical Society journals.

[0] https://info.arxiv.org/help/submit/index.html#formats-for-te... [1] https://arxiv.org/format/2312.12451

smartmic · on Dec 21, 2023

FWIW, I recently learned that it is also possible to produce nice PDF papers with GNU roff (groff), have a look at this example: https://github.com/SudarsonNantha/LinuxConfigs/blob/master/....

pimlottc · on Dec 22, 2023

Looks nice but seems strange to switch from two columns to one column after the first page? Although maybe they’re just trying to demonstrate its capabilities.

macintux · on Dec 22, 2023

W. Richard Stevens (RIP, still hurts) famously used troff for his books.

cozzyd · on Dec 21, 2023

You typically send a .tar.gz of tex files (and, figures, .bbl, etc.) to the journal. And then you typically upload something very similar to the arxiv (I have an arxivify Makefile target for for my papers that handles some arxiv idiosyncrasies like requiring all figures to be in the same folder as the .tex file, and it also clears all the comments; sometimes you can find amusing things in source file comments for some papers).

Some fields may use Word files, but in most of physics you would get laughed at...

It is true that most journals will typically reformat your .tex in a different way than is displayed on the arXiv.

tonyg · on Dec 21, 2023

In computer science, the usual case is that the author fully formats the paper.

aragilar · on Dec 22, 2023

Not only is this wrong about physics/astronomy, I regularly use the arxiv version because the typography is better (e.g. in the published paper an equation is split with part of the equation being at the bottom of one column, and the top of the next, whereas the equation is on one line in the arxiv version).

frocmlol · on Dec 21, 2023

You are very confidently wrong.

In the arxiv you use latex and do everything yourself. There is no editor.

eigenket · on Dec 21, 2023

You are completely wrong. ArXiv doesn't work like that.

z2h-a6n · on Dec 21, 2023

For what it's worth, two column layouts are very common in the physical sciences, or at least in physics which I'm more familliar with. I have a feeling that the reason is at least partly to save page space when using displayed math (e.g. equations that are formatted in a break between blocks of text), which use the full text width (i.e. the width of one column) to display what may be much less than half a page wide.

FredPret · on Dec 21, 2023

It makes sense - for paper. But pixels are infinite - HTML is far better for screen display, which is how people read things nowadays.

The extra column next to the one I'm reading introduces a lot of visual noise, and the content is hard enough as it is. I'm sure physicists have all gotten used to it, but it certainly trips me up.

nyssos · on Dec 21, 2023

> The extra column next to the one I'm reading introduces a lot of visual noise

Papers are generally not read start to finish in one go: there's lots of rereading and jumping back and forth between key parts, and anything that moves them further apart makes this harder.

FredPret · on Dec 21, 2023

Ah, that makes more sense. I imagined scientists just reading the whole thing start-to-finish.

I still think a flexible layout is best. If you like multi-columns and have a wide screen, why not display 12 columns next to each other?

With PDF this is not possible. With HTML the content can in principle be sliced and diced how you like it.

fuck_google · on Dec 22, 2023

One can also view PDF pages side by side, which works pretty well with a 4K monitor.

arp242 · on Dec 22, 2023

I need to scroll up and down a lot more with two-column layout because a single page doesn't fit on my screen in my chosen font size (which is fairly large).

But HTML is so much more flexible, and ideally people can choose how they want it, although at this point it seems that's not (yet) implemented.

I find jumping back and forth is always a pain on computer screens and ebooks by the way, and is the major reason I much prefer print for this type of thing.

aragilar · on Dec 22, 2023

Two column is the default in astronomy also.

ForkMeOnTinder · on Dec 21, 2023

Definitely the two columns for me. It's super annoying skimming a paper and having to scroll down and back up again in a zig-zag pattern.

mmis1000 · on Dec 21, 2023

I think the consuming device matters. A ipad or computer have much wider screen width. One column layout is too wide for them for average people to scan text lines quickly.

While it looks perfectly fine on a phone. Two columns layout looks terrible on a smartphone, the text is too tiny to read comfortably.

It would probably be even better if you can flip it left and right like a ebook instead of scrolling to allocate the content faster. But current design is good enough IMO. (Compare to reading a pdf on cellphone)

GoblinSlayer · on Dec 22, 2023

To display two column layout you need a tall screen, now wide. If you display two column layout on a short wide screen, you have to scroll it up and down in zigzag pattern to read one page.

kjkjadksj · on Dec 21, 2023

Just zoom the smartphone into one column. Problem solved.

mmis1000 · on Dec 21, 2023

And then you will have to scroll both top bottom and left right, a even worst experience.

GoblinSlayer · on Dec 22, 2023

It's about "One column layout is too wide" - if you zoom, it's not too wide anymore, also smartphones have narrow screen, not wide, and tablets can do that too afaik.

kjkjadksj · on Dec 25, 2023

Scrolling like that is not hard in smartphone format imo

kjkjadksj · on Dec 21, 2023

If you read a lot of papers in your line of work you will quickly appreciate the two columns and justification.

jabroni_salad · on Dec 21, 2023

Only problem is jagoffs like me who need the text to be bigger. On PDFs you now get to experience a horizontal scrollbar. HTML has text reflow and I can set the line length by resizing the window. I'm willing to make a lot of sacrifices for that experience.

FredPret · on Dec 21, 2023

Admittedly, I don't read research papers. But with HTML, surely the choice between one or two columns is a checkbox away.

IlliOnato · on Dec 21, 2023

Which checkbox?

I cannot find anything relevant in any of the 3 browsers I use (Vivialdi, Firefox, Chrome). Would really appreciate this option.

A quick search gave some apparently unmaintained browser extensions, and it's it.

FredPret · on Dec 21, 2023

No, I'm saying there should be a checkbox. That way, you can switch between two columns formatted like LaTeX and that font they always use, and one column with Helvetica / Arial.

IlliOnato · on Dec 22, 2023

It would be nice, but I am not holding my breath.

mastazi · on Dec 21, 2023

I wonder if perhaps it's a generational thing, I prefer the PDF because it reminds me of printed paper, which is what I used growing up.

(For reference: I am at the end of Gen X, people 3-4 years younger than me are considered Millennials).

Blikkentrekker · on Dec 22, 2023

Quite so. The font annoys me. This is one of the reasons I hate PDF and why I believe these things should be controlled by the person reading it, not the publisher.

I do not much care what font the auctor finds pleasant to read, but what I find pleasant to read, and this font isn't it, and neither are the colors.

wruza · on Dec 22, 2023

Seconded. I can (will) actually just read referenced papers now instead of hesitating to either get a headache or stay uninformed.

Defaults and UX rule the world. It’s unfortunate that $subj wasn’t a thing for so long and probably scared millions of curious minds from material. It is so important.

cuteboy19 · on Dec 24, 2023

It feels quite standard for a paper

lemper · on Dec 23, 2023

defo concur. will read the html version when on mobile from now on.

znpy · on Dec 22, 2023

I prefer the pdf version, mostly. I can annotate it on the side both in print and digitally with my iPad. I can also invert colors in pdf readers to get some kind of “dark mode” easily.

The html version is wasting a lot of space on the right side and the color scheme is awful (dark grey on a brown background, seriously? How is that any better? Edit: disabling dark mode yields a better reading experience wrt color scheme). Also, somehow links to references make another http request and have no backlink?

The html version could make sense if it had more dynamic functionalities: change fonts/line spacing, toggle color schemes, maybe a mini map or some other navigational tool? Also, some kind of support for highlighting and/or annotating?

jez · on Dec 21, 2023

It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

- I can imagine authors feeling frustrated if someone reaches out about a problem in the HTML version of their paper, but they have no way to correct it except by hoping that a change to the PDF fixes a change to the generated HTML. Easier to just fix the formatting problem in the PDF outright.

- It would be neat to allow people to experiment with alternative formatting for their papers. For example, imagine a paper about a programming language that embeds a sandbox you can use to play around with the language under discussion. Or a paper about multivariable calculus and you can interact with a three dimensional plot of some function.

IlliOnato · on Dec 21, 2023

No, it would not. It's critically important that there is only one "logical" article, albeit with different representations. In other words, a single "source of truth".

With "sideloading" of HTML there is no way in general to make sure that the contents of LaTeX (and PDF) on one side and HTML on the other side is the same.

felixfbecker · on Dec 24, 2023

Maybe some day for some papers HTML could be the source of truth instead of LaTeX. After all, the original use case for HTML and the web was academics. The HTML and CSS specs have evolved a lot since then, with support for the typesetting features you need for papers (justified text, hyphenation, page breaks, page numbers, ...) and even math formulas are possible now again natively with MathML thanks to Igalia. Diagrams can be accessible vector SVGs instead of raster images. Referencing, linking, citing, figures, tables, etc have always been native to HTML. It's trivial nowadays too to wrap a headless chromium in a CLI to convert an HTML document to PDF rendered in the exact same way that the browser would (i.e. not some static conversion tool that lags behind standards or has render implementation differences).

dataflow · on Dec 22, 2023

> With "sideloading" of HTML there is no way in general to make sure that the contents of LaTeX (and PDF) on one side and HTML on the other side is the same.

Is it not possible to write LaTeX code that produces different contents in HTML vs. PDF?

IlliOnato · on Dec 22, 2023

Well, perhaps by exploiting bugs/shortcomings in PDF and HTML converters. Not by design.

However, bugs get fixed, and since the PDF and HTML are generated dynamically, any such hack would be extremely fragile.

And while "single source of truth" can help to prevent such malicious discrepancy, it's unlikely that people would try to hack the system this way: what for?

Far more likely scenario is unintentional discrepancy, and single source of truth definitely helps to prevent that!

kennethologist · on Dec 23, 2023

Straight from ChatGPT:

Yes, it is indeed possible to write LaTeX code that produces different contents when compiled to HTML versus PDF. This is typically done by using conditional commands within the LaTeX document that check for the output format being used. These conditional commands can then include or exclude specific content based on whether the document is being compiled to HTML or PDF.

In LaTeX, the ifpdf package is commonly used to check if the output is being compiled to a PDF. For generating HTML from LaTeX, tools like TeX4ht or LaTeX2HTML are used, and they often define their own specific commands or provide a way to detect the output format.

----- It gives simple code that uses:

The \ifpdf ... \else ... \fi command checks if the document is being compiled to PDF. If it is, the content between \ifpdf and \else is included. If not (which would be the case for HTML), the content between \else and \fi is included.

The content outside the \ifpdf ... \fi conditional will appear in both the PDF and HTML versions.

GoblinSlayer · on Dec 22, 2023

Huh? What's the point of html version if you define it as source of deception?

diffeomorphism · on Dec 21, 2023

> It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

Please don't. Then you will have a mismatch between the source and the "own html" which ruins the point of uploading the source.

eviks · on Dec 21, 2023

Pdf isn't the source

IlliOnato · on Dec 21, 2023

But the PDF is also generated. LaTeX is the single source of truth.

layer8 · on Dec 21, 2023

They’d have to define and document a “safe” subset of HTML, and implement a filter/checker for it. Otherwise we’d end up with papers containing ads and tracking and XSS vulnerabilities and whatnot.

digging · on Dec 21, 2023

Those are issues with JavaScript, not HTML. Wouldn't filtering out iframes pretty much keep us in the clear?

layer8 · on Dec 21, 2023

The parent wanted interactive 3D plots, which means JavaScript embedded in or linked from the HTML. Then there‘s stuff like JavaScript embedded in SVG.

CaptainOfCoit · on Dec 22, 2023

> Those are issues with JavaScript, not HTML

What about various HTML tags that remote load resources? From script, link, to things like img or CSS `background-image` attribute, added in a `style` attribute.

There is a bunch of ways to do remote requests even without HTML.

quickthrower2 · on Dec 22, 2023

The same problem exists in HN comments. This comment gets converted to html.

   But it is fine!

fdupress · on Dec 22, 2023

"gets converted to" and "gets rendered as uploaded by the user" are two different things.

There are no issues with arXiv generating the HTML and sending that over: they control the generation process, and users who visit arXiv already trust it to not be malicious. The issue is with letting the user upload their own and having it sent on to other users as is.

kjkjadksj · on Dec 21, 2023

Most authors probably have no interest in learning html. Also most authors want nothing to do with the work by the time its submitted. It was probably hell getting the project to that point of publishing, they want to be done with it and move on to the next thing going on in their career asap.

jez · on Dec 21, 2023

I think this is an argument in favor of doing automatic PDF -> HTML conversion for the authors that don't want to touch it, but I don't think it's an argument against letting those who are fine with HTML provide their own.

IlliOnato · on Dec 22, 2023

HTML is not generated from PDF. Both PDF and HTML are generated from LaTeX.

kjkjadksj · on Dec 25, 2023

Probably only a small percentage of people are using latex today. I’ve never personally seen it used. Just MS word docs sent to coauthors then to the paper editor.

bookofjoe · on Dec 22, 2023

You hit on an unappreciated truth. By the time my papers appeared in print, I was so sick of them and the endless effort involved in taking them from raw data to finished, edited, proofed, rewritten a zillion times to meet the reviewers' and editors' requests and corrections and suggestions, that I didn't even read the published paper when it arrived as preprints and in the journal.

Enough!

My proof: https://scholar.google.com/citations?user=5DdrMc8AAAAJ&hl=en

tiagod · on Dec 21, 2023

I was under the impression the source authors publish to arxiv was a latex file

jez · on Dec 21, 2023

Ah, thanks for clarifying!

I looked up the submission formats, and it looks like if you authored the paper in TeX/LaTeX, they do not accept pre-rendered versions of the document.

https://info.arxiv.org/help/submit/index.html#formats-for-te...

But if you did not author it in TeX/LaTeX (e.g., Word, Google Docs, etc.) it appears you can upload a PDF or HTML yourself.

IlliOnato · on Dec 22, 2023

But it's still a single source of truth. Only one document is submitted. So for works submitted as HTML no PDF or LaTeX version is available.

jraph · on Dec 21, 2023

It is.

thomasahle · on Dec 21, 2023

> It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

Can you recommend a system I can use to compile my latex, while also making sure the html is going to look good? I'd like some kinds of css style @media queries to switch between certain parts of the layout, while keeping a single latex file.

turing_complete · on Dec 22, 2023

With the shelf life of web technologies, authors would constantly have to maintain their "papers" or they just would not be accessible after a while.

erik_seaberg · on Dec 24, 2023

Knuth’s stated intent in maintaining TeX is only to fix bugs, not evolve the system in a way that might break old documents. Not sure if this is equally true for Lamport’s LaTeX macros but it wouldn’t surprise me.

pasc1878 · on Dec 23, 2023

Plain html from mid 90s still renders and looks as good as it ever was.

I think CSS is also backwards compatible.

It is the JavaScript birs that change

svag · on Dec 21, 2023

The tool that it's being used for this offering is this one, https://github.com/arXiv/arxiv-readability, just to save a few clicks :)

IshKebab · on Dec 21, 2023

Wow I did not know they have the LaTeX for all the papers and compile it themselves! That's pretty crazy. What if they don't have packages you need? What if your paper isn't written with LaTeX?

r4indeer · on Dec 22, 2023

> What if they don't have packages you need?

Unlikely. But if so, you can provide the packages yourself: https://info.arxiv.org/help/submit_tex.html#wegotem

> What if your paper isn't written with LaTeX?

Then they still accept PDF or HTML. See: https://info.arxiv.org/help/submit/index.html#formats-for-te...

aragilar · on Dec 22, 2023

They specify what version of texlive they use. This is significantly better than what publishers offer (usually a really old latex version, not even pdflatex).

dginev · on Dec 22, 2023

That's it in spirit, but in practice it's refreshed:

https://github.com/arXiv/arxiv-view-as-html

ofou · on Dec 22, 2023

I wonder how better is this compared to Pandoc's

injuly · on Dec 22, 2023

For anyone who needs it, arxiv-vanity is amazing: https://www.arxiv-vanity.com/

westurner · on Dec 22, 2023

arxiv-sanity-lite: https://github.com/karpathy/arxiv-sanity-lite

jll29 · on Dec 21, 2023

It's a cool feature because it makes the papers more finable, more easily navigatable, easier to read online and faster to scroll through. I am also happy for blind people that they can more easily use ArXive with Braille readers now.

(I'm still a fan of printing the PDFs, because I annotate on paper and refer to page numbers, but the HTML feature is in addition to PDF download, not a replacement.)

One thing that still sucks (not ArXiv related though) is reading mathematical formulae on the Kindle - wonder if someone with rendering expertise could have a look into the MOBI format.

isaacfung · on Dec 22, 2023

This would never happen but in an ideal world, we should be able to click on a citation to jump to the part of the paper that is being referenced and each paper page should have a discussion board so we can easily communicate with the authors and group the discussion in one place instead of us having to google to see if there is relevant discussion on twitter/reddit. We can even put links to talks, tutorials, blogs, github repo, demo, paperswithcode/google scholar/open review, background material, a timeline of citations in tree form on the same page(actually I am seeing more machine learning papers that have a project page that does some of these) or even turn it into a mini wiki. I just think html has so much more potential(especially now with LLM we can do semantic search). I wonder if there would be interest in such a chrom extension overlay.

Related projects:

https://github.com/ahrm/sioyek

https://github.com/arxiv-vanity/engrafo

https://github.com/dginev/ar5iv

https://academ.us/article/2111.15588/ (powered by https://github.com/jgm/pandoc I believe)

me_jumper · on Dec 22, 2023

I think https://web.hypothes.is/ would be of interest to you.

astrolx · on Dec 22, 2023

This is excellent news. Their HTML formatting is also more pleasant than the HTML articles offered by most journals in my field (e.g arXiv HTML footnotes displayed as sidenotes on large displays!)

tarboreus · on Dec 21, 2023

One of the reasons is to make the papers more accessible to people with disabilities, especially the blind. I participated in a conference they hosted on this a few months ago, I recommend taking a look at the recordings if you're interested in thinking on this.

https://accessibility2023.arxiv.org/

miki123211 · on Dec 21, 2023

Blind person here, can confirm this. Reading PDFs with a screen reader is bad, reading PDFs that come from LaTeX is worse, reading LaTeX math is pretty much impossible. All the semantic info you need is just thrown away.

You can make decently accessible PDFs but it's lots of work, you need Acrobat on the producer' side and might also need it on the consumer's side. Free tools don't even come close. There's also the fact that the process of making accessible PDFs in Acrobat isn't itself accessible.

With that said, the way screen readers treat HTML math certainly isn't perfect, it's geared more towards school children than anything above calculus. I'm probably going to stay with my LaTeX source files for now. At least ArXiv offers those, not many sites do. To be fair, that approach also has its own set of problems (particularly when people use some extra fancy formatting in their math equations, making the markup hard to read), but I find this to be the best approach for me so far, at least on AI/ML papers.

kkylin · on Dec 22, 2023

I teach math at a university. A couple years ago I had two blind students in my section of first-year calculus, and I really struggled with the tooling. Using latexml, I could produce documents that one of the students could use with a screen reader, but the other student never managed to make it work on their machine. Both students prefer braille but I didn't find anything open source that could typeset mathematical braille easily. Our disability resource office sends things out to a contractor to typeset into braille; the turn-around is measured in weeks.

Anyway, if you (or anyone else reading this) has suggestions I'd really appreciate it!

miki123211 · on Dec 22, 2023

I learned (the basics of) LaTeX in my last year of middle school, and stuck with it ever since. To be fair, I was into computers since I was a child, played with Rockbox at the age of 10, started to dabble in programming shortly after, so this was a lot less scary than most of the things I was doing already. I took my middle and high school finals (they're kind of like SAT but matter a lot more) by producing LaTeX output, which I then compiled to PDF and printed. The test itself was in braille, as that was all that our government could do.

Throughout college, my first question to most of my professors of math subjects was "do you do LaTeX, and can you give me your source code." Most said yes, and that's how we worked. LaTeX in, LaTeX or PDF out, depending on what the professor preferred.

The amount of LaTeX you need for calculus 1 isn't that great, you could probably teach it to a relatively bright student if you had an hour or two to spare, and then give them the source files. If you have the time, I'd suggest producing "stripped" versions of your files, with as little markup as possible to get your point across and no fancy formatting unless absolutely necessary. The amount of hoops some books and papers jump through to "look nice" drives me crazy.

You could also consider producing, teaching and consuming ASCII math, which seems like an even simpler and friendlier format. I couldn't really use it much in my school career for boring technical reasons, but it looks like a promising option.

kkylin · on Dec 23, 2023

Thanks for the suggestions! When you LaTeX your work to turn in, do you work only with the source, or do you have a good way to read the PDF output? I agree the amount of LaTeX needed for calculus is pretty minimal.

One of my students was taking chemistry at the same time, which is (I think) much tougher for blind students. But they also had more teaching assistants for the course.

miki123211 · on Dec 26, 2023

I don't interact with the PDF output myself, but I can compile and email PDFs if I need to send work over to people who do not wish to receive LaTeX themselves, a fact I used throughout most of my high-school education, where LaTeX knowledge was rare. This is why I eschew formatting where possible, I can do enough to make my symbols look right and be understandable to a sighted reader not familiar with LaTeX, but not necessarily to make things extra pretty. Not actually seeing the output makes it a lot more difficult to check your formatting work.

lostlogin · on Dec 22, 2023

> Our disability resource office sends things out to a contractor to typeset into braille; the turn-around is measured in weeks.

This seems a massive gap in the market - many institutions have funding earmarked for such things.

hedora · on Dec 22, 2023

I wonder if this is a useful service that an llm could actually outperform humans on.

Saigonautica · on Dec 22, 2023

Interesting! I never thought about this, thank you for sharing.

What kind of turn-around time would be practical? Could you point me to any typeset mathematical braille that would be an example of a solution to your problem? Is Nemeth the only important standard, or are others important for you too?

I'm wondering if it's practical to set this up as back-office work here in Vietnam. There are some outlying provinces here where there are very few job opportunities. Job opportunities for the blind also round down to zero here (e.g. I could hire for proofreading). Maybe there's room to do something cool here.

miki123211 · on Dec 22, 2023

How's English proficiency (and American braille code proficiency) like in Vietnam?

Keep in mind that most blind people who speak English fluently but don't live in an English-speaking country (myself included) can't read English braille, or at least not well. Because of how voluminous Braille is, it uses contractions, single characters that replace common words and character combinations like "the", "would", "ing" or "ed". Those tend to be language specific, never taught outside their country or countries of use, and hard to get accessible electronic materials for. The math codes are completely different too, we use something derived from Marburg, while English-speaking countries use Nemeth. Even basic characters like + and - differ between those two, not to mention more complicated structures. It's not just the dot patterns that are different but also the design principles, like where you put spaces or when you can omit "begin fraction" / "end fraction" characters.

kkylin · on Dec 23, 2023

Our textbook of choice didn't have a braille version, so we sent it out to be converted one chapter at a time. Since textbooks don't change often, a turn-around of weeks is not so bad if we knew the students were going to be in the course.

What would be very useful for me to be able to typeset myself are small things -- homework, quizzes, and (to a lesser extent) exams. Since homework and quizzes often have to adapt to what I actually covered in class, which may or may not match the syllabus, it's hard to rely on sending this out to be typset by others. (Exams are a little easier since they're usually done days ahead of the actual date.)

AFAIK Nemeth is the only standard that matters. If I can typeset a document, send it to the student, and they can get it on a braille display (no need for this to be on paper), it would solve a ton of problems.

Blikkentrekker · on Dec 22, 2023

I made these arguments two decades ago when I was still in university that PDF is a horrible format because it's purely præsentational, especially for people with disabilities whose software relies on semantic information. LaTeX last time I used it didn't even have a different symbol for uppercase Alpha and A because the glyphs are indistinguishable.

They argued that PDF was superior because the publisher could control how it looked and it looked the same everywhere but the point is that it should not. Things such as font size and line spacing should be at the control of the consumer, not the publisher. This isn't simply blind people but for instance also persons with dyslexia who use particular fonts to make it easier to read for them. Or in my case, someone who simply gets a headache from fronts and line-spacing that is too big. I've also been using darkmode everywhere for so long now that reading black text on a white surface on a screen gives me a headache.

seanhunter · on Dec 22, 2023

To write uppercase Alpha you need a modern version of latex (ie xelatex or lualatex) and to include the unicode-math package

https://tex.stackexchange.com/questions/485593/how-to-write-...

IlliOnato · on Dec 22, 2023

For scientific articles pagination is still important, because it's how you refer to a particular part of a paper. If things like font size and line spacing are at the control of the consumer, pagination is not preserved.

This problem is harder than you one would think naively.

lsaferite · on Dec 22, 2023

Seems like they should use detailed section numbering like military documents and laws. Referring by page number seems very course by comparison.

IlliOnato · on Dec 23, 2023

This would require a change from the currently near-uniformly adopted standard.

The problem with this: you need to create a new standard, get everybody to agree to it, and get busy scientists who are concentrating on content and not representation to adapt this new standard in their writing, essentially requiring them to change their habits and spend extra time on writing (which many of them hate), for no obvious gain from their point of view.

I am not saying it's not possible, or not worth it, but it is not easy and simple either.

ngcc_hk · on Dec 23, 2023

Very hard as it is from physical paper world … and even then you have to make sure version is right as page number change.

Blikkentrekker · on Dec 22, 2023

No, the problem is very easy, referring by page number is simply ridiculous. As well as all those “(<Family Name>, <year>)” citations,

Besides, in HTML one can directly link to the relevant part.

IlliOnato · on Dec 23, 2023

I am afraid you are being naive... You see only one factor out of many.

Being able to link directly to the relevant part is irrelevant (pardon my pun!). Such links are machine-readable, not human-readable. Scientific text need visual citations and being able to name the referred part for reading comprehension.

And Harvard-style citations (AKA name-date) exist for a reason; when your read a paper even in interactive format it helps when you can recognize citations to certain papers and not having to click on them or memorize numbers.

Other styles have their own advantages and disadvantages; that's why they all exist and used by this or that journal, and no consensus on a single "right" style was ever reached.

ldenoue · on Dec 21, 2023

I wrote an app called PDF Reflow that reflows the original PDF using image processing to cut out words into tiles so you see the reflowed version of the text in their original look.

https://www.appblit.com/pdfreflow

sydbarrett74 · on Dec 22, 2023

Any chance of releasing an Android version?

hedora · on Dec 22, 2023

Gv (part of ghostscript) used to do a good job of this for two column documents. When zoomed in to show one column width of text, the spacebar ran through the top of column 1, then the bottom of column 1, then the top of column 2 and so on.

The amount it scrolled probably depended on the aspect ratio of the window, so it might be multiple key presses to scroll an entire column.

ldenoue · on Dec 22, 2023

It’s using web technologies so yes it could also be on Android. I’ll see what can be done.

IlliOnato · on Dec 22, 2023

no_identd · on Dec 22, 2023

ahepp · on Dec 21, 2023

Do you think there's potential for language models to play a role here? I know that AI can get tossed around as a buzzword, but hasn't it proved quite successful in fields like computer vision?

I'm not deeply familiar with the state of that art, but it seems like recovering the metadata from a PDF generated by LaTeX would be no more impressive than many other things we're currently seeing language models achieve?

throwaway287391 · on Dec 22, 2023

You wouldn't need to use computer vision on a picture of the PDF. arXiv has the tex source for most of the papers. An LLM trained on code could do a pretty good job of translating tex to readable html with a bit of effort.

staunton · on Dec 21, 2023

I'm absolutely positive a few million dollars could get you a system that can "read aloud" pdf math papers in no time. I guess people will wait for it to become cheaper though.

hutzlibu · on Dec 22, 2023

You can also have that cheaper already. But having it stable and reliable - will take some time and possibly more money, depending on your definition of reliable.

miki123211 · on Dec 22, 2023

Mathpix is trying to achieve something like this, and they do consider the visually impaired market AFAIK, but it's pretty expensive and I have no experience with it personally, so I can't say how good it is.

jakderrida · on Dec 21, 2023

Hold on... Are you telling me that all these complex sentences are being typed out based on your voice alone? That's insane.

topato · on Dec 21, 2023

I'd say it would be simple to talk type these using windows 11's redux of voice typing. Pretty damn accurate and easy to modify/variate text/options. I use it all the time to make tech/engineering blog posts, faster and more organic than typing, typically, and it learns your technoacronyms. Combined with voice access, it makes it trivial to fully operate your computer (well, at least, browse the web, email, and media apps) from across the room. For anyone who hasn't tried the updated version, highly suggest hitting windowskey+h and giving it a shot.

kzrdude · on Dec 21, 2023

Hm tangential question but shouldn't touch typing be well accessible for many blind computer users?

ehPReth · on Dec 21, 2023

? blind people can use keyboards

spookie · on Dec 22, 2023

There are braille keyboards too

Blikkentrekker · on Dec 22, 2023

Or normal keyboards? Many people can type blind. Some learned to do so while born blind, others became blind after they had already learned this skill.

I would assume that the majority of persons on HN are not looking at their keyboard as they type.

spookie · on Dec 22, 2023

I was just giving an additional way to use a computer not known by many. Either way, we shouldn't rely on the skills of a few to interact with a computer.

phlakaton · on Dec 22, 2023

For the math equations, I'm curious: does MathML do any better for you than LaTeX?

seanhunter · on Dec 22, 2023

Not the person you’re asking the question to, but it’s worth noting (if you don’t already know) that MathML is really not designed at all as an input language for practitioners who just want to write a few equations in some document. It’s designed as an output/presentation language so that devices that want to render some maths can do so faithfully[1]. As such, if you’re a human being who wants to typeset some equation, you’ll want to go to latex every single time rather than mathml and then someone else has to figure out the conversion.

[1] Great explanation here https://tex.stackexchange.com/questions/57717/relationship-b...

IlliOnato · on Dec 22, 2023

On the other hand, "semantic" flavor of MathML (as opposed to "presentation") is much easier than TeX for things like screen readers, both conceptually and in practice.

saurik · on Dec 21, 2023

Huh. It would seem like, of all the things which should make it easy to generate the correct accessibility information, the pipeline of compiling a paper from source code in LaTeX should nail it... maybe we should all pitch in to some pool to pay someone to put in the required effort to connect all the dots?

jahewson · on Dec 21, 2023

Surprisingly it’s not easy, and depending on the field it can be quite challenging. The reason for this is that TeX captures the visual aspects of typesetting, not the semantic meaning of the mathematics.

A simple example is ‘\sum’ which provides no way to capture the expression being summed over - because that’s not necessary for typesetting. That’s not the case in, say, MathML.

Writing MathML is no fun though because mathematical formulae are visually ambiguous and we rely on the context to know how to read them, e.g. does ‘f(x - 1)’ mean function f called with argument x - 1, or does it mean variable f multiplied by x - 1?

semi-extrinsic · on Dec 21, 2023

Kind of tangential, but it's also kind of surprising how difficult it is in LaTeX to make a plot of an equation.

Say I have Equation \ref{eq}. Why can't I just say "plot \ref{eq} for x from -6 to 11" and get my graph?

And yes, I know about pgfplots, PSTricks, TikZ etc. But in all those cases, I need to define the same equation twice, in different syntax to boot. It's kind of unsatisfying.

fsh · on Dec 22, 2023

TeX is a very arcane language, and it doesn't support floating point numbers. Few languages would be less suited for making a plotting library.

semi-extrinsic · on Dec 22, 2023

Both pgfplots and PSTricks and TikZ are plotting libraries. It seems like it shouldn't be that hard to let them plot an equation written elsewhere in different syntax.

IlliOnato · on Dec 22, 2023

> Say I have Equation \ref{eq}. Why can't I just say "plot \ref{eq} for x from -6 to 11" and get my graph?

Pretty much for the same reason you cannot press a word and get a pop-up dictionary definition in a paper book.

semi-extrinsic · on Dec 22, 2023

To be clear, I meant in the LaTeX source code. And there I can already write code that plots equations, I just have to re-type the equation in a new syntax.

IlliOnato · on Dec 23, 2023

TeX is about representation, not semantics, by design. To do anything useful with a function (like plotting) you need to get semantics.

An often cited example: what is f(x+y) ? Is it function f with x+y as its argument, or constant f multiplied by (x+y) ? TeX gives you no clue.

Or what is this i in your equation? Is it an index variable, or a square root from minus one?

You as a human figure this out by looking at the context and using domain knowledge. So does a "TeX to HTML/MathML converter". It is ultimately built on heuristics, and cannot be otherwise.

That's why I said basically "for the same reason a paper page is not interactive". It was designed this way!

The goal of TeX was to generate beautiful printed page. The need for semantic structure was not anticipated. To do semantics you need a "semantic version of MathML", or a language used by Wolfram's product, etc.

spookie · on Dec 22, 2023

Yup LaTeX math doesn't make sense. I've been trying to hack my way into getting a voice model to read it but no real progress.

IlliOnato · on Dec 22, 2023

LaTeX is a programming language for generating beautiful pages, basically a typesetting system. It serves this purpose fantastically well.

It was not designed to provide semantic information, unfortunately. So getting anything other than visual representation out of it is hard.

anthk · on Dec 21, 2023

Emacs with Emacspeak has a math reading module.

wilg · on Dec 22, 2023

For accessibility purposes (and regular reading), it would be so much better to drop the justified text. Ragged edge is the way to go!

https://www.boia.org/blog/why-justified-or-centered-text-is-...

jonatanheyman · on Dec 22, 2023

Not necessarily:

https://heyman.info/2023/fill-justified-text-on-the-web

wilg · on Dec 22, 2023

Perhaps someone can publish a paper to arXiv that provides a meta-analysis. But still there doesn't seem to be a clear reason to justify it, given that almost all internet text is not justified.

dginev · on Dec 22, 2023

To me one of the exciting aspects of HTML is that we can theme the same article in different ways, tailored to individual preferences - just swap in a different CSS file.

Having a two-column theme, or left-aligned vs justified themes, could be workable in the long run. I hope that we get to see some browser extensions modding the pages before too long.

The reason for the current justified text is that it is the default aesthetic for a LaTeX-based article, and a lot of authors expect it.

reqo · on Dec 21, 2023

A lot of AI/ML papers these days have an accompanying interactive page like [0], will we see anything like these now directly in arXive?

[0] https://voyager.minedojo.org/

z2h-a6n · on Dec 21, 2023

I think then arXiv would have to deal with mantaining the tech stack and providing the presumably much higher server capacity to serve the more varied web pages that would result, so it seems like a tall order. arXiv already has an experimental integration with Papers with Code [0], which I guess provides similar results for the reader, though the authors have to figure out their own web hosting.

[0] https://info.arxiv.org/labs/showcase.html#arxiv-links-to-cod...

MahiShafiullah · on Dec 23, 2023

Second that. Something I put out recently had an (admittedly video heavy) webpage that had 1TB of traffic over the past month. Cloudflare handled it for free for me, but at ArXiv’s scale it’s bound to be a problem.

shusaku · on Dec 21, 2023

Seems like the references aren’t working very well.

I really want journals to have two way links in a paper. I get google scholar alerts about certain papers being cited, and I want to skip to “why did they cite this? Did they use it, improve it, it just mention it?”

r3trohack3r · on Dec 21, 2023

I’d never considered setting up citation alerts like this.

Thank you for the idea!

shrimpx · on Dec 21, 2023

Looks like clicking a reference adds the hash to the URL but doesn't scroll to the reference. If you load the hash URL directly in the browser you get a 404 page...

burkaman · on Dec 21, 2023

https://browse.arxiv.org/html/2312.12451v1#bib.bib1 works, but https://browse.arxiv.org/html/2312.12451v1/#bib.bib1 doesn't.

IlliOnato · on Dec 21, 2023

Yeah, it seems like a bug in HTML generator...

cbf66 · on Dec 22, 2023

It is a bug. Will be fixed soon.

leoncaet · on Dec 21, 2023

I just hope they don't stop to offer the papers in PDF. Even when I'm on a computer, I still prefer to read PDFs.

creatonez · on Dec 22, 2023

There is a taste component to it of course, but the history of PDF shows that it's the wrong format for reading on a computer. It was originally meant to be the end result of a publishing process before printing, a layer that sits right between the publishing software and the postscript that gets sent to the printer. This makes the PDF format quite inflexible for reading on a computer, with it being impossible to properly zoom or adjust the reading experience.

Unfortunately many institutions and businesses have ignored its limitation because PDF turned out to be an obvious-but-naive to put a 'sheets of paper' metaphor into a computer system, which in the 1990s appealed to tech illiterate folks doing bare-bones computerization of existing paper systems. So later we got complicated and error-prone tools for editing PDFs, and many random additions to the spec to allow for unusual use cases.

impendia · on Dec 22, 2023

> This makes the PDF format quite inflexible for reading on a computer, with it being impossible to properly zoom or adjust the reading experience.

As an academic researcher, generally speaking I also prefer PDF, and the inflexibility and static nature is a feature, not a bug. I appreciate the fact that a paper will appear the same everywhere, that I can refer to "the top of page 7", etc.

The exception is if I wanted to just skim a paper; in this case, I think I'd prefer HTML.

I'm a huge fan of what arXiv is doing here. It effectively preserves the status quo, while adding an additional option on the side. The HTML option might prove a little bit useful for me, and it is likely to prove extremely useful for people with disabilities.

creatonez · on Dec 24, 2023

> I appreciate the fact that a paper will appear the same everywhere, that I can refer to "the top of page 7", etc.

There are many great solutions to this problem, including ones that don't require Javascript at all. This website (https://gwern.net/silk-road) presents a really good example -- every header and sub-header is a clickable anchor. If more granularity is needed, on newer articles most of the paragraphs start with an italicized margin note -- though for technical writing, paragraph anchors might be better. The page also pays careful attention to print CSS and has a 'reader mode' to convert all links to footnotes when printed.

Some websites will also preserve the text you select in a URL anchor, but more often than not this is just cumbersome. It also has a greater risk of not surviving changes to the webpage.

gwern · on Dec 28, 2023

Also intriguing as a solution, which is potentially much stabler across revisions (page numbers are unstable), is the text-anchor-fragment feature Chrome introduced a while back: https://developer.mozilla.org/en-US/docs/Web/Text_fragments

It's actually hit ~88% of the market https://caniuse.com/mdn-html_elements_a_text_fragments but unfortunately, Firefox remains a holdout* and that's my browser, so I don't use it (although maybe I should just install https://addons.mozilla.org/en-US/firefox/addon/link-to-text-... and try it out - my existing method of making new anchors for annotation purposes is cumbersome).

* Firefox officially is positive on it but no sign of any movement on it: https://mozilla.github.io/standards-positions/#scroll-to-tex... https://github.com/mozilla/standards-positions/issues/194 https://bugzilla.mozilla.org/show_bug.cgi?id=1753933 https://wicg.github.io/scroll-to-text-fragment/

aragilar · on Dec 22, 2023

I know of no-one who provides only HTML to arxiv, it's either latex, or doc/odt, so the PDFs should always be there.

ansk · on Dec 21, 2023

When I open a large pdf on arxiv (100+ MB, not uncommon for ML papers focused on hi-res image generation), there is a significant load time (10+ seconds) before anything is rendered at all other than a loading bar. Does anyone know what the source of this delay is? Is it network-bound or is Chrome just really slow to render large PDFs? Do PDFs have to be fully downloaded to begin rendering? In any case, this delay is my only gripe with arxiv and a progressively rendered HTML doc that instantly loads the document text would be a huge improvement.

10000truths · on Dec 22, 2023

> Does anyone know what the source of this delay is? Is it network-bound or is Chrome just really slow to render large PDFs? Do PDFs have to be fully downloaded to begin rendering? In any case, this delay is my only gripe with arxiv and a progressively rendered HTML doc that instantly loads the document text would be a huge improvement.

The default PDF format puts the xref table at the end of the file, forcing a full download before rendering can take place. PDF-1.2 onwards supports linearized PDFs, and most PDF export tools have some way of enabling it (usually an option like "optimize for web").

upbeat_general · on Dec 21, 2023

I have the same issue. From what I can tell it’s just network-bound and the Arxiv servers are slow. They theoretically allow for you to setup a caching server but after spending a while trying to get it setup, I haven’t been able to get it to work.

https://info.arxiv.org/help/faq/cache.html

arccy · on Dec 21, 2023

maybe it'll be faster now with fastly

https://news.ycombinator.com/item?id=38723373

IlliOnato · on Dec 21, 2023

It may be even that the time is taken to generate a PDF.

The format in which articles are submitted and stored in arXive is LaTeX. PDF is automatically generated from it.

Probably arXiv does some caching of PDFs so they don't have to be generated anew every time they are requested, but I don't know how this caching works.

wolverine876 · on Dec 21, 2023

Many here say they prefer html documents. How do you annotate them? How do you make local copies? Also, how will you read them in the decades to come?

I love PDF.

aragonite · on Dec 21, 2023

A lot of academic journals (say from Springer) also offer HTML formats for papers published in the past decade or so, which I personally often find more convenient for reading purposes than PDFs. For example, I parse text a lot faster if I use a regex to split each paragraph into sentences and place a linebreak after each sentence, or if I do natural language "syntax highlighting" by assigning a distinctive color to functional words indicating logical structure like 'if/then', 'and', 'or', 'not', 'because', and 'is'. And sometimes it really improves readability to be able to do "semantic highlighting", in the sense of say assigning a different hashed color to each proper name (or each labeled thesis, etc) that occurs in the paper. Such manipulations are basically impossible with PDFs. It makes me wish sci-hub would start archiving HTML versions in addition to PDFs!

golol · on Dec 21, 2023

IMO pdf and HTML optimize for different things. pdf is easy and pretty. HTML is easy and responsive. But making pdf responsive is impossible and making HTML pretty is not easy. I think having arxiv for well-polished pretty documents, not responsive ugly documents. Most researchers don't have time to make an HTML responsive and pretty.

querez · on Dec 21, 2023

Am researcher, care about responsiveness way more than pretty. I am super glad for the option. Downloading PDFs is super annoying. I'm stoked.

mmis1000 · on Dec 22, 2023

Well... download html is even harder nowadays, because many pages are dynamically generated. Although there are surely some browser extensions that can help you to finish it in a few clicks..

FredPret · on Dec 21, 2023

This is brilliant. I don't share academia's love of LateX multi-column PDFs.

tiagod · on Dec 21, 2023

I like multi-column text on paper (literally), but it's awkward in digital where you can just shape text on the fly to whatever column size you want

golol · on Dec 22, 2023

The oroblem is that gaining this responsiveness fundamentally makes your task much more difficult. Instead of just creating a picture you're now writing code that has to be maintained. In my philosophy arxiv is for documents which are set in granite - pictures.

delhanty · on Dec 22, 2023

> If you are familiar with ar5iv, an arXivLabs collaboration, our HTML offering is essentially bringing this impactful project fully “in-house”. Our ultimate goal is to backfill arXiv’s entire corpus so that every paper will have an HTML version, but for now this feature is reserved for new papers.

IIRC, ar5iv was created on his own initiative by Deynan Ginev

https://twitter.com/dginev/status/1736792316675825981

and it seems that he has worked tirelessly to fix nearly all of the edge cases during the collaboration.

This project creates huge value to humanity so Deynan is to be heartily thanked.

dginev · on Dec 22, 2023

Thanks for the kind words, but some corrections:

1. My name is Deyan (hi!)

2. ar5iv was the latest frontend incarnation, but our actual work on converting LaTeX to HTML goes back nearly 20 years behind the scenes.

3. I was an undergraduate student when I was introduced to the project back in 2007. It was started "in spirit" by 3 senior co-conspirators back then: Michael Kohlhase, Bruce Miller and Robert Miner. And I am by no means a solitary actor today, even if I may be the chief online presence of the people involved. Bruce is doing the bulk of the hard work on LaTeXML to this day.

I documented some of the history in an invited talk for CICM 2022, which you can find on youtube, or see the slides at:

https://prodg.org/talks/welcome_to_ar5iv

It's really great that the HTML has now reached "home base" in arXiv, and I hope their team gets a lot more of the positive attention going forward - today's achievement is entirely theirs!

indrora · on Dec 22, 2023

I remember stumbling upon your work long ago when I was working on a project to have "e-zines" that consumed a series of `article` class files and rendered them out into PDF and HTML as a series package.

I had come across latex2html, Dan Gildea's project, and found myself unpleasantly dissatisfied with how it worked. As I understand it, it's more a "half implementation of lots of packages" rather than what ar5iv seems to be, which is "enough of the core LaTeX engine producing HTML instead of DVI"? I'd love to know more about the nitty gritty of how the engine does its thing.

I'm curious: How has modern web tech (e.g. WebAssembly, Canvas, etc) helped or gotten in the way of getting good LaTeX rendering in the browser?

dginev · on Dec 22, 2023

Right, that's LaTeXML - it tries to emulate as much as possible of the TeX typesetting system, while retaining enough control to emit structured markup.

Which also allows us (and generally all contributors of latexml package support) to conveniently maintain various parallel data structures and metadata needed along the way.

Modern HTML is very often helpful to produce higher quality article renderings. Examples:

1. we recently started using flexbox for subfigures, allowing them to reflow.

2. we have started emitting ARIA accessibility annotations (there is now an "alt" key for \includegraphics)

3. MathML Core allowed us to have native web rendering for math expressions in every browser.

As to LaTeX rendering in the browser, there are various other projects out there you could look up with partial support. For latexml the WebAssembly route seems most realistic, as we are undergoing a rewrite in Rust. But there are quite a number of pieces to flesh out before we get there.

ngcc_hk · on Dec 23, 2023

Went through it and may I ask whether there is any “personal” level of this ar5iv converter or just one of few mentioned parser.

Btw given we are into quotation academic world, I wonder whether you may have mention Gartner Group to invent that technology curve. To be honest there is a variation I like more which deal with the chasm issue.

pushfoo · on Dec 21, 2023

Previously discussed: https://news.ycombinator.com/item?id=38713215

WendyTheWillow · on Dec 21, 2023

I’m so far left wanting for an app that gives me a way to easily track and consume newly published work of a given topic. The existing apps are not great, and maybe this change will make it easier to provide better “reader” views, and possibly even tts (I like to listen+read).

binarymax · on Dec 21, 2023

Nice! Now I don’t need to manually replace arxiv with ar5iv. Congrats to the team.

imjonse · on Dec 21, 2023

"Our ultimate goal is to backfill arXiv’s entire corpus so that every paper will have an HTML version, but for now this feature is reserved for new papers."

For now it only works for papers submitted this month. But it's great to have this feature, makes it so much easier to read on phones.

therealmarv · on Dec 22, 2023

This is the reason I've never liked LaTeX from a data point view. It's made to be printed out or get to look beautiful on a PDF but was never designed to get you to a HTML file or a Word file.

I've written my thesis in Markdown in the past because of this (best for humans) which can be easily transformed to HTML, Word, PDF and even LaTeX https://github.com/tompollard/phd_thesis_markdown

And I think that XML is the best format for machines.

codethief · on Dec 21, 2023

Ugh. I don't belong to the target audience (people with disabilities) but the typesetting doesn't exactly look pleasant on my machine (Chrome on Linux).

Al-Khwarizmi · on Dec 21, 2023

Nice! It would be even better if they offered authors of previous papers the option of converting to HTML, as the latex sources are already in the system.

fprog · on Dec 21, 2023

The article states they're going to backfill all, or nearly all, previously submitted papers!

odyssey7 · on Dec 21, 2023

  article {
    text-justify: Knuth-Plass;
  }

SushiHippie · on Dec 21, 2023

Mind explaining?

odyssey7 · on Dec 22, 2023

The comment is invalid CSS to apply the Knuth-Plass algorithm in rendering an HTML article. Knuth being a perfectionist’s perfectionist, TeX uses this algorithm to determine optimal line breaks to provide for better text justification.

Here’s a discussion of hacks to achieve the algorithm’s results on web pages and an upcoming CSS feature as of 2020. https://mpetroff.net/2020/05/pre-calculated-line-breaks-for-...

SushiHippie · on Dec 23, 2023

Thank you!

computerfriend · on Dec 22, 2023

If only.

cozzyd · on Dec 21, 2023

doesn't work great with long author lists...

https://browse.arxiv.org/html/2312.12907v1

degenerate · on Dec 21, 2023

The PDF is worse, so there is no simple answer to this: https://arxiv.org/pdf/2312.12907v1.pdf

At least the HTML version pairs each author with their affiliations, instead of the PDF which has all the names on page 1, and all the affiliations on page 2. That's completely unreadable.

cozzyd · on Dec 21, 2023

The PDF is better because I'm trained to scroll past the author list. That takes forever on the html version .

mattigames · on Dec 21, 2023

You can click the "Introduction" anchor on the left side and it scrolls for you past the author list

cozzyd · on Dec 21, 2023

well it skips the abstract too, but yes, you can scroll back up to see it.

mattigames · on Dec 21, 2023

Yeah, its a bit weird that the abstract doesn't have a link on the left

cozzyd · on Dec 21, 2023

Probably because \abstract{ } is treated differently than \section{ }, I guess...