Hacker News new | past | comments | ask | show | jobs | submit login
ArXiv now offers papers in HTML format (arxiv.org)
1204 points by programd on Dec 21, 2023 | hide | past | favorite | 312 comments



Since the article doesn't link to any example HTML article, here's a random link:

https://browse.arxiv.org/html/2312.12451v1

It's cool that it has a dark mode. Didn't see a toggle but renders in the system mode.

Overall will make arXiv a lot more accessible on mobile.


And here's the PDF of the same paper for comparison: https://arxiv.org/pdf/2312.12451.pdf


The contrast is massive. I'm much more likely to read the html version; that PDF is deeply off-putting in some hard to define way. Maybe it's the two columns, or the font, or the fact that the format doesn't adjust to fit different screen sizes.


This is very interesting, because for me it's just the opposite. In particular the two column layout is just more readable and approachable for me. The PDF version also allows for a presentation just as the authors intended. I guess it's good that they offer both now.


Do you work extensively with LaTeX?

Two columns is good, albeit annoying on mobile. But the font. The typeface kills me, and almost every LaTeX-generated document sports it.


Hilariously, I would probably tolerate the HTML version a lot better if it had the font from the PDF (and FWIW, the answer for me is "no: I don't work with LaTeX at all... I just read a lot of papers").


If you disable the font rule

  :root, [data-theme=light] {
    /* --text-font-family: "freight-sans-pro";
  }
it switches to "Noto Serif" that is way easier on the eyes.


I hard override the font in browser, designers never get it right.


what is your font of choice?


Verdana


https://github.com/neilpanchal/spinzero-jupyter-theme /fonts/{cmu-text,cmu-mono} :

> "Computer Modern" is used for body text to give it a professional/academic look


Hating on Computer Modern (ok, probably now Latin Modern) is something close to blasphemy.


Computer Modern was not designed for easy viewing on screens (think about the screens Knuth would have been using in 1977), it was designed for printing in books.


I hate Computer Modern, and I'm not even particularly fussy about typefaces.


What device and app are you using to read the document?


The authors don’t format the pdf, the editor does. Authors probably sent a double spaced word document with figures and tables on another file.


Not on arXiv (unless I'm much mistaken), which is a preprint server, not a conventional journal.

arXiv accepts various flavors of TeX, or PDFs not produced by TeX [0], and automatically produces PDFs and HTML where possible (e.g. if TeX is submitted). In the case of the example paper under discussion, the authors submitted TeX with PDF figures [1], and the PDF version of the paper was produced by arXiv. The formatting was mainly set by using REVTeX, which is a set of macros for LaTeX intended for American Physical Society journals.

[0] https://info.arxiv.org/help/submit/index.html#formats-for-te... [1] https://arxiv.org/format/2312.12451


FWIW, I recently learned that it is also possible to produce nice PDF papers with GNU roff (groff), have a look at this example: https://github.com/SudarsonNantha/LinuxConfigs/blob/master/....


Looks nice but seems strange to switch from two columns to one column after the first page? Although maybe they’re just trying to demonstrate its capabilities.


W. Richard Stevens (RIP, still hurts) famously used troff for his books.


You typically send a .tar.gz of tex files (and, figures, .bbl, etc.) to the journal. And then you typically upload something very similar to the arxiv (I have an arxivify Makefile target for for my papers that handles some arxiv idiosyncrasies like requiring all figures to be in the same folder as the .tex file, and it also clears all the comments; sometimes you can find amusing things in source file comments for some papers).

Some fields may use Word files, but in most of physics you would get laughed at...

It is true that most journals will typically reformat your .tex in a different way than is displayed on the arXiv.


In computer science, the usual case is that the author fully formats the paper.


Not only is this wrong about physics/astronomy, I regularly use the arxiv version because the typography is better (e.g. in the published paper an equation is split with part of the equation being at the bottom of one column, and the top of the next, whereas the equation is on one line in the arxiv version).


You are very confidently wrong.

In the arxiv you use latex and do everything yourself. There is no editor.


You are completely wrong. ArXiv doesn't work like that.


For what it's worth, two column layouts are very common in the physical sciences, or at least in physics which I'm more familliar with. I have a feeling that the reason is at least partly to save page space when using displayed math (e.g. equations that are formatted in a break between blocks of text), which use the full text width (i.e. the width of one column) to display what may be much less than half a page wide.


It makes sense - for paper. But pixels are infinite - HTML is far better for screen display, which is how people read things nowadays.

The extra column next to the one I'm reading introduces a lot of visual noise, and the content is hard enough as it is. I'm sure physicists have all gotten used to it, but it certainly trips me up.


> The extra column next to the one I'm reading introduces a lot of visual noise

Papers are generally not read start to finish in one go: there's lots of rereading and jumping back and forth between key parts, and anything that moves them further apart makes this harder.


Ah, that makes more sense. I imagined scientists just reading the whole thing start-to-finish.

I still think a flexible layout is best. If you like multi-columns and have a wide screen, why not display 12 columns next to each other?

With PDF this is not possible. With HTML the content can in principle be sliced and diced how you like it.


One can also view PDF pages side by side, which works pretty well with a 4K monitor.


I need to scroll up and down a lot more with two-column layout because a single page doesn't fit on my screen in my chosen font size (which is fairly large).

But HTML is so much more flexible, and ideally people can choose how they want it, although at this point it seems that's not (yet) implemented.

I find jumping back and forth is always a pain on computer screens and ebooks by the way, and is the major reason I much prefer print for this type of thing.


Two column is the default in astronomy also.


Definitely the two columns for me. It's super annoying skimming a paper and having to scroll down and back up again in a zig-zag pattern.


I think the consuming device matters. A ipad or computer have much wider screen width. One column layout is too wide for them for average people to scan text lines quickly.

While it looks perfectly fine on a phone. Two columns layout looks terrible on a smartphone, the text is too tiny to read comfortably.

It would probably be even better if you can flip it left and right like a ebook instead of scrolling to allocate the content faster. But current design is good enough IMO. (Compare to reading a pdf on cellphone)


To display two column layout you need a tall screen, now wide. If you display two column layout on a short wide screen, you have to scroll it up and down in zigzag pattern to read one page.


Just zoom the smartphone into one column. Problem solved.


And then you will have to scroll both top bottom and left right, a even worst experience.


It's about "One column layout is too wide" - if you zoom, it's not too wide anymore, also smartphones have narrow screen, not wide, and tablets can do that too afaik.


Scrolling like that is not hard in smartphone format imo


If you read a lot of papers in your line of work you will quickly appreciate the two columns and justification.


Only problem is jagoffs like me who need the text to be bigger. On PDFs you now get to experience a horizontal scrollbar. HTML has text reflow and I can set the line length by resizing the window. I'm willing to make a lot of sacrifices for that experience.


Admittedly, I don't read research papers. But with HTML, surely the choice between one or two columns is a checkbox away.


Which checkbox?

I cannot find anything relevant in any of the 3 browsers I use (Vivialdi, Firefox, Chrome). Would really appreciate this option.

A quick search gave some apparently unmaintained browser extensions, and it's it.


No, I'm saying there should be a checkbox. That way, you can switch between two columns formatted like LaTeX and that font they always use, and one column with Helvetica / Arial.


It would be nice, but I am not holding my breath.


I wonder if perhaps it's a generational thing, I prefer the PDF because it reminds me of printed paper, which is what I used growing up.

(For reference: I am at the end of Gen X, people 3-4 years younger than me are considered Millennials).


Quite so. The font annoys me. This is one of the reasons I hate PDF and why I believe these things should be controlled by the person reading it, not the publisher.

I do not much care what font the auctor finds pleasant to read, but what I find pleasant to read, and this font isn't it, and neither are the colors.


Seconded. I can (will) actually just read referenced papers now instead of hesitating to either get a headache or stay uninformed.

Defaults and UX rule the world. It’s unfortunate that $subj wasn’t a thing for so long and probably scared millions of curious minds from material. It is so important.


It feels quite standard for a paper


defo concur. will read the html version when on mobile from now on.


I prefer the pdf version, mostly. I can annotate it on the side both in print and digitally with my iPad. I can also invert colors in pdf readers to get some kind of “dark mode” easily.

The html version is wasting a lot of space on the right side and the color scheme is awful (dark grey on a brown background, seriously? How is that any better? Edit: disabling dark mode yields a better reading experience wrt color scheme). Also, somehow links to references make another http request and have no backlink?

The html version could make sense if it had more dynamic functionalities: change fonts/line spacing, toggle color schemes, maybe a mini map or some other navigational tool? Also, some kind of support for highlighting and/or annotating?


It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

- I can imagine authors feeling frustrated if someone reaches out about a problem in the HTML version of their paper, but they have no way to correct it except by hoping that a change to the PDF fixes a change to the generated HTML. Easier to just fix the formatting problem in the PDF outright.

- It would be neat to allow people to experiment with alternative formatting for their papers. For example, imagine a paper about a programming language that embeds a sandbox you can use to play around with the language under discussion. Or a paper about multivariable calculus and you can interact with a three dimensional plot of some function.


No, it would not. It's critically important that there is only one "logical" article, albeit with different representations. In other words, a single "source of truth".

With "sideloading" of HTML there is no way in general to make sure that the contents of LaTeX (and PDF) on one side and HTML on the other side is the same.


Maybe some day for some papers HTML could be the source of truth instead of LaTeX. After all, the original use case for HTML and the web was academics. The HTML and CSS specs have evolved a lot since then, with support for the typesetting features you need for papers (justified text, hyphenation, page breaks, page numbers, ...) and even math formulas are possible now again natively with MathML thanks to Igalia. Diagrams can be accessible vector SVGs instead of raster images. Referencing, linking, citing, figures, tables, etc have always been native to HTML. It's trivial nowadays too to wrap a headless chromium in a CLI to convert an HTML document to PDF rendered in the exact same way that the browser would (i.e. not some static conversion tool that lags behind standards or has render implementation differences).


> With "sideloading" of HTML there is no way in general to make sure that the contents of LaTeX (and PDF) on one side and HTML on the other side is the same.

Is it not possible to write LaTeX code that produces different contents in HTML vs. PDF?


Well, perhaps by exploiting bugs/shortcomings in PDF and HTML converters. Not by design.

However, bugs get fixed, and since the PDF and HTML are generated dynamically, any such hack would be extremely fragile.

And while "single source of truth" can help to prevent such malicious discrepancy, it's unlikely that people would try to hack the system this way: what for?

Far more likely scenario is unintentional discrepancy, and single source of truth definitely helps to prevent that!


Straight from ChatGPT:

Yes, it is indeed possible to write LaTeX code that produces different contents when compiled to HTML versus PDF. This is typically done by using conditional commands within the LaTeX document that check for the output format being used. These conditional commands can then include or exclude specific content based on whether the document is being compiled to HTML or PDF.

In LaTeX, the ifpdf package is commonly used to check if the output is being compiled to a PDF. For generating HTML from LaTeX, tools like TeX4ht or LaTeX2HTML are used, and they often define their own specific commands or provide a way to detect the output format.

----- It gives simple code that uses:

The \ifpdf ... \else ... \fi command checks if the document is being compiled to PDF. If it is, the content between \ifpdf and \else is included. If not (which would be the case for HTML), the content between \else and \fi is included.

The content outside the \ifpdf ... \fi conditional will appear in both the PDF and HTML versions.


Huh? What's the point of html version if you define it as source of deception?


> It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

Please don't. Then you will have a mismatch between the source and the "own html" which ruins the point of uploading the source.


Pdf isn't the source


But the PDF is also generated. LaTeX is the single source of truth.


They’d have to define and document a “safe” subset of HTML, and implement a filter/checker for it. Otherwise we’d end up with papers containing ads and tracking and XSS vulnerabilities and whatnot.


Those are issues with JavaScript, not HTML. Wouldn't filtering out iframes pretty much keep us in the clear?


The parent wanted interactive 3D plots, which means JavaScript embedded in or linked from the HTML. Then there‘s stuff like JavaScript embedded in SVG.


> Those are issues with JavaScript, not HTML

What about various HTML tags that remote load resources? From script, link, to things like img or CSS `background-image` attribute, added in a `style` attribute.

There is a bunch of ways to do remote requests even without HTML.


The same problem exists in HN comments. This comment gets converted to html.

   But it is fine!


"gets converted to" and "gets rendered as uploaded by the user" are two different things.

There are no issues with arXiv generating the HTML and sending that over: they control the generation process, and users who visit arXiv already trust it to not be malicious. The issue is with letting the user upload their own and having it sent on to other users as is.


Most authors probably have no interest in learning html. Also most authors want nothing to do with the work by the time its submitted. It was probably hell getting the project to that point of publishing, they want to be done with it and move on to the next thing going on in their career asap.


I think this is an argument in favor of doing automatic PDF -> HTML conversion for the authors that don't want to touch it, but I don't think it's an argument against letting those who are fine with HTML provide their own.


HTML is not generated from PDF. Both PDF and HTML are generated from LaTeX.


Probably only a small percentage of people are using latex today. I’ve never personally seen it used. Just MS word docs sent to coauthors then to the paper editor.


You hit on an unappreciated truth. By the time my papers appeared in print, I was so sick of them and the endless effort involved in taking them from raw data to finished, edited, proofed, rewritten a zillion times to meet the reviewers' and editors' requests and corrections and suggestions, that I didn't even read the published paper when it arrived as preprints and in the journal.

Enough!

My proof: https://scholar.google.com/citations?user=5DdrMc8AAAAJ&hl=en


I was under the impression the source authors publish to arxiv was a latex file


Ah, thanks for clarifying!

I looked up the submission formats, and it looks like if you authored the paper in TeX/LaTeX, they do not accept pre-rendered versions of the document.

https://info.arxiv.org/help/submit/index.html#formats-for-te...

But if you did not author it in TeX/LaTeX (e.g., Word, Google Docs, etc.) it appears you can upload a PDF or HTML yourself.


But it's still a single source of truth. Only one document is submitted. So for works submitted as HTML no PDF or LaTeX version is available.


It is.


> It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

Can you recommend a system I can use to compile my latex, while also making sure the html is going to look good? I'd like some kinds of css style @media queries to switch between certain parts of the layout, while keeping a single latex file.


With the shelf life of web technologies, authors would constantly have to maintain their "papers" or they just would not be accessible after a while.


Knuth’s stated intent in maintaining TeX is only to fix bugs, not evolve the system in a way that might break old documents. Not sure if this is equally true for Lamport’s LaTeX macros but it wouldn’t surprise me.


Plain html from mid 90s still renders and looks as good as it ever was.

I think CSS is also backwards compatible.

It is the JavaScript birs that change


The tool that it's being used for this offering is this one, https://github.com/arXiv/arxiv-readability, just to save a few clicks :)


Wow I did not know they have the LaTeX for all the papers and compile it themselves! That's pretty crazy. What if they don't have packages you need? What if your paper isn't written with LaTeX?


> What if they don't have packages you need?

Unlikely. But if so, you can provide the packages yourself: https://info.arxiv.org/help/submit_tex.html#wegotem

> What if your paper isn't written with LaTeX?

Then they still accept PDF or HTML. See: https://info.arxiv.org/help/submit/index.html#formats-for-te...


They specify what version of texlive they use. This is significantly better than what publishers offer (usually a really old latex version, not even pdflatex).


That's it in spirit, but in practice it's refreshed:

https://github.com/arXiv/arxiv-view-as-html


I wonder how better is this compared to Pandoc's


For anyone who needs it, arxiv-vanity is amazing: https://www.arxiv-vanity.com/



It's a cool feature because it makes the papers more finable, more easily navigatable, easier to read online and faster to scroll through. I am also happy for blind people that they can more easily use ArXive with Braille readers now.

(I'm still a fan of printing the PDFs, because I annotate on paper and refer to page numbers, but the HTML feature is in addition to PDF download, not a replacement.)

One thing that still sucks (not ArXiv related though) is reading mathematical formulae on the Kindle - wonder if someone with rendering expertise could have a look into the MOBI format.


This would never happen but in an ideal world, we should be able to click on a citation to jump to the part of the paper that is being referenced and each paper page should have a discussion board so we can easily communicate with the authors and group the discussion in one place instead of us having to google to see if there is relevant discussion on twitter/reddit. We can even put links to talks, tutorials, blogs, github repo, demo, paperswithcode/google scholar/open review, background material, a timeline of citations in tree form on the same page(actually I am seeing more machine learning papers that have a project page that does some of these) or even turn it into a mini wiki. I just think html has so much more potential(especially now with LLM we can do semantic search). I wonder if there would be interest in such a chrom extension overlay.

Related projects:

https://github.com/ahrm/sioyek

https://github.com/arxiv-vanity/engrafo

https://github.com/dginev/ar5iv

https://academ.us/article/2111.15588/ (powered by https://github.com/jgm/pandoc I believe)


I think https://web.hypothes.is/ would be of interest to you.


This is excellent news. Their HTML formatting is also more pleasant than the HTML articles offered by most journals in my field (e.g arXiv HTML footnotes displayed as sidenotes on large displays!)


One of the reasons is to make the papers more accessible to people with disabilities, especially the blind. I participated in a conference they hosted on this a few months ago, I recommend taking a look at the recordings if you're interested in thinking on this.

https://accessibility2023.arxiv.org/


Blind person here, can confirm this. Reading PDFs with a screen reader is bad, reading PDFs that come from LaTeX is worse, reading LaTeX math is pretty much impossible. All the semantic info you need is just thrown away.

You can make decently accessible PDFs but it's lots of work, you need Acrobat on the producer' side and might also need it on the consumer's side. Free tools don't even come close. There's also the fact that the process of making accessible PDFs in Acrobat isn't itself accessible.

With that said, the way screen readers treat HTML math certainly isn't perfect, it's geared more towards school children than anything above calculus. I'm probably going to stay with my LaTeX source files for now. At least ArXiv offers those, not many sites do. To be fair, that approach also has its own set of problems (particularly when people use some extra fancy formatting in their math equations, making the markup hard to read), but I find this to be the best approach for me so far, at least on AI/ML papers.


I teach math at a university. A couple years ago I had two blind students in my section of first-year calculus, and I really struggled with the tooling. Using latexml, I could produce documents that one of the students could use with a screen reader, but the other student never managed to make it work on their machine. Both students prefer braille but I didn't find anything open source that could typeset mathematical braille easily. Our disability resource office sends things out to a contractor to typeset into braille; the turn-around is measured in weeks.

Anyway, if you (or anyone else reading this) has suggestions I'd really appreciate it!


I learned (the basics of) LaTeX in my last year of middle school, and stuck with it ever since. To be fair, I was into computers since I was a child, played with Rockbox at the age of 10, started to dabble in programming shortly after, so this was a lot less scary than most of the things I was doing already. I took my middle and high school finals (they're kind of like SAT but matter a lot more) by producing LaTeX output, which I then compiled to PDF and printed. The test itself was in braille, as that was all that our government could do.

Throughout college, my first question to most of my professors of math subjects was "do you do LaTeX, and can you give me your source code." Most said yes, and that's how we worked. LaTeX in, LaTeX or PDF out, depending on what the professor preferred.

The amount of LaTeX you need for calculus 1 isn't that great, you could probably teach it to a relatively bright student if you had an hour or two to spare, and then give them the source files. If you have the time, I'd suggest producing "stripped" versions of your files, with as little markup as possible to get your point across and no fancy formatting unless absolutely necessary. The amount of hoops some books and papers jump through to "look nice" drives me crazy.

You could also consider producing, teaching and consuming ASCII math, which seems like an even simpler and friendlier format. I couldn't really use it much in my school career for boring technical reasons, but it looks like a promising option.


Thanks for the suggestions! When you LaTeX your work to turn in, do you work only with the source, or do you have a good way to read the PDF output? I agree the amount of LaTeX needed for calculus is pretty minimal.

One of my students was taking chemistry at the same time, which is (I think) much tougher for blind students. But they also had more teaching assistants for the course.


I don't interact with the PDF output myself, but I can compile and email PDFs if I need to send work over to people who do not wish to receive LaTeX themselves, a fact I used throughout most of my high-school education, where LaTeX knowledge was rare. This is why I eschew formatting where possible, I can do enough to make my symbols look right and be understandable to a sighted reader not familiar with LaTeX, but not necessarily to make things extra pretty. Not actually seeing the output makes it a lot more difficult to check your formatting work.


> Our disability resource office sends things out to a contractor to typeset into braille; the turn-around is measured in weeks.

This seems a massive gap in the market - many institutions have funding earmarked for such things.


I wonder if this is a useful service that an llm could actually outperform humans on.


Interesting! I never thought about this, thank you for sharing.

What kind of turn-around time would be practical? Could you point me to any typeset mathematical braille that would be an example of a solution to your problem? Is Nemeth the only important standard, or are others important for you too?

I'm wondering if it's practical to set this up as back-office work here in Vietnam. There are some outlying provinces here where there are very few job opportunities. Job opportunities for the blind also round down to zero here (e.g. I could hire for proofreading). Maybe there's room to do something cool here.


How's English proficiency (and American braille code proficiency) like in Vietnam?

Keep in mind that most blind people who speak English fluently but don't live in an English-speaking country (myself included) can't read English braille, or at least not well. Because of how voluminous Braille is, it uses contractions, single characters that replace common words and character combinations like "the", "would", "ing" or "ed". Those tend to be language specific, never taught outside their country or countries of use, and hard to get accessible electronic materials for. The math codes are completely different too, we use something derived from Marburg, while English-speaking countries use Nemeth. Even basic characters like + and - differ between those two, not to mention more complicated structures. It's not just the dot patterns that are different but also the design principles, like where you put spaces or when you can omit "begin fraction" / "end fraction" characters.


Our textbook of choice didn't have a braille version, so we sent it out to be converted one chapter at a time. Since textbooks don't change often, a turn-around of weeks is not so bad if we knew the students were going to be in the course.

What would be very useful for me to be able to typeset myself are small things -- homework, quizzes, and (to a lesser extent) exams. Since homework and quizzes often have to adapt to what I actually covered in class, which may or may not match the syllabus, it's hard to rely on sending this out to be typset by others. (Exams are a little easier since they're usually done days ahead of the actual date.)

AFAIK Nemeth is the only standard that matters. If I can typeset a document, send it to the student, and they can get it on a braille display (no need for this to be on paper), it would solve a ton of problems.


I made these arguments two decades ago when I was still in university that PDF is a horrible format because it's purely præsentational, especially for people with disabilities whose software relies on semantic information. LaTeX last time I used it didn't even have a different symbol for uppercase Alpha and A because the glyphs are indistinguishable.

They argued that PDF was superior because the publisher could control how it looked and it looked the same everywhere but the point is that it should not. Things such as font size and line spacing should be at the control of the consumer, not the publisher. This isn't simply blind people but for instance also persons with dyslexia who use particular fonts to make it easier to read for them. Or in my case, someone who simply gets a headache from fronts and line-spacing that is too big. I've also been using darkmode everywhere for so long now that reading black text on a white surface on a screen gives me a headache.


To write uppercase Alpha you need a modern version of latex (ie xelatex or lualatex) and to include the unicode-math package

https://tex.stackexchange.com/questions/485593/how-to-write-...


For scientific articles pagination is still important, because it's how you refer to a particular part of a paper. If things like font size and line spacing are at the control of the consumer, pagination is not preserved.

This problem is harder than you one would think naively.


Seems like they should use detailed section numbering like military documents and laws. Referring by page number seems very course by comparison.


This would require a change from the currently near-uniformly adopted standard.

The problem with this: you need to create a new standard, get everybody to agree to it, and get busy scientists who are concentrating on content and not representation to adapt this new standard in their writing, essentially requiring them to change their habits and spend extra time on writing (which many of them hate), for no obvious gain from their point of view.

I am not saying it's not possible, or not worth it, but it is not easy and simple either.


Very hard as it is from physical paper world … and even then you have to make sure version is right as page number change.


No, the problem is very easy, referring by page number is simply ridiculous. As well as all those “(<Family Name>, <year>)” citations,

Besides, in HTML one can directly link to the relevant part.


I am afraid you are being naive... You see only one factor out of many.

Being able to link directly to the relevant part is irrelevant (pardon my pun!). Such links are machine-readable, not human-readable. Scientific text need visual citations and being able to name the referred part for reading comprehension.

And Harvard-style citations (AKA name-date) exist for a reason; when your read a paper even in interactive format it helps when you can recognize citations to certain papers and not having to click on them or memorize numbers.

Other styles have their own advantages and disadvantages; that's why they all exist and used by this or that journal, and no consensus on a single "right" style was ever reached.


I wrote an app called PDF Reflow that reflows the original PDF using image processing to cut out words into tiles so you see the reflowed version of the text in their original look.

https://www.appblit.com/pdfreflow


Any chance of releasing an Android version?


Gv (part of ghostscript) used to do a good job of this for two column documents. When zoomed in to show one column width of text, the spacebar ran through the top of column 1, then the bottom of column 1, then the top of column 2 and so on.

The amount it scrolled probably depended on the aspect ratio of the window, so it might be multiple key presses to scroll an entire column.


It’s using web technologies so yes it could also be on Android. I’ll see what can be done.


+1


+1


Do you think there's potential for language models to play a role here? I know that AI can get tossed around as a buzzword, but hasn't it proved quite successful in fields like computer vision?

I'm not deeply familiar with the state of that art, but it seems like recovering the metadata from a PDF generated by LaTeX would be no more impressive than many other things we're currently seeing language models achieve?


You wouldn't need to use computer vision on a picture of the PDF. arXiv has the tex source for most of the papers. An LLM trained on code could do a pretty good job of translating tex to readable html with a bit of effort.


I'm absolutely positive a few million dollars could get you a system that can "read aloud" pdf math papers in no time. I guess people will wait for it to become cheaper though.


You can also have that cheaper already. But having it stable and reliable - will take some time and possibly more money, depending on your definition of reliable.


Mathpix is trying to achieve something like this, and they do consider the visually impaired market AFAIK, but it's pretty expensive and I have no experience with it personally, so I can't say how good it is.


Hold on... Are you telling me that all these complex sentences are being typed out based on your voice alone? That's insane.


I'd say it would be simple to talk type these using windows 11's redux of voice typing. Pretty damn accurate and easy to modify/variate text/options. I use it all the time to make tech/engineering blog posts, faster and more organic than typing, typically, and it learns your technoacronyms. Combined with voice access, it makes it trivial to fully operate your computer (well, at least, browse the web, email, and media apps) from across the room. For anyone who hasn't tried the updated version, highly suggest hitting windowskey+h and giving it a shot.


Hm tangential question but shouldn't touch typing be well accessible for many blind computer users?


? blind people can use keyboards


There are braille keyboards too


Or normal keyboards? Many people can type blind. Some learned to do so while born blind, others became blind after they had already learned this skill.

I would assume that the majority of persons on HN are not looking at their keyboard as they type.


I was just giving an additional way to use a computer not known by many. Either way, we shouldn't rely on the skills of a few to interact with a computer.


For the math equations, I'm curious: does MathML do any better for you than LaTeX?


Not the person you’re asking the question to, but it’s worth noting (if you don’t already know) that MathML is really not designed at all as an input language for practitioners who just want to write a few equations in some document. It’s designed as an output/presentation language so that devices that want to render some maths can do so faithfully[1]. As such, if you’re a human being who wants to typeset some equation, you’ll want to go to latex every single time rather than mathml and then someone else has to figure out the conversion.

[1] Great explanation here https://tex.stackexchange.com/questions/57717/relationship-b...


On the other hand, "semantic" flavor of MathML (as opposed to "presentation") is much easier than TeX for things like screen readers, both conceptually and in practice.


Huh. It would seem like, of all the things which should make it easy to generate the correct accessibility information, the pipeline of compiling a paper from source code in LaTeX should nail it... maybe we should all pitch in to some pool to pay someone to put in the required effort to connect all the dots?


Surprisingly it’s not easy, and depending on the field it can be quite challenging. The reason for this is that TeX captures the visual aspects of typesetting, not the semantic meaning of the mathematics.

A simple example is ‘\sum’ which provides no way to capture the expression being summed over - because that’s not necessary for typesetting. That’s not the case in, say, MathML.

Writing MathML is no fun though because mathematical formulae are visually ambiguous and we rely on the context to know how to read them, e.g. does ‘f(x - 1)’ mean function f called with argument x - 1, or does it mean variable f multiplied by x - 1?


Kind of tangential, but it's also kind of surprising how difficult it is in LaTeX to make a plot of an equation.

Say I have Equation \ref{eq}. Why can't I just say "plot \ref{eq} for x from -6 to 11" and get my graph?

And yes, I know about pgfplots, PSTricks, TikZ etc. But in all those cases, I need to define the same equation twice, in different syntax to boot. It's kind of unsatisfying.


TeX is a very arcane language, and it doesn't support floating point numbers. Few languages would be less suited for making a plotting library.


Both pgfplots and PSTricks and TikZ are plotting libraries. It seems like it shouldn't be that hard to let them plot an equation written elsewhere in different syntax.


> Say I have Equation \ref{eq}. Why can't I just say "plot \ref{eq} for x from -6 to 11" and get my graph?

Pretty much for the same reason you cannot press a word and get a pop-up dictionary definition in a paper book.


To be clear, I meant in the LaTeX source code. And there I can already write code that plots equations, I just have to re-type the equation in a new syntax.


TeX is about representation, not semantics, by design. To do anything useful with a function (like plotting) you need to get semantics.

An often cited example: what is f(x+y) ? Is it function f with x+y as its argument, or constant f multiplied by (x+y) ? TeX gives you no clue.

Or what is this i in your equation? Is it an index variable, or a square root from minus one?

You as a human figure this out by looking at the context and using domain knowledge. So does a "TeX to HTML/MathML converter". It is ultimately built on heuristics, and cannot be otherwise.

That's why I said basically "for the same reason a paper page is not interactive". It was designed this way!

The goal of TeX was to generate beautiful printed page. The need for semantic structure was not anticipated. To do semantics you need a "semantic version of MathML", or a language used by Wolfram's product, etc.


Yup LaTeX math doesn't make sense. I've been trying to hack my way into getting a voice model to read it but no real progress.


LaTeX is a programming language for generating beautiful pages, basically a typesetting system. It serves this purpose fantastically well.

It was not designed to provide semantic information, unfortunately. So getting anything other than visual representation out of it is hard.


Emacs with Emacspeak has a math reading module.


For accessibility purposes (and regular reading), it would be so much better to drop the justified text. Ragged edge is the way to go!

https://www.boia.org/blog/why-justified-or-centered-text-is-...



Perhaps someone can publish a paper to arXiv that provides a meta-analysis. But still there doesn't seem to be a clear reason to justify it, given that almost all internet text is not justified.


To me one of the exciting aspects of HTML is that we can theme the same article in different ways, tailored to individual preferences - just swap in a different CSS file.

Having a two-column theme, or left-aligned vs justified themes, could be workable in the long run. I hope that we get to see some browser extensions modding the pages before too long.

The reason for the current justified text is that it is the default aesthetic for a LaTeX-based article, and a lot of authors expect it.


A lot of AI/ML papers these days have an accompanying interactive page like [0], will we see anything like these now directly in arXive?

[0] https://voyager.minedojo.org/


I think then arXiv would have to deal with mantaining the tech stack and providing the presumably much higher server capacity to serve the more varied web pages that would result, so it seems like a tall order. arXiv already has an experimental integration with Papers with Code [0], which I guess provides similar results for the reader, though the authors have to figure out their own web hosting.

[0] https://info.arxiv.org/labs/showcase.html#arxiv-links-to-cod...


Second that. Something I put out recently had an (admittedly video heavy) webpage that had 1TB of traffic over the past month. Cloudflare handled it for free for me, but at ArXiv’s scale it’s bound to be a problem.


Seems like the references aren’t working very well.

I really want journals to have two way links in a paper. I get google scholar alerts about certain papers being cited, and I want to skip to “why did they cite this? Did they use it, improve it, it just mention it?”


I’d never considered setting up citation alerts like this.

Thank you for the idea!


Looks like clicking a reference adds the hash to the URL but doesn't scroll to the reference. If you load the hash URL directly in the browser you get a 404 page...



Yeah, it seems like a bug in HTML generator...


It is a bug. Will be fixed soon.


I just hope they don't stop to offer the papers in PDF. Even when I'm on a computer, I still prefer to read PDFs.


There is a taste component to it of course, but the history of PDF shows that it's the wrong format for reading on a computer. It was originally meant to be the end result of a publishing process before printing, a layer that sits right between the publishing software and the postscript that gets sent to the printer. This makes the PDF format quite inflexible for reading on a computer, with it being impossible to properly zoom or adjust the reading experience.

Unfortunately many institutions and businesses have ignored its limitation because PDF turned out to be an obvious-but-naive to put a 'sheets of paper' metaphor into a computer system, which in the 1990s appealed to tech illiterate folks doing bare-bones computerization of existing paper systems. So later we got complicated and error-prone tools for editing PDFs, and many random additions to the spec to allow for unusual use cases.


> This makes the PDF format quite inflexible for reading on a computer, with it being impossible to properly zoom or adjust the reading experience.

As an academic researcher, generally speaking I also prefer PDF, and the inflexibility and static nature is a feature, not a bug. I appreciate the fact that a paper will appear the same everywhere, that I can refer to "the top of page 7", etc.

The exception is if I wanted to just skim a paper; in this case, I think I'd prefer HTML.

I'm a huge fan of what arXiv is doing here. It effectively preserves the status quo, while adding an additional option on the side. The HTML option might prove a little bit useful for me, and it is likely to prove extremely useful for people with disabilities.


> I appreciate the fact that a paper will appear the same everywhere, that I can refer to "the top of page 7", etc.

There are many great solutions to this problem, including ones that don't require Javascript at all. This website (https://gwern.net/silk-road) presents a really good example -- every header and sub-header is a clickable anchor. If more granularity is needed, on newer articles most of the paragraphs start with an italicized margin note -- though for technical writing, paragraph anchors might be better. The page also pays careful attention to print CSS and has a 'reader mode' to convert all links to footnotes when printed.

Some websites will also preserve the text you select in a URL anchor, but more often than not this is just cumbersome. It also has a greater risk of not surviving changes to the webpage.


Also intriguing as a solution, which is potentially much stabler across revisions (page numbers are unstable), is the text-anchor-fragment feature Chrome introduced a while back: https://developer.mozilla.org/en-US/docs/Web/Text_fragments

It's actually hit ~88% of the market https://caniuse.com/mdn-html_elements_a_text_fragments but unfortunately, Firefox remains a holdout* and that's my browser, so I don't use it (although maybe I should just install https://addons.mozilla.org/en-US/firefox/addon/link-to-text-... and try it out - my existing method of making new anchors for annotation purposes is cumbersome).

* Firefox officially is positive on it but no sign of any movement on it: https://mozilla.github.io/standards-positions/#scroll-to-tex... https://github.com/mozilla/standards-positions/issues/194 https://bugzilla.mozilla.org/show_bug.cgi?id=1753933 https://wicg.github.io/scroll-to-text-fragment/


I know of no-one who provides only HTML to arxiv, it's either latex, or doc/odt, so the PDFs should always be there.


When I open a large pdf on arxiv (100+ MB, not uncommon for ML papers focused on hi-res image generation), there is a significant load time (10+ seconds) before anything is rendered at all other than a loading bar. Does anyone know what the source of this delay is? Is it network-bound or is Chrome just really slow to render large PDFs? Do PDFs have to be fully downloaded to begin rendering? In any case, this delay is my only gripe with arxiv and a progressively rendered HTML doc that instantly loads the document text would be a huge improvement.


> Does anyone know what the source of this delay is? Is it network-bound or is Chrome just really slow to render large PDFs? Do PDFs have to be fully downloaded to begin rendering? In any case, this delay is my only gripe with arxiv and a progressively rendered HTML doc that instantly loads the document text would be a huge improvement.

The default PDF format puts the xref table at the end of the file, forcing a full download before rendering can take place. PDF-1.2 onwards supports linearized PDFs, and most PDF export tools have some way of enabling it (usually an option like "optimize for web").


I have the same issue. From what I can tell it’s just network-bound and the Arxiv servers are slow. They theoretically allow for you to setup a caching server but after spending a while trying to get it setup, I haven’t been able to get it to work.

https://info.arxiv.org/help/faq/cache.html


maybe it'll be faster now with fastly

https://news.ycombinator.com/item?id=38723373


It may be even that the time is taken to generate a PDF.

The format in which articles are submitted and stored in arXive is LaTeX. PDF is automatically generated from it.

Probably arXiv does some caching of PDFs so they don't have to be generated anew every time they are requested, but I don't know how this caching works.


Many here say they prefer html documents. How do you annotate them? How do you make local copies? Also, how will you read them in the decades to come?

I love PDF.


A lot of academic journals (say from Springer) also offer HTML formats for papers published in the past decade or so, which I personally often find more convenient for reading purposes than PDFs. For example, I parse text a lot faster if I use a regex to split each paragraph into sentences and place a linebreak after each sentence, or if I do natural language "syntax highlighting" by assigning a distinctive color to functional words indicating logical structure like 'if/then', 'and', 'or', 'not', 'because', and 'is'. And sometimes it really improves readability to be able to do "semantic highlighting", in the sense of say assigning a different hashed color to each proper name (or each labeled thesis, etc) that occurs in the paper. Such manipulations are basically impossible with PDFs. It makes me wish sci-hub would start archiving HTML versions in addition to PDFs!


IMO pdf and HTML optimize for different things. pdf is easy and pretty. HTML is easy and responsive. But making pdf responsive is impossible and making HTML pretty is not easy. I think having arxiv for well-polished pretty documents, not responsive ugly documents. Most researchers don't have time to make an HTML responsive and pretty.


Am researcher, care about responsiveness way more than pretty. I am super glad for the option. Downloading PDFs is super annoying. I'm stoked.


Well... download html is even harder nowadays, because many pages are dynamically generated. Although there are surely some browser extensions that can help you to finish it in a few clicks..


This is brilliant. I don't share academia's love of LateX multi-column PDFs.


I like multi-column text on paper (literally), but it's awkward in digital where you can just shape text on the fly to whatever column size you want


The oroblem is that gaining this responsiveness fundamentally makes your task much more difficult. Instead of just creating a picture you're now writing code that has to be maintained. In my philosophy arxiv is for documents which are set in granite - pictures.


> If you are familiar with ar5iv, an arXivLabs collaboration, our HTML offering is essentially bringing this impactful project fully “in-house”. Our ultimate goal is to backfill arXiv’s entire corpus so that every paper will have an HTML version, but for now this feature is reserved for new papers.

IIRC, ar5iv was created on his own initiative by Deynan Ginev

https://twitter.com/dginev/status/1736792316675825981

and it seems that he has worked tirelessly to fix nearly all of the edge cases during the collaboration.

This project creates huge value to humanity so Deynan is to be heartily thanked.


Thanks for the kind words, but some corrections:

1. My name is Deyan (hi!)

2. ar5iv was the latest frontend incarnation, but our actual work on converting LaTeX to HTML goes back nearly 20 years behind the scenes.

3. I was an undergraduate student when I was introduced to the project back in 2007. It was started "in spirit" by 3 senior co-conspirators back then: Michael Kohlhase, Bruce Miller and Robert Miner. And I am by no means a solitary actor today, even if I may be the chief online presence of the people involved. Bruce is doing the bulk of the hard work on LaTeXML to this day.

I documented some of the history in an invited talk for CICM 2022, which you can find on youtube, or see the slides at:

https://prodg.org/talks/welcome_to_ar5iv

It's really great that the HTML has now reached "home base" in arXiv, and I hope their team gets a lot more of the positive attention going forward - today's achievement is entirely theirs!


I remember stumbling upon your work long ago when I was working on a project to have "e-zines" that consumed a series of `article` class files and rendered them out into PDF and HTML as a series package.

I had come across latex2html, Dan Gildea's project, and found myself unpleasantly dissatisfied with how it worked. As I understand it, it's more a "half implementation of lots of packages" rather than what ar5iv seems to be, which is "enough of the core LaTeX engine producing HTML instead of DVI"? I'd love to know more about the nitty gritty of how the engine does its thing.

I'm curious: How has modern web tech (e.g. WebAssembly, Canvas, etc) helped or gotten in the way of getting good LaTeX rendering in the browser?


Right, that's LaTeXML - it tries to emulate as much as possible of the TeX typesetting system, while retaining enough control to emit structured markup.

Which also allows us (and generally all contributors of latexml package support) to conveniently maintain various parallel data structures and metadata needed along the way.

Modern HTML is very often helpful to produce higher quality article renderings. Examples:

1. we recently started using flexbox for subfigures, allowing them to reflow.

2. we have started emitting ARIA accessibility annotations (there is now an "alt" key for \includegraphics)

3. MathML Core allowed us to have native web rendering for math expressions in every browser.

As to LaTeX rendering in the browser, there are various other projects out there you could look up with partial support. For latexml the WebAssembly route seems most realistic, as we are undergoing a rewrite in Rust. But there are quite a number of pieces to flesh out before we get there.


Went through it and may I ask whether there is any “personal” level of this ar5iv converter or just one of few mentioned parser.

Btw given we are into quotation academic world, I wonder whether you may have mention Gartner Group to invent that technology curve. To be honest there is a variation I like more which deal with the chasm issue.



I’m so far left wanting for an app that gives me a way to easily track and consume newly published work of a given topic. The existing apps are not great, and maybe this change will make it easier to provide better “reader” views, and possibly even tts (I like to listen+read).


Nice! Now I don’t need to manually replace arxiv with ar5iv. Congrats to the team.


"Our ultimate goal is to backfill arXiv’s entire corpus so that every paper will have an HTML version, but for now this feature is reserved for new papers."

For now it only works for papers submitted this month. But it's great to have this feature, makes it so much easier to read on phones.


This is the reason I've never liked LaTeX from a data point view. It's made to be printed out or get to look beautiful on a PDF but was never designed to get you to a HTML file or a Word file.

I've written my thesis in Markdown in the past because of this (best for humans) which can be easily transformed to HTML, Word, PDF and even LaTeX https://github.com/tompollard/phd_thesis_markdown

And I think that XML is the best format for machines.


Ugh. I don't belong to the target audience (people with disabilities) but the typesetting doesn't exactly look pleasant on my machine (Chrome on Linux).


Nice! It would be even better if they offered authors of previous papers the option of converting to HTML, as the latex sources are already in the system.


The article states they're going to backfill all, or nearly all, previously submitted papers!


  article {
    text-justify: Knuth-Plass;
  }


Mind explaining?


The comment is invalid CSS to apply the Knuth-Plass algorithm in rendering an HTML article. Knuth being a perfectionist’s perfectionist, TeX uses this algorithm to determine optimal line breaks to provide for better text justification.

Here’s a discussion of hacks to achieve the algorithm’s results on web pages and an upcoming CSS feature as of 2020. https://mpetroff.net/2020/05/pre-calculated-line-breaks-for-...


Thank you!


If only.


doesn't work great with long author lists...

https://browse.arxiv.org/html/2312.12907v1


The PDF is worse, so there is no simple answer to this: https://arxiv.org/pdf/2312.12907v1.pdf

At least the HTML version pairs each author with their affiliations, instead of the PDF which has all the names on page 1, and all the affiliations on page 2. That's completely unreadable.


The PDF is better because I'm trained to scroll past the author list. That takes forever on the html version .


You can click the "Introduction" anchor on the left side and it scrolls for you past the author list


well it skips the abstract too, but yes, you can scroll back up to see it.


Yeah, its a bit weird that the abstract doesn't have a link on the left


Probably because \abstract{ } is treated differently than \section{ }, I guess...


For me the PDF is much better. It's compact and clean, if I really need to see an affiliation for a particular author, it's really easy to do so in the PDF, not so in the HTML.

It's highly unlikely anybody will read an entire author list this long; typically you would read the first two or three names, or check if some particular name is on the list. So the compactness of the list and being able to quickly get to the article contents is important.


30 years after HTML was invented to support accessibility and collaboration for research and academia and the same day the White House released their new accessibility guidance which happens to be the first time they've published formal new policy natively has HTML rather than PDF - https://www.whitehouse.gov/omb/management/ofcio/m-24-08-stre...


I feel surprised by how succinct, easy-to-understand, and sensible the policy (M-23-22) is:

> Default to HTML: HyperText Markup Language (HTML) is the standard for publishing documents designed to be displayed in a web browser. HTML provides numerous advantages (e.g., easier to make accessible, friendlier to assistive technology, more dynamic and responsive, easier to maintain). When developing information for the web, agencies should default to creating and publishing content in an HTML format in lieu of publishing content in other electronic document formats that are designed for printing or preserving and protecting the content and layout of the document (e.g., PDF and DOCX formats). An agency should develop online content in a non-HTML format only if necessitated by a specific user need.

https://www.whitehouse.gov/omb/management/ofcio/delivering-a...


Hmmm ... accessibility is essential, but PDF is far better for static documents: There's no straightfoward, standard way to read an html document on another platform. Also, the html document may not be readable in 10+ years (unlike most PDFs), and updates are too fluid and hard to track.

I think the general problem is that the end-user doesn't control an html document, e.g., for annotation, as a local record, etc.


...What are you talking about? HTML files are readable on basically every platform, even moreso because they are fundamentally text files (unlike PDFs, which are binaries). PDFs need special software, html can be read on the command line. Likewise, HTML is dead simple to edit and annotate.

Seriously, name a single device that has PDF support that doesn't allow you to view HTML.

I think you're conflating "html" and "things stored on a server", because all of your objections apply to pdfs stored on a server. The ability to save and annotate pdfs is not an inherent feature of the file format, they exist because the format is such a PITA to interact with that specialized programs have to be written. HTML can be saved just as easily, and usually is (on archive.org).


I just tested saving https://browse.arxiv.org/html/2312.12451v1 to disk using Chrome, transferring it to my Android phone, and opening it on the phone. Results:

1. Saving as "Webpage, Single File" (.mhtml): Neither Firefox nor Chrome even showed up in the list of available apps to open it.

2. Saving as "Webpage, Complete": Opened in Chrome but images were broken. Also very difficult to open with the default file browser because it uses a flat folder view and the sidecar folder pollutes the file list.

I was hoping this would work, perhaps you will have different findings. I agree that HTML is the superior format in theory but usability in practice is often lacking. I'm resigned to using both depending on context.


Yes, that's the kind of issue I was talking about. I wish it were otherwise. As a nearby comment pointed out, epub is a potential solution (and I wish arXiv embraced it - without my knowing their other requirements or epub's accessibility features). It's essentially packaged html.


Of course, they’re “just text files” only in theory… but theory and practice diverge very very often.


How do I save an HTML document locally, and annotate it, in an easily sharable form, and in a form that is stable - i.e., in a way that will be readable and useable in 20-50 years?


Basically any HTML document from 20-30 years ago (can't go any further because it didn't exist 50 years ago) will be completely readable and usable. The only issue is people creating content (not styling) in formats besides HTML.

As far as annotations, you can use the native <ruby>[1] tag, or strikethough, but if you mean "literally drawing on the text" then, yeah, you're looking for an image format at that point (which is fundamentally what PDF is), but we shouldn't default to storing text in image formats just because of one specific use case. (Also, as I said above, the only reason tools exist to easily do that in PDFs exist is because everyone insists on using a format that's hard to edit. )

Also, note that the context I was responding to was US legal documents, not something more presentation-heavy.

[1]https://twitter.com/antumbral/status/1730829756013375875


You say it as if pdf is somehow better. To begin with it's a proprietary format. If Adobe goes bankrupt or obscure tomorrow, pdf will go out of use as a failed technology.


> it's a proprietary format. If Adobe goes bankrupt or obscure tomorrow, pdf will go out of use as a failed technology.

It's an ISO standard with a very large ecosystem outside Adobe. Many users and businesses I know don't use Adobe at all.


They will use it, like COBOL. But are COBOL programs usable on your machine?


> There's no straightfoward, standard way to read an html document on another platform.

Such as? What doesn’t have a browser but can render pdfs?


I mean, how do I save it locally on one platform and read it on any platform? Or share it with someone else to read (without them downloading software)? I.e., we don't have a standard, local, single-file html format.


You're right.

We could have such a format if browser and os vendors were interested in supporting such a use case. Unfortunately, they aren't.

On the browser side, supporting all-in-one html files can be as simple a reading a single multipart-encoded page. Heck, if they support automatically serializing all external resources as datauris when saving pages, then most browsers will be able to open them without any modification.

On the OS side, operating systems can treat html files as first class citizens; execute them in an offline sandbox (most operating systems have embedded webviews), then extract icon, title, description and other metadata to present to the user. An icon the consists of a blank page with a small browser icon in the corner doesn't tell me anything about what the page is about. This needs to change.

In short, html can be easily made nicer to deal with locally thanks to all the parts already being in place. The problem is that no one (tech giants, os vendors) are interested in doing this.


.mhtml (or .mhtm) is that format. It's an archive containing an HTML file along with all the resources it references (JavaScript, CSS, and images). These browsers support it: Internet Explorer, Edge, Opera, Chrome, Yandex, and Vivaldi. Create one by saving the web page and choosing the .mhtml format. Safari supports another format called webarchive.

https://en.wikipedia.org/wiki/MHTML


There's epub as one file html document.


> I mean, how do I save it locally on one platform and read it on any platform?

Ctrl/Meta/Cmd + S should do the trick, or "File > Save page", and you get a HTML file you can open in any browser. If there is images, they'll most likely be loaded remotely, or worst case not load at all. But the rest of the structure is there.


> If there is images, they'll most likely be loaded remotely

Most sites have images as a relative path which won't work with saved html and there is also CSS.


A web page is much more than one file. Also, I'm looking for something with end-user control, where they can save the current document statically and long-term.


If both devices have internet, you share the URL. If not, see other replies.


Print it to a pdf


> There's no straightfoward, standard way to read an html document on another platform.

What do you think of the epub format?


I wish so much for it:

Despite all our advances, we lack an editable, local, multimedia, platform (and form-factor) independent, self-contained file - essentially a word-processing file for the 21st century (and I mean it's almost a quarter-century overdue). epub has that potential as a format, and being based on web standards it has capability, a universe of supporting tools and technology, and easy adoption to different applications.

But I haven't heard anyone else express that particular interest, and as of a few years ago epub doesn't allow annotations and is not stable (i.e., I don't know that today's epub file will be readable in 20 or 50 years) - two essential requirements for a serious local content, imho.

And even if it meets those specifications, we need epub editors that are the equivalent of word processsors for non-technical users.


Unfortunately, I am from Iran so I can't use this new feature. I got '403 Forbidden' message from the arXiv server. Worse than that, I totally lost my access to arXiv since they changed their CDN to fastly, because fucking mullahs don't like fastly!


Give it to the United States 2 minutes you're open and your hack smoker Hancock minutes and even this your combustion area of monument time cuz you said looking on the baseball miserable I didn't want to buy it I desktop and your current events are my not me his not he I took it in the garage your prime minister 70 or my event your lucky alone haircut at Josephine alone hacker smoker king Kong young under hackers no car orange county Joseph Adidas adorius avenue I got a new Nissan I thought you need something f*** at Robert Omaha Fernandez Serbia Yunnan i England England Britannia English


Taking a look at a paper I have that went up this month and another that went up before the dec cutoff on ar5iv, they look 90% OK! Figures with side-by-side plots and algorithm environments are the common culprit for being broken though. Particularly in figures, it seems like the width argument isn't being interpreted correctly.

Interestingly this review paper seems to have their side by side figures intact (e.g. fig 2 fig 4). Maybe it's because he used a subfigure like environment (judging by the subcaptions)?

https://ar5iv.labs.arxiv.org/html/1609.04747


For the image widths, there is some CSS fine-tuning that is still needed on the arXiv HTML side. I think that will get fixed soon, just needs the right height directive set.

Getting subfigures emulated via flexbox is one of our more recent LaTeXML enhancements, and still has some ongoing work (working on it today actually). It can be a bit finicky to test - there are easily 20 different ways people can write LaTeX for subfigures in arXiv.


Curious to see how well it will work. Does anybody here know a robust and not crazy computationally expensive solution to extract tables from fairly clean PDF files (especially non-english)?


So, I'm seeing a lot of chatter in the thread about LaTeX and converting that to HTML and PDF, so LaTeX should be the superior single source of truth. Please keep in mind that many areas of science think of latex as an allergy. I even have a colleague, a plasma physicist, who strongly encourages his team to not use LaTeX because a) collaborators get confused and b) it can be a massive time suck.


I agree with your colleague.

At my institution, all of the lowest quality drafts I read are made with latex. I think it's because the programs people use to write latex do not have spelling and grammar checking. Also, the people that prefer latex, are the same types of people that are more interested in technical things, than spelling and grammar.


[dupe] from yesterday

More here: https://news.ycombinator.com/item?id=38713215


> Didn't see a toggle

you can run toggleColorScheme() twice in console to switch to light theme or dark theme.


This will be on of the most popular applications written in Perl, because this is based on 20 year old https://en.wikipedia.org/wiki/LaTeXML.


Fun fact: if seems that if you use Lockdown mode on Apple devices you can't open PDFs from a browser (no official documentation says it but there is anecdotal evidence). This would allow people with Lockdown mode to open Arxiv papers more easily.


Like the maths noscript/basic (x)html wikipedia generator:

The magic of inline images at a known DPI, of course you can provide images for different DPIs.

Reading maths/science noscript/basic (x)html documents on my 100 DPI monitor, on wikipedia. Not yet fully ready on arxiv.


What I would like is for ArXiv to have an LLM to rewrite all papers away from the stodgy, stilted language prevalent in every paper. Just write clearly gang, use proper paragraph breaks and stop with the run-on sentences.


Personally, I would prefer the conventional Latin Modern math font instead of Palatino math.

Latin Modern is used by:

- Wikipedia. - Math.StackExchange. - Nearly all papers, including the ones hosted on arxiv in PDF format. - Nearly any math videos, slides/presentations, notes. - Almost everything, really.

Palatino just looks weird.

Also, I imagine that authors might do math formatting hacks that were only tested on Latin Modern, and might end up breaking on Palatino.

TL;DR:

Palatino :(

Latin Modern :)


Hope they benefit from CDN caching now too.

Edit: aaaand they got Fastly https://news.ycombinator.com/item?id=38723373


What do they use to convert a PDF document to a clean, correct HTML document? It's a difficult space, especially with the variety of layouts you may find in PDF documents...


> The tool that it's being used for this offering is this one, https://github.com/arXiv/arxiv-readability, just to save a few clicks :)

https://news.ycombinator.com/item?id=38726582


Arxiv encourages users to submit the latex source of their papers rather than the PDF


It will ease data scraping, automated meta analysis...


They should also add commenting capabilities under the paper.. a good discussion will lead to more research and information discovery


This is awesome! Push to Kindle (HTML to EPUB) isn't converting the page properly but I'm sure it's coming soon



And, of course, https://ar5iv.labs.arxiv.org/html

However, ar5iv isn't a la carte like arxiv-vanity. They pretty much do last month's papers every month or so. Something like that.


Hi, ar5iv creator here.

You can think of both arxiv-vanity and ar5iv as the "alpha" experiments that lead into the official arXiv "beta" HTML announced today.

Once a few rounds of feedback and improvements are integrated, and the full collection of articles acquires HTML in the main arXiv site, ar5iv will be decommissioned.

The plan is to turn all existing ar5iv links into redirects to the official HTML, and free up the resources for maintaining it. I am not sure what are the plans for maintaining arxiv-vanity, but I suspect they may head down a similar path some time later.


lmao! The actual creator of ar5iv? Sometimes I forget this isn't reddit and legit accomplished people comment here.

Reminds of Burning Man when people kept telling me, "Never talk trash on the art at the main landmarks. The artists are frequently within listening distance."

So, of course, I'd walk around talking about buying the art for $50K-$60k, knowing it's already scheduled to be burned with the landmark.


I was hoping this meant that html native submissions would be possible, so that people made interactive explanations.


With the 2024 browser update, this means I can read these articles on my ancient Kindle perfectly fine.


Saw it last night ! I was sooo happy ! Reading papers on phone is a nightmare. Well done guys !


This makes downloading and parsing paper data easily, which is pretty handy in the LLM era.


About time. Biorxiv and medrxiv have been doing this for probably half a decade at this point?


Wrong, arXiv was first. Check this HTML paper from 1997:

https://arxiv.org/html/astro-ph/9708066


medRxiv and bioRxiv get most of their submissions as Word files. It's a much easier conversion, and if necessary they have manual touch-up. Not feasible for arXiv's volume.


I wonder if this could be used to train an LLM to convert PDFs with rich charts into HTML?


I don't read many papers but this makes it easier for me to save them in Joplin.


Wow, this is _so_ much better!


Is there an open source tool to convert any PDF to something like this?


It sounds like (from the shout-out in the post) they're using https://math.nist.gov/~BMiller/LaTeXML/ to convert the paper's LaTeX into HTML, not from PDF.

The most versatile tool I know of for converting various document formats, including PDF to HTML, is the oss ebook tool Calibre: https://manual.calibre-ebook.com/conversion.html

I have seen https://pdfbox.apache.org/ used for extracting text from PDFs for analysis, but you won't get HTML output.


This is great! I browse papers on mobile, and PDF is so bad for that use case.


OMG. This is amazing. I legit hated reading two column pdfs on a smartphone.


I'm sad that the best they can do is HTML format. HTML is a mess.


nice! will make reading papers on the phone so much more pleasant!


That's great. Now I can read the papers on my phone.


This is a great UX addition. Why did it take them so long?


How would you do it quickly?

For example, HTML isn't divided into numbereres pages while PDFs are. A lot of latex interacts with page boundaries. Figures tend towards the tops of pages. And there's \clearpage. And the reference list might say which page each citation appeared on. All that stuff needs someone to decide how to handle it and then to implement that handling. Like... what value does \pageheight return? Sometimes I resize things to fit the page height, and if it was doubled then I should have resized to fit the width instead.


Latex is a very complicated programming language for creating documents. It is not easy to create a new backend for it.

As a glimpse into the very tip of the iceberg, this diagram is https://tex.stackexchange.com/a/158740/ generated with 100% Latex code.


The conversion is still very error-prone. It can't convert a lot of packages, and the last paper I read, StarVector, half the HTML version is just missing. (I think it hit an error at a figure of some sort.) I reported an error, but I've been reporting errors against the ar5iv and abstracts for years now and the long tail of problems just seems like an incredible slog.


Can confirm. From an ar5iv standpoint, 2.56% articles currently fail to convert entirely, and 22.9% have known errors to the converter. That leaves 74.5% of nominally usable articles. This success rate is noticeably lower for the newest batches of arXiv submissions, as the converter hasn't caught up with the most recent package innovations.

We have a plan in place to meaningfully fall back for unknown packages, but that will take at least another year to put in place, and likely another couple of years to stabilize.

Meanwhile, there is some hope that with arXiv launching the HTML Beta we will get more contributions for package support (LaTeXML is an open source project, with public domain licensing, everybody benefits).

But again the original point is spot on. Coverage will be hit-or-miss for a while longer yet, for an arbitrary arXiv submission. The good news is that authors could work towards better support for their articles, if they wanted to.


Where are the computer vision people? This is the perfect type of problem for multi modal LLMs


Except that the errors made by an LLM might be harder to spot then converter errors that typically are very blatant, and don't usually alter text (perhaps just drop parts of it).

Also, a bug in a converter is conceptually much easier to fix than to re-train your LLM.

I am not sure that AI in it's current state is useful when "high fidelity" is required.


Almost universally, we prepare conference papers as LaTeX files made to export to PDFs which fit within the conferences template.

It's nontrivial to export this to HTML in all cases, and even then, nobody is asking for HTML from us even though we all want it. I'm guessing Arxiv is using some kind of converter which _usually_ but not _always_ works.

That said, this is a long time coming and PDF as the standard should've died a decade ago. I wish I had this when I was in my PhD program.


Because this is a rather conservative field with little dependency on the general public, so without much interest in hepling disseminate the knowledge broadly & accessibly (relative to other priorities, not absolute)


Thank God. Maybe we can now adapt those for mobile?


Nice.... a website that offers even more web pages.


thats great news. I was using arxiv vanity to read on mobile phones. I am not seeing it on all articles, is it only for new papers?


Reading papers on mobile now considered sane!


Very good decision, always bet on the web.


FUCK YES (excuse my profanity). I have a tool that converts HTML to Neural Speech and I always wanted to push arXiv papers through it, but couldn't be bothered with a PDF implementation.


Finally a modern format you can copy&paste from and read on one of the most popular computing platforms!!!


At this point are academic papers simply peer-reviewed blog posts?


For anyone interested in staying informed about important new AI/ML papers on arXiv, check out https://www.emergentmind.com, a site I'm building that should help.

Emergent Mind works by checking social media for arXiv paper mentions (HackerNews, Reddit, X, YouTube, and GitHub), then ranks the papers based on how much social media activity there has been and how long since the paper was published (similar to how HN and Reddit work, except using social media activity, not upvotes, for the ranking). Then, for each paper, it summarizes it using GPT-4, links to the social media discussions, paper references, and related papers.

It's a fairly new site and I haven't shared it much yet. Would love any feedback or requests you all have for improving it.


This is exactly what I was using HN for. But, yeah, in kinda sucked compared to yours. Another thing I was trying to create was some sort of NN model that could use the semanticscholar h-index of authors along with the abstract text and T5 to estimate the one-year out citations. Just for personal use, though. That whole thing fell apart because semanticscholar is kinda crap for associating author links to the same author. I frequently ended up with the wrong professors, which I'd think would be easily fixable for them.


Just a note to say that factoring authors into the ranking system is high on my todo list. v1 won't be too fancy - just a hardcoded list of prominent authors whose papers warrant extra visibility. A future version will likely automate it to avoid the hardcoded list.

Also, soon-ish I'm going to add the ability for users to follow specific authors, so you can get notified when they publish new papers.


> Also, soon-ish I'm going to add the ability for users to follow specific authors, so you can get notified when they publish new papers.

If you could do it, this would be a dream. My original intent was to be able to look through only papers citing a popular one and filtering the results for ones having at least one author with a set minimum h-index. Using Google Scholar data required using SerpAPI, which has some annoying limitations.

The core goal is obviously just not to miss out on a paper that will very likely be influential while not having to comb through the mountain of irrelevant papers.

What's funny is that Microsoft Academic was the best suited, but was retired in 2021.


I did that (used other features). This is how new papers are ranked here:

https://trendingpapers.com


Great site, thanks for sharing. Can you explain how you're determining how many times a paper is cited? Obviously papers include a list of references, but extracting them accurately from the PDF is difficult in my experience (two column formats, ugh) - though the new HTML versions help. And even if you have a list, many authors just mention arXiv paper titles, not their ids, making identifying specific references tricky.


Difficult, yes… but not impossible :)

I just extract the titles and look for their respective ids.

The real challenge was how to do that at scale. Only in CS there are well over half a million papers


FYI I started embedding the HTML pages in an iframe on Emergent Mind when the HTML version is available: https://www.emergentmind.com/papers/2312.11444 // should make it even easier to stay informed about trending papers


I've got a somewhat related question:

is there a site that lists and rates the various LLM models of hugginface.co alongside their various applications?


That looks great. No real feedback yet, but it's the kind of thing I've always been looking for as a better alternative to Twitter.


Thanks! I've got a lot more planned for it too. If anyone has any feedback that doesn't make sense to share here, or if you're a researcher who is open to some questions about how you currently follow arXiv papers, drop me a note at matt@emergentmind.com.


Love the clean design of the website! Looks amazing on mobile.


Thanks! If you ever run into any issues or have any suggestions for improving the site, drop me a note: matt@emergentmind.com.


Would love to see a comments feature at the bottom there. Reddit / HN style

Love the concept though. Added it to my Home Screen on iOS


Thanks for the kind words, it's appreciated.

I might add comments down the road if there's enough interest and if there's enough traffic to warrant it. Don't want to add them just yet and have zero comments on everything and it look like a ghost town.

Keep the suggestions coming though as you use it more: matt@emergentmind.com.


Great site. Bookmarked it.

Would be nice if I could change timeframe. Top this week, month, year, all time.


I'm slowly adding older papers as I work out the kinks in the site. Down the road when the database is more comprehensive, this should definitely be possible.


Works in Chrome, but does not seem to work in Firefox.


Can you (or anyone experiencing similar issues) share any details about what's not working in Firefox? I tested it and all is well for me, though it's definitely possible there's an issue with some other version of it.


Love to see Energent Mind continuing to innovate!


Probably more accessible in general. (PDF) Papers are psychologically scary.


Pdf is by design a image format that can also embed text. It just don't have the primitives to properly retain the article structure.


Nah, it's a super-complex system that creates a graph of components, can draw vectors like PostScript, can embed 3-d models, etc. The spec is here

https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

if you look at sections 14.6 through 14.10 you will find quite baroque facilities for representing the structure of documents in great detail, making documents with accessibility data, making documents that can reflow with HTML, etc. Note to mention the 14.11 stuff which addresses problems with high end printing (say you want to make litho plates for a book.)

For that matter sections 14.4 and 14.5 describe facilities that can be used to add additional private data to PDF files for particular applications. For instance Adobe Illustrator's files are PDF files with some extra private data, and https://en.wikipedia.org/wiki/GeoPDF

I like to complain that PDF has no facility to draw a circle but instead makes you approximate a circle with (accursed) Bézier curves but other than that the main complaint people make about PDF is that it is too complicated not that it is lacking this feature or that feature.

Contrast that to a highly opinionated document format like DjVu

https://en.wikipedia.org/wiki/DjVu

which came out around the same time as PDF and is specialized for the problem of scanned documents and works by decomposing the document into three layers, one of which is a bilevel layer intended to represent text. All three layers have specialized coding schemes, the text layer in particular tries to identify that every copy of (say) the letter "e" or the character "漢" is the same and reuse s the same bitmap for them.


The adode can surely add whatever extension they want to address whatever problem. But unfortunately, most implementation outside of Adobe acrobat itself won't implement all of them. Most library would just implement basic part for printing and marking (At best, supports forms and javascript). That part is basically non-exist for anyone.

There is a reason that most people still use docx for forms even pdf technically support forms.

PS: pdf reader of firefox and chrome don't really supports forms until very late versions.


You would normally use a library to create the PDF so you don't need deal with the complexity of the format. A library would likely provide a function for drawing circles that translates the circle into Bézier curves.


I am glad to see a sans font being used, rather than trying to replicate the serif font from the original papers. It's a bit narrow and fuzzy on low resolutions, but a massive improvement just by switching to sans.


We detached this comment from https://news.ycombinator.com/item?id=38724925.


PDF is objectively much better than HTML at rendering text documents. And it's not even close. This could easily have been done 10, even 15-20 years ago. That it didn't is not just inertia. Latex and PDF have enormously better text rendering, and the static format locks a state-commit in time that is much easier to go back to and reference/critique. Unlike the intrinsically fluid nature of HTML. For academic work, milestone-like formats, that lock state in time, are useful for those who later build on them. And again, the rendering just doesn't compare and that imparts [sub]conscious quality signals.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: