Hacker News new | past | comments | ask | show | jobs | submit login
ArXiv LaTeX Cleaner: Clean the LaTeX code of your paper to submit to ArXiv (github.com/google-research)
103 points by t55 9 days ago | hide | past | favorite | 42 comments





It's really a pity that they do this now. Some of their older papers had actually quite some valuable information, comments, discussions, thoughts, even commented out sections, figures, tables in it. It gave a much better view on how the paper was written over time, or how even the work processed over time. Sometimes you also see some alternative titles being discussed, which can be quite funny.

E.g. from https://arxiv.org/abs/1804.09849:

%\title{Sequence-to-Sequence Tricks and Hybrids\\for Improved Neural Machine Translation} % \title{Mixing and Matching Sequence-to-Sequence Modeling Techniques\\for Improved Neural Machine Translation} % \title{Analyzing and Optimizing Sequence-to-Sequence Modeling Techniques\\for Improved Neural Machine Translation} % \title{Frankenmodels for Improved Neural Machine Translation} % \title{Optimized Architectures and Training Strategies\\for Improved Neural Machine Translation} % \title{Hybrid Vigor: Combining Traits from Different Architectures Improves Neural Machine Translation}

\title{The Best of Both Worlds: \\Combining Recent Advances in Neural Machine Translation\\ ~}

Also a lot of things in the Attention is all you need paper: https://arxiv.org/abs/1706.03762v1


> Some of their older papers had actually quite some valuable information, comments, discussions, thoughts, even commented out sections, figures, tables in it.

I think you answered your own question.


What question?

I think I read the comment as being sceptical as to why. I withdraw my comment in that form.

Maybe papers need to be put under version control.

FigShare and Zenodo grant (DataCite) DOIs for git commit tags.

Maybe papers need to contain executable test assertions.


> Removes all comments from your code (yes, those are visible on arXiv and you do not want them to be).

Why not? I love to peek at .tex file comments, and secretively hope that somebody somewhere is reading mine...


Those comments might also explain how some cool figure was done

Ehh sometimes you have additional results or insightful remarks that simply don't fit into the page limits. You may want to keep those for yourself and use them for a separate publications rather than give them away.

Well, you don't have page limits on arXiv, though.

This is true, but arXiv submissions are often prepared with a target venue in mind that does have page limits.

Also true, but the arxiv version is often (in my experience) containing the entire paper. Indeed, many conferences ask people to submit the full version to arXiv.

Interesting. I know it frequently happens, but I've never seen a conference explicitly make that request.

Here's an example (ICALP):

> Authors are strongly encouraged to also make full versions of their submissions freely accessible in an on-line repository such as ArXiv, HAL, ECCC.

This is from the call for papers a few years ago. The wording has changed in recent CFPs, due to employing (weak) double-blind reviewing.

They still allow uploading to arXiv (with full names and affiliations) despite being anonymous.


Aye, but in this context "full version" usually means "a version with more detailed proofs/results related to the paper's contributions", rather than "a version with additional contributions".

What is the point of concealing tikz source code? It increases the size of the source archive and undermines accessibility.

And obfuscating "raw simulation data"? It's not pro-research fraud, but it's what a person who was pro-research fraud would prefer.

Agreed that the phrasing is suspicious!

However, it’s pointless or even counterproductive to embed the raw high-resolution data in the paper because it doesn’t show up in the rendered copy but balloons its size. For 6.5” (i.e., full width) figure printed at 300 dpi, you can only show 2100 points horizontally—-and realistically a lot less. Upload the raw traces somewhere and add a link.

Source: As a grad student, I stupidly turned a simple poster into a multi-gigabyte monstrosity by embedding lots of raw data. The guy at the print shop was not happy when it crashed his large-format printer!


Same! I've accidentally rendered a PDF monstrosity where every data point was represented in full vector graphic glory. It was absolutely enormous and dumb, because you couldn't tell that from the figure.

Generate high quality graphics, with the limitations of print, digital displays, and attention in mind. Then toss your data up on Zenodo and cite its DOI.

Obfuscating is the wrong word. "Decimate", "project", "render" are all better options, depending on what you mean. Punning render is the most fun of that lot, FWIW.


It's also nice for other people to reuse and adapt your figure, or include it in beamer presentations.

Many researchers learn LaTeX by looking at the idioms used for the papers they really like.

That includes code for Tikz figures.

I hope people will use this tool only to remove the inadvertent disclosure of commented regions and to reduce the file size. But keep the LaTeX source intact otherwise!


It needs to be intact, the pdf is rendered by the arxiv backend based on the source

You can upload only the PDF on ArXiv. Useful when you for some reason (e.g. client request) publish in certain engineering conferences that only allow Word submissions...

if arxiv detects that it's a latex-generated PDF, it will reject it. Though it's probably possible to launder the latex-generated PDF through ghostscript or something to evade detection (I haven't tried...)

To remove comments, one can also run, for example `latexpand --empty-comments --keep-includes --expand-bbl document.bbl document.tex > document-arxiv-v1.tex`. Latexpand should come pre-installed with texlive. Without the `--keep-includes` option, it also flattens the tex files into one.

But I'd consider removing comments by hand and leaving any comments that are potentially insightful.


I wish journals would start accepting Typst[0] files. It is definitely the format of the next decade in my opinion. It's both open source and highly performant.

Sadly existing legacy structures prevent it from gaining the critical mass needed for it to thrive just yet.

[0] https://typst.app/


They could produce TeX files.

Some of those are redundant (arxiv will complain if there are unused files, must commonly by accidentally adding the .bib file). My make arxiv target on papers usually just calls latexpand to cull comments and modifies all image includes to not be in a subdirectory (then prepares a tar file with the modified source and all figures).

https://github.com/mo271/arxiv-comments

...and here the tool to quickly inspect comments that were left in the LaTeX


You can even get sharable links to the comments...

Or, don't put your stuff on the arXiv, but put it on zenodo. You also get a DOI, and you can just publish the PDF, not the source. You can even restrict access to the PDF, and create share links with access to it.

You get a DOI on the arXiv. You can just publish the PDF on the arXiv, but this is a sure sign you are a crackpot.

You cannot just publish the PDF, they have checks that make sure that you didn't produce your PDF with LaTeX. There are probably ways to get around that, but why? Just use zenodo instead.

https://info.arxiv.org/help/submit_pdf.html explains all the constraints on direct PDF publication.

If you disagree with their good reasons https://info.arxiv.org/help/faq/whytex.html to submit the TeX you might be granted an exception.


Or just publish on zenodo, without all that fuss. The reasons the ArXiv gives may be good from their point of view, but if you don’t care too much about that but have your own good reasons for not wanting to publish your source, then zenodo is a great and in many respects superior alternative, no questions asked.

You mean, like Grisha Perelman?

There are exceptions to every rule.

If you are “sure” I expect 100% correctness.

See, every rule has an exception.

Let's assume that every rule has an exception. Then this rule must have an exception as well, so there is a rule with no exception. That is a contradiction.

So most definitely, there are some rules with no exception. The ones you are sure about should be among them.


arXiv issues DOIs for submissions.

I didn't say otherwise. In fact, the "also" is meant to express exactly that.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: