Hacker News new | past | comments | ask | show | jobs | submit login

I don't really understand the subject matter enough, so I apologize in advance for the meta-comment...

The author mentions that he would maybe have written this as a scientific paper:

> I tried writing a serious-looking research paper about the bug and my proposed fix, but I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead. (History is written by the winners; blogs are written by…)

Honestly, thank god he didn't. This paper is so much more readable and approachable than what gets published in "serious" journals. The tone is self-effacing, it does not have an "ego" the way scientific papers tend to have. If all science read like this, and if we were "allowed" to cite research that reads like this, I think we would be much better off. This reads like a conversational, approachable textbook, not like an impenetrable wall.

Is it because I don't understand attention at a PhD level that I hold this opinion? Maybe. Could he be writing like this because he's a layman and utterly wrong about the topic, unlike those Serious Science Authors? Maybe, I don't know.

But my god, wouldn't it be nice to be allowed to write like this?




Nah, scientific papers are supposed to be precise and technical. This reads like those quite frequent suggestions here of switching all equations in papers to plain English or code: it honestly comes from a place of ignorance, and I say that as basically a layman myself.

What should be encouraged is for academics to blog about their research as well. It would even help when recruiting and onboarding new members. Right now the sociological and economical incentives don't promote this at all.


    There was this sociologist who had written a paper for us all to read ahead of time. I started to read the damn thing, and my eyes were coming out: I couldn’t make head nor tail of it! I figured it was because I hadn’t read any of the books on the list. I had this uneasy feeling of “I’m not adequate,” until finally I said to myself “I’m gonna stop, and read one sentence slowly so I can figure out what the hell it means.”
    
    So I stopped-at random-and read the next sentence very carefully. I can’t remember it precisely, but it was very close to this: “The individual member of the social community often receives his information via visual, symbolic channels.” I went back and forth over it, and translated. You know what it means? “People read.”
    
    Then I went over the next sentence, and realised that I could translate that one also. Then it became a kind of empty business: “Sometimes people read; sometimes people listen to the radio,” and so on, but written in such a fancy way that I couldn’t understand it at first, and when I finally deciphered it, there was nothing to it.

  -- Feynman
I disagree. After going through quite a few research papers in my time, I've found the best are the ones that are direct and to the point. Many papers I've spent many hours/days trying to unravel just to realize the concepts were straightforward, not very novel, and there wasn't much of real substance to the paper.

Meanwhile, some of the most impactful papers I've read are direct and to the point. Kadmellia, Bitcoin, BitTorrent, DynamoDB, Firecracker, etc.

It seems like, when you have something of substance to say, you say it. When you don't you overcompensate by falling back on building an intricate puzzle of jargon and convoluted equations in an attempt to make what you're saying sound far more important than it really is.

As LLMs get better, I look forward to the day where every journal has a standard LLM filter you're required to apply to your paper that unravels all of this nonsense and rewrites it a more straightforward way, if not to directly publish than just for the editors to verify there isn't a simpler way to convey your ideas. I suspect that if we had an EIL5 filter for most journal articles, we'd discover that a majority of the words that get published have very little substance at all.


Systems research papers do not represent all research papers out there, not even in computer science.

In cryptography, certainly a paper with formal definitions and proofs can be much more valuable than a corresponding blog post. It's a field where formalism is desired, if not necessary. Otherwise you can't check other people's "proofs", or even know what model you're working in.

I think, since people haven't come up with better formalisms, sometimes it's quite obtuse, which gets mistaken as "academic writing", when really it's a best effort to formalize.


Requiring formalism does not preclude attaching an informal but intuitional description of the formal definition or proof. Unless the authors don't understand very clearly what they are talking about, or they want to prevent others from understanding their concepts too easily, I don't see why there is a reason for the authors not to attach an EIL5 in addition to formalism.


Sure. But it's an ELI5 "in addition to formalism", not "in lieu of formalism". In theory conferences like STOC or FOCS, the first section of the paper often comprises such an overview.

Certainly some papers are better written than others. But sometimes a blog post cannot replace a paper, unless it also goes into the depth and detail that formalism requires. (Then it becomes a 30 page blog post, where most people don't read past the intro.)


The complaint about research papers is that almost all of them omit the ELI5 and provide only the formalism.

You can have both and weave them together into a digestible narrative. I see Physics textbooks sometimes written this way.


Papers are mostly read by other researchers, where the added background is actively bad because it obscures the real meat of the paper to the main audience.

If you just wanted a digestible intro then you would usually buy a textbook.

I think the argument that every research paper ought to be a mashup of a textbook + the actual research to be a bit silly from a “people should specialize at what they’re good at” standpoint.

Put in another context, I also don’t want every recipe to reintroduce what it means to “fry” or “braise” or “marinate”. We have Google for that.


I've long wanted an informational slider to bits of text. Something where you can zoom in and out to the level of desired complexity. LLM's might be able to fill in some of those gaps. You could turn any paper into a introduction of the subject it's a part of.


This sounds like a good use case for local llm models. Browser plugin, precooked prompts for different levels of detail, maybe a lora to give the model some idea of expected output. I bet some of the 13b models could do a useful job on this even if they were imperfect.


Look for "stretchtext"


I don't know that much about AI, but my experience in other areas has shown me that 'more grown up' literature that feels harder to parse when your starting out later becomes the precise technical information you need as you get deeper into a subject. Like W3Schools when you start out in web dev vs MDN when you're skills are more mature.


I believe Feynman understood that he was oversimplifying, and I believe he was able to do because his reason for reading the paper was not the same as the reason another sociologist might have. Thus a sentence like, "The individual member of the social community often receives his information via visual, symbolic channels", does, to a non-expert, mean "people read", but to another sociologist of a researcher in related fields, phrases like "individual member", "social community", and "visual, symbolic channels" would terms of art. That means an expert in the field could read "social community" and it would mean, cognitively, an entire set of concepts in the field.

In short, jargon matters. People here can talk about functional, procedural, and object-oriented programming because each of the three words has more than just the dictionary meaning - to those of use in the field. In the same way we can talk about linear algebra and know it doesn't mean "algebra on lines".

Yes, it's possible to write scientifically without jargon and wordiness, but it's a lot of effort and takes much more space to say "a group who follow a social structure within a society (culture, norms, values, status). They may work together to organise social life within a particular place, or they may be bound by a sense of belonging sustained across time and space"[1]

1 https://othersociologist.com/2013/11/20/sociology-of-communi...


Visual symbols could be anything from written words to police uniforms. It's not oversimplifying— it's flat-out wrong. It would be like reading

Expressions representing numbers may be combined with an expression representing a primitive procedure (such as + or *) to form a compound expression that represents the application of the procedure to those numbers.

And an English professor haughtily responding, "you know what that means? 'Computers compute!' This SICP book is just a pile of jargon that could be dramatically simplified!"

His dismissal revealed nothing about the topic, but a whole lot about how so many in the "hard" sciences view others. Don't understand the text? It's the text's fault! For I am a real scientist, and if I don't understand it, it's not understandable!

He might have been a genius, but he should have stuck to subatomic particles and left exploring human behavior up to the people who'd done the prerequisite reading.


Well, maybe, but you can rationalize arbitrary amounts of pointless jargon that way.

Besides, in the example Faynman gives the simple sentence is actually shorter. Maybe that shorter sentence loses some information that the jargon carried, but Occam's razor suggests the writer was just trying to sound smarter.


Some bad writing certainly comes from trying to sound “academic” or “scholarly” but there’s more to it than that.

A lot of research involves lumping and splitting: what underlying properties do these seemingly-different share (or vice versa). For example, reading text is just one possible instantiation of a “visual symbolic channel.” Traffic lights, road signs, gauges and dials, logos, and clocks also carry information the same way. If you want to discuss “reading and reading-like activities”, you may want some kind of umbrella term.

Plus, you may want to contrast them with other ways of sharing information: non-symbolic systems that literally depict the item in question (photos on a picture menu, for example) or using a different sense altogether, like church bells for telling time.


> It seems like, when you have something of substance to say, you say it.

And this blog post probably could be condensed into 1/4 of its size or less with a less conversational/bloggy tone.


There are words that are added to drive the point in multiple ways, ease into it, and make the text more engaging.

And there are words that are added to add empty padding, keep up academic pretenses, and appear smart.

The post could have been condensed, but it would lose the former, not the latter.


Good rhetoric takes time and energy from both the author and reader


Not an academic here, but I've read (and continue to read) through research papers regularly.

The original bitcoin paper is a great example. I was able to follow the paper almost fully at my first read itself—despite my not having a formal background in maths.

...and as you said, many of the insubstantial papers hide behind jargon and unnecessarily complex equations, just to camouflage their lack of substance. It's frustrating to spend time deciphering a paper, only to realize that you've essentially wasted that time.


I hadn't seen that Feynman quote before, but I discovered then when reading Donna Harraway's books (Cyborg Manifesto, Modest_Witness@Second_Millennium.FemaleMan©Meets_OncoMouse, Primate Visions).

The criticism was """Haraway's work has been criticized for being "methodologically vague"[39] and using noticeably opaque language that is "sometimes concealing in an apparently deliberate way""""


>Haraway's work has been criticized for being "methodologically vague"[39] and using noticeably opaque language that is "sometimes concealing in an apparently deliberate way

So you're saying that "Her work is basically handwaving and bullshitting".


Yes, but also, wrapping the handwaving and bullshitting in a layer of obfuscation:

"Michel Foucault’s biopolitics is a faccid premonition of cyborg politics, a very open feld. By the late twentieth century, our time, a mythic time, we are all chimeras, theorized and fabricated hybrids of machine and organism—in short, cyborgs. The cyborg is our ontology; it gives us our politics. The cyborg is a condensed image of both imagination and material reality, the two joined centers structuring any possibility of historical transformation. In the traditions of “Western” science and politics—the tradition of racist, male-dominant capitalism; the tradition of progress; the tradition of the appropriation of nature as resource for the productions of culture; the tradition of reproduction of the self from the refections of the other—the relation between organism and machine has been a border war"

(donna was woke before woke was a thing)


> (donna was woke before woke was a thing)

Donna Haraway was born 6 years after “stay woke” in its sense as an admonition to maintain alertness to the racist context was coined. Leaving aside a debate over whether her work is a good match for “woke”, she very much cannot have been woke before woke was a thing. (Before its recent replacement of “politically correct” as the American Right’s preferred, meaning-stripped, label for everything it disagrees with, sure, but “woke” was a thing long before that.)


>Leaving aside a debate over whether her work is a good match for “woke”, she very much cannot have been woke before woke was a thing

A game of being pedantic is always welcome:

She very well could have been "woke before woke was a thing", because "woke" as the parent means it in her case, refers to the modern usage (of like, 2 decades), not the original term of the 40s that might have preceeded her birth.

So take the parent's comment to mean:

"She was woke, in the modern, circa-2000s+ sense, before woke, in the modern circa-2000s+ sense was a thing, not in the 1950s namesake sense".

Similar to how somebody could have been a hipster (in the 2000s+ sense [1]) before a hipster was a thing (before 2000s), even if they have been born in the 70s. Sure, the term already existed before the 70s, but it referred to a different thing.

[1] https://en.wikipedia.org/wiki/Hipster_(contemporary_subcultu...


The 1938 sense in which it was coin3d is exactly the sense of the 1950s and the sense that got ibcreased attention circa 2000s and catapulted to attention alongside BLM (which itself was a response to the same kind of even that the art in which the phrase was coined for responded to).

The only newer sense is the American Right’s use of the term to replace “political correctness” as an empty epithet for everything and everyone it disagrees with.


Language grows organically, and the American right gets as much as the American left to define what a word means or how proponents of a movement or social fad are seen in practice (besides woke's standard definition is just "awake", if someone insists on the "original meaning")

So, one side could see woke in theory as a noble activist/social consciousness practice, which can not go wrong and helps liberate us all.

The other side might see woke in practice as intolerable virtue signalling and self-aggrandizing whose actions often border on farcical.


Thanks; that's exactly what I meant. I leave these things out because I assume not everybody is pedantically waiting to call me out on a slight variation on their personal belief system.


> I disagree. After going through quite a few research papers in my time, I've found the best are the ones that are direct and to the point. Many papers I've spent many hours/days trying to unravel just to realize the concepts were straightforward, not very novel, and there wasn't much of real substance to the paper.

Can say the same thing about code. Some people just honestly don't want to give away how simple the core logic is seemingly, and will lead you through myriad twists and turns to finally see the point.


> There was this sociologist

Found the problem.


The writing quality of academic papers is very poor, whatever its intended characteristics are, and we deserve better.

I'm skeptical that the only way for them to be precise and technical is to make them impenetrable. I think there is a culture of academic writing (many different cultures, really) that has adopted a voice and writing style which became a parody of itself over time.

Here's a trivial example: You frequently see papers use the passive voice, something a middle school English teacher would mark with a red pen. 500 participants were asked, vs. we asked 500 participants. In what sense is the former more precise and technical? It's not. It does not convey any additional meaning. People use it to sound objective and distant, even when they really aren't.

Realistically, academic writers usually don't even think about it as much as that. They're just copying the tone of other papers, because there is a culture and it enforces certain behaviors on its members irrespective of the value.


A pain in the ass was observed while writing was performed in the passive voice.

Nobody likes doing it, I think. We just do it because we’re scared our papers won’t be accepted otherwise.


In philosophy papers you see authors often use the pronoun "I", similar to blog posts. But they have other ways to make them hard to parse for outsiders.


Either your example is too trivial to justify your point, or the point itself is trivial. It's right for an academic to distance themselves from the subject of their study because we do need researchers who try not to be biased. If they fail that and then correct themselves, then what's the problem? Complaining about inconsequential uses of tone is obsessing about form over function and reeks too much of insecurity, to be honest.


They aren't magically "objective" because they used the passive voice. It's a performance.


Of course language does not guarantee that the study is objective—that would be in the design of the experiment, the reproducibility of results, and the absence of conflicts of interest among the researchers. Using the passive voice however elevates the outcomes being reported as facts that actually happened, instead of mere personal experiences.

People complain all the time about news being biased for being told from a reporter’s point of view, but complain all the same when events are reported in an encyclopedic manner as researchers do when they remove themselves from the events and the outcomes of their studies.


I'm convinced that the value of active voice is not precision and clarity, but rather the subliminal egocentrism away from the object (the research) towards the subject (the researchers) who need to receive credit for the work. The royal "we" also helps frame the work as a collaborative effort with the audience.


That's rubbish, passive voice has a number of detrimental effects, it increases text length without adding information, it makes subject (acting entity) and object (entity acted upon) easier to confuse and it confuses the reader about who actually did things (what some people often confuse with objectivity).

That said the assertion that most scientific articles are written in passive voice is outdated för quite some time. Most journal style guides advise to use active voice, e.g. https://www.nature.com/nature-portfolio/for-authors/write


> it confuses the reader about who actually did things

When scientific papers have a clear list of authors and delineated section headings, this point is moot. And in such papers, again, repetitive strings of sentences that begin with the same "we..." emphasizes the producers of the work over the work itself.


I agree with everything you say. Though papers really are a bit too hard to read sometimes, but I'd argue it's often not for an overly technical tone so much as writers cutting out a lot of background material for brevity and assumed familiarity.

>What should be encouraged is for academics to blog about their research as well. It would even help when recruiting and onboarding new members. Right now the sociological and economical incentives don't promote this at all.

I will add onto this that a lot of journals have been pushing for video abstracts and "plain English" abstracts. For the most part I don't see these too often but when they're there they're appreciated, and I vaguely recall that someone found that citations go up when they're used (specifically plain English, I don't think anything has been on video abstracts).

There are a lot of good blogs for computational academic subjects (ml, bioinformatics, comp neuro, etc) but I see less for bio and non-software engineering. Math and physics seems to have some really notable blogs, but beyond what gets posted to HN and linked further on those blogs, I can't comment.


"it honestly comes from a place of ignorance, and I say that as basically a layman myself"

Here is an added complication: succinct technical communication can be efficient when communicating to peers who work on the exactly same domain, similar problems as you, and want digest your main ideas quickly.

On the other hand, for any particular paper, the size of the audience to whom it is directly relevant and addressed to can be small. The size of the audience who got to reading it anyway may be vast. (Maybe I am reading your paper because someone cited a method paper that in lieu of a proof or explanation writes just two words and citation to your paper. Maybe I am a freshly minted new student reading it for my first seminar. Maybe I am from a neighboring field and trying to understand what is happening in yours. Maybe I tried to find what people have already done with particular idea I just had and search engine gave your paper. And so on.)

During my (admittedly lackluster) academic career I recall spending much more time trying to read and understand papers that were not addressed to me than papers that were and where I enjoyed the succinct style that avoids details and present the results. (Maybe it is just an idiosyncratic trust issue on my part, because I am often skeptical of stated results and their interpretation, finding the methods more interesting). But that is not all.

I also noticed that genuine misunderstandings coming from "brief" communication of technical "details" were quite common; two different researches would state they "applied method X to avoid Y/seek Z[citation]" in exactly so many and almost exactly same words, where X,Y and Z were complicated technical terms, yet the authors would have quite different opinion what the meaning of those words were and what would be the intended reading and how and why X should be implemented.

In conclusion, I think many a scientific field would benefit from a style where authors were expected to clearly explain what they did and why (as clearly as possible).


>Nah, scientific papers are supposed to be precise and technical.

They're also, more often than not, tedious, badly explained, error prone, oft-skipped, and hardly ever read carefully, even during peer review for the paper that contains them. That's how mistakes stay unnoticed for decades in influential papers with tons of citations.

In essense, a paper's tone and languge is often more formality, academic tradition, ritual, and padding for publication purposes, than serving a real purpose.


Well, I'm not so sure. It seems to me that someone could perfectly well devise an experiment based off of this (another poster chastised me for saying paper, so) blog post.

Equations are perfectly clear. I was able to follow his reasoning perfectly well.

I cannot say the same for so many papers (tm) that I've read. Mostly in a similarly computational (though non- deeplearning) applied math domain.


Strongly agree. “Why are academic papers always written in such mumbo jumbo?” is the same complaint as “Why are contracts written in such legalese?”, which is a manifestation of “I’m smart and I don’t get this, so the author is dumb for not writing clearly.” It’s a natural human bias that most HN denizens insist they don’t possess, but of course we do.


> Nah, scientific papers are supposed to be precise and technical.

> What should be encouraged is for academics to blog about their research as well.

Why so binary? A blog would be hard to find, why not have both in the paper?

My view is similar to that of code vs docs: code should be as small, and as precise as possible, whereas docs are best when they’re explaining to humans how things fit together, high level. Also easier to maintain.

Hyper technical natural language mixed in with math is almost the worst of both worlds: low density of the actual formulas, with an incomprehensible wall of text surrounding it. And clearly this is an issue also for phd domain experts.

Not saying academic writing could be super simple but I also see no reason that the status quo is optimized more for comprehension than say social posturing.


I disagree because it isn't possible for language to be precise on it's own syntactic merit. There is meaning and there is context and the biggest problem with research papers is that the context of many statements in the paper are incredibly ambiguous. The reason for that is that the papers are trying to be "concise". Context can only be disambiguated with more statements. You must eliminate potential interpretations that a reader could make.

"Spectrum sharing in an “apple-like” or a fixed set sense is not a coexistence. ". What does that mean? Coexist? Who knows, the author thought they were being precise, but they understood the statement they made with a head full of context that gave it precise meaning. As readers, we can only scratch our own heads as to what that context could possibly be.


Leslie Lamport definitely doesn’t share your opinion. A known fact about the Paxos paper is that there are no dumbed down summaries worth reading because the proper thing is so approachable. Not sure if you only have to sound smart if you’ve got nothing to say but certainly feels like it could be the case.


Paxos is so mistifyingly hard that Raft was invented as part of a project to understand Paxos (and the advisor and proponent of the project was John Ousterhout, who's pretty badass). There are also I believe a few papers trying to trying to explain Paxos more clearly


Just as a quick source to my claims:

1. The raft paper is titled "In Search of an Understandable Consensus Algorithm"

2. The abstract of this tutorial on Understanding Paxos https://www.ux.uis.no/~meling/papers/2013-paxostutorial-opod...

3. Lamport's own "Paxos made simple" https://lamport.azurewebsites.net/pubs/paxos-simple.pdf


> A known fact about the Paxos paper is that there are no dumbed down summaries worth reading because the proper thing is so approachable.

A known fact is that it's impossible to actually implement it correctly, and the "approachable" paper seems to be a significant factor in this.


I've read a lot of scientific papers in the comp sci / machine learning space and they are rarely precise. It's been over a decade since I've ready many papers so maybe this has changed, but I remember reading a paper out of Microsoft about how to make spell correcting auto-completion for search, and it was nearly impossible to figure out precisely how it was implemented. Precision would have been achieved easily by providing code and a sample data set. instead of was a mix of prose and math equations with many gaps where you had to guess how to fill.


Ah yes, my old supervisor was very fond of that strategy.

"Make it sound like we do cool stuff; but don't make it so precise that they can re-implement what we do. Let them come to us so we can co-author papers."


not always, ReLu is a fucking line, most papers write stuff in the most complicated way to sound smart.


More fundamentally he's postulating that this will work in a blog post but he doesn't do any experiment to prove that it does.


I think maybe its because he didn't have experimental results that show that it worked. Not a knock against the author, there are just so many things that seem like good ideas that don't end up working well in practice, a paper like this without results is hard to value.


Yes, definitely. If he tried to have it published, the lack of experimental results would definitely be a glaring error.

But this is still scientific communication. It's really nice that it's legible!

> Even though softmax1 is facially quite boring, I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research. If you want to run some experiments and prove me right, DM me on Twitter and we’ll get a paper going.

I'm guessing that in the stodgy world of science, a communication like this might happen over lunch at a conference, limited to a small clique of researchers who are zealously guarding their next paper. Who could blame them, publish or perish!

But someone will probably test this theory out (after my read, it will probably happen in llama.cpp with preliminary results on GPT-2 by next week) and achieve results, and it will happen quickly and legibly to the outside world, because this was published openly and without all of the pretension that formal science (tm) has. If it works, it works. Stuff like this is the soul of the internet. Sharing knowledge and making it legible for all.


There's a perfectly good venue for this communication: a workshop.

Workshop submissions often don't need evidence. They just need a small kernel to spur discussion.

Without experiments, there is no hope of publishing this in anything more than a workshop. Nor should there be.


Then again, if you don't have access to giant compute clusters you can't test this, so it's either a blog post or nothing. I believe the outlier problem that this solves only appears for very large models.


That isn’t true at all. Train a smaller model on a smaller dataset. You can even train on your laptop. It’s definitely feasible. This is just a proof of concept, it doesn’t need to beat state of the art.


Maybe I edited my comment too late.


> I believe the outlier problem that this solves only appears for very large models.

Any reason to believe this? The author never mentioned it, and I can’t think of any other a priori reason why it should be true.


See figure 1:

https://arxiv.org/pdf/2208.07339.pdf

Outliers appear at model size 6.7B and are not present at 2.7B


Sure, emergent properties can arise as parameters increase. Everyone knows that. That’s a much less specific claim than to say that the benefit of modifying softmax can only arise as an emergent property after N parameters, and therefore the benefit can only be evaluated on models above a certain size. To my understanding the author of TFA isn’t suggesting the same issue as the one in your linked paper.


The second heading in the TFA is "It’s All About Outliers"


6.7B isn't "needs a datacenter" scale.


It's in the million dollar range. XLnet which is a 1.3B model cost $245,000 to train for example.


To finish the author’s analogy:

Blog posts are written by those who arrive first.

In a weird way my mental model is: blog posts are the recon team discovering a new idea. They might have errors. They might be incomplete. Maybe they’re outright wrong. Stakes are lower as it took less effort to get there and less loss if a position is abandoned.

Then papers are authored, often much later, and they’re the regulars coming in to fortify a newly captured idea. They provide (or at least are supposed to) rigor to the idea. A fortification of a position that we decide is worth holding.

Yeah, this analogy is probably sloppy. But in my brain there’s an eternal conflict against ignorance as we keep advancing into the unknown.


Counterargument: this blogpost is worthless. You get all the way to the end and then find out he hasn't actually tried it, not even on a toy model. It's just a neat idea he thinks will work.


I wouldn’t quite say its value is zero. It’s worth something, but a lot less than if it had been shown to work empirically.

Explainers and their folksy, imprecise tone are good for things we already know are true. I’m skeptical on things which are unproven.


Why would that make it worthless?


Among other reasons, because the decoder-only version of the original transformer architecture has proven weirdly resistant to these kinds of hacks and clever optimizations.

Ideas like sparse attention, tree attention, residual attention, etc, all sound good on paper, but when researchers try to reproduce them they either find no results or results that don't scale. Even AliBi is turning out to be less powerful than scaled-down positional embeddings. It's almost a bitter lesson on its own: you can't beat the original transformer.

Optimizations that do stick around tend to be the ones that preserve the original algorithm but help with caching or memory accesses.


Because there are a thousand ideas a minute in this field that meet the "it's worth trying" bar but don't actually pan out to make any difference. It's the equivalent of a blogpost that says "if someone else turned my idea into a business, it would be a billion dollar business. But I won't bother."


Because until he tries it, who knows if it works?

There are a thousand papers out there making minor tweaks to the transformer architecture. 99% of them are also worthless and forgotten.


> Because until he tries it, who knows if it works?

That's precisely what he shared this for, though. So someone willing to train a model with this tweak tries it.


With say system architecture, you can muse on stuff like "well if Kubernetes made this decision, it would definitely be more secure" or "it would scale up quicker" without empirical evidence and other people could argue "yes I agree because" or "no I don't because"... etc.

With large ML models, there probably is no intuition like this. We just don't know "if I do the common sense thing X, it surely will produce better results for a given benchmark" ... well we have no idea until it is tried out.


He says in the very first paragraph:

> I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead.

So I think your accusation of his burying the lede on the lack of experiment is unwarranted.


> The tone is self-effacing, it does not have an "ego" the way scientific papers tend to have.

I can't imagine judging scientific papers based on whether the author might be looking down on me, or thinks he knows better than me.

> if we were "allowed" to cite research that reads like this

Maybe you're looking down on yourself? You can cite anything you want to cite.


Well if you yourself are trying to publish in a scientific venue you can't always cite exactly what you want to cite. Though it's probably uncommon for a peer reviewer to ask for a specific citation to be removed, the review process absolutely does affect the references list, and expectations about this process affect it doubly so.


In ML, no one is going to police your citation list. I've cited some weird stuff in my papers, including ideas from tweets and random quotes from Jeff Dean. It's never been a problem.


> This paper

It's not a paper. It's an idea that sounds plausible, presented in a highly entertaining form.


A lot of thoughts in this thread on what academic papers are or should be, let me give my own opinion as a person who tries to write papers.

Papers should be structured like fractals - that is, they should be "self-similar". The main text of the paper after the introduction should go into all the necessary details demonstrating the origins of the idea and proving that it has value. Then the introduction section should summarize all this, and take a less rigorous tone. The abstract should be a summary of the introduction. And then the title should summarize the abstract. If you really have a lot of technical work to do, maybe you can write a super long appendix and have the main body summarize that.

I myself probably spend as much time reading paper introductions as I do reading paper bodies, which means that probably 90% of the papers I read, I only read the introduction. I do this because I enjoy it more - I like new ideas, and the intros are a great way to get a lot of them. This blog post reads like a great paper introduction to me. It's easy to trick yourself into believing something is easy though, so an academic paper would have to back this up with an experiment.


There isn't much difference between a blog and a whitepaper, in that people tend to write blogs more casually and whitepaper more seriously (and some academics event only accept things that look more serious).

But a good writer can write great articles in whatever format they wish.


I learned more from this post than a thousand papers. Amazing writing!


> it does not have an "ego" the way scientific papers tend to have.

What do you call it when somebody takes the time to write about "a big discovery" they've made, but don't take the time to check if somebody else already did it? It's not like it's in some forgotten paper nobody has seen. It's in Pytorch itself.

Also this: "I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research."


It's interesting, because as a scientist who reads and writes these kinds of papers, my first impression was: This guy has a pretty big ego or is otherwise badly miscalibrated if he believes his genius idea has a "99.44%" chance of preventing outlier activations without doing any experiments.


Not ego, he's playing on the old Ivory Soap slogan "99+44⁄100% Pure"

https://en.m.wikipedia.org/wiki/Ivory_(soap)


This is why folks like gwern have their own research published this way, i.e. his analysis of GPT-3: https://gwern.net/gpt-3

We call him an "independent AI researcher" because his google scholar is "bland" compared to many academics who play the academia game - https://scholar.google.com/citations?user=yk1QMowAAAAJ&hl=en


I can see AI being used to make scientific papers more approachable like this.


Are most AI papers even published beyond arxiv anyway?


It would be amazing if academia started replacing papers with videos + code

I want to see: an explainer of the science/ideas/experiments/hipothesis

And instructions on how to reproduce the experiments/results

Some YouTubers are going in this direction


+1 to including code with your paper. It improves reproducibility and transparency. There’s even a well-known website dedicated to this purpose.

For the rest of it I don’t care. As long as researchers understand what’s going on, that’s what matters.


I'm not an academic, but some of the notation and terminology they use makes me want to hunt them down and 'clockwork orange their eyes open' until they can show me how their math is "intended" to work.

Inconsistent math notatation in papers along with vague terms in descriptions makes me so mad.


Most papers already have code, and videos are very common.


Videos showing some result, but almost never a video of someone explaining the thing they are doing

When they include good videos, they really stand out


oh god, please, no more videos...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: