More

evnc · 2024-06-03T18:05:03

I'm a bit of a noob here, but if

a) a linear SSM (a form of RNN?) is equivalent to Attention without the scaling and softmax; and

b) Attention is "all you need" and the thing that made Transformers radically outperform all the previous architectures like LSTMs that used to dominate NLP;

does that imply c) the scaling and softmax parts of the attention equation, in particular, is the magic touch that makes Transformers work so well?

visarga · 2024-06-03T20:13:40

The major difference is that transformer state grows as the sequence gets longer, while recurrent models use a fixed size state. So presumably at sequence length (T) > size of state space (N), the transformer will be better on some very specific tasks. Not all, especially those that require the model to select information from the beginning of the sequence conditional on something at the end of the sequence. Transformers can refocus any time, while SSNs need to guess right from the start what to keep and what to drop. SSNs could use the old trick of repeating the input twice to allow the end to condition on the beginning as well.

An important role is held by the softmax function which normalizes the attention scores, allowing the model to weigh different parts of the input sequence dynamically. This means that, unlike RNNs which sequentially process inputs and update states, Transformers can directly access and prioritize information from any part of the sequence, and they are not slower for T < N.

sroussey · 2024-06-04T03:28:17

Nice description, thanks.

evnc · 2024-05-25T23:06:26

Doesn’t this effect happen with EVs generally, not just Teslas? (I don’t particularly mean to defend Tesla here, but I wonder if this might be misleading)

evnc · 2024-04-27T18:49:50

When I see an embedded DSL passed around as strings like this I can't help but think "this could be its own programming language"

Then it could have syntax highlighting, auto complete, and so on. The type system for such a language could possibly include verifying shapes at compile time.

What would a language comprised of .ein source files for manipulating tensors, which compiles down to the same low level ops, look like?

reikonomusha · 2024-04-27T19:57:09

No need for .ein source files. We just need a programming language that allows the definition of embedded DSLs without shoving them into one-line strings. A language like Common Lisp.

Here's einsum in 200 lines of Common Lisp. All einsum expressions are statically analyzed, checked for errors, and AOT compiled to machine code: https://github.com/quil-lang/magicl/blob/master/src/high-lev...

mcabbott · 2024-04-27T21:12:43

This is also how it works in Julia, where macros digest notation for einsum-like operations before compile-time. In fact the linked file's explanatory comment:

     (einsum (A i j) (B i k) (C k j)) 
    results in the the updates
      A[i,j] = \\sum_k B[i,k]C[k,j],
    which is equivalent to matrix multiplication.

very nearly contains the syntax used by all the Julia packages (where @ marks a macro), which is

    @tensor A[i,j] = B[i,k] * C[k,j]

(using https://github.com/Jutho/TensorOperations.jl, but see also OMEinsum, Finch, and my Tullio, TensorCast.)

dsharlet · 2024-04-28T02:30:08

I wrote a library in C++ (I know, probably a non-starter for most reading this) that I think does most of what you want, as well as some other requests in this thread (generalized to more than just multiply-add): https://github.com/dsharlet/array?tab=readme-ov-file#einstei....

A matrix multiply written with this looks like this:

    enum { i = 2, j = 0, k = 1 };
    auto C = make_ein_sum<float, i, j>(ein<i, k>(A) * ein<k, j>(B));

Where A and B are 2D arrays. This is strongly typed all the way through, so you get a lot of feedback at compile time, and C is 2D array object at compile time. It is possible to make C++ template errors reasonable with enable_if and the like, this works well-ish on clang, but not so well in GCC (YMMV).

This library is a lot less automated than most other einsum implementations. You have to explicitly control the loop ordering (in the example above, the `j` loop is innermost because it is loop 0). If you build a good loop order for your shapes, the compiler will probably autovectorize your inner loop, and you'll get pretty good performance. Control over the loop ordering is in general a useful tool, but it's probably a lot lower level than most users want.

lukehoban · 2024-04-27T20:24:31

I played around with the idea of a language motivated by this same thought process last year: https://github.com/lukehoban/ten.

* Succinct syntax and operators tailored to AI model definition

* Fully statically typed tensors, including generic functions over tensor dimension and batch dimensions (...)

* First-class hyper-parameters, model parameters and model arguments for explicit model specification

* Einops-style reshaping and reductions - tensor dimensions are explicit not implicit

evnc · 2024-04-28T04:16:08

Hey, this is neat! Thanks for sharing. I may be interested in collaborating with you on this when I have some free time.

jimsimmons · 2024-04-27T20:22:48

Einsums are the regexes of tensor programming. Should be avoided at all costs IMO. Ideally we should be able to write native loops that get auto-vectorized into einsums for which there is a CUDA/PTX emitting factory. But for some reason neither PyTorch nor JAX/TF took this route and now we are here.

Some of the einsum expressions I have seen for grouped multi headed/query attention is mind-boggling and they get shipped to prod.

oivey · 2024-04-27T20:41:29

JAX kind of did take this route, no? The main issue is that it’s going to be hard/impossible to compile Python loops to GPU kernels. It’s also maybe not the most ergonomic solution, which is why there is shorthand like einsum. Einsum can be much more clear than a loop because what it can do is so much more limited.

jimsimmons · 2024-05-04T21:04:35

JAX tries to be a functional language that has a Python front end. The problem is if you are outside Google and don't really understand the XLA compiler, then you are screwed.

eutectic · 2024-04-28T00:36:16

I like einsum, it's concise and self-explanatory. Way better than multiple lines of nested for loops would be.

jimsimmons · 2024-05-04T21:03:48

I agree that nested loops are verbose. But einsum is unstructured and not extensible, hard to debug in pieces, hard to document etc.,

bjourne · 2024-04-27T22:19:07

That sounds very overkill for something that already is overrkill for most ppliciations.

evnc · 2024-04-18T18:37:57

They're commoditizing their complement [0][1], inasmuch as LLMs are a complement of social media and advertising (which I think they are).

They've made it harder for competitors like Google or TikTok to compete with Meta on the basis of "we have a super secret proprietary AI that no one else has that's leagues better than anything else". If everyone has access to a high quality AI (perhaps not the world's best, but competitive), then no one -- including their competitors -- has a competitive advantage from having exclusive access to high quality AI.

[0]: https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/

[1]: https://gwern.net/complement

FrustratedMonky · 2024-04-18T19:06:21

Yes. And, could potentially diminish OpenAI/MS.

Once everyone can do it, then OpenAI value would evaporate.

visarga · 2024-04-18T20:22:56

Once every human has access to cutting edge AI, that ceases to be a differentiating factor, so the human talent will again be the determining factor.

Aerbil313 · 2024-04-18T21:04:32

And the content industry will grow ever more addictive and profitable, with content curated and customized specifically for your psyche. The very industry Meta happens to be the one to benefit from its growth most among all tech giants.

ben_w · 2024-04-18T19:38:36

> Once everyone can do it, then OpenAI value would evaporate.

If you take OpenAI's charter statement seriously, the tech will make most humans' (economic) value evaporate for the same reason.

https://openai.com/charter

visarga · 2024-04-18T20:24:36

> will make most humans' (economic) value evaporate for the same reason

With one hand it takes, with the other it gives - AI will be in everyone's pocket, and super-human level capable of serving our needs; the thing is, you can't copy a billion dollars, but you can copy a LLaMA.

ben_w · 2024-04-19T05:47:49

> OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity. We will attempt to directly build safe and beneficial AGI, but will also consider our mission fulfilled if our work aids others to achieve this outcome.

No current LLM is that, and Transformers may always be too sample-expensive for that.

But if anyone does make such a thing, OpenAI won't mind… so long as the AI is "safe" (whatever that means).

OpenAI has been totally consistent with saying that safety includes assuming weights are harmful until proven safe because you cannot un-release a harmful model; Other researchers say the opposite, on the grounds that white-box research is safety research is easier and more consistent.

I lean towards the former, not because I fear LLMs specifically, but because the irreversibly and the fact we don't know how close or far we are means it's a habit we should turn into a norm before it's urgent.

TechDebtDevin · 2024-04-19T03:23:48

Very similar to Tesla and EVs

mirekrusin · 2024-04-18T19:34:28

...like open balloon.

evnc · 2024-02-19T21:28:29

Reminds me of Heynote, posted to HN recently[0].

In general I think this approach of "super easy capture into an append-only log" is great, especially if it can be paired with features to enable editing/re-discovery/search/synthesizing old ideas together, which exist in a separate view/mode from the "just get something down as fast as possible" mode. Working on something like this, but just in nights/weekends free time with other obligations, so it's been slow going.

[0] https://news.ycombinator.com/item?id=38733968

evnc · 2024-02-17T05:58:26

From the title alone, I initially thought this would be about meditation.

evnc · 2024-01-12T17:24:51

Yeah, it's a balance. I love being able to help, and I am generally in favor of asking questions early, but not ones of the form "hey so I ran this code and it errored. Help?"

"... did you read the stack trace? Did you look at the code referenced by the stack trace?"

This is where I've learned responding with "Sure! What have you tried so far?" is relevant.

evnc · 2023-12-27T20:58:50

This is an interesting point, because they are kinda infinite spaces but they also impose a structure on the space, which I would assert is a part of why they are so successful, in contrast to the "infinite structureless blank paper" that the OP is talking about.

The trick is getting the amount of structure just right, not too much to be too restrictive and not so little that users are lost in the way the person you're replying to describes.

rchaud · 2023-12-27T21:59:39

Infinite canvases require panning and zooming with the mouse and tracking motion visually. A lot of users probably don't like the cognitive load of that compared to other types of apps.

Excel handles this better because navigating around a large sheet feels more like "snapping" as there's no UI motion as your view zooms in and out. Plus with shortcuts like CTRL + rArrow, which immediately snaps your selected cell to the rightmost end of the current range of cells, the infinite canvas feels downright zippy. Excel sheets also have tabs that signal to the users about how they can split up their data instead of filling up Sheet1 with every scenario. Infinite canvases make you create a new file.

evnc · 2023-12-23T02:02:50

This is fair -- the newest token can attend perfectly to the oldest token, within the context window.

but also, on a broader scale, if a transformer model is presented with a long input that does not fit in its context (e.g.: you are building a chatbot, and you have a very long chat history), it must "compress" or "forget" some of that information (e.g.: repeatedly summarizing historical messages, dropping them and appending the summary at the beginning of the input).

Mamba/RWKV/other "recurrent" architectures, can theoretically operate on unbounded input lengths; they "forget" information from earlier tokens over time, but is that not comparable to what a transformer must do with input lengths greater than their context window?

evnc · 2023-12-22T18:03:43

I love this! Simple and solid execution. I've been wanting to build something similar for some time now, might fork and play around with it. Thank you for open sourcing it!

I've started using Obsidian with a new note for each day and separating "blocks" with a Markdown horizontal rule (`---`) to achieve something similar, but this is much cleaner.

The strength of such an approach is making capture extremely easy -- new block, start writing, no thinking about where this goes and how to fit it into pre-existing structure. I find that if I'm trying to do that, then by the time I find where my idea goes, I've lost the idea.

The downside, of course, is finding things again. The ability to tag or title a block and search by tag or title would be great. More ambitiously, it would be cool to experiment with incorporating LLMs and embeddings to automatically tag, summarize, categorize, cluster etc. your blocks.

There's a lot of different directions one could take this, but I'll echo the sentiment of others to refrain from adding too many features and losing the original appeal of simplicity. :)

Also: How do you handle performance when the buffer gets very large?

mkl · 2023-12-22T21:11:02

It's not open source, as it uses the Commons Clause which severely limits what can be done with it (the name is misleading).

prartichoke · 2023-12-22T22:29:29

As far as a quick google search got me, it seems pretty open with the only caveat being you can't sell or monetize it... how is that not open source?

aryonoco · 2023-12-22T23:13:59

If you put any restrictions on usage or what can be done with it (like selling), then it's absolutely not open source.

open source doesn't mean source code is there. open source has a specific definition. There is a list of acceptable open source licenses, as defined by OSI. similarly there is a list of acceptable free software licenses, as defined by FSF. Broadly, the two lists are the same. Commons Clause is definitely not open source.

j1elo · 2023-12-23T02:00:45

Not sure why you're being downvoted. Even the Commons Clause itself is clear about it:

https://commonsclause.com/#faq

> Is this “Open Source”?

> No.

bityard · 2023-12-23T01:47:37

There exists a niche of commercial software developers who are actively attempting to water down the commonly accepted meaning of "open source" for their own gain, and I suspect they are voting you down. :(

jonatanheyman · 2023-12-23T19:56:48

I didn't put too much consideration into picking a license. If someone has compelling arguments for why I should license it differently, I would absolutely consider it.

mkl · 2023-12-23T23:25:34

If you want to make commercial profit unlikely but encourage contributions and widespread use, GPL would probably work better than Commons Clause.

jonatanheyman · 2023-12-22T18:43:38

Performance is mostly handled by CodeMirror (https://codemirror.net/), the underlying editor that Heynote is built upon. It seems to handle quite large buffers well. Where I have seen some minor performance issues is when working with very large blocks in certain language modes.

niels_bom · 2023-12-23T14:20:30

I use Obsidian for my programming notes, troubleshooting logs, thinking on “paper”, writing and checking assumptions. It’s very powerful and quite performant. AMA.