More

joennlae · 2025-08-22T20:44:55 1755895495

How can I make sure that each github runner uses exactly one cpu core?

joennlae · on May 16, 2024

Trainable Llama-like transformer (with backpropagation) in numpy only (~600 lines)

https://github.com/joennlae/tensorli

Zambyte · on May 17, 2024

The description says GPT-like, but is is just a GPT, right?

p1esk · on May 17, 2024

GPT refers to the specific family of models developed at OpenAI.

Zambyte · on May 17, 2024

It also stands for generative pretrained transformer, which this seems to be.

p1esk · on May 17, 2024

It’s like saying SSD is a YOLO. Both are single shot object detectors, but only YOLO is “a YOLO”.

joennlae · on Nov 28, 2023

+1. Does someone know how to do that?

progbits · on Nov 28, 2023

Firefox only: https://news.ycombinator.com/item?id=36757542

m1el · on Nov 28, 2023

The minimap contains a copy of the content, but with `transform: scale`. The rest is handling `window.onscroll` and mouse events on the overlay.

Klaster_1 · on Nov 28, 2023

Found a canvas-based library for this: https://larsjung.de/pagemap/. Definitely not what OP uses, where the minimap is a shrunk copy of the content markup, with all the drawbacks, such as page search finding the same item twice.

orlp · on Nov 28, 2023

The author should really add at least aria-hidden="true" to the minimap element.

joennlae · on Nov 21, 2023

Author here: Let me try to give an overview as I saw some questions repeating itself.

* This accelerator is for an Edge/Inference case, so there is no training on this chip.

* We introduce a differentiable form of Maddness, allowing Maddness to be used in e2e training and present an application -> ResNet.

* We are still in the process of understanding how this will translate to transformers.

* The goal was to show that Maddness is feasible with a good codesign of the hardware.

* Compared to other extreme quantisation (BNN/TNN) and pruning schemes, this is more general as it replaces the matmul with an approximate matmul.

* The model architecture is not fixed in hardware. It is „just“ a matmul unit.

I hope this helps :-)

joennlae · on Nov 21, 2023

Thank you for the feedback :-)

We have to be careful with the comparisons we make. The TPUv3 is a training and datacenter chip and not an Edge/Inference chip. They optimise for a different tradeoff, so while the comparison looks good, it is unfair.

joennlae · on Nov 21, 2023

Author here:

Thank you for the feedback :-) A lot of the work regarding the comparison with „simple“ approximate matrix multiplication has been done in the preceding paper: https://arxiv.org/abs/2106.10860

While I share your enthusiasm regarding the potential, we have to be careful about the limiting factors. Our main contributions on the algorithmic side are the reformulation of Maddness such that it is differentiable (autogradable), and we can use it in e2e DNN training, as decision trees are not differentiable.

We are still in the process of understanding how to optimise the training. In the next step, we want to look into transformers as, for now, we only looked into ResNets for easy comparability.

If you are a student at ETH Zurich and want to work on this -> reach out to me

fxtentacle · on Nov 21, 2023

Thanks for pointing that out :) When I first read the paper, I thought that 4. DIFFERENTIABLE MADDNESS was still part of the 3. BACKGROUND section.

Also, I have to admit that I don't quite understand that section, even after trying a 2nd time. The text implies that Sc would be 15x4 and Hc would be 16x15 but in the illustration it looks like 3x2 and 4x3. I guess I'll have to read Zhang [37] first because like this, I'm not sure what the selection matrix and description matrix do here. That said, (8) and following is easy to understand again. You use the softmax to create an approximately correct gradient but use the hard maximum for calculation the forward pass values.

pmontra · on Nov 21, 2023

As you are the author: why the name Stella Nera / Black Star?

fxtentacle · on Nov 21, 2023

Not the author but

https://www.youtube.com/watch?v=N8JCMJQ1jyw&list=OLAK5uy_lYv...

was a Platin hit in Switzerland, where the ETH Zürich is located.

joennlae · on Nov 17, 2023

That is true. I went for a simple implementation of the layer norm and included it in the tensorli definition. But it would have been better to define it as a moduli for clarity.

joennlae · on Nov 17, 2023

This would be interesting to consider. But at the moment, nothing is optimized, so many things must be tackled first (especially in the backwards path, for example, buffering) to justify moving to cupy. The goal was to use it as an educational exercise for me.

joennlae · on Nov 17, 2023

They are still applying: https://tmsearch.uspto.gov/bin/showfield?f=doc&state=4805:wl...

Paul-Craft · on Nov 18, 2023

I can't see the search query or results you got from that link, but I did a search as well and found OpenAI applied for GPT-5, GPT-6, and GPT-7, which is no surprise. I'd be surprised if any of those were granted, though, because "GPT" itself is a generic term already in research papers.

joennlae · on Nov 17, 2023

The author here: I absolutely agree with you. I went for a bit more catchy title.