whoateallthepy's comments

whoateallthepy · 2025-03-02T16:37:11 1740933431

I learned Git from an O'Reilly book and I loved that it started with the internals first.

The git CLI has some rough edges, but once you have concepts of work tree, index, commits and diffs down, it is extremely powerful. magit in Emacs is also incredible.

pxmpxm · 2025-03-02T16:53:24 1740934404

Ah yes, the hallmark of great software is having to learn how it's implemented to be able to use it.

None of the concepts behind git are difficult to grasp, the problem is interface and leaky abstractions all over. Any person that mentions reflogs is actually saying I don't understand any of the points above.

skydhash · 2025-03-02T21:00:26 1740949226

> Ah yes, the hallmark of great software is having to learn how it's implemented to be able to use it.

Because learning how a software model a particular problem domain is a great step towards efficient use? You can hope it's magic, but that's a recipe for failure if you're a heavy user. Every professional learns the ins and outs of the tools he uses often.

metabagel · 2025-03-02T22:08:11 1740953291

https://v5.chriskrycho.com/essays/jj-init/

> The problems with Git are many, though. Most of all, its infamously terrible command line interface results in a terrible user experience. In my experience, very few working developers have a good mental model for Git. Instead, they have a handful of commands they have learned over the years: enough to get by, and little more. The common rejoinder is that developers ought to learn how Git works internally — that everything will make more sense that way.

This is nonsense. Git’s internals are interesting on an implementation level, but frankly add up to an incoherent mess in terms of a user mental model. This is a classic mistake for software developers, and one I have fallen prey to myself any number of times. I do not blame the Git developers for it, exactly. No one should have to understand the internals of the system to use it well, though; that is a simple failure of software design. Moreover, even those internals do not particularly cohere. The index, the number of things labeled “-ish” in the glossary, the way that a “detached HEAD” interacts with branches, the distinction between tags and branches, the important distinctions between commits, refs, and objects… It is not that any one of those things is bad in isolation, but as a set they do not amount to a mental model I can describe charitably. Put in programming language terms: One of the reasons the “surface syntax” of Git is so hard is that its semantics are a bit confused, and that inevitably shows up in the interface to users.

nottoohard · 2025-03-03T00:20:17 1740961217

I truly don't get it. All these supposed software engineers who can't seem to grasp the basics of the tools of the trade. It's like watching carpenters argue about a hammer every few weeks for ten plus years.

There's no excuse. Either learn git, or stop using it. If you can do neither, stop complaining, because plenty of people use it just fine.

And no, you really don't have to understand the internals to use git, but you DO need a mental model of what's happening: what a commit is, what a ref is, what a branch is, what merging/rebasing does, etc. These don't involve knowing the internals, but they do involve maybe reading the manual and actually THINKING about what is happening.

Too many developers confuse not wanting to think about something with that thing being difficult. I learned git one summer during a college internship 15 years ago, and I've been fine ever since. I am really, truly, not that smart, and if I can do it, so can you.

Everyone needs to just quit their bitching and actually learn something for once.

fragmede · 2025-03-03T02:10:03 1740967803

What are you doing programming if you can't handle interfaces and leaky abstractions? that's basically the job!

ptx · 2025-03-03T11:04:00 1740999840

This quote from "PHP: a fractal of bad design" [0] seems applicable here:

> Do not tell me that it’s the developer’s responsibility to memorize a thousand strange exceptions and surprising behaviors. Yes, this is necessary in any system, because computers suck. That doesn’t mean there’s no upper limit for how much zaniness is acceptable in a system. PHP is nothing but exceptions, and it is not okay when wrestling the language takes more effort than actually writing your program. My tools should not create net positive work for me to do.

[0] https://eev.ee/blog/2012/04/09/php-a-fractal-of-bad-design/

initialcommit · 2025-03-02T16:42:21 1740933741

Yes totally agree. Curious if you think visual or gamified tools might have been useful to get an initial grasp on the types of concepts you mentioned? And if so where they might fit in?

fragmede · 2025-03-03T02:38:33 1740969513

https://learngitbranching.js.org/

is my go-to recommendation

whoateallthepy · on Nov 21, 2023

One point: if you are re-reviewing, other platforms (e.g. Phabricator, Gerrit) have much more developed ways to compare changes relative to one another.

KerrAvon · on Nov 22, 2023

Phabricator was awful. It loaded commit messages with irrelevant boilerplate and allowed people to post patches that you couldn’t build for testing because they were just diffs and not actual branches in the repo. Not sad to see it go.

whoateallthepy · on April 26, 2023

This is a great set of comments/questions! To try and answer this a bit briefly:

The input string is tokenized into a sequence of token indices (integers) as the first step of processing the input. For example, "Hello World" is tokenized to:

  [15496, 2159]

The first step in a transformer network is to embed the tokens. Each token index is mapped to a (learned or fixed) embedding (a vector of floats) via the embeddings table. The Embeddings module from PyTorch is commonly used. After mapping, the matrix of embeddings will look something like:

  [[-0.147, 2.861, ..., -0.447],
   [-0.517, -0.698, ..., -0.558]]

where the number of columns is the model dimension.

A single transformer block takes a matrix of embeddings and transforms them to a matrix of identical dimensions. An important property of the block is that if you reorder the rows of the matrix (which can be done by reordering the input tokens), the output will be reordered but otherwise identical too. (The formal name for this is permutation equivariance).

In problems related to language it seems inappropriate to have the order of tokens not matter, so to solve for this we need to adjust the embeddings of the tokens initially based on their position.

There are a few common ways you might see this done, but they broadly work by assigning fixed or learned embeddings to each position in the input token sequence. These embeddings can be added to our matrix above so that the first row gets the embedding for the first position added to it, the second row gets the embedding for the second position, and so on. Now if the tokens are reordered, the combined embedding matrix will not be the same. Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension).

I put together this repository at the end of last year to better help visualize the internals of a transformer block when applied to a toy problem: https://github.com/rstebbing/workshop/tree/main/experiments/.... It is not super long, and the point is to try and better distinguish between the quantities you referred to by seeing them (which is possible when embeddings are in a low dimension).

I hope this helps!

Buttons840 · on April 26, 2023

> Alternatively, these embeddings can be concatenated horizontally to our matrix: this guarantees the positional information is kept entirely separate from the linguistic (at the cost of having a larger model dimension).

Yes, the entire description is helpful, but I especially appreciate this validation that concatenating the position encoding is a valid option.

I've been thinking a lot about aggregation functions, usually summation since it's the most basic aggregation function. After adding the token embedding and the positional encoding together, it seems information has been lost, because the resulting sum cannot be separated back into the original values. And yet, that seems to be what they do in most transformers, so it must be worth the trade-off.

It reminds me of being a kid, when you first realize that zipping a file produces a smaller file and you think "well, what if I zip the zip file?" At first you wonder if you can eventually compress everything down to a single byte. I wonder the same with aggregation / summation, "if I can add the position to the embedding, and things still work, can I just keep adding things together until I have a single number?" Obviously there are some limits, but I'm not sure where those are. Maybe nobody knows? I'm hoping to study linear algebra more and perhaps I will find some answers there?

whoateallthepy · on April 26, 2023

One thing to bear in mind is that these embedding vectors are high dimensional, so that it is entirely possible that the token embedding and position embedding are near-orthogonal to one another. As a result, information isn't necessarily lost.

zwaps · on April 26, 2023

The information might be formally lost for the given token, but remember that transformers train on huge amounts of data.

The (absolute) positional encoding is an arbitrary but fixed bias (push into some direction). The word "cat" at position 2 is pushed into the 2-direction. This "cat" might be different from a "cat at position 3, such that the model can learn about this distinction.

Nevertheless, the model could also still learn to keep "cats" at all positions together, for instance such "cats" are more similar to "cats" than to "dogs" at any position. More importantly, for some words, the model might learn that a word at the beginning of the sequence should have an entirely different meaning than the same word at the end of the sequence.

In other words, since the embeddings are a free parameter to be learned (usually both as embeddings, and weight-tied in the head), there isn't any loss in flexbility. Rather, the model can learn how much mixing is required or whether the information added by the positional embedding should be seperable (for instance by making embeddings linearly independent otherwise)

If you concat, you carry along an otherwise useless and static dimension, and mixing it into the embeddings would be the very first thing the model learns in layer 1.

dist-epoch · on April 26, 2023

> The input string is tokenized into a sequence of token indices (integers)

How is this tokenization done? Sometimes a single word can be two tokens. My understanding is that the token indices are also learned, but by whom? The same transformer? Another neural network?

montebicyclelo · on April 26, 2023

Huggingface have good guides on tokenization, and tokenizer training. BPE (e.g. used by gpt) and wordpiece (e.g. used by bert) are two commonly used methods https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

whoateallthepy · on April 26, 2023

The tokenization is done by the tokenizer which can be thought of as just a function that maps strings to integers before the neural network. Tokenizers can be hand-specified or learned, but in either case this is typically done separately from training the model. It is also less frequently necessary unless you are dealing with an entirely new input type/language.

Tokenizers can be quite gnarly internally. https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt is a good resource on BPE tokenization.

whoateallthepy · on Feb 11, 2023

At the end of last year I put together a repository to try and show what is achieved by self-attention on a toy example: detect whether a sequence of characters contains both "a" and "b".

The toy problem is useful because the model dimensionality is low enough to make visualization straightforward. The walkthrough also goes through how things can go wrong, and how it can be improved, etc.

The walkthrough and code is all available here: https://github.com/rstebbing/workshop/tree/main/experiments/....

It's not terse like nanoGPT or similar because the goal is a bit different. In particular, to gain more intuition about the intermediate attention computations, the intermediate tensors are named and persisted so they can be compared and visualized after the fact. Everything should be exactly reproducible locally too!

whoateallthepy · on Jan 29, 2023

I put together a repository at the end of last year to walk through a basic use of a single layer Transformer: detect whether "a" and "b" are in a sequence of characters. Everything is reproducible, so hopefully helpful at getting used to some of the tooling too!

https://github.com/rstebbing/workshop/tree/main/experiments/...