More

jmalicki · 2024-04-12T19:18:43

For very large models, the weights may not fit on one GPU.

Also, sometimes having more than one GPU enables larger batch sizes if each GPU can only hold the activations for perhaps one or two training examples.

There is definitely a performance hit, but GPU<->GPU peer is less latency than GPU->CPU->software context switch->GPU.

For "normal" pytorch training, the training is generally streamed through the GPU. The model does a batch training step on one batch while the next one is being loaded, and the transfer time is usually less than than the time it takes to do the forward and backward passes through the batch.

For multi-GPU there are various data parallel and model parallel topologies of how to sort it, and there are ways of mitigating latency by interleaving some operations to not take the full hit, but multi-GPU training is definitely not perfectly parallel. It is almost required for some large models, and sometimes having a mildly larger batch helps training convergence speed enough to overcome the latency hit on each batch.

jmalicki · 2024-04-11T03:51:06

As a Californian, I experienced a lot of trouble getting quotes for collision and comprehensive in the past year for reasonable prices, if at all - everyone was happy with liability, but it sounded like general property damage (fires, floods, landslides, etc.) for autos led a lot of companies to exit the California market or be very selective.

Of course higher repair costs due to general inflation, newer cars having lots of sensors etc. add to it as well, but the agents I talked to specifically cited general comprehensive (not collision or liability) claim concerns as one of the main reasons.

jmalicki · 2024-04-09T00:56:47

The simplest way of building an MLP on top of the embeddings is simply to concatenate the embeddings and put some dense layers on top.

However, if you use a "two towers" approach and have several additional MLP layers on top of each embedding separately, and then a dense MLP on the concatenations of each tower, the individual tower MLP layers are an embedding transformation, and will improve retrieval.

Like this:

        Final MLP
      /           \
   QP MLP       DP MLP
     |            |
    QE            DE

Now, you can apply the document preprocessor ("DP MLP") to your document embeddings ("DE") before storing them in the vector database, and apply the query preprocessor ("QP MLP") to your query embeddings ("QE") before querying your vector database.

This should improve the precision and recall of your vector retrieval step beyond using e.g. raw LLM embeddings. Even better is for the final layer to just be cosine similarity, maximum inner product, or L2 distance, rather than having an MLP so you can just use a raw threshold (it's at least worth trying).

anon373839 · 2024-04-09T10:11:44

Interesting. This sounds like it would be similar to fine-tuning the embeddings, but with added benefit of learning different representations for the query and document. If you keep a distance/similarity measure as the final layer, then I'm assuming this isn't going to work with binary labels?

jmalicki · 2024-04-09T16:49:40

If you have e.g. cosine distance as the final layer, if the label is 1 you reward the cosine distance being close to 1, if the label is 0 you reward the cosine distance being close to 0.

The finetuning here is specific to optimizing for retrieval, which may be different than just matching documents, which can be an advantage.

You may want to force the query and document finetunings to be the same, which makes a lot of sense, but the advantage in them differing can be that query strings are often rather short, and in a different sort of language structure than documents, so the differing query and document tunings can in some sense "normalize" queries and documents to be in the same space, when it works well.

anon373839 · 2024-04-09T22:40:33

This has got me curious. I don't really understand how binary labels could give us embeddings that are well-ordered by distance. For a training pair where (Q, D) are highly similar and a pair where (Q, D) are just barely related, the model is being trained that they're the same distance apart. Is there something I'm not seeing here?

jmalicki · 2024-04-10T19:03:14

See contrastive losses and siamese networks, like here (where they use L2 distance):

https://lilianweng.github.io/posts/2021-05-31-contrastive/

If documents are similar, you want the two embeddings to be close to each other, if they are dissimilar, you want them to be far apart.

Binary relevance judgements of course don't necessarily produce an ordering, but usually over a large enough set of training examples some will be better matches than others.

"Learning to Rank" gets you into all kinds of labels and losses if you want to go down that rabbit hoel.

jmalicki · 2024-02-28T16:03:48

On Linux, fpenableexcept(3) lets you do this.

kragen · 2024-02-28T16:05:23

aha! feenableexcept! i'd never heard of it!

    // Demonstrate C99/glibc 2.2+ feenableexcept

    #define _GNU_SOURCE
    #include <math.h>
    #include <fenv.h>
    #include <stdlib.h>
    #include <stdio.h>

    int main(int argc, char **argv)
    {
      if (argc != 3) {
        fprintf(stderr, "Usage: %s a b  # floating-point divide a by b\n", argv[0]);
        return 1;
      }

      feenableexcept(FE_DIVBYZERO);
      double a = atof(argv[1]), b = atof(argv[2]);
      printf("%g ÷ %g = %g\n", a, b, a/b);
      return 0;
    }

is the machine still ieee-754-compliant after you call feenableexcept? i'm guessing that if it weren't, intel wouldn't have defined the architecture to have this capability, given the historically close relationship between 8087 and ieee-754

pclmulqdq · 2024-02-28T16:41:57

See my other comment, but it's still compliant. You are allowed to handle exceptions in a lot of different ways.

jmalicki · 2024-02-28T04:37:12

What happens when a person gets burnt out and makes mistakes?

Like this? https://en.wikipedia.org/wiki/United_States_bombing_of_the_C...

jmalicki · 2024-02-21T03:00:37

> As an engineer, my view on days/hackathons is pretty simple: you don't pay me enough for my good ideas.

That depends on how you interpret hack days.

A lot of great hack day projects I've seen aren't introducing a new product line, they're "what if we used this new framework, what's possible?" "What if we used transformers to replace our convolutional neural network" etc. - things that aren't a new product, but have a shot at improving the development flow/end user experience by 20+%, but are speculative enough that are hard to get scheduled in normal flow of development.

> At our most recent company hackathon, we focused on a new way to do the old thing. A lot of what great hackathons are.

Most hackathon projects are not something you'd be able to start a company around, if they are, maybe you're doing it wrong and overindexing on demos to PMs rather than potential company impact.

Jemaclus · 2024-02-21T18:26:56

We are in agreement :)

jmalicki · 2024-02-16T03:45:22

> I know this feels like being given a teaspoon and told to excavate a cavern --- your mental health struggle sounds like it has its roots in life challenges that are genuinely stressful, rather than the putative 'chemical imbalance' sort of anxiety/depression. And a pen is no magic wand -- it cannot erase your debts, or restore your sense of ease.

Good habits, antidepressants, etc. may not fix the underlying problems, but they may turn the teaspoon into a tablespoon into a shovel.

jmalicki · 2024-02-15T22:39:31

You can't crypto loop to $500k in outside funds. Let's say you can borrow fraction (1-r) of deposits (for some small r, could be 0 - in fractional reserve banking monetary expansion, r is the "fractional reserve").

At time t0, you deposit 100 ETH, and borrow 99 ethereum worth of USDT.

You loop that back into 99 ETH, and redeposit it. You have 199 ETH of assets, (99 EHT at time t0 of USDT) of debt (that you basically owe to yourself) - 100 ETH net, what you started with.

You can now borrow an additional 98.1 ETH worth of USDT that you could spend on a house - less than the 100ETH you started with.

Or you could redeposit that, to have 297.1 ETH total. But now, you can only borrow an additional 97.2 ETH.

The total leverage on your ETH/USDT trade goes up, but the amount you can take out of the system (to, say, buy a house) can only stay the same or go down at each step through the loop.

jmalicki · 2024-02-15T22:19:56

This is how IKEA has been operating forever.

https://www.investopedia.com/articles/investing/012216/how-i....

bugglebeetle · 2024-02-15T23:21:29

Perhaps, but let’s not assume that what a wealthy Nazi sympathizer bribed his way into doing in postwar Europe is a repeatably legal arrangement.

jmalicki · 2024-02-07T04:51:19

Gatorade wasn't pure research - it was scientists directly trying to solve a problem for their football team, so they already had customers before they finished their science - the customer problem prompted the research, vs. the other-way-around in most university research. It was very much not pure research.

fuzzfactor · 2024-02-07T05:26:14

Paying customers are a whole new class of customer, compared to dozens of athletes acting as test subjects, and when you put that on the scale of a mass-market, the rest is history.

It took years of pressure from other teams before Gatorade was finally available to them and the public, nobody thought it was going to fly off the shelf.

It exceeded everyone's expectations from the athletes and scientists to the salesmen and marketers. That is something that can come straight out of the university?

When you plan to make lots of bucks, it's already spent before you make it, whether you make it or not.

But if you just come up with something(s) that is outstanding for its intended purpose, and it gains fair recognition, you can end up with more money than you dreamed possible.

Dr. Cade was not trying to contribute to the economy in the original effort.