More

mmoskal · 2024-05-23T15:21:31

Mobile comment view is messed up on Safari.

The top level view seems to leave a bit too much margins on mobile.

sleno · 2024-05-23T15:22:31

Thanks for the feedback, i need to give mobile some more attention. Will fix ASAP

mmoskal · 2024-05-21T14:22:46

Assuming $30k GPU with 3yr deprecation, it's additionally $1.14/h. Much more than energy.

mmoskal · 2024-05-21T01:45:13

I guess the short way to say it is that "undecidable" doesn't mean "it can't ever be decided", just not always.

And of course all programs of practical significance are finite state machines (since there is only a finite number of atoms in the universe).

XorNot · 2024-05-21T01:52:18

Isn't the point closer to, humans simply go "hey that seems to be taking a little long?" when a program doesn't halt, so why couldn't a machine? Basically a fairly obvious constraint on the solution space is "completes in less then N wall-clock time".

roywiggins · 2024-05-21T05:23:01

You can definitely detect a portion of halting machines this way, but it's probably a relatively small portion because the Busy Beaver numbers grow inconceivably quickly: the longest-running machines that halt can go practically forever, you'd need more time than the universe has negentropy left to detect them.

mmoskal · 2024-05-19T03:55:14

One sentence from The Economist seems to explain more than TFA: "Microsoft reported a 31% increase in its indirect (Scope 3) emissions last year from building more data centres (including the carbon found in construction materials) as well as from semiconductors, servers and racks."

So no, it's not about lack of renewable electricity.

https://www.economist.com/the-world-this-week/2024/05/16/bus...

mmoskal · 2024-05-13T15:55:10

Projects change license for new code going forward. The old code remains available under the previous license (and sometimes new). Here, they are able to change the conditions for existing weights.

mmoskal · 2024-05-13T15:31:50

From model card:

Falcon2-11B was trained on 1024 A100 40GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.

Doesn't say how long though.

kouteiheika · 2024-05-13T15:51:34

It does say how long on Huggingface:

> The model training took roughly two months.

mmoskal · 2024-05-12T22:55:21

The post seems to be about execution speed though. However, even there it's definitely not #1 factor as witnessed by popularity of CPython...

mmoskal · 2024-05-11T23:38:28

Survivor bias? maybe they just don't catch the ones that only had two suitcases of cash.

suriya-ganesh · 2024-05-12T03:18:35

This. + Hastiness and overconfidence. When you do it over a long stretch of time, you try to stretch the limits occasionally

ceroxylon · 2024-05-12T14:25:52

From the fraud cases I've worked on, greed does factor in and you can often see a point in time within the data where it seems like they realize they are 'getting away with it'.

mmoskal · 2024-05-10T23:16:59

TFA says you can teach it new facts, but it's very slow and makes the model hallucinate more.

mmoskal · 2024-05-05T06:03:37

Yes. It's speculative decoding but instead of generating just a few sequential tokens with the draft model they generate a whole tree of some sort of optimal shape with hundreds of possible sequences.

It ends up being somewhat faster than regular speculative decoding in normal setting (GPU only). If you are doing CPU offloading it's massively faster.

Edit typo