More

ziedaniel1 · 2024-09-19T14:22:56 1726755776

What are some others?

ziedaniel1 · 2024-05-17T17:17:53 1715966273

Very cool idea - but unless I'm missing something, this seems very slow.

I just wrote a simple loop in C++ to sum up 0 to 2^30. With a single thread without any optimizations it runs in 1.7s on my laptop -- matching Bend's performance on an RTX 4090! With -O3 it vectorizes the loop to run in less than 80ms.

    #include <iostream>

    int main() {
      int sum = 0;
      for (int i = 0; i < 1024*1024*1024; i++) {
        sum += i; 
      }
      std::cout << sum << "\n";
      return 0;
    }

LightMachine · 2024-05-17T18:08:06 1715969286

Bend has no tail-call optimization yet. It is allocating a 1-billion long stack, while C is just looping. If you compare against a C program that does actual allocations, Bend will most likely be faster with a few threads.

Bend's codegen is still abysmal, but these are all low-hanging fruits. Most of the work went into making the parallel evaluator correct (which is extremely hard!). I know that sounds "trust me", but the single-thread performance will get much better once we start compiling procedures, generating loops, etc. It just hasn't been done.

(I wonder if I should have waited a little bit more before actually posting it)

jay-barronville · 2024-05-18T00:36:18 1715992578

> (I wonder if I should have waited a little bit more before actually posting it)

No. You built something that’s pretty cool. It’s not done yet, but you’ve accomplished a lot! I’m glad you posted it. Thank you. Ignore the noise and keep cooking!

phkahler · 2024-05-18T02:51:05 1716000665

>> Bend has no tail-call optimization yet.

I've never understood the fascination with tail calls and recursion among computer science folks. Just write a loop, it's what it optimises to anyway.

fulafel · 2024-05-18T06:03:55 1716012235

Computer science traditionally focuses on algorithms. Divide-and-conquer algorithms are frequently much more clearly expressed as recursion. Quicksort is the classic example, also various tree operations.

runeks · 2024-05-19T16:57:42 1716137862

I've never understood the fascination with programming languages among computer science folks. Just write machine code directly, it's what it compiles to anyway.

nneonneo · 2024-05-17T18:23:00 1715970180

If they’re low-hanging fruit, why not do that before posting about it publicly? All that happens is that you push yourself into a nasty situation: people get a poor first impression of the system and are less likely to trust you the second time around, and in the (possibly unlikely) event that the problems turn out to be harder than you expect, you wind up in the really nasty situation of having to deal with failed expectations and pressure to fix them quickly.

LightMachine · 2024-05-17T18:42:08 1715971328

I agree with you. But then there's the entire "release fast, don't wait before it is perfect". And, then, there's the case that people using it will guide us to iteratively building what is needed. I'm still trying to find that balance, it isn't so easy. This release comes right after we finally managed to compile it to GPUs, which is a huge milestone people could care about - but there are almost no micro-optimizations.

naasking · 2024-05-17T18:27:27 1715970447

That's how development under open source works. You can't please everyone.

nneonneo · 2024-05-17T18:37:43 1715971063

There’s a big difference between developing something and announcing loudly that you have something cool; the developers have done the latter here.

vrmiguel · 2024-05-17T19:36:15 1715974575

I think it's clearly pretty cool even if not as fast as people expect it to be

Ar-Curunir · 2024-05-17T19:29:14 1715974154

Thats completely unfair. They have developed something cool, just with not all the holes plugged.

LightMachine · 2024-05-17T21:31:39 1715981499

Dude we're running unrestricted recursion and closures on GPUs! If that's not cool to you, I apologize, but that mind-blowingly cool to me, and I wanted to share it, even though the codegen is still initial. Hell I was actually going to publish it with the interpreters only, but I still coded an initial compiler because I thought people would like to see where it could go :(

mst · 2024-05-19T11:55:56 1716119756

The closure part was when I had to stop for a moment and go "wait, really?! ... COOL!" and I'm definitely going to try and remember to check back every so often (emphasis on 'try' given I have a brain like a sieve but still ;).

trenchgun · 2024-05-17T20:09:31 1715976571

It is pretty cool milestone achieved, just not production ready.

adw · 2024-05-17T22:56:12 1715986572

This is very cool and it's being treated unfairly, though it's also obviously not ready for prime time; it's an existence proof.

To illustrate that, many people on here have been losing their mind over Kolomogorov-Arnold Networks, which are almost identically positioned; interesting idea, kind of cool, does what the paper claims, potentially useful in the future, definitely not any use at all for any real use-case right now.

(In part that's probably because the average understanding of ML here is _not_ strong, so there's more deference and credulousness around those claims.)

nneonneo · 2024-05-17T18:24:40 1715970280

You might want to double check with objdump if the loop is actually vectorized, or if the compiler just optimizes it out. Your loop actually performs signed integer overflow, which is UB in C++; the compiler could legally output anything. If you want to avoid the UB, declare sum as unsigned (unsigned integer overflow is well-defined); the optimization will still happen but at least you’ll be guaranteed that it’ll be correct.

ziedaniel1 · 2024-05-17T19:04:58 1715972698

I did make sure to check before posting.

Good point about the signed integer overflow, though!

molenzwiebel · 2024-05-17T18:03:04 1715968984

If compiled with -O3 on clang, the loop is entirely optimized out: https://godbolt.org/z/M1rMY6qM9. Probably not the fairest comparison.

LightMachine · 2024-05-17T18:20:53 1715970053

Exactly, this kind of thing always happens with these loops, which is why I think programs that allocate are fairer. But then people point out that the C allocator is terrible, so we can't make that point :')

mst · 2024-05-19T11:59:26 1716119966

Might be worth seeing if e.g. jemalloc is enough less terrible for the sort of examples you're looking at to help with that.

(though note that I mention jemalloc because I've had "huh, this C code now magically runs faster" experiences with it, make no claim it's the right one to look at, and am very sure that I don't know what I'm talking about sufficiently wrt allocators to be able to recognise the right one if it bit me on the leg)

ziedaniel1 · 2024-05-17T19:03:21 1715972601

I used GCC and checked that it wasn't optimized out (which actually surprised me!)

rroriz · 2024-05-17T17:22:06 1715966526

I think the point is that Bend in a much higher level than C++. But to be fair: I also may be missing the point!

gslepak · 2024-05-17T18:57:12 1715972232

The point is that Bend parallelizes everything that can be parallelized without developers having to do that themselves.

5- · 2024-05-17T17:29:58 1715966998

here is the same loop finishing in one second on my laptop, single-threaded, in a very high-level language, q:

  q)\t sum til floor 2 xexp 30
  1031

ziedaniel1 · on July 24, 2023

A couple details worth noting:

- `repr` often outputs valid source code that evaluates to the object, including in the post's example: running `datetime.datetime(2023, 7, 20, 15, 30, 0, 123456)` would give you a `datetime.datetime` object equivalent to `today`.

- Using `_` for throwaway variables is merely a convention and not built into the language in any way (unlike in Haskell, say).

ziedaniel1 · on July 3, 2023

Do you have a source for that? I thought it was closer to 5pJ/byte.

ziedaniel1 · on March 5, 2023

I enjoyed the easter eggs!

ziedaniel1 · on Jan 9, 2022

As someone who TA'd 6.824 (Distributed Systems) twice, I 100% agree. It's easier to halfway understand Raft than to halfway understand Paxos, but implementing a consensus algorithm correctly requires understanding it fully. Once you start talking about e.g. the "election restriction", Raft becomes a bit hairy.

(Hi Jeremy!)

jlrubin · on Jan 11, 2022

Hi back atcha!

I took 6.824 under Paxos (I think we took it at the same time maybe? not sure if you were TA at that point) and then helped a friend when it switched to raft and good lord that seemed to be a worse for them.

ziedaniel1 · on Jan 12, 2022

(Yeah, I was not a TA then -- I took it under Paxos in Spring 2014 but only TA'd it after the switch to Raft)

ziedaniel1 · on May 12, 2021

This sounds like you're describing energy-based models, not diffusion models.

ericjang · on May 13, 2021

Ah, thanks. I had mistakenly assumed the diffusion process here was comparable / drop-in replacement to Langevin dynamics used with energy models.

ericjang · on May 17, 2021

I went and read the paper in more detail. Yeah, my original comment was way off-base, except if you draw a parallel with normalizing flows as iterated refinement (similar to iterated de-noising), and see that the DDPM is a more unconstrained form.

But at a surface level, there isn't a clear connection between DDPM and energy-based models.

ziedaniel1 · on June 13, 2019

I think you misread slightly.

> All of these algorithms take constant time, i.e., time independent of the input coefficients for any particular (n, c).

This means that once you have chosen a particular n and c, the time no longer varies. However, if n and c vary, the running time is definitely allowed to vary also (as the formulas n(c + n) and (c + n)(log cn)^2+o(1) clearly do).

throwawaymath · on June 13, 2019

No, I didn't misread. You and I are in (apparently violent) agreement. Constant time does not mean that running time cannot vary, in either complexity theory or cryptography. There are misconceptions on both sides here, with regard to what the terminology means in both complexity theory and cryptography.

For precision, I'll start with a good definition[1] for what "constant time" means in cryptography:

> Constant-time implementations are pieces of code that do not leak secret information through timing analysis. This is one of the two main ways to defeat timing attacks: since such attacks exploit differences in execution time that depend on secret elements, make it so that execution time does not depend on secret elements. Or, more precisely, that variations in execution time are not correlated with secret elements: execution time may still vary, but not in a way that can be traced back to any kind of value that you wish to keep secret, in particular (but not only) cryptographic keys.

Secure, constant time cryptographic algorithms need not have unvarying execution time. Now going back to complexity theory, it is also an extraordinarily common misconception that "constant time" means "the algorithm has the same execution time regardless of the size of the input." This is not the case. Big O notation doesn't even care about what always happens, it cares about what happens in the worst case. When we use "constant time" in the O(1) sense of the word, we are not precluding the possibility of an algorithm having variable execution time. Again, for precision, we are simply saying that the execution time (number of operations, etc) has an asymptotic upper bound which is independent of the input. The execution time may vary with the input, and generally speaking it will.

_________________________

1. Thomas Pornin, Why constant time crypto? https://bearssl.org/constanttime.html

infinity0 · on June 13, 2019

> Constant time does not mean that running time cannot vary, in either complexity theory or cryptography.

Again, overloading of the term "constant time" causes pointless misunderstanding and arguments. Your statement here is wrong.

In complexity theory the term "constant time" does indeed mean the running time is bounded even with unbounded input (e.g. goes to infinity), although it could vary within this bound.

In cryptography the term "constant time" is sometimes used to mean a different concept, that the operation actually takes constant non-varying time, so that an attacker can't exploit this as a side channel to figure out the input values.

The paper seems to be using the latter meaning.

throwawaymath · on June 13, 2019

> In cryptography the term "constant time" is sometimes used to mean a different concept, that the operation actually takes constant non-varying time, so that an attacker can't exploit this as a side channel to figure out the input values.

Note that I cited Thomas Pornin for my definition of constant time cryptography, who is a cryptographer in theory and implementation. It is emphatically not necessary for software to run with unvarying execution time in order for it to be "constant time" according to the cryptographic sense of the term. This will be a poor hill for you to die on, but I invite you to provide literature supporting your alternative definition.

ziedaniel1 · on April 15, 2018

Actually, the illegal memory access would be undefined behavior, so it's fine for the compiler to assume that it's living in a world where the segfault never happens. Thus, it can optimize away the extra reads. If this weren't allowed, it would be very hard for compilers to eliminate any unnecessary reads.

This sort of optimization reasoning can result in quite surprising behavior: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

jjnoakes · on April 15, 2018

I disagree.

If I write a routine which walks an array one element at a time, in order, up to some max index N, and also stops early when it reaches some other condition (like an element equal to zero), then I am allowed to pass in memory which is only 3 elements long, and an N greater than 3, if I know that there is a zero in the first 3 elements. The function must not be optimized to read past the zero element, regardless of the N passed in, so removing the early exit would be an invalid optimization.

There's no undefined behavior in the above that justifies that optimization.

CJefferson · on April 15, 2018

No, not in this case. The original function is well defined and has no undefined behaviour (in the case I describe) as it would return before it reached "bad memory". The optimised version is what reaches further through memory (while vectorising).

ziedaniel1 · on April 16, 2018

Oh, good point, I didn't read what you wrote carefully enough.

ynik · on April 15, 2018

The compiler is allowed to eliminate unnecessary reads; but vectorization requires introducing additional reads. That's not allowed in general. In this case the compiler could vectorize the loop, though it would have to take care that the additional reads are within the same 4K page as the original reads, to prevent introducing segfaults. But that's usually the case for vectorized reads, as long as the compiler takes care of alignment.

ziedaniel1 · on Jan 22, 2018

The center core is new, but the side boosters are both recycled. See https://en.m.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_...