Someone managed to GPU-accelerate program synthesis, a form of symbolic ML. Firs...

d_burfoot · 2023-07-24T14:49:55

> ... 2010. Before nobody believed that backprop can be GPU-accelerated.

When I was doing my master's in 2004-06, I talked to a guy whose MSc thesis was about running NNs with GPUs. My thought was: you're going to spend a TON of time fiddling with hacky systems code like CUDA, to get basically a minor 2x or 4x improvement in training time, for a type of ML algorithm that wasn't even that useful: in that era the SVM was generally considered to be superior to NNs.

So it wasn't that people thought it couldn't be done, it's that nobody saw why this would be worthwhile. Nobody was going around saying, "IF ONLY we could spend 20x more compute training our NNs, then they would be amazingly powerful".

GregarianChild · 2023-07-24T15:55:07

Exactly.

It's easy to see in retrospect, but hard in prospect: the original paper [1] on GPU acceleration of NNs reports a measly 20x speedup. Assuming a bit of cherry-picking on the author's side to make get the paper published, the 'real-world speedup' will have been assumed by the readership to be less. But this triggered a virtuos cycle of continuous improvements at all levels that has been dubbed "winning the hardware lottery" [2].

[1] K.-S. Oh, K. Jung, GPU implementation of neural networks.

[2] S. Hooker, The Hardware Lottery. https://arxiv.org/abs/2009.06489

rch · 2023-07-24T16:08:57

I went to a talk on "general purpose GPU programming" at the Colorado School of Mines around 2001 that covered exactly that topic. It was very disappointing to have my interest in FPGAs for this purpose be so entirely destroyed by a quirk of graphics card design.

Hinton also addressed the contribution of hardware performance advances to practical deep neural net applications in his talks in the mid-2000s.

Daviey · 2023-07-25T07:31:29

As recent as 2012, I was working for a major US investment and financial services bank (one of the very big ones). An algo dev sent his resignation email to a mailing list with all technologists at the firm, stating he was resigning because they weren't taking FPGA's seriously. Yes, it was very unprofessional, but his statement about "let's see who is proved right" seems to be accurate.

YeGoblynQueenne · 2023-07-24T16:55:22

That's cool and all, but the one thing that really made it possible to train deep neural nets was the use of backpropagation, and its polynomial time complexity.

By contrast, there are no known polynomial time algorithms for program synthesis and the standard approach is to search some large combinatorial space [1]. That's the case for all the classical approaches: SMT, SAT, planning and scheduling, etc. At the same time there's very powerful heuristics for all the other classical problems that can solve many problem instances efficiently.

____________

[1] The one exception to this is Inductive Logic Programming, i.e. the inductive synthesis of logic programs, for which we do know a polynomial time algorithm (but that is my work so I'm not pimping it here).

GregarianChild · 2023-07-24T21:00:21

> backpropagation, and its polynomial time complexity

How do you reconcile the NP-completeness result in [1] about training neural networks with your claim?

[1] A. L. Blum, R. L. Rivest, Training a 3-Node Neural Network is NP-Complete. https://proceedings.neurips.cc/paper/1988/file/3def184ad8f47...

YeGoblynQueenne · 2023-07-25T11:07:29

Thanks, wow, that's a great question to check my assumptions. I turns out I don't know where my claim that backpropagation's time complexity is polynomial comes from! I am repeating what I know from the time of my Master's in 2014, when I studied neural nets. It was most likely something discussed with my tutors during the course (I had some excellent tutors).

In my defense, it certainly doesn't seem to be just my own, personal belief. For example, here are lecture notes from a course on backpropagation, which conclude with a sketch proof of a linear time complexity of O(|V| + |E|) for the computational of a partial derivative over a network with V units and E connections, if I got that right:

https://www.cs.princeton.edu/~rlivni/cos511/lectures/lect16....

There's also other similar calculations I could find floating free on the internet. Those generally seem to look at the complexity of calculating partial derivatives by automatic differentiation, basically.

Then there's all the scholarly articles that claim that training neural nets by backpropagation is not efficient (which is not the same as claiming than automatic differentiation is not efficient) and that the corresponding decision problem is somewhere in the class NP, or worse. I found a whole bunch of such papers while investigating my assumption of polynomiality just now, thanks to your question, and I even found some of them on my hard drive, which I didn't remember!

Well, it seems there is an ongoing debate, continuing all the way to date, and that goes rather further back than the Blum and Rivest paper. Most results I could find seem to be on the side of NP-completeness (or worse).

However, all those results must be examined side-by-side with the irrefutable empirical evidence that training deep neural nets with thousands of layers and millions of units, on lagre datasets no less, is common in practice.

I think the answer lies in the observation that, to compute any complexity result, one must make certain axiomatic assumptions about the structure of the machine (in the abstract sense) that they are investigating. These assumptions may well differ from the assumptions made in common practice for the same general kind of machine. So for example, it seems that the Blum & Rivest paper was criticised, even dismissed as irrelevant, at its time because it assumed a discrete activation function, when continuous functions were already (1992) the norm.

Especially with neural nets it seems that the wild variety of architectures makes it very hard to derive results with general applicability. The following letter to the editor of the journal Neural Networks from 1997, makes this point, and also notes that a popular assumption that a result applying to a simpler architecture can be generalised to more complex architectures is contradicted by the observation that adding more layers or units can _sometimes_ improve the efficiency of training, counterintuitively (though that is not, itself, an assumption that holds in general):

https://dl.acm.org/doi/10.1016/S0893-6080%2897%2900041-5

Unfortunately this is behind a paywall, but email me at the address associated with this github account: https://github.com/stassa and I can send you a copy (totally clandestinely! We'll break the law together :)

The bottom line is: I have no idea whether my claim above about the polynomial time complexity of backpropagation is right or wrong. _But_ it is certainly possible, _in practice_ to train deep neural nets on large datasets, _and_ this was possible even before the use of GPUs was common.

Which is not to say that hardware was not a very important factor in the current domination of deep neural nets in AI research. Hardware- and data.

varranvar · 2023-07-24T17:14:14

I'm working on an inductive logic programming algorithm, and from my understanding the search space for logic programs is just as vast as other types of program synthesis. Do you have any information you can share about the polynomial time algorithm?

YeGoblynQueenne · 2023-07-24T19:56:34

Of course. It's published as an open-access paper here:

https://link.springer.com/article/10.1007/s10994-020-05945-w

But I recommend the arxiv version where Springer couldn't mess up my LaTex formatting:

https://arxiv.org/abs/2101.05050

(Not that my LaTex formatting is anything to write home about!).

To clarify, the algorithm described in our paper doesn't reduce the size of the search space for logic programs- it sidesteps it.

Note I'm the corresponding author in the paper and my email is in the Springer version (top of the page). I'm always happy to answer questions about my work.

claytonjy · 2023-07-24T14:36:13

> First time for ML that is not deep learning

What do you mean by this? Virtually all "classic" or "shallow" ML can be GPU-accelerated, from linear regression to SVM to GBM.

GregarianChild · 2023-07-24T15:00:03

Can you point me to papers with reproducible benchmarking that achieves big speedups on those?

Modern GPUs are GP-GPUs: where GP means "general purpose": you can run any code on GPGPUs. But if you want to gain real speed-ups you will have to program in an awkward style ("data parallel"). I am not aware of GPU acceleration of the work-horses of symbolic AI, such as Prolog, or SMT solving. There has been a lot of work on running SAT-solvers on GPUs, but I don't think this has really succeeded so far.

claytonjy · 2023-07-24T15:45:44

I think we're conflating two things: shallow/classic ML is not symbolic AI. I'm not sure "ML" even encompasses anything "symbolic"; I see symbolic AI and ML as subfields with little overlap.

I'm not saying symbolic AI has been GPU accelerated in the past, but that non-deep ML has been.

saltcured · 2023-07-24T20:10:27

Back when I took AI courses in the early 90s, ML was anything that was trained by data. It did not refer exclusively to dense numerical or statistical methods. It included decision trees, which I think of as being closer to the symbolic camp...

GregarianChild · 2023-07-24T16:27:42

There is no agreement on the exact meaning of ML and AI. They are often used interchangeably. And for good reason, because it's all about getting computers to learn. We should not squabble about semantics.

Can you point towards papers that report substantial GPU acceleration on what you call "non-deep ML"?

claytonjy · 2023-07-24T17:46:55

You're certainly right that AI/ML is often used interchangeably, though I think the distinction is pretty clear among practitioners: AI is the big circle, ML is a subset of AI, deep learning is a subset of ML. Symbolic AI is also a subset of AI, though I'm not familiar enough with it to say how much it interects the others.

So ML is AI, but just a small part of it.

As for papers, here's one showing GPU speedups for the 3 big GBM packages: https://arxiv.org/pdf/1809.04559.pdf. Their setup shows a 3-7x speedup for XGBoost.

chelmney · 2023-07-24T19:15:31

Let‘s not forget Gaussian processes: https://gpytorch.ai/, https://www.kernel-operations.io/keops/index.html Some of the work I‘ve been doing in this field only became feasible in a reasonable amount of time due to the use of GPUs.

fancyfredbot · 2023-07-24T18:36:03

The parent comment is talking about SVM, which is not a form of symbolic AI. SVM and other kernel based algorithms contain naturally data parallel steps like summing f(z-x[i]) for some kernel function f and all sample data x[i]. These will work great on GPU once you have a few thousand data points.

a-dub · 2023-07-24T15:50:46

computing a single matrix right divide is probably faster done in place with the cpu using vector math instructions.

IanCal · 2023-07-25T12:20:52

This isn't too related but

> Deep learning took off precisely when the ImageNet paper dropped around 2010. Before nobody believed that backprop can be GPU-accelerated.

Deep learning kicked off with RBMs because you didn't have to do backprop and there was a training algorithm called "contrastive divergence". Each layer could be done in turn, which meant you could stack them way deeper. In ~2008-2009 I implemented Hintons paper on GPUs, which meant I could do the same scale of thing that was taking weeks in matlab in hours on a gpu, and then there were lots of gpus available on the cluster in the uni. Lots of fun cuda hacking (I'm just glad cublas was around by then). The original published learning rates/etc are wrong if I remember right, they didn't match the code.

naillo · 2023-07-24T14:30:22

How is program synthesis the same as regular expression search? Honest question

GregarianChild · 2023-07-24T14:42:27

Good question. All supervised learning is a form of search with three components:

- Specification: what are you are looking for?

- Search space: were are you looking?

- Search mechanism: how are you going through the search space?

Program synthesis is simply learning where the search space is syntax. In deep learning, taking the ImageNet paper as an example, the specification a bunch of photos with annotations, the search space is multi-variate real functions (encoded as matrix of floats) and the search mechanisms is gradient descent (implemented as backprop) with a loss function.

I think this paper uses regular expressions an example of how to search fast over syntax. It claims not to be tied to regular expressions.

abstractcontrol · 2023-07-24T16:40:06

https://deepai.org/publication/search-based-regular-expressi...

Here is the full text.

Regexps aren't even Turing complete as far as I know, if whatever they have in their paper works for arbitrary programs it would be shocking. I'll give it a read.

*Edit*: The algorithm in the paper is a DP like algorithm for building regexes. They use a matrix, and it has all the potential strings to be checked on one axis, and all the potential regex programs on the other axis, and in-between values (the actual matrix values) are booleans saying whether the string matches the program. The algorithm builds the matrix iteratively.

I haven't understood how regex evaluation is done, probably directly, but obviously this algorithm is only for checking whether a particular regex program matches an output rather than general purpose synthesis.

We'll have to wait for AI chips to really scale genetic programming, GPUs won't cut it.

GregarianChild · 2023-07-24T21:56:17

> genetic programming

There is no magic in GP. It is just another form of searching the space of programs, i.e. program synthesis. The search mechanism is a local, stochastic search, known to be especially inefficient (for example you may hit the same program multiple times). What's good about GP is how simple it is, so it's a good starting point.

lmeyerov · 2023-07-24T20:07:45

This seems like an early paper and agreed with the consternation.

The paper, in a modern context and based solely on the abstract and having been in the community, is chipping at the "uninteresting" part of the problem. Around that time, program synthesis started switching to SMT (satisfiability modulo theory) methods, meaning basically a super powerful & general SAT solver for the broad search ("write a wild python program") and then, for specialized subdomains, have a good way to call out to optimized domain solvers ("write a tiny bit of floating point math here"). The paper would solve what the regex callout looks like.. which is specialized. We can argue regex is one of the most minimal viable steps towards moving to general programming on GPUs. Except as a person who does SIMD & GPU computing, optimizing compute over finite automata is not general nor representative and I don't expect to change my thinking much about the broader computational classes. To be fair to the authors... back then, synthesizing regex & sql were hard in practice even for boring cases.

Separately, nowadays synthesis has shifted to neural (copilot, gpt), and more interesting to me, neurosymbolic in R&D land. We're doing a lot of (simple) neurosymbolic in louie.ai, and I'm excited if/when we can get the SMT solver side in. Making GPT call Z3 & Coq were some of the first programs I tried with it :) Till then, there's a lot more interesting low-hanging fruit from the AI/ML side vs solvers, but feels like just a matter of time.

GregarianChild · 2023-07-24T21:47:47

The paper claims that there is no neural / deep learning based solver that performs well on regular expression inference.

Calls to Coq and Z3 will be very slow and not competitive with GPU compute.

lmeyerov · 2023-07-25T07:29:26

Ironically, one of its few comparisons was FlashFill / Excel.. which had been going the way of NNs and now LLMs for years now. The paper largely skips measured and non-superficial comparisons, so I wouldn't draw comparative conclusions based on their writeup.

RE Coq/Z3 vs GPUs, I think about them as different kinds of things, and thus wouldn't compare directly. A solver like z3 encapsulates search tricks for tasks where brute force approaches would effectively run forever. Their tricks are symptomatically necessary -- effective GPU acceleration requires finding a clean data parallel formulation, and if a more naive algorithm is picked for that GPU friendliness, but at the expense of inflating the workload asymptotically, the GPU implementation would have high utilization, and high speedups vs a CPU version... but would still be (much) worse than Z3. Instead of comparing them as inherently different approaches, the question is if the practically relevant core of Z3 can be GPU accelerated while preserving its overall optimization strategies.

Which takes us back to the broader point of this thread of GPUs for synthesis... LLMs are successfully GPU accelerating program synthesis of more than just regex. It's amazing that half of github code now uses them! Closer to this paper, folks are making inroads on GPU accelerating SMT solvers that the more powerful (classical) program synthesizers use, e.g., https://github.com/muhos/ParaFROST .

GregarianChild · 2023-07-25T15:27:59

Could you point towards a paper that does precise and minimal regular expression (or finite state automaton) inference?

lmeyerov · 2023-07-25T15:54:48

That's a great yet narrow request, which is great for POPL and less for PLDI. As a reference, I believe starcoder does worse than GPT4 and has taken time to publish results on program synthesis that subsumes regex. The flashfill team has academic PL roots and would also be good to compare to, as I mentioned.

With prompt engineering being so easy, and naive search/refinement over them so easy too, not comparing via benchmarks, vs brush-off related work section, to these general solvers seems strange.