How is program synthesis the same as regular expression search? Honest question

GregarianChild · 2023-07-24T14:42:27

Good question. All supervised learning is a form of search with three components:

- Specification: what are you are looking for?

- Search space: were are you looking?

- Search mechanism: how are you going through the search space?

Program synthesis is simply learning where the search space is syntax. In deep learning, taking the ImageNet paper as an example, the specification a bunch of photos with annotations, the search space is multi-variate real functions (encoded as matrix of floats) and the search mechanisms is gradient descent (implemented as backprop) with a loss function.

I think this paper uses regular expressions an example of how to search fast over syntax. It claims not to be tied to regular expressions.

abstractcontrol · 2023-07-24T16:40:06

https://deepai.org/publication/search-based-regular-expressi...

Here is the full text.

Regexps aren't even Turing complete as far as I know, if whatever they have in their paper works for arbitrary programs it would be shocking. I'll give it a read.

*Edit*: The algorithm in the paper is a DP like algorithm for building regexes. They use a matrix, and it has all the potential strings to be checked on one axis, and all the potential regex programs on the other axis, and in-between values (the actual matrix values) are booleans saying whether the string matches the program. The algorithm builds the matrix iteratively.

I haven't understood how regex evaluation is done, probably directly, but obviously this algorithm is only for checking whether a particular regex program matches an output rather than general purpose synthesis.

We'll have to wait for AI chips to really scale genetic programming, GPUs won't cut it.

GregarianChild · 2023-07-24T21:56:17

> genetic programming

There is no magic in GP. It is just another form of searching the space of programs, i.e. program synthesis. The search mechanism is a local, stochastic search, known to be especially inefficient (for example you may hit the same program multiple times). What's good about GP is how simple it is, so it's a good starting point.

lmeyerov · 2023-07-24T20:07:45

This seems like an early paper and agreed with the consternation.

The paper, in a modern context and based solely on the abstract and having been in the community, is chipping at the "uninteresting" part of the problem. Around that time, program synthesis started switching to SMT (satisfiability modulo theory) methods, meaning basically a super powerful & general SAT solver for the broad search ("write a wild python program") and then, for specialized subdomains, have a good way to call out to optimized domain solvers ("write a tiny bit of floating point math here"). The paper would solve what the regex callout looks like.. which is specialized. We can argue regex is one of the most minimal viable steps towards moving to general programming on GPUs. Except as a person who does SIMD & GPU computing, optimizing compute over finite automata is not general nor representative and I don't expect to change my thinking much about the broader computational classes. To be fair to the authors... back then, synthesizing regex & sql were hard in practice even for boring cases.

Separately, nowadays synthesis has shifted to neural (copilot, gpt), and more interesting to me, neurosymbolic in R&D land. We're doing a lot of (simple) neurosymbolic in louie.ai, and I'm excited if/when we can get the SMT solver side in. Making GPT call Z3 & Coq were some of the first programs I tried with it :) Till then, there's a lot more interesting low-hanging fruit from the AI/ML side vs solvers, but feels like just a matter of time.

GregarianChild · 2023-07-24T21:47:47

The paper claims that there is no neural / deep learning based solver that performs well on regular expression inference.

Calls to Coq and Z3 will be very slow and not competitive with GPU compute.

lmeyerov · 2023-07-25T07:29:26

Ironically, one of its few comparisons was FlashFill / Excel.. which had been going the way of NNs and now LLMs for years now. The paper largely skips measured and non-superficial comparisons, so I wouldn't draw comparative conclusions based on their writeup.

RE Coq/Z3 vs GPUs, I think about them as different kinds of things, and thus wouldn't compare directly. A solver like z3 encapsulates search tricks for tasks where brute force approaches would effectively run forever. Their tricks are symptomatically necessary -- effective GPU acceleration requires finding a clean data parallel formulation, and if a more naive algorithm is picked for that GPU friendliness, but at the expense of inflating the workload asymptotically, the GPU implementation would have high utilization, and high speedups vs a CPU version... but would still be (much) worse than Z3. Instead of comparing them as inherently different approaches, the question is if the practically relevant core of Z3 can be GPU accelerated while preserving its overall optimization strategies.

Which takes us back to the broader point of this thread of GPUs for synthesis... LLMs are successfully GPU accelerating program synthesis of more than just regex. It's amazing that half of github code now uses them! Closer to this paper, folks are making inroads on GPU accelerating SMT solvers that the more powerful (classical) program synthesizers use, e.g., https://github.com/muhos/ParaFROST .

GregarianChild · 2023-07-25T15:27:59

Could you point towards a paper that does precise and minimal regular expression (or finite state automaton) inference?

lmeyerov · 2023-07-25T15:54:48

That's a great yet narrow request, which is great for POPL and less for PLDI. As a reference, I believe starcoder does worse than GPT4 and has taken time to publish results on program synthesis that subsumes regex. The flashfill team has academic PL roots and would also be good to compare to, as I mentioned.

With prompt engineering being so easy, and naive search/refinement over them so easy too, not comparing via benchmarks, vs brush-off related work section, to these general solvers seems strange.