Contributions to AI in the form of automated mathematicians is a descendant of Axiomated Set Theory and the desire to automate mathematics in the form of rigorous mechanical proofs. From Robin Milner we have the invention of ML whose actual purpose was as a scripting language to aid in the creation of Automated Proofs ! We also have contributions from AutoMath  which had a lot of modern cutting edge techniques such as dependent types in the late 1960s. All in all the original goal of automatic mathematics and reasoning proved impractical but the tools needed for their development have proven anything but.
Then there is Lisp. Lisp offered to both functional and object oriented techniques. Lisp was inspired by the lambda calculus which was also an offshoot of the whole mechanize maths movement. It was invented by John McCarthy - whose interest was in the old AI notion of symbolic computing - as a turing complete way of representing algorithms. A very old language its influence on all subsequent history is undeniable. From Lisp we see that: "Garbage collection was invented by John McCarthy around 1959 to solve problems in Lisp" . We also notice that the AI researchers had concepts of object orientation in the early 60s  and that the later mixins and multiple inheritance had a root in Lisp. The development of multiple inheritance was also motivated by the needs of the old symbolic reasoning researchers.
As for Walmart? Walmart is a very big user of optimization theory and logistics. Walmart and its ilk also use Data Mining and Rule Learning stuff. And optimization is a lot more important than efficiently routing trucks: OR people are actually laying the foundation of AI and are seeing a lot of their work rediscovered. OR people also contributed dynamics programming for example. This is what I mention in another post about improving synergy between all these very related silos. These topics are essentially subsets of a more general theory that will be pivotal to AI.
I personally think optimization is a very important part of our universe. In a way the principle of Least Action is doing some kind of optimization. I feel that people working in machine learning, in particular the bayesian inference and graphical models part of AI are actually doing a subset of quantum mechanics. Vice versa these techniques will likely prove pivotal in programming quantum systems.
What's next: Minsky traveled back in time in a LISP Machine and impregnated the Rev. Bayes' mother?
I'm willing to give 'em Prolog (take it, please!) but the confluence of FP and AI might owe more to the fact that AI (especially 'strong AI') was considered a Big Deal early on in computing and as such, ideas that were invented/used contemporaneously tend to be associated with AI.
For practicing programmers I've found books like "AI for Game Programmers" to be surprisingly good. Although clearly not the breadth or theoretical basis of an actual text.
What would be nice are video lectures of a class that follows this text, ala the SICP lectures from MIT.
It seems they don't want to be bothered about the significant difference between search and optimization. So they stick all the stochastic optimization stuff into a section called "local search and optimization", and place it underneath the Search chapter (it's not search, and almost none of it is local). And then separate out optimization methods like gradient descent etc., placing them under "local search in continuous spaces", as if (1) they were search and (2) stochastic optimization wasn't applied to continuous spaces.
And if this wasn't muddled enough, their recommended books for stochastic optimization aren't under the Search chapter at all -- they've been placed under the Machine Learning chapter. And it's a strange collection.
I'm pretty disappointed with AIMA's seemingly poor understanding of this area. Well, I guess at least it's better than their cursory treatment of multiagent systems.
Also, of the techniques you list, simulated annealing, hill climbing and gradient-based optimization are all local search methods (genetic algorithms may or may not be, depending on your emphasis on mutation vs recombination.)
But Based on my readings genetic algorithms are more known to be susceptible to local minima, while simulated annealing in theory is a global method.
As for genetic algorithms - for optimization I prefer Differential Evolution. For Exploration maybe genetic algorithms are more fruitfull although I think Genetic programming and Learning Classifier Systems be slept on. Speaking of slept on - another under appreciated but much more used method is stochastic gradient descent. Global optimums might be overrated.
Both simulated annealing and genetic algorithms are global methods, and both can get caught in local minima. The same goes for pretty much every other global stochastic optimization method out there.
Simulated annealing got this "special" reputation among engineers back in the '80s because the Metropolis algorithm had a formal proof of global guarantees -- run it long enough and it's guaranteed to find the global optimum. But nowadays most every algorithm in this category has similar guarantees. It's relatively easy to make such a guarantee as it turns out.
I like differential evolution too, though it's pretty exploitative.
The terms "search" and "optimization" are often conflated, mostly because "search" was used by two different communities early on to mean two different things back in the 1970s. You still see echoes of it in the optimization community, particularly in single-solution techniques like stochastic hillclimbing, or old terms like "beam search".
So here's how I define them. I believe that "search" typically is applied to problems where you have something to search for. A state or point in space which meets a certain criterion. The most common examples of this are state-space search problems, where the path you took to get there matters, and constraint satisfaction search problems, where the path doesn't matter, only the result. It's also applied in simpler fashions: tree search, binary search, etc.
Optimization, on the other hand, applies to problems where you are trying to find a point which is optimal with regard to some criterion. Unlike search, where it's a find-it-or-not proposition, in optimization you do not necessarily expect to find the globally optimal criterion ever, and in fact you don't necessarily even know that one exists (it could just keep going up and up). You're just trying to find the best thing you can given the time and resources you have.
Optimization algorithms are usually more general, and weaker, than search algorithms. Thus you can typically cast a search problem into an optimization problem -- for example, use your heuristic as the optimization criterion -- but optimization algorithms were designed for a more general class of problems and thus are expected to be outperformed by a custom-designed search algorithm (this fact is a general rule of thumb for pretty much everything in optimization, particularly stochastic optimization -- it's a last-ditch approach). Similarly you can use an optimization algorithm to find a decision tree solution: but I'd get smacked around by colleagues in the machine learning community for calling PSO a machine learning algorithm.
So interesting question: is Alpha-Beta Pruning a search algorithm or an optimization algorithm?
At any rate -- Russell and Norvig are entirely muddled about this. They admit (p. 121) exactly what I state above, yet go on to lump the two together without any further explanation, and then make the strange case that local search is global if it is "complete" -- a very logician way of stating it, but not a term in the field so far as I know.
Only the pure of heart Prolog programmer can make his way through that book!8-))
Also, Craft is organized a bit strangely -- IIRC, it was originally written as a response to a specific Prolog implementation's documentation.
I also really enjoyed _Essentials of Metaheuristics_. I bought a print copy, but thanks for putting a free PDF out there.
Prolog is also partially historic, but it has the added benefit of being logic-based, which is the direction that AI focused on for several years. Around that time, it was believed that AI could be done with pure symbolic logic, and that's exactly how programming in Prolog works. This approach eventually turned out not to work very well, but Prolog is still used in some places because it's a very easy language for interacting with graphs and decision trees (which are big things in AI).
Prolog was and still is used precisely because it is so (relatively) easy to specify some facts and behaviors as Horn clauses , which is important, because it is one of the few places I ever hear the phrase "solvable in polynomial time" in KRR.