Maybe because CVC4 has better practices around fuzzing? https://github.com/CVC4/CVC4/wiki/Fuzzing-CVC4 Not sure.
I am also looking for a recommendation for a survey comparing different SMT solvers, including e.g. https://github.com/SRI-CSL/yices2 and https://github.com/dreal/dreal4
We have treated Z3 and CVC4 exactly equal, i.e. every formula on which Z3 was tested, CVC4 got also tested. It is thus striking that we found almost twice as many bugs in Z3 as compared to CVC4. The solvers have different development models. Whereas Z3 relies on a single main developer and a couple of assisting developers, CVC4 is more of a community effort. CVC4 requires code reviews, Z3 usually not. On the other hand, issues in Z3 get usually fixed faster and Z3 has the larger codebase (Z3: ~400k LoC vs CVC4: ~200k LoC). The fuzzing practices of CVC4 did not play a role, though. As soon as they were in place (since August 2020), we applied their rules to both solvers. Our earlier work "Validating SMT Solvers via Semantic Fusion" published at PLDI 2020 (bug hunting from July 2019 - November 2019) showed the same trend. To the best of our knowledge, all SMT solver bug hunting campaigns prior to ours, found no bugs in CVC4 at all. CVC4 simply seems to be harder to crack.
I'm biased but if I had any purse string authority at Microsoft, RiSE's budget would be a multiple of what it currently is.
A nice thing about the smtlib2 format is that you can try the same problem in different solvers without too much pain. Generally every solver will out perform other solvers on at least some problems.
For general purpose use, Z3 is a good starting point.
I've had good luck with Boolector.
I haven't really tried dreal. It is specialized for problems involving nonlinear predicates over real numbers, so it targets a different application domain than the other solvers.
It could have been improved by digging into why the code of Z3 and CVC4 had those bugs.
It is surprising to me that they could be wrong on such simple statements (and such a simple fuzzing procedure). I expected the faulty statements to be gigantic edge-case monstrosities, but they are just a few lines of boolean logic.