We don't have the benchmarking data, but I strongly suspect that these kinds of implementations will actually be significantly slower than what compilers are capable of doing on x86-64. This is almost certainly going to be true for stack-based VMs (stack operations are ridiculously slow compared to registers, and because the push/pop sequence done by instructions is implicit and linear you can't take advantage of any parallelization techniques like superscalar execution, branch prediction, out-of-order execution, and even pipelining is less effective).
To get the same level of performance as modern CPUs you really need to take advantage of all the parallelization techniques I mentioned above. This is extremely difficult to design because you need to ensure all the permutations that are possible in valid instruction sequences produce the correct results in the presence of all the reordering/parallel execution going on. Modern CPUs actually have a lot of bugs that are found relating to this, but they only occur in very unusual code, and these bugs are fixed either by patching the microcode on the CPUs or by the OS, so you rarely encounter them. And this despite the amazingly thorough testing and huge amounts of formal verification that go into CPU designs.
I think FPGA-based designs will continue to be very algorithm specific for those reasons, even if we get FPGAs everywhere.