Have you looked into Xeon Phi? They have up to 288 threads. With a 256 thread Xeon we got good speed ups using up to 128 threads with Folding@home's latest GROMACS core (0xa7). With higher thread counts it failed to improve. In the long-run massively parallel CPUs could outpace GPUs precisely because of their flexibility.
Note that the 0xa7 core also uses 256 AVX. That's multiple CPU threads and vector instructions.
I haven't tried MICs personally, but AFAIK you need vectorization to match current Pascal generation GPU performance with Knight's landing -> which is where my comment applies. I don't doubt that you can get good speedups when this applies to your code already, but if you start from naive CPU code you'll have a lot of work to do, which IMO is similar to the work needed to port to GPGPU.
Note that the 0xa7 core also uses 256 AVX. That's multiple CPU threads and vector instructions.