It's a couple things, process is a large part. You're also dealing with 4-LUT instead of transistors so you pay both in switching power and leakage since you can't get the same logic-to-transisitor density that's available on ASICs.
Also there's a ton of SRAM for the 4-LUT configuration so you're paying leakage costs there as well.
NVidia managed to get it right about year and half ago. Before that their gates leaked power all over the place.
The LUTs on Stratix are 6-to-2, with specialized adders, they aren't at all that 4-LUTs you are describing here.
All in all, there are places where FPGAs can beat ASICs. One example is complex algorithms like, say, ticker correlations. These are done using dedicated memory (thus aren't all that CPU friendly - caches aren't enough) and logic and change often enough to make use of ASIC moot.
Another example is parsing network traffic (deep packet inspection). The algorithms in this field utilize memory in interesting ways (compute lot of different statistics for a packet and then compute KL divergence between reference model and your result to see the actual packet type - histograms created in random manner and then scanned linearly, all in parallel). GPUs and/or CPUs just do not have that functionality.
Also there's a ton of SRAM for the 4-LUT configuration so you're paying leakage costs there as well.