There's an assumption in this line of thinking which I think is unlikely to be true: that this stuff is measurable, at least in principle.
I think 'instructional effectiveness' can't ever be clearly and unambiguously defined, much less measured. As with anything that involves people. In my experience, most attempts to do so only cause more harm than good.
There are different kinds of measurable, and the purpose of the measure is important. Measuring for the purpose of management (eg creating incentives for teachers and schools), measuring for the purpose of most academic publishing, etc. Each have their own pitfalls, and I agree with you on these.
That said... there is clearly such a thing as better and worse instruction.
Where the 2-sigma claim gets interesting is scale. We're not concerned with marginal differences (management measures) and we're not concerned with legibly identifiable causal relationships (academic publishing). We're just concerned with establishing a high watermark.
The simple measures (eg testing) we have, work fine for that... in the context of math, reading/writing, foreign language, etc. Stuff that's easy to test for, and assuming we're only interested in big differences.
Two sigma implies that within 1 year of instruction, the two sigma group will have progressed by several years. 12th grade math level by grade 10. This doesn't have to be achievable by every teacher, it just has to be achievable at the high end.. assuming a random sample of students.
Agreed though.. we have a history of insisting that the unmeasurable be measured... and its a very common, long term failure mode. more harm than good. I'm not suggesting implementing anything, let alone implementing anything using measures. Just setting that watermark, so we know what awesome looks like.
I think 'instructional effectiveness' can't ever be clearly and unambiguously defined, much less measured. As with anything that involves people. In my experience, most attempts to do so only cause more harm than good.