I don't think this approach would work. It's quite common to have long-ish dialogue scenes with lower-grade animation, and save the high-quality work on scenes with interesting things to look at happening on the screen.
As an example, the final conversation between Okabe and FB in Steins;Gate is really not good animation, but it is crucial to the plot, and the dialogue and voice acting still make it a very impactful scene.
An extreme example: There's a critical scene towards the end of Neon Genesis Evangelion where a single frame is on screen for about a minute with no dialog. (Not the elevator scene.)
You are, of course, correct. It's not exactly a bulletproof heuristic. At best, you'd probably only be able to identify likely filler episodes, as opposed to filler scenes.
A truly sophisticated approach capable of identifying filler scenes would probably involve machine learning using data that's not (to my knowledge) actually available to the public, like engagement/watchtime statistics.
As an example, the final conversation between Okabe and FB in Steins;Gate is really not good animation, but it is crucial to the plot, and the dialogue and voice acting still make it a very impactful scene.