Yes there are optimizations that run before monomorphization; this one occurs during macro expansion.
It's a bit of a chicken-and-egg problem. To monomorphize clone(), someone must emit an implementation first. But an optimal implementation requires analyses that aren't available until later in the pipeline. Here the optimization kicks in for types deriving Copy, but a generic parameter is enough to defeat it.
IIRC, that "optimization" mostly avoids wasting time compiling a complex `Clone` implementation, when simply returning `*self` suffices (there are some crates with at lot of `#[derive(Copy, Clone)]` types). We try to avoid having a lot of logic like that too early, for precisely the reasons you mention.
I'd be interested in an example where LLVM can't optimize the general version, as it means we might want to do this through MIR shims instead (which can be generated when collecting the monomorphic instances to codegen - this is what happens when you clone a tuple or closure, for example).
It's a bit of a chicken-and-egg problem. To monomorphize clone(), someone must emit an implementation first. But an optimal implementation requires analyses that aren't available until later in the pipeline. Here the optimization kicks in for types deriving Copy, but a generic parameter is enough to defeat it.