So far, I'd say our observations wrt RPython match the ones discussed in "Tracing vs. Partial Evaluation" : Language implementers have to do more work in Truffle, but probably get better peak performance in return (not talking about warmup or memory consumption here).
RE "a more optimized bytecode" set: Sista  (an extension of the OpenSmalltalkVM) is doing something like that, too. I'd guess performance would be more or less the same using GraalVM as Truffle produces highly specialized code. But, of course, this would need to be benchmarked. The advantage of the Sista approach is that it's managed on the image level, so specialized versions of methods are persisted as part of the image. AFAIK the GraalVM team is working toward persisting compiled code caches, which is kind of similar but on the level of the language implementation framework.
Hope this answers your questions!
I posted a link to side-by-side measurements vw-python3.html of individual Smalltalk and Python programs.