Well, not sure how to write a compiler that generates machine code without fully understanding the semantics and evolution. I have done two of them and did not do rigorous mathematical semantic models. One was an implementation language targeting 8085 with the compiler written in Bliss 36, which had clearly superior semantics, and there was no hint of the corruption that you suggest.
I think it would be a fun project to write a pascal compiler in scheme.
I argue that there are things that you are likely to run across in generating code that you might not see in mastering the language.
See, there are two kinds of study going on here. One is the study that I have done of emacs, which is what I think the article was talking about and that is starting with a small set of knowledge and expanding bit by bit as you use it in daily life.
The other kind of study, which I was trying to illustrate (and apparently failed) is a thorough, directed study (e. g., reading the User Manual and Report until it is about ready to fall apart), or reading the spec for the target language and working with the users closely. In the second case, we (three of us) wrote the compiler in a year, and just about replaced the use of assembler language in the OS project.
I can't think of one non-procedural language you can "teach yourself" by writing a compiler for. That was my point of reference.
Even if you don't do any semantic modeling of the language itself, you most likely will be reading similar research done for a similar language.
You can teach yourself an Algol dialect by implementing it first (same thing with Forth) but that's pretty much about it. Even Griswold's Icon, procedural at it's, is more sophisticated than anything Algolish.
I'm a little more familiar with SNOBOL, but ICON is imperative and procedural and a bit complicated, and it looks tempting as a target for this little thought experiment that we have, writing a compiler to understand how it works.
Now it would not be a fair test to go after scheme or lisp, as there are many examples out there of people who have implemented toy schemes and toy lisps. Toy here in the sense of tutorial or exploratory.
So being a simple country boy, I seem to be having trouble coming up with examples in line with your point. I wonder if you could give me some examples of non-procedural languages for which this exercise would not work--the exercise of writing a compiler for a language not being successful for learning the language without doing extensive rigorous mathematical semantic modeling?
I would say nearly all functional programming languages are beyond the scope of the dedicated hacker with nothing to go with but the Dragon book and Lex and Yacc manuals. The strict ones are more approachable, but even an FP in its most primitive form, the untyped lambda calculus, is hard to reason about without some form of mathematical tools, at least enough to understand efficient implementation of recursion. (Anyone can implement recursion, even in assembly, but its consequences are far more reaching than just "set base pointer to stack pointer, add sizeof(locals) to stack pointer, and jump to routine entry point"; that's why you have people like GvR struggling with it, 20+ years after the lambda papers.)
Type theory would send anyone insane; static typing is trivial when you have a set of "primitive" types and another "derived" types; neither C nor Pascal nor another Algol dialect actually implement proper types. C gives you a bag of bytes in the form of a struct, and pointer to that back; you create your data structures with that. ML dialects take type theory to fantastic extremes. Throw pattern matching into the mix and your regular compiler hacker is off to reading the semantics literature.
Reasoning about lazy evaluation is also not trivial. It doesn't even show up in the mainstream compiler literature, which, by the way, has always been about creating the fastest Fortran dialect. Compiler research IS Fortran research, while the PL research is mostly FP research (even if it masquarades as OOP sometimes, usually for grant and resume purposes.)
Logical languages also are beyond the reach of someone coming from a classic compiler hacking background. LPLs are so far out that not even mainstream PL enthusiasts grok them.
All is not lost, however. At least one language, Forth, is trivial to implement, and best learned by implementing it first, as much as it's difficult to reason about. The problem with Forth is that there is nothing magical going on; it's a simple and beautiful little perversion, something that could only come out of a mind untainted by theory. Ditto with Perl, though by no means pretty. Perl has the finger prints of a hobbyist all over it.
Call me selfish, but I treasure programming languages, specially the weird and unconventional ones, and specially if they have some sort of academic/theoretical brownie points.
You have already done BLISS, I have been meaning to clone it ever since I read an old rant by Dennis Ritchie on usenet. I am weirdly attracted to CMU languages, so I am gonna say Dylan :-) Marlais is open source and you can use it as a starting point. I am half way through cloning DUIM for Common Lisp and, frankly, all their "syntax" innovation would be lost to me. I am forever ruined for non-Lisp syntaxes, I think. ISO Lisp would be cool too.
If you weren't a compiler hacker I would have recommended Oberon.
For some reason, I could never see compiler hacking outside the traditional code generation for a real machine. If I wanna do source translation I would probably stick to macrology.
Miranda would be cool too, specially since there are no FLOSS compilers for it.
Mozart/Oz has a tiny kernel which you can clone, but I don't want you to learn this in that matter. Instead, get the book by van Roy and Haridi and enjoy at a leisurely pace.
Someday I would love to implement an APL and a Prolog dialect. Someday.