So if you are designing the hardware, give it an extra 4 KiB or so (16 pages), mapped through a paging circuit to the zero page. On entry to a function, increment the current page; on exit, decrement. Now you have a way to push all "local variables" that is faster than even pushing all the registers. You could even get fancy and XOR bits 7-4 of the page register with bits 4-7 of the paged address lines, so the compiler can choose anywhere from 16 256-byte frames up to 256 16-byte frames.
Of course, it was also possible to do traditional banking on the /// so that program code could reside anywhere in the 512K, but I thought the X page thing was pretty neat when I finally figured it out.
I would split the parameter and computation stacks, instead of using one as cc65 does. The parameter stack would work the same as cc65's, but the computation stack would be a FORTH-like stack indexed with the zp,X addressing mode.
Second, I would have it build an AST instead of having the parser generate the code. This would open doors to big optimizations, such as automatically detecting when to use 8-bit ops (complicated by C's promotion rules), etc.
Finally, I would implement far pointers for memory-banked systems--most prominently the NES, which cc65 doesn't handle banking on as far as I know.
Another think that could be done is using the zero page as a bank of 16/32/etc. registers and treat the 6502 like a RISC.
cc65 is an amazing project that's enabled many developers. It's Small-C pedigree is limiting though, and inefficient on a machine that it wasn't designed for.
The 65c816 allows the "zero page" (called the direct page there) to be relocated anywhere in the lower 64k. So if one targets that platform the saving/restoring operation could simply be one of pushing/popping the direct page location to/from the stack when entering or exiting function stack frames. Then just keep all program code above the 64k boundary.
Forth. Problem solved.
I learned Forth on a 6502 besides Basic and Assembler, and it was the language with the best performance/memory ratio. It was about 10 times slower than Assembler, and very efficient in memory consumption. Memory was a really big issue at that time. And Forth as a programming language is not bad. You just need a lot of discipline (like at Lisp) because it is so extremely powerful. I wonder why Forth isn't used for firmware since buggy code could be exchanged at runtime.
I was also into Forth in the 1980s on the Apple ][, and have always meant to get back into it (or one of the new Forth-y languages like Factor) one day.
BTW, Mr. Lutus appears around here from time to time.
I remember the day I broke into Ultima and discovered that it was mostly BASIC. I was like WOW, game programmers write in basic too. It validated the way I had hacked together my own ultima clone. (many years after the initial ultima, I was to young in the early 80's to be writing code).
EDIT: Here's the manual for Basic Lightning  which mentions in the preface that a compiler was due to be released in 1985.
EDIT2: Ah, it changed name to Laser BASIC when the compiler was released 
I won't ever forget typing in seemingly endless columns of hex numbers from a magazine to obtain both the interpreter and the compiler... (here are some original scans & documentation: http://www.tmeyer.de/atari/index.html).
Anyway, once I bought a new game "Kolony", stored on a cartridge. It was a multiplayer game, asking about the number of players before start. Accidentally, I pressed CTRL+3, which caused some error and I got the BASIC prompt. It turned out that the whole game was written in Turbo Basic XL and the cartridge included the interpreter as well. I just have to type NEW to erase the game code itself and start programming in the new BASIC flavour.
I forgot my 6502 days. I had a Commodore PET in 1978, and a Vic-20 in 1980 or 1981.