Explicit stack-based interfaces are used for good reason: This allows simple accurate garbage collection (see eg. the Ruby API which requires somewhat error-prone conservative GC, Python which requires refcounts all over the place, or OCaml which requires annotating all local variables in a special block). A custom frame stack is also needed to have coroutines in pure ANSI C (two of the reasons Lua is popular).
I don't believe non-stack based APIs prevent the implementation of the features you just mentioned. You don't necessarily have to expose the underlying stack manipulation routines as your defacto API although it is easier to do so. My gripe with this technique is that it makes it much harder for the compiler to catch errors. Personally, I think it would be better to expose the API as helper functions that compose (and hide) the underlying stack routines.
Each CUDA "core" is actually a lane in a 16-wide SIMD processor, so it has 20 CPUs in the traditional sense. (Intel CPUs have 4-way SIMD with SSE)
Only one "program" (kernel) could run on the older GPUs in parallel, on newer GPUs you can have have all the "CPUs" running different programs, but instruction cache space is quite limited.
You can use ElementTree's iterparse() function (also available for lxml) to do incremental parsing while still having XPath-like functionality available for the elements you like (manually calling the .clear() method to release memory).
Buddhism figured this out a long time ago - instead of "self-discipline" there is "mindfulness", a permanent understanding of what is going on with you and around you. Focusing not on the self but on the truth will avoid many of the faults mentioned in the article. The ego should not be controlled - it should be dissolved.