There's a lot of interest for convolutional networks and the best way to implement it will be to wrap Alex Krizhevsky's cuda-convnet, like DeepNet and PyLearn2 have, but this will require a bit more effort.
With respect to other deep learning packages, Hebel doesn't necessarily do everything differently, but depending on your needs it may be the best choice for a particular job.
PyLearn2 is big and monumental and although I haven't used it much personally, it seems to excellent. But as you mentioned, it's not necessarily easy to use and if you want to extend it, you have to learn the Theano development model, which takes some time to grok.
DeepNet is quite similar to Hebel in its approach (even though it offers more models right now). However, DeepNet is based on cudamat and gnumpy, which I have found to often be quite unstable and slow. Hebel is based on PyCUDA which is very stable and according to some preliminary tests I did runs about twice as fast as cudamat.
So, the idea of Hebel is that it should make it easy to train the most important deep learning models without much setup or having to write much code. It is also supposed to make it easy to implement new models through a modular design that lets you subclass existing layers or models to implement variations of them.
I have found that using a trained net for preprocessing can be accomplished using very limited resources (read: Core 2 Duo laptop). This is one of the very nice features of DeCAF, which could allow for some interesting applications on embedded devices.
Great work by the way - I look forward to testing it out soon!
As far as embedded devices go (I assume you're talking about ARM cpus etc), they are probably too underpowered to run Neural nets anyway, or models would have to be written in highly specialized C.