I work at Cartesia, which operates a TTS API similar to Play [1]. I’d be willing to venture a guess and say that our TTS model, Sonic, is probably SoTA for on-device, but don't quote me on that claim. It's the same model that powers our API.
Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.
Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)
(Not the author but I work in real-time voice.) WebSockets don't really translate to actual GPU load, since they spend a ton of time idling. So strictly speaking, you don't need a GPU per WebSocket assuming your GPU infra is sufficiently decoupled from your user-facing API code.
That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.
While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.
As someone who's attended events run by Daily/Kwindla, I can guarantee that you’ll have fun and leave with your IP rights intact. :) (In fact, I don't even know that they're looking for talent and good ideas... the motivation for organizing these is usually to get people excited about what you're building and create a community you can share things with.)
Thanks for the shoutout! We're very excited about how this space is evolving and are working on new features and perf improvements to support experiences like this.
> Most programming work is still text editing [...]
Text no longer needs to be the primary way of conveying programs. There are practical reasons text works best on screens, but if your coding environment is boundless then there’s no reason to believe you can’t do fancier things like direct manipulation of ASTs pretty easily. Imagine "grabbing" an AST node and attaching it to a different parent, all in space.
Beyond simple AST manipulation, the Vision Pro will probably enable Dynamicland-esque "programming spaces" where you manipulate objects in your virtual environment to construct programs.
This was one of my first thoughts when I tried Hololens back in the day -- it would be great to watch the execution of my software, or visualize all the messages being passed between functions or different services on my network, and go all in on block-based programming in 3D (trees can be packed more densely in 3 dimensions, even moreso in 4)
I was expressing this to a friend who was involved in VR in the 80s (VPL research) and was simultaneously elated and disheartened to learn that they had the same idea ! Googling around for it now I suppose he was telling me about "Body Electric" or "Bounce" and looks like any other 2D data flow language [0]. Maybe just ahead of its time. A patent about it [1] describes the problem with wires going everywhere and needing to provide the user the option to hide any set of connections. I'd want to accomplish this by representing the whole connectome in 4D space, and then shifting the projection into 3D to hide and reveal a subset of connections. Further visual filtering could be performed with a depth of field focus and fog effect, controlling all these parameters to isolate the subset of the system you want to inspect.
Sounds like a good reason to try to do better. Unless you suppose the UI of programming is solved and everyone who wants to have control over a machine just needs to bite down and learn vim ?
Grabbing AST nodes and dragging them around is never the bottleneck when I'm programming. Cut/paste is plenty efficient even in notepad and doesn't require gross arm movements. Feel free to try it, but I maintain doubt that a literal forest of code is going to be anything more than a gimmick. By all means, prove me wrong!
:D I'm with you on gross arm movements, I'm just trying to expand the code-viewing experience beyond "strings stacked on top of one another"
While all the VR goggles try to embed your body into virtual space (putting the graphics "around you") I'm prototyping with a stereo-microscope, with it's nice heavy knobs for course and fine adjustment, such that you're looking into the virtual world from the outside. No illusions to pull off, no motion sickness, and nothing strapped to your face.
Agreed. I used Supabase for a fairly simple project and felt like I had to know a lot about Postgres to implement anything. If you’re building something yourself, I feel like Firebase is still the safer bet. I’m guessing Supabase really shines when you’re building a startup or have a team.
Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.
Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)
[1] https://cartesia.ai/sonic [2] https://youtu.be/neQbqOhp8w0?si=2n1i432r5fDG2tPO&t=1886