Hacker News new | past | comments | ask | show | jobs | submit | kabirgoel's comments login

I work at Cartesia, which operates a TTS API similar to Play [1]. I’d be willing to venture a guess and say that our TTS model, Sonic, is probably SoTA for on-device, but don't quote me on that claim. It's the same model that powers our API.

Sonic can be run on a MacBook Pro. Our API sounds better, of course, since that's running the model on GPUs without any special tricks like quantization. But subjectively the on-device version is good quality and real-time, and it possesses all the capabilities of the larger model, such as voice cloning.

Our co-founders did a demo of the on-device capabilities on the No Priors podcast [2], if you're interested in checking it out for yourself. (I will caveat that this sounds quite a bit worse than if you heard it in person today, since this was an early alpha + it's a recording of the output from a MacBook Pro speaker.)

[1] https://cartesia.ai/sonic [2] https://youtu.be/neQbqOhp8w0?si=2n1i432r5fDG2tPO&t=1886


Is your model really open source or did you misunderstand the question?


(Not the author but I work in real-time voice.) WebSockets don't really translate to actual GPU load, since they spend a ton of time idling. So strictly speaking, you don't need a GPU per WebSocket assuming your GPU infra is sufficiently decoupled from your user-facing API code.

That said, a GPU per generation (for some operational definition of "generation") isn't uncommon, but there's a standard bag of tricks, like GPU partitioning and batching, that you can use to maximize throughput.


> that you can use to maximize throughput

While degrading the experience sometimes, little or by a lot, thanks to possible "noisy neighbors". Worth keeping in mind that most things are trade-offs somehow :) Mostly important for "real-time" rather than batched/async stuff of course.


As someone who's attended events run by Daily/Kwindla, I can guarantee that you’ll have fun and leave with your IP rights intact. :) (In fact, I don't even know that they're looking for talent and good ideas... the motivation for organizing these is usually to get people excited about what you're building and create a community you can share things with.)


Thanks for the shoutout! We're very excited about how this space is evolving and are working on new features and perf improvements to support experiences like this.


Once you start following people the noise mostly disappears. Think the initial feed is a way for users to bootstrap their follow list.


Zed has copilot support. I’ve been using it, and it works pretty well.


> Most programming work is still text editing [...]

Text no longer needs to be the primary way of conveying programs. There are practical reasons text works best on screens, but if your coding environment is boundless then there’s no reason to believe you can’t do fancier things like direct manipulation of ASTs pretty easily. Imagine "grabbing" an AST node and attaching it to a different parent, all in space.

Beyond simple AST manipulation, the Vision Pro will probably enable Dynamicland-esque "programming spaces" where you manipulate objects in your virtual environment to construct programs.


This seems like a very literal interpretation of "spatial computing"; I don't think anyone will be physically manipulating ASTs with any regularity.


This sounds neat and all but I can “create” much faster with my fingers.

I think it boils down to it being a “programming language”

What we need for AR / VR is “programming gestures”

This way there is no syntax but visual mediums you manipulate via gestures. And this would get compiled to a binary which can be executed.


This was one of my first thoughts when I tried Hololens back in the day -- it would be great to watch the execution of my software, or visualize all the messages being passed between functions or different services on my network, and go all in on block-based programming in 3D (trees can be packed more densely in 3 dimensions, even moreso in 4)

I was expressing this to a friend who was involved in VR in the 80s (VPL research) and was simultaneously elated and disheartened to learn that they had the same idea ! Googling around for it now I suppose he was telling me about "Body Electric" or "Bounce" and looks like any other 2D data flow language [0]. Maybe just ahead of its time. A patent about it [1] describes the problem with wires going everywhere and needing to provide the user the option to hide any set of connections. I'd want to accomplish this by representing the whole connectome in 4D space, and then shifting the projection into 3D to hide and reveal a subset of connections. Further visual filtering could be performed with a depth of field focus and fog effect, controlling all these parameters to isolate the subset of the system you want to inspect.

[0] http://www.art.net/~hopkins/Don/lang/bounce/SpaceSeedCircuit...

[1] https://patents.google.com/patent/US5588104 (bonus, figure 3 shows the dataglove as just a hand plugging into the PC)


Literal spaghetti code?

How do you deal with references? Like defining a variable and using it later in multiple places?


We already have plenty of these, and they suck at doing anything serious.


Sounds like a good reason to try to do better. Unless you suppose the UI of programming is solved and everyone who wants to have control over a machine just needs to bite down and learn vim ?


Grabbing AST nodes and dragging them around is never the bottleneck when I'm programming. Cut/paste is plenty efficient even in notepad and doesn't require gross arm movements. Feel free to try it, but I maintain doubt that a literal forest of code is going to be anything more than a gimmick. By all means, prove me wrong!


:D I'm with you on gross arm movements, I'm just trying to expand the code-viewing experience beyond "strings stacked on top of one another"

While all the VR goggles try to embed your body into virtual space (putting the graphics "around you") I'm prototyping with a stereo-microscope, with it's nice heavy knobs for course and fine adjustment, such that you're looking into the virtual world from the outside. No illusions to pull off, no motion sickness, and nothing strapped to your face.


Ha, that turned out to be a pretty good prediction. Wonder what else can be gleaned from features in standard products.


While OP could have worded their comment better, this is exactly the kind of thing that Swartz explicitly aligned himself against. [1]

It would be incredibly surprising for Swartz to have supported the kind of bait-and-switch tactics Reddit is employing here.

[1]: https://en.wikipedia.org/wiki/Guerilla_Open_Access_Manifesto


Agreed. I used Supabase for a fairly simple project and felt like I had to know a lot about Postgres to implement anything. If you’re building something yourself, I feel like Firebase is still the safer bet. I’m guessing Supabase really shines when you’re building a startup or have a team.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: