Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Supertone Shift – AI powered Real-time voice changer (supertone.ai)
157 points by chaeeunlee9611 7 months ago | hide | past | favorite | 98 comments
Supertone's Shift offers real-time voice changing technology. It lets users immediately switch to any selected voice. Just pick a voice and begin speaking. Shift is suited for VTubers, content creators, and gamers, as well as anyone who wishes to accurately express their chosen persona's voice. Try out Supertone Shift now. >> https://product.supertone.ai/shift



Very interesting!

I would like some clarity on the Terms of Service clause 4:

> The content created using Supertone Shift remains your property. However, by using our Services, you grant Supertone a worldwide, non-exclusive, royalty-free license to use, reproduce, adapt, and display content solely for the purpose of operating and improving Supertone Shift. This license does not grant Supertone any rights to sell or distribute your content.

Does Supertone Shift need the user content in order to further improve the product during the beta period?

Or does it need the user content in normal operation (for example, running the conversion on remote servers vs local processing)?

I can see some hesitation from people if you're recording everything they say, and keeping that recording for an indefinite period of time.

I can appreciate that there may be a problem enforcing a "Don't use our product for evil" clause, if you can review usage.

The challenge here seems overwhelming.


The phrasing is pretty standard, the important part is the middle sentence. Often it includes irrevocable, transferable and sublicensable as well.

That being said, I hate "remains your property" part. It's just fluff that changes nothing, but distracts from the following sentence.


The reason why this is standard is because functionally, anything that receives data from a user, hosts it, and transmits it to third parties is engaging in distribution of copyrighted content. Without a grant of license, pretty much every message board, social media platform, or any website or internet-based application that does anything with user-generated data could be exposed to copyright liability. You may note that this very site's legal declarations page includes an identical clause.

"Remains your property" is not fluff at all, and explicitly disclaims any ownership of rights associated with content you post, and equivalently indemnifies users against any liability for re-posting or re-using content they posted here, which they'd potentially be exposed to if they were assigning copyright to the hosting platform rather than just granting a license.


Stating you own it, and then licensing it away in the sentence after to give yourself irrevocable rights seems duplicitous.

Neat looking service though.


No, it's definitely not duplicitous, and is standard practice across the industry.

It boils down to "we're not claiming to own the rights to your content -- you still retain those -- but we need a grant of permission from you to ensure that we can publish it on our own site without facing possible liability".


Or.. it's industry standard duplicity.

This isn't specifically an issue with this service alone (I like it), but the approach to UGC (user generated content) in general.

What's unclear is what rights or license remain if someone deletes their account and content. It's trivial to clarify, making omission is a decision.

Use of the word "improving" is pretty general and broad, and can be about the priorities of the vendor over the customer.

What's missing is the clause that closes the loop and doesn't give them a lifetime license.

"you grant Supertone a worldwide, non-exclusive, royalty-free license to use, reproduce, adapt, and display content solely for the purpose of operating and improving Supertone Shift."


Its open-ended, yes, but that does not imply any duplicity. Platform providers have an incentive to minimize their liability exposure, so they're not going to volunteer to add language that applies additional qualifications and exceptions, but that does not in itself imply that any deception or abuse is going on.

It's certainly proper to call out examples of actual bad behavior by specific organizations, but not so much to treat defensive boilerplate as though it is itself bad behavior.


"Remains your property" is not fluff at all, and explicitly disclaims any ownership of rights associated with content you post,

I think that should not be necessary to do explicitly because it is the default state. At any rate it is weird to talk about IP assignment in the very same paragraph as licensing.


that is totally dependent on the jurisdictions involved, and modified by the agreement that the user accepts to start the service(s) no?


It's part of the agreement that the user accepts to use the service, and there are frameworks in place that already implement cross-jurisdictional reciprocity for things like copyright licenses.


Looks like facebook's ToS,

we may need your data for some unspecified purpose ("AI model training") that we can't even dream of right now, so we'll just take all the rights


There are dozens of other products in this category, including completely open source ones you can fine tune.

Commercial applications like Voice.Ai and Koe are real time and have celebrity and anime voices respectively.

The RVC ecosystem on GitHub has dozens of different real time open source voice changers. I haven't kept up with the SOTA, but they're incredible, fine tunable, and 100% local.

https://voice.ai

https://koe.ai

https://m.youtube.com/watch?v=zkaBK5erB2c


Ive tried making this exact product using all of these services, including using github repo koi is based on.

They all use like 50% of my cpu to get real time. I was able to get actual low latency with koi, but still massive cpu usage. And theres no community of models for it either.

Perhaps someone who really knows what theyre doing could optimize these open source models but its not me


> Ive tried making this exact product using all of these services, including using github repo koi is based on.

Could you share the repo?


You can find kois model if you search koi llvc.

I just made some modifications to run it as a stream from a microphone. I am trying to develop my own voice changer (https://voicechange.io) so I dont want to share the source code for that.


I'm assuming it's this one which was posted from their Twitter: https://github.com/KoeAI/llvc


You don't need to look far to understand those terms are standard, and by 'standard' see non-binding, or broad, it doesnt matter what they 'say' here because you will only find supertone abusing these 'terms' if someone at supertone lets you know - meanwhile your voice is syphoned off and used in anyway their friends see fit, and no terms laid out here will be broken. As per other replies for standard software terms, see duplicitous.


It could atleast have a time limit


Why does the Mac installer require admin right and a restart? Giving admin rights to an installer requires trust in the vendor. Supertone Shift is just a newborn. I cancelled the installation because of that.

I would love to test the technology without the risk of damaging my computer!


I use the great, free, "Suspicious Package" app [0] to inspect installers like these.

In fact, it was Supertone Shift's installer that prodded me to seek it out (I happened to find and install Shift a couple of weeks ago).

In this case, it needs admin permissions to install to `/Library/Application Support` as well as `/Library/Audio`.

It needs to restart in order for the HAL driver to be loaded (this provides the virtual audio interface for using the app with Teams, Zoom, etc.)

The preinstall/postinstall scripts simply handle the app's directory in Application Support.

I decided it was safe enough, and had some fun playing with it. It contacts what it claims are licensing servers (when it starts), and won't start without it. It wanted to keep contacting those servers constantly, but blocking its network access via Little Snitch didn't prevent it from functioning. The network traffic was in the single-digit kilobyte range, so I felt reasonably confident no audio data was being looted.

[0] https://mothersruin.com/software/SuspiciousPackage/


Thanks for this, I was very eager to try it out but this is a always a deal breaker.


This seems really cool and I can see some great use cases. But the marketing is very odd to me. It says it will let me express myself in a voice that is truly my own…but I can already do that with my natural voice. That seems more likely to be unique than what I would get by adjusting it in software.


I guess the wording is awkward, but as a trans person, I still resonate with it. I'm acutely aware it's not going to be "my voice", but neither is the one I have right now.


It's funny to me that we just had a big front-page thread full of HN users questioning the value of diversity, and then this thread where people struggle to figure out the obvious trans market for voice-changing software.


Non-verbal people might also be interested in such things


Thanks for the explanation. This is definitely something I hadn't considered.


Outside of the trans use-case mentioned here, I could imagine some women gamers getting value out of this too. You kinda need voice comms to play some games properly, and not wanting to reveal yourself as a woman online, especially over voip, is completely reasonable. Because gamers are terrible. Something like this could make hiding that trivial, assuming the latency is accurate (would need to be very fast in some games)


Are gamers really more toxic towards women than men? I feel like switching gender will just switch one kind of toxicity to another.


Yes, unfortunately. It's not a rigorous scientific study, but one recent experiment that I think is illuminating:

https://esports.gg/news/valorant/male-valorant-pros-face-sex...

> The experiment showed just how drastically a woman’s voice can impact the score of a player. One male pro played with his normal voice and earned 15 kills with two deaths. When playing Valorant with a female voice, his score almost completely inversed with three kills and 16 deaths after the other players refused to cooperate in the game.

> The pro players also endured being mocked and insulted with sexist slurs. Many will recognize this as the average experience of women in gaming. During their games, one male teammate told a pro to go back to the kitchen. Likewise, another teammate told the female-voiced pro that “all women should just die.”


Less about toxicity, more about the barrage of creepy messages and cyber stalking. To be clear, I’m a man saying these things. I think it’s neat we’re able to do things like this now and it’s neat to think of what it can do to solve old problems. This is just one more hypothetical.


It's not a perfect shield against online toxicity that's for sure, but the online voice-enabled gaming world is not kind to women:

https://www.prnewswire.com/news-releases/online-gaming-in-20...


It's certainly not kind to men either, and your source does not make any comparisons between men and women in this regard.

Sibling comment source makes a more compelling case for why women receive more abuse than me.


> 30% of the 489 women polled - almost one in three - said they had experienced abuse and toxicity when gaming online

> Of this 30%, the majority (72%) said the abuse was misogynistic. By comparison, none of the male respondents reported any gender-based discrimination or abuse


Ok, I missed that. So they did compare men in the specific category of "gender-based discrimination". Now if you look up this comment thread, you will see that the question I posed was "Are gamers really more toxic towards women than men? I feel like switching gender will just switch one kind of toxicity to another." So if you wanted to actually answer this question, you would compare how much abuse and toxicity women receive compared to men. This is not something that the linked article tries to do, at all. "30% of the 489 women polled - almost one in three - said they had experienced abuse and toxicity when gaming online" - so is that more than men, or is that less than men?


Yeah, we used to play CS:GO quite a bit and the one girl we knew who played was both very good and constantly harassed. When we 4-queued, we'd just boot the other guy who was being annoying. The rest of the time it was mute and shun. I was GNM at my highest, she was GE but had to play on an alt to play with us (too high rank dispersion will fail to queue for matchmaking). Once or twice when we got higher (MG or so) there was less of it. I got maybe a few comments about competency (rightfully I suppose) but she got them all the time and she was way better.

To help clarify: The Gold Nova Master rank is like 60th percentile. GE is 99th.


That particular line is definitely directed towards people with gender identity issues.


Salesperson: You test drive any car on the lot!

You: Why? I already own a 2002 Ford Escape...

I'm not trying to make fun of you, I think you actually have a unique and impressive perspective! I've always hated hearing my voice on answering machines, so if I could choose any voice I'd choose Chris Cornell or Morgan Freeman.


Pro tip: Some people do not consider their natural voice 'their' voice.


If it was compatible as a VST plugin for DAWs it would be even more useful than a standalone software. From skimming through the website it seems that Supertone already make a VST plugin so it may be a matter of time before Shift becomes a VST too.


Self plug, but I've been developing a local AI voice changing VST [1] (bring your own RVC models, or use builtins). It works in DAWs in realtime on modern macs.

[1] https://audio.sunflower.industries


This looks cool and I've downloaded it. Clicking on the "free" tier on the subscription page brings you into Stripe checkout for the $6 tier, FYI.


Would it be possible to embed a watermark in the generated audio? Many people will use voice changing tech for honest purposes, but there will always be those acting to ruin it for the rest of us. There are just too many scenarios where faking your voice confers an illicit benefit.

I know watermarks are never foolproof, but they may deter casual misuse.


Curious Question: Given the low latency, does it run the computation on device or over the network? If on device, are there minimum CPU requirements?


Very interested in this answer! I'd really like to see it on the website for any AI I'm considering. It's an entirely different proposition as to whether you're getting a utility or a service.


I can see this being interesting for gamers and more whimsical pursuits, but I'm more curious about neural speech synthesis for both normal speech and singing--the first because there is a pretty strong demand for automated narration of training videos, and the second because of my music hobby--other than vocaloids and a few niche DAWs, I haven't found any nice Open Source tooling for the latter (the former I can mostly do with XTTSv2).


From what I found, XTTSv2 is based on the Coqui Public Model License, which explicits disallows commercial commercial usage: "This license allows only non-commercial use of a machine learning model and its outputs."

So, from what I understand, I cannot use it and then upload the training video to Youtube. Or can I?


I guess if it is demonetized it should be ok? Or maybe not if your other content or activity is commercial, as even if the video in itself doesn't make money it would indirectly promote your other commercial activity.

Interesting legal problem.


Seems like we're getting closer and closer to Star Trek's universal translator


Fun looking product. Sad to see no Linux support (yet?).

Would you be interested in any help porting/maintaining a Linux release?


This looks like an amazing tool for indie game developers. Even musicians could find this an amazing help to add some unique tones.


I was trying to make this myself earlier but every single AI model I found used something like 50% of my CPU or GPU.

Any idea how this is possible? Voicemod does something similar and I couldn't figure it out. Is it actually AI or is this just shifting pitch/reverb/etc


Weren't we able to do this before AI? I'm not sure I get what AI is bringing to the table/value-adding for this particular technology, except marketing hype.


Wasn't that very basic pitch shifting only?


It was a bit more complex than that, but that's more or less what this software, which claims the benefits of AI, is doing. It's not, near as I can tell, changing inflection or tone or doing anything other than changing the pitch and maybe adding some frequencies.

It's not even producing natural tones. None of the voices they're demoing sound real in any way. It's a toy, no more, no less.


> The installation has completed. Please restart your Mac.

Seriously?


Not only, it's not possible to quit the installer. Had to kill it and then look for changes done to the system. Hope I've been able to find them all but really upsetting.


So, Supertone Shift creators: this is really good! The first time you hear your own voice as a K-pop star or a nymph it’s genuinely startling.

Just improve the installer so I don’t feel like I’ve been scammed by malware!


Same question. What did I just install that required a restart?


Virtual microphone driver, perhaps?


It's worth it!


Would have liked to know it before installing.


I wonder if this could be applied to educational videos to make the material seem less challenging for children.


Except for purely non lucrative entertainment use cases with a very high novelty factor, I am struggling to see productive use cases for all these AI applications that don't involve some form of deception or at best disingenuous marketing.


As someone who makes indie games as a passion and creative outlet, tools like these drastically expand my creative possibilities.


There's a balance to the ecosystem, though. People in the creative fields have always had to rely on eachother to fill in gaps in skill because it's mutually beneficial. With things like this voice changer, one has to think what opportunities are being taken away from others compared to what opportunities the technology affords oneself. So far we've been screwing that balance up pretty egregiously with these AI tools where one implementation cuts the employment prospects and creative participation of a dozen people.


I think this is huge for new content creators that are not native speakers to get rid of the accent. Also if it enables multiple people to sound the same then you can have a YouTube channel with a larger team but only one voice


I don't believe it modifies the accent. I noticed I could hear his asian accent coming through every character, so it seems to just modify the voice but not the intonation


Agreed, at the very least it doesn't seem to change _my_ accent, just things like color, tone, pitch, etc.


I can think of a few applications of this technology, although some may fall into the deception category, albeit harmless in my view:

- overcoming social anxiety in voice or online calls. It doesn’t take very many bullying incidents during childhood to become convinced you have a horrible or weird voice. I can see this being used as a useful tool to make people feel more comfortable by having a different voice

- amateur interactive fiction development. Having your characters have a real voice in a game in response too the players commands is a real need, and being able to record it yourself and be a different character would be a huge enabler of creating something for a solo developer.

- internal HR videos/podcasts. Creating these can be very expensive, needing different persons reading out dialogue could significantly reduce the effort in recording and producing these

- another instrument for music creators. Auto tune is a very common tool for music production for all skill levels, and this could be applied in a very similar way

It no doubt can be used for disingenuous purposes, any technology can. But these can be real life improving tools enabling many people to do things they never thought possible.

The idea of participating in Q&A session in a webinar would be far too confronting and inconceivable for many people, but to be able to do it semi-anonymously with a different voice would eliminate much of the anxiety preventing them


I also can't help thinking of the "Melanie speaks" episode of 99% invisible [1].

Of course this only works for your "online persona", but still the idea of impacting how you are perceived by working on your voice... is a thing.

[1] https://99percentinvisible.org/episode/melanie-speaks/


This is huge for indie game developers! They can voice every line of dialogue for every character themselves (or with just 1 professional voice actor).

Text-to-speech AI voice generators exist, but you don't have fine control over the emotion/expressiveness/intonation of the lines like you do with this approach.


Would imagine the same sort of reasons people do v tubing in general, such as safety and anonymity.


If they generate good quality then I suppose voice acting could have good use of it.


Nice to see a venture from South Korea!


This is awesome. Very futuristic


more dystopian. yet another "contribution" of AI for destroying the society via misinformation.


Speaking generally, there's an undermarketed positive privacy aspect to such voice changers, in helping with both protecting a user's identity against data scraping and doxxing. Additionally, like one other comment touched on, some people have strong accents that make communication in videos challenging (and can be a turn off for audiences and prejudice initial impressions).

Though by using an online service approach it means providing one's real voice to a service that may be using it for further training. Users have to make the call whether they feel they're good stewards.


My accent seems to translate very well through the app, tho. The tool "only" changes color, pitch, tone, etc. not _how_ I say stuff, i.e. pronounciation, choice of words, ...


All technology can be used for good and evil

If humanity can figure out how to make machines think, may we should also figure out how to stop doing evil to each other


Yes, but we can't stop it.


We've become slaves to the technology and are doomed to watch helplessly at it destroys us?


Not quite. We've become slaves to the technology, and we can opt to have some fun as the inevitable is happening.

On a personal level, I don't think one can do much against the zeitgeist. But they can decide what part they play in it.


Long time ago. Since the discovery of electricity, basically.


It's a pity to think that after everything we've achieved we still haven't mastered self control.


Since the invention of the pointy stick already.


Congrats! This is amazing work


All this technology is leading to a world where we can present second-life/alternative identities cohesively online. I wonder if this is going to cause a global decline in the ability for people to express themselves, since it is now so easy to create an identity online that is different than your real-life identity.

I think it's rather sad. Yes, there are some fringe use-cases perhaps but I think this is the wrong direction for humanity. We should find more value in what we already have rather than inventing arbitrary things like this to hide away from real acceptance of ourselves.


It will first lead to a world where fake videos of celebrities will be used to scam you, and your own voice will be used to scam your relatives. Both of those are happening today.

Ironically, this will lead to a work where we need to use these fake personas online to not have our lives messed with offline.

I don’t fully agree with your first paragraph, but I do agree with the second one.


> and your own voice will be used to scam your relatives. Both of those are happening today.

I can't really see it becoming common for cold-calls that pretend to be someone the victim knows (like the terrifying ransom calls), since the operations work at a huge scale expecting most people to not even pick up a "scam likely" call. Even given free and instant model tuning, just having to find voice clips of the person prior to each unanswered automated call seems like it would tank the quantity they're able to make.

I imagine there will be plenty of unevidenced claims that this is what scammers did to them though. Victims have always said "it sounded exactly like him/her", and from there it's more comforting for someone to conclude they must've been fooled by a sophisticated attack rather than something simple.

For more targetted phishing, like pretending to be a company's CEO and phoning employees to get access, I could definitely see it being used. I think we're probably going to have to move "person sounds like boss over the phone" from "plausible to fake" to "trivial to fake".


> just having to find voice clips of the person

You can clone a voice from a clip which is under 5 seconds.

https://www.pcmag.com/news/microsofts-ai-program-can-clone-y...

All you need is a short spam call and you’re done before you even realise anything is happening. Or grab some video out of Facebook since you’ll have the family connections right there for the taking.


There's no human in the process (to be trawling through Facebook pages looking for videos where relatives speak) prior to the victim picking up - and even then often not until the victim has replied to some initial hook. The huge number of phones being automatically rang doesn't permit it, as far as I can tell.


[flagged]


I make videos, it might be handy to be able to ‘be’ someone else. I can for sure see a use for this.


...needing an excuse to get access to microphones to solve the problem stated on their own front page (https://supertone.ai/) as We need voice source material to train the AI.


Unless the app is directly lying, it says "For the best performance of Shift we need to listen to your voice for 10 seconds. We do not collect your voice data from this app."

I haven't put a network sniffer or anything on it yet though. Just wanted to take a peak at the UI


Funny to see myself being downvoted for a harmless call for a reason, so, few more comments:

The fact someone did something and put a substantial effort into it is not a reason good enough to justify said effort (other than the benefits of learrning) and the product that was created. The world is full of things which actually made it worse place.

Another comment is actually a meta-comment and might be shocking to some people here:

downvoting is not a good method of making someone stop saying statements perceived by some as unconfortable. In fact, there is nothing wrong with earning points first and then burning them with saying comments that are feared by certain individuals to the level that they "must" be downvoted...


The reason is fun, and no justification is needed. Maybe you've spent too long looking at the world in a purely progress-driven way. EDIT: I've been there before, but imo the point of progress is to have fun in the end anyway


No, I am precisely not looking at the world in a purely progress-driven way. I think many of creators do - they think they bring a piece of progress to the world, where in fact they might bring regress - in this context, a service which is enabling people to sound like someone else, adding up to the pile of confusion and fake content.


> Funny to see myself being downvoted for a harmless call for a reason

You made a critical snarky dismissal comment, what did you expect?

Might I suggest reviewing: https://news.ycombinator.com/newsguidelines.html

Here’s a couple to specifically call out:

1. Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.

2. Please don't comment about the voting on comments. It never does any good, and it makes boring reading.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: