Is this akin to karaoke? (The word never appears on the page.)
> Apple today announced Apple Music Sing, an exciting new feature that allows users to sing along to their favorite songs with adjustable vocals1 and real-time lyrics.
Users can already "sing along to their favorite songs" -- this feels like a wasted opportunity in the first sentence to explain the differentiator. It feels like they're misusing/misunderstanding what "sing along" means, and the real feature enhancement here is that users now have the ability to suppress vocals in songs. (Which is kinda cool.)
I get that this is a press release, but "show don't tell" would go a long way in announcements like this, even (especially?) if it's not available yet.
The at-home karaoke experience is pretty horrible. Companies put a high price tag on access to limited catalogs with hit and miss quality. Oftentimes, the musical bed is low quality midi music and the lyrics may or may not be timed properly.
I'm thrilled to see this announcement and will likely buy my first Apple TV because of it and switch my Pandora subscription to Apple Music.
As an exercise for the reader: see if you can count the barriers-to-entry for the average person in this suggestion, and then contrast and compare to Apple's offering.
Absolutely, which is why I specified “average person”. I think most anyone on HN could figure it out. But that’s not what Apple is shooting for, I don’t think.
At the end of the day, I just didn’t want come right out and say “rsync, ftp, something something Dropbox”. :-)
I mean I'm the one suggesting it and it's still a pain for me to set up after a few drinks, used to have the Microphones going into the computer and then that being the whole output but we literally just plug mics direct from receiver to speakers because a MacOS update introduced latency.
This is what my wife and I do, it’s pretty flawless. YouTube + Mic & mixer to a separate speaker. It’s the only way, and YouTube has the best karaoke.
Routing the mic through your AV setup _will_ result in a delay which’ll make for a horrible experience. So don’t bother. Get a cheap mixer and a decentish Shure mic, a single active studio monitor and you’re away.
You don't need an ad blocker if you just pay for YouTube premium. Often times I see people complaining about ads there, but I think ads are necessary for users who aren't paying for the product.
Wireless mics work fine too, just need to make sure its going through as few devices as possible. So I plug the Mic receiver box straight into two speakers, one for each mic.
Even to an introvert like me, the cheesiness and being corny is part of what makes karaoke so good, it has an automatic atmosphere where you are allowed to do badly because we can all laugh at ourselves.
Polish and perfection makes this less appealing to me. Makes it feel up-tight and a bit snobby.
for me, I can't sing along if it's not close to the original because my brain remembers every tick of the original and wants to match the key of the original as well.
Also, I had a friend who got addicted to Smule which is an online sing duets with random strangers. It uses auto-tune so that anyone can enjoy themselves. I never tried it but he said it was amazing and he got a ton of joy from it.
Me, I like the idea that I can show off my amazing singing skills so if this auto-tunes everyone then that will go away. On the other hand, it's always been frustrating that box style karaoke places rarely have a large selection of western music (most of them are targeting the subset of their local market that would actually to got a box style karaoke place). My hope is that with this more of them will support just using the room as a place to use this service and then my friends and I can sing from a larger variety of songs.
> I can't sing along if it's not close to the original because my brain remembers every tick of the original and wants to match the key of the original as well.
I don't know if this helps at all, but you probably want to learn to sing in a different octave, not a different key. This means shifting your voice up and down from the original in multiples of eight notes. You can sing in a different octave without changing the background music and it'll sound completely normal, and have the same feel as the original song.
> And if Apple knows anything is how to market / rebrand the same old shit and sell it to the masses at really marked up prices for profit.
They're also pretty good at taking a product that is about 75% of the way to being great, and pushing it that last bit. Hell, that's arguably why they are so successful. They're not really inventors, they are perfectors.
It's probably if they use the word カラオケ, that might cause inherent expectation of what services would feature, like mic echo, scoring, ranking, etc., that's most of major karaoke apps, even on mobile platform would feature.
Disagree. The reason is that they would get laughed out of court if they tried to trademark karaoke, whereas now they own one of the most common words in the English language. If you had or were planning to make an app called 'Sing along' or 'Sing with me' or suchlike, get ready for a scary legal letter from Apple's trademark lawyers.
IANAL but Karaoke has been trademarked in different decades in different countries. Most of those trademarks are considered "Abandoned" according to the few databases I skimmed in a brief search, but that covered only a couple of countries.
I think most US courts would see "karaoke" as a generic word today, but I don't know about courts outside of the US, and Karaoke is still seen as a part of the "trade dress" of a couple specific US corporations even if they can't entirely protect that trademark, they are still allowed to fight for it. At least a couple seem to still exist primarily as trolls for lawsuits today.
According to the toy aisle of US stores, "Sing" (and "Sing Along") is the only real generic English term in use, and very few (no?) current toys use the word "Karaoke", so I'm guessing those trolls are still winning in the US, for now. (Again, I don't know about the rest of the world, and toy branding is an anecdotal source at best, but still an interesting proxy view in how much large companies assume they will be sued.)
I spent a couple years on the YouTube Music team, adjacent to the folks who handled the nightmare that is music rights and licensing.
It was explained to me that the reason YouTube didn't offer a karaoke feature -- despite having licenses to a lot of lyrics -- is that karaoke is considered a separate license.
Even if you license both the recording and the lyrics, combining them into a karaoke feature isn't on the table by default.
I personally have no idea how accurate that is, but this was the scoop I got from those in a position of authority.
If it is indeed true, then perhaps that is a factor in Apple's decision not to use that word.
This sort of ultra-divided and litigious licensing rights regime that I suspect is why music hasn't yet had a GPT-3/DALLe style machine learning breakthrough that has captured the public's imagination.
Given the monumental size of fastidiously annotated music corpuses available, the quite diligent and far more agreed upon systematisation of various forms of scientific description of musical forms, and the arguably simpler space than some of the more recent advances like 3d object generation and the sophisticated artistic output from the latest 2d image generating models... seems rather odd music has no wildly popular machine learning model, something i could ask "give me 5 minutes of synth-wave by Mozart" or "Prodigy's Firestarter, but without the lyrics"...
... My suspicion is that the entire endeavour is tortured by licensing and at every turn must avoid ever sounding like music that could be owned by someone else, and as a consequence cannot be "popular" as it must be crippled and limited, built to at all costs avoid ever sounding like Taylor Swift, Queen, Aerosmith, et al.
For prior art, see what happened to Aereo. A little-known fact is that YouTube TV started with the exact same strategy. But Google, of course, had more money and lawyers to get over the hump.
> Adjustable vocals: Users now have control over a song’s vocal levels. They can sing with the original artist vocals, take the lead, or mix it up on millions of songs in the Apple Music catalog.
Everyone I know would call that karaoke. And I assume will when telling their friends about this new feature.
If I search for Apple Sing on google news, the first 5 articles all call it "karaoke-like" or just "a karaoke feature".
Whatever reason they have for not saying the word karaoke, I don't think it's that it's not what most people consider karaoke.
When I think karaoke, I think microphone. It appears this makes no use of a microphone or captures the singer's vocals in any way, it just isolates the instrumental tracks and eliminates or reduces the vocal track.
A microphone (and ideally having your own vocals mixed with the audio output) is the core feature of karaoke, not playing an instrumental version of the song.
And yet most media coverage I can find of the Apple Music Sing release uses the word karaoke to describe it, thinking that will be clear to their readers. I would feel safe saying your opinion is a minority one.
But this has become a pretty silly "debate".
Me, I'm still curious why Apple did not use the word karaoke, and do not myself think it's becuase they don't think people will consider it karaoke, but I understand you do, cool.
Singing with something doesn't imply anything about capture. It's just saying that I can sing along to songs without the original vocal while I'm washing the dishes in the kitchen or whatever.
Some "karaoke versions" of songs you can find are literally the same recording with the audio unmixed. Carefully listening can pick up some of the vocals from the other microphones.
Karaoke is where some other artist performs and records the song, without vocals, in the style of whoever popularized the song.
This Apple feature seems to be where the same popular recording gets used, which has vocals, and then some magic processing reduces the vocals.
There has always been software to sorta-kinda suppress vocals from existing recordings, but it usually ends up sounding like trash (artifacts of all sorts) compared to just having a karaoke band re-perform the song (although the talent and production quality can be dubious with the latter). Presumably the innovation here is not having vocal suppression sound so bad.
Aside: and then there are popular artists who release an instrumental version, or stems from which an instrumental version can be trivially created, but this is so extremely rare as to not even be on-topic for a conversation about broad availability.
> Karaoke is where some other artist performs and records the song, without vocals, in the style of whoever popularized the song.
No, this just isn't what the word means, either popularly or in technical usage. I can't find any dictionary that restricts karaoke to covers, and if anything the dictionary emphasizes non-covers:
> an act of singing along to a music video, especially one from which the original vocals have been electronically eliminated.
It's true that in practice most karaoke audio tracks are covers for financial reasons, but that doesn't change what the word means. Most tables have four legs, but six-legged tables are still tables.
I'm also coming at it from a Western perspective, and I agree that in the US karaoke audio tracks are usually covers. But the typical situation is not effectively like meaning. Tables are typically not made of gold, because that's so expensive, but if you show someone a solid-gold table they will not be unsure whether it's a table. The defining features of tables have to do with form and function, not the chemical composition. (In fact, if we can even go to the extreme of considering a hypothetical single-piece diamond table, which presumably has never existed. If one were created, everyone would instantly recognize it as a table.)
See the many examples (in the West) of tools that remove vocals from audio tracks; they are ubiquitously described as useful for creating karaoke audio tracks:
Is it? I've never taken karaoke to mean covers. Often it is the original soundtrack with the vocals suppressed (sometimes well via remastering and sometimes poorly with an equalizer).
To me karaoke is just signing to music, which is exactly what this is. I found it very strange that they avoided that word.
> To me karaoke is just signing to music, which is exactly what this is. I found it very strange that they avoided that word.
Apple's footnote disclaimer says "does not fully remove vocals" and "kara" means "empty/void" so it kind of makes sense to avoid using such an absolute term to describe something that isn't.
> Often it is the original soundtrack with the vocals suppressed (sometimes well via remastering and sometimes poorly with an equalizer).
Properly licensed karaoke tracks are covers, because licensing custom remixes of original tracks is prohibitive. There are sites where you can buy stems/backing tracks of any song you can think of, which are reproductions by talented musicians.
That feels like an accidental, not essential, property of karaoke. I doubt anyone who enjoys karaoke would reject a track recorded by the original artist because it wasn't a cover by someone they'd never heard of.
> That feels like an accidental, not essential, property of karaoke.
Regardless of whether you believe it's an "essential property", that karaoke music often sounds cheesy is one of several reasons Apple wouldn't want Apple Music Sing to be associated with it.
That's not an essential component of Karaoke though. I live in Japan where Karaoke is extremely popular, and in the more higher-end karaoke systems, you can pick the original music for popular songs (or sometimes you can even pick live performances with the video of the artist on stage and everything)
I see where you're going, but I disagree that "without vocals" would exist anywhere in the definition of "cover." It's all a simple Venn diagram, though. There are covers, and there are karaoke versions, and Western karaoke is usually the intersection.
Sound separation is incredible lately, even the free options. I'm sure Apples tested their model against songs likely to be sung. I'm sure it sounds great.
The real surprise is that the artists are ok with this, they must be getting compensated.
This not remotely true. There are plenty of karaoke songs that are the original recording from the original artist. At least in Japan, where the word "karaoke" originates
I had a hard time understanding what they are offering as well and came here to find out. Is this for people in the same room? If so, maybe some vague reference to karaoke is merited. Is it for sharing with other people over the internet?
edit: ultimately I could care less (but not by much) about using it; I am just curious about what they offer and how they compete.
Karaoke is the best activity I do with my parents in-law when I visit. They love it. It's kind of fun, actually. The equipment they have is old and they have special CDs with the songs and lyrics, really low quality. Apple Sing through AppleTV, with voice search, would be a huge improvement.
The fact that you're aware that most people can't sing well is evidence that lack of singing ability in no way deters people from singing along to their favorite songs. You've kind of answered your own question: "most people" are probably the target demographic.
Yeah, the huge number of "sing-along versions" of movies on Disney+ make it pretty clear what at least one major target market for this is. See also: the overwhelming majority of home-karaoke devices that are purchased. Most of the market's shitty overpriced devices that barely work, aimed at kids. This replaces those with something way the hell better, on a device that also does a bunch of other stuff.
This is not only about vocals suppression, but also about visualizing timing (displaying when each syllable should be sung). Many songs have weirdly timed bridges where it's not clear when singing will resume. And some people are just not good at timing - having a visual timing clue will help there.
I agree with “show don’t tell” as a general rule for effectiveness in cases like this, but I find it really hard to believe anyone is going to read this and not immediately grasp what is meant by “with adjustable vocals and real-time lyrics”. Obviously you can already sing along, and the question about how this makes it better is answered within the same sentence.
I don't know, @hamburglar -- judging by just the comments in this thread, it wasn't so obvious to others.
And "real-time lyrics" is _already_ in Apple Music.
In your/Apple's defense, the photo they included is pretty clear what's going on. Important to note the footnote on the page -- this isn't even full-fledged karaoke: "The vocal slider adjusts vocal volume, but does not fully remove vocals."
I'm being nitpicky for sure, but that's kind of what HN and similar forums are for. Especially when it's Apple, which I hold to a pretty high standard for product marketing.
And "real-time lyrics" is _already_ in Apple Music.
Currently, Apple Music lyrics are synced to the current line of lyrics (if the lyrics metadata supports it)
This new feature appears to sync the lyrics to the current word of lyrics, which is what karaoke usually does. Not sure how big a deal that is to folks looking to do their own vocals.
I'm curious about how they're doing that word-for-word timing. Hardcoded metadata? ML? Is it a simple interpolation based on the existing line-by-line timing?
The metadata exists in special karaoke recordings, but assuming they're using original recordings not created/modified for karaoke, they'd have to create it on the fly:
I'd guess it's done using the same speech-to-text system used by voice assistants, which can certainly show the words it hears in near-realtime -- and way more quickly when it already knows what words it's listening for.
By the way, karaoke often highlights individual syllables, not just whole words.
License agreements often forbid things similar to “adjustable vocals.” It would be interesting to learn if this is achieved with technology, like automatic stemming, or through partnership agreements that provide vocal tracks separately for adjustment.
Normally it’s important for karaoke that the singer’s voice is amplified to (above) the volume level of the song, and from the description they don’t seem to provide that.
1) Perhaps they don’t want to trigger a different licensing model (on either the music or the lyrics) that applies specifically to karaoke devices and venues. Just a guess, not a music lawyer, but all sorts of unexpected terms apply to music licensing (such as streaming TV services not being able to use music from the original TV broadcast)
2) Perhaps they don’t want to associate with the karaoke “brand” which some users might perceive as kitschy or low-quality (have you ever seen a karaoke video?)
I suspect the second is exactly right. Apple is extremely careful about how their products are perceived as elite and cool and it’s easy to imagine their marketing department thinking the word “karaoke” just evokes an uncool or unsophisticated vibe.
When I hear karaoke I think 90s looking machine with a tethered microphone, or equally old music videos with ugly subtitles. True or not, it is possible they just want to have full control of the image of their service.
Another thing is that when you sing a foreign song, the karaoke also means to read the foreign subtitles in your own language. It's an extra meaning they might want to avoid, I assume this is not a feature they have.
Obvious. Make this look like a more original feature. Redefine the concept on Apple's terms. Have people pay attention to the novelty of it being done automatically, rather than manually in a catalog with lyrics that look like they belong on a CRT screen.
Wow I was making shit up but the real invention is actually quite close.
"The world's first karaoke machine, the Juke-8, was built by Japanese inventor and musician Daisuke Inoue in 1971. But it is Filipino inventor Roberto del Rosario who holds the machine's patent. He developed the Karaoke Sing-Along System in 1975."
Karaoke usually implies no or suppressed vocal channel in the audio mix, doesn't it? It's "sing along" as it's still the original mix, I suppose.
edit: I missed the "adjustable vocals" in the list of features. No idea then. Footnote says "The vocal slider adjusts vocal volume, but does not fully remove vocals."
Its not "yet" Karaoke. Missing one "key" feature, with karaoke machine one can adjust the key of the song up or down to be able to sing along. Apple Music Sing appears to only adjust the vocal volume up or down.
While I have zero interest in karaoke, I imagine it's devastating for a bunch of karaoke apps that are now going to see new sales evaporate. I don't think it's cool for the company to just take over a market, or a smart thing to do when their app store monopoly is a political football. I guess they're relying on antitrust enforcement being more of an idea than a practical reality these days.
People will definitely rip no-vocal versions out of there (if they are good quality, that is). I see some pretty neat implications:
- Young artists can (illegally) use beats by popular producers. Young producers can listen to various tracks and better understand some techniques that were shadowed by vocals before.
- More bootleg remixes of popular tracks. Things that weren't possible by just slicing and sampling the track would suddenly be possible, too.
- Better music stemming / demixing ML models. I believe right now the baseline is Spleeter by Deezer [1], which is pretty good but leaves so much room for improvement.
Back in the day, we just took a channel differential (L minus R) with one channel high passed so the difference didn't have all the bass eliminated.
Tons of reasons it sucked: stereo reverb on vocals wouldn't cancel out, other instruments panned dead center (or at least highly correlated) would cancel out along with the vocal, and early MP3 encoders often sounded better in "joint stereo" or "intensity stereo" mode (assuming low/mid bitrates) which very reasonably gave more bits to the center and fewer bits to the sides, so the differential sounded like hot garbage.
Exciting to see that this technique has been replaced with ML.
Your comment reminded me of ...And Justice For Jason, where people fixed the bass guitar volume of the Metallica album by taking the individual channels from Rock Band.
Stemroller[1] is another audio track separator, based on demucs[2]. The demucs GH page has an accuracy comparison table with spleeter, demucs and others.
If it's realtime, and you are mixing vocals and other separated tracks back together on-the-fly, the actual separations don't need to be _super_ hiqh quality, as the mixing will mask most deficiencies.
The reason I say that is because I suspect they are doing this with a neural net.
I think a neural net would be an overkill here. If you are doing the mixing on device, I think the algorithm could be as simple as just overlaying two tracks.
Splitting tracks on-device would probably use a neural net, but I don't think they're doing that.
I wouldn’t expect the ML-based real-time vocals removal to be high-quality enough for those use cases. It’s a really difficult task, and it doesn’t have to be high-quality to be adequate for karaoke. The press release also says "The vocal slider adjusts vocal volume, but does not fully remove vocals".
Perhaps I'm missing something, but without a remix pack, how would you take vocals and overlay those on top of a completely rewritten track for example, or bite a particular sample from the track that only plays with vocals on top?
There are a couple ways. The method most musicians have used since the mid-90s is phase cancellation, where you take an instrumental and use it to cancel out any sound that isn't instrumental (a-la noise-cancelling). In more dire scenarios, people would split the stereo tracks and cancel them against one another, which actually worked fairly well if you had a DAW precise enough to pull it off.
Then in the mid-2000s there were a few attempts to use algorithms for this (that mostly failed) and circa-2015 we started to see machine-learning based options like Spleeter and Demucs. This is what Apple is implementing here, and it's been freely available for years. It's not really novel for anyone who isn't writing their music on an iPhone.
I mostly ask because I've spent the past 2 or 3 weeks working on a remix album, and haven't really had trouble finding stem separation tools that work well.
There are many services or apps/plugins that can do this. I think the promise Apple is making isn't necessarily that it will be better, but that they will have done it prior for many songs in their catalog, so that the stems are immediately available.
Can anyone make out if songs like this are specially mastered with separate vocal channels? I’m guessing they’re not minus ones that someone else has arranged and played. If these are coming from the original artist it’ll be interesting to know if they’re being uploaded as separate vocal channels or if Apple is applying ML to do voice isolation.
Do any audio file formats support separate tracks with level info?
My guess is that they're doing it through ML. The old "cancel out the left track with the inverted right track" trick is probably good enough, but I don't see why Apple wouldn't take it one step further.
> Do any audio file formats support separate tracks with level info?
I don't know what file format audio engineering software uses to store all the track info but my guess is that it's akin to a zip file containing a flac for each track.
I assume this comes along with “Mastered for iTunes” / “Apple Digital Masters” ah least in part. I think if it’s only available for limited tracks, we’ll know they went the full separate-the-vocal-track route. If it’s everything, then I bet they augmented with ML.
Plenty of file formats support this, but they're generally for DAWs and not for a final mix. In order to support volume the file would have to be packed with each individual track.
Almost assuredly this is just ML-assisted frequency separation.
AAC is already interleaved multi-channel. Two-channel L/R is obviously very common for music, but spatial audio and Dolby Atmos comes with more channels (up to 128, which can be arranged and binned into "beds"). 5.1 (6 channels) and 7.1 (8 channels) are also common for video. Having dedicated/isolated vocal tracks as a channel very trivial in the format. I'd expect this is how it's done, and not ML, because Apple have already been driving mixing and end to end workflow with studios for Spatial audio and mastered for iTunes.
Channel and spatial/binaural tracks I'd expect, but separated tracks for instruments/vocals is a lot more time consuming and the kind of thing that studios/producers/masterers would bluster at enough that I wouldn't expect it to come close to giving Apple the volume they'd likely want.
It absolutely could be done, I just think that Apple would want very good coverage and studios would be very slow to provide this format.
Apple need the labels either way. They can't just go creating new derived works without a license, and the artists and producers would far prefer to avoid the artifacts of something that is overly automated.
It's already become pretty common for studios and labels to make stems available (stems are full multi-track files that can be used with a DAW) to industry insiders and even the public sometimes. There's a community of remixers, samplers, even a small cottage industry of YouTubers who work with these regularly. It wouldn't take them more than a few minutes per track to annotate which channel is vocals.
Stems are usually lossless files and have an intended purpose for remixing.
Do you stop at instruments and vocals or are we talking 100+ tracks? It strikes me as nontrivial work for very limited purpose. Apple can turn your vocals down so now your exports have to include extra tracks? I don't think that's a good enough sell. The labels do care about this use case.
It makes a lot of sense for an audio file format to support multiple tracks when each track is intended to target a different output device (left and right speakers, etc.)
It makes less sense when the intent is to mix those tracks together and send them to a single output device. Producers and audio engineers want full control over that mixing process because it's almost never just a simple sum. They are doing audio compression (sometimes multi-band), dynamic EQ, saturation, limiting, etc. They wouldn't want to give that up, because it's an essential part of making a good sounding recording.
Presumably if they have Don’t Stop Believing and a separate karaoke karaoke today, they could basically cross fade between them for the adjustment of original singer vocals. And it would work better than trying to remove vocals after the fact.
But that would require both versions, synchronizing them, and KNOWING which two tracks went together.
Is that easier or harder than getting the labels to just give you a special version with an extra channel for vocals?
Hopefully someone digs in and finds out once this is released.
>> Is that easier or harder than getting the labels to just give you a special version with an extra channel for vocals?
Depends on the label and the artist. Could be easier, could be much harder. Sometimes multi-stems just don't exist anymore; were lost, have additional licensing issues; labels are a nightmare to deal with.
> Do any audio file formats support separate tracks with level info?
There is a 4-track an audio format developed by Native Instruments that is mostly used by DJs, and only dance music gets released in that. However, now that I search their website, there's no mention of it anymore — guess it wasn't popular enough with the labels to get any traction.
Though I have no idea how to get into this beta test.
This is a killer feature for me as I am trying to learn to sing.
There are some services online that use AI to split tracks into instruments and vocals that do hell of a job, but trying to combine them with written lyrics again is a pretty painful experience.
To anyone wondering how this will work, since the post didn’t say:
> The feature won’t see users switching over to music tracks that already have the vocals removed, however. Instead, it’s relying on an on-device machine learning algorithm that processes the music in real time, Apple says. The algorithm isolates the vocals from the rest of the song, allowing users to adjust their volume accordingly using a new slider button in the Apple Music app.
(TechCrunch)
I wonder how well this will work. Especially when some songs apply all sorts of wacky effects on vocals. Sometimes it’s not even clear to a human what should be considered vocals and what’s part of the instrumental. The context matters too. Maybe they’re using the lyrics as a signal?
Having easy access to instrumental versions of millions of tracks will be huge for remixers, people recording vocal demos, etc.
...well, maybe.
I'm sure you won't be able to export the vocal-less versions out to an audio file directly.
But you should be able to turn off the vocals and capture the resulting audio by the usual means: loopback audio drivers, analog capture, etc. It's still a step, but it's going to be fairly easy and the resulting file should be quite clean.
(It's also going to revolutionize the singing I do alone after a few glasses of wine but that's not super impactful to anybody but me)
It’s not what this is for, but I’ve often thought it would be kind of nice to be able to just put Apple Music in a “instrumental“ mode.
The easy way would be if it knew which tracks were instrumental or not and only play those. The GREAT way would be if it just played everything but knew how to remove the all vocals by playing just the backing track.
Extracting vocal tracks from songs is something that is done trivially in all audio mixing software out there. This is hardly a "revolution", especially considering Apple isn't even going to let you export the results digitally.
It's been a while since I played around with that, but I remember the results not being super clean.
Is that still the case?
I am assuming, perhaps incorrectly, that Apple has access to multitrack versions of some songs and that for these tracks it will be able to remove the vocals in a pristine way.
Here's a question, is it copyright infringement to serve modified versions of a song? We know that attempts to censor movies have not held up in court, how is this different? Did all the publishers agree to this in their contract with Apple?
IANAL, but Apple Music already pays people per stream, and this type of modification wouldn't result in additional people needing to get paid, so that at least eliminates a whole class of issues.
This looks really cool, and it'll be interesting to see how artists and producers react. Presumably the tracks need to be mixed for the format with dedicated vocal track. Although it's possible to isolate vocals some of the time from mixes, using stereo separation (the vocals are often the only component that is 100% center mixed), filtering, and even AI techniques ... none are perfect.
I wonder if this will have the real killer feature of professional karaoke ... the ability to transpose the song into a more convenient key for the singer. Apple have had this capability in Logic Pro's pitch shift and it's pretty good even on polyphonic music. Some Karaoke platforms use midi and midi-like formats and synthesize the instruments to do it, but I'd be surprised to see that here.
Serato just released Stems, which allows DJs to isolate drums, bassline, vocals in the tracks they are playing live. The algos are getting pretty damn good.
I remember seeing something like this in VirtualDJ the last time I fired it up and updated a few years ago. Only noticed because they'd remapped the high/mid/low EQ knobs on my old Numark controller to remove bass, drums, vox, etc.
Wasn't what I was looking for initially but it was surprisingly good compared to the older "vocal remover" type plugins I remember having for Winamp et al back in the day. If I ever actually DJed anymore it would definitely be useful for mixing tracks.
Most DJ software these days has tempo/key adjustment as well as hooks to connect to subscription karaoke services (basically Spotify for karaoke). Unless I had a business with a ton of old school gear, that's how I'd run a karaoke company nowadays. No worries about licensing/rights for songs, updating your collection with the autotuned robot song of the week, and easy to connect a laptop/mixer to an existing sound system or your own PA.
They should introduce "Discover Weekly", "Release Radar" and "Enhanced Playlists". Why are the recommendations lacking so much behind Spotify? That's the only reason I can't switch over and I would love to.
Apple Music has custom play lists, which it updates every week. One of them is the "New Music Mix", which is sort of a combination of Spotify's "Release Radar" and "Discover Weekly" play lists. Apple also has other mix play lists, which are updated weekly: things like "Favorites Mix", "Chill Mix", and "Get Up! Mix". Finally Apple Music has custom "radio stations", which are endless playlists based on your listening history. (I believe Spotify also had stations, but I never used them much.)
I switched from Spotify to Apple Music about a month ago, mostly because I wanted music in Dolby Atmos. (Not all songs are in Dolby Atmos, but many are, and they sound far superior on my AirPods to anything without it.) After transferring my Spotify playlists to Apple, I've found that the experiences are pretty similar. There are a few things I like better about Spotify (e.g., it has better support for non-Apple platforms like Linux and Windows), but I prefer Apple now. If you're curious, I'd recommend a free trial.
I switched to Apple Music for a couple years and recently switched back to Spotify. Two main reasons:
1. The Apple Music desktop app is utter garbage. Other than performance issues and crashing, which happened to me almost daily, the UX is a decade behind. It's dead simple things... like the artist's name on the currently playing track isn't even a link.
2. Spotify's recommendation engine is outrageously better than Apple's. With Spotify I find new music that I actually like almost weekly, with zero effort at all. All Apple does is pigeonhole you with whatever you've listened to recently, I had a day where I listened to "lo-fi chill" music and that's all it recommended to me for weeks.
I find both Apple Music and Spotify's recommendation services pretty much awful after a few months of use. They both eventually end up pushing the same stuff and recommending me songs by the same artists I'm already listening to.
This will be a huge hit to Karafun’s subscriber base. I’ve paid for a couple of months before. Having this as part of my Apple Music or One subscription would give me no reason to ever subscribe to it again, assuming feature and library parity.
> and can be enjoyed on iPhone, iPad, and the new Apple TV 4K
But the old Apple TV 4K is not powerful enough for displaying text on a TV, even though any old iPhone is?
I get that they want to push device and service adoption but they should pick a lane, either push users to upgrade devices, or push to sign up to paid services. Pushing for both at the same time is greedy.
It's not explicit, but from the descriptions it seems to suggest it's performing recognition of the audio streams of what's being sung, and even supports dual streams for duets.
So I wouldn't be surprised if it relies on a particular hardware chip that the older Apple TV simply doesn't have. That has definitely been the case for everything Apple has launched with regards to Spatial Audio.
I hope they don't do this. It would be so wasteful to have every end user device running an ML model every time a song is played. Just run the model once in the datacenter and then distribute the time stamp metadata.
> It's not explicit, but from the descriptions it seems to suggest it's performing recognition of the audio streams of what's being sung, and even supports dual streams for duets.
Just curious: what in the article makes you think that?
Adjustable vocals: Users now have control over a song’s vocal levels. They can sing with the original artist vocals, take the lead, or mix it up on millions of songs in the Apple Music catalog.
Real-time lyrics: Users can sing along to their favorite songs with animated lyrics that dance to the rhythm of the vocals.
Background vocals: Vocal lines sung simultaneously can animate independently from the main vocals to make it easier for users to follow.
Duet view: Multiple vocalists show on opposite sides of the screen to make duets or multi-singer tracks easy to sing along to.
------
The part of the article where they state these things explicitly...
> Adjustable vocals: Users now have control over a song’s vocal levels. They can sing with the original artist vocals, take the lead, or mix it up on millions of songs in the Apple Music catalog.
I think this only requires pre-making two audio files per track, and simultaneously streaming these.
Real-time lyrics, Background vocals and Duet view are all nice features too, but the hardest part processing-wise is analysing how loud you sing into the microphone. It's just karaoke with a good UI.
> [Apple says it is] relying on an on-device machine learning algorithm that processes the music in real-time. The tech builds on Apple’s noise-cancellation expertise and other developments it’s made for FaceTime, the company said.
Wonder why they take this approach though, as it is clearly over-engineering (if I correctly understand that the goal is just to make vocals volume adjustable).
> Wonder why they take this approach though, as it is clearly over-engineering (if I correctly understand that the goal is just to make vocals volume adjustable).
Depends what the other non-functional requirements were. i.e. if the NFRs were as follows:
* Cannot increase bandwidth / mobile data usage.
* Cannot impact music quality / bitrate.
* Has to work offline.
* Cannot increase on-device storage.
* Has to be responsive.
Then two audio streams might not work.
Another advantage of doing it on-device is that it doesn't actually change any of the backend architecture too. It might be a lot of change to a lot of systems for a feature which only adds a small amount of functionality - i.e. architecting your entire backend and streaming around seperating audio tracks might not be the right focus.
Maybe it's licensing? I can imagine copyright holders being squeamish about Apple processing, permanently storing, and serving heavily altered versions of their music. The difference is silly and pedantic, but by processing it in real-time during playback, one might argue it's just a filter effect like EQ.
Not sure - although I would imagine that it would effectively double the storage and bandwidth/data requriement for Apple Music in general if they had to send two files with equal bitrate.
They don't state anything explicitly referring to real-time processing of the songs.
As a matter of fact, calling out "millions of songs in the Apple Music catalog" actually makes it seem like the adjustable vocals will only be available on certain songs that they've added support for.
It's hard for me to imagine they'd do something special to support "millions of songs" while excluding others, since the entire catalog is ~100m songs.
My guess is that it's entirely dynamic. It's hard to imagine the complexity of doing batch processing to render each song in the library, and maintain that as new songs are uploaded, and update the renders for software improvements. Better to just do it realtime.
And since classical, instrumentals, esoteric ambient stuff, death metal, etc, will probably not be supported by the algorithms, I think the "millions" refers to those that can be processed in realtime.
They do something special: get the necessary sign offs from legal. Apple‘s license for some content may bar it from being used that way (that’s another reason for doing it on device: could be different situation as far legal & licensing are concerned)
Sure, and a good point. I should have said they wouldn't do anything special on a song-by-song basis for millions of songs. It's not like someone's pushing the button for each song, or building a list of songs. Those that meet the criteria will be included, whether that is 10 or 10 million of them.
I’d guess it’s more probable that lots of songs simply have no lyrics at all, so the claim „all songs in the Apple Music catalog“ would be factually wrong.
> [...] an on-device machine learning algorithm that can process music in real time
Their latest apple TV includes the A15 which includes a 'neural engine' for ML, and this is also included in their latest iPhone / iPad, so that might be part of it.
The divide in expectations is funny. Non-Apple-user: "ML stuff? Must be 'in the cloud'." Apple user: "ML stuff? Must use a special chip in the device."
That makes sense when there's audio that Apple hasn't seen before. With Apple Music Sing, it makes more sense to do that processing once in the datacenter.
it makes more sense to do that processing once in the datacenter.
Since Apple is all about on-device processing with so many of its features, going back-and-forth to the data center doesn't seem to be its style these days.
That's more of a Google thing.
And no one can accuse Apple of telling its advertisers that you start your day with Funky Cold Medina.
There's a reason Apple's preferring on-device processing: user privacy. This doesn't make sense for music (stems, lyrics) since it's not listener's data.
> There's a reason Apple's preferring on-device processing: user privacy.
Is that the actual reason though? My personal impression has been that it's a combination of reasons that benefit Apple. The increased user privacy being a nice bonus for users, but not the primary reason:
1. Producing phones powerful enough for on-device ML both justifies the high price point to the general public and is a good marketing point (along with increased user privacy)
2. Avoid backend infrastructure costs. Why spend extra money on servers, maintenance, and compliance when they can just offload the work to the devices themselves since they're capable?
3. Bonus: The unplanned obsolescence for new features like the one announced is also a side effect that benefits Apple.
I do not get the impression that Apple's primary focus is to benefit users and their privacy.
>There's a reason Apple's preferring on-device processing:
It makes it so you have to buy new devices sooner.
The privacy thing is a nice side-benefit and PR thing, but let's be realistic here.
EDIT: Just to remind everyone, we are literally in a thread about a new device feature that is trivial to do in the cloud, which Apple chooses to do on-device, which makes it only a feature for its newest generation of products...
There's no extra back and forth. You have to fetch the songs from the datacenter in the first place, right? So you fetch the additional data at the same time.
If Apple wanted to support this for user-provided mp3s then on-device would make sense. It doesn't sound like they support that though.
The note at the bottom reads: "Apple Music Sing will be available on all compatible iPhone and iPad models as well as the new Apple TV 4K".
Maybe it requires the Neural Engine that was added with the A11 and opened to third-party apps with the A12. (Sep 2018).
But that should still allow the Apple TV 4K (second generation) which uses the A12.
iPhones/iPads with the A12 or newer are the iPhone XS/XR, iPad Mini (5th generation), iPad Air (3rd generation), and iPad (8th generation). The iPhone SE (2nd generation) uses the A13 so that should be compatible too.
What if the top comment wasn't the MOST cynical take possible at any given time? :D
What is cool about HN though is that maybe an Apple engineer will show up anonymously and give us the low down about why. In my short experience at Apple there were plenty of problems but NOT shipping things to customers to make money was not even remotely one of them. Bending over backwards on ancient radars to support the iPhone 6 sucks up an enormous amount of time!
Apple products have longer lifespans than competitors'[0]. There's just a secondary market, which is more environmentally efficient. It's a very different world, so it's understandable that people not in the Apple ecosystem see "buy a new iPad" as synonymous with "throw old iPad in trash." But that's not what happens at all. The actual devices go on being used for a very long time.
But ignoring that for a moment, are you proposing that companies shouldn't improve hardware because it leads to software features that don't work on old devices, which leads to old devices being thrown away? That doesn't feel right to me.
> Pushing users to upgrade devices is unethical and harmful. Apple does it all the time and they should be held accountable for that.
Huh. Long-time Apple user and my experience is they're really good about supporting older hardware, including when that hardware can't support the newest features. Usually (not always, but usually) if an older device doesn't get a feature, it's because it can't, but they keep getting updates anyway, just without the features they cannot support. That's the trade-off for more features being local-only or local-mostly rather than "cloud".
It uses an “on-device machine learning algorithm that processes the music in real-time.” Perhaps it doesn’t work as well on the old Apple TV? Or just an excuse.
> But the old Apple TV 4K is not powerful enough for displaying text on a TV, even though any old iPhone is?
Honestly, it might not be. My Apple TV 4k seems slow as molasses with performance more like an AMLogic 905.
Not sure if it’s because it doesn’t have enough RAM, the flash is slow or the CPU is just way under powered. Yes, they advertise it as an A12Z or whatever, but it’s a binned part that wasn’t acceptable for a phone.
I have owned nearly every generation of Apple TV. Other than the original model, I wouldn't describe any of them as slow as molasses. On the contrary, they are consistently the fastest streaming devices I've owned, and I'm comparing to things like Roku (built-in to a TCL), FireTV, and Android TV across a number of devices (Shield Pro, Hisense U6G, Zidoo Z9X). (Technically the Zidoo uses Android, not Android TV.)
I currently own all three generations of the Apple TV 4K. All of them are snappy devices.
Now that said, I use them for streaming video only, both a variety of services as well as use Infuse for streaming from my NAS.
>they are consistently the fastest streaming devices I've owned,
As someone that works on set top box / streaming chips, I don’t think you realize how low that bar is :P
The prev gen ATV 4k is clearly superior, I’m not disputing that. But ‘snappy’ is a relative term. I expect a lot more out of a device sporting the chip it has. I regularly get input lag, stalled apps (that the system cannot recover from for some time), bursted input (like it doesn’t debounce queued remote inputs after stalling)
There are systems level issues that remain unaddressed, and I speculate they are due to hardware bottlenecks somewhere .
Yeah, I switched from a Shield to the prior gen of Apple TV 4K in part because the Shield was often laggy as hell (also the UI layout was a mess, also the damn ads, also it was significantly buggier) and AFAIK that's about the most powerful non-Apple box on the market. ATV's way better on that front. Frankly a low-end-but-not-quite-bottom-end Roku machine feels snappier and more-reliable than the Shield did (though not as good as the Apple TV)
Im a big shield fan, I don't feel like other boxes can touch its range of audio and video codecs. I did replace the default launcher with fLauncher so that I no longer had ads.
That sounds like you may have defective hardware or something else like an overheated environment forcing it to throttle. On a standard unit everything built-in is snappy.
I am not convinced it won't work on multiple models. They don't actually provide a list of supported devices. The language here is unclear.
At first, I thought it meant that this only was on the current Apple TV 4K, but then I thought it might just be marketing speak to get people to buy Apple TVs.
But maybe the jump from the A12 to the A15 does provide something that makes this possible. The only thing I can think of is the neural engine (although that first debuted in the A11).
iPhone 8 that is the oldest supported for iOS 16 has an A11 chip; the second gen 4K has A12. The first gen 4K has an A10X chip - maybe it needs the "Neural Engine" that first appeared in A11 but the 2nd gen does have it.
In addition, the Apple TV is plugged in, it doesn't have to limit its power for battery reasons.
It seems that limiting new headline software features to the latest hardware release is the direction that Apple is going lately. The most recent examples that comes to mind are the always-on display in the new iPhone 14 and the hover mode in the new iPad Pro.
You think it's a coincidence that the first iPhone with AOD is also the first one with a LTPO display? If so, why did Apple wait until the 14? If it's just an artificial limitation, wouldn't it have made more sense to do for the XS or the 7 or whatever? It's generally preferable to pull revenue forward, not hold it back for years.
What's worse is that the Apple Music app on Mac doesn't even have lyrics (and is ridiculously bad in many other ways). It feels like they are slowly abandoning the platform.
... it is very peculiar that when I write a less than flattering thing about Apple, even when it is just partly relevant, certain people start to downvote my opinion. Only when it is about Apple btw, do not happen about something else.
This sounds cool. I was in glee club in Junior High, that would be almost 50 years ago, and haven’t really sang since. On the rare occasions when I have our house to myself, I sometimes partake in Apple Fitness+ dance exercises - I really don’t want anyone seeing that. Now I have the option to sing, in a non-embarrassing setting!
This Music Sing service might also be really good at parties.
Interesting yet very anti-competitive move by Apple.
There are lots and lots of existing providers of karaoke tracks: these invest (depending on quality) between several dollars and several hundreds of dollars to record "soundalike" tracks
Sales of such tracks do not generate royalties for the original performer, but do pay out to the composer (per track sale and for things like public performance).
Apple is now garroting these middlemen using technology, and most likely using this capability as leverage in negotiating with recording artists ("hey, give us a 14-day exclusive on iMusicOrWhateverWeCallItThisWeek, and we'll kick back an additional point on residuals").
This is bad news for the existing providers, and barely good news for anyone else.
There are lots and lots of existing
providers of karaoke tracks:
these invest (depending on quality)
between several dollars and several
hundreds of dollars to record "soundalike"
tracks
Commercial karaoke establishments will still need to pay for the licensed versions if they want to be legal. That doesn't change.
People who don't want to do that already had tons of options - Adobe Audition and tons of other software can remove/reduce vocals.
So I don't feel like this changes the commercial picture too much? I feel like this will mainly affect at-home singalongs.
Commercial karaoke establishments are, outside Japan and Korea, not very relevant.
The at-home market in the UK, on the other hand, is pretty significant. None of these households know how to operate Adobe Audition or something similar: they just want to sing along with whatever is on the telly*.
There are lots of companies catering to that market. In fact: in the past, Apple was more than happy to allow them on their platform, to fill in the gaps left by Apple's inability to negotiate certain agreements.
In the past year or so, Apple has gotten more and more restrictive with regards to "soundalike" content. And we now know why... Is this inevitable? Possibly. Is it fair? Maybe. Is it yet another cottage industry that Apple strangulates? Definitely.
*And yes, this is a very simplistic caricature by choice. Of course UK consumers are more sophisticated, but...
In what cases to we continue to use a slow, antiquated process to do a task when an automated technology comes out that can do it faster and/or better?
Sure, I could pay a contractor to go through my files to find and replace every instance of a text string with another text string. Or I could use sed. It's not anti-competitive for sed to exist.
It seems like just regular old competition to me. Like lightbulb manufacturers pushing out big candle or automobiles pushing horse breeders out of business.
I'm excited for this feature. My brother purchased a small karaoke machine for my daughter for her birthday this year. It connects to her iPad via Bluetooth; but, so far, she's only been able to use her regular iTunes library for music. This is because the "karaoke music service" pushed in the box for the and available on the app store was a subscription for $19.99 PER WEEK.
I don't care if their audio processing to drop the volume of the lyrics isn't perfect. It will give my daughter the whole Apple Music catalog to sing along with and without the side of price-gouging.
I love karaoke, but I wish we could switch to a system with better visual cues.
For example, anyone remember those Disney sing-a-longs with the bouncy mickey mouse icon? IMO, the icon really helps communicate timing information; you can tell what's coming by the speed and arc of the icon. I'd love to have that system with a wider variety of music.
There's also RockBand's approach, which displays scrolling syllables to be sung when they cross a white line. It's a bit harder to read, but I'd take it over traditional karaoke.
I'm kind of disappointed at the priorities of these streaming services. I guess I'm not in the mainstream, music-consumption-wise.
After all, it makes sense. If you want to engage subscribers to your service, what else would you do? How most people engage with recorded music? Well, they just… sing along…
I'm also curious (like some on this thread) how they are technically separating the vocals. Is this yet another proprietary music format that artists have to mix/master to? (Like spatial audio)
I'm honestly more surprised the other way. I can't believe it took this long for them to offer even a basic karaoke experience. I thought this would be an obvious feature.
Impressive that they seem to have worked out the licensing, which is usually much more expensive for karaoke (presumably some sort of "public performance" license) than for personal listening/streaming.
So when can we get an Apple Arcade Rock Band/Guitar Hero type game that includes vocals, guitar, bass, keyboard and drums and works with any song on Apple Music?
And how about an automatic Dance Dance Revolution/Just Dance/etc. with the entire Apple Music catalog?
There are a lot of apps that having singing lessons. Yousician may be the most popular.
I'd equate being able to speak a foreign language to singing in key. That's what the apps will be able to help with, at least from what I've seen. If that's what you're looking for, that's a good place to start. However, there is a lot of technique to make your voice sound pleasant and interesting beyond just singing on key. I'm not sure if that apps really help out there that much like an actual vocal coach could.
Right, I'm not expecting anything that comes close to a vocal coach - but I'm not dedicated enough to seek one out. I want to spend 15 minutes a day and get incrementally better with some pretty generic tips.
I'm not sure if this is the same as the DJ apps that remove vocals and other instruments but there seems to be a surge of apps that have this feature. It's impressive how well some of these work and I believe they are mostly based on an open source project (I can't recall the name).
There's a neat trick you can do with songs that don't have much reverb. If you load a track. Clone the channel into a second one and then just "flip" the wave form it will remove all the vocals.
I"m curious if artists can opt out of their songs being included in those that have these capabilities. This seems like another way that a streaming service is exploiting artists to me. It's cool technology definitely, but there are some problematic implications for sure.
It's a cool feature, even if you're just reading along.
Our Amazon Echo Show started doing this recently when listening to songs on Amazon's own Music service. I noticed it, thought it was interesting, commented once to the wife, then went back to just listening.
I wish that before working on this they had fixed the long-standing bug where Apple Music playback on the Apple TV 4K (and possibly other versions) disables the screensaver, making it impossible to use on any OLED TV without risking burn-in.
Weird that this is new. I’ve seen it multiple times over the past few months. Maybe I got opted into a test group or something. I activated it by accident several times and ignored it because I primarily use Apple Music when I’m driving.
Yeah, you’re right, I may have just seen regular lyrics and this adds some features (alternate singing positions and speed). So, it’s an enhancement to the existing lyrics feature.
Apple services sometimes seem incongruent because they are. Other times they're experiments (which can and do fail), or windows into far larger projects. For example, QuickTime for Windows 3.0 was a product of porting significant chunks of pre-NeXT MacOS to x86¹, which was eventually leveraged for Carbon².
What I really want is lyrics on CarPlay. I'm sure there would be a ton of huffing in here about distracted driving, but...well, I don't care! I'd really just use it for confirmation of a word or phrase on a small handful of songs myself. It would be great for passengers, though.
Do they really have to give a name to every new hardware and software feature they come up with? Especially when the feature in question is just karaoke.
> Apple today announced Apple Music Sing, an exciting new feature that allows users to sing along to their favorite songs with adjustable vocals1 and real-time lyrics.
Users can already "sing along to their favorite songs" -- this feels like a wasted opportunity in the first sentence to explain the differentiator. It feels like they're misusing/misunderstanding what "sing along" means, and the real feature enhancement here is that users now have the ability to suppress vocals in songs. (Which is kinda cool.)
I get that this is a press release, but "show don't tell" would go a long way in announcements like this, even (especially?) if it's not available yet.