> Consider that someone legally backs up his own files to MU and never share...

iamben · on Nov 24, 2012

Devil's advocate: what if we've bought our MP3s from the same place?

DanBC · on Nov 24, 2012

That's an important point that I forgot about.

ZoFreX · on Nov 24, 2012

> But if I rip a CD track to MP3 and you rip a CD track to MP3 are we going to get files that have matching hashes? (Even if we use the same settings, and the same meta information.)

Not unless you are both using a very complex ripping process, very good CD drives (for a strange and rare definition of "very good"), have configured the ripping program 100% correctly, and are using the exact same version of the same MP3 encoder.

That said, if you have both bought the MP3 from the same online store, it would of course be identical. And if you rip it in iTunes but pay for iTunes Match, if you upload your rip to their cloud it will analyse it, realise it's a song it has in its library (if it does), and then remember you have that song and throw away your data. Of course, it may not be the case that it's legal for all cloud storage providers to de-duplicate their data this aggressively - Apple only does this for files they already have agreements in place to legally distribute.

-- more details for the curious --

Ripping audio from CDs is a depressingly inexact process compared to copying data off them. To guarantee you rip the data as accurately as possible (which is the only simple way to guarantee your copies match) you need to be doing what is commonly called a "secure rip". Normally when you rip a CD it rips in burst mode, which is very fast as it reads the data linearly, once, and makes good use of the cache on the drive. However, if a read error occurs, what you want to happen is for the program to go back and try reading that sector again. This is made very complicated by the read cache! You can buy drives that don't have them, though, or have an option to turn them off (note: doesn't actually work on many drives that "support" it). Even if your drive does have a cache, decent ripping programs can defeat it by instructing the drive to read enough data from another location to evict the sector that needs re-reading. (Side note: your ripping now takes 40x longer in many cases)

So that's the complex ripping process. Asides from caching, the drive specs have another role to play: read offset. When instructed to read a given audio sector, drives miss by a constant amount (such as +17 sectors). You need to find your drive's offset and put that setting in the ripping program to compensate for this. This introduces an extra problem though, which is that with a non-zero offset you are going to need to ask your drive to read either before 0:00:00.00 or after the end of the last track in order to get all of the data - but not all drives support reading into the lead-in or lead-out.

Assuming that we have correctly defeated our cache, have managed to read the CD with no read errors occurring, no uncaught errors have occurred either (rare, but possible with some drive/CD combinations, particularly CDs with DRM), and that we have adjusted for the offset and managed to read 100% of the data...

Well, you still have the pre-track gaps to worry about. On a CD, let's say track 1 starts at 0:00 and finishes at 3:20. Track 2 could start at 3:25! On some CD players, you will see the time go from -0:05 to 0:00 while you are in that "pre-track gap". (It's actually a cool feature - it means you can have a bridge between tracks so the CD plays as a continuous mix from start to finish, but can skip around tracks and not hear a few seconds of one song going into another)

When converting them to separate audio files though (which not everyone does, once you descend into the audiophile underworld) you then have to decide what to do with the data in that gap - discard it, append it to the end of the previous track, or prepend it to the next one? So even though this setting is somewhat subjective, it would need to be the same for the data to be identical!

Luckily there is a convention there (append to previous track) so maybe you both chose that. Great! At this point it is actually very likely that you have the exact same raw PCM data, bit-for-bit!* But now you're going to screw it all up by compressing it.

Not all MP3 encoders are alike, which many people know - but the consequence of that is, that by making better compression decisions, two programs can produce MP3s from the same raw data which are both completely specification compliant, but where one is more efficiently compressed than the other (i.e. smaller file size for the same quality, or better quality for the same size). Of course, this means the files that they produce are very different - so both people would need to be using the same encoder.

Of course, the same logic applies to different versions of a single encoder! And, perhaps you would even need to be using the exact same binaries? I don't know quite enough at the low level to be sure about this, but I would not be surprised if using different compilers or different optimization methods when compiling a particular version to binary could affect the MP3 files it produces.

Oh, and of course you would need to be passing the exact same options to the MP3 encoder.

And then even if you entered the same tag data, depending which tagging program you used, the files might be different again! There are several different specs for tagging files to begin with, and there's a little too much room for interpretation in them. Some taggers will use the TPE2 "Band/orchestra/accompaniment" tag to represent the "album artist", some use a custom TXXX tag with the key "ALBUMARTIST", and some do the same but with "ALBUM ARTIST". Just to make it even more fun, the majority of tagging programs are definitively incorrect with regards to the specification, and the cherry on top is that most of them add an extra tag, hidden from you in the GUI, recording for posterity which program you tagged the file with!

But, assuming you both entered the exact same tag data into the exact same version of the exact same tagging program, you would have the same files. With the exceptions of the tagging programs that have non-deterministic behaviour (hello, Windows Media Player!)

You might be reading all this and thinking that the odds sound like zero. In practice there is a group of people (digital age audiophiles) ripping CDs using the same versions of the same software, because they all always use the latest stable versions of their entire toolchain. They're all using the exact same binary to encode MP3s, too, because LAME is the favourite and there are very few good binaries of it around (most sites are passing around the same file for any given version). They even get the same metadata, because the software they use gets it automatically from an online database. It's definitely conceivable that some of them have independently produced identical files!

P.S. I only remembered right at the end that most audio CDs have many different pressings which often differ at the 1s and 0s level, even if there is no observable difference between them ;)

* Of course, CDs do have correction codes, but sadly the vast majority of drives have C2 support so bad that it is of no use at all.

You can even check: There's an online database of checksums to ensure your rip is correct - the caveat being it doesn't check the very beginning or end of the file, due to the lead-in / lead-out problem I mentioned already

moconnor · on Nov 24, 2012

If you used the same encoder, sure. CDs contain digital audio; each copy is identical.