1. Number of releases per album individually tagged:
2. The amount of metadata for an album:
When you get used to this kind of high quality metadata, it's just so so sad to see how companies like Spotify treat metadata. As an example, look up Bob Marley & The Wailers on Spotify and try to find original releases, and then compare that to the list found here:
...and the sad part is that the metadata is freely available, with a permissive license.
Consider a recording search for "smells like teen spirit". Any human with cursory knowledge of pop music would point you at the Nirvana single from the 1991 release of Nevermind (in this particular case, it's likely even true in every locale). But MusicBrainz has no notion of popularity, common sense, or the real-world context of any of its entities, so the recording from Nevermind isn't even on the first page of results. Heck, the first result isn't even a Nirvana recording. The second result is from an obscure live bootleg album. In my opinion this should be considered a bug. This stuff matters!
This is an area I've dedicated a lot of time to when integrating MusicBrainz with my project, and it strikes me as something that MetaBrainz could spend time on to make the platform more accessible. Answering simple questions about music is currently quite difficult to newcomers on account of the overwhelming amount of data. Consider a world where it's possible to stream every recording from every release in the MusicBrainz database: it should be easier to make "Alexa, play Dark Side of the Moon" work without it needing to ask whether I mean the 1994 Netherlands CD release.
(FWIW, it's totally possible to build these heuristics on top of MusicBrainz today, but having better built-in support for determining this stuff would be nice. Spotify is absolutely amazing at figuring out what song in its entire catalog should be the top result even when I've only typed a few characters.)
Context certainly matters (first, modified, compilation, remaster, remix, audiophile pressing, and so on) but you can't even nail "canonical" to first release, especially for singles, because there may be early promo mixes, radio mixes, vinyl mixes, iTunes mixes, and so on - all mastered differently.
Most people's idea of "canonical" is really "The version I want to hear without having to specify other details". But that's subjective and likely to be significantly different for some non-trivial percentage of users, especially in different territories.
Spotify probably just makes an informed stab at "most popular" - which is a good heuristic and will work most of the time, but is hard to calculate when you don't have Spotify's stats.
I'm not sure when/if we'll be able to tie it in with MusicBrainz directly, but for someone like exogen, ListenBrainz may be a good basis to figure out relative popularity of various Recordings/Tracks regardless.
> Most people's idea of "canonical" is really "The version I want to hear without having to specify other details". But that's subjective and likely to be significantly different for some non-trivial percentage of users, especially in different territories.
Yup! You are describing the problem literally any search engine faces. And yet, Google/Bing/etc. provide pretty smart results. So, do you think the "Smells Like Teen Spirit" recording by Francis Drake is the BEST first result, as MusicBrainz says it is? Is a live bootleg recording the BEST second result? In any locale? MusicBrainz is NOT primarily a search engine, but all that data has very little value if people (and other software) can't actually find it! This absolutely harms adoption.
OK, so we might not need to nail down a "canonical" version when we live in a world with search ranking scores. I totally realize "canonical" is a bad word choice on my part – but it's really how people think of these things!
> Spotify probably just makes an informed stab at "most popular" - which is a good heuristic and will work most of the time, but is hard to calculate when you don't have Spotify's stats.
I bet they do it that way too, but I think you're throwing in the towel way too early here. :) I have a system that works amazingly well and nearly always chooses the most likely intended recording without any listen count data. MusicBrainz has a LOT of data available to it, what type of heuristics might make sense here? I use a ranking system that takes all these factors into account and, like Lucene, assigns a score:
• Number of releases & release groups the recording appears on (the most well-known recording is more likely to appear on additional albums like compilations, and more likely to be widely released in lots of countries).
• How old the release is relative to the other search results (earlier matches are more likely to be the original).
• Whether the recording is from a release with a "single from" relation to another album (the target LP is more likely to hold the recording we want).
• Whether it's from a release that's an Album or EP (positive weighting), or Live (negative weighting), whether the recording ONLY appears on Compilation albums (negative weighting), whether it's any other type of release like Bootleg (strong negative weighting).
• Whether the recording has ISRCs entered for it (more well-known recordings are more likely to have ISRCs in the first place, and also more likely for people to have entered them into MusicBrainz).
• Whether MusicBrainz users have entered any tags and ratings for it (weak but positive correlation with how popular it is).
• Domain-specific string similarity metrics; essentially, query expansion that makes sense specifically for song titles & artist names. This lets certain matches remain equivalent when it makes sense (e.g. "mambo number 5", "mambo no. 5", "mambo #5", "mambo number five" should all be exactly equivalent in terms of string matching. Lucene does some of this already of course, but not nearly enough – I have a query expander with hundreds of examples where Lucene does a worse job)
I can think of more too, that my system doesn't currently use. All that's without relying on any external data source! But if you want to go one better, it's also possible to correlate results with other APIs like WikiData, DBpedia, Spotify, YouTube…
In most cases, I've found that there's enough of a delta between the top score and the second-best score to determine which one is "correct". (Yes, that word, I know…)
Ideally MusicBrainz would be on par with a human expert in determining which recording you most likely meant, and I believe that it CAN do this today, but it doesn't.
Of course, MusicBrainz is an open source endeavour. The old search server maintainer was a volunteer from the community. If you believe you can do a better job at running our search server, please join us in #metabrainz at Freenode and introduce yourself.
The more hands we are, the more we can lift. :)
What does "in theory" mean here? Do those tables exist in whole or some part? Is this a matter of indexing an existing data set or hoping some data was acquired by accidental consequence?
Even if it's not collected though, it's data that they at least already have the ability to collect by simply flipping a switch, as opposed to spinning up a whole new ListenBrainz service and hoping it gains traction.
You're absolutely right that we could, in theory, have that data, but we do not currently. Not in any usable form (for this purpose) anyway.
I use it daily for a couple of year now (to clean up the tags in my collection - quodlibet integrates nicely with musicbrainz for that) - and very rarely I am playing something that's not there.
When this is the case I try to add/edit the metadata, but most of the time, they're way ahead of me.
Just to name a few of the other projects, there's AcousticBrainz  collecting acoustic information which may be pretty useful for machine learning, CritiqueBrainz  for collecting user reviews of songs, albums and more, ListenBrainz , an open scrobbling service a group of people including former last.fm employees initially hacked together in a weekend, and finally BookBrainz , which tries to be what MB is but for books.
During the last year the people running MB have worked on getting companies using the data to support the project resulting in a quite impressive list of supporters  including big names like Google, Spotify and the BBC.
MB has also collaborated with our fellow data nerds over at the Internet Archive to create the Cover Art Archive. 
In general the project is run by people who equally love both data and hacking. Feel free to stop by on the IRC channels #musicbrainz and #metabrainz on freenode!
I lightly modified a version of the Filemon driver from Sysinternals and wrote a little C program that used the driver to monitor for mp3s being played and then grab the perceptual audio hash of the file using trm.exe from Musicbrainz. It then sent the resulting fingerprint off to my website (written in glorious PHP3 no less!) and you could login with an account to see stats on the music you'd been listening to (done with meta data pulled from Musicbrainz).
Surprisingly, it worked reasonably well ...though very sure if I looked at the code now I'd run away screaming.
Really cool to see they're still going strong after all these years!
Compare http://www.last.fm/api/show/track.scrobble 's 7 item specific metadata fields (artist, track, album, trackNumber,
mbid, albumArtist, duration) to https://listenbrainz.readthedocs.io/en/latest/dev/json.html#... - as ListenBrainz is part of the MetaBrainz "umbrella", one of our own main highlights is that we can now actually submit all MBIDs associated with a file, not just the Recording MBID (ie., Artist MBID(s), Release MBID, Release Group MBID, Track MBID, Work MBID(s), possibly Label MBID(s), etc., etc.), but also stuff like language, performers, AcoustIDs, ...
Also, ListenBrainz is linked up with MessyBrainz, which should work as a buffer to have even listens submitted without MusicBrainz identifiers be able to eventually get linked up to the MusicBrainz database.
My solution was to fill in full metadata for all my tracks, though, a pretty big task in itself that I also had to half-automate to achieve. I did consider how to release the stats service for others, too, but realized the number of people with impeccable metadata would be way too low.
My system is actually still running in the background, listening for MPRIS D-Bus track change events and writing them to a simple text file, occasionally flushing the changes from the file to a database for the stats display service -- written also with PHP like most web interfaces of the time.
They never interact with images and audio files, they don't know about metadata at all. They don't use notebook or a PC. They are vendor locked to iOS or Android. They are not dumb, but less and less have the initiative search around and try out new things outside the box. They stay inside their apps, they don't know the vast web outside that can be searched with Google search engine. (it depends on parents and schools to inspire them to try out more)
It's not like the previous generation of people all explored the vast web. They didn't. They didn't even use it until relatively recently.
The percentage of those that are intrigued by technology and have the wanderlust to explore the digital landscape are probably exactly the same (perhaps more now) as the previous generation. Those people that you refer to as "vendor locked" now would never have even used computers in the past generations, or used them only for Office apps.
I'm fully into Spotify now, minus the 1000 or so tracks I couldn't match, but damn, does talking about this take me back. Lugging around my 160Gb iPod Classic and still not being able to fit all my music. UT2004 instagib & Counter Strike LAN parties and swapping entire media libraries.. movies and series included. It was a fun time :)
This makes playlists resistant to filename changes, moves, or even losing all the actual audio tracks and having to buy them again, all because MusicBrainz provides so accurate metadata.
You should try out the demo queries linked from that README if you want to get a sense of the depth of information available in their database.
They seem to have everything I throw at them, except for:
1) Extremely new releases (on the order of a-few-hours-after-release)
2) Some niche songs that haven't been officially released (soundtracks for some Korean television shows)
is a metadata editor, doing both audio fingerprinting and manual tagging(while fetching data from musicbrainz) where that is not available: https://picard.musicbrainz.org/
I remember seeing Robert Kaye wandering around the office when he visited us to talk licensing terms, although as the most junior employee I didn't get to talk to him myself. We also chatted to Col Needham, the founder of IMDB, and asked him "so, how do you become a massive media-encyclopedia site?"; his answer was "it's easy, just start 17 years ago."
Really we had no idea what we were doing, and although we got some surprisingly dedicated users (we sent T-shirts to a couple who'd contributed hundreds of thousands of edits!), the site folded after a few years.
I'm very glad to see that MusicBrainz outlived us and continued to thrive :)
I was always wanting to know since then if there are other maintained/curated music databases.
I also didn't realize at first that they offer a public API. The Picard client was decent, but I'd be interested in a command-line solution. Does anyone know if this exists?
No music makes it into my collection unless it's been imported via beets. It has a powerful import/query/alter API, sufficient config options, and a nice plugin system.
It's been a crazy time lurking on IRC and seeing him juggle both huge technical problems that are worth more than a few blog posts as well as most of the business work for the last two years (until he recently hired someone to handle most of the latter).
Of course one shouldn't make this too much of a personality cult and there's always been an amazing team working with him but I can't imagine anyone other than Robert taking MusicBrainz to what it's become today.
There also was freedb (http://www.freedb.org/), not sure if that is still kept up-to-date.
As far as I understand it, it's data quality really isn't great and according to their statistics , they've only had just over 7,000 album requests last week.
As I recall, it was pleasant to work with and did what I needed it to quite nicely, aside from a feature that my code had depended on being removed — anonymous/unauthenticated search — at which point the project was already basically dead and not worth trying to fix (that was just the last nail in the coffin). In any case, nice to see that it's still active.
In the end we settled before it ever went to trial, so unfortunately that quote never found its way into a court transcript.
either way, nice to see it
Unlike Wikipedia, there is truly nothing too niche to be part of MB's database. My favourite is a CD with the music that plays while you're on hold calling Lidl Spain. 
You could replace the WAVs with FLACs and save a bunch of space on that account. Maybe enough that you realise you don't need to keep both lossless (WAV/FLAC) and lossy (MP3) copies around. Since FLAC is lossless, it will have the exact same audio quality as the WAV files they'll be sourced from. (And, as you mention, FLAC supports tags directly.)
You'll find a portable player capable of handling vinyl before you find one that plays ogg/opus, but desktop software is mostly fine.
Mp3tag also handles MP4/M4A files wonderfully.
If you're encoding AAC use either Apples encoder, or one of those from Fraunhofer:- Fhg-AAC available as a paid plug-in, or free with WinAMP, or the Open Source FDKAAC implementation originally for Android, but now available cross platform.
Whoa, I'd never imagined that could be the case and appreciate the heads-up. I've only had experience with the Apple and Fraunhofer encoders so far.
Mind calling out the offender(s)? This summer I'm releasing a best practices guide (and hopefully tools) to help podcasters move to AAC, and that would be useful to mention.
Also for AAC in particular there are a few options https://trac.ffmpeg.org/wiki/Encode/AAC
Very powerful file renaming tool that also leverages TheTVDB, AniDB, TheMovieDB, etc for TV and movies. Both GUI and command-line versions available, I always use the (free) command-line version even though I paid for the GUI.
It also accepts scripts, has smart/fuzzy matching, and is an all-around good renaming tool for other things too.
It's a great tool. I had one problem / wishlist that I couldn't work out, and got a response from the author a few minutes after filing an issue, explaining how to solve my problem.
There's some risk that thetvdb.com's change of API later this year may be a problem, but I suspect a low risk.
https://github.com/dbr/tvnamer , with packages available for most distros AFAIK.
It would be completely against the spirit of the project to close in on itself, and as Leo_Verto mentioned, also pretty hard. With the core data available as CC0 and all the source code needed to run the servers, anyone could legally take all the data and set up a "LibreMusicBrainz" in some hours in the unlikely event that the MetaBrainz Foundation (the organisation created to support MusicBrainz and the other *Brainz projects) should ever flip.
Because of the underlying design and relationships between albums and recordings and musical pieces (or works), once it reaches some level of critical mass you can start to mine the data for things like:
Who has recorded versions of Vivaldi's Four Seasons Spring in London?
Which artists have recorded both Greig's Piano Concerto and Chopsticks?
Who has recorded "A Day in the Life" other than by the Beatles?
Both MusicBrainz and Freesound are truly international in scope. They cover metadata and sound for Indian classical music and such genres too. The CompMusic research team publishes to both of these.
Edit: CompMusic url - http://compmusic.upf.edu/
I guess it's important to start with my listening style. I go something like David Guetta -> Frank Sinatra -> Depeche Mode -> Slipknot -> Rocky Horror Picture show. This is a very nomadic pattern. I essentially wanted a really random system of music. I had alot of thoughts on this, and sort of wish I had more time to work on this almost full time.
* In built positive weighting of the song. If you try to increase volume, re-queue that song up. It's done well.
* In built negative weighting of the song/selection. If you skip within the first minute of the song, negative weight that genre and song. This goes into.
* Smart shuffling. None of this random integer crap. I want it to see there are X genres in my playlist. The next song will be of a different genre, year, and country.
* Proper geographical distribution. If you're skewing towards just the US then blacklist adding artists from US. My system went on a south african kick for whatever reason, and <3.
* Ease discovery, you're having new artists pop up all the time. This was the web interface I was working off of. I have a surface pro at my desk with the frontend. Easy access to history, lyrics, etc.
* Analyze the song BPM, Key, Wave pattern etc.
* IBM tone analysis of lyrics to denote whether you want happy, sad, or whatever style of music.
* The frontend tied into youtube, providing a link to the official music video if present.
* Show a tree of how the artists were added. My absolute favorite was 32 levels deep going from DropKick Murphies to Blondie.
I'm a heavy spotify user. Weekly Discover is usually sticking to EBM music for me. Discover weekly, rarely gives me anything new. I want to listen to stuff from Korea, India, Africa, the 20s, 50s and now. None of the systems I've seen work like that. I had a couple of friends use it, and they really like it barring the MPD requirement.
TouchaTouchaTouchMe and HarmonyGen
There's FreeDB (http://www.freedb.org) which does roughly the same thing, starting from the old CDDB database before Gracenote, and then Sony, bought it. Their database dump is supposedly available.
"The MusicBrainz Database is split into two components for licensing purposes.
"The core data of the database is licensed under the CC0, which is effectively placing the data into the Public Domain. This means that anyone can download and use the core data in any way they see fit. No restrictions, no worries!
"The remaining portions of the database are released under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license. This allows for non-commercial use of the data as long as MusicBrainz is given credit and that derivative works (works based on the CC licensed data) are also made available under the same license."
It's the constant questioning I do for Wiki sites, since there are multiple for most subjects. Am I alone in this struggle? I wouldn't mind being talked out of using Discogs for the sake of creating / managing metadata that will be the most useful.
Is there a more reliable way to query this data without running a full server on your own?
This is not the first occasion when demand has exceeded capacity, but any capacity added soon gets swallowed up.
Suffice it to say that the MB team is urgently looking into how to stop this happening.
(Note that there's also a "GraphBrainz" project (also not something MetaBrainz is involved with), posted about earlier here: https://news.ycombinator.com/item?id=14479031 )
MusicBrainz is the third project of it's kind. Two previous older projects got bought by the media industry (Sony and Magix). Such a database gets useless if it doesn't receive updates.
First there was CDDB, short for Compact Disc Database, is a database for software applications to look up audio CD (compact disc) information over the Internet. This is performed by a client which calculates a (nearly) unique disc ID and then queries the database. As a result, the client is able to display the artist name, CD title, track list and some additional information. CDDB was invented by Ti Kan around late 1993 as a local database that was delivered with his popular xmcd music player application. CDDB is a licensed trademark of Gracenote. In March 2001, CDDB, now owned by Gracenote, banned all unlicensed applications from accessing their database. As of June 2, 2008, Sony Corp. of America completed acquisition (full ownership) of Gracenote. https://en.wikipedia.org/wiki/CDDB
Then there was freedb. freedb is a database of compact disc track listings, where all the content is under the GNU General Public License. To look up CD information over the Internet, a client program calculates a hash function from the CD table of contents and uses it as a disc ID to query the database. If the disc is in the database, the client is able to retrieve and display the artist, album title, track list and some additional information. It was originally based on the now-proprietary CDDB (Compact Disc DataBase). On October 4, 2006, freedb owner Michael Kaiser announced that Magix had acquired freedb. On June 25, 2007, MusicBrainz – a project with similar goals – officially released their freedb gateway. The latter allows users to harvest information from the MusicBrainz database rather than freedb. https://en.wikipedia.org/wiki/Freedb