On this page we describe several datasets with the term “canonical”. Since MusicBrainz aims to catalog all released music, the database contains a lot of different versions of releases or different versions of recordings. We find it important to collect all of these different versions, but in the end it is too much data for most of our users. Fortunately, it is easy to combine multiple pieces of well structured data into something that fits a user’s desired end-use.
However, sometimes it can be challenging to work out which of the many releases/recordings is the one that “most people will think of the most representative version”. Even defining what this means is incredibly difficult, but we’ve attempted to do just that and we’re using the results of this work in our production systems on ListenBrainz to map incoming listens to MusicBrainz entries.
When looking at the descriptions our datasets, please consider that “canonical” implies the most representative version. Each of our canonical datasets has a more detailed description of what “canonical” means in that given dataset.
As with our database dumps, in order to keep the MetaBrainz Foundation operating so that these datasets can be maintained and updated further, we require financial support from our commercial supporters. Without this support, the future of the datasets cannot be guaranteed. As such, even when a specific dataset is available under the Creative Commons Zero (CC0) license (public domain), we still need commercial supporters of the data to support us, on a moral basis rather than a legal one!
The MusicBrainz canonical data dump contains information about MusicBrainz Recordings, Releases, and Artists. The dataset has been developed to make it easier to reason about the core metadata in the MusicBrainz database, providing one single record for each set of mostly equivalent releases and recordings in the database.
Because of the nature of the MusicBrainz database it contains many instances where metadata exists for many different versions (formats, reissues) of the same album, or many versions (live recordings, alternate versions) of a recording.
However for some tasks such as music recommendation, it’s often useful to be able to use a single identifier when talking about a recording or a release, both in order to improve the quality of a training dataset for machine learning and when providing results to an end user.
The MusicBrainz canonical data dump provides a mapping of all Releases and Recordings in the MusicBrainz database to a single “canonical” version.
|Canonical MusicBrainz Data documentation
|Allowed, but financial support strongly urged, even for CC0 data.
|Twice a month, on the 1st and 15th.
|Creative Commons Zero (CC0)
|zstd compressed CSV files
The MetaBrainz team has also released the Music Listening Histories Dataset+ dataset, an improved version of the original MLHD, released by the DDMAL Lab at McGill Univeristy.
The original MLHD dataset contained a number of errors that were due to poor data matching algorithms when this data was originally collected. Problems that we encountered and fixed in the dataset:
Follow redirects: In MusicBrainz there are times when a data entity was entered more than once, so any duplicates must be deleted from the database. In order to not lose the old MBID, we enter an entry into our redirect tables, keeping track of all of them. In the MHLD+ dataset, we checked each recording and ensured that the dataset contains the most recent and correct version.
Identify Canonical Recordings: Using our cannonical datasets (see above) we've identified all the canonical versions of releases and recordings and used them, rather than the non-canonical versions.
Lookup correct Artist and Releases: Once we identified the canonical recordings, we looked up the canonical release and the correct artist MBID for each of the datapoints. The original matching algorithm often conflated two similarly named artists as the same artist, greatly dimishing the value of this dataset.
Other data problems: Some entries in the dataset has some data elements missing, making those data points useless. We've removed all the data points that contained errors that could not be positively resolved into the MusicBrainz database.
Better compression: The dataset is compressed using zstandard, resulting in a 50% reduction in compressed size on disk.
|Music Listening History Dataset+ documentation
|Not allowed, as per original dataset terms.
|Never gonna give you up...
|None. Non-commercial use only.
|zstd compressed CSV files