Canonical MusicBrainz data

From MusicBrainz Wiki
Revision as of 16:31, 8 March 2023 by RobertKaye (talk | contribs)
Jump to navigationJump to search

We have created several datasets that use the term “canonical”. Since MusicBrainz aims to catalog all released music, the database contains a lot of different versions of releases or different versions of recordings. We find it important to collect all of these different versions, but in the end it is too much data for most of our users. Fortunately, it is easy to combine multiple pieces of well structured data into something that fits a user’s desired end-use.

However, sometimes it can be challenging to work out which of the many releases/recordings is the one that “most people will think of the most representative version”. Even defining what this means is incredibly difficult, but we’ve attempted to do just that and we’re using the results of this work in our production systems on ListenBrainz to map incoming listens to MusicBrainz entries. When looking at the descriptions our datasets, please consider that “canonical” implies the most representative version. Each of our canonical datasets has a more detailed description of what “canonical” means in that given dataset.

As with our database dumps, in order to keep the MetaBrainz Foundation operating so that these datasets can be maintained and updated further, we require financial support from our commercial users. Without this support, the future of the datasets cannot be guaranteed. As such, even when a specific dataset is available under the Creative Commons Zero (CC0) license (public domain), we still need commercial users of the data to support us, on a moral basis rather than a legal one!

Our canonical data-sets include:

Canonical Release Mapping ( canonical_release_redirect.csv )

For each Release Group in MusicBrainz (e.g. Miles Davis - Kind of Blue), this data-set identifies one Release which is Canonical for this Release Group. Given a Release Group, this data-set can easily give you the Release that is considered Canonical for this Release Group.

Columns:

 release_mbid
 canonical_release_mbid
 release_group_mbid

Canonical Recording Mapping ( canonical_recording_redirect.csv )

For each recording in MusicBrainz, identify the most canonical version of that recording. This has the effect of redirecting all variations of a recording (single, EP, Live recordings, remasters) to a single recording, often the version that appears on the first Album where it was released.

Columns:

 recording_mbid
 canonical_recording_mbid
 canonical_release_mbid

Canonical Metadata ( canonical_musicbrainz_data.csv )

A list of metadata used to create the two canonical mappings.

Columns:

 id
 artist_credit_id
 artist_mbids
 artist_credit_name
 release_mbid
 release_name
 recording_mbid
 recording_name
 combined_lookup
 score
 year

All fields except for the ones defined below, are common MusicBrainz fields, whose definition you can find in our schema docs. The remaining fields are:

  • combined_lookup: This is comprised of the artist_credit_name and recording_name, with punctuation, superfluous whitespace removed and any non-ASCII characters converted into ASCII (see https://pypi.org/project/Unidecode/ for more details on this). This field therefore contains all of the useful components of artist_credit and recording_name in a way that makes it easy to lookup tracks, especially if you have fuzzy search capabilities.
  • score: The score column indicates a priority order of all of the items. Some searching systems, such as typesense, require its data to be ordered