Canonical MusicBrainz data: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
(Created page with "We have created several datasets that use the term “canonical”. Since MusicBrainz aims to catalog all released music, the database contains a lot of different versions of...")
(No difference)

Revision as of 16:17, 8 March 2023

We have created several datasets that use the term “canonical”. Since MusicBrainz aims to catalog all released music, the database contains a lot of different versions of releases or different versions of recordings. We find it important to collect all of these different versions, but in the end it is too much data for most of our users. Fortunately, it is easy to combine multiple pieces of well structured data into something that fits a user’s desired end-use.

However, sometimes it can be challenging to work out which of the many releases/recordings is the one that “most people will think of the most representative version”. Even defining what this means is incredibly difficult, but we’ve attempted to do just that and we’re using the results of this work in our production systems on ListenBrainz to map incoming listens to MusicBrainz entries. When looking at the descriptions our datasets, please consider that “canonical” implies the most representative version. Each of our canonical datasets has a more detailed description of what “canonical” means in that given dataset.

As with our database dumps, in order to keep the MetaBrainz Foundation operating so that these datasets can be maintained and updated further, we require financial support from our commercial users. Without this support, the future of the datasets cannot be guaranteed. As such, even when a specific dataset is available under the Creative Commons Zero (CC0) license (public domain), we still need commercial users of the data to support us, on a moral basis rather than a legal one!

Our canonical data-sets include:

Canonical Release Mapping

For each Release Group in MusicBrainz (e.g. Miles Davis - Kind of Blue), this data-set identifies one Release which is Canonical for this Release Group. Given a Release Group, this data-set can easily give you the Release that is considered Canonical for this Release Group.

Dump format

File: canonical_release_redirect.csv Format: CSV Columns: release_mbid, canonical_release_mbid, release_group_mbid

Canonical Recording Mapping

For each recording in MusicBrainz, identify the most canonical version of that recording. This has the effect of redirecting all variations of a recording (single, EP, Live recordings, remasters) to a single recording, often the version that appears on the first Album where it was released.

Dump format

File: canonical_recording_redirect.csv) Format: CSV Columns: recording_mbid,canonical_recording_mbid,canonical_release_mbid


Canonical Metadata

A list of metadata used to create the two canonical mappings.

Dump format

File: canonical_musicbrainz_data.csv Format: CSV Columns: id,artist_credit_id,artist_mbids,artist_credit_name,release_mbid,release_name,recording_mbid,recording_name,combined_lookup,score,year

All fields except for the ones defined below, are common MusicBrainz fields, whose definition you can find in our schema docs. The remaining fields are:

  • combined_lookup: This is comprised of the artist_credit_name and recording_name, with punctuation, superfluous whitespace removed and any non-ASCII characters converted into ASCII (see https://pypi.org/project/Unidecode/ for more details on this). This field therefore contains all of the useful components of artist_credit and recording_name in a way that makes it easy to lookup tracks, especially if you have fuzzy search capabilities.
  • score: The score column indicates a priority order of all of the items. Some searching systems, such as typesense, require its data to be ordered