Canonical MusicBrainz data

From MusicBrainz Wiki
Revision as of 14:46, 10 March 2023 by RobertKaye (talk | contribs)
Jump to navigationJump to search

We have created several datasets that use the term “canonical”. Since MusicBrainz aims to catalog all released music, the database contains a lot of different versions of releases or different versions of recordings. We find it important to collect all of these different versions, but in the end it is too much data for most of our users. Fortunately, it is easy to combine multiple pieces of well structured data into something that fits a user’s desired end-use.

However, sometimes it can be challenging to work out which of the many releases/recordings is the one that “most people will think of the most representative version”. Even defining what this means is incredibly difficult, but we’ve attempted to do just that and we’re using the results of this work in our production systems on ListenBrainz to map incoming listens to MusicBrainz entries. When looking at the descriptions our datasets, please consider that “canonical” implies the most representative version. Each of our canonical datasets has a more detailed description of what “canonical” means in that given dataset.

As with our database dumps, in order to keep the MetaBrainz Foundation operating so that these datasets can be maintained and updated further, we require financial support from our commercial users. Without this support, the future of the datasets cannot be guaranteed. As such, even when a specific dataset is available under the Creative Commons Zero (CC0) license (public domain), we still need commercial users of the data to support us, on a moral basis rather than a legal one!

Our canonical data-sets include:

Canonical Release Mapping ( canonical_release_redirect.csv )

For each Release Group in MusicBrainz (e.g. Miles Davis - Kind of Blue), this data-set identifies one Release which is Canonical for this Release Group. Given a Release Group, this data-set can easily give you the Release that is considered Canonical for this Release Group.

Where to download: [http://data.metabrainz.org/pub/musicbrainz/canonical_data/ Canonical Downloads
Updated: 1st and 15 of each month
License: Creative Commons Zero (CC0)
Columns:

 release_mbid
 canonical_release_mbid
 release_group_mbid

Canonical Recording Mapping ( canonical_recording_redirect.csv )

For each recording in MusicBrainz, identify the most canonical version of that recording. This has the effect of redirecting all variations of a recording (single, EP, Live recordings, remasters) to a single recording, often the version that appears on the first Album where it was released.

Where to download: [http://data.metabrainz.org/pub/musicbrainz/canonical_data/ Canonical Downloads
Updated: 1st and 15 of each month
License: Creative Commons Zero (CC0)
Columns:

 recording_mbid
 canonical_recording_mbid
 canonical_release_mbid

Canonical Metadata ( canonical_musicbrainz_data.csv )

A list of metadata that the canonical release and canonical recording mapping refer to. While this data-set does not provide any new features, we provide the data-set as a helper in order to quickly access the MusicBrainz data, without having to download the entire MusicBrainz database.

This table serves as the basis for our ListenBrainz MBID mapper that takes incoming listen data that is comprised only of an Artist Name and a Track Name and attempts to work out which the canonical track in MusicBrainz is. The combined_lookup fields is particularly useful for creating such a mapping -- see the detailed column descriptions below.

If you have a body of music metadata that needs to get matched to MusicBrainz, this table can be quite helpful! Using the combined_lookup field you can quickly implement a fuzzy metadata lookup for all of the data elements that you need to find MusicBrainz IDs for.

Where to download: [http://data.metabrainz.org/pub/musicbrainz/canonical_data/ Canonical Downloads
Updated: 1st and 15 of each month
License: Creative Commons Zero (CC0)
Columns:

 id
 artist_credit_id
 artist_mbids
 artist_credit_name
 release_mbid
 release_name
 recording_mbid
 recording_name
 combined_lookup
 score
 year

All fields except for the ones defined below, are common MusicBrainz fields, whose definition you can find in our schema docs. The remaining fields are:

combined_lookup: This is comprised of the artist_credit_name and recording_name, with punctuation, superfluous whitespace removed and any non-ASCII characters converted into ASCII (see https://pypi.org/project/Unidecode/ for more details on this). This field therefore contains all of the useful components of artist_credit and recording_name in a way that makes it easy to lookup tracks, especially if you have fuzzy search capabilities.
score: The score column indicates a priority order of all of the items. Some searching systems, such as Typesense, require its data to be ordered

Using the combined_lookuo and score functions you can take this data and index it using Typesense and then start making fuzzy lookup queries against this data to match your own data to MusicBrainz.