Canonical MusicBrainz data: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
mNo edit summary
No edit summary
 
Line 11: Line 11:
For each release group in MusicBrainz (e.g. Miles Davis - Kind of Blue), this dataset identifies one release which is canonical for the release group. Given a release group, this dataset can easily give you the release that is considered canonical for it, for example to get a representative tracklist for an album.
For each release group in MusicBrainz (e.g. Miles Davis - Kind of Blue), this dataset identifies one release which is canonical for the release group. Given a release group, this dataset can easily give you the release that is considered canonical for it, for example to get a representative tracklist for an album.


'''Where to download''': [http://data.metabrainz.org/pub/musicbrainz/canonical_data/ Canonical Downloads]<br/>
'''Where to download''': [https://metabrainz.org/datasets/derived-dumps#canonical Canonical Downloads]<br/>
'''Updated''': Twice a month, on the 1st and 15th.<br/>
'''Updated''': Twice a month, on the 1st and 15th.<br/>
'''License''': [https://creativecommons.org/share-your-work/public-domain/cc0/ Creative Commons Zero (CC0)]<br/>
'''License''': [https://creativecommons.org/share-your-work/public-domain/cc0/ Creative Commons Zero (CC0)]<br/>
Line 23: Line 23:
For each recording in MusicBrainz, this dataset identifies the most canonical version of that recording, often the version that appears on the first album where it was released. This has the effect of redirecting all variations of a recording (single, EP, live recordings, remasters) to that one canonical recording, so that you can treat them all as equivalent if desired.
For each recording in MusicBrainz, this dataset identifies the most canonical version of that recording, often the version that appears on the first album where it was released. This has the effect of redirecting all variations of a recording (single, EP, live recordings, remasters) to that one canonical recording, so that you can treat them all as equivalent if desired.


'''Where to download''': [http://data.metabrainz.org/pub/musicbrainz/canonical_data/ Canonical Downloads]<br/>
'''Where to download''': [https://metabrainz.org/datasets/derived-dumps#canonical Canonical Downloads]<br/>
'''Updated''': Twice a month, on the 1st and 15th.<br/>
'''Updated''': Twice a month, on the 1st and 15th.<br/>
'''License''': [https://creativecommons.org/share-your-work/public-domain/cc0/ Creative Commons Zero (CC0)]<br/>
'''License''': [https://creativecommons.org/share-your-work/public-domain/cc0/ Creative Commons Zero (CC0)]<br/>
Line 39: Line 39:
If you have a body of music metadata that needs to get matched to MusicBrainz, this table can be quite helpful! Using the combined_lookup field you can quickly implement a fuzzy metadata lookup for all of the data elements that you need to find MusicBrainz IDs for.
If you have a body of music metadata that needs to get matched to MusicBrainz, this table can be quite helpful! Using the combined_lookup field you can quickly implement a fuzzy metadata lookup for all of the data elements that you need to find MusicBrainz IDs for.


'''Where to download''': [http://data.metabrainz.org/pub/musicbrainz/canonical_data/ Canonical Downloads]<br/>
'''Where to download''': [https://metabrainz.org/datasets/derived-dumps#canonical Canonical Downloads]<br/>
'''Updated''': Twice a month, on the 1st and 15th.<br/>
'''Updated''': Twice a month, on the 1st and 15th.<br/>
'''License''': [https://creativecommons.org/share-your-work/public-domain/cc0/ Creative Commons Zero (CC0)]<br/>
'''License''': [https://creativecommons.org/share-your-work/public-domain/cc0/ Creative Commons Zero (CC0)]<br/>

Latest revision as of 13:27, 5 May 2023

We have created several datasets that use the term “canonical”. Since MusicBrainz aims to catalog all released music, the database contains a lot of different versions of releases or different versions of recordings. We find it important to collect all of these different versions, but in the end it is too much data for most of our users. Fortunately, it is easy to combine multiple pieces of well structured data into something that fits a user’s desired end-use.

However, sometimes it can be challenging to work out which of the many releases/recordings is the one that “most people will think of the most representative version”. Even defining what this means is incredibly difficult, but we’ve attempted to do just that and we’re using the results of this work in our production systems on ListenBrainz to map incoming listens to MusicBrainz entries. When looking at the descriptions our datasets, please consider that “canonical” implies the most representative version. Each of our canonical datasets has a more detailed description of what “canonical” means in that given dataset.

Please note that the nature of our canonical MBIDs is that we do not guarantee that they will be stable over time. What is considered to be canonical in one data dump may not be canonical in the next data dump (for example after a release or recording merge). We don't expect this to happen very frequently, but the possibility for this to happen does exist.

As with our database dumps, in order to keep the MetaBrainz Foundation operating so that these datasets can be maintained and updated further, we require financial support from our commercial users. Without this support, the future of the datasets cannot be guaranteed. As such, even when a specific dataset is available under the Creative Commons Zero (CC0) license (public domain), we still need commercial users of the data to support us, on a moral basis rather than a legal one!

Canonical Release Mapping ( canonical_release_redirect.csv )

For each release group in MusicBrainz (e.g. Miles Davis - Kind of Blue), this dataset identifies one release which is canonical for the release group. Given a release group, this dataset can easily give you the release that is considered canonical for it, for example to get a representative tracklist for an album.

Where to download: Canonical Downloads
Updated: Twice a month, on the 1st and 15th.
License: Creative Commons Zero (CC0)
Columns:

 release_mbid
 canonical_release_mbid
 release_group_mbid

Canonical Recording Mapping ( canonical_recording_redirect.csv )

For each recording in MusicBrainz, this dataset identifies the most canonical version of that recording, often the version that appears on the first album where it was released. This has the effect of redirecting all variations of a recording (single, EP, live recordings, remasters) to that one canonical recording, so that you can treat them all as equivalent if desired.

Where to download: Canonical Downloads
Updated: Twice a month, on the 1st and 15th.
License: Creative Commons Zero (CC0)
Columns:

 recording_mbid
 canonical_recording_mbid
 canonical_release_mbid

Canonical Metadata ( canonical_musicbrainz_data.csv )

A list of metadata that the canonical release and canonical recording mapping refer to. While this dataset does not provide any new features, we provide the dataset as a helper in order to quickly access the MusicBrainz data, without having to download the entire MusicBrainz database.

This table serves as the basis for our ListenBrainz MBID mapper that takes incoming listen data that is comprised only of an artist name and a track name and attempts to work out what is the best canonical recording and release in MusicBrainz. The combined_lookup field is particularly useful for creating such a mapping -- see the detailed column descriptions below.

If you have a body of music metadata that needs to get matched to MusicBrainz, this table can be quite helpful! Using the combined_lookup field you can quickly implement a fuzzy metadata lookup for all of the data elements that you need to find MusicBrainz IDs for.

Where to download: Canonical Downloads
Updated: Twice a month, on the 1st and 15th.
License: Creative Commons Zero (CC0)
Columns:

 id
 artist_credit_id
 artist_mbids
 artist_credit_name
 release_mbid
 release_name
 recording_mbid
 recording_name
 combined_lookup
 score
 year

All fields except for the ones defined below are common MusicBrainz fields, whose definition you can find in our schema docs. The remaining fields are:

combined_lookup: This is comprised of the artist_credit_name and recording_name, with punctuation and superfluous whitespace removed and any non-ASCII characters converted into ASCII (see https://pypi.org/project/Unidecode/ for more details on this). This field therefore contains all of the useful components of artist_credit and recording_name in a way that makes it easy to lookup tracks, especially if you have fuzzy search capabilities.
score: The score column indicates a priority order of all of the items. Some searching systems, such as Typesense, require its data to be ordered.

Using both combined_lookup and score you can take this data and index it using Typesense and then start making fuzzy lookup queries against the dataset to match your own data to MusicBrainz.