History:Next Generation Schema/Migration Notes

From MusicBrainz Wiki

This is an attempt to document what happens during the NGS upgrade script for people who know how MusicBrainz works, but not in so much detail that they understand the schema too (so this only covers visible changes, not the behind-the-scenes stuff like splitting dates into three fields)

Artists

The artist entity is largely still the same. There are two new fields: country and gender. Aliases no longer have to be unique and now have locales.

Artist credits

The main change to how artists work is the introduction of artist credits. Release groups, releases, tracks, recordings and works all have artist credits (note: relationships don't). The artist credit will be initially set to the same as the current artist name.

Collaborations

The upgrade script tries to convert existing collaboration artists into proper artist credits. For example, One Sweet Day is currently credited to the collaboration of "Mariah Carey & Boyz II Men", the upgrade script will convert this collaboration and replace the release group's collaboration artist with an artist credit of "Mariah Carey & Boyz II Men".

In order for a collaboration to be converted, it must match the following criteria:

  • The collaboration artist must have no relationships other than the collaboration relationship type that links the collaborating artists together.
  • Each of the collaborating artist's names must be found within the collaboration's name.
  • Phrases that join the artist names together (e.g. "&", "and", "vs.", "feat.", etc.) must be 5 characters or fewer (with some exceptions, e.g. "featuring", "versus"...) (-> link to source for full list?)

When trying to find artist names within the collaboration name, the upgrade script first generates a list of possible names for each artist which includes the artist name and the artist's aliases. For names such as "John Lennon", it automatically includes "John", "Lennon" and "J. Lennon" as well. It then goes through the list of names, starting with the longest, until it finds a match in the collaboration's name.

If the collaboration can't be converted, no changes are made, and the artist will remain as it is with collaboration relationships intact. If it can be converted, the collaboration artist is deleted as it is now empty.

Release groups

Release groups still represent the same concept, but have been given more functionality. They can now have relationships, annotations and a disambiguation comment. Some of the release relationships (e.g. Discogs masters, Wikipedia, IMDb...) are moved to the release group level. There is a table on Next Generation Schema/Relationships Conversion which may or may not be up-to-date. Tags and ratings are also moved from the release level to the release group level.

Releases

Note: old-release refers to the pre-NGS release entity, new-release refers to the NGS release entity.

Now this is where it gets complicated...

The release model has had some major changes, so this is where much of the conversion needs to happen. Switching to a model which tries to capture the actual product you can buy has numerous benefits. Now that relationships and release status is at the old-release level, it means we can have different relationships for releases with the same tracklist (e.g. purchase links, cover art, ...) and also that we can distinguish promo or bootleg versions of an official release which have the same tracklisting.

First the old-release is turned into new-releases. A tracklist is created from the old-release's tracks. Each release event is then turned into a new-release which has a medium using that tracklist. The disc IDs from the old-release are copied to that medium if the format is a CD or DualDisc. The release name will initially be the same for each new-release created from a old-release. The old-release's MBID will be assigned to the new-release corresponding to the first release event the old-release has (i.e. lowest row ID), the others will get new MBIDs.

The upgrade script then tries to assign Discogs release, Amazon, part of set and transl(iter)ation relationships to the correct new-releases. All the remaining old-release relationships are copied to each new-release.

Discogs relationships

  • To try and match Discogs releases to our release events, the upgrade script will use our catalogue numbers and barcodes to try and match the Discogs release's catalogue number (non-alphanumeric characters and leading zeros are removed from both).
  • If that doesn't match, it tries to find a partial match to the Discogs release's catalogue number (?) where the format matches.
  • If that still doesn't match, it tries to match the country, year and format.
  • (what happens if nothing matches? copy to each release?)

Amazon relationships

  • To match Amazon releases, it uses the barcodes (with leading zeros removed).
  • If the barcodes don't match, it tries to find a partial match using barcodes.
  • If that still doesn't match, it tries to find a match to the full release date.
  • Finally, it tries to find a match to the year and format.
  • (again, what happens?)

Part of set/transl(iter)ation relationships

For both part of set and transl(iter)ation relationships, it tries to find matches with the same date, country, label, catalogue number and barcode (catalogue numbers are lowercased and the last character is ignored).


Combining multiple discs

Now it's time to combine discs into one release.

The part of set relationships are loaded and we go through each of the discs and try to combine them into one release. In order for them to be combined, the following apply:

  • The title needs to follow Disc Number Style. This is necessary because we need to extract the release title, the disc number and the disc title.
  • Disc numbers can't be repeated.
  • Every disc can only have one next disc (note: That doesn't mean a old-release can't be linked to more than one disc, it means that after the relationship disambiguation in the previous section, a release should only end up linked to one release.)
  • The relationships mustn't be circular (that's just silly :P)
  • The discs need to have the same release title, artist, date, country, label and barcode (note: not catalogue number or format).

The releases which can't be combined will keep their part of set relationships, so they can be cleaned up later.

Other changes

  • Releases now have a packaging field and can have a disambiguation comment.
  • The release type is dropped because only release groups have types. (does it ignore it completely?)
  • [Unknown Country] is removed, since country is longer a required field. Any release events which currently have the country set to [Unknown Country] will now have no country set.
  • A number of new release formats have been added. (-> link to a list)
  • Even though a new-release corresponds to a old-release event, they can have multiple labels and catalogue numbers.

Tracks and recordings

Tracks are no longer core entities and all current track entities will become recording entities. Tracks will still exist, but are now simply the link between a release and a recording. That means that:

  • All tracks get a corresponding recording with the same name.
  • Track MBIDs become recording MBIDs.
  • ISRCs and PUIDs are moved from tracks to recordings.
  • Track tags become recording tags.
  • Track ratings become recording ratings.

The upgrade script also tries to merge recordings. It will merge the following:

  • Tracks linked with an earliest release relationship (as long as the difference in track times is no more than 5000ms and the artists are the same).
  • Corresponding tracks on releases linked with the transl(iter)ation relationship (as long as the difference in track times is no more than 3000ms. Here the artists can be different).
  • Tracks which share the same PUID (with what criteria?)
  • Tracks which share the same ISRC (with what criteria?)

Recordings can also have disambiguation comments.

Works

Works can have a disambiguation comment, ISWCs, relationships, annotations, aliases, tags and ratings.

Some works are created during the upgrade script.

  • Do magic!


URLs

URLs are still URLs of course. However, there is now some code to try and make sure that URLs are entered in the database in a more consistent format (-> uses the output of URI->canonical) and URLs are now merged when editing one URL into another which is already in the database.

The upgrade script checks all the URLs and will update any which aren't in the desired format. The duplicate URLs will be merged.

Other stuff

  • migrate collections (in ./admin/sql/updates/ngs-rawdata.pl)
  • Relationships can now have multiple different attributes (I guess this is due to the changes in how relationships are stored)
  • Labels didn't change at all?