User:JonnyJD/DiscID

From MusicBrainz Wiki
< User:JonnyJD
Revision as of 16:49, 25 September 2013 by JonnyJD (talk | contribs) (FreeDB ID (fuzzy TOC lookup): make fuzzy lookup bold)

There was some discussion about the issues and a possible removal of Disc IDs on the 13th MusicBrainz Summit. I want to summarize issues and benefits of discID usage a bit.

Purpose of Disc IDs / TOCs

TOC

A TOC is set of sector offsets/times for a specific pressing of CD. A release medium can have multiple TOCs attached and the (primarey) times of the release should be the times from one of the TOCs. This way multiple pressings are grouped in one release-medium. Releases can be found with the TOC of the disc, though currently only a fuzzy search by TOC is available (see below).

Depending on context, there are two different types of TOC:

  • disc TOC: The TOC of a certain disc used for lookup
  • server TOC: The TOC of a certain (disc) pressing saved and attache to a medium on MusicBrainz while "adding a disc ID"

Disc ID

A disc ID is basically a hash of the TOC, so you can generate a disc ID from the TOC, but no TOC from the disc ID. On MusicBrainz these disc IDs can be used as IDs for a specific TOC.

FreeDB ID (fuzzy TOC lookup)

A ID is also an identifier, which can be generated given a TOC. However, standard FreeDB IDs have lots of collisions so one FreeDB ID can have several corresponding (server) TOCs.

The MusicBrainz FreeDB_Gateway mainly uses the (disc) TOC given in the search command to do a fuzzy search on the (primary) track times of the release. This will also include completely unrelated releases, but mostly also includes the correct release in the list. The (misc) freeDB ID returned by the lookup is no "real" freeDB ID, as it isn't generated the usual way. The misc ID is a representation of the medium ID in the database and because of that it is collision free and leads to exactly one medium. Additionally also a search by the FreeDB ID given by the client is done in the (server) TOCs. This adds exactly one result to the list, which isn't necessarily the correct one, due to collisions mentioned above.

Note that the fuzzy search of the FreeDB Gateway is not dependent on attached server TOCs or DiscIDs, but also isn't as accurate (many (unrelated) false positives, few false negatives).

Exact lookup with matching disc and server TOCs would be technically possible to remove false positives.

Disc Lookup

The disc ID is used as a short identifier to lookup a release "by disc". Taggers expect the returned release to match what they see as the disc: same number of tracks etc. Lookup by TOC is in theory possible, but currently isn't as exact (due to fuzzy lookup, see below) and isn't widely supported (also see below).

Lookup by Disc ID + disc TOC gives only exact matches when the Disc ID is the database. However, only some of the releases of a release group might have the Disc ID attached and other releases in that group might not appear in the results. (nearly no (unrelated) false positives, but sometimes false negatives) The releases not appearing in the results are mostly seen as "practically" identical to one release in the results. They mostly have the same release title, artist and track names. Differences in cat# etc. might get lost, but are usually not even shown in the list of results.

Additionally listing all releases in a release group with similar track times as one already in the result would be possible to remove the rate of false negatives.

Issues with (or because of) Disc IDs

Pregap Tracks (locks)

The ability to attach pregrap tracks directly to the release is a long wanted feature and tracked in MBS-967. You can't add a Pregap Track as track 0 when a Disc ID is attached to the release (tracklist is locked). Removing the Disc ID could work, but the release can only be found by clients with TOC lookup support (see below). Removing Disc IDs altogether would extend the problem to all releases.

The additional "track" would confuse many tagger tools, so the pregap track should only be included when explicitely requested, not as a normal track.

Correcting Times (locks)

When disc IDs are attached, the track times can't be set manually. Setting times to one of the disc IDs is the only option.

I didn't grasp yet when it is the case that no correct disc ID is available and there is a better source for the correct time. The notes mention Video CDs, which possibly shouldn't get disc IDs attached in the first place? (when there are no audio tracks). Otherwise I don't see why we should mess around with length of data tracks. --JonnyJD (talk) 14:24, 25 September 2013 (UTC)

Pre NGS disc IDs (precision/false positives, though releated)

Disc IDs added pre-NGS were copied to all releases split off with NGS. So many disc IDs from that time are still attached to all releases in a release group. The results include some wrong entries, but no completely irrelevant entries.

Incencitive to add IDs to multiple releases (recall/false negatives, though similar)

When a disc ID is attached to one release in a release group, the lookup with this ID yields a result that is good enough for most users. The release title, artist and track titles match and the cat# etc. are not relevant for many users. So there is not much incencitive to add the disc ID to more releases in a release group.

This can easily be fixed on server side to include other releases from the release group, although the precision goes down a bit then.

Client support with and without full TOC

The main client library for disc ID support is libdiscid:

  • submission url provided includes DiscID and TOC, to save the TOC on the server
  • web service url provided also includes both, but is outdated (WS/1)
  • only the disc ID can be gathered directly with the API (as of 0.5.2)
  • there will be an upcoming release (0.6.0) with a "toc string API" (LIB-41)

The web service does allow lookup by disc ID and, if it doesn't match, a fuzzy lookup by TOC. The result of a match by disc ID and a fuzzy match by TOC looks completely different (see comment in LMB-36) A syntactically valid (though not existing) disc ID is always required and a match by TOC is currently always fuzzy.

libmusicbrainz is technically able to work with a TOC lookup, but isn't straightforward due to how the web service works. (see above and LMB-36)

python-musicbrainzngs can only lookup by disc ID. There is an outstanding ticket for (fuzzy) lookup by TOC.

Usage statistics

(still left to gather)

Summary/Evaluation/TLDNR

Disc IDs are only IDs, so they are not important data in themselves. The important data is in the TOCs. However, the Disc IDs also aren't the problem, since the locks are in place because of the TOCs. When lookup would work directly by TOC, we would have the same issues with confused taggers (in case of pregap tracks implemented as normal tracks) or inconsistent data (in case of track times in no relation to actual TOCs).

Additionally the client structure is based on lookup by Disc ID, not by TOC directly. Removing Disc IDs will probably break lots of lookup applications (statistics not gathered as of writing though).