Difference between revisions of "User:JonnyJD/DiscID"

From MusicBrainz Wiki
Jump to navigationJump to search
(→‎Usage statistics: add date to statistics)
Line 119: Line 119:
When lookup would work directly by TOC, we would have the same issues with confused taggers (in case of pregap tracks implemented as normal tracks) or inconsistent data (in case of track times in no relation to actual TOCs).
When lookup would work directly by TOC, we would have the same issues with confused taggers (in case of pregap tracks implemented as normal tracks) or inconsistent data (in case of track times in no relation to actual TOCs).
So the question would rather be if we want to keep maintaining (server) TOCs.
So the question would rather be if we want to keep maintaining (server) TOCs.
Removing DiscIDs withoug removing (server) TOCs attached to releases wouldn't make any sense.
Removing DiscIDs without removing (server) TOCs attached to releases wouldn't make any sense.


Removing Disc IDs/TOCs would leave 88 % of lookups (freedb gateway) unchanged,
Removing Disc IDs/TOCs would leave 88 % of lookups (freedb gateway) unchanged,
but break most applications that opted to implement lookup by Disc ID (nearly none of them support fuzzy lookups)
but break most applications that opted to implement lookup by Disc ID as recommended by MusicBrainz.
Nearly none of them support fuzzy lookups.
as recommended by MusicBrainz.


=== fuzzy disc TOC lookup vs. discid lookup ===
=== fuzzy disc TOC lookup vs. discid lookup ===

Revision as of 13:26, 27 September 2013

There was some discussion about the issues and a possible removal of Disc IDs on the 13th MusicBrainz Summit. I want to summarize issues and benefits of discID usage a bit.

Purpose of Disc IDs / TOCs

TOC

A TOC is set of sector offsets/times for a specific pressing of CD. A release medium can have multiple TOCs attached and the (primarey) times of the release should be the times from one of the TOCs. This way multiple pressings are grouped in one release-medium. Releases can be found with the TOC of the disc, though currently only a fuzzy search by TOC is available (see below).

Depending on context, there are two different types of TOC:

  • disc TOC: The TOC of a certain disc used for lookup
  • server TOC: The TOC of a certain (disc) pressing saved and attache to a medium on MusicBrainz while "adding a disc ID"

Disc ID

A disc ID is basically a hash of the TOC, so you can generate a disc ID from the TOC, but no TOC from the disc ID. On MusicBrainz these disc IDs can be used as IDs for a specific TOC.

fuzzy (disc) TOC lookup

The MB server can make a fuzzy lookup with the TOC from the disc. This lookup is based on the primary track times for the release and is unrelated to discIDs and (server) TOCs. This kind of lookup includes the correct release most of the time (high recall), but also many completely unrelated releases that happen to have similar times (low precision, example). When a specific release that should be in the results is not, this normally can't be fixed (without also considering exact (server) TOCs).

FreeDB ID

A ID is also an identifier, which can be generated given a TOC. However, standard FreeDB IDs have lots of collisions so one FreeDB ID can have several corresponding (server) TOCs.

The MusicBrainz FreeDB_Gateway mainly uses the (disc) TOC given in the search command to do a fuzzy search on the (primary) track times of the release (see above). The (misc) freeDB ID returned by the lookup is no "real" freeDB ID, as it isn't generated the usual way. The misc ID is a representation of the medium ID in the database and because of that it is collision free and leads to exactly one medium. Additionally also a search by the FreeDB ID given by the client is done in the (server) TOCs. This adds exactly one result to the list, which isn't necessarily the correct one, due to collisions mentioned above.

Exact lookup with matching disc and server TOCs would be technically possible to remove false positives.

Disc Lookup

The disc ID is used as a short identifier to lookup a release "by disc". Taggers expect the returned release to match what they see as the disc: same number of tracks etc. Lookup by TOC is in theory possible, but currently isn't as exact (lookup is currently fuzzy, see above) and isn't widely supported (also see below).

Lookup by Disc ID + disc TOC gives only exact matches when the Disc ID is the database. However, only some of the releases of a release group might have the Disc ID attached and other releases in that group might not appear in the results. (nearly no (unrelated) false positives, but sometimes false negatives) The releases not appearing in the results are mostly seen as "practically" identical to one release in the results. They mostly have the same release title, artist and track names. Differences in cat# etc. might get lost, but are usually not even shown in the list of results. When the disc ID is not found, a fuzzy lookup (see above) is done.

Additionally listing all releases in a release group with similar track times as one already in the result would be possible to remove the rate of false negatives.

Issues with (or because of) Disc IDs

Pregap Tracks (locks)

The ability to attach pregrap tracks directly to the release is a long wanted feature and tracked in MBS-967. You can't add a Pregap Track as track 0 when a Disc ID is attached to the release (tracklist is locked). Removing the Disc ID could work, but the release can only be found by clients with (fuzzy) TOC lookup support (see below). Removing Disc IDs altogether would extend the problem to all releases.

The additional "track" would confuse many tagger tools, so the pregap track should only be included when explicitely requested, not as a normal track. Just removing the lock (completely) is no solution, but making the addition / naming of a special pregap track possible is possible.

It should be noted, that the fact that a pregap track exists is already included in (both) TOCs, as the fact that the first track starts after way more than the usual 2 seconds (offset >> 150).

Correcting Times (locks)

When disc IDs are attached, the track times can't be set manually. Setting times to one of the disc IDs is the only option.

I didn't grasp yet when it is the case that no correct disc ID is available and there is a better source for the correct time. The notes mention Video CDs, which possibly shouldn't get disc IDs attached in the first place? (when there are no audio tracks). Otherwise I don't see why we should mess around with length of data tracks. --JonnyJD (talk) 14:24, 25 September 2013 (UTC)

Pre NGS disc IDs (precision/false positives, though releated)

Disc IDs added pre-NGS were copied to all releases split off with NGS. So many disc IDs from that time are still attached to all releases in a release group. The results include some wrong entries, but no completely irrelevant entries.

Incencitive to add IDs to multiple releases (recall/false negatives, though similar)

When a disc ID is attached to one release in a release group, the lookup with this ID yields a result that is good enough for most users. The release title, artist and track titles match and the cat# etc. are not relevant for many users. So there is not much incencitive to add the disc ID to more releases in a release group.

This can easily be fixed on server side to include other releases from the release group, although the precision goes down a bit then.

Client support with and without full TOC

The main client library for disc ID support is libdiscid:

  • submission url provided includes DiscID and TOC, to save the TOC on the server
  • web service url provided also includes both, but is outdated (WS/1)
  • only the disc ID can be gathered directly with the API (as of 0.5.2)
  • there will be an upcoming release (0.6.0) with a "toc string API" (LIB-41)

The web service does allow lookup by disc ID and, if it doesn't match, a fuzzy lookup by TOC. The result of a match by disc ID and a fuzzy match by TOC looks completely different (see comment in LMB-36) A syntactically valid (though not existing) disc ID is always required and a match by TOC is currently always fuzzy.

libmusicbrainz is technically able to work with a TOC lookup, but isn't straightforward due to how the web service works. (see above and LMB-36)

python-musicbrainzngs can only lookup by disc ID. There is an outstanding ticket for (fuzzy) lookup by TOC.

Usage statistics

(statistics gathered 20130927)

  • /ws/2/discid: 9 days, 683,503 requests, only 246 (0.00036 %) with toc parameter (enabling fuzzy lookup when ID is not found)
  • mb2freedb: ca. 550-620 k per day (this always uses a fuzzy lookup)

So we have over 7 times more requests per freedb gateway than with discids directly (-> 88 % of all lookups are freedb gateway lookups). Altogether we have ca. 660 k requests per day and 88 % of these work with fuzzy lookup.

Summary/Evaluation/TLDNR

Disc IDs are only IDs, so they are not important data in themselves. The important data is in the (server) TOCs. However, the Disc IDs also aren't the problem, since the locks are in place because of the TOCs. When lookup would work directly by TOC, we would have the same issues with confused taggers (in case of pregap tracks implemented as normal tracks) or inconsistent data (in case of track times in no relation to actual TOCs). So the question would rather be if we want to keep maintaining (server) TOCs. Removing DiscIDs without removing (server) TOCs attached to releases wouldn't make any sense.

Removing Disc IDs/TOCs would leave 88 % of lookups (freedb gateway) unchanged, but break most applications that opted to implement lookup by Disc ID as recommended by MusicBrainz. Nearly none of them support fuzzy lookups.

fuzzy disc TOC lookup vs. discid lookup

Fuzzy lookup has the problem with unrelated false positives (low precision). This and the (few) false negatives can only be solved with additional manually attached (server) TOCs or similar to filter or amend the results.

DiscID lookup does has nearly no false positives (high precision), but can miss similar releases with different cat#. DiscID lookup does have to be maintained, but all problems can (in theory) be fixed with adding/moving/removing DiscIDs/TOCs and/or making the server smarter (in cases of missing releases from the same release group).

exact disc TOC lookup

Could be used for ranking results for fuzzy TOC lookup (exact matches at the top), but is otherwise functionally equivalent of a discID lookup.