Difference between revisions of "User:JonnyJD/DiscID"

From MusicBrainz Wiki
(add summary/evaluation)
(Incencitive to add IDs to multiple releases (recall/false negatives, though similar): isrcsubmit improvements)
 
(13 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
=== TOC ===
 
=== TOC ===
 
A TOC is set of sector offsets/times for a specific pressing of CD.
 
A TOC is set of sector offsets/times for a specific pressing of CD.
A release can have multiple TOCs attached and the (primarey) times of the release should be the times from one of the TOCs.
+
A release medium can have multiple TOCs attached and the (primarey) times of the release should be the times from one of the TOCs.
This way multiple pressings are grouped in one release.
+
This way multiple pressings are grouped in one release-medium.
Releases can be found by TOC, though currently only a fuzzy search by TOC is available (see below).
+
Releases can be found with the TOC of the disc,
 +
though currently only a fuzzy search by TOC is available (see below).
 +
 
 +
Depending on context, there are two different types of TOC:
 +
* '''disc TOC''': The TOC of a certain disc used for lookup
 +
* '''server TOC''': The TOC of a certain (disc) pressing saved and attache to a medium on MusicBrainz while "[[How to Add Disc IDs|adding a disc ID]]"
  
 
=== Disc ID ===
 
=== Disc ID ===
Line 14: Line 19:
 
On MusicBrainz these disc IDs can be used as IDs for a specific TOC.
 
On MusicBrainz these disc IDs can be used as IDs for a specific TOC.
  
 +
=== fuzzy (disc) TOC lookup ===
 +
The MB server can make a fuzzy lookup with the TOC from the disc.
 +
This lookup is based on the primary track times for the release and is unrelated to discIDs and (server) TOCs.
 +
This kind of lookup includes the correct release most of the time (high recall),
 +
but also many completely unrelated releases that happen to have similar times (low precision, [http://freedb.musicbrainz.org/~cddb/cddb.cgi?cmd=cddb+query+B20DEE0B+1+150+345&proto=6 example]).
 +
When a specific release that should be in the results is not, this normally can't be fixed
 +
(without also considering exact (server) TOCs).
 +
 +
=== FreeDB ID ===
 +
A [https://en.wikipedia.org/wiki/CDDB#Example_calculation_of_a_CDDB1_.28FreeDB.29_disc_ID|FreeDB ID] is also an identifier,
 +
which can be generated given a TOC.
 +
However, standard FreeDB IDs have lots of collisions so one FreeDB ID can have several corresponding (server) TOCs.
 +
 +
The '''MusicBrainz [[FreeDB_Gateway]]''' mainly uses the (disc) TOC given in the search command to do a '''fuzzy search
 +
on the (primary) track times''' of the release (see above).
 +
The (misc) freeDB ID returned by the lookup is no "real" freeDB ID, as it isn't generated the usual way.
 +
The misc ID is a representation of the medium ID in the database and because of that it is collision free and leads to exactly one medium.
 +
Additionally also a search by the FreeDB ID given by the client is done in the (server) TOCs.
 +
This adds exactly one result to the list, which isn't necessarily the correct one, due to collisions mentioned above.
 +
 +
Exact lookup with matching disc and server TOCs would be technically possible to remove false positives.
  
 
=== Disc Lookup ===
 
=== Disc Lookup ===
 
The disc ID is used as a short identifier to lookup a release "by disc".
 
The disc ID is used as a short identifier to lookup a release "by disc".
 
Taggers expect the returned release to match what they see as the disc: same number of tracks etc.
 
Taggers expect the returned release to match what they see as the disc: same number of tracks etc.
Lookup by TOC is in theory possible, but currently isn't as exact (due to fuzzy lookup, see below) and isn't widely supported (also see below).
+
Lookup by TOC is in theory possible, but currently isn't as exact (lookup is currently fuzzy, see above) and isn't widely supported (also see below).
 +
 
 +
Lookup by Disc ID + disc TOC gives only exact matches when the Disc ID is the database.
 +
However, only some of the releases of a release group might have the Disc ID attached and other releases in that group might not appear in the results. (nearly no (unrelated) false positives, but sometimes false negatives)
 +
The releases not appearing in the results are mostly seen as "practically" identical to one release in the results.
 +
They mostly have the same release title, artist and track names.
 +
Differences in cat# etc. might get lost, but are usually not even shown in the list of results.
 +
When the disc ID is not found, a fuzzy lookup (see above) is done.
 +
 
 +
Additionally listing all releases in a release group with similar track times as one already in the result would be possible
 +
to remove the rate of false negatives.
  
 
== Issues with (or because of) Disc IDs ==
 
== Issues with (or because of) Disc IDs ==
  
=== Pregap Tracks ===
+
=== Pregap Tracks (locks) ===
 
The ability to attach pregrap tracks directly to the release is a long wanted feature and tracked in [http://tickets.musicbrainz.org/browse/MBS-967 MBS-967].
 
The ability to attach pregrap tracks directly to the release is a long wanted feature and tracked in [http://tickets.musicbrainz.org/browse/MBS-967 MBS-967].
 
You can't add a [[Pregap Track]] as track 0 when a Disc ID is attached to the release (tracklist is locked).
 
You can't add a [[Pregap Track]] as track 0 when a Disc ID is attached to the release (tracklist is locked).
Removing the Disc ID could work, but the release can only be found by clients with TOC lookup support (see below).
+
Removing the Disc ID could work, but the release can only be found by clients with (fuzzy) TOC lookup support (see below).
 
Removing Disc IDs altogether would extend the problem to all releases.
 
Removing Disc IDs altogether would extend the problem to all releases.
  
 
The additional "track" would confuse many tagger tools, so the pregap track should only be included when explicitely requested, not as a normal track.
 
The additional "track" would confuse many tagger tools, so the pregap track should only be included when explicitely requested, not as a normal track.
 +
Just removing the lock (completely) is no solution, but making the addition / naming of a special pregap track possible is possible.
 +
 +
It should be noted, that the fact that a pregap track exists is already included in (both) TOCs, as the fact that the first track starts after way more than the usual 2 seconds (offset >> 150).
  
=== Correcting Times ===
+
=== Correcting Times (locks) ===
 
When disc IDs are attached, the track times can't be set manually. Setting times to one of the disc IDs is the only option.
 
When disc IDs are attached, the track times can't be set manually. Setting times to one of the disc IDs is the only option.
 
::I didn't grasp yet when it is the case that no correct disc ID is available and there is a better source for the correct time. The notes mention Video CDs, which possibly shouldn't get disc IDs attached in the first place? (when there are no audio tracks). Otherwise I don't see why we should mess around with length of data tracks. --[[User:JonnyJD|JonnyJD]] ([[User talk:JonnyJD|talk]]) 14:24, 25 September 2013 (UTC)
 
::I didn't grasp yet when it is the case that no correct disc ID is available and there is a better source for the correct time. The notes mention Video CDs, which possibly shouldn't get disc IDs attached in the first place? (when there are no audio tracks). Otherwise I don't see why we should mess around with length of data tracks. --[[User:JonnyJD|JonnyJD]] ([[User talk:JonnyJD|talk]]) 14:24, 25 September 2013 (UTC)
 +
 +
=== Pre NGS disc IDs (precision/false positives, though releated) ===
 +
Disc IDs added pre-NGS were copied to all releases split off with NGS.
 +
So many disc IDs from that time are still attached to all releases in a release group.
 +
The results include some wrong entries, but no completely irrelevant entries.
 +
 +
=== Incencitive to add IDs to multiple releases (recall/false negatives, though similar) ===
 +
When a disc ID is attached to one release in a release group, the lookup with this ID yields a result that is good enough for most users.
 +
The release title, artist and track titles match and the cat# etc. are not relevant for many users.
 +
So there is not much incencitive to add the disc ID to more releases in a release group.
 +
::Isrcsubmit (version 2) will include incencitives to attach disc IDs to the exact release the user has ([https://github.com/JonnyJD/musicbrainz-isrcsubmit/pull/75 #75]) --[[User:JonnyJD|JonnyJD]] ([[User talk:JonnyJD|talk]]) 00:20, 8 October 2013 (UTC)
 +
 +
This can easily be fixed on server side to include other releases from the release group, although the precision goes down a bit then.
  
 
== Client support with and without full TOC ==
 
== Client support with and without full TOC ==
Line 54: Line 106:
  
 
== Usage statistics ==
 
== Usage statistics ==
(still left to gather)
+
''(statistics gathered 20130927)''
 +
 
 +
* /ws/2/discid: 9 days, 683,503 requests, only 246 (0.00036 %) with toc parameter (enabling fuzzy lookup when ID is not found)
 +
* mb2freedb: ca. 550-620 k per day (this always uses a fuzzy lookup)
 +
 
 +
So we have over 7 times more requests per freedb gateway than with discids directly (-> 88 % of all lookups are freedb gateway lookups).
 +
Altogether we have ca. 660 k requests per day and 88 % of these work with fuzzy lookup.
  
 
== Summary/Evaluation/TLDNR ==
 
== Summary/Evaluation/TLDNR ==
  
 
Disc IDs are only IDs, so they are not important data in themselves.
 
Disc IDs are only IDs, so they are not important data in themselves.
The important data is in the TOCs. However, the Disc IDs also aren't the problem, since the locks are in place because of the TOCs.
+
The important data is in the (server) TOCs. However, the Disc IDs also aren't the problem, since the locks are in place because of the TOCs.
 
When lookup would work directly by TOC, we would have the same issues with confused taggers (in case of pregap tracks implemented as normal tracks) or inconsistent data (in case of track times in no relation to actual TOCs).
 
When lookup would work directly by TOC, we would have the same issues with confused taggers (in case of pregap tracks implemented as normal tracks) or inconsistent data (in case of track times in no relation to actual TOCs).
 +
So the question would rather be if we want to keep maintaining (server) TOCs.
 +
Removing DiscIDs without removing (server) TOCs attached to releases wouldn't make any sense.
 +
 +
Removing Disc IDs/TOCs would leave 88 % of lookups (freedb gateway) unchanged,
 +
but break most applications that opted to implement lookup by Disc ID as recommended by MusicBrainz.
 +
Nearly none of them support fuzzy lookups.
 +
 +
=== fuzzy disc TOC lookup vs. discid lookup ===
 +
Fuzzy lookup has the problem with unrelated false positives (low precision).
 +
This and the (few) false negatives can only be solved with additional manually attached (server) TOCs or similar
 +
to filter or amend the results.
 +
 +
DiscID lookup does has nearly no false positives (high precision),
 +
but can miss similar releases with different cat#.
 +
DiscID lookup does have to be maintained, but all problems can (in theory) be fixed with adding/moving/removing DiscIDs/TOCs
 +
and/or making the server smarter (in cases of missing releases from the same release group).
  
Additionally the client structure is based on lookup by Disc ID, not by TOC directly.
+
=== exact disc TOC lookup ===
Removing Disc IDs will probably break lots of lookup applications (statistics not gathered as of writing though).
+
Could be used for ranking results for fuzzy TOC lookup (exact matches at the top),
 +
but is otherwise functionally equivalent of a discID lookup.

Latest revision as of 00:20, 8 October 2013

There was some discussion about the issues and a possible removal of Disc IDs on the 13th MusicBrainz Summit. I want to summarize issues and benefits of discID usage a bit.

Purpose of Disc IDs / TOCs

TOC

A TOC is set of sector offsets/times for a specific pressing of CD. A release medium can have multiple TOCs attached and the (primarey) times of the release should be the times from one of the TOCs. This way multiple pressings are grouped in one release-medium. Releases can be found with the TOC of the disc, though currently only a fuzzy search by TOC is available (see below).

Depending on context, there are two different types of TOC:

  • disc TOC: The TOC of a certain disc used for lookup
  • server TOC: The TOC of a certain (disc) pressing saved and attache to a medium on MusicBrainz while "adding a disc ID"

Disc ID

A disc ID is basically a hash of the TOC, so you can generate a disc ID from the TOC, but no TOC from the disc ID. On MusicBrainz these disc IDs can be used as IDs for a specific TOC.

fuzzy (disc) TOC lookup

The MB server can make a fuzzy lookup with the TOC from the disc. This lookup is based on the primary track times for the release and is unrelated to discIDs and (server) TOCs. This kind of lookup includes the correct release most of the time (high recall), but also many completely unrelated releases that happen to have similar times (low precision, example). When a specific release that should be in the results is not, this normally can't be fixed (without also considering exact (server) TOCs).

FreeDB ID

A ID is also an identifier, which can be generated given a TOC. However, standard FreeDB IDs have lots of collisions so one FreeDB ID can have several corresponding (server) TOCs.

The MusicBrainz FreeDB_Gateway mainly uses the (disc) TOC given in the search command to do a fuzzy search on the (primary) track times of the release (see above). The (misc) freeDB ID returned by the lookup is no "real" freeDB ID, as it isn't generated the usual way. The misc ID is a representation of the medium ID in the database and because of that it is collision free and leads to exactly one medium. Additionally also a search by the FreeDB ID given by the client is done in the (server) TOCs. This adds exactly one result to the list, which isn't necessarily the correct one, due to collisions mentioned above.

Exact lookup with matching disc and server TOCs would be technically possible to remove false positives.

Disc Lookup

The disc ID is used as a short identifier to lookup a release "by disc". Taggers expect the returned release to match what they see as the disc: same number of tracks etc. Lookup by TOC is in theory possible, but currently isn't as exact (lookup is currently fuzzy, see above) and isn't widely supported (also see below).

Lookup by Disc ID + disc TOC gives only exact matches when the Disc ID is the database. However, only some of the releases of a release group might have the Disc ID attached and other releases in that group might not appear in the results. (nearly no (unrelated) false positives, but sometimes false negatives) The releases not appearing in the results are mostly seen as "practically" identical to one release in the results. They mostly have the same release title, artist and track names. Differences in cat# etc. might get lost, but are usually not even shown in the list of results. When the disc ID is not found, a fuzzy lookup (see above) is done.

Additionally listing all releases in a release group with similar track times as one already in the result would be possible to remove the rate of false negatives.

Issues with (or because of) Disc IDs

Pregap Tracks (locks)

The ability to attach pregrap tracks directly to the release is a long wanted feature and tracked in MBS-967. You can't add a Pregap Track as track 0 when a Disc ID is attached to the release (tracklist is locked). Removing the Disc ID could work, but the release can only be found by clients with (fuzzy) TOC lookup support (see below). Removing Disc IDs altogether would extend the problem to all releases.

The additional "track" would confuse many tagger tools, so the pregap track should only be included when explicitely requested, not as a normal track. Just removing the lock (completely) is no solution, but making the addition / naming of a special pregap track possible is possible.

It should be noted, that the fact that a pregap track exists is already included in (both) TOCs, as the fact that the first track starts after way more than the usual 2 seconds (offset >> 150).

Correcting Times (locks)

When disc IDs are attached, the track times can't be set manually. Setting times to one of the disc IDs is the only option.

I didn't grasp yet when it is the case that no correct disc ID is available and there is a better source for the correct time. The notes mention Video CDs, which possibly shouldn't get disc IDs attached in the first place? (when there are no audio tracks). Otherwise I don't see why we should mess around with length of data tracks. --JonnyJD (talk) 14:24, 25 September 2013 (UTC)

Pre NGS disc IDs (precision/false positives, though releated)

Disc IDs added pre-NGS were copied to all releases split off with NGS. So many disc IDs from that time are still attached to all releases in a release group. The results include some wrong entries, but no completely irrelevant entries.

Incencitive to add IDs to multiple releases (recall/false negatives, though similar)

When a disc ID is attached to one release in a release group, the lookup with this ID yields a result that is good enough for most users. The release title, artist and track titles match and the cat# etc. are not relevant for many users. So there is not much incencitive to add the disc ID to more releases in a release group.

Isrcsubmit (version 2) will include incencitives to attach disc IDs to the exact release the user has (#75) --JonnyJD (talk) 00:20, 8 October 2013 (UTC)

This can easily be fixed on server side to include other releases from the release group, although the precision goes down a bit then.

Client support with and without full TOC

The main client library for disc ID support is libdiscid:

  • submission url provided includes DiscID and TOC, to save the TOC on the server
  • web service url provided also includes both, but is outdated (WS/1)
  • only the disc ID can be gathered directly with the API (as of 0.5.2)
  • there will be an upcoming release (0.6.0) with a "toc string API" (LIB-41)

The web service does allow lookup by disc ID and, if it doesn't match, a fuzzy lookup by TOC. The result of a match by disc ID and a fuzzy match by TOC looks completely different (see comment in LMB-36) A syntactically valid (though not existing) disc ID is always required and a match by TOC is currently always fuzzy.

libmusicbrainz is technically able to work with a TOC lookup, but isn't straightforward due to how the web service works. (see above and LMB-36)

python-musicbrainzngs can only lookup by disc ID. There is an outstanding ticket for (fuzzy) lookup by TOC.

Usage statistics

(statistics gathered 20130927)

  • /ws/2/discid: 9 days, 683,503 requests, only 246 (0.00036 %) with toc parameter (enabling fuzzy lookup when ID is not found)
  • mb2freedb: ca. 550-620 k per day (this always uses a fuzzy lookup)

So we have over 7 times more requests per freedb gateway than with discids directly (-> 88 % of all lookups are freedb gateway lookups). Altogether we have ca. 660 k requests per day and 88 % of these work with fuzzy lookup.

Summary/Evaluation/TLDNR

Disc IDs are only IDs, so they are not important data in themselves. The important data is in the (server) TOCs. However, the Disc IDs also aren't the problem, since the locks are in place because of the TOCs. When lookup would work directly by TOC, we would have the same issues with confused taggers (in case of pregap tracks implemented as normal tracks) or inconsistent data (in case of track times in no relation to actual TOCs). So the question would rather be if we want to keep maintaining (server) TOCs. Removing DiscIDs without removing (server) TOCs attached to releases wouldn't make any sense.

Removing Disc IDs/TOCs would leave 88 % of lookups (freedb gateway) unchanged, but break most applications that opted to implement lookup by Disc ID as recommended by MusicBrainz. Nearly none of them support fuzzy lookups.

fuzzy disc TOC lookup vs. discid lookup

Fuzzy lookup has the problem with unrelated false positives (low precision). This and the (few) false negatives can only be solved with additional manually attached (server) TOCs or similar to filter or amend the results.

DiscID lookup does has nearly no false positives (high precision), but can miss similar releases with different cat#. DiscID lookup does have to be maintained, but all problems can (in theory) be fixed with adding/moving/removing DiscIDs/TOCs and/or making the server smarter (in cases of missing releases from the same release group).

exact disc TOC lookup

Could be used for ranking results for fuzzy TOC lookup (exact matches at the top), but is otherwise functionally equivalent of a discID lookup.