Disc ID Calculation: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
(added OriginalVersion link (Imported from MoinMoin))
m (Fix formatting)
 
(25 intermediate revisions by 9 users not shown)
Line 1: Line 1:
[[Image:cdhand.gif|left]]
=How MusicBrainz calculates Disc IDs=


Let us first have some words about how data is organized on a CD, going top down.
Let us first have some words about how data is organized on a CD, going top down.


[[Image:cdhand.gif]] An Audio CD (CD-DA) can hold up to 99 audio tracks.
An Audio CD (CD-DA) can hold up to 99 audio tracks.


Sampling is done at a rate of 44.1 kHz using 16 bits resolution per channel, thus there are 44100 x 2 bytes x 2 channels (Stereo) = 176400 bytes of PCM data stored per second.
Sampling is done at a rate of 44.1 kHz using 16 bits resolution per channel, thus there are 44100 x 2 bytes x 2 channels (Stereo) = 176400 bytes of PCM data stored per second.
Line 13: Line 13:
At present we can stop at the level of logical blocks for our task. The deeper stuff will be needed in case the CD-Text format becomes popular, which stores extra data (like artist and track information) in the above mentioned control data parts.
At present we can stop at the level of logical blocks for our task. The deeper stuff will be needed in case the CD-Text format becomes popular, which stores extra data (like artist and track information) in the above mentioned control data parts.


==Reading the TOC==
Now let us have a look at a real world CD-DA.


===Audio CD===
Using a tool like [ftp://ftp.freebsd.org/pub/FreeBSD/FreeBSD-current/src/usr.sbin/cdcontrol/cdcontrol.c cdcontrol] (FreeBSD) on [http://www.musicbrainz.org/showalbum.html?discid=MUtMmKN402WPj3_VFsgUelxpc8U-&tracks=15&toc=1+15+325882+150+18791+34817+56500+77156+106244+125879+149935+169035+186060+205979+230292+246809+265764+289629 this CD] will yield:


Now let us have a look at a real world CD-DA. Using a tool like [http://cdrecord.org cdrecord] on [https://musicbrainz.org/cdtoc/49HHV7Eb8UKF3aQiNmu1GR8vKTY- this CD] will yield:
<pre> marc@oranje$ cdcontrol -f /dev/cd0c info

Starting track = 1, ending track = 15, TOC size = 4 bytes
<pre>$ cdrecord dev=/dev/cdrom -toc
track start duration block length type
...
-------------------------------------------------
first: 1 last 6
1 0:02.00 4:10.41 0 18641 audio
2 4:10.41 3:35.51 18641 16026 audio
track: 1 lba: 0 ( 0) 00:02:00 adr: 1 control: 2 mode: 0
3 7:44.17 4:51.08 34667 21683 audio
track: 2 lba: 15213 ( 60852) 03:24:63 adr: 1 control: 2 mode: 0
4 12:33.25 4:37.31 56350 20656 audio
track: 3 lba: 32164 ( 128656) 07:10:64 adr: 1 control: 2 mode: 0
5 17:08.56 6:29.63 77006 29088 audio
track: 4 lba: 46442 ( 185768) 10:21:17 adr: 1 control: 2 mode: 0
6 23:36.44 4:23.60 106094 19635 audio
track: 5 lba: 63264 ( 253056) 14:05:39 adr: 1 control: 2 mode: 0
7 27:58.29 5:22.56 125729 24056 audio
track: 6 lba: 80339 ( 321356) 17:53:14 adr: 1 control: 2 mode: 0
track:lout lba: 95312 ( 381248) 21:12:62 adr: 1 control: 2 mode: -1
8 33:19.10 4:16.50 149785 19100 audio
</pre>
9 37:33.60 3:49.00 168885 17025 audio

10 41:20.60 4:27.44 185910 19919 audio
You can see that the CD has 6 audio tracks and the special lead-out track (it has number 170 in the CD TOC). Also note that the LBA (Logical Block Address) offsets start at address 0, but the first track starts actually at 00:02:00 (the standard length of the lead-in track). So we need to add 150 logical blocks to each LBA offset. The resulting data needed to calculate a MusicBrainz disc ID are:
11 45:46.29 5:26.13 205829 24313 audio

12 51:10.42 3:42.17 230142 16517 audio
13 54:50.59 4:14.55 246659 18955 audio
<pre>track 1: 150 (150 + 0)
track 2: 15363 (150 + 15213)
14 59:03.39 5:20.15 265614 23865 audio
track 3: 32314 (150 + 32164)
15 64:21.54 8:05.28 289479 36253 data
track 4: 46592 (150 + 46442)
170 72:25.07 - 325732 - -
track 5: 63414 (150 + 63264)
track 6: 80489 (150 + 80339)
lead-out track: 95462 (150 + 95312)
</pre>

===Multi-session (audio + data) CD===

Because MusicBrainz doesn't include data tracks in Disc IDs, reading the TOC from a multi-session disc is a little more complicated. Running cdrecord on [https://musicbrainz.org/cdtoc/BPnh1KU.hea1C.KMYWLGZkHJr0w- this CD] will give us:

<pre>$ cdrecord dev=/dev/cdrom -toc
...
first: 1 last 8
track: 1 lba: 0 ( 0) 00:02:00 adr: 1 control: 0 mode: 0
track: 2 lba: 13959 ( 55836) 03:08:09 adr: 1 control: 0 mode: 0
track: 3 lba: 33436 ( 133744) 07:27:61 adr: 1 control: 0 mode: 0
track: 4 lba: 52927 ( 211708) 11:47:52 adr: 1 control: 0 mode: 0
track: 5 lba: 65631 ( 262524) 14:37:06 adr: 1 control: 0 mode: 0
track: 6 lba: 77742 ( 310968) 17:18:42 adr: 1 control: 0 mode: 0
track: 7 lba: 99024 ( 396096) 22:02:24 adr: 1 control: 0 mode: 0
track: 8 lba: 125824 ( 503296) 27:59:49 adr: 1 control: 6 mode: 1
track:lout lba: 188333 ( 753332) 41:53:08 adr: 1 control: 6 mode: -1
</pre>
</pre>


This is an example of a CD with an extra track of data (what you know as CD-ROM), marketed in this case as CD-Extra, featuring a video and some pictures. (More precise: this is disc with two sessions, audio and data)
This is an example of a CD with an extra track of data (what you know as CD-ROM), marketed in this case as CD-Extra, featuring a video and some pictures. (More precise: this is disc with two sessions, audio and data)


This CD has 8 tracks, but only the first 7 audio tracks should be used to calculate a MusicBrainz disc ID. The problem is that we can't use the offset of track 8 as the "lead-out track" offset, because there is a gap between the audio session and the data session. This gap is 11400 frames long (11250 frames for lead-out/lead-in + 150 frames of pre-gap), so we need to substract this value from the offset of track 8 to get the end of track 7. The result is:
You should note the special track 170 (<code><nowiki> 0xaa </nowiki></code>), that marks the beginning of the leadout track, consisting of the remaining blocks on the CD.


<pre>track 1: 150 (150 + 0)
The CD Index algorithm simply takes the following pieces of data and runs them through the [http://www.w3.org/TR/1998/REC-DSig-label/SHA1-1_0 SHA-1] hash function:
track 2: 14109 (150 + 13959)
* First track (normally one): 1 byte
track 3: 33586 (150 + 33436)
* Last track: 1 byte
track 4: 53077 (150 + 52927)
* Leadout track: 4 bytes
track 5: 65781 (150 + 65631)
* 99 frame offsets: 4 bytes for each track.
track 6: 77892 (150 + 77742)
<ul><li style="list-style-type:none">If there are less than 99 frame offsets (almost certainly), the value 0 will be used instead.
track 7: 99174 (150 + 99024)
lead-out track: 114574 (150 + 125824 - 11400)
</pre>

==Calculating the Disc ID==

===Step 1: Hashing the binary TOC data===

The CD Index algorithm simply takes the following pieces of data and runs them through the [https://tools.ietf.org/html/rfc3174 SHA-1] hash function:
* First track number (normally one): 1 byte
* Last track number: 1 byte
* Lead-out track offset: 4 bytes
* 99 frame offsets: 4 bytes for each track
<ul><li style="list-style-type:none">''If there are less than 99 tracks (almost certainly), the value 0 will be used instead.''
</ul>
</ul>


Before the data is fed through the SHA-1 hash, it is converted to ASCII using <code><nowiki> printf("%02X", value); </nowiki></code> for byte values and <code><nowiki> printf("%08X", value); </nowiki></code> for 32 bit integer values.
Before the data is fed through the SHA-1 hash, it is converted to upper-case hex ASCII using <code><nowiki>printf("%02X", value);</nowiki></code> for single-byte values and <code><nowiki>printf("%08X", value);</nowiki></code> for 4-byte values.


Code is a better definition than English, so here is the code that calculates the [[Disc ID|DiscID]]:
Code is a better definition than English, so here is the code that calculates the [[Disc ID|DiscID]]:


<pre> sprintf(temp, "%02X", pCDInfo->First­Track);
<pre>sprintf(temp, "%02X", pCDInfo->First­Track);
sha_update(&sha, (unsigned char*) temp, strlen(temp));
sha_update(&sha, (unsigned char*) temp, strlen(temp));


sprintf(temp, "%02X", pCDInfo->Last­Track);
sprintf(temp, "%02X", pCDInfo->Last­Track);
sha_update(&sha, (unsigned char*) temp, strlen(temp));
sha_update(&sha, (unsigned char*) temp, strlen(temp));


for (i = 0; i<100; i++) {
for (i = 0; i < 100; i++) {
sprintf(temp, "%08X", pCDInfo->Frame­Offset[i]);
sprintf(temp, "%08X", pCDInfo->Frame­Offset[i]);
sha_update(&sha, (unsigned char*) temp, strlen(temp));
sha_update(&sha, (unsigned char*) temp, strlen(temp));
}
}
sha_final(digest, &sha);
sha_final(digest, &sha);
</pre>
</pre>


Note that the leadout track is stored in <code><nowiki> pCDInfo->Frame­Offset[0] </nowiki></code>.
Note that the lead-out track is stored in <code><nowiki>pCDInfo->Frame­Offset[0]</nowiki></code>.


===Step 2: Base64-encoding of the hash===
The resulting 20 byte SHA-1 signature is converted to a base 64 encoded character ASCII string that is the disc Id.


The resulting 20 byte SHA-1 signature is converted into a Base64 encoded string of printable ASCII characters, which is the disc ID. All disc ID strings are thus exactly 28 characters long. In the above audio CD example the disc ID is <code><nowiki>49HHV7Eb8UKF3aQiNmu1GR8vKTY-</nowiki></code>. For details about Base64, please see [https://tools.ietf.org/html/rfc4648 RFC 4648] or the reasonable good [https://en.wikipedia.org/wiki/Base64 Wikipedia article].
One uses base 64 encoding to map arbitrary bytes onto a string of printable ASCII characters.


'''Important note:''' The Base64 encoding used by MusicBrainz is not the same one as specified in [https://tools.ietf.org/html/rfc4648 RFC 4648]. The specification uses <code><nowiki>+</nowiki></code>, <code><nowiki>/</nowiki></code>, and <code><nowiki>=</nowiki></code> characters, all of which are special HTTP/URL characters. To avoid the problems with dealing with that, MusicBrainz uses <code><nowiki>.</nowiki></code>, <code><nowiki>_</nowiki></code>, and <code><nowiki>-</nowiki></code> instead. For details on this, please refer to [https://github.com/metabrainz/libdiscid/blob/master/src/base64.c base64.c] in the [https://github.com/metabrainz/libdiscid libdiscid source code].
It does this by redistributing the 24 bits of three 8-bit codes each into four 6-bit codes from a table of 64 very common ASCII characters.


==Remarks==
Thus we end up with an Id string of 28 characters, in the above example with <code><nowiki> "MUtMmKN402WPj3_VFsgUelxpc8U-" </nowiki></code>.


The disc ID scheme has the advantage of being very simple (simple to understand, simple to implement). However, two different pressings of the same disc may have different IDs. To handle this case, the MusicBrainz system will let a user check to see if the CD already exists in the system under a different ID. If so, the system creates a new association for the different pressing of the same CD. Also note that a disc ID is ''usually'' not ambiguous, but it can still happen that different CDs have exactly the same set of frame offsets and hence the same disc ID, for example <code>[https://musicbrainz.org/cdtoc/lwHl8fGzJyLXQR33ug60E8jhf4k- lwHl8fGzJyLXQR33ug60E8jhf4k-]</code>.
Note: This base 64 string is not the same one specified in RFC822. The RFC822 spec uses <code><nowiki> + </nowiki></code>, <code><nowiki> / </nowiki></code>, and <code><nowiki> = </nowiki></code> characters, all of which are special HTTP/URL characters.


If you'd like to know more about disc ID calculation, please download the [https://github.com/metabrainz/libdiscid libdiscid source code] and check it out. The code is clean and self documenting.
To avoid the problems with dealing with that, I (Rob) used <code><nowiki> . </nowiki></code>, <code><nowiki> _ </nowiki></code>, and <code><nowiki> - </nowiki></code>. For details on this, please refer to <code><nowiki> base64.c </nowiki></code> in the source code.


If you are interested in creating other MusicBrainz clients and need the SHA-1 source code, check out [https://tools.ietf.org/html/rfc3174 RFC 3174].
This scheme has the advantage of being very simple (simple to understand, simple to implement) and it is not ambiguous.


==Tools==
However, two CDs pressings may not have the same Ids.


{{:Disc_ID_Calculation/Tools}}
To handle this case, the CD Index system will let a user check to see if the CD already exists in the system under a different Id. If so, the system creates a new association for the different pressing of the same CD.


==Libraries==
If you'd like to know more about this, please [http://www.musicbrainz.org/download.html download] the client source code and check it out.


{{:Disc_ID_Calculation/Libraries}}
The code is clean and self documenting.


==Links==
If you are interested in creating other CD Index clients and need the SHA-1 source code, you can either dig through the CD Index source, or check the [http://www.w3.org/tr/1998/rec-dsig-label/sha1-1_0 W3C page] on SHA-1.

Note: The use of the MD5 algorithm has been discontinued in favour of public domain SHA-1.

===Links===


* The How Stuff Works site has a very readable [http://howstuffworks.com/cd.htm introduction] on CD technology intended for the technically curious. Written by Marshall Brain.
* The How Stuff Works site has a very readable [http://howstuffworks.com/cd.htm introduction] on CD technology intended for the technically curious. Written by Marshall Brain.
* There is actually a very fine [http://www.cdrfaq.org/ FAQ about toasting] available that has lots of interesting facts. Written by Andy McFadden.
* A superb coverage of CDs and formats on a scientific level is given by Prof Kelin J Kuhn from University of Washington. Get the basics lecture [http://www.ee.washington.edu/conselec/CE/kuhn/cdaudio/95x6.htm here] and the format/encoding stuff lecture [http://www.ee.washington.edu/conselec/CE/kuhn/cdaudio2/95x7.htm here].
* There is actually a very fine [http://www.cdrfaq.org/ FAQ about toasting] available that has lots of interesting facts. Written by Andy [[Mc Fadden|McFadden]].
* [Find out more about how CD-DA and CD-ROM formats have been enhanced through [http://www.discmfg.com/PDF/enhanced.pdf this] PDF document provided by Cinram.] This site seems to be permanently down. I suggest we remove it, unless someone can find this document on another source
* Another introduction from Kelin Kuhn can be found [http://www.ee.washington.edu/class/ConsElec/cd.html here], again at University of Washington Electrical Engineering.

----[http://www.musicbrainz.org/disc.html OriginalVersion]


[[Category:To Be Reviewed]] [[Category:Documentation]] [[Category:Documentation For Developers]]
[[Category:Documentation]] [[Category:Documentation For Developers]] [[Category:WikiDocs Page]]

Latest revision as of 14:14, 28 October 2023

cdhand.gif

Let us first have some words about how data is organized on a CD, going top down.

An Audio CD (CD-DA) can hold up to 99 audio tracks.

Sampling is done at a rate of 44.1 kHz using 16 bits resolution per channel, thus there are 44100 x 2 bytes x 2 channels (Stereo) = 176400 bytes of PCM data stored per second.

This audio data is contained in logical blocks of 2352 bytes each on the CD, holding 2352 / 176400 = 1 / 75 seconds of sound.

A logical block plus 882 bytes of error correction and control data forms a raw block of 3234 bytes that is spread among 98 frames of 33 bytes each that are all together written on one spiral track among the CD.

At present we can stop at the level of logical blocks for our task. The deeper stuff will be needed in case the CD-Text format becomes popular, which stores extra data (like artist and track information) in the above mentioned control data parts.

Reading the TOC

Audio CD

Now let us have a look at a real world CD-DA. Using a tool like cdrecord on this CD will yield:

$ cdrecord dev=/dev/cdrom -toc
...
first: 1 last 6
track:   1 lba:         0 (        0) 00:02:00 adr: 1 control: 2 mode: 0
track:   2 lba:     15213 (    60852) 03:24:63 adr: 1 control: 2 mode: 0
track:   3 lba:     32164 (   128656) 07:10:64 adr: 1 control: 2 mode: 0
track:   4 lba:     46442 (   185768) 10:21:17 adr: 1 control: 2 mode: 0
track:   5 lba:     63264 (   253056) 14:05:39 adr: 1 control: 2 mode: 0
track:   6 lba:     80339 (   321356) 17:53:14 adr: 1 control: 2 mode: 0
track:lout lba:     95312 (   381248) 21:12:62 adr: 1 control: 2 mode: -1

You can see that the CD has 6 audio tracks and the special lead-out track (it has number 170 in the CD TOC). Also note that the LBA (Logical Block Address) offsets start at address 0, but the first track starts actually at 00:02:00 (the standard length of the lead-in track). So we need to add 150 logical blocks to each LBA offset. The resulting data needed to calculate a MusicBrainz disc ID are:

track 1:       150    (150 + 0)
track 2:       15363  (150 + 15213)
track 3:       32314  (150 + 32164)
track 4:       46592  (150 + 46442)
track 5:       63414  (150 + 63264)
track 6:       80489  (150 + 80339)
lead-out track: 95462  (150 + 95312)

Multi-session (audio + data) CD

Because MusicBrainz doesn't include data tracks in Disc IDs, reading the TOC from a multi-session disc is a little more complicated. Running cdrecord on this CD will give us:

$ cdrecord dev=/dev/cdrom -toc
...
first: 1 last 8
track:   1 lba:         0 (        0) 00:02:00 adr: 1 control: 0 mode: 0
track:   2 lba:     13959 (    55836) 03:08:09 adr: 1 control: 0 mode: 0
track:   3 lba:     33436 (   133744) 07:27:61 adr: 1 control: 0 mode: 0
track:   4 lba:     52927 (   211708) 11:47:52 adr: 1 control: 0 mode: 0
track:   5 lba:     65631 (   262524) 14:37:06 adr: 1 control: 0 mode: 0
track:   6 lba:     77742 (   310968) 17:18:42 adr: 1 control: 0 mode: 0
track:   7 lba:     99024 (   396096) 22:02:24 adr: 1 control: 0 mode: 0
track:   8 lba:    125824 (   503296) 27:59:49 adr: 1 control: 6 mode: 1
track:lout lba:    188333 (   753332) 41:53:08 adr: 1 control: 6 mode: -1

This is an example of a CD with an extra track of data (what you know as CD-ROM), marketed in this case as CD-Extra, featuring a video and some pictures. (More precise: this is disc with two sessions, audio and data)

This CD has 8 tracks, but only the first 7 audio tracks should be used to calculate a MusicBrainz disc ID. The problem is that we can't use the offset of track 8 as the "lead-out track" offset, because there is a gap between the audio session and the data session. This gap is 11400 frames long (11250 frames for lead-out/lead-in + 150 frames of pre-gap), so we need to substract this value from the offset of track 8 to get the end of track 7. The result is:

track 1:       150    (150 + 0)
track 2:       14109  (150 + 13959)
track 3:       33586  (150 + 33436)
track 4:       53077  (150 + 52927)
track 5:       65781  (150 + 65631)
track 6:       77892  (150 + 77742)
track 7:       99174  (150 + 99024)
lead-out track: 114574 (150 + 125824 - 11400)

Calculating the Disc ID

Step 1: Hashing the binary TOC data

The CD Index algorithm simply takes the following pieces of data and runs them through the SHA-1 hash function:

  • First track number (normally one): 1 byte
  • Last track number: 1 byte
  • Lead-out track offset: 4 bytes
  • 99 frame offsets: 4 bytes for each track
  • If there are less than 99 tracks (almost certainly), the value 0 will be used instead.

Before the data is fed through the SHA-1 hash, it is converted to upper-case hex ASCII using printf("%02X", value); for single-byte values and printf("%08X", value); for 4-byte values.

Code is a better definition than English, so here is the code that calculates the DiscID:

sprintf(temp, "%02X", pCDInfo->First­Track);
sha_update(&sha, (unsigned char*) temp, strlen(temp));

sprintf(temp, "%02X", pCDInfo->Last­Track);
sha_update(&sha, (unsigned char*) temp, strlen(temp));

for (i = 0; i < 100; i++) {
    sprintf(temp, "%08X", pCDInfo->Frame­Offset[i]);
    sha_update(&sha, (unsigned char*) temp, strlen(temp));
}
sha_final(digest, &sha);

Note that the lead-out track is stored in pCDInfo->Frame­Offset[0].

Step 2: Base64-encoding of the hash

The resulting 20 byte SHA-1 signature is converted into a Base64 encoded string of printable ASCII characters, which is the disc ID. All disc ID strings are thus exactly 28 characters long. In the above audio CD example the disc ID is 49HHV7Eb8UKF3aQiNmu1GR8vKTY-. For details about Base64, please see RFC 4648 or the reasonable good Wikipedia article.

Important note: The Base64 encoding used by MusicBrainz is not the same one as specified in RFC 4648. The specification uses +, /, and = characters, all of which are special HTTP/URL characters. To avoid the problems with dealing with that, MusicBrainz uses ., _, and - instead. For details on this, please refer to base64.c in the libdiscid source code.

Remarks

The disc ID scheme has the advantage of being very simple (simple to understand, simple to implement). However, two different pressings of the same disc may have different IDs. To handle this case, the MusicBrainz system will let a user check to see if the CD already exists in the system under a different ID. If so, the system creates a new association for the different pressing of the same CD. Also note that a disc ID is usually not ambiguous, but it can still happen that different CDs have exactly the same set of frame offsets and hence the same disc ID, for example lwHl8fGzJyLXQR33ug60E8jhf4k-.

If you'd like to know more about disc ID calculation, please download the libdiscid source code and check it out. The code is clean and self documenting.

If you are interested in creating other MusicBrainz clients and need the SHA-1 source code, check out RFC 3174.

Tools

Tools based on our library libdiscid:

  • Our tagger Picard that has a feature “Lookup by CD”
  • Some of the ISRC submit tools that also support submitting Disc ID

Tools based on third party implementations:

  • mbdiscid, a command-line tool (written in Perl) that computes the Disc ID, and either prints it or submits it through a browser
  • MetaBrainz.MusicBrainz.dotnet-mbdiscid (NuGet Package), a command-line tool (using the C#/Mono/.NET library MetaBrainz.MusicBrainz.DiscId) that shows information about an audio CD, including the Disc ID, CD-TEXT data, ISRC values…

Libraries

It can be calculated using our C library, libdiscid, or any of its many language bindings.

Third party libraries:

  • C#/Mono/.NET - MetaBrainz.MusicBrainz.DiscId (NuGet Package), a library which adds support for CD-TEXT retrieval, but does not currently support Solaris (no recent Mono, no .NET Core) or macOS (no machine available)

Links

  • The How Stuff Works site has a very readable introduction on CD technology intended for the technically curious. Written by Marshall Brain.
  • There is actually a very fine FAQ about toasting available that has lots of interesting facts. Written by Andy McFadden.