How MusicBrainz calculates Disc IDs

Let us first have some words about how data is organized on a CD, going top down.

An Audio CD (CD-DA) can hold up to 99 audio tracks.

Sampling is done at a rate of 44.1 kHz using 16 bits resolution per channel, thus there are 44100 x 2 bytes x 2 channels (Stereo) = 176400 bytes of PCM data stored per second.

This audio data is contained in logical blocks of 2352 bytes each on the CD, holding 2352 / 176400 = 1 / 75 seconds of sound.

A logical block plus 882 bytes of error correction and control data forms a raw block of 3234 bytes that is spread among 98 frames of 33 bytes each that are all together written on one spiral track among the CD.

At present we can stop at the level of logical blocks for our task. The deeper stuff will be needed in case the CD-Text format becomes popular, which stores extra data (like artist and track information) in the above mentioned control data parts.

Now let us have a look at a real world CD-DA.

Using a tool like cdcontrol (FreeBSD) on this CD will yield:

   marc@oranje$ cdcontrol -f /dev/cd0c info
   Starting track = 1, ending track = 15, TOC size = 4 bytes
   track     start  duration   block  length   type
   -------------------------------------------------
       1   0:02.00   4:10.41       0   18641  audio
       2   4:10.41   3:35.51   18641   16026  audio
       3   7:44.17   4:51.08   34667   21683  audio
       4  12:33.25   4:37.31   56350   20656  audio
       5  17:08.56   6:29.63   77006   29088  audio
       6  23:36.44   4:23.60  106094   19635  audio
       7  27:58.29   5:22.56  125729   24056  audio
       8  33:19.10   4:16.50  149785   19100  audio
       9  37:33.60   3:49.00  168885   17025  audio
      10  41:20.60   4:27.44  185910   19919  audio
      11  45:46.29   5:26.13  205829   24313  audio
      12  51:10.42   3:42.17  230142   16517  audio
      13  54:50.59   4:14.55  246659   18955  audio
      14  59:03.39   5:20.15  265614   23865  audio
      15  64:21.54   8:05.28  289479   36253   data
     170  72:25.07         -  325732       -      -

This is an example of a CD with an extra track of data (what you know as CD-ROM), marketed in this case as CD-Extra, featuring a video and some pictures. (More precise: this is disc with two sessions, audio and data)

You should note the special track 170 ( 0xaa ), that marks the beginning of the leadout track, consisting of the remaining blocks on the CD.

The CD Index algorithm simply takes the following pieces of data and runs them through the SHA-1 hash function:

First track (normally one): 1 byte
Last track: 1 byte
Leadout track: 4 bytes
99 frame offsets: 4 bytes for each track.

If there are less than 99 frame offsets (almost certainly), the value 0 will be used instead.

Before the data is fed through the SHA-1 hash, it is converted to ASCII using printf("%02X", value); for byte values and printf("%08X", value); for 32 bit integer values.

Code is a better definition than English, so here is the code that calculates the DiscID:

    sprintf(temp, "%02X", pCDInfo->FirstTrack);
    sha_update(&sha, (unsigned char*) temp, strlen(temp));

    sprintf(temp, "%02X", pCDInfo->LastTrack);
    sha_update(&sha, (unsigned char*) temp, strlen(temp));

    for (i = 0; i<100; i++) {
        sprintf(temp, "%08X", pCDInfo->FrameOffset[i]);
        sha_update(&sha, (unsigned char*) temp, strlen(temp));
    }
    sha_final(digest, &sha);

Note that the leadout track is stored in pCDInfo->FrameOffset[0] .

The resulting 20 byte SHA-1 signature is converted to a base 64 encoded character ASCII string that is the disc Id.

One uses base 64 encoding to map arbitrary bytes onto a string of printable ASCII characters.

It does this by redistributing the 24 bits of three 8-bit codes each into four 6-bit codes from a table of 64 very common ASCII characters.

Thus we end up with an Id string of 28 characters, in the above example with "MUtMmKN402WPj3_VFsgUelxpc8U-" .

Note: This base 64 string is not the same one specified in RFC822. The RFC822 spec uses + , / , and = characters, all of which are special HTTP/URL characters.

To avoid the problems with dealing with that, I (Rob) used . , _ , and - . For details on this, please refer to base64.c in the source code.

This scheme has the advantage of being very simple (simple to understand, simple to implement) and it is not ambiguous.

However, two CDs pressings may not have the same Ids.

To handle this case, the CD Index system will let a user check to see if the CD already exists in the system under a different Id. If so, the system creates a new association for the different pressing of the same CD.

If you'd like to know more about this, please download the client source code and check it out.

The code is clean and self documenting.

If you are interested in creating other CD Index clients and need the SHA-1 source code, you can either dig through the CD Index source, or check the W3C page on SHA-1.

Note: The use of the MD5 algorithm has been discontinued in favour of public domain SHA-1.

Links

The How Stuff Works site has a very readable introduction on CD technology intended for the technically curious. Written by Marshall Brain.
A superb coverage of CDs and formats on a scientific level is given by Prof Kelin J Kuhn from University of Washington. Get the basics lecture here and the format/encoding stuff lecture here.
There is actually a very fine FAQ about toasting available that has lots of interesting facts. Written by Andy McFadden.
[Find out more about how CD-DA and CD-ROM formats have been enhanced through this PDF document provided by Cinram.] This site seems to be permanently down. I suggest we remove it, unless someone can find this document on another source
Another introduction from Kelin Kuhn can be found here, again at University of Washington Electrical Engineering.

OriginalVersion

Disc ID Calculation

How MusicBrainz calculates Disc IDs

Links

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

sites

Tools