Disc ID Calculation
How MusicBrainz calculates Disc IDs
Let us first have some words about how data is organized on a CD, going top down.
An Audio CD (CD-DA) can hold up to 99 audio tracks.
Sampling is done at a rate of 44.1 kHz using 16 bits resolution per channel, thus there are 44100 x 2 bytes x 2 channels (Stereo) = 176400 bytes of PCM data stored per second.
This audio data is contained in logical blocks of 2352 bytes each on the CD, holding 2352 / 176400 = 1 / 75 seconds of sound.
A logical block plus 882 bytes of error correction and control data forms a raw block of 3234 bytes that is spread among 98 frames of 33 bytes each that are all together written on one spiral track among the CD.
At present we can stop at the level of logical blocks for our task. The deeper stuff will be needed in case the CD-Text format becomes popular, which stores extra data (like artist and track information) in the above mentioned control data parts.
Now let us have a look at a real world CD-DA.
Using a tool like cdcontrol (FreeBSD) on this CD will yield:
marc@oranje$ cdcontrol -f /dev/cd0c info Starting track = 1, ending track = 15, TOC size = 4 bytes track start duration block length type ------------------------------------------------- 1 0:02.00 4:10.41 0 18641 audio 2 4:10.41 3:35.51 18641 16026 audio 3 7:44.17 4:51.08 34667 21683 audio 4 12:33.25 4:37.31 56350 20656 audio 5 17:08.56 6:29.63 77006 29088 audio 6 23:36.44 4:23.60 106094 19635 audio 7 27:58.29 5:22.56 125729 24056 audio 8 33:19.10 4:16.50 149785 19100 audio 9 37:33.60 3:49.00 168885 17025 audio 10 41:20.60 4:27.44 185910 19919 audio 11 45:46.29 5:26.13 205829 24313 audio 12 51:10.42 3:42.17 230142 16517 audio 13 54:50.59 4:14.55 246659 18955 audio 14 59:03.39 5:20.15 265614 23865 audio 15 64:21.54 8:05.28 289479 36253 data 170 72:25.07 - 325732 - -
This is an example of a CD with an extra track of data (what you know as CD-ROM), marketed in this case as CD-Extra, featuring a video and some pictures. (More precise: this is disc with two sessions, audio and data)
You should note the special track 170 ( 0xaa
), that marks the beginning of the leadout track, consisting of the remaining blocks on the CD.
The CD Index algorithm simply takes the following pieces of data and runs them through the SHA-1 hash function:
- First track (normally one): 1 byte
- Last track: 1 byte
- Leadout track: 4 bytes
- 99 frame offsets: 4 bytes for each track.
- If there are less than 99 frame offsets (almost certainly), the value 0 will be used instead.
Before the data is fed through the SHA-1 hash, it is converted to ASCII using printf("%02X", value);
for byte values and printf("%08X", value);
for 32 bit integer values.
Code is a better definition than English, so here is the code that calculates the DiscID:
sprintf(temp, "%02X", pCDInfo->FirstTrack); sha_update(&sha, (unsigned char*) temp, strlen(temp)); sprintf(temp, "%02X", pCDInfo->LastTrack); sha_update(&sha, (unsigned char*) temp, strlen(temp)); for (i = 0; i<100; i++) { sprintf(temp, "%08X", pCDInfo->FrameOffset[i]); sha_update(&sha, (unsigned char*) temp, strlen(temp)); } sha_final(digest, &sha);
Note that the leadout track is stored in pCDInfo->FrameOffset[0]
.
The resulting 20 byte SHA-1 signature is converted to a base 64 encoded character ASCII string that is the disc Id.
One uses base 64 encoding to map arbitrary bytes onto a string of printable ASCII characters.
It does this by redistributing the 24 bits of three 8-bit codes each into four 6-bit codes from a table of 64 very common ASCII characters.
Thus we end up with an Id string of 28 characters, in the above example with "MUtMmKN402WPj3_VFsgUelxpc8U-"
.
Note: This base 64 string is not the same one specified in RFC822. The RFC822 spec uses +
, /
, and =
characters, all of which are special HTTP/URL characters.
To avoid the problems with dealing with that, I (Rob) used .
, _
, and -
. For details on this, please refer to base64.c
in the source code.
This scheme has the advantage of being very simple (simple to understand, simple to implement) and it is not ambiguous.
However, two CDs pressings may not have the same Ids.
To handle this case, the CD Index system will let a user check to see if the CD already exists in the system under a different Id. If so, the system creates a new association for the different pressing of the same CD.
If you'd like to know more about this, please download the client source code and check it out.
The code is clean and self documenting.
If you are interested in creating other CD Index clients and need the SHA-1 source code, you can either dig through the CD Index source, or check the W3C page on SHA-1.
Note: The use of the MD5 algorithm has been discontinued in favour of public domain SHA-1.
Links
- The How Stuff Works site has a very readable introduction on CD technology intended for the technically curious. Written by Marshall Brain.
- A superb coverage of CDs and formats on a scientific level is given by Prof Kelin J Kuhn from University of Washington. Get the basics lecture here and the format/encoding stuff lecture here.
- There is actually a very fine FAQ about toasting available that has lots of interesting facts. Written by Andy McFadden.
- [Find out more about how CD-DA and CD-ROM formats have been enhanced through this PDF document provided by Cinram.] This site seems to be permanently down. I suggest we remove it, unless someone can find this document on another source
- Another introduction from Kelin Kuhn can be found here, again at University of Washington Electrical Engineering.