History:Future Proof Fingerprint
|Status: This Page is Glorious History!
The content of this page either is bit-rotted, or has lost its reason to exist due to some new features having been implemented in MusicBrainz, or maybe just described something that never made it in (or made it in a different way), or possibly is meant to store information and memories about our Glorious Past. We still keep this page to honor the brave editors who, during the prehistoric times (prehistoric for you, newcomer!), struggled hard to build a better present and dreamed of an even better future. We also keep it for archival purposes because possibly it still contains crazy thoughts and ideas that may be reused someday. If you're not into looking at either the past or the future, you should just disregard entirely this page content and look for an up to date documentation page elsewhere.
Future Proof Acoustic Fingerprinting Technology
Status: This page discusses the possibility of an open source acoustic fingerprinting technology that could be developed by MusicBrainz contributors. However, this is a massive task that requires extraordinary programming skills. Initially, no one volunteered to do this. In the meantime MusicIP has teamed up with MusicBrainz and we now use their closed source PUID technology. Recently (May 2006), however, a prototype implementation of some of this, by Juha Heljoranta has been made available here.
Therefore This page only collects Geoff Schmidt's proposal, to prevent it from being forgotten. This is not actively developed, unless you take the work upon you. Also note that Geoff is a professional in this realm, so please keep amateurish proposals out of this page.
OK, let's try to sketch something out.
Let's say I give you a function, FutureProofFingerPrintFunction, that takes some fixed amount of audio -- probably in the 250ms-500ms range -- and gives you back a small integer, say in the range of 0 to 63, such that when the audio undergoes "worst case distortion" for the worst case we're willing to handle, there is a 50% chance that the function will still return the same number. Let's say this "distortion" includes not only things like noise, equalization, and lossy encoding, but the movement of the "start point" of our 250ms chunk in the track by up to ± 50ms, and uniform speeding up or slowing down of the playback rate by up to ± 3%. I'll come back to how we get this function in a minute.
You can imagine taking this function and sliding its 250ms "analysis window" over a song, sample by sample. There are long stretches where the same symbol is returned, because I specified that the function had to be reasonably robust to movement of the window. Let's say that we calculate the function every 100ms and that this is our fingerprint for a stretch of audio -- 10 of these numbers between 0 and 63 per second, essentially an extremely compressed description of the song. (The way we take the samples is important. We shouldn't take them at t, t+100ms, t+200ms, etc. We should take them at t ± 50ms, t + 100ms ± 50ms, t + 200ms ± 50ms, etc. That way we never consistently get the worst case of comparing two fingerprints taken at time bases 50ms apart.)
So say we apply this procedure to a bunch of songs and put them in a database, and provide some kind of efficient substring search index. Now say we have an audio sequence to be identified. We subject it to the same fingerprinting process, and then search for substrings of the query sequence that match substrings of records in the database. (It is simpler to think of the database not as a collection of songs, but as one big long sequence, with annotations labeling particular time ranges of that sequence as particular songs.) This finds us all parts of all songs in the database that match parts of our query (above a certain length, presumably.) The substring matching process can't look for perfect matches; it must be content to have up to 50% of the positions in a pairing not match, because that is the performance of our fingerprint function under distortion.
So how do you use this? You can take a 5-10 second sample of a song, throw it at the database, and see what records contain it, to identify a song. You can take an entire song, throw it at the database, and look for the best match across the entire length of the song to distinguish subtle variations of a song that differ in only a few time intervals, like some clean edits. You can take a 10 hour recording of a radio stream and find the songs that were played (find database entries that are substrings of the query.) And you can do a neat trick where you take, say, a month's recording of a radio station and load it into the database blind, without annotations, and then match the same recording against the database -- that is, you find the places an audio sequence matches itself, which is to say the repeated segments in it, which is to say the songs and commercials. A nice thing is that anyone who knows how to do string handling can hack on applications like these, without having to dive into DSP land or write a classifier.
(Collisions are essentially impossible in this scheme with any query sequence length over a second or two. Two songs would have to have the same time course over the length of the query. In my experience this simply doesn't happen except for pathologically chosen fingerprint functions. The exception to this is pathologically chosen audio, like silence. What all of this means is that the sequence data we collect could be used by many people for many different purposes for many years.)
How can we do this indexing? First of all note that this is a well-studied problem as it is the basis for the BLAST tool that is used in bioinformatics to search for similar DNA and protein sequences. We might even be able to use BLAST ourselves if we put an appropriate cache in front of it and only blasted songs whose hash we had never seen before. But for now let's assume we implement something optimized to our needs and do some back-of-the-envelope calculations.
Let's say we want to prove feasibility for a billion songs using the technology of the near future. Let's say that we want to index the entire length of each song for posterity, not just a fixed ~30 second window, which would be more than sufficient for mp3 file identification. Suppose these songs average 4 minutes in length. Assume that the worst-case "fidelity" of the fingerprint function (likelihood of returning the same number under distortion) is 50% as given above. (We should do much, much better under the expected case of mp3 encoding of a CD rip. I picked 50% to see how much really degraded cases hurt us.) Assume that each "symbol" (fingerprint function output) is a number between 0 and 63. Assume that the fingerprint function f(t) is sampled every 100ms, and that f(t) and f(t+100ms) are statistically independent. Assuming that each "symbol" is equally likely to some out of f(t) is too much to ask. Now, because the symbol distribution will be nonflat, we won't get as much information out of evaluating f(t) as we would out of rolling a 64-sided die or pulling 6 bits out of a true random number generator. Assume that the "entropy of f(t)" is more like 5 bits, that we learn as much as we would out of rolling a 32-sided die. That could be 32 equally likely symbols and 32 that have virtually no chance of being output, or it could be some other nonuniform distribution that has the same net effect.
To index a sequence, we will divide it into overlapping "fragment," each a series of 6 symbols. We'll break each song in the database into these fragments and put each mapping (fragment -> song ID containing it at least once) in a giant hash table. Given a query sequence, we'll break it into fragments too, look each fragment up in the hash table, and do a more detailed analysis on each song that had more than n matching fragments. The principle here is that as the length of the query sequence increases (meaning a linear increase in the amount of work), the expected number of fragment matches for a "correct song match" increases much faster than that for a random track in the database that is matching only "coincidentally," so we are better and better able to pick an n that separates the two classes of songs even as we minimize the chance that we could miss a correct match.
Because symbols, even if adjacent, are independent, the "entropy of a fragment" is 6 symbols * 5 bits/symbol = 30 bits. That means if we pick two random fragments, the chance of them matching is one in 2^30, about one in a billion. We are free to create as many of these fragments per second of audio as we want, as long as we do it in a consistent way. We could take 6 adjacent symbols at each 100ms sample point, giving 10 fragments per second. We could do that every 200ms (or better, do it every 100ms, hash the fragments, and discard the 50% of the with the lower set of hash values) and get 5 fragments per second. We could sample twice at every position, the second time skipping the second symbol and including an extra one at the end, and get 20 fragments per second. As we take more fragments, the size of our index increases linearly, and the length of the query sequence we need under adverse conditions decreases (sublinearly, I think, because the fragments overlap.) Let's say we take 10 fragments a second, and let's make-believe that they don't overlap, for purposes of making some independence assumptions about them later.
There are 4*60*109 seconds of audio in the database, and 4*60*109*10 fragments in the index, or 2.4e12 fragment records (assuming for simplicity that even if a fragment appears twice in one song, we store it twice instead of optimizing and storing it once.) When we do a query for a fragment, the chance that a random record in the database will match by chance is one in 2^30. 2.4e12 / 2^30 = an expected 2235 song ID's returned for any given fragment due to coincidence. Additionally, our results might include something noncoincidental: a match to the "correct" song. Since all six symbols must match and there is a 50% chance that each of them will, if we assume that the failures are independent, there is a one in 64 chance that the song ID of the correct match will be in the results.
If we check 300 fragments from the query, which at 10 fragments a second is 30 seconds of music, and there is one chance in 64 of getting a correct match at each position, there is a .9489 chance of getting at least two fragment matches for the correct song (I used the binomial distribution calculator at http://psych.rice.edu/online_stat/java/binomialProb.html for this.) The expected number of matches is 300 / 64 = 4.7. On the other hand, we'll have a total of about 2235 * 300 = 670500 coincidentally returned song IDs, each a randomly (uniformly, independently, we assume) selected number between 0 and a billion minus one. Using the binomial distribution again and a Python program, I estimated that the expected number of song IDs that will coincidentally appear at least twice across all 300 fragments as 224.
So after we complete this process, we'll have a few hundred song IDs. Most will be spurious, but with 95% confidence, even if 50% of the symbols in the fingerprint got changed to random values, the correct song will be in there (if there are multiple correct matches, each has a 95% chance.) We'll load all the song fingerprints off of disk and send them to the client, which can take the edit distance or whatever to distinguish the correct match from the spurious matches, which will be obviously wrong (random symbols plus two six-symbol shared substrings with the query somewhere, not even in the right place relative to each other.) I said 95% confidence, so what about the other 5% of the time? Well, you can keep scanning through the query sequence, boosting the odds as you go, until you fall off the end of the track and have to declare defeat.
How long will all of this take? Putting 670500 entries in a hash table is pretty fast; the dominating factor is a cache miss for each entry. (You don't need a full hash table, just a big bit array. When you see a song ID, hash it to get the bit index. If that bit is not set, the song ID is being seen for the first time, so set the bit and continue. If it's already set, we're probably seeing this song ID for the second time, so add the song ID to the results list if it's not already there.) I did a super-approximate test just now and my low end emachines Celeron can take that many cache misses in 24ms. The expensive part is the 300 lookups in the fragment database.
I said that there were 2.4e12 fragment records. Let's say that each is a 32-bit integer, with the lower 31 bits giving the song ID (snugly fitting our billion songs) and the high bit indicating the end of a list of entries. That means the total amount of data, without the table of pointers, is 9.6e12 bytes, or 8900 gigabytes, so it must go on disk and not in RAM. 500 GB drives and even larger are routinely available today, so let's say that the "near future" I mentioned holds 1 TB drives. At that capacity a billion song database could feasibly fit on one server with a big RAID card (voice of experience, except with 2001 technology) though more likely it'd be distributed over a few machines with queries routed by the fragment ID requested. If storage prices follow historical trends so that these drives can be had for under $500 each, the total cost of the storage (for the index) would be under $5k.
Then we'd need a file that say, given a fragment, where its associated data is. A fragment is 6*6 bits long (remember, not uniformly distributed, so "worth" only 5*6 bits) so there are 236 possible fragments, most of which will probably get used at least once somewhere, and we'll need to associate with each some sort of 64-bit pointer = 23 bytes, so we're looking at 239 bytes. That is 230 * 29 = 512 GB, again large enough that no compression will make it fit in memory, so figure that's on disk too.
The dominating factor in looking up a fragment to find the song IDs whose sequences contain it is disk seek time. Typical consumer hard drives claim to do a random seek in around 10ms. To look up a fragment takes two seeks, one to read the table, and one to read the actual song ID list, which at 2235 * 4 = ~9k doesn't take a significant amount of time to read after the disk head is positioned. We are figuring on 300 fragment lookups, so that's 600 seeks, or 6000ms = 6 sec. This time can be divided over all of the disks in the system, so in a ten-disk system we'd have a 600ms query time.
600ms (or worse, 6 sec on a single-disk server!) is ages, but *remember that this is for the "worst case distortion"* where 50% of the symbols can change. Let's say that a symbol is only 25% likely to get scrambled, so the chance of a fragment matching is ~18%. Then instead of 300 fragments, we only have to check 25 to have a .9522 chance of getting two correct fragment matches, and we average only 1.56 other spurious song ID matches to send to the client instead of 224. If we check 40 fragments, we'd have a .9962 chance, and average 4.00 spurious matches. These would take 500ms and 800ms of disk time respectively; 50ms and 80ms divided over 10 disks. In reality I'm hoping we'll get more like 90% fidelity on clean, fresh-ripped audio; after all, there are a lot more than 64 distinctly different sounds 100ms long. To adapt to both scenarios, I imagine that the search process would be under client control. The client would send a "next match" command until it was satisfied with the results it had gotten.
[Edit: with a clever disk allocation scheme we could eliminate the table and its seek and speed things up almost 50%. Just seek to an approximate disk location based on hashing the fragment and read until you see a header indicating the start of the fragment you want. You'll read for a lot longer, but you'll avoid the extremely expensive seek.]
Remember also that the above is a thought experiment at the billion-song scalability point. To get started, we'd just start collecting the fingerprints and throw together a small-capacity server, with a simple in-memory database or maybe a SQL backend, possibly even in a language like Python. We can use whatever inefficient, lightweight indexing scheme we prefer, and tune it to our small database (for example, make the fragment length 5 symbols instead of 6, making the match odds much higher and the required query length much shorter.) As database size and load increases, we can do whatever we want to the indexing, transparent to the client.
OK, at the top I promised to say something about the DSP side of things -- the magic function that takes a block of audio and returns one of these symbols. It turns out that the naive way to do this works surprisingly well on digital files, with some basic tuning. Downsample to a 8khz sampling rate or so ("telephone bandwidth) making sure to use a lowpass filter to take out anything above the Nyquist frequency as you do so. Remove DC offset (figure out what you need to subtract or add to the sample to make the average sample 0 and do so.) Normalize volume level (for example, find the mean deviation from zero and multiply the signal by the appropriate value to make it 1.0.) Now take an FFT, using of course an appropriate window function, and throw away the phase information leaving only the power spectrum. Beforehand, come up with a "codebook" of specta, each associated with a symbol. The output of the function is the symbol associated with the codebook spectrum with the minimum Euclidean distance to the observed spectrum. To make this cheaper, you can do dimensionality reduction on the power spectrum and the codebook, so you're taking Euclidean distance in 16- or 8-dimensional space instead of 2048-dimensional space. Principal component analysis works just fine, but that's outside the scope of this email. To build the codebook in the first place, build a training corpus of example spectra and use your favorite clustering algorithm, such a k-means.
The reason such a simple thing works is that you are using the FFT to leak large amounts of temporal information (you are blurring time, much like space is blurred when you take off your glasses, and very much like jpegs are blurred at high compression ratios.) This happens two ways. First of all, when you throw away the phase information, you're throwing away the half of the FFT's output where most of the time structure information ended up. Second, because you carefully picked a FFT window that was a lot wider than your fingerprint function sampling interval, when the window is jittered back and forth a bit, only a small amount of information is entering and leaving the analysis window. The other reason such a simple thing works is that even an 8-dimensional space is very difficult for people to visualize, and Euclidean distance between spectra doesn't work out quite the way you think it will.
At Tuneprint, we augmented this with a full psychoacoustic stack, going from the physical power spectrum to a physiological neural excitation plot. That is overkill. However it's probably reasonable to rescale the frequency axis logarithmically, or according to the Bark scale (the human frequency sensitivity curve) and maybe take the log of the energy. This is dangerous territory because only mp3 encoders care how human ears work. (mp3 encoders and people trying to defeat copy protection measures without fucking up the way a song sounds.) Other distortions see only the physical signal and distort accordingly. We don't have a lot to fear from mp3 encoders, especially under 4khz.
As for other distortions, white noise nudges our point a random amount in spectrum space, and a codebook-based approach will usually be really resilient to this. Equalization is trickier. The first question is, how much robustness/insensitivity do you actually want? A bit of sensitivity to equalization changes can be great for distinguishing different masters of the same work, but in an application where you're trying to identify what's playing overhead in the supermarket using a crappy microphone, you want a great deal of robustness. The quick hack to get robustness is to filter the power spectrum, dividing each frequency bin by the average of those around it. A better-performing answer is to use a "feature-detector" based fingerprint function, which could be based on features like the locations of likely note attacks. In the sequence-based server architecture I talked about above, to get a symbol, we'd combine the features into a vector and throw the vector into a classifier. One problem with feature-based approaches is, what if there are no features in this block of audio? Another distortion is structured noise, like a person talking over a song or two songs mixed together. If the noise has known properties and you have a good sample of it, you can increase robustness by fingerprinting in a linear subspace where the noise isn't, but feature-based approaches seem most promising to me in environments with lots of structured noise.
To take a bit more of a theoretical view, you can divide everything after taking the spectrum and before the codebook lookup into two categories, linear operators and nonlinear operators. Any number of linear operators can be lumped together into one big matrix multiplication. The fingerprint function I'm thinking of will have three parts. First of all, there will be some fixed nonlinearities like the volume level scaling and the psychoacoustic normalization if we use it. The output of all of this will be normalized spectra. Then there'll be a multiplication by a Magic Transformation Matrix, which will project each spectrum down into 8 dimensions or so. I will probably make this with independent component analysis and some hacks to suppress dimensions of variation that correspond to those in a noise training set; this is one way to sneak in equalization robustness. Then there'll be a search through a Magic Codebook with from a few dozen to a few hundred entries. Each entry'd be a point in 8 dimensional space, and you'd scan through the codebook, computing the square of the distance to each entry to find the best match. I'd make the codebook with k-means, maybe with a bias to keep the cells from getting too small (negative effect on robustness) and as balanced in probability as possible (to maximize the information content of a symbol.) On the other hand you can also argue that the codebook should just be a regular quantization grid. I had success with both approaches at TP.
What I'd like to do is make the client code read the Magic Matrix and the Magic Codebook out of a file, perhaps even one that can be updated by the server (they are tables sized around 1024x8 and 64x8 respectively). This would make it easy to do some large-scale testing around the beginning.
Does this design make sense? Questions/comments/ideas/competing proposals? Do you see any math errors?