History:Future Proof Fingerprint Function

From MusicBrainz Wiki
Revision as of 20:03, 25 October 2011 by Reosarevok (talk | contribs) (→‎Commentary:)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Status: This Page is Glorious History!

The content of this page either is bit-rotted, or has lost its reason to exist due to some new features having been implemented in MusicBrainz, or maybe just described something that never made it in (or made it in a different way), or possibly is meant to store information and memories about our Glorious Past. We still keep this page to honor the brave editors who, during the prehistoric times (prehistoric for you, newcomer!), struggled hard to build a better present and dreamed of an even better future. We also keep it for archival purposes because possibly it still contains crazy thoughts and ideas that may be reused someday. If you're not into looking at either the past or the future, you should just disregard entirely this page content and look for an up to date documentation page elsewhere.

Adapted from an email from Geoff Schmidt, giving additional details of his original proposal for a replacement for TRM, which is kept at FutureProofFingerPrint.

A Java implimentation of this, by Juha Heljoranta was available here, but that link seems dead. The source code that was posted there is here

This assumes that the code reads a configuration file that has all of the constants it needs. I've included a little spec for such a file format, which hopefully is easily readable with scanf, and should have enough header data that we can change our mind about what fields we need in the future. Maybe in the early stages the client could shell out to wget and get the current config file every night, so we could push a configuration change and get feedback the next day.

File format:

<the magic word "FingerprintConfiguration">

<the name of this configuration>

<the name of the DSP strategy to use -- for now, the magic word "logpower-linear-vq">

<signal_sample_rate, a real number giving the sampling rate to downsample the incoming PCM to>

<analysis_window, the width of the FFT analysis window, in samples>

<sample_interval, the distance between successive evaluations of f(t), in samples>

<basis_vector_count, a non-negative integer>

<basis_vectors, a series of basis_vector_count 'basis vectors', each of which is just analysis_window / 2 reals>

<codebook_size, a non-negative integer>

<codebook, a series of codebook_size 'codebook entries', each of which is just basis_vector_count reals>

<symbols, a series of codebook_size raw ASCII characters, none of which may be whitespace characters>









.0232 .0423 .1310 .0224 ... (total of 4096 numbers)

.9798 .1223 .0001 -2.9343 ... (total of 4096 numbers)

.0000 .0000 .5000 .0000 ... (total of 4096 numbers)


.8970 .7669 -.7688

0 0 0

-1 -2 10 .237

4 -4 4

.123 .456 .789



Processing blocks of audio.

0. Parse the configuration file. If the DSP strategy isn't 'logpower-linear-vq', fail with the error "Unsupported DSP strategy."

1. If the input stream is compressed, decode it to PCM. If it is stereo, mix it down to mono, mixing both channels equally.

2. Resample the input signal to signal_sample_rate. It's very important to lowpass-filter the signal at the same time, instead of just dropping samples. You can probably get code to do this out of the open-source 'sox' utility.

3. Read the first analysis_window samples, put them in a buffer, and apply the following steps. Then slide all elements of that buffer "to the left" by sample_interval samples, and read sample_interval more samples from the input signal into the "right" side of the buffer. Do the following steps again. Repeat until there are fewer than sample_interval samples available on the input.

Normalizing and taking a power spectrum.

4. Make a copy of the analysis_window input samples that we are about to process. This is a good time to promote the samples to 'double' if they are still of integer type.

5. Find the average (mean) value of all of the samples. Subtract it from each sample. Now the mean should be 0. In other words, remove any DC offset.

6. Find the mean absolute value of all of the samples. Divide each sample by this mean, making the mean absolute value 1.0. (Unless the mean absolute value is very small, say less than 1% of the maximum possible value based on the scaling of the input samples -- in that case replace the samples with all 0's. 1% would be 1.28 in the case of 8-bit input samples or 327.68 in the case of 16-bit input samples.) This is gain control; in other words it ensures that volume level changes won't change the fingerprint.

7. Multiply the samples by a Hanning window. In other words, compute a Hanning window in an analysis_window-point-long array, and multiply each sample by the corresponding point in the Hanning array. Google will give you the formula for the Hanning function.

8. Perform a real-valued FFT. Ignoring the "mirroring" because of the real-valued input, the output is analysis_window / 2 + 1 complex numbers. An excellent GPL FFT library is FFTW.

9. Square the complex numbers to get analysis_window / 2 + 1 real numbers: the power spectrum. The first one is the DC offset, and we know it is zero because of the normalization we did. Discard it, leaving analysis_window / 2 reals. Call this power_spectrum.

10. Take the natural logarithm of each element in power_spectrum, except that if an element is extremely close to zero, set it to zero instead. This has the effect of expressing power in terms of a decibel-like measure relative to some unspecified physical reference signal.

Arbitrary linear transformation.

11. Multiply the basis_vectors matrix by power_spectrum to get a vector called transformed_vector that is basis_vector_count elements long:

11a. Allocate an array transformed_vector, basis_vector_count elements long.

11b. For each element i of transformed_vector:

11c. transformed_vector[i] = 0

11d. For j = 0 to analysis_window / 2 - 1 inclusive:

11e. transformed_vector[i] += power_spectrum[j] * basis_vectors[i][j]

Vector quantization.

12. Find the codebook vector that is closest (in Euclidean distance) to transformed vector) and record its index (an integer from 0 to codebook_size-1 inclusive) as matched_vector:

12a. Set matched_vector = -1

12b. Set matched_distance2, a real, = MAXDOUBLE

12c. For i = 0 to codebook_size - 1 inclusive:

12d. Set this_distance, a real = 0

12e. For j = 0 to basis_vector_count:

12f. difference = codebook[i][j] - transformed_vector[j]

12g. this_distance2 += difference * difference

12h. End for loop

12i. If this_distance2 < matched_distance2, set matched_vector = i and matched_distance2 = this_distance2


13. Output symbols[matched_vector] as the result of f(t) for this block.


My first stab at a simple set of basis vectors will be the product of three matrices that drop all but the "telephone bandwidth" part of the spectrum, do some filtering 'across the length of the vector' to reduce the impact of equalization (since this is all happening in a log domain, this would seem to be linear), and finally do dimensionality reduction for convenience. As a first attempt at a codebook, I'll build a training corpus of transformed_vectors and do basic k-means clustering. This, together with a small sample_interval compared to analysis_window, should be enough for us to at least get started on relatively clean files and see where we stand on fidelity.