Difference between revisions of "History:Future Proof Fingerprint Function"

From MusicBrainz Wiki
(Intertwingling. (Imported from MoinMoin))
((Imported from MoinMoin))
Line 1: Line 1:
Adapted from [http://lists.musicbrainz.org/pipermail/musicbrainz-devel/2005-October/001476.html an email] from Geoff Schmidt, giving additional details of his original proposal for a replacement for [[TRM]], which is kept at [[Future Proof Fingerprint|FutureProofFingerPrint]].  
Adapted from [http://lists.musicbrainz.org/pipermail/musicbrainz-devel/2005-October/001476.html an email] from Geoff Schmidt, giving additional details of his original proposal for a replacement for [[TRM]], which is kept at [[Future Proof Fingerprint|FutureProofFingerPrint]].  
An implimentation of this, by Juha Heljoranta is available [http://www.fsfe.org/en/fellows/juha/fpfpf here]. 

Revision as of 16:13, 22 May 2006

Adapted from an email from Geoff Schmidt, giving additional details of his original proposal for a replacement for TRM, which is kept at FutureProofFingerPrint.

An implimentation of this, by Juha Heljoranta is available here.

This assumes that the code reads a configuration file that has all of the constants it needs. I've included a little spec for such a file format, which hopefully is easily readable with scanf, and should have enough header data that we can change our mind about what fields we need in the future. Maybe in the early stages the client could shell out to wget and get the current config file every night, so we could push a configuration change and get feedback the next day.

File format:

<the magic word "FingerprintConfiguration">

<the name of this configuration>

<the name of the DSP strategy to use -- for now, the magic word "logpower-linear-vq">

<signal_sample_rate, a real number giving the sampling rate to downsample the incoming PCM to>

<analysis_window, the width of the FFT analysis window, in samples>

<sample_interval, the distance between successive evaluations of f(t), in samples>

<basis_vector_count, a non-negative integer>

<basis_vectors, a series of basis_vector_count 'basis vectors', each of which is just analysis_window / 2 reals>

<codebook_size, a non-negative integer>

<codebook, a series of codebook_size 'codebook entries', each of which is just basis_vector_count reals>

<symbols, a series of codebook_size raw ASCII characters, none of which may be whitespace characters>









.0232 .0423 .1310 .0224 ... (total of 4096 numbers)

.9798 .1223 .0001 -2.9343 ... (total of 4096 numbers)

.0000 .0000 .5000 .0000 ... (total of 4096 numbers)


.8970 .7669 -.7688

0 0 0

-1 -2 10 .237

4 -4 4

.123 .456 .789



Processing blocks of audio.

0. Parse the configuration file. If the DSP strategy isn't 'logpower-linear-vq', fail with the error "Unsupported DSP strategy."

1. If the input stream is compressed, decode it to PCM. If it is stereo, mix it down to mono, mixing both channels equally.

2. Resample the input signal to signal_sample_rate. It's very important to lowpass-filter the signal at the same time, instead of just dropping samples. You can probably get code to do this out of the open-source 'sox' utility.

3. Read the first analysis_window samples, put them in a buffer, and apply the following steps. Then slide all elements of that buffer "to the left" by sample_interval samples, and read sample_interval more samples from the input signal into the "right" side of the buffer. Do the following steps again. Repeat until there are fewer than sample_interval samples available on the input.

Normalizing and taking a power spectrum.

4. Make a copy of the analysis_window input samples that we are about to process. This is a good time to promote the samples to 'double' if they are still of integer type.

5. Find the average (mean) value of all of the samples. Subtract it from each sample. Now the mean should be 0. In other words, remove any DC offset.

6. Find the mean absolute value of all of the samples. Divide each sample by this mean, making the mean absolute value 1.0. (Unless the mean absolute value is very small, say less than 1% of the maximum possible value based on the scaling of the input samples -- in that case replace the samples with all 0's. 1% would be 1.28 in the case of 8-bit input samples or 327.68 in the case of 16-bit input samples.) This is gain control; in other words it ensures that volume level changes won't change the fingerprint.

7. Multiply the samples by a Hanning window. In other words, compute a Hanning window in an analysis_window-point-long array, and multiply each sample by the corresponding point in the Hanning array. Google will give you the formula for the Hanning function.

8. Perform a real-valued FFT. Ignoring the "mirroring" because of the real-valued input, the output is analysis_window / 2 + 1 complex numbers. An excellent GPL FFT library is FFTW.

9. Square the complex numbers to get analysis_window / 2 + 1 real numbers: the power spectrum. The first one is the DC offset, and we know it is zero because of the normalization we did. Discard it, leaving analysis_window / 2 reals. Call this power_spectrum.

10. Take the natural logarithm of each element in power_spectrum, except that if an element is extremely close to zero, set it to zero instead. This has the effect of expressing power in terms of a decibel-like measure relative to some unspecified physical reference signal.

Arbitrary linear transformation.

11. Multiply the basis_vectors matrix by power_spectrum to get a vector called transformed_vector that is basis_vector_count elements long:

11a. Allocate an array transformed_vector, basis_vector_count elements long.

11b. For each element i of transformed_vector:

11c. transformed_vector[i] = 0

11d. For j = 0 to analysis_window / 2 - 1 inclusive:

11e. transformed_vector[i] += power_spectrum[j] * basis_vectors[i][j]

Vector quantization.

12. Find the codebook vector that is closest (in Euclidean distance) to transformed vector) and record its index (an integer from 0 to codebook_size-1 inclusive) as matched_vector:

12a. Set matched_vector = -1

12b. Set matched_distance2, a real, = MAXDOUBLE

12c. For i = 0 to codebook_size - 1 inclusive:

12d. Set this_distance, a real = 0

12e. For j = 0 to basis_vector_count:

12f. difference = codebook[i][j] - transformed_vector[j]

12g. this_distance2 += difference * difference

12h. End for loop

12i. If this_distance2 < matched_distance2, set matched_vector = i and matched_distance2 = this_distance2


13. Output symbols[matched_vector] as the result of f(t) for this block.


My first stab at a simple set of basis vectors will be the product of three matrices that drop all but the "telephone bandwidth" part of the spectrum, do some filtering 'across the length of the vector' to reduce the impact of equalization (since this is all happening in a log domain, this would seem to be linear), and finally do dimensionality reduction for convenience. As a first attempt at a codebook, I'll build a training corpus of transformed_vectors and do basic k-means clustering. This, together with a small sample_interval compared to analysis_window, should be enough for us to at least get started on relatively clean files and see where we stand on fidelity.