Difference between revisions of "History:Future Proof Fingerprint Function"

From MusicBrainz Wiki
((Imported from MoinMoin))
 
((Imported from MoinMoin))
Line 75: Line 75:
 
3. Read the first analysis_window samples, put them in a buffer, and apply the following steps. Then slide all elements of that buffer "to the left" by sample_interval samples, and read sample_interval more samples from the input signal into the "right" side of the buffer. Do the following steps again. Repeat until there are fewer than sample_interval samples available on the input.  
 
3. Read the first analysis_window samples, put them in a buffer, and apply the following steps. Then slide all elements of that buffer "to the left" by sample_interval samples, and read sample_interval more samples from the input signal into the "right" side of the buffer. Do the following steps again. Repeat until there are fewer than sample_interval samples available on the input.  
  
[Normalizing and taking a power spectrum.]
+
===Normalizing and taking a power spectrum.===
  
 
4. Make a copy of the analysis_window input samples that we are about to process. This is a good time to promote the samples to 'double' if they are still of integer type.  
 
4. Make a copy of the analysis_window input samples that we are about to process. This is a good time to promote the samples to 'double' if they are still of integer type.  
Line 91: Line 91:
 
10. Take the natural logarithm of each element in power_spectrum, except that if an element is extremely close to zero, set it to zero instead. This has the effect of expressing power in terms of a decibel-like measure relative to some unspecified physical reference signal.  
 
10. Take the natural logarithm of each element in power_spectrum, except that if an element is extremely close to zero, set it to zero instead. This has the effect of expressing power in terms of a decibel-like measure relative to some unspecified physical reference signal.  
  
[Arbitrary linear transformation.]
+
===Arbitrary linear transformation.===
  
 
11. Multiply the basis_vectors matrix by power_spectrum to get a vector called transformed_vector that is basis_vector_count elements long:  
 
11. Multiply the basis_vectors matrix by power_spectrum to get a vector called transformed_vector that is basis_vector_count elements long:  
Line 105: Line 105:
 
11e. transformed_vector[i] += power_spectrum[j] * basis_vectors[i][j]  
 
11e. transformed_vector[i] += power_spectrum[j] * basis_vectors[i][j]  
  
[Vector quantization.]
+
===Vector quantization.===
  
 
12. Find the codebook vector that is closest (in Euclidean distance) to transformed vector) and record its index (an integer from 0 to codebook_size-1 inclusive) as matched_vector:  
 
12. Find the codebook vector that is closest (in Euclidean distance) to transformed vector) and record its index (an integer from 0 to codebook_size-1 inclusive) as matched_vector:  
Line 127: Line 127:
 
12i. If this_distance2 < matched_distance2, set matched_vector = i and matched_distance2 = this_distance2  
 
12i. If this_distance2 < matched_distance2, set matched_vector = i and matched_distance2 = this_distance2  
  
[Output.]
+
===Output.===
  
 
13. Output symbols[matched_vector] as the result of f(t) for this block.  
 
13. Output symbols[matched_vector] as the result of f(t) for this block.  
  
Commentary:  
+
==Commentary:==
  
 
My first stab at a simple set of basis vectors will be the product of three matrices that drop all but the "telephone bandwidth" part of the spectrum, do some filtering 'across the length of the vector' to reduce the impact of equalization (since this is all happening in a log domain, this would seem to be linear), and finally do dimensionality reduction for convenience. As a first attempt at a codebook, I'll build a training corpus of transformed_vectors and do basic k-means clustering. This, together with a small sample_interval compared to analysis_window, should be enough for us to at least get started on relatively clean files and see where we stand on fidelity.  
 
My first stab at a simple set of basis vectors will be the product of three matrices that drop all but the "telephone bandwidth" part of the spectrum, do some filtering 'across the length of the vector' to reduce the impact of equalization (since this is all happening in a log domain, this would seem to be linear), and finally do dimensionality reduction for convenience. As a first attempt at a codebook, I'll build a training corpus of transformed_vectors and do basic k-means clustering. This, together with a small sample_interval compared to analysis_window, should be enough for us to at least get started on relatively clean files and see where we stand on fidelity.  
 
[[Category:To Be Reviewed]] [[Category:Development]]
 
[[Category:To Be Reviewed]] [[Category:Development]]

Revision as of 15:59, 31 January 2006

Adapded from an email at: http://lists.musicbrainz.org/pipermail/musicbrainz-devel/2005-October/001476.html

This assumes that the code reads a configuration file that has all of the constants it needs. I've included a little spec for such a file format, which hopefully is easily readable with scanf, and should have enough header data that we can change our mind about what fields we need in the future. Maybe in the early stages the client could shell out to wget and get the current config file every night, so we could push a configuration change and get feedback the next day.

File format:

<the magic word "FingerprintConfiguration">

<the name of this configuration>

<the name of the DSP strategy to use -- for now, the magic word "logpower-linear-vq">

<signal_sample_rate, a real number giving the sampling rate to downsample the incoming PCM to>

<analysis_window, the width of the FFT analysis window, in samples>

<sample_interval, the distance between successive evaluations of f(t), in samples>

<basis_vector_count, a non-negative integer>

<basis_vectors, a series of basis_vector_count 'basis vectors', each of which is just analysis_window / 2 reals>

<codebook_size, a non-negative integer>

<codebook, a series of codebook_size 'codebook entries', each of which is just basis_vector_count reals>

<symbols, a series of codebook_size raw ASCII characters, none of which may be whitespace characters>

Example:

FingerprintConfiguration

sampleconfig

logpower-linear-vq

8192

2048

768

3

.0232 .0423 .1310 .0224 ... (total of 4096 numbers)

.9798 .1223 .0001 -2.9343 ... (total of 4096 numbers)

.0000 .0000 .5000 .0000 ... (total of 4096 numbers)

5

.8970 .7669 -.7688

0 0 0

-1 -2 10 .237

4 -4 4

.123 .456 .789

abc

Pseudocode:

Processing blocks of audio.

0. Parse the configuration file. If the DSP strategy isn't 'logpower-linear-vq', fail with the error "Unsupported DSP strategy."

1. If the input stream is compressed, decode it to PCM. If it is stereo, mix it down to mono, mixing both channels equally.

2. Resample the input signal to signal_sample_rate. It's very important to lowpass-filter the signal at the same time, instead of just dropping samples. You can probably get code to do this out of the open-source 'sox' utility.

3. Read the first analysis_window samples, put them in a buffer, and apply the following steps. Then slide all elements of that buffer "to the left" by sample_interval samples, and read sample_interval more samples from the input signal into the "right" side of the buffer. Do the following steps again. Repeat until there are fewer than sample_interval samples available on the input.

Normalizing and taking a power spectrum.

4. Make a copy of the analysis_window input samples that we are about to process. This is a good time to promote the samples to 'double' if they are still of integer type.

5. Find the average (mean) value of all of the samples. Subtract it from each sample. Now the mean should be 0. In other words, remove any DC offset.

6. Find the mean absolute value of all of the samples. Divide each sample by this mean, making the mean absolute value 1.0. (Unless the mean absolute value is very small, say less than 1% of the maximum possible value based on the scaling of the input samples -- in that case replace the samples with all 0's. 1% would be 1.28 in the case of 8-bit input samples or 327.68 in the case of 16-bit input samples.) This is gain control; in other words it ensures that volume level changes won't change the fingerprint.

7. Multiply the samples by a Hanning window. In other words, compute a Hanning window in an analysis_window-point-long array, and multiply each sample by the corresponding point in the Hanning array. Google will give you the formula for the Hanning function.

8. Perform a real-valued FFT. Ignoring the "mirroring" because of the real-valued input, the output is analysis_window / 2 + 1 complex numbers. An excellent GPL FFT library is FFTW.

9. Square the complex numbers to get analysis_window / 2 + 1 real numbers: the power spectrum. The first one is the DC offset, and we know it is zero because of the normalization we did. Discard it, leaving analysis_window / 2 reals. Call this power_spectrum.

10. Take the natural logarithm of each element in power_spectrum, except that if an element is extremely close to zero, set it to zero instead. This has the effect of expressing power in terms of a decibel-like measure relative to some unspecified physical reference signal.

Arbitrary linear transformation.

11. Multiply the basis_vectors matrix by power_spectrum to get a vector called transformed_vector that is basis_vector_count elements long:

11a. Allocate an array transformed_vector, basis_vector_count elements long.

11b. For each element i of transformed_vector:

11c. transformed_vector[i] = 0

11d. For j = 0 to analysis_window / 2 - 1 inclusive:

11e. transformed_vector[i] += power_spectrum[j] * basis_vectors[i][j]

Vector quantization.

12. Find the codebook vector that is closest (in Euclidean distance) to transformed vector) and record its index (an integer from 0 to codebook_size-1 inclusive) as matched_vector:

12a. Set matched_vector = -1

12b. Set matched_distance2, a real, = MAXDOUBLE

12c. For i = 0 to codebook_size - 1 inclusive:

12d. Set this_distance, a real = 0

12e. For j = 0 to basis_vector_count:

12f. difference = codebook[i][j] - transformed_vector[j]

12g. this_distance2 += difference * difference

12h. End for loop

12i. If this_distance2 < matched_distance2, set matched_vector = i and matched_distance2 = this_distance2

Output.

13. Output symbols[matched_vector] as the result of f(t) for this block.

Commentary:

My first stab at a simple set of basis vectors will be the product of three matrices that drop all but the "telephone bandwidth" part of the spectrum, do some filtering 'across the length of the vector' to reduce the impact of equalization (since this is all happening in a log domain, this would seem to be linear), and finally do dimensionality reduction for convenience. As a first attempt at a codebook, I'll build a training corpus of transformed_vectors and do basic k-means clustering. This, together with a small sample_interval compared to analysis_window, should be enough for us to at least get started on relatively clean files and see where we stand on fidelity.