History:Find Duplicate Music Files

From MusicBrainz Wiki

This HOWTO describes how to use TRMs to search a music collection and identify those tracks that are duplicates. It will probably only work on unix-like operating systems (if anyone has a Windows equivalent, please add details to this page). The method uses the "trm" utility included in the TunePimp library. This solution was provided by Björn Krombholz (also known as Fuchs):

  • IIRC this approach is quite close to a denial of service attack to the TRM server, since it is queried for each generated TRM. So use this sparely, please --DonRedman


For Linux or other *nixes you can get the libtunepimp packages which normally includes the small "trm" utility. In this case you can simply run a bash one-liner like:

 # find / -regex '.*\(ogg\|mp3\|wav\)' -exec trm '{}' ';' -print0 | tr '\n\000' ' \n' > /trm/trms.log

It can take hours to complete for large collections, but it will create the file "/tmp/trms.log" of the form:

 <TRM> <ABSOLUTE PATH TO FILE>

You can then sort and unify the trms in this file to get the duplicates, for example by using:

 # cat /tmp/trms.log | sort | uniq -c -W1 | grep -v -e '^ *1' > /tmp/duplicates.log

which merges every two or more lines with the same TRM and then extracts all those new lines for TRMs with 2 or more associated files. The output will be of the form:

 <NUMBER OF OCCURENCES> <TRM> <PATH OF THE FIRST FILE WITH THIS TRM>

3) Now it's time to check the duplicates either by hand by comparing every trm with each line in trms.log or you first create a third file "/tmp/trmdupls.log" that looks the same as the one from 1) but only lists the duplicate files:

 # cat /tmp/duplicates.log | cut -b9-44 | xargs -iXX -n1 grep XX /tmp/trms.log > /tmp/trmdupls.log

That's it. if you don't like hitting return for three times ;) you can simply put it all in one line like this:

 # find / -regex '.*\(ogg\|mp3\|wav\)' -exec trm '{}' ';' -print0 
       | tr '\n\000' ' \n' 
       | sort 
       | uniq -c -W1 
       | grep -v -e '^ *1' 
       | cut -b9-44 
       | xargs -iXX -n1 grep XX /tmp/trms.log 
     > /tmp/trmdupls.log

Comments

Couldn't this same now be done, without ever touching the server, with a python script, using mutagen (or libtunepimp?) to get the PUID tags and look for matching puids? It'll find mis-matched tracks and duplicates - and, assuming someone can figure a good way to install libtunepimp on a Windows box, it'd be 100% cross-platform... -- BrianSchweitzer 21:02, 04 June 2007 (UTC)