History:Find Duplicate Music Files

From MusicBrainz Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This HOWTO describes how to use TRMs to search a music collection and identify those tracks that are duplicates. It will probably only work on unix-like operating systems (if anyone has a Windows equivalent, please add details to this page). The method uses the "trm" utility included in the TunePimp library. This solution was provided by Björn Krombholz (also known as Fuchs):

  • IIRC this approach is quite close to a denial of service attack to the TRM server, since it is queried for each generated TRM. So use this sparely, please --DonRedman


For Linux or other *nixes you can get the libtunepimp packages which normally includes the small "trm" utility. In this case you can simply run a bash one-liner like:

 # find / -regex '.*\(ogg\|mp3\|wav\)' -exec trm '{}' ';' -print0 | tr '\n\000' ' \n' > /trm/trms.log

It can take hours to complete for large collections, but it will create the file "/tmp/trms.log" of the form:

 <TRM> <ABSOLUTE PATH TO FILE>

You can then sort and unify the trms in this file to get the duplicates, for example by using:

 # cat /tmp/trms.log | sort | uniq -c -W1 | grep -v -e '^ *1' > /tmp/duplicates.log

which merges every two or more lines with the same TRM and then extracts all those new lines for TRMs with 2 or more associated files. The output will be of the form:

 <NUMBER OF OCCURENCES> <TRM> <PATH OF THE FIRST FILE WITH THIS TRM>

3) Now it's time to check the duplicates either by hand by comparing every trm with each line in trms.log or you first create a third file "/tmp/trmdupls.log" that looks the same as the one from 1) but only lists the duplicate files:

 # cat /tmp/duplicates.log | cut -b9-44 | xargs -iXX -n1 grep XX /tmp/trms.log > /tmp/trmdupls.log

That's it. if you don't like hitting return for three times ;) you can simply put it all in one line like this:

 # find / -regex '.*\(ogg\|mp3\|wav\)' -exec trm '{}' ';' -print0 
       | tr '\n\000' ' \n' 
       | sort 
       | uniq -c -W1 
       | grep -v -e '^ *1' 
       | cut -b9-44 
       | xargs -iXX -n1 grep XX /tmp/trms.log 
     > /tmp/trmdupls.log

Comments

Couldn't this same now be done, without ever touching the server, with a python script, using mutagen (or libtunepimp?) to get the PUID tags and look for matching puids? It'll find mis-matched tracks and duplicates - and, assuming someone can figure a good way to install libtunepimp on a Windows box, it'd be 100% cross-platform... -- BrianSchweitzer 21:02, 04 June 2007 (UTC)