Auto-Insert From FreeDB

From MusicBrainz Wiki
Revision as of 14:34, 29 September 2005 by DonRedman (talk | contribs) (Reformatted the page and status notice. This is well archived now (Imported from MoinMoin))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Auto-Importing Entries from FreeDB

Attention.png Important Note: This is no longer done -- we started getting too many duplicates and the community suggested that we need to switch from a quantity to a quality approach. As a first step towards this, we've decided to stop auto-inserting FreeDB matches. The remainder is provided for historical value only.



This page is about the process whereby MusicBrainz "auto-imported" entries from FreeDB.

Summary

  • A web service request is made (via libmusicbrainz) for the "GetCDInfo" call.
  • The requested discid is looked up in MB; if it is found, the matching album is returned, and the process ends.
  • Otherwise, MB looks up the album on FreeDB. If it is not found in FreeDB, then a "not found" result is returned, and the process ends.


The Current Implementation

(as of 2004-07-28, updated 2004-10-20)

Should the FreeDB Match Be Auto-Inserted?

The following rules are applied in order. Processing stops at the first rule to "match":

  • If there are fewer than 5 tracks, the album will NOT be auto-inserted.
  • If the album's artist is either "Various" or "Various Artists", the album will NOT be auto-inserted (but the determination of the album's artist is itself an inexact science).
  • If at least 70% of tracks contain " - ", or at least 70% of tracks contain " / ", or at least 85% of tracks contain "-", or at least 85% of tracks contain "/", then the album will NOT be auto-inserted.
  • If the album's name or artist fail the "Style Check", the album will NOT be auto-inserted.
  • Otherwise the album WILL be auto-inserted.

The "Style Check"

In the code, see Style::UpperLowercaseCheck.

TODO describe it

The Auto-Insertion

This is in two parts:

  • If the FreeDB album matches an existing album in MB, then the disc ID will be added to that album.
    • To "match", we look for an artist where the name matches (by name or sortname),
    • and where the album name matches exactly (case-insensitive),
    • and where the number of tracks matches.
    • If we find such an album, we attempt to add the disc ID to that album. Any errors encountered while doing so are silently ignored.
  • Otherwise, a new album will be created, as the "FreeDB" moderator
    • All these auto-insertions are done as the "FreeDB" moderator. This is because, since the original web service request which started all this is not authenticated, we have no idea to whom we might otherwise attribute this change.
    • All albums inserted as a result of FreeDB moderations should have exactly one discid (but due to bug #898903 some don't).




A Discussion of the Ideal Approach

(at the time of writing - 2004-07-28 - this section was last updated late 2002)

This page is mean to describe the ideal system that uses FreeDB to insert new entries in MusicBrainz or update the existing entries.

When queries fail to return any content within the MusicBrainz database, a query is sent toward FreeDB.org to see if it matches any of their content. If FreeDB has an entry, we can use this info. Please remember that all FreeDB content consists of 'single artist albums' only, multiple artist albums, albums with 1 track, etc. are all the same.

Currently there is a FreeDB moderator on MusicBrainz that inserts now entries in the system. However, sometimes these entries have errors, or are simply total crap. Here are some problems with the entries and (pending) solutions. Feel free to add your own.

  • In MusicBrainz the album already exists, but is incomplete. Solution: do fuzzy matching with present albums and merge while keeping possible TRM entries.
  • FreeDB does not handle multiple artist CDs properly. Solution: when more than half the tracks on the album contain TRACKNAME - ARTIST seperators such as --, /, -, |, or \ the AutoInsertFromFreeDB script should try to detect the format automatically. Using the database it can try to detect what the artist part is. This should ease the pressure on our valuable Moderators a bit.
  • FreeDB has bad entries for artists. Solution: creating a new artists should be difficult, because the likelyhood for errors is higher. If an albumname and artistname do already appear in the database, the new info if probably correct. A FreeDB new artist insert and album should require more votes for example than a simple merge.

--Johan

Should probably be "should require a higher yes to no ratio than a simple merge." (but I'm sure that was meant) --Cmarqu



I've taken the above suggestions and combined them with the current setup to come up with the complete proposed FreeDB auto insert algorithm:

  1. A client looks up a CD at MusicBrainz and the CD is not found*
  2. The CD is looked up at FreeDB and returned to the user*
  3. The CD is sanity checked:
    1. Must have more than one track* (its has been suggested to raise this to 4 or 5)
    2. The text is sanity checked to make sure its not all upper or lower case*
    3. The album title must not contain the word 'various'.* If the sanity check fails, the possible submission gets tossed
  1. (Step 3.c must be dropped to implement this) If the title contains the word 'various', 'compilation' or 'soundtrack', or if more than half of the tracks contain a track/artist seperator (--, /, |, ()) the album will be treated as a multiple artist album.
    1. Determine an appropriate track/artist seperator
    2. Split the track titles into left and right portions
    3. Lookup the left and right portions in the artist table to determine which side is the track title and which side is the artist name.
  1. If an album with the same title (exact or fuzzy) and number of tracks exists in MB, the CD Index id is associated with the matching album. Stop here.
  2. If an album with the same title (exact or fuzzy) and fewer number of tracks exists in MB, the new album is merged into the existing album, preserving existing TRM and CD Index ids and filling out missing tracks. Stop here.
  3. If no matching album exists in MB, insert new album*

Items with a * are currently implemented, even if not available on the main server yet.

Johan: I don't think that requiring a higher number of yes to no ratio us going to work here, since most votes go down pretty unanimously anyway. It would just make work for MORE people to have to look at the moderation in order for it to pass. For single artist albums I don't think we need to change anything. However, currently we do no address the multiple artist albums at all, since they can potentially create dozens of new (and quite possibly crappy) artists. I'm not even sure we should do this -- I don't think that the FreeDB data is good enough to do this. :-( --ruaok



Ruaok: Great algorithm, should give us good content. The multiple artist album insert could indeed create crappy artists. It would be great if every user could simply edit the insert proposed by the FreeDB moderator. If for every artist is was shown if it was new or already present, that would be valuable info for the moderators I think. But how do we handle the Insert edit permission if people already votes yes/no? --Johan



Ruaok: I don't think albums with Disc IDs should be extended by further tracks. (your point 6) This would lead to Disc IDs associated to albums with more tracks than expected.

Johan: I think this can be handled by the artist aliases / artist merging features. Of course popular artist will get lot's of crappy aliases.

(fixed typo in 3c)

--DeKarl



DeKarl: Yeah, I'll need to be careful to make sure that no extra tracks are added. The main point is to 'fill out' any missing tracks, and using the CD Index ID as guide for the 'correct' number of tracks. --ruaok



An addition is to set the album type while inserting.

  • If "Soundtrack" or "O.S.T." is in the album name it is a soundtrack.
  • If "Live" or "Unplugged" is in the album name it is a live recording.
  • If most tracks start with the same words it is a maxi (a single with the radio/album version and possibly lots of remixes)
  • If the album name is the same as a tracks name it might be a single (how many tracks does a single have? 3-6?)

--DeKarl