Auto-Insert From FreeDB: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
(Fixing ruaok broken sigs (Imported from MoinMoin))
 
m (4 revision(s))
(No difference)

Revision as of 08:45, 15 March 2009

Auto-Importing Entries from FreeDB

Attention.png Important Note: This is no longer done -- we started getting too many duplicates and the community suggested that we need to switch from a quantity to a quality approach. As a first step towards this, we've decided to stop auto-inserting FreeDB matches. The remainder is provided for historical value only.



This page is about the process whereby MusicBrainz "auto-imported" entries from FreeDB.

Summary

  • A web service request is made (via libmusicbrainz) for the "GetCDInfo" call.
  • The requested discid is looked up in MB; if it is found, the matching release is returned, and the process ends.
  • Otherwise, MB looks up the release on FreeDB. If it is not found in FreeDB, then a "not found" result is returned, and the process ends.


The Current Implementation

(as of 2004-07-28, updated 2004-10-20)

Should the FreeDB Match Be Auto-Inserted?

The following rules are applied in order. Processing stops at the first rule to "match":

  • If there are fewer than 5 tracks, the release will NOT be auto-inserted.
  • If the release's artist is either "Various" or "Various Artists", the release will NOT be auto-inserted (but the determination of the release's artist is itself an inexact science).
  • If at least 70% of tracks contain " - ", or at least 70% of tracks contain " / ", or at least 85% of tracks contain "-", or at least 85% of tracks contain "/", then the release will NOT be auto-inserted.
  • If the release's name or artist fail the "Style Check", the release will NOT be auto-inserted.
  • Otherwise the release WILL be auto-inserted.

The "Style Check"

In the code, see Style::UpperLowercaseCheck.

TODO describe it

The Auto-Insertion

This is in two parts:

  • If the FreeDB release matches an existing release in MB, then the disc ID will be added to that release.
    • To "match", we look for an artist where the name matches (by name or sortname),
    • and where the release name matches exactly (case-insensitive),
    • and where the number of tracks matches.
    • If we find such an release, we attempt to add the disc ID to that release. Any errors encountered while doing so are silently ignored.
  • Otherwise, a new release will be created, as the "FreeDB" moderator
    • All these auto-insertions are done as the "FreeDB" moderator. This is because, since the original web service request which started all this is not authenticated, we have no idea to whom we might otherwise attribute this change.
    • All releases inserted as a result of FreeDB moderations should have exactly one discid (but due to bug #898903 some don't).




A Discussion of the Ideal Approach

(at the time of writing - 2004-07-28 - this section was last updated late 2002)

This page is mean to describe the ideal system that uses FreeDB to insert new entries in MusicBrainz or update the existing entries.

When queries fail to return any content within the MusicBrainz database, a query is sent toward FreeDB.org to see if it matches any of their content. If FreeDB has an entry, we can use this info. Please remember that all FreeDB content consists of 'single artist releases' only, multiple artist releases, releases with 1 track, etc. are all the same.

Currently there is a FreeDB moderator on MusicBrainz that inserts now entries in the system. However, sometimes these entries have errors, or are simply total crap. Here are some problems with the entries and (pending) solutions. Feel free to add your own.

  • In MusicBrainz the release already exists, but is incomplete. Solution: do fuzzy matching with present releases and merge while keeping possible TRM entries.
  • FreeDB does not handle multiple artist CDs properly. Solution: when more than half the tracks on the release contain TRACKNAME - ARTIST seperators such as --, /, -, |, or \ the AutoInsertFromFreeDB script should try to detect the format automatically. Using the database it can try to detect what the artist part is. This should ease the pressure on our valuable Moderators a bit.
  • FreeDB has bad entries for artists. Solution: creating a new artists should be difficult, because the likelyhood for errors is higher. If an releasename and artistname do already appear in the database, the new info if probably correct. A FreeDB new artist insert and release should require more votes for example than a simple merge.

--Johan

Should probably be "should require a higher yes to no ratio than a simple merge." (but I'm sure that was meant) --Cmarqu



I've taken the above suggestions and combined them with the current setup to come up with the complete proposed FreeDB auto insert algorithm:

  1. A client looks up a CD at MusicBrainz and the CD is not found*
  2. The CD is looked up at FreeDB and returned to the user*
  3. The CD is sanity checked:
    1. Must have more than one track* (its has been suggested to raise this to 4 or 5)
    2. The text is sanity checked to make sure its not all upper or lower case*
    3. The release title must not contain the word 'various'.* If the sanity check fails, the possible submission gets tossed
  1. (Step 3.c must be dropped to implement this) If the title contains the word 'various', 'compilation' or 'soundtrack', or if more than half of the tracks contain a track/artist seperator (--, /, |, ()) the release will be treated as a multiple artist release.
    1. Determine an appropriate track/artist seperator
    2. Split the track titles into left and right portions
    3. Lookup the left and right portions in the artist table to determine which side is the track title and which side is the artist name.
  1. If an release with the same title (exact or fuzzy) and number of tracks exists in MB, the CD Index id is associated with the matching release. Stop here.
  2. If an release with the same title (exact or fuzzy) and fewer number of tracks exists in MB, the new release is merged into the existing release, preserving existing TRM and CD Index ids and filling out missing tracks. Stop here.
  3. If no matching release exists in MB, insert new release*

Items with a * are currently implemented, even if not available on the main server yet.

Johan: I don't think that requiring a higher number of yes to no ratio us going to work here, since most votes go down pretty unanimously anyway. It would just make work for MORE people to have to look at the moderation in order for it to pass. For single artist releases I don't think we need to change anything. However, currently we do no address the multiple artist releases at all, since they can potentially create dozens of new (and quite possibly crappy) artists. I'm not even sure we should do this -- I don't think that the FreeDB data is good enough to do this. :-( --Ruaok



Ruaok: Great algorithm, should give us good content. The multiple artist release insert could indeed create crappy artists. It would be great if every user could simply edit the insert proposed by the FreeDB moderator. If for every artist is was shown if it was new or already present, that would be valuable info for the moderators I think. But how do we handle the Insert edit permission if people already votes yes/no? --Johan



Ruaok: I don't think releases with Disc IDs should be extended by further tracks. (your point 6) This would lead to Disc IDs associated to releases with more tracks than expected.

Johan: I think this can be handled by the artist aliases / artist merging features. Of course popular artist will get lot's of crappy aliases.

(fixed typo in 3c)

--DeKarl



DeKarl: Yeah, I'll need to be careful to make sure that no extra tracks are added. The main point is to 'fill out' any missing tracks, and using the CD Index ID as guide for the 'correct' number of tracks. --Ruaok



An addition is to set the release type while inserting.

  • If "Soundtrack" or "O.S.T." is in the release name it is a soundtrack.
  • If "Live" or "Unplugged" is in the release name it is a live recording.
  • If most tracks start with the same words it is a maxi (a single with the radio/release version and possibly lots of remixes)
  • If the release name is the same as a tracks name it might be a single (how many tracks does a single have? 3-6?)

--DeKarl