Auto-Insert From FreeDB: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
(Reformatted the page and status notice. This is well archived now (Imported from MoinMoin))
 
m (Stale, historic, discussions that don't affect the present)
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
#REDIRECT [[FreeDB]]
=Auto-Importing Entries from FreeDB=

[[Image:Attention.png]] '''Important Note:''' ''This is no longer done -- we started getting too many duplicates and the community suggested that we need to switch from a quantity to a quality approach. As a first step towards this, we've decided to stop auto-inserting [[FreeDB]] matches. The remainder is provided for historical value only.''

----


This page is about the process whereby [[MusicBrainz]] "auto-imported" entries from [[FreeDB]].

==Summary==

* A web service request is made (via libmusicbrainz) for the "GetCDInfo" call.
* The requested discid is looked up in MB; if it is found, the matching album is returned, and the process ends.
* Otherwise, MB looks up the album on [[FreeDB]]. If it is not found in [[FreeDB]], then a "not found" result is returned, and the process ends.

----


===The Current Implementation===

(as of 2004-07-28, updated 2004-10-20)

===Should the FreeDB Match Be Auto-Inserted?===

The following rules are applied in order. Processing stops at the first rule to "match":
* If there are fewer than 5 tracks, the album will NOT be auto-inserted.
* If the album's artist is either "Various" or "Various Artists", the album will NOT be auto-inserted (but the determination of the album's artist is itself an inexact science).
* If at least 70% of tracks contain " - ", or at least 70% of tracks contain " / ", or at least 85% of tracks contain "-", or at least 85% of tracks contain "/", then the album will NOT be auto-inserted.
* If the album's name or artist fail the "Style Check", the album will NOT be auto-inserted.
* Otherwise the album WILL be auto-inserted.

===The "Style Check"===

In the code, see Style::[[Upper Lowercase Check|UpperLowercaseCheck]].

''TODO describe it''

===The Auto-Insertion===

This is in two parts:
* If the [[FreeDB]] album matches an existing album in MB, then the disc ID will be added to that album.
** To "match", we look for an artist where the name matches (by name or sortname),
** and where the album name matches exactly (case-insensitive),
** and where the number of tracks matches.
** If we find such an album, we attempt to add the disc ID to that album. Any errors encountered while doing so are silently ignored.

* Otherwise, a new album will be created, as the "FreeDB" moderator
** All these auto-insertions are done as the "FreeDB" moderator. This is because, since the original web service request which started all this is not authenticated, we have no idea to whom we might otherwise attribute this change.
** All albums inserted as a result of [[FreeDB]] moderations ''should'' have exactly one discid (but due to [http://sourceforge.net/tracker/index.php?func=detail&aid=898903&group_id=19506&atid=119506 bug #898903] some don't).

----

----


==A Discussion of the Ideal Approach==

''(at the time of writing - 2004-07-28 - this section was last updated late 2002)''

This page is mean to describe the ideal system that uses [[FreeDB]] to insert new entries in [[MusicBrainz]] or update the existing entries.

When queries fail to return any content within the [[MusicBrainz]] database, a query is sent toward [[FreeDB]].org to see if it matches any of their content. If [[FreeDB]] has an entry, we can use this info. Please remember that all [[FreeDB]] content consists of 'single artist albums' only, multiple artist albums, albums with 1 track, etc. are all the same.

Currently there is a [[FreeDB]] moderator on [[MusicBrainz]] that inserts now entries in the system. However, sometimes these entries have errors, or are simply total crap. Here are some problems with the entries and (pending) solutions. Feel free to add your own.
* In [[MusicBrainz]] the album already exists, but is incomplete. Solution: do fuzzy matching with present albums and merge while keeping possible [[TRM]] entries.
* FreeDB does not handle multiple artist CDs properly. Solution: when more than half the tracks on the album contain TRACKNAME - ARTIST seperators such as --, /, -, |, or \ the AutoInsertFromFreeDB script should try to detect the format automatically. Using the database it can try to detect what the artist part is. This should ease the pressure on our valuable [[Moderators]] a bit.
* FreeDB has bad entries for artists. Solution: creating a new artists should be difficult, because the likelyhood for errors is higher. If an albumname and artistname do already appear in the database, the new info if probably correct. A [[FreeDB]] new artist insert and album should require more votes for example than a simple merge.

--[[Johan]]

Should probably be "should require a higher yes to no ratio than a simple merge." (but I'm sure that was meant) --[[Cmarqu]]

----


I've taken the above suggestions and combined them with the current setup to come up with the complete proposed [[FreeDB]] auto insert algorithm:
# A client looks up a CD at [[MusicBrainz]] and the CD is not found*
# The CD is looked up at [[FreeDB]] and returned to the user*
# The CD is sanity checked:
## Must have more than one track* (its has been suggested to raise this to 4 or 5)
## The text is sanity checked to make sure its not all upper or lower case*
## The album title must not contain the word 'various'.* If the sanity check fails, the possible submission gets tossed

# (Step 3.c must be dropped to implement this) If the title contains the word 'various', 'compilation' or 'soundtrack', or if more than half of the tracks contain a track/artist seperator (--, /, |, ()) the album will be treated as a multiple artist album.
## Determine an appropriate track/artist seperator
## Split the track titles into left and right portions
## Lookup the left and right portions in the artist table to determine which side is the track title and which side is the artist name.

# If an album with the same title (exact or fuzzy) and number of tracks exists in MB, the CD Index id is associated with the matching album. Stop here.
# If an album with the same title (exact or fuzzy) and fewer number of tracks exists in MB, the new album is merged into the existing album, preserving existing TRM and CD Index ids and filling out missing tracks. Stop here.
# If no matching album exists in MB, insert new album*

Items with a * are currently implemented, even if not available on the main server yet.

Johan: I don't think that requiring a higher number of yes to no ratio us going to work here, since most votes go down pretty unanimously anyway. It would just make work for MORE people to have to look at the moderation in order for it to pass. For single artist albums I don't think we need to change anything. However, currently we do no address the multiple artist albums at all, since they can potentially create dozens of new (and quite possibly crappy) artists. I'm not even sure we should do this -- I don't think that the [[FreeDB]] data is good enough to do this. :-( --[[ruaok]]

----


Ruaok: Great algorithm, should give us good content. The multiple artist album insert could indeed create crappy artists. It would be great if every user could simply edit the insert proposed by the [[FreeDB]] moderator. If for every artist is was shown if it was new or already present, that would be valuable info for the moderators I think. But how do we handle the Insert edit permission if people already votes yes/no? --[[Johan]]

----


Ruaok: I don't think albums with Disc IDs should be extended by further tracks. (your point 6) This would lead to Disc IDs associated to albums with more tracks than expected.

Johan: I think this can be handled by the artist aliases / artist merging features. Of course popular artist will get lot's of crappy aliases.

(fixed typo in 3c)

--[[User:DeKarl|DeKarl]]

----


[[User:DeKarl|DeKarl]]: Yeah, I'll need to be careful to make sure that no extra tracks are added. The main point is to 'fill out' any missing tracks, and using the CD Index ID as guide for the 'correct' number of tracks. --[[ruaok]]

----


An addition is to set the album type while inserting.
* If "Soundtrack" or "O.S.T." is in the album name it is a soundtrack.
* If "Live" or "Unplugged" is in the album name it is a live recording.
* If most tracks start with the same words it is a maxi (a single with the radio/album version and possibly lots of remixes)
* If the album name is the same as a tracks name it might be a single (how many tracks does a single have? 3-6?)

--[[User:DeKarl|DeKarl]]

[[Category:To Be Reviewed]]

Latest revision as of 03:15, 29 March 2010

Redirect to: