Auto-Insert From FreeDB: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
m (4 revision(s))
m (Stale, historic, discussions that don't affect the present)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
#REDIRECT [[FreeDB]]
=Auto-Importing Entries from FreeDB=

[[Image:Attention.png]] '''Important Note:''' ''This is no longer done -- we started getting too many duplicates and the community suggested that we need to switch from a quantity to a quality approach. As a first step towards this, we've decided to stop auto-inserting [[FreeDB]] matches. The remainder is provided for historical value only.''

----


This page is about the process whereby [[MusicBrainz]] "auto-imported" entries from [[FreeDB]].

==Summary==

* A web service request is made (via [[libmusicbrainz]]) for the "GetCDInfo" call.
* The requested discid is looked up in MB; if it is found, the matching release is returned, and the process ends.
* Otherwise, MB looks up the release on [[FreeDB]]. If it is not found in [[FreeDB]], then a "not found" result is returned, and the process ends.

----


===The Current Implementation===

(as of 2004-07-28, updated 2004-10-20)

===Should the FreeDB Match Be Auto-Inserted?===

The following rules are applied in order. Processing stops at the first rule to "match":
* If there are fewer than 5 tracks, the release will NOT be auto-inserted.
* If the release's artist is either "Various" or "Various Artists", the release will NOT be auto-inserted (but the determination of the release's artist is itself an inexact science).
* If at least 70% of tracks contain " - ", or at least 70% of tracks contain " / ", or at least 85% of tracks contain "-", or at least 85% of tracks contain "/", then the release will NOT be auto-inserted.
* If the release's name or artist fail the "Style Check", the release will NOT be auto-inserted.
* Otherwise the release WILL be auto-inserted.

===The "Style Check"===

In the code, see Style::[[Upper Lowercase Check|UpperLowercaseCheck]].

''TODO describe it''

===The Auto-Insertion===

This is in two parts:
* If the [[FreeDB]] release matches an existing release in MB, then the disc ID will be added to that release.
** To "match", we look for an artist where the name matches (by name or sortname),
** and where the release name matches exactly (case-insensitive),
** and where the number of tracks matches.
** If we find such an release, we attempt to add the disc ID to that release. Any errors encountered while doing so are silently ignored.

* Otherwise, a new release will be created, as the "FreeDB" moderator
** All these auto-insertions are done as the "FreeDB" moderator. This is because, since the original web service request which started all this is not authenticated, we have no idea to whom we might otherwise attribute this change.
** All releases inserted as a result of [[FreeDB]] moderations ''should'' have exactly one discid (but due to [http://sourceforge.net/tracker/index.php?func=detail&aid=898903&group_id=19506&atid=119506 bug #898903] some don't).

----

----


==A Discussion of the Ideal Approach==

''(at the time of writing - 2004-07-28 - this section was last updated late 2002)''

This page is mean to describe the ideal system that uses [[FreeDB]] to insert new entries in [[MusicBrainz]] or update the existing entries.

When queries fail to return any content within the [[MusicBrainz]] database, a query is sent toward [[FreeDB]].org to see if it matches any of their content. If [[FreeDB]] has an entry, we can use this info. Please remember that all [[FreeDB]] content consists of 'single artist releases' only, multiple artist releases, releases with 1 track, etc. are all the same.

Currently there is a [[FreeDB]] moderator on [[MusicBrainz]] that inserts now entries in the system. However, sometimes these entries have errors, or are simply total crap. Here are some problems with the entries and (pending) solutions. Feel free to add your own.
* In [[MusicBrainz]] the release already exists, but is incomplete. Solution: do fuzzy matching with present releases and merge while keeping possible [[TRM]] entries.
* FreeDB does not handle multiple artist CDs properly. Solution: when more than half the tracks on the release contain TRACKNAME - ARTIST seperators such as --, /, -, |, or \ the AutoInsertFromFreeDB script should try to detect the format automatically. Using the database it can try to detect what the artist part is. This should ease the pressure on our valuable [[Moderators]] a bit.
* FreeDB has bad entries for artists. Solution: creating a new artists should be difficult, because the likelyhood for errors is higher. If an releasename and artistname do already appear in the database, the new info if probably correct. A [[FreeDB]] new artist insert and release should require more votes for example than a simple merge.

--[[Johan]]

Should probably be "should require a higher yes to no ratio than a simple merge." (but I'm sure that was meant) --[[Cmarqu]]

----


I've taken the above suggestions and combined them with the current setup to come up with the complete proposed [[FreeDB]] auto insert algorithm:
# A client looks up a CD at [[MusicBrainz]] and the CD is not found*
# The CD is looked up at [[FreeDB]] and returned to the user*
# The CD is sanity checked:
## Must have more than one track* (its has been suggested to raise this to 4 or 5)
## The text is sanity checked to make sure its not all upper or lower case*
## The release title must not contain the word 'various'.* If the sanity check fails, the possible submission gets tossed

# (Step 3.c must be dropped to implement this) If the title contains the word 'various', 'compilation' or 'soundtrack', or if more than half of the tracks contain a track/artist seperator (--, /, |, ()) the release will be treated as a multiple artist release.
## Determine an appropriate track/artist seperator
## Split the track titles into left and right portions
## Lookup the left and right portions in the artist table to determine which side is the track title and which side is the artist name.

# If an release with the same title (exact or fuzzy) and number of tracks exists in MB, the CD Index id is associated with the matching release. Stop here.
# If an release with the same title (exact or fuzzy) and fewer number of tracks exists in MB, the new release is merged into the existing release, preserving existing TRM and CD Index ids and filling out missing tracks. Stop here.
# If no matching release exists in MB, insert new release*

Items with a * are currently implemented, even if not available on the main server yet.

Johan: I don't think that requiring a higher number of yes to no ratio us going to work here, since most votes go down pretty unanimously anyway. It would just make work for MORE people to have to look at the moderation in order for it to pass. For single artist releases I don't think we need to change anything. However, currently we do no address the multiple artist releases at all, since they can potentially create dozens of new (and quite possibly crappy) artists. I'm not even sure we should do this -- I don't think that the [[FreeDB]] data is good enough to do this. :-( --[[User:Ruaok|Ruaok]]

----


Ruaok: Great algorithm, should give us good content. The multiple artist release insert could indeed create crappy artists. It would be great if every user could simply edit the insert proposed by the [[FreeDB]] moderator. If for every artist is was shown if it was new or already present, that would be valuable info for the moderators I think. But how do we handle the Insert edit permission if people already votes yes/no? --[[Johan]]

----


Ruaok: I don't think releases with Disc IDs should be extended by further tracks. (your point 6) This would lead to Disc IDs associated to releases with more tracks than expected.

Johan: I think this can be handled by the artist aliases / artist merging features. Of course popular artist will get lot's of crappy aliases.

(fixed typo in 3c)

--[[User:DeKarl|DeKarl]]

----


[[User:DeKarl|DeKarl]]: Yeah, I'll need to be careful to make sure that no extra tracks are added. The main point is to 'fill out' any missing tracks, and using the CD Index ID as guide for the 'correct' number of tracks. --[[User:Ruaok|Ruaok]]

----


An addition is to set the release type while inserting.
* If "Soundtrack" or "O.S.T." is in the release name it is a soundtrack.
* If "Live" or "Unplugged" is in the release name it is a live recording.
* If most tracks start with the same words it is a maxi (a single with the radio/release version and possibly lots of remixes)
* If the release name is the same as a tracks name it might be a single (how many tracks does a single have? 3-6?)

--[[User:DeKarl|DeKarl]]

[[Category:To Be Reviewed]]

Latest revision as of 03:15, 29 March 2010

Redirect to: