User:Jeroen: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
(New page: As, you may have noticed, I sometimes do a lot of edits in a short time. This is because I run reports on the MusicBrainz database, combining it with other data sources, and use a semi-aut...)
 
(Tidied up the page, added reference to the source code.)
 
Line 1: Line 1:
As, you may have noticed, I sometimes do a lot of edits in a short time. This is because I run reports on the MusicBrainz database, combining it with other data sources, and use a semi-automated way to contribute high-certainty data to MusicBrainz. My goal is to reduce the amount of pure data-copying edits, and let our editors focus on edits that require judgement and creativity. I created this page to answer some common questions about what I'm doing.
As, you may have noticed, I sometimes do a lot of edits in a short time. This is because I run reports on the MusicBrainz database, combining it with other data sources, and use an automated way to contribute high-certainty data to MusicBrainz. My goal is to reduce the amount of pure data-copying edits, and let our editors focus on edits that require judgement and creativity.


Currently I'm focusing on adding ARs based on data taken from the public domain Discogs repository. Though a secondary source, the quality is overall quite good and the collection is extensive. This will take some time to get right. After that though, there are several sources that I'm looking at for a next project, such as bringing in authoritative data from label websites. If you have any ideas, please do [http://musicbrainz.org/show/user/?userid=11780 contact me].
== Semi-automated? We are focusing on quality, not quantity! ==


== More information==
Yes, I'm aware. I've been editing data on albums that I own, since joining back in 2003. I've seen a number of the iterations of the style guidelines, and I've learned a lot about the way the community wants to add and maintain data. The edits that I contribute through the reports is based on data that I'm confident is correct.
* My [[User_talk:Jeroen|talk page]], where discussion takes place about the approach and any problems are signaled.

* The [http://musicbrainz.org/mod/search/pre/editor-open.html?userid=506697 list of open edits]. I check these manually afterwards, which is the biggest limiting factor in the editing speed. It's a great help if you vote on these.
For example, I've been adding performer credits from Discogs data, but I don't do this unless both the artist and the release are linked to their Discogs URL (that is, manually reviewed as being the same album/artist), the track name matches, the track position matches, and the track length is not more than 10 seconds different. Then, before I create the edit, I check that the artist relationship has not been added in the meantime, but I also check that a performance name of group that the artist is a member of is not listed. Finally, I double-check that the track that I'm matching to is still linked to the release that I was matching against, to make sure no changes have been made.
* The [http://github.com/jeroenl/musicbrainz-databot source code] is a good starting point if you know the details about all the quality control rules that are included in the reports and scripts.

This process filters out a lot of edits that are probably right, but I want to make sure not to make any errors.

== So how do you make sure the data is correct? ==

Before I go and contribute a lot of data, I go through the following steps:

# '''Explore'''. I spend a couple of days looking at the data, looking at similar data in MusicBrainz and how it's structured. I review the style guidelines, and I make manual edits using the data as a source. I explore the different type of situations that you encounter.
# '''Evaluate'''. I consider whether I feel I can trust the data. If there have been a lot of surprises in the data, or I see a lot of situations that require human judgement, I move on to something else. If the edits I made manually are purely about copying the data, then this is an area where my approach can make a big contribution.
# '''Create'''. Based on my experience, I create the reports that identify possible edits and build the editing tools. I make sure that any uncertain situations I found are excluded, and I double-check that the data is consistent. The editing tools adds a reference to the source data in the note, so that other editors can verify my work.
# '''Test'''. After running the reports, I hand-check a set of potential edits before using the editing tools. Where necessary I make changes to the reports, and rerun them. After I'm confident about the edits, I test the editing tools using first a few and then a few dozen edits. I hand-check all of these, to spot any problems. After that I contribute a larger set of edits, doing spot checks and waiting a day or two to see whether other editors have comments about the edits. Any problems that are spotted, I correct immediately.
# '''Contribute'''. After this testing period, I start contributing more of the data. I keep monitoring the edits, and review any comments that come in. If any problems are spotted, I cancel the edit and correct where necessary. Of course, if the same problem could have happened in other areas, I review the edits that could have been touched by this, and manually correct any errors that were made.

In my experience so far, the thousands of edits that I have made only revealed a few problems. All of these problems were corrected.

== But still, you can make errors ==

Yes, of course. With this quantity, there are bound to be some errors, just like if you would do a large number of manual edits. I try to do only do edits, where I can contribute with the same quality of a human editor that does a lot of data merging. So far, that seems to be working.

As an example, these are a few of the problems that I've seen:
* ''Incorrect relationships''. I've been setting the types of some artists that have an unknown type to either group or person, if they had only relationships that could be linked to only a person or a group. Some of the changes I made turned out to be incorrect, but this revealed that the relationships were of the incorrect type. In each case, I've investigated the situation and changed either the type or the relationships.
* ''Incorrect source URL''. For adding relationships, I've been using Discogs as a data source. Because releases can be linked to multiple Discogs releases, and I was going back later to reconstruct which Discogs page I used, sometimes I provided the wrong source URL in the note. Though the information was correct, this created some irritation among editors who were voting on my edits. I've made sure, and double-check each time, that the URL that provided the information is now included in the note. No more irritation.
* ''Listing artists, where a performance name was already listed'', or vice versa. I found out that sometimes Discogs was giving credit to a [[PerformanceName|performance name]], while the [[LegalName|legal name]] was already listed. I'm now maintaining a table of all performance names and group memberships, even several levels deep, that I compare against the existing relationships. Since it's several levels deep, this is even catching relations that are not obvious to an editor not familiar with the artist.

== But ... ==

Considering how important quality is to MusicBrainz, I can understand that some editors will have doubts about the way I work. If you have any other questions, or issues that you would like to discuss, please [[http://musicbrainz.org/show/user/?userid=11780 contact me]] or post on my [[User_talk:Jeroen|talk page]]. If you see specific issues with the edits that I'm doing, please do respond to those edits. That helps me find potential trouble spots, and maintain the quality of the data.

Thanks,
Jeroen

Latest revision as of 12:02, 24 July 2010

As, you may have noticed, I sometimes do a lot of edits in a short time. This is because I run reports on the MusicBrainz database, combining it with other data sources, and use an automated way to contribute high-certainty data to MusicBrainz. My goal is to reduce the amount of pure data-copying edits, and let our editors focus on edits that require judgement and creativity.

Currently I'm focusing on adding ARs based on data taken from the public domain Discogs repository. Though a secondary source, the quality is overall quite good and the collection is extensive. This will take some time to get right. After that though, there are several sources that I'm looking at for a next project, such as bringing in authoritative data from label websites. If you have any ideas, please do contact me.

More information

  • My talk page, where discussion takes place about the approach and any problems are signaled.
  • The list of open edits. I check these manually afterwards, which is the biggest limiting factor in the editing speed. It's a great help if you vote on these.
  • The source code is a good starting point if you know the details about all the quality control rules that are included in the reports and scripts.