History:Data Quality

From MusicBrainz Wiki

Data Quality and Editing Strictness

The concept of data quality is based on release locking suggestions in QualityAndQuantity and ReleaseLocking. After much discussion and even more time, the concept has undergone a number of changes in and will now actually be implemented. This page serves as the point to describe the idea and the DataQualityDiscussion page lets users chime in on the merits of this idea.

Goals

The data quality idea has the following goals:

  • Establish a method to determine the quality of an artist and the releases that belong to that artist. This provides consumers of MusicBrainz a clue about the relative quality rating of the data in the database.
  • Provide fine grained control over what efforts are required to edit the database and to vote on those edits.
  • Provide editors with a means to allow easier editing of data that is deemed to be of poor quality.
  • Provide editors with a means to make it harder to edit data that is considered to be of good quality.
  • Reduce the overall number of edits in the system by making the requirements to pass an edit suited for each edit type.

End user feature changes

To accomplish these goals, this feature will allow editors to indicate the quality for a given artist. An artist can be of unknown, low, medium or high data quality. The data quality indicator determines what level of effort is required to change the artist information or to add/remove albums from an artist. An artist with unknown or medium quality will roughly require the amount of effort that MusicBrainz currently requires to edit the database. An artist with low data quality will make it easier to add/remove albums or to change the artist information (name, sortname, aliases). And an artist with high data quality will require more effort to add/remove albums or the change the artist information. The data quality concept also applies to releases in the same manner. Changing a release with low data quality will be easier than changing a release with high data quality.

Each artist will have a new link in the edit bar: Change artist quality. This link will allow the user to select a new quality rating for the artist. Each album will have a similar link in its edit bar: Change release quality. As with the artist, this link allows the changing of the data quality rating for this release. Changing the quality rating for releases will also be a batch operation.

The daily artist subscription email will now inform users when the quality of an artist or a release belonging to that artist has been changed.

Data quality affects edit strictness

The quality rating for an artist/release will determine the following edit strictness values:

  • Edit voting duration (in days)
  • Number of unanimous votes to pass
  • Expire action: Accept, reject
  • EditTypes which are AutoEdits

Attention.png All this information must be presented to the user on the Edit Conditions page. This will be different for all edits now.

In the first implementation of Data Quality we're not going to change the expire times of edits that are pending. The expire time for an edit is determined when the edit is first entered into the database. Changing the quality level for an artist/release will not affect the expire time for previously opened edits.

As a rough illustration the data quality levels could influence the edit strictness as follows:

Normal or Unknown High Low
Voting period (days) 14 14 4
Yes votes required to pass +3 (=3 more yes than no) +4 +1
Action on expiration accept reject accept
AutoEdits see EditType none All non-structural changes

(The table above needs to be replaced with a detailed table that lists all of the edit types and their associated edit strictness values)

Unresolved Issues

What kind of edits should it be to change the data quality? That's a difficult question. We can only make an educated guess and then see how this will work out during beta tests.

Consider the following matrix:

legitimate raise illegitimate raise
legitimate lowering illegitimate lowering

legitimate raise

User raises DQ legitimately. Any edit entered from this point on should be harder to apply. This means: Once the raise-DQ-edit gets applied, all pending edits should now be harder to apply. This only works if the voting period for the raise-DQ-edit is not longer than that of the other edits on this quality level.

If we want DataQuality to have meaning then raising it should not happen automatically. It should need some peer review. Do we trust a single user to judge the data quality (esp. if lowering it again is hard)? What about the possibility of "raise-data-quality spam" (like the "random votes" we once had)?

What are the motivations and expected outcome of raising the data quality? For subscribers: lower their workload of watching silly edits. That is a long-term goal which can take time and need some effort. For the casual user: Honour and protect their own work. Ideally that would need instant gratification. We need to find a balance between these interests.

illegitimate raise

User tries to raise DQ but gets voted down. All edit should be applied using low DQ strictnes at all times.

legitimate lowering

User lowers DQ legitimately. We suppose they enter legitimate edits right afterwards (or even just before). They will only be motivated to do that, if this is easier than just entering the edits at the current level. Therefore, once the lower-DQ-edit has been accepted, all pending edits should be applied according to the new easier rules. This means, ModBot must apply pending edits according to their current DQ, not according to the DQ at the time the edit was entered!

illegitimate lowering

User tries to lower DQ but gets voted down. All other edits entered should be applied or rejected according to the stricter rules. Also getting the lower-DQ-edit to pass should be considerably harder than getting one or two of the other edits to pass.

Conclusions

Remember: This is a completely unproven initial hypothesis that needs to be tested on both test and live data!

  • RaiseDataQualityEdit
    • Takes relatively few unanimous votes (~1)
    • expire action is difficult to decide upon. It must be "reject" if we experience "raise-quality-spam", but we could probably start out with "apply" and only raise the strictness once we experience problems.
    • voting period is =< voting period for edits in the old DQ
  • LowerDataQualityEdit
    • Is hard to pass. It takes more unanimous votes than other edits in the old DQ (~ 1.5 to 2 times more).
    • expire action is reject.
    • Is quick to apply. voting period should be about half of normal voting period for edits in the old DQ

Attention.png Lower-data-quality-edits must be extremely easy to track. There should be special feeds or subscriptions just for them. Ideas to raise the awareness of such edits are:

  • If a change-DQ-edit is entered which would have consequences for the edit that I look at, I am informed of this fact. (probably hard to implement: When a normal edit is entered, check for open change-DQ-edits and add an EditNote. When a change-DQ-edit is entered, add such a note to all relevant pending edits. Uff)
  • A number of Mails get sent to random subscribers from a "quality watch" list.