Do not cluster

From MusicBrainz Wiki

Introduction

This page used to be a guideline. It's no longer one, since it shouldn't be the editors' job to worry about clusters - but it is an important issue to keep in mind when designing (and proposing) new relationships. When possible (and reasonable), try to find an alternative that does not imply clusters of relationships.

The relationship cluster problem

A "Relationship Cluster" refers to a situation where there are a group of entities in the database, where every entity is linked to every other entity. For example, all the siblings in the Jackson family (Rebbie, Jackie, Tito, Jermaine, La Toya, Marlon, Michael, Randy, Janet) need artist entries. Given that we have the Sibling Relationship Type, then on the face of it every sibling should be linked to every other sibling. For these nine siblings that requires 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 = 36 relationship entries. This is a lot of unnecessary moderation work.

The other problem with relationship clusters is the difficulty of updating the database consistently. Let's say that we only have Michael Jackson and Janet Jackson in the database, with a sibling relationship between them. Then someone adds Hugh Jackman to the database, and mistakenly creates a sibling link to Michael Jackson and Janet Jackson. If someone then comes in to correct this fact, they might delete one sibling relationship without deleting the other. Or they might submit both deletions, but one could get voted down while the other gets voted through. This is, depending on how you look at it, either inconsistent or merely confusing.

Other potential clusters include: linking various recordings of the same song together; linking artist performance names together; linking re-released releases together; linking individual releases in a box set together; and many more.

Solutions

In general, relationship clusters can be avoided by defining a "special" member of the cluster to which all the members should link. For example, instead of linking to all siblings, only link to the oldest. To encourage this, you generally name the relationship something like "is the oldest sibling of" instead of "is a sibling of". In the Jacksons example, this reduces 36 relationships to just 8.

Unfortunately, this will still produce confusing results on the website. Rebbie Jackson is the eldest, but far from being the most famous. When looking at Janet Jackson's artist page, most people would be far more interested to know that she's the sister of Michael Jackson than Rebbie Jackson. Even worse, it might give the impression that in fact Janet Jackson is not the sister of Michael Jackson, since why else would such an obvious relationship not appear? This could potentially be fixed by clever website code that figures out all the siblings by going up to the eldest sibling and back down to all the others, but that will require extra development time for every situation that arises.

Also, it's frequently very difficult to come up with a good definition of which is "special". Which release is the special one in a box set? Frequently they're explicitly ordered, but often not. We'd need to come up with a comprehensive StyleGuideline to define it, which would inevitably be confusing in all kinds of edge cases.

Another possibility is to try to avoid the situation in the first place by relating to some other common entity instead. For example, represent the sibling relationship between the Jacksons indirectly by connecting them via "is the child of" relationships to their parents. This still doesn't show all the sibling relationships on the artist pages, but does at least remove some of the confusion as to which sibling should be related to which other sibling. Another example: when trying to avoid the "is in the same box set as" release cluster, create a new entity (release?) to represent the box set as a whole, and use "is contained in" relationships instead.

I should note here that there are big problems with actually implementing both of the above examples; I'm not implying that those are good solutions for those problems, just trying to indicate how problems can potentially be solved. Breaking relationship clusters is often a very tricky problem, and usually requires some lateral thinking and a tolerance of a little counter-intuitive design.

Another note: going back to siblings, it would be possible to link each sibling to the next in succession, instead of linking them all to the eldest. So you would record Rebbie Jackson "is the next oldest sibling of" La Toya Jackson, who "is the next oldest sibling of" Jackie Jackson, and so on. This conveys the same meaning, but there are a couple of problems. Firstly, if you need to remove or add someone to the group, it requires several edits instead of just one. Secondly, it's much less efficient for software to compile a list of all the siblings if it needs to step many times through the list of links.

jacksons.png