History:Don't Make Relationship Clusters

From MusicBrainz Wiki
Revision as of 23:26, 13 June 2007 by Shepard (talk | contribs) ((Imported from MoinMoin))
Jump to navigationJump to search

The Relationship Cluster Problem

A "Relationship Cluster" refers to a situation where there are a group of entities in the database, where every entity is linked to every other entity. For example, all the siblings in the Jackson family (Rebbie, Jackie, Tito, Jermaine, La Toya, Marlon, Michael, Randy, Janet) need artist entries. Given that we have the SiblingRelationshipType, then on the face of it every sibling should be linked to every other sibling. For these nine siblings that requires 8 + 7 + 6 + 5 + 4 + 3 + 2 + 1 = 36 relationship entries. This is a lot of unnecessary moderation work.

The other problem with relationship clusters is the difficulty of updating the database consistently. Let's say that we only have Michael Jackson and Janet Jackson in the database, with a sibling relationship between them. Then someone adds Hugh Jackman to the database, and mistakenly creates a sibling link to Michael Jackson and Janet Jackson. If someone then comes in to correct this fact, they might delete one sibling relationship without deleting the other. Or they might submit both deletions, but one could get voted down while the other gets voted through. This is, depending on how you look at it, either inconsistent or merely confusing.

Other potential clusters include: linking various recordings of the same song together; linking artist performance names together; linking re-released releases together; linking individual releases in a box set together; and many more.

Solutions

In general, relationship clusters can be avoided by defining a "special" member of the cluster to which all the members should link. For example, instead of linking to all siblings, only link to the oldest. To encourage this, you generally name the relationship something like "is the oldest sibling of" instead of "is a sibling of". In the Jacksons example, this reduces 36 relationships to just 8.

Unfortunately, this will still produce confusing results on the website. Rebbie Jackson is the eldest, but far from being the most famous. When looking at Janet Jackson's artist page, most people would be far more interested to know that she's the sister of Michael Jackson than Rebbie Jackson. Even worse, it might give the impression that in fact Janet Jackson is not the sister of Michael Jackson, since why else would such an obvious relationship not appear? This could potentially be fixed by clever website code that figures out all the siblings by going up to the eldest sibling and back down to all the others, but that will require extra development time for every situation that arises.

Also, it's frequently very difficult to come up with a good definition of which is "special". Which release is the special one in a box set? Frequently they're explicitly ordered, but often not. We'd need to come up with a comprehensive StyleGuideline to define it, which would inevitably be confusing in all kinds of edge cases.

Another possibility is to try to avoid the situation in the first place by relating to some other common entity instead. For example, represent the sibling relationship between the Jacksons indirectly by connecting them via "is the child of" relationships to their parents. This still doesn't show all the sibling relationships on the artist pages, but does at least remove some of the confusion as to which sibling should be related to which other sibling. Another example: when trying to avoid the "is in the same box set as" release cluster, create a new entity (release?) to represent the box set as a whole, and use "is contained in" relationships instead.

I should note here that there are big problems with actually implementing both of the above examples; I'm not implying that those are good solutions for those problems, just trying to indicate how problems can potentially be solved. Breaking relationship clusters is often a very tricky problem, and usually requires some lateral thinking and a tolerance of a little counter-intuitive design.

Another note: going back to siblings, it would be possible to link each sibling to the next in succession, instead of linking them all to the eldest. So you would record Rebbie Jackson "is the next oldest sibling of" La Toya Jackson, who "is the next oldest sibling of" Jackie Jackson, and so on. This conveys the same meaning, but there are a couple of problems. Firstly, if you need to remove or add someone to the group, it requires several edits instead of just one. Secondly, it's much less efficient for software to compile a list of all the siblings if it needs to step many times through the list of links.

jacksons.png

AdvancedRelationshipTypes Affected

The following AdvancedRelationshipTypes are affected:

Discussion

For relationships, such as SiblingRelationshipType, which are transitive, it would be nifty if the server could treat all the related items as a set, rather than exposing the representation of the set as a set of links (the difference being that removing a link potentially breaks the set in half, while removing an item from the set will not). Also, being treated as a set would result in all items in the set being displayed, rather than only the directly linked items. --MRudat

To be honest, I don't really see the problem. The two reasons given for not creating clusters that I see above are

  • 1) If they're wrong, they're more confusing to fix
  • 2) They are "unnecessary moderation work". Regarding #1, this seems more a verification issue than a moderating issue - the goal ought to be preventing incorrect ARs from being added, not preventing correct ARs from being created. There are many other ARs where things are redundant, but we vote not to remove them, because the ARs are still "correct". As long as correct ARs are being formed, I don't see why any are bad. Regarding #2, this seems to make an assumption regarding the priorities of the moderator. No one *has* to create ARs, but if someone chooses to do them, realistically, for any editor, any editing they do is "unnecessary moderation work". We all do it because we like to, not because we're required to. The only reason I can see to not create these is that it creates additional edits to be voted on. However, consider: The above example of the Jackson family creates 36 edits. To take an even more extreme example, the 13 children of Bob Marley create 91 sibling relationships, plus 26 parent relationships, for a total of 117 ARs. This sounds like a lot, until you consider the number of ARs created by a single classical work. Take a standard mixed composer release. Assume 8 tracks, each with a composer, librettist, SATB, orchestra, and backing chorus, plus conductors for the chorus and orchestra. That's 80 ARs for the tracks, plus all the standard ARs for the liner notes, artwork, etc. Essentially, 1 classical CD of ARs is just about equal to the entirety of ARs for the Marleys - and the Marley ARs only need to be done once. Essentially, it all boils down to this question: Why are relationship clusters, if done correctly the first time, a bad thing? -- BrianSchweitzer 23:13, 12 June 2007 (UTC)
  • The assumptions on this page are wrong anyway, I wouldn't pay too much attention to it. ;) Sibling relationships for example are not transitive - if you take into account half-siblings, adoption and whatnot. Example 1: Bob has a son Peter from an earlier marriage. Alice has a daughter Mary from an earlier marriage. Bob marries Alice and they have a son Paul. Now Peter and Paul are half-siblings and Paul and Mary are half-siblings. But Peter and Mary are not siblings, not even half ones. Example 2: Peter and Paul are entered as siblings in the database. Bob is entered as the father of Peter, Alice is entered as the mother of Paul. Can you conclude that Alice and Bob are married or in any way related? No - you don't have to marry to have childs, you can divorce, you can adopt children, ... -- Shepard 23:25, 13 June 2007 (UTC)