User:Nikki/Recording lengths 2

From MusicBrainz Wiki

After the initial round of feedback, all except one person who responded supported always setting recording lengths automatically. We decided in a dev chat that we will always set them automatically and we are focusing on how to calculate the the length. As in the initial round of feedback, the discussion on this page does not apply to standalone recordings.

Determining the length

The goals I think we have when determining the length are:

  • To use one of the actual track lengths.
  • To avoid anomalous values, whether that's from appended silence, clipping or erroneous data.
  • To avoid huge changes in lengths when data is added or removed (difficult with only one or two values however).

Given those goals:

  • The mean (0 votes previously) does not fit, since it will often not return one of the actual track lengths.
  • Sorting the releases and taking the length from the first one (0 votes previously) does not fit, since it does not avoid anomalous values and does not avoid huge changes in lengths when data is added or removed.
  • Using the shortest length (2 votes previously) does not fit, because it does not avoid all anomalous values (it works for anomalies that make the track longer, but not ones which make it shorter) and also does not avoid huge changes in lengths when data is added or removed.

The two which do fit are the median (3 votes previously) and the mode (4 votes previously). The median is problematic when there are an even number of values, since then you normally take the mean of the two values, which would not necessarily result in an actual track length (e.g. given 3:00 and 5:00, it would give 4:00). The mode is also problematic when there are multiple modes (e.g. again, 3:00 and 5:00, neither is more common than the other). We could however avoid that problem by instead taking the shortest of the middle values or most common values for the median and mode respectively.

Votes and reasons for how the length should be determined

I think the median is better than the mode, because ...

  • if there are an equal number of different track lengths(e.g. {1:03, 1:04, 1:07, 1:08}) then all of them are the mode. Not very useful. However, median could produce a decimal duration. Hawke (talk)
  • LordSputnik - The reasons that hawke said. However, if we think of it statistically, the difference between track lengths is an error. This error will typically be < 10 seconds. Because of this error, it makes no sense to quote the recording length to the same number of significant figures as the track length, so we should calculate the median length, then only show the recording length as an approximation, to the nearest 10 seconds.
  • Same reason as hawke said. You have also less options in the sub-optimal case (even number of tracks) with median then you have with mode (all track lengths are different). For median I'd simply do: sorted_lengths[len(sorted_lengths)/2], which I think is very simple to understand for people and it produces more stable results than the mode, which can change from the shortest track to the longest, as a result of adding just one additional track. Lukáš Lalinský (talk) 17:01, 26 February 2013 (UTC)

I think the mode is better than the median, because ...

The median and the mode both suck! I think we should ...

I still don't care, just calculate it automatically somehow.

  • Nikki (talk) 02:48, 26 February 2013 (UTC)
  • Ianmcorvidae (talk) 03:22, 26 February 2013 (UTC) (as long as the "choose the shortest" variation is chosen in order to have an actual track length)
  • Reosarevok (talk) 13:03, 26 February 2013 (UTC)

Which lengths should be included?

In the initial round of feedback there were some suggestions of only including lengths from releases with disc IDs if any of the releases have disc IDs. Which track lengths should be included for determining the recording length?

I don't care

All track lengths

If there are official releases, only include lengths from official releases

If there are releases with disc IDs, only include lengths from releases with disc IDs

  • This, but what ian said above: weighting not exclusion. Hawke (talk)
  • Jesus2099 (talk) 16:57, 26 February 2013 (UTC) coming from discs is the only important and unbiased
DiscIDs are getting less common though. --JonnyJD (talk) 17:54, 26 February 2013 (UTC)
  • AcoustIDs (duration estimated as with tracks above) would also be reasonable. Hawke (talk)

Weighted (Official/TOC)

  • "Add a track twice" in the list where the mode/median is chosen when the release is official or has a Disc ID. Not more when both is the case though. That might be too much. --JonnyJD (talk) 04:37, 26 February 2013 (UTC)