MLHD+: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
(Documentation for MLHD+)
 
m (formatting)
Line 1: Line 1:
== Introduction ==
== Introduction ==
MLHD+ (Music Listening Histories Dataset+) is an improved version of the Music Listening Histories Dataset, released by the DDMAL Lab at McGill University.
MLHD+ (Music Listening Histories Dataset+) is an improved version of the [https://ddmal.music.mcgill.ca/research/The_Music_Listening_Histories_Dataset_(MLHD)/ Music Listening Histories Dataset], released by the [https://ddmal.music.mcgill.ca/ DDMAL] at McGill University.


The original MLHD dataset contained a number of errors that we discovered due to poor data matching algorithms when this data was originally collected. Problems that we encountered and have fixed in the dataset include:
The original MLHD dataset contained a number of errors that we discovered due to poor data matching algorithms when this data was originally collected. Problems that we encountered and have fixed in the dataset include:


# Follow redirects: In MusicBrainz there are times when a data entity was entered more than once, so any duplicates must be deleted from the database. In order to not lose the old MBID, we enter an entry into our redirect tables, keeping track of all of them. In the MHLD+ data set, we checked each recording and ensured that the dataset contains the most recent and correct version.
# '''Follow redirects''': In MusicBrainz there are times when a data entity was entered more than once, so any duplicates must be deleted from the database. In order to not lose the old MBID, we enter an entry into our redirect tables, keeping track of all of them. In the MHLD+ data set, we checked each recording and ensured that the dataset contains the most recent and correct version.
# Identify Canonical Recordings: Using our canonical datasets we've identified the canonical versions of releases and recordings and used them, rather than the non-canonical versions. This means that if two people listened to different versions of a recording we can identify that they are the same instead of using two distinct MBIDs.
# '''Identify Canonical Recordings''': Using our [[meb:datasets/derived-dumps#canonical|canonical datasets]] we've identified the canonical versions of releases and recordings and used them, rather than the non-canonical versions. This means that if two people listened to different versions of a recording we can identify that they are the same instead of using two distinct MBIDs.
# Lookup correct Artist and Releases: Once we identified the canonical recordings, we looked up the canonical release and the correct artist MBID for each of the datapoints. The original matching algorithm often conflated two similarly named artists as the same artist, greatly diminishing the value of this dataset.
# '''Lookup correct Artist and Releases''': Once we identified the canonical recordings, we looked up the canonical release and the correct artist MBID for each of the datapoints. The original matching algorithm often conflated two similarly named artists as the same artist, greatly diminishing the value of this dataset.
# Other data problems: Some entries in the dataset have some data elements missing, making those data points useless. We've removed all the data points that contained errors that could not be positively resolved into the MusicBrainz database.
# '''Other data problems''': Some entries in the dataset have some data elements missing, making those data points useless. We've removed all the data points that contained errors that could not be positively resolved into the MusicBrainz database.
# Better compression: The dataset is compressed using zstandard, resulting in a 50% reduction in compressed size on disk.
# '''Better compression''': The dataset is compressed using zstandard, resulting in a 50% reduction in compressed size on disk.


== Obtaining the dataset ==
== Obtaining the dataset ==

Revision as of 14:46, 29 March 2023

Introduction

MLHD+ (Music Listening Histories Dataset+) is an improved version of the Music Listening Histories Dataset, released by the DDMAL at McGill University.

The original MLHD dataset contained a number of errors that we discovered due to poor data matching algorithms when this data was originally collected. Problems that we encountered and have fixed in the dataset include:

  1. Follow redirects: In MusicBrainz there are times when a data entity was entered more than once, so any duplicates must be deleted from the database. In order to not lose the old MBID, we enter an entry into our redirect tables, keeping track of all of them. In the MHLD+ data set, we checked each recording and ensured that the dataset contains the most recent and correct version.
  2. Identify Canonical Recordings: Using our canonical datasets we've identified the canonical versions of releases and recordings and used them, rather than the non-canonical versions. This means that if two people listened to different versions of a recording we can identify that they are the same instead of using two distinct MBIDs.
  3. Lookup correct Artist and Releases: Once we identified the canonical recordings, we looked up the canonical release and the correct artist MBID for each of the datapoints. The original matching algorithm often conflated two similarly named artists as the same artist, greatly diminishing the value of this dataset.
  4. Other data problems: Some entries in the dataset have some data elements missing, making those data points useless. We've removed all the data points that contained errors that could not be positively resolved into the MusicBrainz database.
  5. Better compression: The dataset is compressed using zstandard, resulting in a 50% reduction in compressed size on disk.

Obtaining the dataset

The dataset can be downloaded from https://data.musicbrainz.org/pub/musicbrainz/listenbrainz/mlhd/

Dataset structure

Listening history is stored in one file per user. Each individual file is named with a random UUID and is compressed using zstandard compression. For ease of distribution and handling, files are stored in a directory named after the first two characters of the filename (256 directories in total). These directories are bundled in uncompressed tar archives named after the first character of the filename (16 archives with 16 directories each).

For example, the layout looks as follows:

mlhdplus-complete-0.tar
00/004927e0-092d-4eea-b928-eea7a8ca3b48.txt.zst
00/00174fbf-5e8e-461c-b2ad-95f9319a8fcf.txt.zst
00/…
01/01a34a9c-6044-4cb8-bacf-e56931403a9e.txt.zst
01/011b9c22-0cd9-41de-828c-20a9b68ddde6.txt.zst
01/…
…
0f/0f3eebe3-e553-4af4-95dd-742bc0448f66.txt.zst
0f/0f136730-8b5d-4f55-9865-e47535fc61ba.txt.zst
0f/…

The dataset is split into two types of files:

  1. Rows that contain a Recording MBID and that we were able to match to a Canonical Recording are stored in the “-complete” archives.
  2. Rows that are missing data (e.g. only include an artist ID, or only include a timestamp and no MBIDs), or which included data for which were unable to match to a valid MBID in MusicBrainz are stored in the “-partial” archives, using the same filename structure.

If you want to work with verified MBIDs, we recommend that you use only the -complete archives. If you need additional information such as a list of all timestamps where a user listened to something (even if we don’t know what it is), you can also merge data from the -partial archives.

Each file contains 4 fields, separated by tabs, with no header row:

  • The timestamp of the listening event
  • Artist MBIDs
  • Release MBID
  • Recording MBID

Timestamps are represented in Unix Epoch time (see https://www.epochconverter.com/ )

Because recordings in the MusicBrainz database can have multiple artists attached to them (https://wiki.musicbrainz.org/Artist_Credits), the second column may have multiple MBIDs. If this is the case, multiple MBIDs will be listed separated with a comma.

In the -partial archives, if there was no data available in the original MLHD data files for a given column then this data will also be missing. If we were unable to resolve a value to a valid MBID, we left it in the data file unchanged. You should independently verify any MBIDs in these files before using them.

Dataset caveats

The dataset was made with a snapshot of the MusicBrainz database as at March 2023. This means that as the MusicBrainz database changes, some items in the MLHD may no longer be the same as the current state of the database.