Development/Summer of Code/2022/ListenBrainz

From MusicBrainz Wiki
< Development‎ | Summer of Code‎ | 2022
Revision as of 11:56, 28 February 2022 by RobertKaye (talk | contribs)
Jump to navigationJump to search

ListenBrainz is one of the newest MetaBrainz projects. Read more information on its homepage.

Getting started

(see also: Getting started with GSoC)

If you want to work on ListenBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that we might ask you about

  • Show that you understand the goals that ListenBrainz wants to achieve, which are written on its homepage
  • Create an oauth application on the MusicBrainz website and add the configuration information to your ListenBrainz server. Use this to log in to your server with your MusicBrainz details
  • Use the import script that is part of the ListenBrainz server to load scrobbles from last.fm to your ListenBrainz server, or the main ListenBrainz server
  • Use your preferred programming language to write a submission tool that can send Listen data to ListenBrainz. You could make up some fake data for song names and artists. This data doesn't have to be real.
  • Try and delete the ListenBrainz database on your local server to remove the fake data that you added.
  • Look at the list of tickets that we have open for ListenBrainz and see if you understand what tasks the tickets involve
  • If you want to, see if you can contribute to fixing a ticket. Either add a comment to the ticket or ask in IRC for clarification if you don't understand what the ticket means

We're adding a number of new social features to ListenBrainz that we hope will enable people discover more music they like and users who have similar music tastes to their own. We're working on some of these features now, but we will need to get help for other features:


Create a music recommendation algorithm using the Troi toolkit

Proposed mentors: mayhem
Languages/skills: Python, possibly Postgres.
Estimated Project Length: Can be 175 or 350 hours depending scope of the chosen project.
Expected outcomes: One or more finished, debugged and tested plugins for Troi.

Our troi recommendation toolkit is our playground for developing recommendation algorithms. The toolkit already knows how to fetch data from ListenBrainz for stats, collaborative filtered recommended tracks and from MusicBrainz for metadata and from AcousticBrainz for tracks that have similar acoustic features. We're looking for one student who has an original idea that can be implemented in Troi, ideally using the existing data-sets without having to invent or create new data-sets. This plugin should create a new feature that allows users to discover new music. Please note: We're going to be very selective on what proposals we accept for this project. Before you propose an algorithm to us, you'll need to carefully familiarize yourself with the troi toolkit and what features it provides. Your idea needs to be new and novel, at least in the context of Troi.

Integrate more music services for recording listens and playing music

Proposed mentors: lucifer
Languages/skills: Python/Flask, Typescript/React
Estimated Project Length: Can be 175 or 350 hours depending on the integration/service chosen.
Difficulty: Easy
Expected Outcomes: A new music service integration for users to play and record listens on ListenBrainz.

LB has a number of music discovery features that use BrainzPlayer to facilitate track playback. BrainzPlayer (BP) is a custom React component in LB that uses multiple data sources to search and play a track. As of now, it supports Spotify, Youtube and Soundcloud as a backend. LB also supports linking a Spotify account to record listening history. Currently, we are reworking the integration of external music service in LB to make adding other music services easier. We have looked into some other services and found that Deezer and Apple Music also provide the music playback and recording listening history capability. Integrating these services into LB would make for a good SoC project.

Clean up the Music Listening Histories Dataset

Proposed mentors: mayhem
Languages/skills: Python/Postgres/Apache Spark/Typesense
Estimated Project Length: 350 hours
Expected outcomes: Debugged and tested code that can clean up the files contained in the MLHD.

The Music Listening Histories Dataset is a very large dataset with 27 billion rows scraped from Last.fm over many years. Last.fm used their own algorithm for matching their scrobbles (listens in our world) to the MusicBrainz data, but unfortunately the dataset had many problems. Many of the older scrobbles were never updated to keep up with the changes in the MusicBrainz database and thus are out of date. Also, given how Last.fm used to manage its artist data, there were many problems with the different artists who had the same name.

Today MusicBrainz and ListenBrainz have a number of datasets that make it possible to examine each of the 27 billion rows and attempt to resolve the old data to up to date data that is more concise. This project would need to happen in three stages:

  1. Download one of the 576 files and process the data using a simple python script, examining each of the rows in the data set to resolve the data against a current copy of MusicBrainz. This is an exploration project, since we do not know what will work well. We'll need to try a few different approaches to see what will work best.
  2. Once we understand the algorithm that we wish to implement, it should be moved to our Spark cluster. This involves not only porting the queries and code to Spark, but also requires of importing any datasets that are required in the process that are not already present on our Spark cluster.
  3. Once we can process one file, we will need to automate the process to download all the files in the whole data set and process as many as we can when our spark cluster is idle during the day. It is hard to say how long this process will take, but there are chances that baby-sitting the project will take longer than GSoC allows for.


Create a Spotify metadata cache

Proposed mentors: mayhem
Languages/skills: Python, Postgres.
Estimated Project Length: 350 hours
Expected outcomes: Debugged and tested code that loads and maintains this cache.

The BrainzPlayer, our cross-service embedded web music player, supports playing from Spotify. However for most of the tracks that are played via this player, we query the Spotify Metadata API to find appropriate tracks to play. This process is less than ideal, since the logic for resolving which tracks to play resides in the player. It would be much better if this data resided on the server, in the form of a cache of the Spotify metadata, which would allows us to resolve the tracks on the server when we load a BrainzPlayer page.

This metadata cache comprises of a new set of postgres tables in a new schema for the data and a process that runs continuously and listens to a RabbitMQ queue for new Spotify artists to cache. When a new artist ID is received, the process should fetch all of the releases for this artist and save it to postgres. Periodically the cache should also check to see if any records have expired or have been marked as dirty and for those records re-fetch the data and update the expiration timestamps.

We have a slightly more detailed write-up of this project -- please come to IRC and ask us for a link to the document that describes this if you are interesting in working on this project.

If this project does not take the full 350 hours, we can start to build the lookup portion of this project as well, where given an Artist and Recording name, find the best track in Spotify to play.