Difference between revisions of "Development/Summer of Code/2022/ListenBrainz"

From MusicBrainz Wiki
Jump to navigationJump to search
 
Line 78: Line 78:
 
=== Clean up the Music Listening Histories Dataset ===
 
=== Clean up the Music Listening Histories Dataset ===
   
Proposed mentors: ''mayhem''<br>
+
Proposed mentors: ''alastairp''<br>
 
Languages/skills: Python/Postgres/Apache Spark/Typesense<br>
 
Languages/skills: Python/Postgres/Apache Spark/Typesense<br>
 
Estimated Project Length: 350 hours<br>
 
Estimated Project Length: 350 hours<br>

Latest revision as of 12:55, 11 April 2022

ListenBrainz is one of the newest MetaBrainz projects. Read more information on its homepage.

Getting started

(see also: Getting started with GSoC)

If you want to work on ListenBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that we might ask you about

  • Show that you understand the goals that ListenBrainz wants to achieve, which are written on its homepage
  • Create an oauth application on the MusicBrainz website and add the configuration information to your ListenBrainz server. Use this to log in to your server with your MusicBrainz details
  • Use the import script that is part of the ListenBrainz server to load scrobbles from last.fm to your ListenBrainz server, or the main ListenBrainz server
  • Use your preferred programming language to write a submission tool that can send Listen data to ListenBrainz. You could make up some fake data for song names and artists. This data doesn't have to be real.
  • Try and delete the ListenBrainz database on your local server to remove the fake data that you added.
  • Look at the list of tickets that we have open for ListenBrainz and see if you understand what tasks the tickets involve
  • If you want to, see if you can contribute to fixing a ticket. Either add a comment to the ticket or ask in IRC for clarification if you don't understand what the ticket means

We're adding a number of new social features to ListenBrainz that we hope will enable people discover more music they like and users who have similar music tastes to their own. We're working on some of these features now, but we will need to get help for other features:

Add Timezone support to ListenBrainz

Proposed mentors: monkey, mayhem
Languages/skills: Python, Postgres, React
Estimated Project Length: 175 hours
Expected outcomes: A finished feature ready to be merged into production code

ListenBrainz currently does not have the concept of a user timezone and all listens are recorded in UTC. Having timezones for individual users would allow us know when it is day/night for the users, as to be able to make daily recommendations in the night so that when the user wakes up, they have a new playlist. This project would involve making an appropriate table in Postgres to store timezones for users, the adding the API endpoints to allow the HTML pages to fetch/set the user timezone and the react UI pages that allow the user to set/edit their timezone.

Send a track to another user as a personal recommendation

Proposed mentors: lucifer, monkey
Languages/skills: Python, React
Estimated Project Length: 175 hours
Expected outcomes: A finished feature ready to be merged into production code

A popular music service had a feature long ago that allowed a user to send a track, along with some text explaining the recommendation, to another user. We think this would be a neat feature of have in ListenBrainz, so we're offering it up as a summer of code project. For this feature you would need to implement only the React UI, whereas the back-end would be implemented by a MetaBrainz staff member in order to keep this project as a short project. The UI implementation includes adding a new menu option "send this track to a user..." to the ListenCard component and then a UI dialog to allow the user to enter a description with the track and then to send it to the user. Finally, the user timeline would need to display this recommendation to the user to whom the track was sent.

Bonus feature if there is time: Implement the email notification that the user that they have just received a track.

Upcoming and new releases page

Proposed mentors: lucifer, mayhem
Languages/skills: Python, React, design skills a bonus
Estimated Project Length: 175 hours
Expected outcomes: A finished feature ready to be merged into production code

MusicBrainz contains a mountain of data about releases (albums) and their release dates, some of which are in the future from now. We have not really made this information available to users and this GSoC project aims to fix this. For this project you would be working on the UI components of this feature and the required API endpoints to make this data available would be created by a MetaBrainz staff member in order to keep this project short. This feature includes a single page, and links to the page from our nav bar, that lists upcoming releases in several sections such as: New albums/EPs, Live releases, other releases and re-releases. The data to be shown on the page will be properly formatted and broken into the right sections when it is fetched from the API endpoint, so all your project needs to do is to show the data on the page, complete with links to cover art and links to MusicBrainz for more information. Ideally you would design the look of this page, taking in to account the MetaBrainz design system that guides UI design for our projects and also work with others to ensure that the design fits into our overall plans for the ListenBrainz pages.

Create 'More Tracks Like This' music recommendation plugin for the Troi toolkit

Proposed mentors: mayhem
Languages/skills: Python, possibly Postgres.
Estimated Project Length: Likely 175 hours unless we decide to expand the scope of the project.
Expected outcomes: One or more finished, debugged and tested plugins for Troi.

Our troi recommendation toolkit is our playground for developing recommendation algorithms. The toolkit already knows how to fetch data from ListenBrainz for stats, collaborative filtered recommended tracks, similar artists and similar recordings. From MusicBrainz it can fetch needed metadata such as genres and tags. The goal of this project is to take in one of more seed song MBIDs and then use the above listed data sets to attempt to find recordings that are similar enough in order to make a playlist of tracks that have a similar sound and feel to the given seed tracks.

This project could be a little tricky -- the quality of the playlists generated by this project depend very heavily on the quality of the datasets that feed into it. In particular, the artist and recording similarity data sets will play a very important role, but these datasets may not be up to the needed standards to create good playlists. However, this does not invalidate this project nor would it cause us to fail the student -- it is understood that the output of this project will improve as the underlying data improves.

Create a 'Release Radar' plugin for the Troi toolkit

Proposed mentors: mayhem
Languages/skills: Python, postgres possibly
Estimated Project Length: 175 hours
Expected outcomes: One or more finished, debugged and tested plugins for Troi.

Our troi recommendation toolkit is our playground for developing recommendation algorithms. The toolkit already knows how to fetch data from ListenBrainz for stats, collaborative filtered recommended tracks, similar artists and similar recordings. From MusicBrainz it can fetch needed metadata such as genres and tags. This project should generate a playlist every Friday that is a collection of selected tracks that have been recently released (last 2 weeks or so) by artists that are in a given users top artists list. We will have an API endpoint that will list recent releases for a given user, which will be implemented by a MetaBrainz team member, and your Troi plugin should select tracks from these releases and make an exploration playlist from these tracks.

However, some care must be taken to not select ALL the tracks from a new release, but instead to pick some tracks that we think might be interesting to the user. How would you do this? This question is hard to answer on your own -- you will be required to engage with the ListenBrainz team in IRC to discuss this feature in detail before you make your proposal. Any proposal that does not engage the community to design this feature will not be considered for acceptance, due to the nature of this project.

Integrate more music services for recording listens and playing music

Proposed mentors: lucifer
Languages/skills: Python/Flask, Typescript/React
Estimated Project Length: Can be 175 or 350 hours depending on the integration/service chosen.
Difficulty: Easy
Expected Outcomes: A new music service integration for users to play and record listens on ListenBrainz.

LB has a number of music discovery features that use BrainzPlayer to facilitate track playback. BrainzPlayer (BP) is a custom React component in LB that uses multiple data sources to search and play a track. As of now, it supports Spotify, Youtube and Soundcloud as a backend. LB also supports linking a Spotify account to record listening history. Currently, we are reworking the integration of external music service in LB to make adding other music services easier. We have looked into some other services and found that Deezer and Apple Music also provide the music playback and recording listening history capability. Integrating these services into LB would make for a good SoC project.

Clean up the Music Listening Histories Dataset

Proposed mentors: alastairp
Languages/skills: Python/Postgres/Apache Spark/Typesense
Estimated Project Length: 350 hours
Expected outcomes: Debugged and tested code that can clean up the files contained in the MLHD.

The Music Listening Histories Dataset is a very large dataset with 27 billion rows scraped from Last.fm over many years. Last.fm used their own algorithm for matching their scrobbles (listens in our world) to the MusicBrainz data, but unfortunately the dataset had many problems. Many of the older scrobbles were never updated to keep up with the changes in the MusicBrainz database and thus are out of date. Also, given how Last.fm used to manage its artist data, there were many problems with the different artists who had the same name.

Today MusicBrainz and ListenBrainz have a number of datasets that make it possible to examine each of the 27 billion rows and attempt to resolve the old data to up to date data that is more concise. This project would need to happen in three stages:

  1. Download one of the 576 files and process the data using a simple python script, examining each of the rows in the data set to resolve the data against a current copy of MusicBrainz. This is an exploration project, since we do not know what will work well. We'll need to try a few different approaches to see what will work best.
  2. Once we understand the algorithm that we wish to implement, it should be moved to our Spark cluster. This involves not only porting the queries and code to Spark, but also requires of importing any datasets that are required in the process that are not already present on our Spark cluster.
  3. Once we can process one file, we will need to automate the process to download all the files in the whole data set and process as many as we can when our spark cluster is idle during the day. It is hard to say how long this process will take, but there are chances that baby-sitting the project will take longer than GSoC allows for.

Create a Spotify metadata cache

Proposed mentors: mayhem
Languages/skills: Python, Postgres.
Estimated Project Length: 350 hours
Expected outcomes: Debugged and tested code that loads and maintains this cache.

The BrainzPlayer, our cross-service embedded web music player, supports playing from Spotify. However for most of the tracks that are played via this player, we query the Spotify Metadata API to find appropriate tracks to play. This process is less than ideal, since the logic for resolving which tracks to play resides in the player. It would be much better if this data resided on the server, in the form of a cache of the Spotify metadata, which would allows us to resolve the tracks on the server when we load a BrainzPlayer page.

This metadata cache comprises of a new set of postgres tables in a new schema for the data and a process that runs continuously and listens to a RabbitMQ queue for new Spotify artists to cache. When a new artist ID is received, the process should fetch all of the releases for this artist and save it to postgres. Periodically the cache should also check to see if any records have expired or have been marked as dirty and for those records re-fetch the data and update the expiration timestamps.

We have a slightly more detailed write-up of this project -- please come to IRC and ask us for a link to the document that describes this if you are interesting in working on this project.

If this project does not take the full 350 hours, we can start to build the lookup portion of this project as well, where given an Artist and Recording name, find the best track in Spotify to play.

Coalesce feature in ListenBrainz

Proposed mentors: akshaaatt, monkey, lucifer, mayhem
Languages/skills: Python, React, Postgres, Troi.
Estimated Project Length: 175/350 hours
Difficulty: Medium
Expected outcomes: A finished feature ready to be merged into production code.

Our troi recommendation toolkit is our playground for developing recommendation algorithms. The toolkit already knows how to fetch data from ListenBrainz for stats, collaborative filtered recommended tracks, similar artists and similar recordings. From MusicBrainz it can fetch needed metadata such as genres and tags. The goal of this project is to create a plugin to generate playlists based on the listening habits of two or more users (similar to the Spotify blend feature).

Unlike other Troi projects, this project involves a fair deal of frontend and backend work in the ListenBrainz server as well. A UI will be needed to allow multiple users to consent to creating a playlist with their combined interests. The UI should allow users to optionally configure different parameters for creating such a playlist. New database tables and API endpoints will be needed on the backend side to store these parameters and requests to generate combined playlists.

To connect the Troi and the ListenBrainz server, a background running process could read the database and invoke Troi patches accordingly. Finally, the actual Troi patch to generate the playlist needs to be written.

There are a lot of parts in this project so it is fine if the contributor only wants to do some of those. These details should be discussed with the mentors beforehand so that the appropriate project schedule and length can be worked out.