Development/Summer of Code/2018/AcousticBrainz

From MusicBrainz Wiki

Proposed mentor: ruaok or alastairp
Languages/skills: Python, Postgres, Flask
Forum for discussion

Getting started

(see also: GSoC - Getting started)

If you want to work on AcousticBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that you could do to get familiar with the AcousticBrainz project and code:

  • Install the server on your computer or use the Vagrant setup scripts to build a virtual machine
  • Download the AcousticBrainz submission tool and configure it to compute features for some of your audio files and submit them to the local server that you configured
  • Use your preferred programming language to access the API to download the data that you submitted to your server, or other data from the main AcousticBrainz server
  • Create an oauth application on the MusicBrainz website and add the configuration information to your AcousticBrainz server. Use this to log in to your server with your MusicBrainz details
  • Look at the system to build a Dataset (accessible from your profile page on the AcousticBrainz server) and try and build a simple dataset

Join in on development

We like it when potential students show initiative and make contributions to code without asking us what to do next. We have tagged tickets that we think are suitable for for new contributors with the "good-first-bug" label. Take a look at these tickets and see if any of them grab your interest. It's a good idea to talk to us before starting work on a ticket, to make sure that you understand what tasks are involved to finish the ticket, and to make sure that you're not duplicating any work which has already been done. To talk to us, join our IRC channel or post a message in the forums or on a ticket.

Ideas

Here are some ideas for projects that we would like to complete in AcousticBrainz in the near future. They are a good size for a Summer of Code project, but are in no way a complete list of possible ideas. If you have other ideas that you think might be interesting for the project join us in IRC and talk to us about your ideas.

Statistics and data description

We have a lot of data in AcousticBrainz, but we don't know much about what this data looks like. This task involves looking at the data that we have and finding interesting ways to show this data to visitors to the AB website. Part of the proposal for this task would be to look at and understand the data and come up with a list of recommended visualisations/descriptions. For many of the types of statistics that we want to show, it is infeasible to compute the data at every page load, therefore part of this task is to also come up with an appropriate caching system.

Here are a few ideas for statistics that we have thought of so far:

Automatic updating statistics page, containing data about our submissions:

  • Formats, year, reported genre, other tags (mood)?
  • BPM analysis
  • Compare audio content md5_encoded with mbids
  • Use the musicbrainz mbid redirect tables to find more duplicates
  • Lists of artists + albums/recordings for each artist

Visualize AB data - either a sub-dataset/list or all data in AB

  • distribution plots for all low-level descriptors
  • expectedness of features for each particular track (paper: Corpus Analysis Tools for Computational Hook Discover by Jan Van Balen)

2D visual maps

  • Improving visualization of high-dimensional music similarity spaces (Flexter)
  • 2d maps with t-Stochastic Neighbor Embedding (TSNE, but there are other approaches in the paper) with shared nearest neighbor distance normalization (against hubs)

New machine learning infrastructure

Skills: Python, C++, Machine learning, scikitlearn, postgres

We build what we call high level models in acousticbrainz, which are multiclass SVM models trained using libsvm wrapped in a custom library called gaia. Gaia performs its task well, but it is written in C++ and not easy to extend with different machine learning algorithms and new techniques like deep learning.

We would like to replace our model training infrastructure with scikit learn, which is widely known and contains a large number of machine learning algorithms

  1. Understand the existing gaia-based training process
  2. Reproduce the existing SVM model process using scikitlearn
  3. Replace the highlevel model training process with scikitlearn
  4. Perform an analysis of other ML algorithms in scikitlearn to see if they give better results than those that we currently have with SVM

More detailed integration with MusicBrainz

Skills: Docker, React, Python, Postgresql, Machine learning

Recordings in AcousticBrainz are stored based on their MBID from MusicBrainz. With this information we should be able to have a much tighter integration with the MusicBrainz database, allowing us to better understand our data .

We would like to use MusicBrainz data in four places in AcousticBrainz:

  1. Use MBID redirect information to determine when two distinct MBIDs in the AcousticBrainz database refer to the same Recording. This would allow us to select any duplicate recording regardless of its MBID when using the API and the AcousticBrainz website. It should also prevent users from adding the "same" MBID to a class in a dataset using the dataset editor
  2. Use Artist information from MusicBrainz to give users real-time feedback about artist filters when creating datasets. When a dataset is created, we should only include one recording by a given artist in each class. Because the MusicBrainz schema reflects a rich representation of multiple artists per album or recording, we need to consider what to do with groups, group members, and recording artist credits.
  3. Allow users to add all recordings that match a given criteria to a particular class in the dataset editor. This could be all recordings with a given tag in MusicBrainz, or all recordings by a given Artist or in a given Release.
  4. Use MusicBrainz data to show statistics of the data that we have in the AcousticBrainz database. This could include information about the most commonly submitted recordings, artists, and albums.

There are two ways of performing this integration,

  1. Connect directly to a MusicBrainz database. This is the most straight-forward way of performing this integration, but it comes at some costs. Because the Database is separate to the AcousticBrainz database, any code which uses this data would have to do at least one query against the AcousticBrainz database, get some data in Python, and then do another query against the MusicBrainz database.
  2. Copy relevant information from the MusicBrainz database into a separate schema in the AcousticBrainz database. This would allow us to join and filter directly in the database in a single query, but would mean that we need separate tools to populate this data based on what exists in AcousticBrainz, and also to update this data when MusicBrainz is updated.

The first part of this project should be to perform a test using data at the scale of AcousticBrainz to see what method works best for us, in collaboration with the MetaBrainz team.

Storage for AcousticBrainz v2 data

When we release a new version of the AcousticBrainz extractor tool we will want to store data for this new version in addition to data from the current version of the extractor that we provide.

This project needs to consider at least the following items:

  1. Update the database schema to include a data version field, and allow the Submit and Read methods to switch between them.
  2. Update the frontend including the dataset editor
  3. Update the client software to include a check where they announce to the server what version they are
  4. Update the client software to enhance the "already submitted" local database to allow data from the new version of the extractor