Difference between revisions of "Development/Summer of Code/2017/AcousticBrainz"

From MusicBrainz Wiki
(Created page with "Proposed mentor: ''ruaok'' or ''alastairp''<br> Languages/skills: Python, Postgres, Flask<br> [https://community.metabrainz.org/c/acousticbrainz Forum for discussion] == Gett...")
 
Line 14: Line 14:
 
'''This page describes ideas that we've had for [[AcousticBrainz]] project.''' If you are interested in working on them for Summer of Code, or as part of the MusicBrainz project, [[Communication|contact us]] through the MusicBrainz IRC channels.
 
'''This page describes ideas that we've had for [[AcousticBrainz]] project.''' If you are interested in working on them for Summer of Code, or as part of the MusicBrainz project, [[Communication|contact us]] through the MusicBrainz IRC channels.
 
If you want to explore this data in an academic context, talk to the [http://mtg.upf.edu/ Music Technology Group].
 
If you want to explore this data in an academic context, talk to the [http://mtg.upf.edu/ Music Technology Group].
 +
 +
==Ideas==
 +
 +
===BigQuery upload and statistics===
 +
 +
Skills: Python, design
 +
 +
We have some prototype code to upload the contents of AcousticBrainz to google bigquery.
 +
We would like to finish this project so that the data is available in bigquery. This involves:
 +
# Finishing the upload tool, fixing any existing issues
 +
# Check that the data that we have stored in bigquery is in a format apt for searching. The value that BQ brings for us is to be able to quickly search through all of our data quickly. We need to make sure that it's possible. This could involve surveys/emails with other in the musicbrainz/Music research community
 +
# Once the data is uploaded, create an interactive statistics page for the AcousticBrainz website, showing images similar to our original blog post
 +
 +
===New machine learning infrastructure===
 +
 +
Skills: Python, C++, Machine learning, scikitlearn, postgres
 +
 +
We build what we call high level models in acousticbrainz, which are multiclass SVM models trained using libsvm wrapped in a custom library called gaia. Gaia performs its task well, but it is written in C++ and not easy to extend with different machine learning algorithms and new techniques like big learning.
 +
 +
We would like to replace our model training infrastructure with scikit learn, which is widely known and contains a large number of machine learning algorithms
 +
 +
# Understand the existing gaia-based training process
 +
# Reproduce the existing SVM model process using scikitlearn
 +
# Replace the highlevel model training process with scikitlearn
 +
# Perform an analysis of other ML algorithms in scikitlearn to see if they give better results than those that we currently have with SVM
 +
 +
===Docker-based model training for 3rd party ML algorithms===
 +
 +
The process of training models in AcousticBrainz runs on our servers. This is because it's a simple process, but also so that we can create estimates of new data when it submitted to us.
 +
Some researchers have shown interest in contributing machine learning algorithms to acousticbrainz (in contrast to datasets). We would like to be able to run their code on our servers so that we can calculate features for new submissions, but we want to be careful about running 3rd party code on our servers.
 +
 +
A solution to this would be to run 3rd party algorithms inside docker. Researchers can provide us with an image which has a known API, which we can call automatically.
 +
 +
===Storage for detailed analysis files===

Revision as of 21:26, 8 February 2017

Proposed mentor: ruaok or alastairp
Languages/skills: Python, Postgres, Flask
Forum for discussion

Getting started

If you want to work on AcousticBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that we might ask you about

  • Install the server on your computer or use the Vagrant setup scripts to build a virtual machine
  • Download the AcousticBrainz submission tool and configure it to compute features for some of your audio files and submit them to the local server that you configured
  • Use your preferred programming language to access the API to download the data that you submitted to your server, or other data from the main AcousticBrainz server
  • Create an oauth application on the MusicBrainz website and add the configuration information to your AcousticBrainz server. Use this to log in to your server with your MusicBrainz details
  • Look at the system to build a Dataset (accessible from your profile page on the AcousticBrainz server) and try and build a simple dataset
  • Look at the list of tickets that we have open for AcousticBrainz and see if you understand what some of them mean. Feel free to ask questions about what they mean - some ticket descriptions don't have much detail

This page describes ideas that we've had for AcousticBrainz project. If you are interested in working on them for Summer of Code, or as part of the MusicBrainz project, contact us through the MusicBrainz IRC channels. If you want to explore this data in an academic context, talk to the Music Technology Group.

Ideas

BigQuery upload and statistics

Skills: Python, design

We have some prototype code to upload the contents of AcousticBrainz to google bigquery. We would like to finish this project so that the data is available in bigquery. This involves:

  1. Finishing the upload tool, fixing any existing issues
  2. Check that the data that we have stored in bigquery is in a format apt for searching. The value that BQ brings for us is to be able to quickly search through all of our data quickly. We need to make sure that it's possible. This could involve surveys/emails with other in the musicbrainz/Music research community
  3. Once the data is uploaded, create an interactive statistics page for the AcousticBrainz website, showing images similar to our original blog post

New machine learning infrastructure

Skills: Python, C++, Machine learning, scikitlearn, postgres

We build what we call high level models in acousticbrainz, which are multiclass SVM models trained using libsvm wrapped in a custom library called gaia. Gaia performs its task well, but it is written in C++ and not easy to extend with different machine learning algorithms and new techniques like big learning.

We would like to replace our model training infrastructure with scikit learn, which is widely known and contains a large number of machine learning algorithms

  1. Understand the existing gaia-based training process
  2. Reproduce the existing SVM model process using scikitlearn
  3. Replace the highlevel model training process with scikitlearn
  4. Perform an analysis of other ML algorithms in scikitlearn to see if they give better results than those that we currently have with SVM

Docker-based model training for 3rd party ML algorithms

The process of training models in AcousticBrainz runs on our servers. This is because it's a simple process, but also so that we can create estimates of new data when it submitted to us. Some researchers have shown interest in contributing machine learning algorithms to acousticbrainz (in contrast to datasets). We would like to be able to run their code on our servers so that we can calculate features for new submissions, but we want to be careful about running 3rd party code on our servers.

A solution to this would be to run 3rd party algorithms inside docker. Researchers can provide us with an image which has a known API, which we can call automatically.

Storage for detailed analysis files