Development/Summer of Code/2019/AcousticBrainz: Difference between revisions

Latest revision as of 10:19, 25 March 2019

Proposed mentors: ruaok or alastairp
Languages/skills: Python, Postgres, Flask
Forum for discussion

Getting started

If you want to work on AcousticBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that you could do to get familiar with the AcousticBrainz project and code:

Install the server on your computer or use the Vagrant setup scripts to build a virtual machine
Download the AcousticBrainz submission tool and configure it to compute features for some of your audio files and submit them to the local server that you configured
Use your preferred programming language to access the API to download the data that you submitted to your server, or other data from the main AcousticBrainz server
Create an oauth application on the MusicBrainz website and add the configuration information to your AcousticBrainz server. Use this to log in to your server with your MusicBrainz details
Look at the system to build a Dataset (accessible from your profile page on the AcousticBrainz server) and try and build a simple dataset

Join in on development

We like it when potential students show initiative and make contributions to code without asking us what to do next. We have tagged tickets that we think are suitable for for new contributors with the "good-first-bug" label. Take a look at these tickets and see if any of them grab your interest. It's a good idea to talk to us before starting work on a ticket, to make sure that you understand what tasks are involved to finish the ticket, and to make sure that you're not duplicating any work which has already been done. To talk to us, join our IRC channel or post a message in the forums or on a ticket.

Ideas

Here are some ideas for projects that we would like to complete in AcousticBrainz in the near future. They are a good size for a Summer of Code project, but are in no way a complete list of possible ideas. If you have other ideas that you think might be interesting for the project join us in IRC and talk to us about your ideas.

Statistics and data description

We have a lot of data in AcousticBrainz, but we don't know much about what this data looks like. This task involves looking at the data that we have and finding interesting ways to show this data to visitors to the AB website. Part of the proposal for this task would be to look at and understand the data and come up with a list of recommended visualisations/descriptions. For many of the types of statistics that we want to show, it is infeasible to compute the data at every page load, therefore part of this task is to also come up with an appropriate caching system.

Here are a few ideas for statistics that we have thought of so far:

Automatic updating statistics page, containing data about our submissions:

Formats, year, reported genre, other tags (mood)?
BPM analysis
Compare audio content md5_encoded with mbids
Use the musicbrainz mbid redirect tables to find more duplicates
Lists of artists + albums/recordings for each artist

Visualize AB data - either a sub-dataset/list or all data in AB

distribution plots for all low-level descriptors
expectedness of features for each particular track (paper: Corpus Analysis Tools for Computational Hook Discover by Jan Van Balen)

2D visual maps

Improving visualization of high-dimensional music similarity spaces (Flexter)
2d maps with t-Stochastic Neighbor Embedding (TSNE, but there are other approaches in the paper) with shared nearest neighbor distance normalization (against hubs)

New machine learning infrastructure

Skills: Python, C++, Machine learning, scikitlearn, postgres

We build what we call high level models in acousticbrainz, which are multiclass SVM models trained using libsvm wrapped in a custom library called gaia. Gaia performs its task well, but it is written in C++ and not easy to extend with different machine learning algorithms and new techniques like deep learning.

We would like to replace our model training infrastructure with scikit learn, which is widely known and contains a large number of machine learning algorithms

Understand the existing gaia-based training process
Reproduce the existing SVM model process using scikitlearn
Replace the highlevel model training process with scikitlearn
Perform an analysis of other ML algorithms in scikitlearn to see if they give better results than those that we currently have with SVM

Storage for AcousticBrainz v2 data

When we release a new version of the AcousticBrainz extractor tool we will want to store data for this new version in addition to data from the current version of the extractor that we provide.

This project needs to consider at least the following items:

Update the database schema to include a data version field, and allow the Submit and Read methods to switch between them.
Update the frontend including the dataset editor
Update the client software to include a check where they announce to the server what version they are
Update the client software to enhance the "already submitted" local database to allow data from the new version of the extractor

Recording similarity

Skills: Python, Postgres, data crunching

AcousticBrainz contains acoustic information for a large number of recordings (music tracks). One of the important tools that we haven’t created yet is a tool to compare recordings with each other in order to determine how similar they are. The similarity of recordings is an important dataset that we wish to include for music exploration and recommendation (in collaboration with ListenBrainz), and also for helping us to in future recommendation engines in the ListenBrainz project. Some researchers have done previous work with data related to the content that we have in AcousticBrainz:

Music Similarity and Recommendation PhD Thesis: http://mtg.upf.edu/node/2817 [1]
Content-based recommendation: https://repositori.upf.edu/handle/10230/33926?locale-attribute=en [2]
Masters thesis on recommendation of data in AcousticBrainz using research from [2]: https://zenodo.org/record/1479769#.XIJ41mRKhTY [3]
- Code https://github.com/philtgun/acousticbrainz-server/tree/master/similarity
An extension of [1] and [2] using both low-level and high-level data: https://repositori.upf.edu/handle/10230/34780?locale-attribute=en [4]

This previous work shows us that content-based recommendation systems using the data that we have available in AcousticBrainz can work, but we don’t yet have a system that works using all of the content in AB. [1], [2] and [4] show that the content that we have works, but not at a large scale. [3] Integrates this previous work into AcousticBrainz (based on work in [2]), but we found that the method that we used didn’t scale as largely as we had hoped, and was slow to retrieve results.

In order to implement similarity into AcousticBrainz we want to perform the following steps. Note that these steps are everything that we would like to see in such a project, but it is too much work to complete in the timespan of Summer of Code. A successful project proposal will include only parts that a potential student thinks that they can complete in the timespan.

Use the Annoy nearest neighbour software (https://github.com/spotify/annoy) to replicate the results of [3]. We believe that this software will be much faster than our previous attempts using Postgresql. The code for processing input files and calculating distances should be able to be copied directly from the existing code. This should be a proof of concept that Annoy can calculate our distance measures
Using subsets of the AcousticBrainz database, perform an experiment to see how Annoy works with an increasing number of recordings. What we saw with our PostgreSQL solution is that as we added more recordings, this similarity lookup took longer and longer to compute. We expect that Annoy’s lookup speed will remain relatively constant as more items are added.
Build an index with Annoy, with a system to update it each time that new content gets submitted to AcousticBrainz
Make an API so that people can query the similarity of a recording. This will use the Annoy index
For use in other projects, create an offline matrix of similarity for all recordings in AcousticBrainz. For each recording, have an ordered list of the most similar recordings (maybe 1000). We need to determine a data structure or storage system that can give a similarity result in a constant time regardless of how many recordings we have in the system. This could be something like a python numpy array, or a system using postgres tables or redis. It might require some investigation to find the fastest solution.
Determine a way to update this similarity matrix when new recordings come into the AcousticBrainz database, and decide how often it should be updated
Provide this pre-computed similarity matrix as a data dump that can be downloaded and used by anyone
Update the AcousticBrainz website to include navigation by similarity. This can be an extension of the work started in [2]
Use the similarity to remove duplicate submissions. We have multiple submissions for the same MBIDs, and we also have duplicate submissions for merged recording IDs. We expect that duplicate submissions with have a very close similarity. We can use this similarity lookup to determine that the submissions are in fact the same.
Build on the work of [3] using work from [1] and [4] to perform different types similarity using the high-level data in AcousticBrainz

@@ Line 69: / Line 69: @@
 Skills: Python, Postgres, data crunching
-AcousticBrainz contains acoustic information for a large number of recordings (music tracks). One of the important tools that we haven’t created yet is a tool to compare recordings against each other in order to determine how similar they are. The similarity of recordings is an important data set that we wish to include in future recommendation engines in the ListenBrainz project. This topic has been the focus of a [http://mtg.upf.edu/node/2817 previous thesis] and there are pieces of this thesis that we can use to make this project considerably easier. The work in this thesis did not project a result that would work for a large number of recordings.
+AcousticBrainz contains acoustic information for a large number of recordings (music tracks). One of the important tools that we haven’t created yet is a tool to compare recordings with each other in order to determine how similar they are. The similarity of recordings is an important dataset that we wish to include for music exploration and recommendation (in collaboration with ListenBrainz), and also for helping us to in future recommendation engines in the ListenBrainz project. Some researchers have done previous work with data related to the content that we have in AcousticBrainz:
+* Music Similarity and Recommendation PhD Thesis: http://mtg.upf.edu/node/2817 [1]
+* Content-based recommendation: https://repositori.upf.edu/handle/10230/33926?locale-attribute=en [2]
+* Masters thesis on recommendation of data in AcousticBrainz using research from [2]: https://zenodo.org/record/1479769#.XIJ41mRKhTY [3]
+** Code https://github.com/philtgun/acousticbrainz-server/tree/master/similarity
+* An extension of [1] and [2] using both low-level and high-level data: https://repositori.upf.edu/handle/10230/34780?locale-attribute=en [4]
+This previous work shows us that content-based recommendation systems using the data that we have available in AcousticBrainz can work, but we don’t yet have a system that works using all of the content in AB. [1], [2] and [4] show that the content that we have works, but not at a large scale. [3] Integrates this previous work into AcousticBrainz (based on work in [2]), but we found that the method that we used didn’t scale as largely as we had hoped, and was slow to retrieve results.
+In order to implement similarity into AcousticBrainz we want to perform the following steps. Note that these steps are everything that we would like to see in such a project, but it is too much work to complete in the timespan of Summer of Code. A successful project proposal will include only parts that a potential student thinks that they can complete in the timespan.
-The project breaks into three parts:
+* Use the Annoy nearest neighbour software (https://github.com/spotify/annoy) to replicate the results of [3]. We believe that this software will be much faster than our previous attempts using Postgresql. The code for processing input files and calculating distances should be able to be copied directly from the existing code. This should be a proof of concept that Annoy can calculate our distance measures
+* Using subsets of the AcousticBrainz database, perform an experiment to see how Annoy works with an increasing number of recordings. What we saw with our PostgreSQL solution is that as we added more recordings, this similarity lookup took longer and longer to compute. We expect that Annoy’s lookup speed will remain relatively constant as more items are added.
-# Take existing code from the thesis and to extract the feature vectors for each of the recordings in AcousticBrainz and then calculate their similarity in a reasonable amount of time. This portion of the project should mostly involve getting an existing piece of code working -- it does not involve selecting which features to use for our comparisons.
+* Build an index with Annoy, with a system to update it each time that new content gets submitted to AcousticBrainz
-# Once the feature vectors have been extracted, we will want to compare the vectors for each recording against the vectors of all other recordings. This is an O(n^2) operation and must work on all of the unique recordings in AcousticBrainz -- some 4 million recordings. Clearly a complex O(n^2) operation on 4M tracks would require enormous computing power that we do not have at our disposal. We hope that the use of [https://github.com/spotify/annoy Annoy] will allow us to run this algorithm on a large memory virtual machine and to compute the similarity values between all of the recordings in a reasonably short period of time. As part of this project you should install Annoy and write the necessary tools to feed the AB feature vectors to Annoy and to build Annoy indexes for all 4M tracks.
+* Make an API so that people can query the similarity of a recording. This will use the Annoy index
-# Finally we will lookup each of the 4M recordings in order to compute the similarity between each of the recordings and create a data dump that contains the mapping between two recordings and their similarity score.
+* For use in other projects, create an offline matrix of similarity for all recordings in AcousticBrainz. For each recording, have an ordered list of the most similar recordings (maybe 1000). We need to determine a data structure or storage system that can give a similarity result in a constant time regardless of how many recordings we have in the system. This could be something like a python numpy array, or a system using postgres tables or redis. It might require some investigation to find the fastest solution.
+* Determine a way to update this similarity matrix when new recordings come into the AcousticBrainz database, and decide how often it should be updated
+* Provide this pre-computed similarity matrix as a data dump that can be downloaded and used by anyone
+* Update the AcousticBrainz website to include navigation by similarity. This can be an extension of the work started in [2]
+* Use the similarity to remove duplicate submissions. We have multiple submissions for the same MBIDs, and we also have duplicate submissions for merged recording IDs. We expect that duplicate submissions with have a very close similarity. We can use this similarity lookup to determine that the submissions are in fact the same.
+* Build on the work of [3] using work from [1] and [4] to perform different types similarity using the high-level data in AcousticBrainz

Development/Summer of Code/2019/AcousticBrainz: Difference between revisions

Latest revision as of 10:19, 25 March 2019

Contents

Getting started

Join in on development

Ideas

Statistics and data description

New machine learning infrastructure

Storage for AcousticBrainz v2 data

Recording similarity

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

sites

Tools