https://wiki.musicbrainz.org/index.php?title=Development/Summer_of_Code/2021/AcousticBrainz&feed=atom&action=historyDevelopment/Summer of Code/2021/AcousticBrainz - Revision history2024-03-28T11:47:53ZRevision history for this page on the wikiMediaWiki 1.39.4https://wiki.musicbrainz.org/index.php?title=Development/Summer_of_Code/2021/AcousticBrainz&diff=75196&oldid=prevAlastairp: 2021 ideas2021-02-17T10:16:55Z<p>2021 ideas</p>
<table style="background-color: #fff; color: #202122;" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 10:16, 17 February 2021</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 43:</td>
<td colspan="2" class="diff-lineno">Line 43:</td>
</tr>
<tr>
<td class="diff-marker"></td>
<td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* 2d maps with t-Stochastic Neighbor Embedding (TSNE, but there are other approaches in the paper) with shared nearest neighbor distance normalization (against hubs)</div></td>
<td class="diff-marker"></td>
<td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>* 2d maps with t-Stochastic Neighbor Embedding (TSNE, but there are other approaches in the paper) with shared nearest neighbor distance normalization (against hubs)</div></td>
</tr>
<tr>
<td class="diff-marker"></td>
<td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td>
<td class="diff-marker"></td>
<td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td class="diff-marker" data-marker="−"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>===Storage for AcousticBrainz v2 data===</div></td>
<td colspan="2" class="diff-empty diff-side-added"></td>
</tr>
<tr>
<td class="diff-marker" data-marker="−"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>When we release a new version of the AcousticBrainz extractor tool we will want to store data for this new version in addition to data from the current version of the extractor that we provide.</div></td>
<td colspan="2" class="diff-empty diff-side-added"></td>
</tr>
<tr>
<td class="diff-marker"></td>
<td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td>
<td class="diff-marker"></td>
<td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>===Machine learning feature temporary storage and evaluation===</div></td>
</tr>
<tr>
<td class="diff-marker" data-marker="−"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>This project needs to consider at least the following items:</div></td>
<td colspan="2" class="diff-empty diff-side-added"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>We have a machine learning process that takes new data submissions and combines them with a set of models, to produce new features.</div></td>
</tr>
<tr>
<td class="diff-marker" data-marker="−"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div># Update the database schema to include a data version field, and allow the Submit and Read methods to switch between them.</div></td>
<td colspan="2" class="diff-empty diff-side-added"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>We also have a system where we can produce new datasets and build models for new tasks.</div></td>
</tr>
<tr>
<td class="diff-marker" data-marker="−"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div># Update the frontend including the dataset editor</div></td>
<td colspan="2" class="diff-empty diff-side-added"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>We currently don't have a way of promoting a new model into the production, and want to add this functionality. This has a few steps:</div></td>
</tr>
<tr>
<td class="diff-marker" data-marker="−"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div># Update the client software to include a check where they announce to the server what version they are</div></td>
<td colspan="2" class="diff-empty diff-side-added"></td>
</tr>
<tr>
<td class="diff-marker" data-marker="−"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><del style="font-weight: bold; text-decoration: none;">#</del> <del style="font-weight: bold; text-decoration: none;">Update</del> the <del style="font-weight: bold; text-decoration: none;">client</del> <del style="font-weight: bold; text-decoration: none;">software</del> <del style="font-weight: bold; text-decoration: none;">to</del> <del style="font-weight: bold; text-decoration: none;">enhance</del> <del style="font-weight: bold; text-decoration: none;">the</del> <del style="font-weight: bold; text-decoration: none;">"already</del> <del style="font-weight: bold; text-decoration: none;">submitted"</del> <del style="font-weight: bold; text-decoration: none;">local</del> database to <del style="font-weight: bold; text-decoration: none;">allow</del> <del style="font-weight: bold; text-decoration: none;">data from</del> the <del style="font-weight: bold; text-decoration: none;">new</del> <del style="font-weight: bold; text-decoration: none;">version of the</del> <del style="font-weight: bold; text-decoration: none;">extractor</del></div></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><ins style="font-weight: bold; text-decoration: none;">*</ins> <ins style="font-weight: bold; text-decoration: none;">Use</ins> the <ins style="font-weight: bold; text-decoration: none;">new</ins> <ins style="font-weight: bold; text-decoration: none;">model</ins> <ins style="font-weight: bold; text-decoration: none;">on</ins> <ins style="font-weight: bold; text-decoration: none;">a</ins> <ins style="font-weight: bold; text-decoration: none;">significant</ins> <ins style="font-weight: bold; text-decoration: none;">subset</ins> <ins style="font-weight: bold; text-decoration: none;">of</ins> <ins style="font-weight: bold; text-decoration: none;">the AB</ins> database<ins style="font-weight: bold; text-decoration: none;"> in order</ins> to <ins style="font-weight: bold; text-decoration: none;">verify</ins> <ins style="font-weight: bold; text-decoration: none;">that</ins> the <ins style="font-weight: bold; text-decoration: none;">results</ins> <ins style="font-weight: bold; text-decoration: none;">look</ins> <ins style="font-weight: bold; text-decoration: none;">good</ins></div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>* Optionally: provide a way for an evaluation of this computation to ensure that the model works at a large scale</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>* Once the model has been approved, integrate it into the production system, and compute features for all existing submissions, before computing data again as new items are submitted to AcousticBrainz</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>===Tensorflow-based transfer learning===</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>In 2020 we had a summer of code project to integrate scikit-learn into AcousticBrainz. This gave us an up-to-date tool for performing machine learning, but there has also been a lot of work in music analysis using deep learning techniques.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>One very useful technique for performing analysis on content like what we have in AcousticBrainz is transfer learning, where a model is built using a large general dataset, and then it is refined a second time using a more specific dataset.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>An example of this type of process can be found here: https://github.com/jordipons/sklearn-audio-transfer-learning</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>We would like to extend the existing machine learning system to support these transfer learning-based processes.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>===Identifying bad data submissions===</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>AcousticBrainz accepts submissions from anyone. Submissions are identified by their recording MBID, but sometimes this value is incorrect. As a result, for some MBIDs we have hundreds of duplicate submissions, and we know that some of them have been tagged incorrectly.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>We want to perform clustering on the submissions to identify which submissions</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>As part of our 2019 summer of code project we have a nearest neighbor system which allows us to group data in an n-dimensional space. We want to use this system to cluster recordings which have the same MBID and see if some submissions are always identified as outliers. These outliers can be marked so that they're not given to users or used in other machine learning tasks.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><br /></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>===Further analysis on the quality of descriptors in AcousticBrainz===</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>At one of the top conferences on music data analysis, a paper was presented analysing the quality of the data in the AcousticBrainz dataset: https://program.ismir2020.net/static/final_papers/137.pdf</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty diff-side-deleted"></td>
<td class="diff-marker" data-marker="+"></td>
<td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>This is some really interesting preliminary work which would be great to continue. This could help us identify submissions or categories of submissions that don't provide good data in AcousticBrainz and should be removed.</div></td>
</tr>
</table>Alastairphttps://wiki.musicbrainz.org/index.php?title=Development/Summer_of_Code/2021/AcousticBrainz&diff=75167&oldid=prevRobertKaye: Created page with "Proposed mentors: ''ruaok'' or ''alastairp''<br> Languages/skills: Python, Postgres, Flask<br> [https://community.metabrainz.org/c/acousticbrainz Forum for discussion] == Get..."2021-02-04T10:13:35Z<p>Created page with "Proposed mentors: ''ruaok'' or ''alastairp''<br> Languages/skills: Python, Postgres, Flask<br> [https://community.metabrainz.org/c/acousticbrainz Forum for discussion] == Get..."</p>
<p><b>New page</b></p><div>Proposed mentors: ''ruaok'' or ''alastairp''<br><br />
Languages/skills: Python, Postgres, Flask<br><br />
[https://community.metabrainz.org/c/acousticbrainz Forum for discussion]<br />
<br />
== Getting started ==<br />
(see also: [[Development/Summer_of_Code/Getting_started|GSoC - Getting started]])<br />
<br />
If you want to work on AcousticBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that you could do to get familiar with the AcousticBrainz project and code:<br />
* Install the server on your computer or use the Vagrant setup scripts to build a virtual machine<br />
* Download the AcousticBrainz submission tool and configure it to compute features for some of your audio files and submit them to the local server that you configured<br />
* Use your preferred programming language to access the API to download the data that you submitted to your server, or other data from the main AcousticBrainz server<br />
* Create an oauth application on the MusicBrainz website and add the configuration information to your AcousticBrainz server. Use this to log in to your server with your MusicBrainz details<br />
* Look at the system to build a Dataset (accessible from your profile page on the AcousticBrainz server) and try and build a simple dataset<br />
<br />
== Join in on development==<br />
<br />
We like it when potential students show initiative and make contributions to code without asking us what to do next. We have tagged [https://tickets.metabrainz.org/issues/?jql=labels%20%3D%20good-first-bug%20AND%20project%20%3D%20AB%20AND%20status%20in%20(Open%2C%20'In%20Progress') tickets that we think are suitable for for new contributors with the "good-first-bug" label]. Take a look at these tickets and see if any of them grab your interest.<br />
It's a good idea to talk to us before starting work on a ticket, to make sure that you understand what tasks are involved to finish the ticket, and to make sure that you're not duplicating any work which has already been done.<br />
To talk to us, join our [[IRC]] channel or post a message in the forums or on a ticket.<br />
<br />
==Ideas==<br />
<br />
Here are some ideas for projects that we would like to complete in AcousticBrainz in the near future. They are a good size for a Summer of Code project, but are in no way a complete list of possible ideas. If you have other ideas that you think might be interesting for the project join us in IRC and talk to us about your ideas.<br />
<br />
===Statistics and data description===<br />
We have a lot of data in AcousticBrainz, but we don't know much about what this data looks like. This task involves looking at the data that we have and finding interesting ways to show this data to visitors to the AB website. Part of the proposal for this task would be to look at and understand the data and come up with a list of recommended visualisations/descriptions. For many of the types of statistics that we want to show, it is infeasible to compute the data at every page load, therefore part of this task is to also come up with an appropriate caching system.<br />
<br />
Here are a few ideas for statistics that we have thought of so far:<br />
<br />
Automatic updating statistics page, containing data about our submissions:<br />
* Formats, year, reported genre, other tags (mood)?<br />
* BPM analysis<br />
* Compare audio content md5_encoded with mbids<br />
* Use the musicbrainz mbid redirect tables to find more duplicates<br />
* Lists of artists + albums/recordings for each artist<br />
<br />
Visualize AB data - either a sub-dataset/list or all data in AB<br />
* distribution plots for all low-level descriptors<br />
* expectedness of features for each particular track (paper: Corpus Analysis Tools for Computational Hook Discover by Jan Van Balen)<br />
<br />
2D visual maps<br />
* Improving visualization of high-dimensional music similarity spaces (Flexter)<br />
* 2d maps with t-Stochastic Neighbor Embedding (TSNE, but there are other approaches in the paper) with shared nearest neighbor distance normalization (against hubs)<br />
<br />
===Storage for AcousticBrainz v2 data===<br />
When we release a new version of the AcousticBrainz extractor tool we will want to store data for this new version in addition to data from the current version of the extractor that we provide.<br />
<br />
This project needs to consider at least the following items:<br />
# Update the database schema to include a data version field, and allow the Submit and Read methods to switch between them.<br />
# Update the frontend including the dataset editor<br />
# Update the client software to include a check where they announce to the server what version they are<br />
# Update the client software to enhance the "already submitted" local database to allow data from the new version of the extractor</div>RobertKaye