Development/Summer of Code/2017

From MusicBrainz Wiki

Are you interested in working with MetaBrainz in Google Summer of Code 2017? You're in the right place!

Where to start

New to MetaBrainz?
List of MetaBrainz projects
New to MetaBrainz development and/or GSoC?
Getting started with GSoC
New to the idea of linked open data?
Linked open data article on wikipedia
Ready to apply?
GSoC applications @ community.metabrainz.org
Be aware of the content of our Development/Summer of Code/Application Template

Mentors

Mentor list
Name IRC nick Project
Robert Kaye ruaok AcousticBrainz, ListenBrainz, MusicBrainz
Michael Wiencek bitmap MusicBrainz
Alastair Porter alastairp AcousticBrainz, ListenBrainz
Ben Ockmore LordSputnik BookBrainz
Sean Burke Leftmost BookBrainz
Roman Tsukanov Gentlecat CritiqueBrainz, AcousticBrainz, ListenBrainz

Some potential mentors are listed by each project; this is far from a normative list, but it might give you somebody to ask about the project.

Note: Contacting the mentors privately (e.g., via e-mail or private IRC messages) will get you off to a very, very bad start in your relations with us and any application you send us is now almost definitely going to not get accepted.

About proposals

Before you dive in and send a proposal to us through Google, it's a good idea to take some time and learn about the MusicBrainz community. At MusicBrainz we pride ourselves for having a strong community - most of us know each other in some way, and some of us know each other face to face from development summits.

A good way to get a feel of this would be to talk about your ideas and proposals on IRC. However, starting off by sending private messages to potential mentors is not a good way to introduce yourself to the community. Please don't do that!

If you're not sure where to start, Development/Summer of Code/Getting started might help.


Projects

AcousticBrainz

AcousticBrainz logo small notext.png AcousticBrainz is our new project that crowdsources acoustic information for all music in the world and to make it available to the public. We already have low-level information about more than three million tracks. What we need is a good way for users and developers to interact with all this data and help improve algorithms that are used to analyze it.

It would suit someone with experience or an interest in machine learning algorithms, though the majority of the project will probably involve creating infrastructure around our existing algorithms.

Languages/skills: Python, PostgreSQL, Flask
Ideas page | Main page | Forums | Blog

BookBrainz

BookBrainz logo small notext.png BookBrainz is a database of book metadata.

This year we're interested in projects that help us get more data. The three suggested ideas to build proposals around are data importing, a web API and gamification of editing. Please see our sub-project ideas page for information on getting started and more details about the ideas themselves.

Top 3 Desired Skills: Node.js, Python, SQL
Ideas page | Main page | Forums

CritiqueBrainz

CritiqueBrainz logo small notext.png Fills the gap between music critics and raw data by providing a platform created for the sole purpose of Creative Commons licensed reviews.
Languages/skills: Python, Flask, SQL, PostgreSQL
Ideas page | Main page | Forums

ListenBrainz

ListenBrainz logo small notext.png An open source music website that allows users to import their listen history. One of the goals is for this data to be used for building open music recommendation systems.
Languages/skills: Python
Ideas page | Main page

MusicBrainz

MusicBrainz logo small notext.png A community-maintained open source music encyclopedia that collects music metadata and makes it available to the public.
Languages/skills: JavaScript (React), Perl, Python, PostgreSQL, SQL
Ideas page | Main page | Forums



This page captures our ideas for Google Summer of Code projects for 2017:

MusicBrainz

Bitmap: Please add ideas!

AcousticBrainz

BigQuery upload and statistics

Skills: Python, design

We have some prototype code to upload the contents of AcousticBrainz to google bigquery. We would like to finish this project so that the data is available in bigquery. This involves:

  1. Finishing the upload tool, fixing any existing issues
  2. Check that the data that we have stored in bigquery is in a format apt for searching. The value that BQ brings for us is to be able to quickly search through all of our data quickly. We need to make sure that it's possible. This could involve surveys/emails with other in the musicbrainz/Music research community
  3. Once the data is uploaded, create an interactive statistics page for the AcousticBrainz website, showing images similar to our original blog post

New machine learning infrastructure

Skills: Python, C++, Machine learning, scikitlearn, postgres

We build what we call high level models in acousticbrainz, which are multiclass SVM models trained using libsvm wrapped in a custom library called gaia. Gaia performs its task well, but it is written in C++ and not easy to extend with different machine learning algorithms and new techniques like big learning.

We would like to replace our model training infrastructure with scikit learn, which is widely known and contains a large number of machine learning algorithms

  1. Understand the existing gaia-based training process
  2. Reproduce the existing SVM model process using scikitlearn
  3. Replace the highlevel model training process with scikitlearn
  4. Perform an analysis of other ML algorithms in scikitlearn to see if they give better results than those that we currently have with SVM

Docker-based model training for 3rd party ML algorithms

The process of training models in AcousticBrainz runs on our servers. This is because it's a simple process, but also so that we can create estimates of new data when it submitted to us. Some researchers have shown interest in contributing machine learning algorithms to acousticbrainz (in contrast to datasets). We would like to be able to run their code on our servers so that we can calculate features for new submissions, but we want to be careful about running 3rd party code on our servers.

A solution to this would be to run 3rd party algorithms inside docker. Researchers can provide us with an image which has a known API, which we can call automatically.

Storage for detailed analysis files

CritiqueBrainz

Direct access to MusicBrainz database

Proposed mentor: Gentlecat
Languages/skills: Python, Flask, SQL (PostgreSQL, SQLAlchemy), Docker, Consul

So far, the biggest cause for slowdown in CritiqueBrainz are requests to MusicBrainz web service. It's not that MusicBrainz WS is slow, it's just that some pages on CritiqueBrainz require a lot of MusicBrainz data, which might take a very long time to retrieve. This can be caused by the complexity of a request, or by a number of them (when showing multiple items, since there's no way to do batch-requests).

New infrastructure allows us to easily read data directly from the MusicBrainz database. Doing this in CritiqueBrainz will probably be a significant speedup.

See https://tickets.metabrainz.org/browse/CB-231.

Book reviews

Proposed mentor: Gentlecat
Languages/skills: Python, Flask, SQL (PostgreSQL, SQLAlchemy), JavaScript

Currently people can review release groups, events, and places. We'd like to expand that list by allowing people to publish reviews for books. Metadata about books can be retrieved from the BookBrainz project (by accessing the web API or its database directly). This project would require (1) extension of CritiqueBrainz back-end to support book reviews, and (2) changes to the web interface.

See https://tickets.metabrainz.org/browse/CB-240.

ListenBrainz

Create charts/graphs for user behaviour

Proposed mentors:mayhem, alastairp
Languages/skills: Python, Flask, BiqQuery, InfluxDB, data science, graphing, visualization, data architecture

ListenBrainz is preparing to stream its listen data to Big Query where anyone can have access to it in real time. From this data that is stored in BigQuery we wish to have a student build a general charting/graphing system that allows future contributors to explore the data with BigQuery. Any user should be able to craft a query that can be turned into a graph/visualization on the ListenBrainz site, with minimal effort. If a user crafts an interesting query, they should be able to open a pull request and supply the details of the query in order for the LB team to add this graph to the site.

This project requires building the behind the scenes BigQuery access, caching, periodic updates and synchronization between the ListenBrainz server and the BigQuery data store.

A way to associate listens with MBIDs

Proposed mentors: ruaok, alastairp, gentlecat
Languages/skills: Python
Forum for discussion

Last.fm is broken because of the terrible way it handles metadata (artists with the same name are jumbled into a single page; at the same time, there are often multiple pages for the same artist/album/track due to spelling variations). ListenBrainz is smarter by taking advantage of MBIDs. But there needs to be some sort of interface for identifying listens as being for a particular track (or recording) MBID. This could allow the user to identify an album they listened to on Spotify as the same one they listen to in iTunes a few days later. Then they wouldn't remain separate artists or albums in the stats due to differences in metadata alone.

BookBrainz

Data Importing

Proposed mentors: LordSputnik or Leftmost
Languages/skills: Browser JS, Node.js or Python, SQL/PostgreSQL
Forum for discussion

At last year's summit, the two BookBrainz lead developers, Leftmost and LordSputnik worked on a plan for importing third party data into BookBrainz. This plan has several stages. First, data sources need to be identified, including mass import sources with freely available data, such as libraries, and manual import sources, such as online book stores and other user-contributed databases. The next stage is to update the database to introduce an "Import" object, which can be used to distinguish mass imported data from (usually better quality) user contributions. Then, actual import bots for mass import and userscripts for manual import will need to be written. Finally, it would desirable (but not necessary if time is short) to introduce an interface to the BookBrainz site to allow users to review automatically imported data, and approve it.

Web API

Proposed Mentors: LordSputnik/Leftmost
Languages/skills: Node.js, ES6, Python, Redis, OAuth
Forum for discussion

We’re currently in the process of switching to using Node.js for all server side code. As part of this, our schema has been redesigned, and the current Python-based web API will no longer work.

We'd like a new and improved JSON web API to be designed and implemented. The design would clearly describe the result of each different query to the web API, and give examples of output. It would also describe the workings of any additional features to be implemented - for example, authentication, caching and rate limiting. Authentication in the web API is a particular challenge, since the current MB OAuth setup requires a GUI.

The web API should be written using the koa.js Node.js server framework, so that the resulting code is as clean and minimal as possible. Tests should be written in parallel with the implementation, adapting and expanding on the tests for the existing Python web API. The priority for this task is a solid plan and quality code, not a complete implementation (although that would be nice!)

MusicBrainz Picard

Qt5/Py3 Port

Proposed Mentors: zas/bitmap
Languages/skills: Python, PyQt5

Picard is going to be receiving a complete make-over code/feature/UI wise for v2.0. A part of the v2.0 plans is porting the existing code to PyQt5/Py3 from PyQt4/py2 to allow for future support and better encoding compatibility along with new features of Qt5(such as High DPI scaling, better Widget and networking support and better cross-platform compatibility) and along with all the bug fixes that come with PyQt5 which will help cleaning up the code-base for Picard.

For more information see: PICARD-588 PICARD-960

Picard Plugins API v2

Proposed Mentors: zas/bitmap
Languages/skills: Python, PyQt5, Flask

Picard 2.0 is going to be PyQt5 based, which means all the existing plugins have to be ported to PyQt5/Py3. Since Picard v1 plugins will be incompatible with v2 and vice versa, we need to implement a new v2 end-point for the Picard website. There is also a need of providing a uniform GUI api for plugins to allow quick GUI option settings, i18n support and UI uniformity with the rest of the Picard code. Also, additional metadata headers need to be implemented to allow cross-platform compatibility checks.

For more info see: PICARD-977