Development/Summer of Code/2020/ListenBrainz: Difference between revisions

Latest revision as of 15:41, 28 February 2020

ListenBrainz is one of the newest MetaBrainz projects. Read more information on its homepage.

Getting started

If you want to work on ListenBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that we might ask you about

Show that you understand the goals that ListenBrainz wants to achieve, which are written on its homepage
Create an oauth application on the MusicBrainz website and add the configuration information to your ListenBrainz server. Use this to log in to your server with your MusicBrainz details
Use the import script that is part of the ListenBrainz server to load scrobbles from last.fm to your ListenBrainz server, or the main ListenBrainz server
Use your preferred programming language to write a submission tool that can send Listen data to ListenBrainz. You could make up some fake data for song names and artists. This data doesn't have to be real.
Try and delete the ListenBrainz database on your local server to remove the fake data that you added.
Look at the list of tickets that we have open for ListenBrainz and see if you understand what tasks the tickets involve
If you want to, see if you can contribute to fixing a ticket. Either add a comment to the ticket or ask in IRC for clarification if you don't understand what the ticket means

Ideas

Add more statistics and graphs for users and our community

Proposed mentors:mayhem, alastairp, iliekcomputers
Languages/skills: python, javascript, D3, apache spark, data science, graphing, visualization

ListenBrainz now has a statistics infrastructure that collects and computes statistics from the listen data that we have stored in our database (and in an Apache Spark cluster). So far we've only implemented a top artists per user query that shows that our statistics infrastructure is working. However, we're interested in adding a lot more statistics/graphs to this setup:

top album for a user
top album for a user by genre
top tracks for a user or everyone
users with similar music tastes to mine
when did I start listening to this artist/album?

There are many more interesting charts/graphs/statistics that we wish to show, but haven't thought of yet. If you are interested in participating in this project, we will ask you to think about possible user stats and also to come up with other examples of statistics that we might be interested in capturing/producing. One part of this project will include writing queries in Apache Spark and python glue code to take the results and ship them from our Apache Cluster to our production servers. The other part of this project will include serving up these statistics from our servers using python and then to render the results with good looking charts created in javascript with the D3 toolkit.

Create a high performance listen ingester

Proposed mentors:mayhem, iliekcomputers
Languages/skills: Rust or Go, Python, Protobuf, RabbitMQ, Timescale DB

ListenBrainz currently processes incoming listens using pure python and this processing a listen requires parsing JSON, data validation, and re-serializing JSON and sending it to the database component for deduplication and writing to our datastore. The current process takes up too many resources and simply isn't very scalable; also the code isn't perfectly laid out causing us to serialize and deserialize each listen more than once.

For this summer of code project we would like a student to implement a single API endpoint (submit listen) and to port our existing ingestion pipeline to use Protocol Buffers. The new ingester should parse the JSON, validate the data, handle and report errors in exactly the same manner that is currently in use in our production system. Furthermore, the incoming listen pipeline should be converted to use the new protobuf based format for internal communication in order to make the new ingester as performant as possible.

This will require the creation of a very small Go/Rust server that handles the submit listens endpoint (ingester) and a tool that will read incoming listens from the RabbitMQ queue, write them to the Timescale DB and then pass on the unique listens down another RabbbitMQ pipeline (was influx_writer, will soon be timescale_writer).

At this point we haven't quite settled on Rust or Go for this project. Do you have a feeling for that?

Relevant links:

Add 'love/hate a recording' support to ListenBrainz

Proposed mentors:mayhem, alastairp, iliekcomputers
Languages/skills: python, influxdb, data science

A lot of music services have the ability for users to give feedback on a track, whether or not they like or dislike a track. This information is very useful for tuning recommendation algorithms that we're in the process of developing. This project would involve three distinct parts:

Adding views and user interface elements to allow a user to mark a track as like/hate from the LB web pages.
Adding underlying data store functionality for storing and retrieving likes/hates
Adding API endpoints for users to fetch/submit this data.

@@ Line 6: / Line 6: @@
 If you want to work on ListenBrainz you should show that you are able to set up the server software and understand how some of the infrastructure works. Here are some things that we might ask you about
 * Show that you understand the goals that ListenBrainz wants to achieve, which are written on its homepage
-* Install the server on your computer or use the Vagrant setup scripts to build a virtual machine
 * Create an oauth application on the MusicBrainz website and add the configuration information to your ListenBrainz server. Use this to log in to your server with your MusicBrainz details
 * Use the import script that is part of the ListenBrainz server to load scrobbles from last.fm to your ListenBrainz server, or the main ListenBrainz server
@@ Line 16: / Line 15: @@
 ==Ideas==
-=== Create beautiful charts/graphs for user behaviour ===
+=== Add more statistics and graphs for users and our community ===
-Proposed mentors:''mayhem'', ''alastairp''<br>
+Proposed mentors:''mayhem'', ''alastairp'', ''iliekcomputers''<br>
-Languages/skills: javascript, D3, data science, graphing, visualization, data architecture
+Languages/skills: python, javascript, D3, apache spark, data science, graphing, visualization
-ListenBrainz now has a statistics infrastructure that collects and computes statistics from the listen data that has been stored in Google's BigQuery tool. For this summer we are looking for a student who has some design skills and a general feel for the design process to help us build beautiful graphs from our collected user data. It is easy to create crappy looking graphs that are not compelling to the end user -- we can do that ourselves. If you are interested in working with us over the summer, you'll need to appear in IRC and discuss your ideas with potential mentors -- be prepared to demonstrate some prior work that illustrates your competence in design and javascript. And ideal candidate would already know the D3 toolkit we intend to use for our graphs.
+ListenBrainz now has a statistics infrastructure that collects and computes statistics from the listen data that we have stored in our database (and in an Apache Spark cluster). So far we've only implemented a top artists per user query that shows that our statistics infrastructure is working. However, we're interested in adding a lot more statistics/graphs to this setup:
+* top album for a user
+* top album for a user by genre
+* top tracks for a user or everyone
+* users with similar music tastes to mine
+* when did I start listening to this artist/album?
+There are many more interesting charts/graphs/statistics that we wish to show, but haven't thought of yet. If you are interested in participating in this project, we will ask you to think about possible user stats and also to come up with other examples of statistics that we might be interested in capturing/producing. One part of this project will include writing queries in Apache Spark and python glue code to take the results and ship them from our Apache Cluster to our production servers. The other part of this project will include serving up these statistics from our servers using python and then to render the results with good looking charts created in javascript with the D3 toolkit.
+=== Create a high performance listen ingester ===
+Proposed mentors:''mayhem'', ''iliekcomputers''<br>
+Languages/skills: Rust or Go, Python, Protobuf, RabbitMQ, Timescale DB
+ListenBrainz currently processes incoming listens using pure python and this processing a listen requires parsing JSON, data validation, and re-serializing JSON and sending it to the database component for deduplication and writing to our datastore. The current process takes up too many resources and simply isn't very scalable; also the code isn't perfectly laid out causing us to serialize and deserialize each listen more than once.
+For this summer of code project we would like a student to implement a single API endpoint (submit listen) and to port our existing ingestion pipeline to use Protocol Buffers. The new ingester should parse the JSON, validate the data, handle and report errors in exactly the same manner that is currently in use in our production system.  Furthermore, the incoming listen pipeline should be converted to use the new protobuf based format for internal communication in order to make the new ingester as performant as possible.
+This will require the creation of a very small Go/Rust server that handles the submit listens endpoint (ingester) and a tool that will read incoming listens from the RabbitMQ queue, write them to the Timescale DB and then pass on the unique listens down another RabbbitMQ pipeline (was influx_writer, will soon be timescale_writer).
+At this point we haven't quite settled on Rust or Go for this project. Do you have a feeling for that?
+Relevant links:
+* [https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz/webserver/views/api.py#L21 LB submit-listen API endpoint]
+* [https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz/influx_writer LB influx-writer]
+* [https://developers.google.com/protocol-buffers Protocol Buffers]
+* [https://www.timescale.com/ Timescale DB]
+=== Add 'love/hate a recording' support to ListenBrainz ===
+Proposed mentors:''mayhem'', ''alastairp'', ''iliekcomputers''<br>
+Languages/skills: python, influxdb, data science
+A lot of music services have the ability for users to give feedback on a track, whether or not they like or dislike a track. This information is very useful for tuning recommendation algorithms that we're in the process of developing. This project would involve three distinct parts:
+# Adding views and user interface elements to allow a user to mark a track as like/hate from the LB web pages.
+# Adding underlying data store functionality for storing and retrieving likes/hates
+# Adding API endpoints for users to fetch/submit this data.
+See also [https://tickets.metabrainz.org/browse/LB-489 Add support for like/hate on a listen]

Development/Summer of Code/2020/ListenBrainz: Difference between revisions

Latest revision as of 15:41, 28 February 2020

Contents

Getting started

Ideas

Add more statistics and graphs for users and our community

Create a high performance listen ingester

Add 'love/hate a recording' support to ListenBrainz

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

sites

Tools