Server Log Analysis
The following data will be mined:
The 10 most visited artists/release groups/releases/recordings/labels/works could be selected for each day. In the long run, these daily statistics will be summarized. We could even create some sort of "billboard". Later we will try to collect the most popular recordings for each artist. This task may be too time consuming to be run each day.
Visit counts of "static" pages
It could be interesting to see which menu items are more popular than others. This has been mined for the sample data, but having these statistics summarized for more days could provide more accurate data. This can be extended to other pages that are not part of the menu.
Web service usage statistics
Various usage statistics can be mined about the web services. A part of this has also been done on the sample log - the most popular "inc" parameters were selected on each web service. The number of times an entity (artist, release, etc.) was searched has also been counted. In this case mining data for multiple days would allow us to see the tendencies.
Finding problem users
If there are commercial companies downloading data without permission, the logs can be used to determine these cases. Privacy is also a problem here, that should be taken care of. This information should never be publicly released, but if such a case occurs the student should be able to check the validity of the data.
The architecture will allow arbitrary queries to be added and run periodically. Along with the query some metadata could be provided. This metadata will contain information on when to run the query (daily, weekly, etc.), and how the output should be handled (public, private).
The original proposal for the project mentions IP geolocation as an interesting source of data to mine. Privacy is a major problem here. The Summer of Code student should not be able to access IP addresses in the log, so the addresses (along with other sensitive data) are hashed. Geolocation could be retrieved before hashing happens, but this is a low priority task, that we may not do at all. Google Analytics already provides good information on this.
We use Splunk to process the logs. Splunk will be used through the Python SDK with a python script that is run periodically. The queries will be stored in a text file with the metadata (possibily a simple csv format?)
Public data could be published on stats.musicbrainz.org. As a special case the "billboard" with the most popular entities could be published on musicbrainz.org itself. Private data should only be accessible by authenticated members. Privacy is again a problem here, since the student would not be allowed to access these areas. It is not clear where to post these private results yet. If they are posted on stats.musicbrainz.org, a new form of authentication has to be added. On musicbrainz.org the existing user account system can be used to authenticate members, but the data could be out of place there if we have an existing place for statistics.
- The development blog of the Summer of Code project