User:Jokipii: Difference between revisions
From MusicBrainz Wiki
Jump to navigationJump to search
(Added link to my new userscript that tries to make voting on Discogs links easier) |
(Bot programming tasks updated) |
||
(28 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{| |
|||
== Who am I? == |
|||
! '''Antti Jokipii''' [MB: [[Editor:Jokipii|Jokipii]] and operator of [[Editor:Jokipii_bot|Jokipii_bot]] | IRC: Jokipii | Wiki: [[User:Jokipii|Jokipii]] | Last.fm [http://www.last.fm/user/AnttiJokipii AnttiJokipii]] |
|||
MusicBrainz editor [http://musicbrainz.org/user/Jokipii Jokipii] and operator of [http://musicbrainz.org/user/Jokipii_bot Jokipii_bot]. |
|||
|} |
|||
I am currently trying to improve linking between MusicBrainz and Discogs. I have both databases installed on PostgreSQL. Bot code can be found at [https://github.com/Jokipii/musicbrainz-bot musicbrainz-bot] and code that produces Discogs database from monthly XML dumps found at [https://github.com/Jokipii/discogs-xml2db discogs-xml2db]. |
|||
== Currently working with == |
|||
I have both MusicBrainz and Discogs databases installed on PostgreSQL. I am currently trying to improve linking between those. |
|||
== Userscripts == |
== Userscripts == |
||
Line 9: | Line 9: | ||
== Bot queue == |
== Bot queue == |
||
* Artist Discogs links |
|||
** Exact name match. Have release(s) with Discogs links. All releases found that way point on same artist at Discogs. |
|||
*** 4074 |
|||
== Possible future sets == |
|||
Set descriptions and number of links |
Set descriptions and number of links |
||
* Artist (type:group) name with exact (case insensitive) match, have members with Discogs links, all members found that way have been also market as members in Discogs entry. |
|||
'''''In Progress''''' |
|||
** 2083 |
|||
Artist Discogs links based on name match and other evidence (linked releases, linked release-groups, urls, release names, "member of band" and "is person"-links, VA release tracks, release level credits) |
|||
* Artist name with exact (case insensitive) match, is member of groups with Discogs links, all groups found that way have same Discogs artist as member. |
|||
* 39570 |
|||
Label links based on name match, catalog numbers, and release names |
|||
* 10496 |
|||
Release links identified by match on normalized catalog number, release name, linked label, format, same number of tracks, same release country, same release year, barcode |
|||
* Artist that have Discogs link, and not have type(person/group) set, and have multiple members in Discogs (indicating type:group) |
|||
* 21966 |
|||
* Artist that have Discogs link, and not have type(person/group) set, and have Discogs realname without characters "&,/+" and word "and" (indicating type:person) |
|||
'''''Not Started''''' |
|||
** 1699 |
|||
Release-groups based on linked artists, unique names (in artists context), and same earliest release year |
|||
* 35289 |
|||
Advanced relationshipd between recordings and artists where both are already linked to Discogs |
|||
* remixer ~ 40000 |
|||
Barcodes from linked Discogs releases |
|||
* 500 |
|||
Artist "is person"-links based on Discogs data |
|||
* 1962 |
|||
Artist "member of band"-links based on Discogs data |
|||
* 28183 |
|||
Advanced relationships between releases and artists where both are already linked to Discogs |
|||
* Producer [http://musicbrainz.org/edit/16730693 Hand made example] 74164 |
|||
* Mastered 27970 |
|||
* and certainly lots also in other relationship classes |
|||
'''''Done''''' |
|||
Artist Discogs links with name match. Have release(s) with Discogs links. All releases found point to same artist at Discogs. |
|||
Artist types based on disambiguation comment |
|||
Artist country based on disambiguation comment |
|||
Artist country based on Discogs profile text |
|||
Release links identified by exact match on catalog number, release name, linked label, format, same number of tracks and same release country. |
|||
Release links identified by match on normalized catalog number, release name, linked label, format, same number of tracks, same release country, same release year and same barcode. |
|||
Removing dead (404 error) Discogs artist and label links |
|||
== Bot programming tasks == |
== Bot programming tasks == |
||
* Merge bot code to [https://github.com/lalinsky/musicbrainz-bot musicbrainz-bot] |
* Merge bot code to [https://github.com/lalinsky/musicbrainz-bot musicbrainz-bot] [[Image:Checkmark.png]]'''''Done''''' |
||
* Start using [https://github.com/philipmat/discogs-xml2db discogs-xml2db] to produce Discogs database |
* Start using [https://github.com/philipmat/discogs-xml2db discogs-xml2db] to produce Discogs database [[Image:Checkmark.png]]'''''Done''''' |
||
* Better documentation |
|||
** [[User:Jokipii/Discogs mapping]] documentation how bot maps values between Discogs and MB '''''Need some updates''''' |
|||
** [[User:Jokipii/Discogs_credits_mapping]] documentation how bot maps Discogs credits to ARs |
|||
* Map Discogs credits <-> MB Advanced relationships '''''Started''''' |
|||
* Start using schema for both databases and move data in one database. This removes dependency on dblink extension, makes data duplication unnecessary, speed up complex operations and makes them less complex |
|||
== Some stats == |
== Some stats == |
||
Line 47: | Line 70: | ||
| 2720810 |
| 2720810 |
||
| 171170 |
| 171170 |
||
| |
| 17% |
||
| |
|||
| |
| |
||
| |
| |
||
Line 58: | Line 80: | ||
| 365081 |
| 365081 |
||
| 47387 |
| 47387 |
||
| |
| 13% |
||
| 106445 |
| 106445 |
||
| |
| 11% |
||
| 277207 |
| 277207 |
||
| |
| 10% |
||
|- |
|- |
||
! Artist: |
! Artist: |
||
Line 68: | Line 90: | ||
| 2100250 |
| 2100250 |
||
| 110825 |
| 110825 |
||
| |
| 18% |
||
| 606737 |
| 606737 |
||
| |
| 61% |
||
| 1738474 |
| 1738474 |
||
| |
| 64% |
||
|- |
|- |
||
! Label: |
! Label: |
||
Line 78: | Line 100: | ||
| 245988 |
| 245988 |
||
| 16004 |
| 16004 |
||
| |
| 29% |
||
| 414436 |
| 414436 |
||
| |
| 42% see note |
||
| 1542803 |
| 1542803 |
||
| |
| 57% |
||
|} |
|} |
||
note: In MB only 567392 releases have label information, and 420909 don't have. |
note: In MB only 567392 releases have label information, and 420909 don't have. |
||
{| |
|||
! 2012-02-23 |
|||
! MusicBrainz Total |
|||
! Discogs Total |
|||
! Links (all these are not unique) |
|||
! Percent done (compared to smaller total) |
|||
|- |
|||
! Releases: |
|||
| 1008061 |
|||
| 2926422 |
|||
| 182581 |
|||
| 18% |
|||
|- |
|||
! Release Groups: |
|||
| 839314 |
|||
| 405891 |
|||
| 73656 |
|||
| 18% |
|||
|- |
|||
! Artists: |
|||
| 644784 |
|||
| 2251519 |
|||
| 126210 |
|||
| 20% |
|||
|- |
|||
! Labels: |
|||
| 58038 |
|||
| 300452 |
|||
| 17141 |
|||
| 30% |
|||
|} |
|||
{| |
|||
! 2012-03-23 |
|||
! MusicBrainz Total |
|||
! Discogs Total |
|||
! Links (all these are not unique) |
|||
! Percent done (compared to smaller total) |
|||
|- |
|||
! Releases: |
|||
| 1016640 |
|||
| 2926422 |
|||
| 200576 |
|||
| 20% |
|||
|- |
|||
! Release Groups: |
|||
| 846637 |
|||
| 405891 |
|||
| 80482 |
|||
| 20% |
|||
|- |
|||
! Artists: |
|||
| 651243 |
|||
| 2251526 |
|||
| 143468 |
|||
| 22% |
|||
|- |
|||
! Labels: |
|||
| 58872 |
|||
| 300452 |
|||
| 17322 |
|||
| 29% |
|||
|} |
|||
{| |
|||
! 2012-04-23 |
|||
! MusicBrainz Total |
|||
! Discogs Total |
|||
! Links (all these are not unique) |
|||
! Percent done (compared to smaller total) |
|||
|- |
|||
! Releases: |
|||
| 1027422 |
|||
| 3045567 |
|||
| 225871 |
|||
| 22% |
|||
|- |
|||
! Release Groups: |
|||
| 854785 |
|||
| 423610 |
|||
| 90931 |
|||
| 21% |
|||
|- |
|||
! Artists: |
|||
| 658827 |
|||
| 2328770 |
|||
| 149744 |
|||
| 23% |
|||
|- |
|||
! Labels: |
|||
| 59775 |
|||
| 323996 |
|||
| 17529 |
|||
| 29% |
|||
|} |
|||
{| |
|||
! 2012-05-23 |
|||
! MusicBrainz Total |
|||
! Discogs Total |
|||
! Links (all these are not unique) |
|||
! Percent done (compared to smaller total) |
|||
|- |
|||
! Releases: |
|||
| 1035849 |
|||
| 3045567 |
|||
| 229590 |
|||
| 22% |
|||
|- |
|||
! Release Groups: |
|||
| 860665 |
|||
| 423610 |
|||
| 94024 |
|||
| 22% |
|||
|- |
|||
! Artists: |
|||
| 664330 |
|||
| 2328770 |
|||
| 164656 |
|||
| 25% |
|||
|- |
|||
! Labels: |
|||
| 60456 |
|||
| 323996 |
|||
| 17949 |
|||
| 30% |
|||
|} |
|||
{| |
|||
! 2012-06-23 |
|||
! MusicBrainz Total |
|||
! Discogs Total |
|||
! Links (all these are not unique) |
|||
! Percent done (compared to smaller total) |
|||
|- |
|||
! Releases: |
|||
| 1044483 |
|||
| 3166200 |
|||
| 239847 |
|||
| 23% |
|||
|- |
|||
! Release Groups: |
|||
| 867139 |
|||
| 441382 |
|||
| 96020 |
|||
| 22% |
|||
|- |
|||
! Artists: |
|||
| 669934 |
|||
| 2407765 |
|||
| 183851 |
|||
| 27% |
|||
|- |
|||
! Labels: |
|||
| 61057 |
|||
| 347133 |
|||
| 19938 |
|||
| 33% |
|||
|} |
|||
{| |
|||
! 2012-07-23 |
|||
! MusicBrainz Total |
|||
! Discogs Total |
|||
! Links (all these are not unique) |
|||
! Percent done (compared to smaller total) |
|||
|- |
|||
! Releases: |
|||
| 1053427 |
|||
| 3166200 |
|||
| 244006 |
|||
| 23% |
|||
|- |
|||
! Release Groups: |
|||
| 874072 |
|||
| 441382 |
|||
| 98747 |
|||
| 22% |
|||
|- |
|||
! Artists: |
|||
| 676840 |
|||
| 2407765 |
|||
| 194864 |
|||
| 29% |
|||
|- |
|||
! Labels: |
|||
| 61302 |
|||
| 347133 |
|||
| 22366 |
|||
| 36% |
|||
|} |
Latest revision as of 17:35, 26 July 2012
Antti Jokipii [MB: Jokipii and operator of Jokipii_bot | IRC: Jokipii | Wiki: Jokipii | Last.fm AnttiJokipii] |
---|
I am currently trying to improve linking between MusicBrainz and Discogs. I have both databases installed on PostgreSQL. Bot code can be found at musicbrainz-bot and code that produces Discogs database from monthly XML dumps found at discogs-xml2db.
Userscripts
Here is userscript that makes voting for Discogs links easier.
Bot queue
Set descriptions and number of links
In Progress
Artist Discogs links based on name match and other evidence (linked releases, linked release-groups, urls, release names, "member of band" and "is person"-links, VA release tracks, release level credits) * 39570 Label links based on name match, catalog numbers, and release names * 10496 Release links identified by match on normalized catalog number, release name, linked label, format, same number of tracks, same release country, same release year, barcode * 21966
Not Started
Release-groups based on linked artists, unique names (in artists context), and same earliest release year * 35289 Advanced relationshipd between recordings and artists where both are already linked to Discogs * remixer ~ 40000 Barcodes from linked Discogs releases * 500 Artist "is person"-links based on Discogs data * 1962 Artist "member of band"-links based on Discogs data * 28183 Advanced relationships between releases and artists where both are already linked to Discogs * Producer Hand made example 74164 * Mastered 27970 * and certainly lots also in other relationship classes
Done
Artist Discogs links with name match. Have release(s) with Discogs links. All releases found point to same artist at Discogs. Artist types based on disambiguation comment Artist country based on disambiguation comment Artist country based on Discogs profile text Release links identified by exact match on catalog number, release name, linked label, format, same number of tracks and same release country. Release links identified by match on normalized catalog number, release name, linked label, format, same number of tracks, same release country, same release year and same barcode. Removing dead (404 error) Discogs artist and label links
Bot programming tasks
- Merge bot code to musicbrainz-bot Done
- Start using discogs-xml2db to produce Discogs database Done
- Better documentation
- User:Jokipii/Discogs mapping documentation how bot maps values between Discogs and MB Need some updates
- User:Jokipii/Discogs_credits_mapping documentation how bot maps Discogs credits to ARs
- Map Discogs credits <-> MB Advanced relationships Started
- Start using schema for both databases and move data in one database. This removes dependency on dblink extension, makes data duplication unnecessary, speed up complex operations and makes them less complex
Some stats
MusicBrainz Total | Discogs Total | Links (all these are not unique) | Percent done (compared to smaller total) | Sum of unique MusicBrainz releases connected to linked entities | Percent of all MusicBrainz releases | Sum of unique Discogs releases connected to linked entities | Percent of all Discogs releases | |
---|---|---|---|---|---|---|---|---|
Releases: | 988301 | 2720810 | 171170 | 17% | ||||
Release groups: | 822442 | 365081 | 47387 | 13% | 106445 | 11% | 277207 | 10% |
Artist: | 626598 | 2100250 | 110825 | 18% | 606737 | 61% | 1738474 | 64% |
Label: | 55844 | 245988 | 16004 | 29% | 414436 | 42% see note | 1542803 | 57% |
note: In MB only 567392 releases have label information, and 420909 don't have.
2012-02-23 | MusicBrainz Total | Discogs Total | Links (all these are not unique) | Percent done (compared to smaller total) |
---|---|---|---|---|
Releases: | 1008061 | 2926422 | 182581 | 18% |
Release Groups: | 839314 | 405891 | 73656 | 18% |
Artists: | 644784 | 2251519 | 126210 | 20% |
Labels: | 58038 | 300452 | 17141 | 30% |
2012-03-23 | MusicBrainz Total | Discogs Total | Links (all these are not unique) | Percent done (compared to smaller total) |
---|---|---|---|---|
Releases: | 1016640 | 2926422 | 200576 | 20% |
Release Groups: | 846637 | 405891 | 80482 | 20% |
Artists: | 651243 | 2251526 | 143468 | 22% |
Labels: | 58872 | 300452 | 17322 | 29% |
2012-04-23 | MusicBrainz Total | Discogs Total | Links (all these are not unique) | Percent done (compared to smaller total) |
---|---|---|---|---|
Releases: | 1027422 | 3045567 | 225871 | 22% |
Release Groups: | 854785 | 423610 | 90931 | 21% |
Artists: | 658827 | 2328770 | 149744 | 23% |
Labels: | 59775 | 323996 | 17529 | 29% |
2012-05-23 | MusicBrainz Total | Discogs Total | Links (all these are not unique) | Percent done (compared to smaller total) |
---|---|---|---|---|
Releases: | 1035849 | 3045567 | 229590 | 22% |
Release Groups: | 860665 | 423610 | 94024 | 22% |
Artists: | 664330 | 2328770 | 164656 | 25% |
Labels: | 60456 | 323996 | 17949 | 30% |
2012-06-23 | MusicBrainz Total | Discogs Total | Links (all these are not unique) | Percent done (compared to smaller total) |
---|---|---|---|---|
Releases: | 1044483 | 3166200 | 239847 | 23% |
Release Groups: | 867139 | 441382 | 96020 | 22% |
Artists: | 669934 | 2407765 | 183851 | 27% |
Labels: | 61057 | 347133 | 19938 | 33% |
2012-07-23 | MusicBrainz Total | Discogs Total | Links (all these are not unique) | Percent done (compared to smaller total) |
---|---|---|---|---|
Releases: | 1053427 | 3166200 | 244006 | 23% |
Release Groups: | 874072 | 441382 | 98747 | 22% |
Artists: | 676840 | 2407765 | 194864 | 29% |
Labels: | 61302 | 347133 | 22366 | 36% |