User:Ianmcorvidae/Ingestion: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 7: Line 7:
*'''subitem''': an extracted, uniquely identifiable entity from a given index, which may or may not appear in more than one item; e.g. Jamendo artist IDs
*'''subitem''': an extracted, uniquely identifiable entity from a given index, which may or may not appear in more than one item; e.g. Jamendo artist IDs
*'''mapping''': a function converting between the format of something in a given index to a common intermediate format; e.g. a function that takes the JSON representation of a single Jamendo release and outputs JSON in a standard MusicBrainz-inspired format suitable for performing seeding and such
*'''mapping''': a function converting between the format of something in a given index to a common intermediate format; e.g. a function that takes the JSON representation of a single Jamendo release and outputs JSON in a standard MusicBrainz-inspired format suitable for performing seeding and such
*'''matching''': a link between an item or subitem and an mbid (plus type) or collection of mbids (plus a type), tagged with username and timestamp; e.g. Jamendo release 5334 would be linked to [[release:ee3e3fcb-57b8-46e4-bd9d-6a9ba7db5952|the correct MB release]]. ''Note: collections of MBIDs are supported here primarily for artists; e.g. "Foo and Bar" should generally be linked to Foo and to Bar''
*'''matching''': a link between an item or subitem and an mbid (plus type) or collection of mbids (plus a type), tagged with username and timestamp; e.g. Jamendo release 5334 would be linked to [[release:ee3e3fcb-57b8-46e4-bd9d-6a9ba7db5952|the correct MB release]]. ''Note: collections of MBIDs are supported here primarily for artists; e.g. "Foo and Bar" should generally be linked to Foo and to Bar.'' The most recent match (preferring non-automatic matches, where relevant) is considered correct; there is no voting/quality-control system other than social pressure not to fuck up.


== Design Process ==
== Design Process ==

Revision as of 00:09, 11 December 2012

i.e. "ingestr"

Terms

  • index: a specific data source, e.g. Jamendo releases
  • item: a specific entry in an index, e.g. a specific Jamendo release
  • subitem: an extracted, uniquely identifiable entity from a given index, which may or may not appear in more than one item; e.g. Jamendo artist IDs
  • mapping: a function converting between the format of something in a given index to a common intermediate format; e.g. a function that takes the JSON representation of a single Jamendo release and outputs JSON in a standard MusicBrainz-inspired format suitable for performing seeding and such
  • matching: a link between an item or subitem and an mbid (plus type) or collection of mbids (plus a type), tagged with username and timestamp; e.g. Jamendo release 5334 would be linked to the correct MB release. Note: collections of MBIDs are supported here primarily for artists; e.g. "Foo and Bar" should generally be linked to Foo and to Bar. The most recent match (preferring non-automatic matches, where relevant) is considered correct; there is no voting/quality-control system other than social pressure not to fuck up.

Design Process

Users: this is designed to be used by primarily advanced users, as far as directly using the interface. This also has the potential to see heavy automated use, e.g. by people trying to figure out what a given external ID maps to in MB.

Use cases:

  1. Manual Matching + Importing: editors using this as a data source will want to work through matching things to MB, and importing things that aren't there. This will likely happen via a search result of some sort.
  2. Automatic Matching: some datasets are already partially or fully matched by some automatic process to MB, and with access to lots of data we or others may want to contribute additional automatically-derived matchings. This must therefore be possible programmatically and with relative ease.
  3. Lookup: editors verifying information or third parties wanting to match something-they-have to something-we-have will want to be able to look things up by external identifier or by MBID and find the information they're looking for.

Design Principles

  • Collections Everywhere: For the common mapping format, very few things are individual items; most are, instead, lists. For things like artists, which can be multivalued, collections provide a slightly simpler data model than full artist credits; for things like titles, collections provide a set of potential alternates, as given indices don't always agree with themselves on the same bit of information (especially as mappings sometimes can't account for non-structured dissimilarities between items). (since data doesn't necessarily map cleanly, collections are more flexible)
  • First Class API: For matchings, primarily: the interface calls out to an endpoint to perform matchings, which could also potentially be used by external sources (but authentication is of course required for linking to users). Automated matchings are simply added by not authenticating to the endpoint and sending an extra parameter to identify what automatic matching algorithm is being used (e.g. "archive.org acoustid-based matching", "ianmcorvidae's janky jamendo perl script"). For simply getting information, this is basically a pass-through to elasticsearch. (derives from the fact many users may be automated)
  • Always Show Everything: the mapping isn't the only way into the data. The full JSON source is always available; some things may be hidden by default (e.g. archive.org generates sha1, md5, and crc for every file, but we probably only need one), but with a "show all" link. This also applies to the mappings: pages need to have prominent links to the source code for the mappings so they can be inspected. Ideally (i.e. eventually), it'll support passing elasticsearch's query DSL directly through as a start for a search result, so users aren't limited to the basic query-string matching and default filters. (derives from the fact that the users are generally advanced -- full access is important when the users are smarter than the developer! Generality of search is useful for people wanting less fuzzy search results, or for third-parties wanting to perform certain kinds of lookup)

Various Design Decisions

  • Search modes: "Search items" with a basic textbox and the default filters listed below; "search by linked MBID" with an MBID selector; and "search by external ID" with a way of selecting an external ID (i.e. subitem). Ideally/in the future: search by linked MBID would have an option for direct vs. indirect links (i.e. via subitems), and a fourth mode for simply providing a JSON-formatted elasticsearch Query DSL query.
  • Default filters for search: trying to optimize for usefulness without making implementation take forever, we have some checkboxes: "Show human-matched items", "Show automatically matched items", and "Show unmatched items" (with the last two on by default). We also have a multi-select for which index or indices to search within. In the future: search by user who matched most recently, or all users that have contributed a match at some point, perhaps filter by number of matches.