Difference between revisions of "User:Ianmcorvidae/Ingestion"

From MusicBrainz Wiki
(s/ingestr/geordi/g)
(Mapping Format)
Line 45: Line 45:
 
     "date": [str],
 
     "date": [str],
 
     "artist": [{
 
     "artist": [{
 +
      "name": str,                # TODO should be list
 +
      "subitem": str              # optional, specifies the subitem this artist is represented by
 +
    }],
 +
    "other_artist": [{            # for any other random artist associated with the release (e.g. composers, conductors, etc.)
 
       "name": str,                # TODO should be list
 
       "name": str,                # TODO should be list
 
       "subitem": str              # optional, specifies the subitem this artist is represented by
 
       "subitem": str              # optional, specifies the subitem this artist is represented by

Revision as of 06:44, 18 January 2013

i.e. "geordi"

Terms

  • elasticsearch
  • index: a specific data source, e.g. Jamendo releases (we'll say this index is called "jamendo")
  • item: a specific entry in an index, e.g. a specific Jamendo release (perhaps identified by Jamendo release ID -- so including the index, "jamendo/5334" -- in elasticsearch, these are stored as "{index}/item/{item id}")
  • subitem: an extracted, uniquely identifiable entity from a given index, which may or may not appear in more than one item; e.g. Jamendo artist IDs (e.g., as stored in elasticsearch, "jamendo/subitem/0-444", assuming that artist IDs are declared as subitem type 0 for the "jamendo" index (this is defined in the mapping))
  • mapping: a class (specific to a given index) with a variety of methods converting item information to several common intermediate formats; e.g. a 'jamendo' class whose 'map' method takes the JSON representation of a single Jamendo release and outputs JSON in a standard MusicBrainz-inspired format suitable for display and seeding MusicBrainz (see below for formats for this and other methods of such classes)
  • matching: a link between an item or subitem and an MBID (plus type) or collection of MBIDs (plus a type), tagged with username and timestamp; e.g. Jamendo release 5334 would be linked to the correct MB release. Note: collections of MBIDs are supported here primarily for artists; e.g. "Foo and Bar" should generally be linked to Foo and to Bar. The most recent match (preferring non-automatic matches, where relevant) is considered correct; there is no voting/quality-control system other than social pressure not to fuck up. Matchings are stored within items and subitems in elasticsearch.

Design Process

Users: this is designed to be used by primarily advanced users, as far as directly using the interface. This also has the potential to see heavy automated use, e.g. by people trying to figure out what a given external ID maps to in MB.

Use cases:

  • Manual Matching + Importing: editors using this as a data source will want to work through matching things to MB, and importing things that aren't there. This will likely happen via a search result of some sort.
  • Automatic Matching: some datasets are already partially or fully matched by some automatic process to MB, and with access to lots of data we or others may want to contribute additional automatically-derived matchings. This must therefore be possible programmatically and with relative ease.
  • Lookup: editors verifying information or third parties wanting to match something-they-have to something-we-have will want to be able to look things up by external identifier or by MBID and find the information they're looking for.

Design Principles

  • Collections Everywhere: For the common mapping format, very few things are individual items; most are, instead, lists. For things like artists, which can be multivalued, collections provide a slightly simpler data model than full artist credits; for things like titles, collections provide a set of potential alternates, as given indices don't always agree with themselves on the same bit of information (especially as mappings sometimes can't account for non-structured dissimilarities between items). (since data doesn't necessarily map cleanly, collections are more flexible)
  • First Class API: For matchings, primarily: the interface calls out to an endpoint to perform matchings, which could also potentially be used by external sources (but authentication is of course required for linking to users). Automated matchings are simply added by not authenticating to the endpoint and sending an extra parameter to identify what automatic matching algorithm is being used (e.g. "archive.org acoustid-based matching", "ianmcorvidae's janky jamendo perl script"). For simply getting information, this is basically a pass-through to elasticsearch. (derives from the fact many users may be automated)
  • Always Show Everything: the mapping isn't the only way into the data. The full JSON source is always available; some things may be hidden by default (e.g. archive.org generates sha1, md5, and crc for every file, but we probably only need one), but with a "show all" link. This also applies to the mappings: pages need to have prominent links to the source code for the mappings so they can be inspected. Ideally (i.e. eventually), it'll support passing elasticsearch's query DSL directly through as a start for a search result, so users aren't limited to the basic query-string matching and default filters. (derives from the fact that the users are generally advanced -- full access is important when the users are smarter than the developer! Generality of search is useful for people wanting less fuzzy search results, or for third-parties wanting to perform certain kinds of lookup)

Various Design Decisions

  • Search modes: "Search items" with a basic textbox and the default filters listed below; "search by linked MBID" with an MBID selector; "search by external ID" with a way of selecting an external ID (i.e. subitem); and search item purely by providing a JSON query. Ideally/in the future: search by linked MBID would have an option for direct vs. indirect links (i.e. via subitems)
  • Default filters for search: trying to optimize for usefulness without making implementation take forever, we have some checkboxes: "Show human-matched items", "Show automatically matched items", and "Show unmatched items" (with the last two on by default). We also have a multi-select for which index or indices to search within. In the future: search by user who matched most recently, or all users that have contributed a match at some point, perhaps filter by number of matches.

Mapping Format

Each index has a specific class that's used for performing mappings. example

Still in progress, more keys will appear as it gets fleshed-out further. str for string, int for integer. TODO: Probably should change back to "everything is strings". This is extracted with the 'map' method of a mapping class, passing the data for an item.

This data will be available via _geordi.mapping in a given item.

{
  "version" : int,                 # This and other versions for selective invalidation or updating, should it be desired
  "release" : {
    "title": [str],
    "date": [str],
    "artist": [{
      "name": str,                 # TODO should be list
      "subitem": str               # optional, specifies the subitem this artist is represented by
    }],
    "other_artist": [{             # for any other random artist associated with the release (e.g. composers, conductors, etc.)
      "name": str,                 # TODO should be list
      "subitem": str               # optional, specifies the subitem this artist is represented by
    }],
    "combined_artist": "",         # Typically calculated by: ", ".join(artist[:-1]) + " and " + artist[-1]
    "label": [{
      "name": str,                 # TODO should be list
      "catalog_number": [str],     # optional, associated catalog numbers with this label specifically
                                   #           for data sources that support this sort of association
      "subitem": str               # optional, specifies the (single) subitem this label is represented by
    }],
    "catalog_number": [str],       # Unassociated catalog numbers
    "urls": [{                     # optional, linked URLs (or urlish things like cover art) at the release level
      "url": [str],
      "type": "link type"|"cover art", # for now, if known
      "link_type": [str],          # if applicable and available
      "cover_art_type", [str]      # if applicable and available
    }, ...],
    "tracks": [{
      "title": [str],
      "artist": [{
        "name": str,               # TODO should be list
        "subitem": str             # optional, specifies the (single) subitem this artist is represented by
      }],             
      "length": [int],             # integer milliseconds
      "length_formatted": [str],   # '?:??', HH:MM:SS, MM:SS, or XX ms
      "number": [str],
      "subitem": str               # optional, specifies the (single) subitem this track is represented by
      "totaltracks": [int],        # optional
      "medium": [str],             # optional
      "acoustid": [uuid]           # optional
    }, ...]
  }
}

Subitem mapping format, extracted with the 'link_types' method of a mapping class (with no arguments). The index of a mapping within the returned array is the identifier for the mapping.

{"<identifier": {
  "name": str,  # Used as a description of the subitem type
  "key": str    # Which key to use to extract an ID from a link
  }, ...,
 "version": int}

Matchings/subitem extraction format

All of this is within the '_geordi' key at the top level of an item or subitem.

{
  "matchings": {
    "matchings" :[{
      "auto": boolean,
      "user": str,
      "timestamp": str,
      "type": str,
      "mbid": [uuid],
      "version": int
     }, ...],
     "version": int
  },
  "links": {
    "links" : {
      "<identifier>": [{     # Extracted via the 'extract_linked' method of a mapping class, passing the data for an item
        "<various>": str,    # Could be anything; the subitem mapping definition includes which
                             # key to use to identify; others are just whatever will help identify
        ...}, ...], ...}, ...,
    "version": int
    }
}

API

All API endpoints are accessed by GET. All return JSON, format:

{ "code": int,           # Same as HTTP code number
  "error": str,          # if code is not 200
  "<various>": <various> # endpoint-specific return values
}

/api/search

Same parameters as the interface search page, but returns the results in JSON.
Parameters: type (item [default], subitem, raw), from, query, index [multivalued], subitem_index [for type=subitem], subitem_type [for type=subitem]
JSON values: 'result': search result, JSON; 'mapping': mapped data for search items

/api/item/<index>/<item>

Returns JSON for an item. No parameters.
JSON values: 'document': item JSON

/api/subitem/<index>/<subitem>

Returns JSON for a subitem. No parameters.
JSON values: 'document': subitem JSON

/api/matchitem/<index>/<item>

Register a match for an item. Authenticated users will be matched under their name as human matches; unauthenticated need to provide a name for the automated process being used, with the 'user' parameter and their IP will be recorded.
Parameters: type (release [default], release-group, artist, etc.), mbid [multivalued] or mbids [comma-separated], user [for automatic matches]

/api/matchsubitem/<index>/<subitem>

Register a match for a subitem. Same semantics re: authentication as above for item matches. Default type is different, however.
Parameters: type (release, release-group, artist [default], etc.), mbid [multivalued] or mbids [comma-separated], user [for automatic matches]