User:Ianmcorvidae/RDF

From MusicBrainz Wiki
< User:Ianmcorvidae
Revision as of 04:10, 20 February 2012 by Ianmcorvidae (talk | contribs) (section 1, disclaimers about this being a terrible draft, etc.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This here page is almost entirely about my opinions. If you've been directed here, it's probably because you heard me say something along the lines of "Gah, this would be way easier with RDF!" in IRC or elsewhere and asked what RDF is, what sort of use of RDF I'm talking about, or why I think it would be useful.

It's quite possible (perhaps likely) that what I say here is totally off-base and unreasonable. (If it is, though, I'd love to hear about it and correct this page, so please tell me.)

HORRIBLE DRAFT FORM AT THE MOMENT, PLEASE DON'T REFERENCE OR ANYTHING

What is RDF?

In this section, I hope to explain RDF in terms of MusicBrainz. If you're not familiar with the way MusicBrainz represents data, this will probably do exactly nothing for you, and you'll probably find more typical foundational documents more useful; I'll attempt to cite a few for your use.

To start, RDF is primarily a way of representing data. It's a standard defined by the W3C (World Wide Web Consortium), created as part of the process of defining the tools for the Semantic Web. There's also some query tools and serialization formats standardized by the W3C for RDF, but those don't need to come into play in this discussion. Look to your search engine if you're interested!

RDF has three, by my estimation, core concepts that need covering, which fairly loosely map to some existing MusicBrainz concepts (but are different in some core ways which I'll note): URIs are sort of like MBIDs or like our entities (Recordings, Releases, Works, etc.), Literals are sort of like a lot of the "core properties" of entities (such as names, release dates, or catalog numbers) and many link attributes (such as start/end dates, assorted checkbox-type attributes, or medium numbers), and Triples are sort of like ARs, .

URIs and Resources (MBIDs/Entities)

Like MusicBrainz, RDF thinks it's very important to have unique identifiers. MusicBrainz uses the MBID, but RDF uses Uniform Resource Identifiers, or URIs. URLs (Uniform Resource Locators), which you probably know about already, are the most common kind of URI, the other kind being the URN, which you can read about elsewhere (here's a hint, though: the 'mailto:<email address>' links you've seen are using the mailto: URN scheme). This choice is pretty closely linked to RDF's origins as part of the web — URIs, invented for the web, already have to be unique (obviously, you can't have the same URL point to two different places), so they got reused. URIs have to identify some sort of thing: RDF calls these things "Resources", while MusicBrainz generally calls them entities. The pairing of a resource to a URI is one-to-one: every resource is identified by a unique URI and every URI identifies a distinct resource.

One thing that RDF does a bit differently than MusicBrainz is that it assigns identifiers to nearly everything. Tracks, tracklists, and mediums, which don't get to have MBIDs in NGS, would get globally-unique URI identifiers in RDF. In fact, some of our ARs would even result in distinct resources (for example, events, such as albums being recorded, would be resources themselves, with identifying URIs). Using more jargon, what RDF calls a "Resource" is a lot broader than what MusicBrainz calls an "entity". If this proliferation of identifiers seems shocking to you, don't worry – there are benefits to being able to identify anything uniquely, as you can read below!

Literals (Core Properties/Link Attributes)

Some things, however, are more fundamental than needing an identifier. Numbers, blocks of text (or short strings), dates, and an assortment of other things don't need to be assigned a URI — the number 5 is always going to be the same thing: the number 5. In MusicBrainz, the most core properties of an entity are something like literals, as are the available options for many link attribute types. The name "Perfume" doesn't need to be uniquely identified by a URI, even if it's used by a number of artists (3, last I checked). Note that I'm talking here about the string itself, rather than its association with any given artist. For those familiar with the database schema, literals are the values we stick into database cells (not the schemata, which add meaning to the values in the cells). For those who use Picard, literals are the things we put in the tags (not the tag names, which add meaning to the values in the tags).

Triples (ARs)

Now that we've talked about how to talk about individual things, we can talk about how to connect things together, which is really the core of the thing. RDF does this with the triple: a set of a subject, predicate, and object. The predicate is identified with a URI (something like how our AR types are identified with UUIDs), and the subject and object are either URIs (which is to say resources – like our entities, remember?) or literals. You're probably starting to see how sticking arbitrary sets of three together looks a lot like MusicBrainz's AR system.

An AR as a triple, roughly, thinks of the first entity as the subject, the other entity as the object, and the AR type as the predicate. So a composition AR linking the artist Foo Bar to their work "Vagaries of Data Representation Formats" would end up as a triple of: the URI identifier for Foo Bar, the URI identifier for the composition AR type, and the URI identifier for "Vagaries of Data Representation Formats".

What's different about triples compared with ARs is that ARs have to have entities at both ends, while a triple can have a literal – if you've ever thought that an "AR to a text field" would be useful, that's putting a string literal as the object of a triple! This can also be used for some things we consider "fundamental properties", like an artist's name, which is a triple of: the URI for the artist, the URI for the 'name' predicate, and the name (a literal string such as "The Beatles"). In MusicBrainz terms: you don't need an "artist name" field if there's an Artist -> text field AR for the artist name.

Triples also get used for defining types: saying that a given URI is an identifier for an album, for example, is a triple of: the URI for the album, the URI http://www.w3.org/2000/01/rdf-schema#type (which defines the type of a given resource), and the URI for the 'album' type.

In fact, triples can describe everything in MusicBrainz (or perhaps everything in the world!), especially with the addition of some useful resource types like collections.

One More Thing Worth Mentioning

One other useful concept to understand, when talking about RDF, is the idea of a vocabulary/taxonomy/ontology. I'll just call these 'ontologies' from here; the distinction between a vocabulary, taxonomy, and ontology is something to leave to terrible pedants like me. An ontology, in short, is a reusable set of URIs and some descriptions (in triples) of them. Usually ontologies mostly consist of types of predicates and types for resources, and descriptions of these things. Some fundamental concepts are expressed in ontologies that everyone uses; one such example is the RDF Schema ontology, which includes the "type" predicate I used earlier. This and other core ontologies define concepts like domain and range of predicates, ways of defining collections and lists, subclasses, equivalence, transitivity, cardinality (which covers things like "all artists must have a name" and "all releases must have at least one medium"), and many other things.

How and why might MusicBrainz use RDF?

Having read the above section (you read it if you don't know the basics of what RDF is, right?), you'll understand that RDF is a pretty general thing. As such, there's a number of different places where it could be used by MusicBrainz. Usually when I'm ranting about this I'm referring to using RDF as a data storage format, which will be what most of this section covers. The other primary way to use RDF is as a sort of API via Linked Data principles. I'll mention this usage as well, but in my opinion this is the less-useful application of RDF for MusicBrainz (despite being the one more-often tried), so I'll spend less time on it.

RDF as a data format

RDF as an API