User:Ianmcorvidae/RDF

From MusicBrainz Wiki
Jump to navigationJump to search

This here page is almost entirely about my opinions. If you've been directed here, it's probably because you heard me say something along the lines of "Gah, this would be way easier with RDF!" in IRC or elsewhere and asked what RDF is, what sort of use of RDF I'm talking about, or why I think it would be useful.

It's quite possible (perhaps likely) that what I say here is totally off-base and unreasonable. (If it is, though, I'd love to hear about it and correct this page, so please tell me.)

HORRIBLE DRAFT FORM AT THE MOMENT, PLEASE DON'T REFERENCE OR ANYTHING

What is RDF?

In this section, I hope to explain RDF in terms of MusicBrainz. If you're not familiar with the way MusicBrainz represents data, this will probably do exactly nothing for you, and you'll probably find documents like the RDF Primer more useful.

RDF is primarily a way of representing data. It's a standard defined by the W3C (World Wide Web Consortium), created as part of the process of defining the tools for the Semantic Web. There's also some query tools and serialization formats which probably aren't useful here.

RDF has roughly three core concepts, which fairly loosely map to existing MusicBrainz concepts (but are different in notable ways which lead to the benefit I discuss in section 2): URIs and the closely related Resources are sort of like MBIDs and the closely related entities (Recordings, Releases, Works, etc.); Literals are sort of like the things editors type into form inputs; and Triples are sort of like ARs.

URIs and Resources (MBIDs/Entities)

Like MusicBrainz, RDF thinks it's very important to have unique identifiers. MusicBrainz uses MBIDs; RDF uses Uniform Resource Identifiers, or URIs. URLs (Uniform Resource Locators), which you probably know about already, are the most common kind of URI, the other kind being the URN, which you can read about elsewhere (here's a hint, though: the mailto:{email address} links you've seen are using the mailto: URN scheme). This choice is pretty closely linked to RDF's origins as part of the web — URIs, invented for the web, already have to be unique (obviously, you can't have the same URL point to two different places), so they got reused.

URIs must identify some sort of identifiable "thing": RDF calls these things "Resources", roughly equivalent to what MusicBrainz generally calls "entities". The pairing of a Resource to a URI is one-to-one: every Resource is identified by a unique URI and every URI identifies a distinct resource.

One thing that RDF does a bit differently than MusicBrainz is that it assigns identifiers to, functionally, everything. Tracks, tracklists, and mediums, which don't get to have MBIDs, would get globally-unique URI identifiers in RDF. Every AR type would have a URI identifier. In fact, many of our ARs would create "intermediate entities" of a sort (for example, an event, such as an album being recorded, would be a resource). Using more jargon: what RDF calls a "Resource" is a lot broader than what MusicBrainz calls an "entity".

Literals (Input values)

Some things are more fundamental than needing an identifier. Numbers, blocks of text (or short strings), dates, and an assortment of other things don't need to be assigned a URI — the number 5 is always going to be the same thing: the number 5. In MusicBrainz, the things we type into fields are like literals: the piece of text "The Beatles" is a different thing from the artist that piece of text is the name of. Literals are what Picard puts into tags (that can't be interpreted without the tag names/the mapping) and what the server puts into cells in the database (that can't be interpreted without the schema).

Triples (ARs)

Now that we know how to talk about individual things, we can talk about relating things, which is the more interesting part. RDF does this with the triple: a set of a subject, predicate, and object. The predicate is identified with a URI (something like how our AR types are identified with UUIDs), and the subject and object are either URIs (which is to say resources – like our entities, remember?) or literals. You're probably starting to see how sticking arbitrary sets of three together looks a lot like MusicBrainz's AR system.

An AR as a triple, roughly, thinks of the first entity as the subject, the other entity as the object, and the AR type as the predicate. So a composition AR linking the artist Foo Bar to their work Vagaries of Data Representation Formats would end up as a triple of: the URI identifier for Foo Bar, the URI identifier for the composition AR type, and the URI identifier for Vagaries of Data Representation Formats.

What's different about triples compared with ARs is that ARs have to have entities at both ends, while a triple can have a literal – if you've ever thought that an "AR to a text field" would be useful, that's putting a string literal as the object of a triple! This can also be used for some things we consider "fundamental properties", like an artist's name, which is a triple of: the URI for the artist, the URI for the 'name' predicate, and the name (a literal string such as "The Beatles"). In MusicBrainz terms: you don't need an "artist name" field if there's an Artist -> text field AR for the artist name.

Triples also get used for defining types: saying that a given URI is an identifier for an album, for example, is a triple of: the URI for the album, the URI http://www.w3.org/2000/01/rdf-schema#type (which defines the type of a given resource), and the URI for the 'album' type.

In fact, triples can describe everything in MusicBrainz (or perhaps everything in the world!), especially with the addition of some useful resource types like collections.

One More Thing Worth Mentioning: Ontologies

One other useful concept to understand, when talking about RDF, is the idea of a vocabulary/taxonomy/ontology. I'll just call these 'ontologies' from here; distinguishing between vocabularies, taxonomies, and ontologies is best left to pedants (like me). An ontology, in short, is a reusable set of Resources and some descriptions of them (expressed in RDF, of course). Usually ontologies consist of types of predicates and types for resources, and descriptions of these things. Some fundamental concepts are expressed in some core ontologies; one such example is the RDF Schema ontology, which includes the "type" predicate I used above. This and other core ontologies define concepts like domain and range of predicates, ways of defining collections and lists, subclasses, equivalence, transitivity, cardinality (which covers things like "all artists must have a name" and "all releases must have at least one medium"), and many other things. Ontologies tend to be referred to using namespace prefixes, such that http://www.w3.org/2000/01/rdf-schema#type is usually just called rdfs:type.

In some ways, ontologies in RDF are like schemas in more traditional systems like MusicBrainz, but there's nothing stopping anyone from using a predicate or type that's not defined by an ontology, where schemas prevent using new things without first changing the schema.

How and why might MusicBrainz use RDF?

Having read the above section (you read it if you don't know the basics of what RDF is, right?), you'll understand that RDF is a pretty general thing. As such, there's a number of different places where it could be used by MusicBrainz. Usually when I'm ranting about this I'm referring to using RDF as a data storage format, which will be what most of this section covers. The other primary way to use RDF is as a sort of API via Linked Data principles. I'll mention this usage as well, but in my opinion this is the less-useful application of RDF for MusicBrainz (despite being the one more-often tried), so I'll spend less time on it.

RDF as a data format

RDF as an API