User:Ianmcorvidae/RDF

From MusicBrainz Wiki

This here page is almost entirely about my opinions. If you've been directed here, it's probably because you heard me say something along the lines of "Gah, this would be way easier with RDF!" in IRC or elsewhere and asked what RDF is, what sort of use of RDF I'm talking about, or why I think it would be useful.

It's quite possible (perhaps likely) that what I say here is totally off-base and unreasonable. (If it is, though, I'd love to hear about it and correct this page, so please tell me.)

What is RDF?

Don't fear, RDF is here to help!

In this section, I hope to explain RDF in terms of MusicBrainz. If you're not familiar with the way MusicBrainz represents data, this will probably do exactly nothing for you, and you'll probably find documents like the RDF Primer more useful.

RDF is primarily a way of representing data. It's a standard defined by the W3C (World Wide Web Consortium), created as part of the process of defining the tools for the Semantic Web. There's also some query tools and serialization formats which probably aren't useful here.

RDF has roughly three core concepts, which fairly loosely map to existing MusicBrainz concepts (but are different in notable ways which lead to the benefit I discuss in section 2): URIs and the closely related Resources are sort of like MBIDs and the closely related entities (Recordings, Releases, Works, etc.); Literals are sort of like the things editors type into form inputs; and Triples are sort of like ARs.

URIs and Resources (MBIDs/Entities)

Like MusicBrainz, RDF thinks it's very important to have unique identifiers. MusicBrainz uses MBIDs; RDF uses Uniform Resource Identifiers, or URIs. URLs (Uniform Resource Locators), which you probably know about already, are the most common kind of URI, the other kind being the URN, which you can read about elsewhere (here's a hint, though: the mailto:{email address} links you've seen are using the mailto: URN scheme). This choice is pretty closely linked to RDF's origins as part of the web — URIs, invented for the web, already have to be unique (obviously, you can't have the same URL point to two different places), so they got reused.

URIs must identify some sort of identifiable "thing": RDF calls these things "Resources", roughly equivalent to what MusicBrainz generally calls "entities". The pairing of a Resource to a URI is one-to-one: every Resource is identified by a unique URI and every URI identifies a distinct resource.

One thing that RDF does a bit differently than MusicBrainz is that it assigns identifiers to, functionally, everything. Tracks, tracklists, and mediums, which don't get to have MBIDs, would get globally-unique URI identifiers in RDF (or blank node identifiers -- but changing these into URIs is an in-place operation that almost by definition can't affect any code). Every AR type would have an identifier. In fact, many of our ARs would create "intermediate entities" of a sort (for example, an event, such as an album being recorded, would be a resource). Using more jargon: what RDF calls a "Resource" is a lot broader than what MusicBrainz calls an "entity".

Literals (Input values)

Some things are more fundamental than needing an identifier. Numbers, blocks of text (or short strings), dates, and an assortment of other things don't need to be assigned a URI — the number 5 is always going to be the same thing: the number 5. In MusicBrainz, the things we type into fields are like literals: the piece of text "The Beatles" is a different thing from the artist that piece of text is the name of. Literals are what Picard puts into tags (that can't be interpreted without the tag names/the mapping) and what the server puts into cells in the database (that can't be interpreted without the schema).

Triples (ARs)

Now that we know how to talk about individual things, we can talk about relating things, which is the more interesting part. RDF does this with the triple: a set of a subject, predicate, and object. The predicate is identified with a URI (something like how our AR types are identified with UUIDs), and the subject and object are either URIs (which is to say resources – like our entities, remember?) or literals. You're probably starting to see how sticking arbitrary sets of three together looks a lot like MusicBrainz's AR system.

An AR as a triple, roughly, thinks of the first entity as the subject, the other entity as the object, and the AR type as the predicate. So a composition AR linking the artist Foo Bar to their work Vagaries of Data Representation Formats would end up as a triple of: the URI identifier for Foo Bar, the URI identifier for the composition AR type, and the URI identifier for Vagaries of Data Representation Formats.

What's different about triples compared with ARs is that ARs have to have entities at both ends, while a triple can have a literal – if you've ever thought that an "AR to a text field" would be useful, that's putting a string literal as the object of a triple! This can also be used for some things we consider "fundamental properties", like an artist's name, which is a triple of: the URI for the artist, the URI for the 'name' predicate, and the name (a literal string such as "The Beatles"). In MusicBrainz terms: you don't need an "artist name" field if there's an Artist -> text field AR for the artist name.

Triples also get used for defining types: saying that a given URI is an identifier for an album, for example, is a triple of: the URI for the album, the URI http://www.w3.org/2000/01/rdf-schema#type (which defines the type of a given resource), and the URI for the 'album' type.

In fact, triples can describe everything in MusicBrainz (or perhaps everything in the world!), especially with the addition of some useful resource types like collections.

One More Thing Worth Mentioning: Ontologies

One other useful concept to understand, when talking about RDF, is the idea of a vocabulary/taxonomy/ontology. I'll just call these 'ontologies' from here; distinguishing between vocabularies, taxonomies, and ontologies is best left to pedants (like me). An ontology, in short, is a reusable set of Resources and some descriptions of them (expressed in RDF, of course). Usually ontologies consist of types of predicates and types for resources, and descriptions of these things. Some fundamental concepts are expressed in some core ontologies; one such example is the RDF Schema ontology, which includes the "type" predicate I used above. This and other core ontologies define concepts like domain and range of predicates, ways of defining collections and lists, subclasses, equivalence, transitivity, cardinality (which covers things like "all artists must have a name" and "all releases must have at least one medium"), and many other things. Ontologies tend to be referred to using namespace prefixes, such that http://www.w3.org/2000/01/rdf-schema#type is usually just called rdfs:type.

In some ways, ontologies in RDF are like schemas in more traditional systems like MusicBrainz, but there's nothing stopping anyone from using a predicate or type that's not defined by an ontology, where schemas prevent using new things without first changing the schema.

How and why might MusicBrainz use RDF?

Having read the above section (you read it if you don't know the basics of what RDF is, right?), you'll understand that RDF is a pretty general thing. As such, there's a number of different places where it could be used by MusicBrainz. Usually when I'm ranting about this I'm referring to using RDF as a data storage format, which will be what most of this section covers. The other primary way to use RDF is as a sort of API via Linked Data principles. I'll mention this usage as well, but in my opinion this is the less-useful application of RDF for MusicBrainz (despite being the one more-often tried), so I'll spend less time on it.

RDF as a data format

Let me first clarify that what I mean by 'data format' is 'database format' – I'm talking about using RDF as the way we store data. This would involve switching out our current database and putting in its place a specialized RDF-optimized storage facility called a triple store. This infrastructure-change burden is probably the single biggest legitimate reason, in my mind, for us not to switch wholesale to RDF starting tomorrow.

So: what would be the upsides and downsides of using RDF for our base-level storage and representation of data? To explore this, it shouldn't be surprising that we look at the differences I mention in the previous section between the MusicBrainz conceptual data model and RDF. To summarize:

  1. RDF assigns identifiers to Just About Everything (which is to say: everything except literals)
  2. RDF lacks the restriction of needing schema change for changes in data structure (ontologies are descriptive, not prescriptive)
  3. RDF allows more than just entities as the "endpoints" of its analogy to ARs (triples)
  4. RDF stores everything as URIs and triples

Let's explore these in order.

Identifiers for Just About Everything

Since RDF's concept of the Resource is much more general than our current concept of an entity, many more things would end up with identifiers; tracks, mediums, tracklists, etc. Certainly at least everything that's currently a table in our database (with the exception of some small, optimization-oriented tables like artist_name, and some likely simplification of the AR system) would end up being identifiable resources. There is a potential downside to this, in that it probably means people start using these identifiers (at least, if they're actual URIs rather than blank node identifiers). However, never once in my time as a contributor to MusicBrainz have I heard anyone ask for fewer things to be uniquely identifiable as public resources. Rather, I hear frequent requests to bring back track MBIDs, to be able to uniquely identify mediums and tracklists, and requests to reveal more of the unique identifiers we already have, such as UUIDs for AR types. While it might be more work initially to make sure identifiers are assigned in a reasonable way and that recommended practices are carefully documented (for example, recommending against referring to some kinds of identifiers from the "outside world" – though if they don't appear in webservice output it's not likely they'll be used), we would be a lot less likely to have the problem of something needing a unique identifier and finding we don't really have anything satisfactory.

Flexible and Schema-Free

Although we'd probably want to define an ontology describing our use of terms and assigning them to a namespace (for internal use and documentation if nothing else!), changing this or adding things wouldn't be nearly as arduous as schema changes are now, because the ontology is merely extra information about terms, not a technical prerequisite for storing information using those terms. A schema change to add a column must be applied before anything can add data to or use data from that column; if a replicated server doesn't have the right schema, they simply cannot use certain code. If an ontology is out of date on a replicated server, though, the worst likely situation is if they're lacking some cardinality information and thus aren't necessarily checking for required elements – but if they're replicated, all that data is coming from the main server, which will be doing those checks anyway. Amusingly, RDF ontologies are also RDF, so they could be put into the same RDF store as the data and changes to them can be replicated alongside the data. So, not only are schema sequence numbers and potentially broken schema upgrade processes unnecessary, but what vestigial analogs to them remain are easier to automate.

This schema-freeness is also what makes me gloss over the difference between URIs and blank nodes -- since blank node identifiers aren't ever going to be referenced directly anyway (largely can't be, in fact), changing them wholesale to real URIs is possible whenever it's decided that something should have an external identifier. It might make for a large replication packet, but that's the extent of it.

In case anyone's worried, it's still possible to make some arbitrary divisions in our data (like, say, not including editor passwords in our data dumps and replication packets) by way of a feature called named graphs.

More Than Just Entities

Triples, unlike our current ARs, don't require both ends to be entities. This means, first of all, that we get the capability for ARs to literal data – differentiated comments on different subtopics, version numbers, boolean or enumeration-type flags, etc. In combination with the ease of data structure change mentioned above, this also happens to allow for ARs with any number of endpoints and any number and type of attributes (since it makes AR endpoints and attributes fundamentally the same thing).

Any AR with attributes (all of them, since dates are available for all AR types) would in fact have to be migrated as a type of resource as well as a predicate. For example, the '{additional} {guest} {solo} {instrument}' AR would be translated as, say: a PerformanceEvent resource, linked to the Artist with a performedBy relation, to the Recording with a performance relation, with the attributes being expressed as additional, guest, and solo relations to literal booleans and an instrument relation to a Resource of type Instrument, or if we're lazy, to a literal string with the instrument's name (but just making the Resource/entity for an instrument would probably be easier – especially since it's a requested feature. Probably don't need to keep all those duplicate very-slightly-different ARs based on different attributes, too, which is a nice reduction in conceptual complexity. Oh, did I mention that lots of other requested new entities are also both easier and possibly the best way to migrate to RDF?

There are plenty of other things that move much more to the relationship-editor end of the task list, rather than the developer end, like sortnames for more entities or a subtitle field, which would just be new predicates (AR types) to text strings. More-complicated things like location and date fields for entities very efficiently reuse the aforementioned migrations of AR attributes as new resources.

Oh, did I mention there's a predicate owl:sameAs that very nicely links to other names for the same resource?

Everything Looks The Same

Everything looks the same means that: everything is identified by URI (and URI only: no duality of gids versus database row numbers) and every relationship between bits of data is expressed as triples and perhaps some intermediate entities. Anyone who's ever written some Lisp understands how nice uniform representation can be, and how powerful. Anyone who's ever dealt with heterogeneous sources of information (say, anyone who's ever hunted down sources for an edit, or written citations for a research paper) understands that having things in the same format makes things more reusable. And when specialized interfaces haven't been written yet, having everything look the same means a general interface can be written as a fallback, letting features exist for advanced-user use as soon as possible and letting the developers structure their time more freely.

Need I even mention that URIs being a basic currency of the underlying database makes handling URLs a lot easier?

Downsides

Lest I seem completely unabashedly in favor of an immediate switch (which, admittedly, isn't that far off – but still), let's review some of the downsides of using RDF.

  • Unusual: the old "nobody got fired for buying IBM" mantra – RDF isn't the way most people do these things (even if I think they should), and that means a number of problems, such as the one where anything that goes wrong is likely to get blamed on RDF whether or not it's involved, the one where it's a lot harder to sell (and we do need more customers), and the one where fewer people know how it works
  • Continuing from there, I don't think our core devs know much of anything about RDF, so a lot of learning would need to happen if MusicBrainz doesn't hire me (which, much as I'd love it, isn't probably something MusicBrainz has the money to do)
  • And likewise: our data customers might be less than pleased to have to set up a triplestore and such, no matter how nice we make it. More dependencies is more dependencies, and putting sysadmin burden on our customers isn't always a great plan. That being said, musicbrainz-server requires a lot of sysadmin burden already, so it's not necessarily hurting all that much.
  • We get to rewrite... oh, yeah, a WHOLE BUNCH OF CODE. Like, at the very least all our data-access. And then some, likely. It's hard to know how much support there is for what we do, at a technical level, in existing triple stores – I imagine there's something like triggers, for example, but I honestly don't know. Perl bindings? Who knows.

If we're moving this way anyway (which I'd argue we are, even if we don't use RDF specifically – this sort of generality is the direction NGS is a shift towards), of course, it's just a question of timing: when are these downsides minimized. The situation is probably that the fourth one will become harder over time as more code gets written in the current system, but that the other three have the potential to become easier if we wait. The benefits only come when the switch happens, though, which is another implied downside of any waiting. Nevertheless, for now this only serves as a reference page and not as a proposal – perhaps in the future these issues will make RDF, or a similar system, a viable idea.

RDF as an API

Most people think of RDF more in line with its original goals: as a way of sharing information. That's more what this is about. The recent Linked Data movement is all about that, and in some ways MusicBrainz was one of the original parts of that (with our ancient RDF Web Service, now obsolete due to Doing Things Wrong in many ways). Our recent work with RDFa and LinkedBrainz is more Done Right, but it aims to be interoperable – which, while desirable for an API, limits our flexibility, since the project is based on the Music Ontology, beyond our control. The notion of RDF as an API technology, while ancient news in the MusicBrainz community, is also not well-known or respected in the outside world; our XML Web Service is much more normal for consumers of that data; this, it is my understanding, is part of the reason that API was discontinued. Really, this is the part of the RDF ecosystem that's being developed now: how to interoperate and share. RDF was designed to be shared and interoperable, but thus far the fact is it's seen more use as a pure representational format, meaning that using RDF as an API has even more of the unusuality and instability problems than those I mentioned above.

In short: When I say I want RDF, I don't mean at this level for the most part. I'd love to see it there too, but as experimental efforts like LinkedBrainz that flex the muscles of the system without committing the project to a potentially-soon-obsolete methodology. I want RDF at a data level, where it's well-tested and admired in many communities, and where it's easier for MusicBrainz to use it as it's useful for us without worrying about interoperability (just as we don't worry about it with our current database system).