User:Ianmcorvidae/RDF: Difference between revisions

From MusicBrainz Wiki
Jump to navigationJump to search
No edit summary
(add rdf as a data source, basic start notes, remove "this is horrible hack" note)
Line 2: Line 2:


''It's quite possible (perhaps likely) that what I say here is totally off-base and unreasonable. (If it is, though, I'd love to hear about it and correct this page, so please tell me.)''
''It's quite possible (perhaps likely) that what I say here is totally off-base and unreasonable. (If it is, though, I'd love to hear about it and correct this page, so please tell me.)''

'''HORRIBLE DRAFT FORM AT THE MOMENT, PLEASE DON'T REFERENCE OR ANYTHING'''


== What is RDF? ==
== What is RDF? ==
Line 44: Line 42:


=== RDF as a data format ===
=== RDF as a data format ===
Let me first clarify that what I mean by 'data format' is 'database format' – I'm talking about using RDF as the way we store data. This would involve switching out our current database and putting in its place a specialized RDF-optimized storage facility called a ''triple store.'' This infrastructure-change burden is something I'll mention below in downsides to this use of RDF (and is probably the single biggest legitimate reason, in my mind, for us not to switch wholesale to RDF starting tomorrow).

So: what would be the upsides and downsides of using RDF for our base-level storage and representation of data? To explore this, it shouldn't be surprising that we look at the differences I mention in the previous section between the MusicBrainz conceptual data model and RDF. To summarize:
# RDF assigns identifiers to Just About Everything (which is to say: everything except literals)
# RDF lacks the restriction of needing schema change for changes in data structure (ontologies are descriptive, not prescriptive)
# RDF allows more than just entities as the "endpoints" of its analogy to ARs (triples)
# RDF stores everything as URIs and triples

Let's explore these in order.

==== Identifiers for Just About Everything ====
Since RDF's concept of the Resource is much more general than our current concept of an entity, ''many'' more things would end up with URIs as identifiers; tracks, mediums, tracklists, etc. Certainly at least everything that's currently a table in our database (with the exception of some small, optimization-oriented tables like artist_name, and some likely simplification of the AR system) would end up being identifiable resources. There is a potential downside to this, in that it probably means people start ''using'' these identifiers. However, never once in my time as a contributor to MusicBrainz have I heard anyone ask for ''fewer'' things to be uniquely identifiable as public resources. Rather, I hear frequent requests to bring back track MBIDs, to be able to uniquely identify [http://tickets.musicbrainz.org/browse/MBS-603 mediums] and tracklists, and requests to reveal more of the unique identifiers we already have, such as UUIDs for AR types. While it might be more work initially to make sure identifiers are assigned in a reasonable way and that recommended practices are carefully documented (for example, recommending against referring to some kinds of identifiers from the "outside world"), we would be a lot less likely to have the problem of something needing a unique identifier and finding we don't really have anything satisfactory.

==== Flexible and Schema-Free ====
Although we'd probably want to define an ontology describing our use of terms and assigning them to a namespace (for internal use and documentation if nothing else!), changing this or adding things wouldn't be nearly as arduous as schema changes are now, because the ontology is merely extra information about terms, not a technical prerequisite for storing information using those terms. A schema change to add a column must be applied before anything can add data to or use data from that column; if a replicated server doesn't have the right schema, they simply ''cannot'' use certain code. If an ontology is out of date on a replicated server, though, the worst likely situation is if they're lacking some cardinality information and thus aren't necessarily checking for required elements – but if they're replicated, all that data is coming from the main server, which will be doing those checks anyway. Amusingly, RDF ontologies are also RDF, so they could be put into the same RDF store as the data and changes to them can be replicated alongside the data. So, not only are [http://blog.musicbrainz.org/?p=1243 schema sequence numbers] and [http://blog.musicbrainz.org/?p=1254 potentially broken schema upgrade processes] unnecessary, but what vestigial analogs to them remain are easier to automate.

In case anyone's worried, it's still possible to make some arbitrary divisions in our data (like, say, not including editor passwords in our data dumps and replication packets) by way of a feature called named graphs.

==== More Than Just Entities ====
Triples, unlike our current ARs, don't require both ends to be entities. This means, first of all, that we get the capability for [http://tickets.musicbrainz.org/browse/MBS-3859 ARs to literal data] – differentiated comments on different subtopics, version numbers, boolean or enumeration-type flags, etc. In combination with the ease of data structure change mentioned above, this also happens to allow for [http://tickets.musicbrainz.org/browse/MBS-1159 ARs with any number of endpoints] and any number and type of attributes (since it makes AR endpoints and attributes fundamentally the same thing).

Any AR with attributes (all of them, since dates are available for all AR types) would in fact have to be migrated as a type of resource as well as a predicate. For example, the '{additional} {guest} {solo} {instrument}' AR would be translated as, say: a <code>PerformanceEvent</code> resource, linked to the Artist with a <code>performedBy</code> relation, to the Recording with a <code>performance</code> relation, with the attributes being expressed as <code>additional</code>, <code>guest</code>, and <code>solo</code> relations to literal booleans and an <code>instrument</code> relation to a Resource of type <code>Instrument</code>, or if we're lazy, to a literal string with the instrument's name (but just making the Resource/entity for an instrument would probably be easier &ndash; especially since it's a [http://tickets.musicbrainz.org/browse/MBS-3674 requested feature]. Oh, did I mention that lots of [http://tickets.musicbrainz.org/browse/MBS-799 other requested new entities] are also both easier and possibly the best way to migrate to RDF?

There are plenty of other things that move much more to the relationship-editor end of the task list, rather than the developer end, like [http://tickets.musicbrainz.org/browse/MBS-3641 sortnames for more entities] or [http://tickets.musicbrainz.org/browse/MBS-832 a subtitle field], which would just be new predicates (AR types) to text strings. More-complicated things like [http://tickets.musicbrainz.org/browse/MBS-798 location and date fields for entities] very efficiently reuse the aforementioned migrations of AR attributes as new resources.

Oh, did I mention there's a predicate owl:sameAs that very nicely links to [http://tickets.musicbrainz.org/browse/MBS-3248 other names for the same resource]?

==== Everything Looks The Same ====

Everything looks the same means that: everything is identified by URI (and URI only: no duality of gids versus database row numbers) and every relationship between bits of data is expressed as triples and perhaps some intermediate entities. Anyone who's ever written some Lisp understands how nice uniform representation can be, and how powerful. Anyone who's ever dealt with heterogeneous sources of information (say, anyone who's ever hunted down sources for an edit, or written citations for a research paper) understands that having things in the same format makes things more reusable. And when [http://tickets.musicbrainz.org/browse/MBS-3632 specialized interfaces haven't been written yet], having everything look the same means a general interface can be written as a fallback, letting features exist for advanced-user use as soon as possible and letting the developers structure their time more freely.

Need I even mention that URIs being a basic currency of the underlying database makes [http://tickets.musicbrainz.org/browse/MBS-3417 handling URLs a lot easier]?


=== RDF as an API ===
=== RDF as an API ===

'''TBA'''

Revision as of 07:10, 20 February 2012

This here page is almost entirely about my opinions. If you've been directed here, it's probably because you heard me say something along the lines of "Gah, this would be way easier with RDF!" in IRC or elsewhere and asked what RDF is, what sort of use of RDF I'm talking about, or why I think it would be useful.

It's quite possible (perhaps likely) that what I say here is totally off-base and unreasonable. (If it is, though, I'd love to hear about it and correct this page, so please tell me.)

What is RDF?

In this section, I hope to explain RDF in terms of MusicBrainz. If you're not familiar with the way MusicBrainz represents data, this will probably do exactly nothing for you, and you'll probably find documents like the RDF Primer more useful.

RDF is primarily a way of representing data. It's a standard defined by the W3C (World Wide Web Consortium), created as part of the process of defining the tools for the Semantic Web. There's also some query tools and serialization formats which probably aren't useful here.

RDF has roughly three core concepts, which fairly loosely map to existing MusicBrainz concepts (but are different in notable ways which lead to the benefit I discuss in section 2): URIs and the closely related Resources are sort of like MBIDs and the closely related entities (Recordings, Releases, Works, etc.); Literals are sort of like the things editors type into form inputs; and Triples are sort of like ARs.

URIs and Resources (MBIDs/Entities)

Like MusicBrainz, RDF thinks it's very important to have unique identifiers. MusicBrainz uses MBIDs; RDF uses Uniform Resource Identifiers, or URIs. URLs (Uniform Resource Locators), which you probably know about already, are the most common kind of URI, the other kind being the URN, which you can read about elsewhere (here's a hint, though: the mailto:{email address} links you've seen are using the mailto: URN scheme). This choice is pretty closely linked to RDF's origins as part of the web — URIs, invented for the web, already have to be unique (obviously, you can't have the same URL point to two different places), so they got reused.

URIs must identify some sort of identifiable "thing": RDF calls these things "Resources", roughly equivalent to what MusicBrainz generally calls "entities". The pairing of a Resource to a URI is one-to-one: every Resource is identified by a unique URI and every URI identifies a distinct resource.

One thing that RDF does a bit differently than MusicBrainz is that it assigns identifiers to, functionally, everything. Tracks, tracklists, and mediums, which don't get to have MBIDs, would get globally-unique URI identifiers in RDF. Every AR type would have a URI identifier. In fact, many of our ARs would create "intermediate entities" of a sort (for example, an event, such as an album being recorded, would be a resource). Using more jargon: what RDF calls a "Resource" is a lot broader than what MusicBrainz calls an "entity".

Literals (Input values)

Some things are more fundamental than needing an identifier. Numbers, blocks of text (or short strings), dates, and an assortment of other things don't need to be assigned a URI — the number 5 is always going to be the same thing: the number 5. In MusicBrainz, the things we type into fields are like literals: the piece of text "The Beatles" is a different thing from the artist that piece of text is the name of. Literals are what Picard puts into tags (that can't be interpreted without the tag names/the mapping) and what the server puts into cells in the database (that can't be interpreted without the schema).

Triples (ARs)

Now that we know how to talk about individual things, we can talk about relating things, which is the more interesting part. RDF does this with the triple: a set of a subject, predicate, and object. The predicate is identified with a URI (something like how our AR types are identified with UUIDs), and the subject and object are either URIs (which is to say resources – like our entities, remember?) or literals. You're probably starting to see how sticking arbitrary sets of three together looks a lot like MusicBrainz's AR system.

An AR as a triple, roughly, thinks of the first entity as the subject, the other entity as the object, and the AR type as the predicate. So a composition AR linking the artist Foo Bar to their work Vagaries of Data Representation Formats would end up as a triple of: the URI identifier for Foo Bar, the URI identifier for the composition AR type, and the URI identifier for Vagaries of Data Representation Formats.

What's different about triples compared with ARs is that ARs have to have entities at both ends, while a triple can have a literal – if you've ever thought that an "AR to a text field" would be useful, that's putting a string literal as the object of a triple! This can also be used for some things we consider "fundamental properties", like an artist's name, which is a triple of: the URI for the artist, the URI for the 'name' predicate, and the name (a literal string such as "The Beatles"). In MusicBrainz terms: you don't need an "artist name" field if there's an Artist -> text field AR for the artist name.

Triples also get used for defining types: saying that a given URI is an identifier for an album, for example, is a triple of: the URI for the album, the URI http://www.w3.org/2000/01/rdf-schema#type (which defines the type of a given resource), and the URI for the 'album' type.

In fact, triples can describe everything in MusicBrainz (or perhaps everything in the world!), especially with the addition of some useful resource types like collections.

One More Thing Worth Mentioning: Ontologies

One other useful concept to understand, when talking about RDF, is the idea of a vocabulary/taxonomy/ontology. I'll just call these 'ontologies' from here; distinguishing between vocabularies, taxonomies, and ontologies is best left to pedants (like me). An ontology, in short, is a reusable set of Resources and some descriptions of them (expressed in RDF, of course). Usually ontologies consist of types of predicates and types for resources, and descriptions of these things. Some fundamental concepts are expressed in some core ontologies; one such example is the RDF Schema ontology, which includes the "type" predicate I used above. This and other core ontologies define concepts like domain and range of predicates, ways of defining collections and lists, subclasses, equivalence, transitivity, cardinality (which covers things like "all artists must have a name" and "all releases must have at least one medium"), and many other things. Ontologies tend to be referred to using namespace prefixes, such that http://www.w3.org/2000/01/rdf-schema#type is usually just called rdfs:type.

In some ways, ontologies in RDF are like schemas in more traditional systems like MusicBrainz, but there's nothing stopping anyone from using a predicate or type that's not defined by an ontology, where schemas prevent using new things without first changing the schema.

How and why might MusicBrainz use RDF?

Having read the above section (you read it if you don't know the basics of what RDF is, right?), you'll understand that RDF is a pretty general thing. As such, there's a number of different places where it could be used by MusicBrainz. Usually when I'm ranting about this I'm referring to using RDF as a data storage format, which will be what most of this section covers. The other primary way to use RDF is as a sort of API via Linked Data principles. I'll mention this usage as well, but in my opinion this is the less-useful application of RDF for MusicBrainz (despite being the one more-often tried), so I'll spend less time on it.

RDF as a data format

Let me first clarify that what I mean by 'data format' is 'database format' – I'm talking about using RDF as the way we store data. This would involve switching out our current database and putting in its place a specialized RDF-optimized storage facility called a triple store. This infrastructure-change burden is something I'll mention below in downsides to this use of RDF (and is probably the single biggest legitimate reason, in my mind, for us not to switch wholesale to RDF starting tomorrow).

So: what would be the upsides and downsides of using RDF for our base-level storage and representation of data? To explore this, it shouldn't be surprising that we look at the differences I mention in the previous section between the MusicBrainz conceptual data model and RDF. To summarize:

  1. RDF assigns identifiers to Just About Everything (which is to say: everything except literals)
  2. RDF lacks the restriction of needing schema change for changes in data structure (ontologies are descriptive, not prescriptive)
  3. RDF allows more than just entities as the "endpoints" of its analogy to ARs (triples)
  4. RDF stores everything as URIs and triples

Let's explore these in order.

Identifiers for Just About Everything

Since RDF's concept of the Resource is much more general than our current concept of an entity, many more things would end up with URIs as identifiers; tracks, mediums, tracklists, etc. Certainly at least everything that's currently a table in our database (with the exception of some small, optimization-oriented tables like artist_name, and some likely simplification of the AR system) would end up being identifiable resources. There is a potential downside to this, in that it probably means people start using these identifiers. However, never once in my time as a contributor to MusicBrainz have I heard anyone ask for fewer things to be uniquely identifiable as public resources. Rather, I hear frequent requests to bring back track MBIDs, to be able to uniquely identify mediums and tracklists, and requests to reveal more of the unique identifiers we already have, such as UUIDs for AR types. While it might be more work initially to make sure identifiers are assigned in a reasonable way and that recommended practices are carefully documented (for example, recommending against referring to some kinds of identifiers from the "outside world"), we would be a lot less likely to have the problem of something needing a unique identifier and finding we don't really have anything satisfactory.

Flexible and Schema-Free

Although we'd probably want to define an ontology describing our use of terms and assigning them to a namespace (for internal use and documentation if nothing else!), changing this or adding things wouldn't be nearly as arduous as schema changes are now, because the ontology is merely extra information about terms, not a technical prerequisite for storing information using those terms. A schema change to add a column must be applied before anything can add data to or use data from that column; if a replicated server doesn't have the right schema, they simply cannot use certain code. If an ontology is out of date on a replicated server, though, the worst likely situation is if they're lacking some cardinality information and thus aren't necessarily checking for required elements – but if they're replicated, all that data is coming from the main server, which will be doing those checks anyway. Amusingly, RDF ontologies are also RDF, so they could be put into the same RDF store as the data and changes to them can be replicated alongside the data. So, not only are schema sequence numbers and potentially broken schema upgrade processes unnecessary, but what vestigial analogs to them remain are easier to automate.

In case anyone's worried, it's still possible to make some arbitrary divisions in our data (like, say, not including editor passwords in our data dumps and replication packets) by way of a feature called named graphs.

More Than Just Entities

Triples, unlike our current ARs, don't require both ends to be entities. This means, first of all, that we get the capability for ARs to literal data – differentiated comments on different subtopics, version numbers, boolean or enumeration-type flags, etc. In combination with the ease of data structure change mentioned above, this also happens to allow for ARs with any number of endpoints and any number and type of attributes (since it makes AR endpoints and attributes fundamentally the same thing).

Any AR with attributes (all of them, since dates are available for all AR types) would in fact have to be migrated as a type of resource as well as a predicate. For example, the '{additional} {guest} {solo} {instrument}' AR would be translated as, say: a PerformanceEvent resource, linked to the Artist with a performedBy relation, to the Recording with a performance relation, with the attributes being expressed as additional, guest, and solo relations to literal booleans and an instrument relation to a Resource of type Instrument, or if we're lazy, to a literal string with the instrument's name (but just making the Resource/entity for an instrument would probably be easier – especially since it's a requested feature. Oh, did I mention that lots of other requested new entities are also both easier and possibly the best way to migrate to RDF?

There are plenty of other things that move much more to the relationship-editor end of the task list, rather than the developer end, like sortnames for more entities or a subtitle field, which would just be new predicates (AR types) to text strings. More-complicated things like location and date fields for entities very efficiently reuse the aforementioned migrations of AR attributes as new resources.

Oh, did I mention there's a predicate owl:sameAs that very nicely links to other names for the same resource?

Everything Looks The Same

Everything looks the same means that: everything is identified by URI (and URI only: no duality of gids versus database row numbers) and every relationship between bits of data is expressed as triples and perhaps some intermediate entities. Anyone who's ever written some Lisp understands how nice uniform representation can be, and how powerful. Anyone who's ever dealt with heterogeneous sources of information (say, anyone who's ever hunted down sources for an edit, or written citations for a research paper) understands that having things in the same format makes things more reusable. And when specialized interfaces haven't been written yet, having everything look the same means a general interface can be written as a fallback, letting features exist for advanced-user use as soon as possible and letting the developers structure their time more freely.

Need I even mention that URIs being a basic currency of the underlying database makes handling URLs a lot easier?

RDF as an API

TBA