From MusicBrainz Wiki
Jump to navigationJump to search

Support for Unicode means that we can support non-ASCII characters, which is critical for non-US artists and releases as well as artists, album titles and track titles that use non-ASCII characters. We need to not be American-centric in our support for music. It is not so some lameass can say "I like the Macintosh smartquote so I'll just change all the ASCII apostrophes to suit my whim." That's utter b*llsh*t.

  1. It wastes a lot of resources: MB CPU time, bandwidth, database transactions, e-mails, editors reviewing, etc.
  2. It accomplishes absolutely nothing of value.
  3. It will screw-up rendering in MP3 players that do not support Unicode (hint; there's a LOT more than ipods out there)
  4. You can accomplish the same thing with a simple plugin to the Picard tagger

Note that point 4 has actually been used by the SmartQuote Contingent to claim that those who are unhappy with it can just rename them back to ASCII. O_o


I've been using MusicBrainz seven years people, SEVEN years and this is the stupidest frelling b*llsh*t I have ever seen in MB, bar none. There's a massive litany of all the problems it causes, all the hassles to adopt to it, and yet no one can provide one single pro for it other than "Gee, it looks pretty". The argument that we shouldn't "hobble" the MB database by the lowest common denominator is the exact opposite of what we should do. When you have a massive platform to support (All music information in the entire world), you NEED to go by the lowest common denominator. So we need to support Unicode, but we don't waste resources on fiddly crap that can be corrected on the client end. --riffer 15:03, 1 February 2011 (UTC)

What about dashes vs. hyphens? See, I've encountered a similar problem there. Wouldn't it make a lot of sense if we would make full use of Unicode in the database and have some automated heuristic in the tagger to convert to ASCII if necessary? This heuristic could be made optional and dropped when a sufficient number of clients supports Unicode. I mean, the tagger does support Unicode in tags already, and so do several software players. Why should some hardware players pose a restriction for the whole system? Considering European languages, the replacement list shouldn't become too large to handle. Also, it doesn't have to be perfect, because in ASCII you can only do approximations of the correct typography, anyway. I'd volunteer to set up and maintain the list. AnswerMe, please, at least concerning the dashes vs. hyphens question ... Thanks. Editor:selig

  • It would be probably better to ask on the mailing list. --LukasLalinsky
  • I agree with Editor:selig. We ought to have the data in MusicBrainz rise above the particular limitations of this player or that filesystem of today, because if we don't, then tomorrow when the limitation is removed, our data will still be hobbled. It's the job of the tagger to dumb down the data to the limitations of the destination player or format of the moment. And by the way, this means I disagree with the guideline about "…"; our data ought to favour the ellipsis over three periods "...". — JimDeLaHunt 2008-01-07
    • While I would love to see better punctuation, it then raises its own cast of headaches, namely:
      • English editors would have to break built-in habits and learn some punctuation that most schools never actually bother with, ie, the use of «» in English, when to use ”“ vs "", when to use ’‘ vs ' ' (and how those usages differ between American English and British English).
      • Danish, Swedish, and Norwegian editors would have to deal with ““ and ‘‘, and the inevitability that people would try to incorrectly use ”“ and ’‘ instead. Swedish would also then have the complexity of when to use those, when to use »», and when to use (ignore the brackets here) [ - ][ - ] (space hyphen space, space hyphen space).
      • Dutch editors would have to deal with „“ and ‚‛ (and the inevitability that people would use them backwards)
      • « » instead of "" on French releases (and then which do we use in mixed language releases, such as classical?) We would also add in with this the complexity of non-closed guillemets in French («» becomes «?, «…, «! when the sentence ends in other than a period, and «—» (not «» «») when you have two adjacent quotations.)
      • German editors would have to deal with „“ and ‚‛ (and the inevitability that people would use them backwards), as well as «» and «‹›».
      • Italian editors (ie, all classical editors, etc) would have to distinguish between - – and —.
      • Italian editors would have to worry about when to use "", when to use «», and when to use "« »" (same as "' '" in English).
      • and so on...
    • That's only 7 languages - and the above is only latin script, where essentially the same punctuation marks are used (just with slightly differing rules), and ignores the complexities of totally different punctuation *marks* in different languages... Add in typographic differences between quarterwidth, halfwidth, normalwidth, and fullwidth characters, it'd get really confusing. Now, not to be misunderstood - I do wish correct typography was something we could pull off. But considering how many basic misspellings we get, and how confused 99% of editors get when trying to deal with more complex formulations like CSG, I'd think asking everyone to suddenly become typographical experts before using punctuation, and adjusting their proper typographical punctuation dependent upon which language they're dealing with... just never would happen. We'd end up with an incredibly confusing guideline, with at least 1, normally 3 or more, language-specific differences, for each and every language. With as much difficulty as we have even getting everyone to use the proper capitalization for a given language, I just see getting everyone to also use and remember any given PunctuationStandardLanguage as great in theory, but someone that would never happen in practice. -- BrianSchweitzer 05:47, 08 January 2008 (UTC)
      • But wouldnt't that complexity be offset by the fact that only people who are familiar with the issues would care for them? Two examples: (1) people who don't know German (or French or Latin) often don't respect the caps rules, but that's OK because those who do will eventually fix them (rather than delete the incorrect entries). (2) I don't know Russian or Chinese or Japanese, so (a) I never edit them, (b) I couldn't tell if they are correct or not, so it wouldn't bother me either way, and (c) they are entered mostly by and fixed only by people who use those languages regularly and thus know what and how to do these. We'll never get even spelling 100% correct on everything, but we strive for it, and any errors are fixed by those who both care and can fix them. Having guidelines that are consistently (except for English/Latin) wrong but simple can just make the database consistently "wrong". Consistently "correct but hard" guidelines would mean more correct data, though a bit more inconsistent (ie, not everything is correct, but at least some of it is). This is a bit of a "if a tree falls and nobody hears it, it doesn't matter if it makes a sound" argument. If nobody notices an entry is wrong (because they can't tell between half-width and full-width characters, for example), they don't care. If they notice it, they probably can fix it and they probably will. --BogdanB
        • Quite agreed, so long as we always maintain the option, such that either generic entry is ok or a copy/paste solution exists. -- BrianSchweitzer 17:41, 01 February 2008 (UTC)

Why rules on ellipsis (… U+2026) and quotation marks? If someone offers to use typograpically correct characters, what blocking? I don't really understand the real purpose of those small rules. They enforce pre-unicode-era legacy compromises without clear motive. Oh yes there is, the motive is clients which doesn't support unicode (I personally can't name any but I'm biased). What is the deadline for unicode support then? Year 2099? -- jesus2099 08:57, 25 February 2009 (UTC)

I agree with you. things names inside the database should be written correctly regarding content and typography. clients should be able to convert correct signs to ASCII replacements if they want.--Akirom 09:57, 27 March 2009 (UTC)