Internationalization (i18n) and Multiple Language Support
The major issue with multi-language support in Musicbrainz, specifically support of non-English album and track titles, is that a lot of releases will have multiple versions with different languages available. The database can only assign one album per CDID, which means we can't match against multiple sets of languages - it's one or the other. Taking Japanese titles as an example, some people are going to want to tag their files with and view the romanised versions, some will want the Japanese titles. We can't presently do both, and workarounds like "Japanese TitleRomanised Title" are ugly and prone to incorrect ordering and so forth (and which title should go first is another argument in itself). Really, this needs a broader change in the database itself to be able to handle this information. I have outlined a proposal to deal with this and several other issues with the present system : ReleaseGroups. Please review this and give me some feedback. I will be setting up a separate mailing list specifically for dealing with Internationalisation issues, and I will also be asking a few prominent people to become part of a team to focus on these issues which I will defer these kind of decisions to. I'm doing this mainly because I have no second language besides English, and whilst I can understand the overall issues I feel that I will not be able to deal with some of the more specific issues (such as UTF-8 encodings) without support.
Issues with the current system:
- BrowseArtists only has a latin alphabet. Though 'browse by symbol' allows one to browse artists beginning with non-latin letters, it is still not the best solution. Perhaps allowing one to choose the alphabet you want to browse in a drop-down or somesuch would be a good solution. (Latin, Scandinavian, Cyrillic come to mind)...
- BrowseArtists is based on SortNames, which by convention are always romanized; so choosing the alphabet won't really make any difference. It might be helpful to display the ArtistName as well as the SortName in the display, though. @alex
- No way to search for foreign artists that are named in their native language for a person who does not speak that language. That is, an english-speaker may not know how to spell the name of a russian band Аквариум (Aquarium) in cyrillic (or indeed even have a cyrillic layout available on his keyboard). This can be helped easily enough by simply adding transliterated and/or translated names of the artist into the aliases.
- This has been done pretty well for all the artists whose ArtistName is in a non-Latin character set. The MisencodingFAQ has served as a training manual for a number of moderators who have done an excellent job of adding these aliases. @alex
- Automatic transliteration could be done for many languages if no transliterated/translated alias is available. For best results it is necessary to know the language (e.g. cyrillic script is used by several languages; transliteration will be subtly different from Ukrainian or from Azerbaijani - in the case of Chinese, differences between dialects are even more dramatic). For Japanese, where identical kanji can have multiple different readings, the correct transliteration may not be easy to determine at all. In addition, individual artists often may prefer nonstandard transliteration of their names, or may have an "English" name that isn't really a transliteration.
- For the above and other reasons, a field should be added identifying the language of the artist name or alias, e.g. "ru", "zh" etc.. This could also be used to identify misencoded alias names, e.g. "zh.big5" for a alias with the big5 misencoded Chinese; and could even be used to automatically generate misencoding aliases for artist names in common character encodings.
- In discussion with djce on IRC an issue of distinguishing types of aliases came up - would be good to have a way for the programs to distinguish between good data (correct titles) and bad data (misspelled titles added as aliases for search purposes). Ideally search actually should not rely on aliases entered with typos but rather use a phonetic similarity algorithm of some sort, but since we do not have this, at the very least there should be a way to distinguish legitimate aliases - pseudonyms, transliterations and translations - from misspelling aliases added for the purpose of easing the search engine's work.
- The same issues with translations exist with album and track titles for foreign artists - and that cannot be solved as easily for neither of those has a way of adding aliases. Therefore, a way to add aliases to tracks and albums should be added - so a person could search for "Sestra Haos", "Sister Chaos" and "Сестра Хаос" and still turn up the same album for all three searches. Ditto for tracks.
- Having multiple aliases for each track title seems a far too complex mechanism to ever be implemented; instead it will probably be preferable to have translations on an album basis, so that there will be different aliases for albums sharing the same track time data and TRMs. (Release data and disc ids may be something we don't want to share; so that the Japanese release (and disc id) is associated with the Japanese translation, but not the English titles -- this is not yet entirely clear. [[Dupuy|@alex]
- The Tagger does not handle the non-latin characters correctly (displays garbage for cyrillic titles fetched from MB, anyway). That's a major bug that should be addressed.
Moderation Proposal for Working With the Current System
- Artist names (stolen from dave's suggestion in classical music naming thread) - this would apply to the display names, or at the very least the sort names, which should always be in roman alphabet, even after language tags are added to artist names/aliases.
- Use the name as written in the artist's home country as long as it is recognizable by those familiar with the common English spelling. This would include Vietnamese names, even though they are not representable in any 8859 encoding.
- If the name is unrecognizable e.g. it uses a non-roman character set (Cryllic, Kanji etc.) use the standard transliteration into roman characters if there is one.
- If the artist's home country's language does not define a standard transliteration into roman characters (e.g. Pinyin, Japanese romaji, use the transliteration into English - transliterations into other languages may be different (e.g. Ч = Ch in english, but Ч = Tch in french).
- Note that common English spellings should be preferred, e.g. Tchaikovsky, not Chajkovskij.
- Any (and all!) alternative names using different spellings, character sets and/or transliterations should be entered as artist aliases.
- As there are sometimes multiple ways to represent the same symbols with different unicode byte sequences (e.g. using combining accent marks) it may be desirable to enter these as aliases; a normalized form should be used for the principal artist name. See http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0002.html - it's possible that the musicbrainz server and or database software may (or should) do this already.
- Album and track titles
- For titles in any latin alphabet, use the title as written in the original language.
- For titles in non-latin alphabets, use the title as written in the original language. If known, include a transliteration or translation into English in square brackets, e.g. "光Hikari" or "Я Тебе ЛюблюI Love You". Parts of the original title that are written in latin letters, e.g. (remix) should be omitted from the english version, e.g. "光 (original karaoke)Hikari" or "Знаю Я (remix)I Know"
- If the original language does not define a standard transliteration, use the transliteration into English as noted above.
- If a title is only known in translation or transliteration, do not use square brackets, e.g. "Yang Ku Tunggu" not "[Yang Ku Tunggu]".
Issues With Current Moderation Proposals:
- An issue with incorporating any non-latin alphabets in titles is that support for non-latin filenames in filesystems are patchy and inconsistent. Many users rename their track's filenames after track titles and a non-latin filename on a system that doesn't support it can make the file unreadable or even irrecoverable. Personally, I have not had this issue with MusicBrainz, but with downloaded files with non-latin (particularly CJK) filenames. I feel that MusicBrainz should err on the side of caution, however.
The problem with erring on the side of caution is that you end up with the lowest common denominator of all the different filesystems that you support, which would negate the use of Unicode in the first place.
An important distinction should be drawn between metadata and filenames, since the two are quite separate. Currently, for example, character such as the / are removed from filenames (not metadata) on Unix systems. On Windows a variety of characters are supressed or altered. It makes sense to change at the point where you meet the problem i.e. the filesystem as if you try to fix the database then you will end up with conflicting requirements.
I'm no expert on i18n issues but is the recent work to translate from Unicode into different encodings not the correct solution to this problem? -- bawjaws
- After consideration and examining recent non-latin moderations, I've come to feel that bawjaws is correct in this matter. MusicBrainz can handle Unicode exceptionally well, and all modern filesystems handle Unicode as well (the issues seem to be with EUC and Shift-JIS encoding for me). Further testing in this matter still needs to take place, but if no further issues are raised we can write off these filesystems issues.
Proposed long-term solution:
- An ideal solution, in my view, would be to offer an alternate title for albums and tracks that is a transliteration to the latin alphabets. The user can have the choice to name files after the original or the transliteration, somewhat similar to artist names and sortnames. After implementation, later moderations can provide latin titles for albums and tracks.
Moderators have conflicting ideas on how internationalisation should be represented in the database as it stands now. I hope this wiki helps resolve any questions people may have.
This issue keeps cropping up over and over again -- we will need to formulate a plan of attack for offering the MB site in multiple languages.
- How do you do this technically without making the process to create new pages awkward and painful?
- How do you handle translation issues when some of the volunteer translators for a language are slacking and you are ready to roll out a new version of the website?
It seems like there are several social and technical issues to address here.
- One thing that has occurred to me is that we may want to provide something like a wiki interface to allow moderators to update and edit the translated content for MB pages. The easier we make it to edit the content, the more likely we are to get contributions of translations. @alex