MusicBrainz Server/Internationalization

From MusicBrainz Wiki
< MusicBrainz Server
Revision as of 10:54, 27 June 2023 by YvanZo (talk | contribs) (→‎Unicode issues: Strike characters that have been added in comment to https://tickets.metabrainz.org/browse/STYLE-1402)
Jump to navigationJump to search

Getting started

If you want to help translate, go to the Transifex page and create an account. If there is already a team for your language, you can join it, if not, you can ask for the creation of a new team.

There used to be an i18n mailing list, but it is discontinued and has been replaced by new forums (using categories and tags).

Questions or problems

If you have any questions or you're having any problems, you're welcome to ask in the #metabrainz IRC channel.

If you find a bug in the server, you can enter an issue in our bug tracker.

Translation components

The following components are available for translation:

Attributes

It contains the names and the descriptions of MusicBrainz entity attributes such as artist’s type and so on.

It is also used by MusicBrainz Picard.

Countries

It contains the names of release countries.

It is also used by MusicBrainz Picard.

Note that country names should be the same as area aliases; See jira:MBS-13140 for follow-up.

Only the documentation Release/Country is not localized for now; See jira:MBS-13109 for follow-up.

Instrument Descriptions

It contains only the descriptions of instruments.

Instruments

It contains only the names of instruments.

Note that instrument names should be the same as instrument aliases; See jira:MBS-13141 for follow-up.

Languages

It contains the names of languages that can be set for [[Release#Language|release]’s tracklist and [[Work|work]’s lyrics.

Relationship Types

It contains the names, descriptions, and (forward/long/reverse) link phrases of relationship types as well as the names and descriptions of relationship attributes. See also Relationships.

Scripts

It contains the names of scripts that can be set for [[Release#Script|release]’s tracklist.

Note: Because of transliteration a language is not necessarily paired with its usual script/writing system.

Server

It contains the messages shown to users and admins by the MusicBrainz website.

Statistics

It contains the events in MusicBrainz timeline and the messages for Database Statistics section of the website UI.

Viewing the translations

Some of the more complete translations (generally those over 50% translated) are available on the beta server at https://beta.musicbrainz.org/. The translations do not update automatically (see development beta cycle), but the beta server uses the same database as the main server. If you want to use the beta server all of the time for your editing, click the "Use beta site" link in the footer of https://musicbrainz.org/.

Variables

Translatable messages not only contain plain text or HTML markup, they can also contain replaceable variables. For example:

  • In {entity1} has a BookBrainz page at {entity0}, which is a URL-Work relationship link phrase, there are two entity variables whose name should not be translated, since variable {entity1} will be replaced by a work title and {entity0} by a URL.
  • In link phrases, variables are often used for (optional) attributes, in order to avoid inflating the number of messages. Below are examples with the “additional” attribute:
    • {additional} will be replaced by additional if the “additional” attribute is set, otherwise it will be removed from the text.
    • {additional:additionally} will be replaced by additionally if the “additional” attribute is set, otherwise it will be removed from the text.
    • {additional:an|a} will be replaced by an if the “additional” attribute is set, otherwise it will be replaced by a.
    • {additional:%|regular} will be replaced by additional if the “additional” attribute is set, otherwise it will be replaced by regular.
    • Hence, {additional} can be translated as {additional:aldona} in Esperanto.
  • Note that {instrument} and {vocals} variables are replaced by the specific instrument/vocals name:
    • {instrument:%|instruments} will be replaced by piano (or its translation) if the related instrument is “piano”, otherwise it will be replaced by instruments.

Development

The MusicBrainz Server code is using gettext to provide with automatic internationalization of messages and texts used in the Perl code and templates.

A .pot file is provided with all the strings used in the server. They are in English.

Beyond translation

Current features

Current issues

Most of current issues are tracked through MusicBrainz Server internationalization tickets. Some more long-term goals are not tracked yet.

There are most likely some internationalization issues with fuzzy search in some languages (with agglutinative words or ideographic characters). It mostly requires making proper use of language analysis from Apache Solr.

Future

Overview

One of the goals of MusicBrainz is to store information about music from all over the world, and since that music is written in many languages, support for those languages is essential. In the future, we also want people to be able to use MusicBrainz in any language, not just English, especially since the people who know the most about music in other languages are often native speakers of those languages.

The decision to use Unicode for MusicBrainz was an important first step on the road to internationalization, and it has allowed entry of hundreds of International Artists with works in dozens of languages, but there remains much work to be done. The work of adapting software so that it can be used with different languages or in different regions is called internationalization (abbreviated as I18N), and translating it into each of those languages and regions is called localization (abbreviated as L10N). Both of these are substantial efforts, but the resources needed are different. Internationalization requires specialized understanding of aspects of many languages, but that is often easier to find than the native linguistic ability in non-Western languages needed for localization.


The following is a breakdown of the many issues for i18n and l10n by area. Issues that should have RFEs filed are marked with RFE ME and a note on the priority (low, med, high). Where RFEs have already been filed, they should be linked.

Database

Many of the most crucial issues for i18n are with the database schema, in order to support the additional data needed to properly localize artists, releases, etc. The localization itself is done by moderators, and can even be done to some extent without full i18n support in the database.

Locales

Just as releases have countries associated with them, artists, aliases, and releases should have locales associated with them; this would be a way of capturing the language (and country & encoding variants) of the names and titles. The Open I18N guidelines for locale names should be used where possible: the basic format for standard locales is lc '''_''' CC '''.''' CSet, where lc is an ISO 639 two letter LanguageCode (three-letter codes may be used if no two-letter code exists), CC is an ISO 3166 two letter CountryCode, and CSet is an IANA registered preferred MIME encoding name, or if none is preferred, a standard name from Open I18N Codeset Alias Table.

Alternately, we could use the convention adopted for CSS (and other XML/HTML/HTTP?) of using hyphen ("-") as the separator for all components, instead of underscore ("_") and period ("."). The disadvantage of that form is that it doesn't allow you to omit leading components. Extending the Open I18N guidelines, both by allowing any of language/country/encoding to be omitted (if either of the second two components is omitted, their preceding separator would also be omitted) and perhaps to add another variant component, as noted below, adds some functionality that may be very useful.

In most cases, the language code alone would be used, but there would be uses for country variants, e.g. for the group known ("en" as "Yazoo" but in the U.S. "en_US" as "Yaz"). Although it is not strictly speaking correct, simplified Chinese is often identified as "zh_CN" and traditional Chinese as "zh_TW" (although both are used outside of those regions); see [http:#zh below] for a discussion on Chinese languages and scripts.

It may be that the best solution is to add more components for scripts and "dialects" (preceded by hyphen "-") so that you could have "zh-hant-guoyu_CN.UTF-8" to indicate a title in Mandarin (guoyu) using traditional (hant) Chinese characters, in the PRC, using UTF-8 encoding. But this could be overkill. On the other hand there are many languages which use multiple scripts (typically Latin "Latn", Cyrillic "Cyrl" or Arabic "Arab" - see the IANA language tags for examples like Azerbaijani; there are others, like Moldovan, and many cases where very similar dialects (e.g. Hindi-Urdu, Serbo-Croatian) are divided mostly by use of different scripts.

Encoding components could be used to identify misencoded alias names, e.g. "zh.BIG5" for a alias with Big5-misencoded Chinese; and could even be used to automatically generate misencoding aliases for artist names in common character encodings.

It might be desirable to have a fourth variant component (preceded by ":" or another character?) that could be used to identify multiple variants; it could be used to represent misspellings (e.g. "en:TYPO") or performance variants, for association with particular releases (e.g. a release could be marked "en:2" to get the second variant of the artist name, marked with the same locale). There's a discussion of why this might be desirable on the MailingList.

Some possible examples for usage:

  • ".UTF-8" (standard alias for any UTF-8 locale in absence of a more specific match; the preferred Artist Name might have this locale implicitly)
  • ".ISO-8859-1" (Latin-1 representation)
  • ".ASCII" (an ASCII representation without accents etc.)
  • "en" (English name, typically for a non-English artist)
  • "en_US" (Preferred name in USA, e.g. "Yaz")

Users could specify in preferences their preferred locale; it might also be possible to glean something from X-Accept-Languages: and similar headers in HTTP requests.

Artists

One of the most pressing needs is for i18n of artist names. Although all transliterations and translations can be supported by aliases, currently, only the "official" name is used for tagging (ArtistSortNames are displayed but not yet tagged). Especially since tagging of non-latin names is poorly supported, the existing localization of artist names to Japanese and Chinese (or even Cyrillic) creates problems for other users, especially with VariousArtists release where one or two artists with non-latin names appear together with mostly western artists.

ArtistNames

A default locale for releases by the artist (and locale indicator for the official ArtistName) should be added. RFE ME med

ArtistAliases

Since an artist can have an unlimited number of artist aliases, there is some support for i18n already; the MisencodingFAQ has served as a training manual for a number of moderators who have done an excellent job of adding aliases in different languages. RFE 1059830 high suggests adding locales to each alias to indicate their language.

ArtistSortNames

Currently, the database only supports a single sortname for each artist. In order to provide a consistent sort order across multiple alphabets, the generally accepted guideline is to use only roman (latin) alphabet characters in ArtistSortNames. While this is less than ideal, solving this problem internationally is an extremely complex problem, since the rules for sorting vary by locale, and conventions about spelling out numbers in names differ as well. Given the total amount of i18n and l10n work needed for more important problems, and the difficulty of solving this relatively unimportant one, it is probably best to postpone a better solution for this until after the i18n effort is largely complete. In the meantime, some of the following points should probably be added to the style guidelines:

  1. ArtistSortNames should be restricted to the Latin-1 (8859-1) character set (current convention allows any roman characters, even Vietnamese)
  2. Sortnames should indicate family name in Asian languages with comma, even though reversal is unneeded, e.g. Mao, Tse-Tung
  3. Sortnames should be transliterations of the official artist name, not translations (currently, translations are sometimes used)
  4. Transliterations should use the artist's home country's standard transliteration into roman characters supported by 8859-1
  5. If there is no standard transliteration, the standard English transliteration should be used - transliterations into other languages may be different (e.g. Ч = Ch in English, but Ч = Tch in french)
  6. However, common English spellings should be preferred, e.g. Tchaikovsky, not Chajkovskij.

Points 1 and 3 represent a change from existing convention - comments are welcome on the mailing list. In particular, point 3 is often broken for Asian artists who have "English names" that are not transliterations, but more like alternate names.

For a more international sort, there could be a point requiring numbers to be written numerically (e.g. 4 Tops, The rather than Four Tops, The ) but as most languages specify sorting numbers as if written out in full (in the local language, of course) this is likely to meet weak acceptance and strong opposition.

Releases and Tracks

Having multiple aliases for each track title seems a far too complex mechanism to ever be implemented; instead it will probably be preferable to have translations on a release basis, so that there will be duplicate entries for releases sharing the same track time data and TRMs. (Release data and disc ids may be something we don't want to share; so that the Japanese release (and disc id) is associated with the Japanese translation, but not the English titles -- this is not yet entirely clear.

Especially now that the database supports assigning DiscIDs to multiple releases, it is quite reasonable to have VirtualDuplicateReleases that are not truly duplicates, since they represent different translations/transliterations.

AdvancedRelationships could potentially be used to link different releases that represent translations/transliterations of each other. TarragonAllen's ReleaseGroups proposal could provide a framework for this as well.

It may be desirable to have per-track locale information, but this should probably be used to record the performance language of a particular track, which would not necessarily be the same as the language of the track title (especially on a translated release).

For titles where translations or transliterations are present together with the original title, perhaps there should be a StyleGuideline specifying use of square brackets; however in most cases it will be preferable to have them in separate titles on duplicate releases. Parts of the original title that are written in latin letters, e.g. (remix) should be omitted from the translated version, e.g. "Знаю Я (remix)I Know".

If a title is given only in translation or transliteration, do not use square brackets, e.g. "Yang Ku Tunggu" not "[Yang Ku Tunggu]".

On VariousArtists releases, Artist aliases that are most compatible with the locale of the release itself should be used. Thus, on an "en" compilation where a Chinese artist appears, her "en" alias (if any) would be preferred to her official "zh" name. There would probably need to be some interaction with user preferences here as well.

Unicode issues

As there are sometimes multiple ways to represent the same symbols with different unicode byte sequences (e.g. using combining accent marks) it may be desirable to enter these as aliases; a normalized form should be used for all names and titles in the database. See http://lists.w3.org/Archives/Public/www-xml-canonicalization-comments/2000Jan/0002.html - it's possible that the musicbrainz server and or database software may (or should) do this already.

There's a Perl Unicode normalization tool Charlint that could be used or adapted to do this normalization; for artist names and aliases, which are supposedly unique NFKC (normal form, compatibility decomposition + canonical composition) would probably be best, while for release and track titles, NFC (normal form, canonical decomposition + canonical composition, which doesn't change visible appearance) would probably be better.

There are also ranges of Unicode characters that should be avoided as they do not provide increased range of expression but merely create interoperability issues for those without complete Unicode fonts. In particular, the following should be explicitly prohibited by style guidelines (and perhaps enforced by the database):

  1. Soft hyphen U+00AD
  2. Non-breaking space U+00A0
  3. Fullwidth latin and halfwidth kana/hangul U+FF00-FFEF
  4. Byte order mark U+FEFF
  5. Narrow non-breaking space U+202F
  6. Ideographic space U+3000
  7. Medium Mathematical space U+205F
  8. Typographic spaces U+2000-200B
  9. Roman numeral characters, e.g. single character IX U+2160-217F
  10. Private use surrogates and codes U+DB80-DBFF U+E000-F8FF
  11. Control characters U+0000-001F U+007F U+0080-009F

The first two are not specifically Unicode issues as they occur in Latin-1, but these plus the third are among the ones where database enforcement is most desirable as they can lead to artist names that are visually identical in appearance but which are, in fact, different. RFE ME low

Web Server

Automatic Transliteration

Automatic transliteration could be done for many languages if no transliterated/translated alias is available. For best results it is necessary to know the language (e.g. cyrillic script is used by several languages; transliteration will be subtly different from Ukrainian or from Azerbaijani - in the case of Chinese, differences between dialects are even more dramatic). For Japanese, where identical kanji can have multiple different readings, the correct transliteration may not be easy to determine at all. In addition, individual artists often may prefer nonstandard transliteration of their names, or may have an "English" name that isn't really a transliteration.