History:Character Encodings

From MusicBrainz Wiki
Revision as of 20:04, 25 October 2011 by Reosarevok (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Examples of different character (mis)encodings

  • UTF-8 :
  • This is the most common encoding for Unicode; it is not a mis-encoding, unless it gets interpreted as 8859-1 (Latin-1), as shown in this example.
   伊藤è¢æ²» = 伊藤賢治 
  • Generally has an even number of characters (for European, Cyrillic, Greek, Hebrew, Arabic etc.), or multiple of three characters (for Asian texts) with some accented version of a at the beginning of each group of two or three.
  • Big5 :
  • This is probably the most common encoding; it is used primarily for Traditional Chinese (used in Taiwan) and also for Japanese and sometimes Korean.
   °ªÓ¬ü  =  高勝美 
  • Generally has even number of (mis)encoded characters (due to 2x1 expansion); large numbers of symbols, some accented characters.
  • GB2312 (variants GB18030, HZ, HKSCS) :
  • This is the second most common encoding for Chinese; it is used for Simplified Chinese (used in mainland China and Hong Kong).
   º«±¦ÒÇ  =  韩宝仪 
  • Essentially the same as Big5: even number of misencoded characters; large numbers of symbols.
  • EUC-KR :
  • An infrequent encoding for Korean.
   Á¶°ü¿ì  =  조관우 
  • If this generates something plausible that doesn't look like Korean (e.g. no circles in the ideograms) the correct encoding is probably EUC-JP.
  • Windows-1255 (CP1255, variant ISO 8859-8i ISO-Logical) :
  • Most common encoding for Hebrew.
   âìé÷øéä  =  גליקריה 
  • This misencoding will generate almost entirely accented alphabetics in lower case only. Frequent presence of division sign (÷ = ק). Note that this displays right->left when encoding is corrected, so if first (leftmost) word of misencoded title has two letters, the rightmost word of correctly encoded title will have two letters.
  • ISO 8859-8 (ISO Visual) :
  • Bizarre left-to-right (reversed) encoding for Hebrew.
   ñéèøåô éîø  =  רמי פורטיס 
  • I have never seen this as a misencoding for FreeDB entries, but the encoding does occur in Web pages, e.g. Rami Fortis' discography. The character values are the same as ISO 8859-8i (a subset of CP1255) but the insane thing is that the display order is left->right (i.e. the same as English) so that all the words and phrases are sdrawkcab nettirw (written backwards). So, as in this example, if the rightmost word of the misencoded title has three letters, so does the rightmost word of the correctly encoded title (once it is written forwards). Serious problems can occur if you try to cut and paste text from a web page in ISO Visual to an edit page in MusicBrainz. Since the character encodings aren't different (only the display direction changes) the (Unicode) text that gets pasted is still in reverse order. (This is definitely true for Internet Explorer through 6.0, possibly for other browsers as well). To avoid this problem you have to look closely at pasted data to make sure it is correct. If you don't read Hebrew, check results closely with Google, and look carefully to see if Hebrew web pages are encoded in ISO Visual. Be alert for pages that don't embed the encoding correctly, but rely on browser auto-detection that doesn't work right, as in the Fortis discography pages - a giveaway there is that if Visual encoding is not selected, the track numbers show up on the left for English titles, instead of all on the right for both English and Hebrew. There is no known workaround for the cut & paste problem, but you may find the Unix program rev (reverses characters in lines) useful.
  • Windows-1253 (CP1253) :
  • Most common encoding for Greek.
   ÃéÜííçò Êüôóéñáò  =  Γιάννης Κότσιρας 
  • This misencoding will generate almost entirely accented alphabetics in proper case, with occasional uppercase letters in misencoding for Greek vowels with tonics(?).
  • Windows-1251 (CP1251) :
  • Most common encoding for Cyrillic (used for Russian, Ukrainian, Serbian, Bulgarian).
   ÀóêöÛîí  =  АукцЫон 
  • This misencoding will generate almost entirely accented alphabetics in proper case. Occasional, infrequent presence of division sign (÷ = ч) and cedilla (¸ = ё).
  • KOI8-R :
  • This is a fairly rare encoding for Cyrillic, used mostly on older pre-Unicode Unix systems.
   ìÀÂÜ  =  Любэ 
  • Like Windows-1251 most characters are accented alphabetics, however it is easily distinguished because uppercase Cyrillic letters map to lowercase Latin-1 and lowercase Cyrillic map to uppercase Latin-1, giving it a distinctive "broken caps-lock" look.
  • TIS-620/ISO 8859-11 (variant CP 874):
  • Standard encoding for Thai
   ¾ÃÒÇ  =  พราว 
  • This appears similar to the Chinese encodings in that many symbols appear, but can be distinguished because titles may have an odd number of characters (although note that Chinese misencodings sometimes contain a soft hyphen which is not displayed, creating the appearance of an odd number of characters).