unicode & utf
☆ unicode toys

ASCII

ASCII used 7 bits (128 characters, codes 0-127) that were standard. The first 32 characters were unprintable and used for control. The eighth bit mapped to one of many possible "code pages" of which there were many and the numbers 128-255 mapped to Hebrew, Chinese, Greek or whatever code page you defined.

Unicode

As the need to represent more characters (Chinese, Thai, Japanese) increases, we need a larger and unified coding scheme.

Examples:   合 🌎 𡔒 𡔓 𡔚 Ϫ ڜ ۞ ߷ ߹ ੴ ⨹ 👰 🐜 🂡 

"Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code [UCS-2] where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad." (Spolsky)

Example

'Hello' in ascii is 5 bytes as follows: 48 65 6C 6C 6F 'Hello' in Unicode is represented by five code points: U+0048 U+0065 U+006C U+006C U+006F.
Unicode is a straightforward list, but encoding the unicode characters is more complex than one might assume. "There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway." (Spolsky) The Unicode character set is huge, containing more than 110,000 characters. Unicode can be implemented by many different character encodings including UTF-8 and UTF-16.

UTF Encoding of Unicode

UTF: Unicode Transformation Format

Simple 16 bit encoding of Unicode characters had problems.
  1. Double the length of ALL characters, even simple (& common) "ascii" where character codes <= 127
  2. Introduce little endian vs. big endian issues.
  3. Not enough bits: 65536 character limit. Unicode represents over 100,000 characters as of 2013.

UTF-8 was invented as a variable length encoding scheme for Unicode. It allowed ONE byte encoding for traditional (and common) ascii characters and allowed up to 4 bytes and 21 data bits for encoding other characters. It is endianness neutral.

UTF-16 was invented as a variable length encoding scheme for Unicode. It allowed two byte or four byte encoding of characters. UTF-16 suffers from endian specificty and so must encode differently depnding on architecture (UTF-16BE, UTF-16LE). UTF-16 is used by the .NET environments and Mac OS X's Cocoa and Core Foundation frameworks as well as Python and Java. "UTF-16 is the worst of both worlds - variable length and too wide. It exists for historical reasons, adds a lot of confusion and will hopefully die out." (UTF-8 Everywhere Manifesto)

UTF-8 Definition

UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646] In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded. The table below summarizes the format of these different octet types. The letter x indicates bits available for encoding bits of the character number. Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) -----------------------+-------------------------------------- 0000 0000 - 0000 007F | 0xxxxxxx - 7 bits of data 0000 0080 - 0000 07FF | 110xxxxx 10xxxxxx - 11 0000 0800 - 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx - 16 0001 0000 - 001F FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - 21 Encoding a character to UTF-8 proceeds as follows: 1. Determine the number of octets required from the character number and the first column of the table above. It is important to note that the rows of the table are mutually exclusive, i.e., there is only one valid way to encode a given character. 2. Prepare the high-order bits of the octets as per the second column of the table. 3. Fill in the bits marked x from the bits of the character number, expressed in binary. Start by putting the lowest-order bit of the character number in the lowest-order position of the last octet of the sequence, then put the next higher-order bit of the character number in the next higher-order position of that octet, etc. When the x bits of the last octet are filled in, move on to the next to last octet, then to the preceding one, etc. until all x bits are filled in.

Summary

"If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that 'plain' text is ASCII." (Spolsky)

References

Spolsky on Character Codes     UTF-8 and Unicode     Brief History of Character Codes     UTF-8     Unicode Reference     UTF-8 Encoding Format