Tuesday, September 3, 2013

Notes on Character Encoding and Unicode

We often see terminologies such as ASCII, UTF-8, Unicode, etc. These can be confusing. Here are some notes on them.
  1. Basically an encoding is about how to represent language characters in computer using numbers.
  2. ASCII is basically for English. It uses 7 bits. Each value from 0 to 127 is mapped to a character. We know that this is not enough for other languages.
  3. ISO 8859 is another encoding. It uses 8 bits. It can be thought of as an expansion of ASCII. It uses the same values to represent the same ASCII characters. With the additional one bit, it can represent more characters and is sufficient for most other western languages.
  4. Unicode also maps characters to numbers (called code points). It uses 32 bits. But the range of the values is not all the 32 bit numbers. The range is from 0x0 to 0x10FFFF. For every character that has an ASCII value, the Unicode code point and the ASCII value of that character are the same.
  5. Unicode is a specification. It is not an implementation. Actually the same can be said of ASCII. UTF-8 and UTF-16 are two of the Unicode implementations.
  6. UTF-8 uses 8 bits to represent some characters. And uses up to 32 bits to represent other characters. It is a variable-length encoding. UTF-16 uses 16 bits to represent some characters. And uses 32 bits to represent other characters. It is also variable-length encoding. So the number 8 and 16 in these two encoding scheme can be thought of as the least number of digits used to encode a character.
  7. Java uses Unicode for String. And it uses UTF-16. In java, you can use escape character to represent a Unicode character. For example you can use \u0078 for 'x'. Now immediately you can see that this format has the largest possible value \uFFFF, which does not cover the all the values in a Unicode. The largest Unicode value is 0x10FFFF. For this java uses a surrogate pair which is a combination of two \u numbers.
  8. In java, you can also use another format to represent a char value. For example,
    char var = '\101'; 
    Here the variable var is equal to 'A'. The formal name for this representation is "octal escapes". The maximum value allowed in Java is '\377', which is equivalent to the decimal value 255 because 3*8^2 + 7*8 + 7 = 255. It is probably for purely historical reasons that Java supports octal escape sequences at all. These escape sequences originated in C.
  9. Note that the Unicode representation in java is preprocessed before compilation. So for example "\u0033\u0078\u1000\u0079" is a valid string that consists of 4 letters "3x?y".

3 comments:

  1. Great tips! Just couple of corrections:

    ---------------------------
    4. Unicode also maps characters to numbers (called code points). It uses 32 bits. But the range of the values is not all the 32 bit numbers. The range is from 0x0 to 0x10FFFF. Unicode uses different numbers from ASCII to represent common characters. For example, ASCII uses 120 for 'x'. But Unicode uses 78 for 'x'.
    ---------------------------

    Oops: 0x78 (hexadecimal) is the same as 120 (decimal). Unicode and ASCII use the same number for 'x'. That's how UTF-8 can look the same as ASCII if you use only ASCII characters (no ≈, ↑, é, ä, ŭ, ĉ, etc.)


    ---------------------------
    8. In java, you can also use another format to represent a char value. For example, '\65' represents character 'A'. This representation uses the ASCII values to represent characters. But you will find that the maximum value in this format is not 127 or 0x10FFFF. It is '\377' for some reason. Other values are not allowed by Java. The formal name for this representation is "octal escapes". It is probably for purely historical reasons that Java supports octal escape sequences at all. These escape sequences originated in C.

    ---------------------------
    The code for 'A' is 65 (decimal), 0101 (octal), or 0x41 (hexadecimal). So both '\101' and '\u0065' represent 'A'.

    ReplyDelete
  2. Thanks. And now I just noticed that my own comment has an error. 'A' is \u0041, and not \u0065! :-)

    ReplyDelete