What is UTF-8?
UTF stand for Unicode Transformation Format. The 8 refers to the fact that each character is
stored in as little as one or as many as 4 bytes (a byte is made up of 8 bits). UTF-8 is an
encoding scheme for Unicode characters. Its encoding method is compatible with standard ASCII encoded characters.
It is gaining more and more mainstream use as the standard for character encoding.
UTF-8 is able to be compatible with standard ASCII by encoding standard ascii values less than
128 using the 7 lowest order bits of a byte. If the value is higher than 127, then the highest
order bit(s) are set to indicate how many bytes will be used to store the character.
Example: A two byte UTF-8 encoded character would have 110 starting from the highest order bit.
The first tow bits indicate that two bytes will be used. The 0 indicates the end of the signaling bits.
That would leave 5 bits for the character value. The second byte will have a 1 and a 0 and
the rest used for the remainder of the character value. The second third and fourth byte of any
multibyte UTF-8 encoded character will have a 1 and a 0 and 6 bits for the character value.