What is UTF-8?

UTF stand for Unicode Transformation Format. The 8 refers to the fact that each character is stored in as little as one or as many as 4 bytes (a byte is made up of 8 bits). UTF-8 is an encoding scheme for Unicode characters. Its encoding method is compatible with standard ASCII encoded characters. It is gaining more and more mainstream use as the standard for character encoding. UTF-8 is able to be compatible with standard ASCII by encoding standard ascii values less than 128 using the 7 lowest order bits of a byte. If the value is higher than 127, then the highest order bit(s) are set to indicate how many bytes will be used to store the character.

Example: A two byte UTF-8 encoded character would have 110 starting from the highest order bit. The first tow bits indicate that two bytes will be used. The 0 indicates the end of the signaling bits. That would leave 5 bits for the character value. The second byte will have a 1 and a 0 and the rest used for the remainder of the character value. The second third and fourth byte of any multibyte UTF-8 encoded character will have a 1 and a 0 and 6 bits for the character value.