Unicode Character Set (UTF-8-bit)
(CP65001/ISO-10646-UTF-1/UTF-8)
UTF-8 is Unicode’s 8-bit transformation format, suitable for use on Web sites that mostly rely on 8-bit communication paths. It is my preferred transformation format, because most of the documents written in UTF-8 do not take up as much space as those in UTF-16 (except when they contain only Asian or Indic characters). The advantage of UTF-8 over the rest is that you can edit UTF-8 documents directly with non-Unicode-compliant text editors without having to confront non-printing, non-editable C0 Control Code bytes (the first 32 characters from almost every character set); in this case, characters that are not Basic Latin are displayed as pairs or trios of characters (which are combinations of Header, Middle, & Trailer Bytes used to generate the rest of the characters in UTF-8; extraplanar characters would be displayed as four-character groups in such case). For most HTML browsers, it is unnecessary to include the byte-order code (which in UTF-8 is the 0xEF+BB+BF byte sequence) in order to interpret the document as UTF-8 (if the META tag is properly included within the HTML’s header); however, for some modern text editors, it is necessary to include such code (or such HTML would be opened as ASCII).
All of Unicode’s Plane-0 characters in UTF-8 are displayed below in two different pages: a single, 474-KB page for the public characters, which may take a lot of patience to load if you don't have a high-speed Internet connection; and a smaller page for the private characters. All of Unicode 5.0’s extraplanar characters (Planes 1, 2, 14, 15, & 16), which in UTF-8 require four bytes each to be generated, are displayed below in individual pages 600 KB or larger.

For further help on Unicode, please visit the Unicode Consortium.
Back

Run by: Leroy Vargas. For feedback related to Unicode or this website, Leroy can be contacted through his Lycos address.