Friday, August 22, 2014

Character encoding, entity references and UTF-8

There have been a lot of questions in the forums recently all of which touch on a very important but often misunderstood part of building websites - character encoding, or how a document stores and displays different characters on a page.
The basics of character encoding - US-ASCII
In the beginning there was binary - all information is stored as a series of ones and zeros, or "on" and "off" - the heart of computing and electronics. In order to display alphanumeric characters, a standard was created which defined which binary sequence represented which character. This was the American Standard Code for Information Interchange, or ASCII. There were a few variants, the most well-known by far being US-ASCII, still in widespread use today.
With ASCII, each character is represented by a single-octet sequence. One byte, one letter. The biggest weakness with US-ASCII is that it only includes characters used in English, excluding any accented letters or regional variations such as the German double S.
Stage two - the ISO standards
To fulfil the demands on users which required more than the basic a-z / A-Z sequence, extensions to ASCII were developed and approved by the ISO. The best known are the ISO-8859 series, which used the same sequences as ASCII but added extra characters for accented letters and regional variations. ISO-8859-1 is for most western European languages such as English, French, Italian...
ISO-8859-1 versus windows-1252
ISO-8859-1 became the standard encoding for most Unix and Unix-like systems. However when Microsoft developed Windows, it used a slight variation on ISO-8859-1 commonly known as windows -1252. the differences between the two boil down to 27 characters (including the Euro symbol, certain angled quote marks, the ellipsys and the conjoined oe or "oe ligature") which windows-1252 uses in the place of 27 control characters in ISO-8859-1. Within Windows, ISO-8859-1 is silently replaced by windows-1252, which often means that copy/pasting content from, say, a Word document left the web page with validation errors. Many web authors incorrectly assume that the fault is with the characters themselves and that using entity references is the only way for accented characters. In fact, if you are using a western European language and ISO-8859-1 most accented characters suh as é è û î etc. can be used without resorting to entities such as é or similar. (ISO-8859-1 does not include an oe ligature for a very bizarre reason, but that's another story! You must therefore use œ instead.)
Character encoding on the web - HTML entity references
In order to get around character encoding problems on the web, a method was introduced to "encode" non-ASCII characters in HTML without having to change charsets away from the widely-supported US-ASCII. Accented characters such as é (e acute) can be encoded as é and the user agent would "translate" that into the appropriate character. These entity references or character entities are defined within the HTML document type definition (DTD) - in HTML 4.0, for example, there are over two hundred different entity references defined.
There are several weaknesses with the entity references approach. Firstly, they are excessively verbose - in ISO-8859-1 an e acute takes up one byte of space, whereas the entity reference takes up 8 bytes. The second problem is that the are only useful in the context of a parsed HTML document - read the source code as plain text and the result can end up verging on gibberish, especially of you are using a language which relies heavily on accents, such as Polish. Even in French, if you want to write the phrase à côté it ends up as à côté.
HTML entities are a tag soup solution to a tag soup problem, and this is seen clearest with the third problem with entity references - XML.
HTML entity references, RSS and XML
The entity references "solution" falls down once you start working with XML. Unlike HTML 4.0 or XHTML 1.0, which have DTDs which define the entity references, most XML does not have a doctype declaration, so none of those entity references are valid. What's worse, as XML doesn't share HTML's liberal error-handling, using undefined entities will break the document.
XML actually has ony five defined entity references, the bare minimum required for functionality. They are: & ' " < and >. There are various hacks and methods to add extra entity references to your XML, but the only real solution is to avoid their use entirely.
The most popular use of XML on the web at the moment is RSS and syndication. RSS is an XML format, so if you are using entity references, for example held in a database, then you will have difficulties producing a valid RSS feed. What's more, encoding directly in, say, ISO-8859-1 doesn't completely solve your problem as you are limited in the character you can use. Want to add a copyright notice in you feed? In HTML you can use ©, but in RSS you just get a parsing error, and ISO-8859-1 does not offer an alternative.
One encoding for every language - Unicode and UTF-8
In order to overcome the hodge-podge of incomplete, conflicting and aging standards (the ISO-8859 series date from the early 1980s), the notion of Unicode was developed. The differing versions of the ISO-10646 standard (Unicode has been approved by the ISO) are beyond the scope of this very brief introduction, but the important thing to note that is different with Unicode is that it offers one single character encoding for all of the world's languages. The second difference is that it is a multi-byte implementation rather than a simple one-byte per character representation.
By far the most important Unicode version on the web is UTF-8. This standard have numerous advantages, the most important of which is that it remains compatible with the much earlier US-ASCII standard. In fact, all of the single-byte ASCII characters are represented in exactly the same way in UTF-8. Only extended characters are different, made from multi-byte strings defined for each character, whether an e acute, an oe ligature, or characters from Arabic, Russian, Urdu or Japanese.
UTF-8 is especialy important for XML as it is the default encoding for all XML documents. And as you can't use HTML entity references and earlier ISO-8859 standards are incomplete, UTF-8 is the only logical choice when dealing with XML formats such as RSS or Atom which, even if you are only using English, are more than likely to eventually need more than the basic ASCII charset can offer.
UTF-8 is incredibly useful in HTML/XHTML too - no more entity references, the possibility to use extended characters such as curly quotes or long dashes, the possibility of using one charset across a multi-lingual site.
The downsides to UTF-8
There remain a few hurdles to UTF-8 acceptance, most of which can be minimsed or overcome.
- Browser support is excellent, with IE5.x up supporting UTF-8 fully, as do Mozilla/Firefox, Opera, Safari, Konqueror, etc. However earlier browsers such as IE4 and NN4 have problems, and IE3/NN3 and earlier lack support. Bear in mind that documents using markup older than HTML 4.0 cannot use UTF-8.
- The scripting language PHP (and some others) can have problems with multi-byte strings. See an excellent earlier WebmasterWorld thread by ergophobe: UTF-8, ISO-8859-1, PHP and XHTML [webmasterworld.com]. However if you check out how beautifully the PHP-driven WordPress handles UTF-8 content, it is clear that UTF-8 and PHP can successfully mix.
- Just because you can add content in, say, traditional Chinese to your site doesn't mean that the end-user has an appropriate font to display it - you still need to test and ensure compatibility when it comes to defining font families and such for your target audience.
How to implement UTF-8 on your site
If your site's language is English, simply swapping your ISO-8859-1 meta tags to UTF-8 goves the impression that you have succeeded. However, there is a little more to it than that. You still need to ensure that any non-ASCII content is correctly encoded. Users of other languages will almost certainly need to convert their files to UTF-8.
Most modern text and wysiwyg editors handle UTF-8 perfectly - in most cases, it is simply a case of going to "Save As" and choosing "UTF-8" or "Unicode" from the options. From then on, you can tidy up any entity references and start using the true characters. One useful tip is to copy/paste from a word-processing program such as Word which automagically replaces, for example, straight quotes with the appropriate "curly" opening and closing quotes.
If you are using a Linux or similar Unix-like server or desktop, you can use iconv to batch-convert many files at once.
Conclusion
If you are serious about standards, character encoding matters - even if you are just producing content in English. UTF-8 offers huge advantages, and you have everything to gain by moving to UTF-8 for new content.
Further reading
If you want a better or more detailed introduction to Unicode and character encoding in general, try some of these links:
  • A tutorial on character code issues [cs.tut.fi] (Jukka Korpela) 
  • On the Goodness of Unicode [tbray.org] (Tim Bray) 

  • What Is Unicode? [unicode.org] (Unicode Consortium) 

  • http://www.webmasterworld.com/forum21/11176.htm

    No comments:

    Post a Comment