February 24, 2006

Understanding Unicode

David Goodger set himself the possibly ambitious goal of dispersing unicodaphobia, which he defined as fear of Unicode and all the words surrounding it. Unicode is intended to be a cross-platform standard applicable to any program and any language (possibly even including Dwarvish and Klingon in the future). It is currently defined by ISO 10646. What follows is my attempt to record my developing understanding during the presentation.

The basic element in Unicode is the character: a character set takes a subset of the characters and maps each one to a code point, a numeric value associated (in that particular character set) with that character. There are a number of 8-bit character sets that all associate the ASCII characters with the code points from 0-127, but mapping completely different characters onto the upper half of the code points.

Unlike Western alphabets, many other languages have large characters sets that therefore cannot be mapped onto only 256 code points.

An encoding is a specification of the code points mapped to each of the characters in a character set. An encoding will be performed by a codec. Encoding a Unicode text maps each character in a Unicode text into one or more bytes in a representation. Decoding is the converse process. in a Python program the internal representation might use 16 or 32 bits per character, depending on the build configuration.

Whatever the internal representation of the Unicode text, text that is read in must be decioded from its external representation to the internal representation, and when a Unicode string is written out it must be encoded into a suitable encoding (but not necessarily the same one that was used to read it in). ASCII is an encoding, as is Latin-1 throughLatin-N and Windows-1252; all thses encodings are ASCII-compatible because each code point is a single byte containing the ASCII set on code points 0 through 127. UCS-2, UCS-4 and UTF-16 are all encodings that are capable of representing all Unicode characters. Because they use multi-byte code points they aren't ASCII-compatible.

The new default encoding is UTF-8, which is ASCII-compatible because the characters of the ASCII set still map onto code points represented as single bytes with values of 1 through 127. It is capable of representing all Unicode characters, because it uses multi-byte representations (from 1 to 6 bytes in length) for the non-ASCII characters, and none of the bytes in those representations are in the range 1 through 127. The first byte of a non-ASCII character is always in the range 0xc0 to 0xFD, and the second byte will be in the range 0x80 to 0xBF.

For this student the main failing of the presentation was the complete absence of any graphical materials. Some of us understand pictures better than just words. I was left with the feeling that my understanding of Unicode had increased, but I couldn't necessarily say I can proceed confidently from here on in. I wish the presentation hadn't overrun its time, as the noise of delegates departing early distracted from the later slides.

6 comments:

Anonymous said...

Thanks for the summary, Steve. I have a better understanding of character encoding now, although I'm in the same boat: some supplemental materials would be great.

Anonymous said...

The basic element of Unicode isn't a character, but a code point.

Nor are character sets fundamental to Unicode--Unicode is just one big set of code points,
and encodings map those Platonic code points to various character sets.

At least as I understand it...

Anonymous said...

The basic element of Unicode isn't a character, but a code point.

Nor are character sets fundamental to Unicode--Unicode is just one big set of code points,
and encodings map those Platonic code points to various character sets.

At least as I understand it...

Steve said...

I was trying to log what David said as accurately as I could, rather than correct him on the fly, and given the deadline I'd given myself (to publish before the next session) I had no chance to seek clarification from him - the slides for his talk weren't available at the time.

Technically I believe each Unicode character has a name (such as "PERCENT SIGN") and a code point (such as U+0025) and that a character set maps a (sub)set of Unicode characters to their representations. I had to cover Unicode in Python Web Programming, so owners of that book can check my understanding there.

However, for those wanting to study these things in more detail I can offer two references: Unicode Secrets by Uche Ogbuji is useful for its list of resources, one of which is Unicode for Programmers by Jason Orendorff

Anonymous said...

Thanks for your article.
I work in a latinamerican country, where unicode provoked me more than one headache and legal problems.
Frecuently I receive many ascii files to process and many programs compiled under, -specifically .NET - that read this files, assume unicode so, when I read, for example "España" or "América" I "dangerously" get "Espa?a" or "Am?rica" as output.
I mean, the interchange of information between ascii and unicode files is really critical. Do you know which is the solution? How can I avoid to worry if a file is in ascii or unicode format?
and that the encoding between ascii and unicode is well done??
Many, many thanks!!

Steve said...

Your thanks, of course, should principally go to David Goodger, who took the time to prepare and present the session. We should be grateful to all the PyCon speakers. It takes a lot of work to put a short presentation together.

It seems that the only way to avoid worries about whether a file is in ASCII or Unicode is if everyone used Unicode all the time. There are some strategies that you can use to determine the incoming representation, but they aren't guaranteed to work and so you can always end up making the wrong decision.

Marc André Lemburg and Martin von Löwis are the people whose names I tend to regard as authoritative in this area for the Python world. It's probably worth Googling for those names and the word "unicode" to see what's available in the Python world.