David Goodger set himself the possibly ambitious goal of dispersing unicodaphobia, which he defined as fear of Unicode and all the words surrounding it. Unicode is intended to be a cross-platform standard applicable to any program and any language (possibly even including Dwarvish and Klingon in the future). It is currently defined by ISO 10646. What follows is my attempt to record my developing understanding during the presentation.
The basic element in Unicode is the character: a character set takes a subset of the characters and maps each one to a code point, a numeric value associated (in that particular character set) with that character. There are a number of 8-bit character sets that all associate the ASCII characters with the code points from 0-127, but mapping completely different characters onto the upper half of the code points.
Unlike Western alphabets, many other languages have large characters sets that therefore cannot be mapped onto only 256 code points.
An encoding is a specification of the code points mapped to each of the characters in a character set. An encoding will be performed by a codec. Encoding a Unicode text maps each character in a Unicode text into one or more bytes in a representation. Decoding is the converse process. in a Python program the internal representation might use 16 or 32 bits per character, depending on the build configuration.
Whatever the internal representation of the Unicode text, text that is read in must be decioded from its external representation to the internal representation, and when a Unicode string is written out it must be encoded into a suitable encoding (but not necessarily the same one that was used to read it in). ASCII is an encoding, as is Latin-1 throughLatin-N and Windows-1252; all thses encodings are ASCII-compatible because each code point is a single byte containing the ASCII set on code points 0 through 127. UCS-2, UCS-4 and UTF-16 are all encodings that are capable of representing all Unicode characters. Because they use multi-byte code points they aren't ASCII-compatible.
The new default encoding is UTF-8, which is ASCII-compatible because the characters of the ASCII set still map onto code points represented as single bytes with values of 1 through 127. It is capable of representing all Unicode characters, because it uses multi-byte representations (from 1 to 6 bytes in length) for the non-ASCII characters, and none of the bytes in those representations are in the range 1 through 127. The first byte of a non-ASCII character is always in the range 0xc0 to 0xFD, and the second byte will be in the range 0x80 to 0xBF.
For this student the main failing of the presentation was the complete absence of any graphical materials. Some of us understand pictures better than just words. I was left with the feeling that my understanding of Unicode had increased, but I couldn't necessarily say I can proceed confidently from here on in. I wish the presentation hadn't overrun its time, as the noise of delegates departing early distracted from the later slides.