Python Topics : Unicode & Character Encodings
What's a Character Encoding?

ASCII encoding encompasses

lowercase English lettersa through z
uppercase English lettersA through Z
some punctuation and symbolsincludes "$" and "!"
whitespace characters an actual space (" "), a newline, carriage return, horizontal tab, vertical tab, and a few others
some non-printable characters characters such as backspace "\b" which can't be printed literally

each single character has a corresponding code point
characters are segmented into different ranges within the ASCII table

Code Point Range Class
0 through 31 Control/non-printable characters
32 through 64 Punctuation, symbols, numbers, and space
65 through 90 Uppercase English alphabet letters
91 through 96 Additional graphemes, such as [ and \
97 through 122 Lowercase English alphabet letters
123 through 126 Additional graphemes, such as { and |
127 Control/non-printable character (DEL)
the entire ASCII table contains 128 characters
the ASCII table displays the complete ASCII character set

The String Module
string constants
# From lib/python3.7/string.py

whitespace = ' \t\n\r\v\f'
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + ascii_letters + punctuation + whitespace
can use these constants for everyday string manipulation
>>> import string

>>> s = "What's wrong with ASCII?!?!?"
>>> s.rstrip(string.punctuation)
'What's wrong with ASCII'
A Bit of a Refresher
a bit is a signal that has only two possible states
different ways of symbolically representing a bit that all mean the same thing
  • 0 or 1
  • 'yes' or 'no'
  • True or False
  • 'on' or 'off'
binary versions of 0 through 10 in decimal
Decimal Binary (Compact) Binary (Padded Form)
0 0 00000000
1 1 00000001
2 10 00000010
3 11 00000011
4 100 00000100
5 101 00000101
6 110 00000110
7 111 00000111
8 1000 00001000
9 1001 00001001
10 1010 00001010
a handy way to represent ASCII strings as sequences of bits in Python
each character from the ASCII string gets pseudo-encoded into 8 bits
spaces in between the 8-bit sequences that each represent a single character
>>> def make_bitseq(s: str) -> str:
...     if not s.isascii():
...         raise ValueError("ASCII only allowed")
...     return " ".join(f"{ord(i):08b}" for i in s)

>>> make_bitseq("bits")
'01100010 01101001 01110100 01110011'

>>> make_bitseq("CAPS")
'01000011 01000001 01010000 01010011'

>>> make_bitseq("$25.43")
'00100100 00110010 00110101 00101110 00110100 00110011'

>>> make_bitseq("~5")
'01111110 00110101'
the f-string f"{ord(i):08b}" uses Python's Format Specification Mini-Language
is a way of specifying formatting for replacement fields in format strings
  • the left side of the colon, ord(i), is the actual object whose value will be formatted and inserted into the output
    using the Python ord() function gives the base-10 code point for a single str character
  • the right hand side of the colon is the format specifier
    08 means width 8, 0 padded
    the b functions as a sign to output the resulting number in base 2 (binary)
We Need More Bits!
given a number of bits n the number of distinct possible values that can be represented in n bits is 2n
  • 1 bit : 21 == 2 possible values
  • 8 bits : 28 == 256 possible values
  • 64 bits : 264 == 18,446,744,073,709,551,616 possible values
trying to solve for is n in the equation 2n = x where x already known
from math import ceil, log

def n_bits_required(nvalues: int) -> int:
    return ceil(log(nvalues) / log(2))

n_bits_required(256)
need to use a ceiling in n_bits_required() to account for values that are not clean powers of 2
need to store a character set of 110 characters total
this should take log(110) / log(2) == 6.781 bits
no such thing as 0.781 bits
110 values will require 7 bits
>>> n_bits_required(110)
7
Covering All the Bases - Other Number Systems
common numbering systems
  • Binary: base 2
  • Octal: base 8
  • Hexadecimal (hex): base 16
Enter Unicode

the problem with ASCII is that it's not nearly a big enough set of characters to accommodate the world's set of languages, dialects, symbols, and glyphs
Unicode has 1,114,112 possible code points
ASCII is a perfect subset of Unicode

Unicode vs UTF-8
Unicode is an abstract encoding standard, not an encoding
UTF-8, UTF-16 and UTF-32 are encoding formats for representing Unicode characters as binary data of one or more bytes per character

Encoding and Decoding in Python 3
str type represents human-readable text, can contain any Unicode character
bytes type represents binary data
encode str type to bytes type
decode bytes type to str type
>>> "résumé".encode("utf-8")
b'r\xc3\xa9sum\xc3\xa9'
>>> "El Niño".encode("utf-8")
b'El Ni\xc3\xb1o'

>>> b"r\xc3\xa9sum\xc3\xa9".decode("utf-8")
'résumé'
>>> b"El Ni\xc3\xb1o".decode("utf-8")
'El Niño'
\xc3\xb1 are the two bits representing the ñ

Python 3 - All-in on Unicode
  • Python 3 source code is assumed to be UTF-8 by default
  • all text (str) is Unicode by default
    str type can contain any literal Unicode character
  • Python 3 accepts many Unicode code points in identifiers
    résumé = "~/Documents/resume.pdf"
    is valid
  • re module defaults to the re.UNICODE flag rather than re.ASCII
    r"\w" matches Unicode word characters, not just ASCII letters
  • default encoding in str.encode() and bytes.decode() is UTF-8
the default encoding to the built-in open() is platform-dependent and depends on the value of locale.getpreferredencoding()
>>> # Mac OS X High Sierra
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

>>> # Windows Server 2012; other Windows builds may use UTF-16
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
make no assumptions

One Byte, Two Bytes, Three Bytes, Four
a crucial feature is that UTF-8 is a variable-length encoding
ASCII encoding only requires one byte per character
a Unicode character can occupy between 1 and 4 bytes
>>> ibrow = "🤨"
>>> len(ibrow)
1
>>> ibrow.encode("utf-8")
b'\xf0\x9f\xa4\xa8'
>>> len(ibrow.encode("utf-8"))
4

>>> # Calling list() on a bytes object gives you
>>> # the decimal value for each byte
>>> list(b'\xf0\x9f\xa4\xa8')
[240, 159, 164, 168]
subtle but important feature of len()
  • the length of a single Unicode character as a Python str will always be 1, no matter how many bytes it occupies
  • the length of the same character encoded to bytes will be anywhere between 1 and 4
What About UTF-16 and UTF-32?
two variations of decoding the same bytes object may produce results which aren't even in the same language
encoding four Greek letters with UTF-8 and then decoding back to text in UTF-16 produces a text str whichis in a completely different language (Korean)
>>> letters = "αβγδ"
>>> rawdata = letters.encode("utf-8")
>>> rawdata.decode("utf-8")
'αβγδ'
>>> rawdata.decode("utf-16") 
'뇎닎돎듎'
the range or number of bytes under UTF-8, UTF-16, and UTF-32
EncodingBytes per CharVariable Length
UTF-81 to 4yes
UTF-162 to 4yes
UTF-324no

Python's Built-in Functions
Python has a group of built-in functions that relate in some way to numbering systems and character encoding
can be logically grouped together based on their purpose
FunctionsDescriptions
ascii()
bin()
hex()
oct()
each obtains a different representation of an input
each one produces a str
ascii() produces an ASCII only representation of an object with non-ASCII characters escaped
the other three give binary, hexadecimal, and octal representations of an integer, respectively
results are only representations, not a fundamental change in the input
bytes()
str()
int()
class constructors for their respective types
each offers ways of coercing the input into the desired type.
ord()
chr()
are inverses of each other
ord() function converts a str character to its base-10 code point
chr() does the opposite
Python String Literals
six ways that Python will allow entering the same Unicode character
Escape Sequence Meaning "a"
noneusing str c'tor"a"
"\ooo" character with octal value ooo "\141"
"\xhh" character with hex value hh "\x61"
"\N{name}" character named name in the Unicode database "\N{LATIN SMALL LETTER A}"
"\uxxxx" character with 16-bit (2-byte) hex value xxxx "\u0061"
"\Uxxxxxxxx" character with 32-bit (4-byte) hex value xxxxxxxx "\U00000061"

two main caveats

  1. not all of these forms work for all characters
    the hex representation of the integer 300 is 0x012c which won't going to fit into the 2-hex-digit escape code "\xhh"
    the highest code point which can squeezed into this escape sequence is "\xff" ("ÿ")
    similarly for "\ooo", it will only work up to "\777" ("ǿ")
  2. for \xhh, \uxxxx, and \Uxxxxxxxx exactly as many digits are required as are shown in these examples
    the way that Unicode tables conventionally display the codes for characters is with a leading U+ and variable number of hex characters
    the key is that Unicode tables most often do not zero-pad these codes
the "\Uxxxxxxxx" form is the only escape sequence that is capable of holding any Unicode character

Other Encodings Available in Python
many other encoding schemes
Latin-1 (also called ISO-8859-1) which is technically the default for HTTP
Windows has its own Latin-1 variant called cp1252
complete list of accepted encodings is in the documentation for the codecs module

to quickly get a representation of a decoded string's escaped Unicode literal use "unicode-escape"

>>> alef = chr(1575)  # Or "\u0627"
>>> alef_hamza = chr(1571)  # Or "\u0623"
>>> alef, alef_hamza
('ا', 'أ')
>>> alef.encode("unicode-escape")
b'\\u0627'
>>> alef_hamza.encode("unicode-escape")
b'\\u0623'
You Know What They Say About Assumptions ...
Python makes the assumption of UTF-8 encoding for files and code
should operate with the same assumption for external data
>>> data = b"\xbc cup of flour"
>>> data.decode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 0: invalid start byte
>>> data.decode("latin-1")
'¼ cup of flour'
Odds and Ends: unicodedata
do lookups on the Unicode Character Database (UCD)
>>> import unicodedata

>>> unicodedata.name("€")
'EURO SIGN'
>>> unicodedata.lookup("EURO SIGN")
'€'
index