Unicode & Character Encodings

What's a Character Encoding?

The String Module
A Bit of a Refresher
We Need More Bits!

ASCII encoding encompasses

lowercase English letters	a through z
uppercase English letters	A through Z
some punctuation and symbols	includes "$" and "!"
whitespace characters	an actual space (" "), a newline, carriage return, horizontal tab, vertical tab, and a few others
some non-printable characters	characters such as backspace "\b" which can't be printed literally

each single character has a corresponding code point
characters are segmented into different ranges within the ASCII table

Code Point Range	Class
0 through 31	Control/non-printable characters
32 through 64	Punctuation, symbols, numbers, and space
65 through 90	Uppercase English alphabet letters
91 through 96	Additional graphemes, such as `[` and `\`
97 through 122	Lowercase English alphabet letters
123 through 126	Additional graphemes, such as `{` and `\|`
127	Control/non-printable character (`DEL`)

the entire ASCII table contains 128 characters
the ASCII table displays the complete ASCII character set

The String Module

string constants

# From lib/python3.7/string.py

whitespace = ' \t\n\r\v\f'
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + ascii_letters + punctuation + whitespace

can use these constants for everyday string manipulation

>>> import string

>>> s = "What's wrong with ASCII?!?!?"
>>> s.rstrip(string.punctuation)
'What's wrong with ASCII'

A Bit of a Refresher

a bit is a signal that has only two possible states
different ways of symbolically representing a bit that all mean the same thing

0 or 1
'yes' or 'no'
True or False
'on' or 'off'

binary versions of 0 through 10 in decimal

Decimal	Binary (Compact)	Binary (Padded Form)
0	0	00000000
1	1	00000001
2	10	00000010
3	11	00000011
4	100	00000100
5	101	00000101
6	110	00000110
7	111	00000111
8	1000	00001000
9	1001	00001001
10	1010	00001010

a handy way to represent ASCII strings as sequences of bits in Python
each character from the ASCII string gets pseudo-encoded into 8 bits
spaces in between the 8-bit sequences that each represent a single character

>>> def make_bitseq(s: str) -> str:
...     if not s.isascii():
...         raise ValueError("ASCII only allowed")
...     return " ".join(f"{ord(i):08b}" for i in s)

>>> make_bitseq("bits")
'01100010 01101001 01110100 01110011'

>>> make_bitseq("CAPS")
'01000011 01000001 01010000 01010011'

>>> make_bitseq("$25.43")
'00100100 00110010 00110101 00101110 00110100 00110011'

>>> make_bitseq("~5")
'01111110 00110101'

the f-string f"{ord(i):08b}" uses Python's Format Specification Mini-Language
is a way of specifying formatting for replacement fields in format strings

the left side of the colon, ord(i), is the actual object whose value will be formatted and inserted into the output
using the Python ord() function gives the base-10 code point for a single str character
the right hand side of the colon is the format specifier
08 means width 8, 0 padded
the b functions as a sign to output the resulting number in base 2 (binary)

We Need More Bits!

given a number of bits n the number of distinct possible values that can be represented in n bits is 2ⁿ

1 bit : 2¹ == 2 possible values
8 bits : 2⁸ == 256 possible values
64 bits : 2⁶⁴ == 18,446,744,073,709,551,616 possible values

trying to solve for is n in the equation 2ⁿ = x where x already known

from math import ceil, log

def n_bits_required(nvalues: int) -> int:
    return ceil(log(nvalues) / log(2))

n_bits_required(256)

need to use a ceiling in n_bits_required() to account for values that are not clean powers of 2
need to store a character set of 110 characters total
this should take log(110) / log(2) == 6.781 bits
no such thing as 0.781 bits
110 values will require 7 bits

>>> n_bits_required(110)
7

Covering All the Bases - Other Number Systems

common numbering systems

Binary: base 2
Octal: base 8
Hexadecimal (hex): base 16

Enter Unicode

Unicode vs UTF-8
Encoding and Decoding in Python 3
Python 3 - All-in on Unicode
One Byte, Two Bytes, Three Bytes, Four
What About UTF-16 and UTF-32?

the problem with ASCII is that it's not nearly a big enough set of characters to accommodate the world's set of languages, dialects, symbols, and glyphs
Unicode has 1,114,112 possible code points
ASCII is a perfect subset of Unicode

Unicode vs UTF-8

Unicode is an abstract encoding standard, not an encoding
UTF-8, UTF-16 and UTF-32 are encoding formats for representing Unicode characters as binary data of one or more bytes per character

Encoding and Decoding in Python 3

str type represents human-readable text, can contain any Unicode character
bytes type represents binary data
encode str type to bytes type
decode bytes type to str type

>>> "résumé".encode("utf-8")
b'r\xc3\xa9sum\xc3\xa9'
>>> "El Niño".encode("utf-8")
b'El Ni\xc3\xb1o'

>>> b"r\xc3\xa9sum\xc3\xa9".decode("utf-8")
'résumé'
>>> b"El Ni\xc3\xb1o".decode("utf-8")
'El Niño'

\xc3\xb1 are the two bits representing the ñ

Python 3 - All-in on Unicode

Python 3 source code is assumed to be UTF-8 by default
all text (str) is Unicode by default
str type can contain any literal Unicode character
Python 3 accepts many Unicode code points in identifiers
```
résumé = "~/Documents/resume.pdf"
```
is valid
re module defaults to the re.UNICODE flag rather than re.ASCII
r"\w" matches Unicode word characters, not just ASCII letters
default encoding in str.encode() and bytes.decode() is UTF-8

the default encoding to the built-in open() is platform-dependent and depends on the value of locale.getpreferredencoding()

>>> # Mac OS X High Sierra
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'

>>> # Windows Server 2012; other Windows builds may use UTF-16
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

make no assumptions

One Byte, Two Bytes, Three Bytes, Four

a crucial feature is that UTF-8 is a variable-length encoding
ASCII encoding only requires one byte per character
a Unicode character can occupy between 1 and 4 bytes

>>> ibrow = "🤨"
>>> len(ibrow)
1
>>> ibrow.encode("utf-8")
b'\xf0\x9f\xa4\xa8'
>>> len(ibrow.encode("utf-8"))
4

>>> # Calling list() on a bytes object gives you
>>> # the decimal value for each byte
>>> list(b'\xf0\x9f\xa4\xa8')
[240, 159, 164, 168]

subtle but important feature of len()

the length of a single Unicode character as a Python str will always be 1, no matter how many bytes it occupies
the length of the same character encoded to bytes will be anywhere between 1 and 4

What About UTF-16 and UTF-32?

two variations of decoding the same bytes object may produce results which aren't even in the same language
encoding four Greek letters with UTF-8 and then decoding back to text in UTF-16 produces a text str whichis in a completely different language (Korean)

>>> letters = "αβγδ"
>>> rawdata = letters.encode("utf-8")
>>> rawdata.decode("utf-8")
'αβγδ'
>>> rawdata.decode("utf-16") 
'뇎닎돎듎'

the range or number of bytes under UTF-8, UTF-16, and UTF-32

Encoding	Bytes per Char	Variable Length
UTF-8	1 to 4	yes
UTF-16	2 to 4	yes
UTF-32	4	no

Python's Built-in Functions

Python has a group of built-in functions that relate in some way to numbering systems and character encoding
can be logically grouped together based on their purpose

Functions	Descriptions
ascii() bin() hex() oct()	each obtains a different representation of an input each one produces a str ascii() produces an ASCII only representation of an object with non-ASCII characters escaped the other three give binary, hexadecimal, and octal representations of an integer, respectively results are only representations, not a fundamental change in the input
bytes() str() int()	class constructors for their respective types each offers ways of coercing the input into the desired type.
ord() chr()	are inverses of each other ord() function converts a str character to its base-10 code point chr() does the opposite

Python String Literals

six ways that Python will allow entering the same Unicode character

Escape Sequence	Meaning	"a"
none	using str c'tor	"a"
"\ooo"	character with octal value ooo	"\141"
"\xhh"	character with hex value hh	"\x61"
"\N{name}"	character named name in the Unicode database	"\N{LATIN SMALL LETTER A}"
"\uxxxx"	character with 16-bit (2-byte) hex value xxxx	"\u0061"
"\Uxxxxxxxx"	character with 32-bit (4-byte) hex value xxxxxxxx	"\U00000061"

two main caveats

not all of these forms work for all characters
the hex representation of the integer 300 is 0x012c which won't going to fit into the 2-hex-digit escape code "\xhh"
the highest code point which can squeezed into this escape sequence is "\xff" ("ÿ")
similarly for "\ooo", it will only work up to "\777" ("ǿ")
for \xhh, \uxxxx, and \Uxxxxxxxx exactly as many digits are required as are shown in these examples
the way that Unicode tables conventionally display the codes for characters is with a leading U+ and variable number of hex characters
the key is that Unicode tables most often do not zero-pad these codes

the "\Uxxxxxxxx" form is the only escape sequence that is capable of holding any Unicode character

Other Encodings Available in Python

many other encoding schemes
Latin-1 (also called ISO-8859-1) which is technically the default for HTTP
Windows has its own Latin-1 variant called cp1252
complete list of accepted encodings is in the documentation for the codecs module

to quickly get a representation of a decoded string's escaped Unicode literal use "unicode-escape"

>>> alef = chr(1575)  # Or "\u0627"
>>> alef_hamza = chr(1571)  # Or "\u0623"
>>> alef, alef_hamza
('ا', 'أ')
>>> alef.encode("unicode-escape")
b'\\u0627'
>>> alef_hamza.encode("unicode-escape")
b'\\u0623'

You Know What They Say About Assumptions ...

Python makes the assumption of UTF-8 encoding for files and code
should operate with the same assumption for external data

>>> data = b"\xbc cup of flour"
>>> data.decode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 0: invalid start byte
>>> data.decode("latin-1")
'¼ cup of flour'

Odds and Ends: unicodedata

do lookups on the Unicode Character Database (UCD)

>>> import unicodedata

>>> unicodedata.name("€")
'EURO SIGN'
>>> unicodedata.lookup("EURO SIGN")
'€'