| What's a Character Encoding? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ASCII encoding encompasses
characters are segmented into different ranges within the ASCII table
the ASCII table displays the complete ASCII character set The String Module
string constants
# From lib/python3.7/string.py
whitespace = ' \t\n\r\v\f'
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz'
ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
ascii_letters = ascii_lowercase + ascii_uppercase
digits = '0123456789'
hexdigits = digits + 'abcdef' + 'ABCDEF'
octdigits = '01234567'
punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable = digits + ascii_letters + punctuation + whitespace
can use these constants for everyday string manipulation
>>> import string >>> s = "What's wrong with ASCII?!?!?" >>> s.rstrip(string.punctuation) 'What's wrong with ASCII' A Bit of a Refresher
a bit is a signal that has only two possible statesdifferent ways of symbolically representing a bit that all mean the same thing
each character from the ASCII string gets pseudo-encoded into 8 bits spaces in between the 8-bit sequences that each represent a single character >>> def make_bitseq(s: str) -> str:
... if not s.isascii():
... raise ValueError("ASCII only allowed")
... return " ".join(f"{ord(i):08b}" for i in s)
>>> make_bitseq("bits")
'01100010 01101001 01110100 01110011'
>>> make_bitseq("CAPS")
'01000011 01000001 01010000 01010011'
>>> make_bitseq("$25.43")
'00100100 00110010 00110101 00101110 00110100 00110011'
>>> make_bitseq("~5")
'01111110 00110101'
the f-string f"{ord(i):08b}" uses Python's Format Specification Mini-Languageis a way of specifying formatting for replacement fields in format strings
We Need More Bits!
given a number of bits n the number of distinct possible values that can be represented in n bits is 2n
from math import ceil, log
def n_bits_required(nvalues: int) -> int:
return ceil(log(nvalues) / log(2))
n_bits_required(256)
need to use a ceiling in n_bits_required() to account for values that are not clean powers of 2need to store a character set of 110 characters total this should take log(110) / log(2) == 6.781 bits no such thing as 0.781 bits 110 values will require 7 bits >>> n_bits_required(110) 7 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Covering All the Bases - Other Number Systems | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
common numbering systems
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Enter Unicode | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unicode has 1,114,112 possible code points ASCII is a perfect subset of Unicode Unicode vs UTF-8
Unicode is an abstract encoding standard, not an encodingUTF-8, UTF-16 and UTF-32 are encoding formats for representing Unicode characters as binary data of one or more bytes per character Encoding and Decoding in Python 3
str type represents human-readable text, can contain any Unicode character bytes type represents binary data encode str type to bytes type decode bytes type to str type >>> "résumé".encode("utf-8")
b'r\xc3\xa9sum\xc3\xa9'
>>> "El Niño".encode("utf-8")
b'El Ni\xc3\xb1o'
>>> b"r\xc3\xa9sum\xc3\xa9".decode("utf-8")
'résumé'
>>> b"El Ni\xc3\xb1o".decode("utf-8")
'El Niño'
\xc3\xb1 are the two bits representing the ñ
Python 3 - All-in on Unicode
>>> # Mac OS X High Sierra >>> import locale >>> locale.getpreferredencoding() 'UTF-8' >>> # Windows Server 2012; other Windows builds may use UTF-16 >>> import locale >>> locale.getpreferredencoding() 'cp1252'make no assumptions One Byte, Two Bytes, Three Bytes, Four
a crucial feature is that UTF-8 is a variable-length encodingASCII encoding only requires one byte per character a Unicode character can occupy between 1 and 4 bytes >>> ibrow = "🤨"
>>> len(ibrow)
1
>>> ibrow.encode("utf-8")
b'\xf0\x9f\xa4\xa8'
>>> len(ibrow.encode("utf-8"))
4
>>> # Calling list() on a bytes object gives you
>>> # the decimal value for each byte
>>> list(b'\xf0\x9f\xa4\xa8')
[240, 159, 164, 168]
subtle but important feature of len()
What About UTF-16 and UTF-32?
two variations of decoding the same bytes object may produce results which aren't even in the same languageencoding four Greek letters with UTF-8 and then decoding back to text in UTF-16 produces a text str whichis in a completely different language (Korean) >>> letters = "αβγδ"
>>> rawdata = letters.encode("utf-8")
>>> rawdata.decode("utf-8")
'αβγδ'
>>> rawdata.decode("utf-16")
'뇎닎돎듎'
the range or number of bytes under UTF-8, UTF-16, and UTF-32
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Python's Built-in Functions | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Python has a group of built-in functions that relate in some way to numbering systems and character encoding can be logically grouped together based on their purpose
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Python String Literals | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
six ways that Python will allow entering the same Unicode character
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Other Encodings Available in Python | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
many other encoding schemes Latin-1 (also called ISO-8859-1) which is technically the default for HTTP Windows has its own Latin-1 variant called cp1252 complete list of accepted encodings is in the documentation for the codecs module to quickly get a representation of a decoded string's escaped Unicode literal use "unicode-escape" >>> alef = chr(1575) # Or "\u0627"
>>> alef_hamza = chr(1571) # Or "\u0623"
>>> alef, alef_hamza
('ا', 'أ')
>>> alef.encode("unicode-escape")
b'\\u0627'
>>> alef_hamza.encode("unicode-escape")
b'\\u0623'
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| You Know What They Say About Assumptions ... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Python makes the assumption of UTF-8 encoding for files and code should operate with the same assumption for external data >>> data = b"\xbc cup of flour"
>>> data.decode("utf-8")
Traceback (most recent call last):
File "
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Odds and Ends: unicodedata | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
do lookups on the Unicode Character Database (UCD)
>>> import unicodedata
>>> unicodedata.name("€")
'EURO SIGN'
>>> unicodedata.lookup("EURO SIGN")
'€'
|