What's a Character Encoding? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ASCII encoding encompasses
characters are segmented into different ranges within the ASCII table
the ASCII table displays the complete ASCII character set The String Module
string constants
# From lib/python3.7/string.py whitespace = ' \t\n\r\v\f' ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz' ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' ascii_letters = ascii_lowercase + ascii_uppercase digits = '0123456789' hexdigits = digits + 'abcdef' + 'ABCDEF' octdigits = '01234567' punctuation = r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" printable = digits + ascii_letters + punctuation + whitespacecan use these constants for everyday string manipulation >>> import string >>> s = "What's wrong with ASCII?!?!?" >>> s.rstrip(string.punctuation) 'What's wrong with ASCII' A Bit of a Refresher
a bit is a signal that has only two possible statesdifferent ways of symbolically representing a bit that all mean the same thing
each character from the ASCII string gets pseudo-encoded into 8 bits spaces in between the 8-bit sequences that each represent a single character >>> def make_bitseq(s: str) -> str: ... if not s.isascii(): ... raise ValueError("ASCII only allowed") ... return " ".join(f"{ord(i):08b}" for i in s) >>> make_bitseq("bits") '01100010 01101001 01110100 01110011' >>> make_bitseq("CAPS") '01000011 01000001 01010000 01010011' >>> make_bitseq("$25.43") '00100100 00110010 00110101 00101110 00110100 00110011' >>> make_bitseq("~5") '01111110 00110101'the f-string f"{ord(i):08b}" uses Python's Format Specification Mini-Language is a way of specifying formatting for replacement fields in format strings
We Need More Bits!
given a number of bits n the number of distinct possible values that can be represented in n bits is 2n
from math import ceil, log def n_bits_required(nvalues: int) -> int: return ceil(log(nvalues) / log(2)) n_bits_required(256)need to use a ceiling in n_bits_required() to account for values that are not clean powers of 2 need to store a character set of 110 characters total this should take log(110) / log(2) == 6.781 bits no such thing as 0.781 bits 110 values will require 7 bits >>> n_bits_required(110) 7 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Covering All the Bases - Other Number Systems | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
common numbering systems
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Enter Unicode | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Unicode has 1,114,112 possible code points ASCII is a perfect subset of Unicode Unicode vs UTF-8
Unicode is an abstract encoding standard, not an encodingUTF-8, UTF-16 and UTF-32 are encoding formats for representing Unicode characters as binary data of one or more bytes per character Encoding and Decoding in Python 3
str type represents human-readable text, can contain any Unicode character bytes type represents binary data encode str type to bytes type decode bytes type to str type >>> "résumé".encode("utf-8") b'r\xc3\xa9sum\xc3\xa9' >>> "El Niño".encode("utf-8") b'El Ni\xc3\xb1o' >>> b"r\xc3\xa9sum\xc3\xa9".decode("utf-8") 'résumé' >>> b"El Ni\xc3\xb1o".decode("utf-8") 'El Niño'\xc3\xb1 are the two bits representing the ñ Python 3 - All-in on Unicode
>>> # Mac OS X High Sierra >>> import locale >>> locale.getpreferredencoding() 'UTF-8' >>> # Windows Server 2012; other Windows builds may use UTF-16 >>> import locale >>> locale.getpreferredencoding() 'cp1252'make no assumptions One Byte, Two Bytes, Three Bytes, Four
a crucial feature is that UTF-8 is a variable-length encodingASCII encoding only requires one byte per character a Unicode character can occupy between 1 and 4 bytes >>> ibrow = "🤨" >>> len(ibrow) 1 >>> ibrow.encode("utf-8") b'\xf0\x9f\xa4\xa8' >>> len(ibrow.encode("utf-8")) 4 >>> # Calling list() on a bytes object gives you >>> # the decimal value for each byte >>> list(b'\xf0\x9f\xa4\xa8') [240, 159, 164, 168]subtle but important feature of len()
What About UTF-16 and UTF-32?
two variations of decoding the same bytes object may produce results which aren't even in the same languageencoding four Greek letters with UTF-8 and then decoding back to text in UTF-16 produces a text str whichis in a completely different language (Korean) >>> letters = "αβγδ" >>> rawdata = letters.encode("utf-8") >>> rawdata.decode("utf-8") 'αβγδ' >>> rawdata.decode("utf-16") '뇎닎돎듎'the range or number of bytes under UTF-8, UTF-16, and UTF-32
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Python's Built-in Functions | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Python has a group of built-in functions that relate in some way to numbering systems and character encoding can be logically grouped together based on their purpose
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Python String Literals | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
six ways that Python will allow entering the same Unicode character
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Other Encodings Available in Python | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
many other encoding schemes Latin-1 (also called ISO-8859-1) which is technically the default for HTTP Windows has its own Latin-1 variant called cp1252 complete list of accepted encodings is in the documentation for the codecs module to quickly get a representation of a decoded string's escaped Unicode literal use "unicode-escape" >>> alef = chr(1575) # Or "\u0627" >>> alef_hamza = chr(1571) # Or "\u0623" >>> alef, alef_hamza ('ا', 'أ') >>> alef.encode("unicode-escape") b'\\u0627' >>> alef_hamza.encode("unicode-escape") b'\\u0623' |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You Know What They Say About Assumptions ... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Python makes the assumption of UTF-8 encoding for files and code should operate with the same assumption for external data >>> data = b"\xbc cup of flour" >>> data.decode("utf-8") Traceback (most recent call last): File " |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Odds and Ends: unicodedata | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
do lookups on the Unicode Character Database (UCD)
>>> import unicodedata >>> unicodedata.name("€") 'EURO SIGN' >>> unicodedata.lookup("EURO SIGN") '€' |