Python Topics : Strings and Character Data
Getting to Know Strings and Characters in Python
Python doesn't have a character type
single characters are strings of length one
strings are immutable sequences of characters
any operation that modifies a string will create a new string
a string is also a sequence
allows access to characters using zero-based integer indices
Creating Strings in Python
Standard String Literals
a string literal is just a sequence of characters enclosed in quotes
can single or double quotes
# create an empty string object
x = ''
y = ""
# create a string object
z = 'not an empty string'
can use triple-quoted strings to create multiline strings
>>> '''A triple-quoted string
... spanning across multiple
... lines using single quotes'''
'A triple-quoted string\nspanning across multiple\nlines using single quotes'

>>> """A triple-quoted string
... spanning across multiple
... lines using double quotes"""
'A triple-quoted string\nspanning across multiple\nlines using double quotes'
Escape Sequences in String Literals
an escape sequence allows interpretation of characters as something different
  • apply special meaning to characters
  • suppress special character meaning
use a backslash (\) character combined with other characters
here a backslash suppresses the single quote's usual meaning as a delimiter
>>> 'This string contains a single quote (\') character'
"This string contains a single quote (') character"
escape sequences

character usual
interpretation
escape
sequence
escaped
interpretation
' delimits a string literal \' literal single quote (') character
" delimits a string literal \" literal double quote (") character
<newline> terminates line input \<newline> newline is ignored
\ introduces an escape sequence \\ literal backslash character

how <newline> works

>>> "Hello\
... , World\
... !"
'Hello, World!'
additional escape sequences
escape sequenceescape interpretation
\a ASCII Bell (BEL) character
\b ASCII Backspace (BS) character
\f ASCII Formfeed (FF) character
\n ASCII Linefeed (LF) character
\N (<name>) Character from Unicode database with given <name>
\r ASCII Carriage return (CR) character
\t ASCII Horizontal tab (TAB) character
\uxxxx Unicode character with 16-bit hex value xxxx
\Uxxxxxxxx Unicode character with 32-bit hex value xxxxxxxx
\v ASCII Vertical tab (VT) character
\ooo Character with octal value ooo
\xhh Character with hex value hh
Raw String Literals
with raw string literals, you can create strings that don't translate escape sequences
any backslash characters are left in the string
to create a raw string prepend the string literal with the letter r or R
>>> print("Before\tAfter")  # Regular string
Before    After

>>> print(r"Before\tAfter")  # Raw string
Before\tAfter
raw strings are commonly used to create regular expressions
they allow the use of several different characters which may have special meanings without restrictions
want to create a regular expression to match email addresses
>>> import re

>>> pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
>>> pattern
'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b'

>>> regex = re.compile(pattern)

>>> text = """
...     Please contact us at [email protected]
...     or [email protected] for further information.
... """

>>> regex.findall(text)
['[email protected]', '[email protected]']
Formatted String Literals
formatted string literals (f-strings) allows to interpolation of values into strings and format them as needed
to create a string with an f-string literal prepend with an f or F letter
F-strings let you interpolate values into replacement fields in a string literal
create these fields using curly brackets
>>> name = "Jane"

>>> f"Hello, {name}!"
'Hello, Jane!'
The Built-in str() Function
can create new strings using the built-in str() function
more common use case is to convert other data types into strings
>>> str()
''

>>> str(42)
'42'

>>> str(3.14)
'3.14'

>>> str([1, 2, 3])
'[1, 2, 3]'

>>> str({"one": 1, "two": 2, "three": 3})
"{'one': 1, 'two': 2, 'three': 3}"

>>> str({"A", "B", "C"})
"{'B', 'C', 'A'}"
Using Operators on Strings
Concatenating Strings: The + Operator
the + operator is used to concatenate strings
concatenation involves joining two or more string objects to create a single new string
>>> greeting = "Hello"
>>> name = "Pythonista"

>>> greeting + ", " + name + "!!!"
'Hello, Pythonista!!!'

>>> file = "app"

>>> file += ".py"
>>> file
'app.py'
Repeating Strings: The * Operator
the repetition operator is the asterisk (*)
the repetition operator takes two operands
one operand is the string to be repeated
the other operand is an integer representing the number of repetitions
>>> "=" * 10
'=========='

>>> 10 * "Hi!"
'Hi!Hi!Hi!Hi!Hi!Hi!Hi!Hi!Hi!Hi!'

>>> sep = "-"
>>> sep *= 10
>>> sep
'----------'
Finding Substrings in a String: The in and not in Operators
use membership tests when need to determine if a substring appears in a given string
>>> "food" in "That's food for thought."
True

>>> "food" in "That's good for now."
False
Exploring Built-in Functions for String Processing
functiondescription
len()returns the length of a string
str()returns a user-friendly string representation of an object
repr()returns a developer-friendly string representation of an object
format()allows for string formatting
ord()converts a character to an integer
chr()converts an integer to a character

Finding the Number of Characters: len()
>>> len("Python")
6

>>> len("")
0
Converting Objects Into Strings: str() and repr()
to convert objects into their string representation can use the built-in str() and repr() functions
the str() function converts a given object into its user-friendly representation
this type of string representation is targeted at end users
>>> str(42)
'42'

>>> str(3.14)
'3.14'

>>> str([1, 2, 3])
'[1, 2, 3]'

>>> str({"one": 1, "two": 2, "three": 3})
"{'one': 1, 'two': 2, 'three': 3}"

>>> str({"A", "B", "C"})
"{'B', 'C', 'A'}"
repr() function returns a developer-friendly representation of the object
>>> repr(42)
'42'

>>> repr(3.14)
'3.14'

>>> repr([1, 2, 3])
'[1, 2, 3]'

>>> repr({"one": 1, "two": 2, "three": 3})
"{'one': 1, 'two': 2, 'three': 3}"

>>> repr({"A", "B", "C"})
"{'B', 'C', 'A'}"
should be able to copy the output of repr() to re-create the original object

difference between repr() and str()

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def __repr__(self):
        return f"{type(self).__name__}(name='{self.name}', age={self.age})"

    def __str__(self):
        return f"I'm {self.name}, and I'm {self.age} years old."
Formatting Strings: format()
>>> import math
>>> from datetime import datetime

>>> format(math.pi, ".4f")  # Four decimal places
'3.1416'

>>> format(1000000, ",.2f")  # Thousand separators
'1,000,000.00'

>>> format("Header", "=^30")  # Centered and filled
'============Header============'

>>> format(datetime.now(), "%a %b %d, %Y")  # Date
'Mon Jul 29, 2024'
the ".4f" specifier formats the input value as a floating-point number with four decimal places.
the ",.2f" format specifier formats a number using commas as thousand separators and with two decimal places
the "=^30" specifier to format the string "Header" centered in a width of 30 characters using the equal sign as a filler character

Processing Characters Through Code Points: ord() and chr()
the ord() function returns an integer value representing the Unicode code point for the given character
the chr() function does the reverse of ord()
returns the character value associated with a given code point
Indexing and Slicing Strings

Python's strings are ordered sequences of characters

Indexing Strings
can access individual characters from a string using the characters' associated index
sequences are zero-indexed
can use positive or negative index

Slicing Strings
slice is an expression of the form s[m:n]
returns the portion of s starting at index m, and up to but not including index n
by default the first index is zero
slicing starts at the beginning of the string
if the second index is not provided slicing returns the rest of the string
>>> s = "foobar"

>>> s[2:5]
'oba'
a optional argument is the step to use
>>> numbers = "12345" * 5
>>> numbers
'1234512345123451234512345'

>>> numbers[::5]
'11111'
>>> numbers[4::5]
'55555'
Exploring str Class Methods
Manipulating Casing
methoddescription
.capitalize() returns a copy of the target string with its first character converted to uppercase, and all other characters converted to lowercase
.lower() returns a copy of the target string with all alphabetic characters converted to lowercase
.swapcase() returns a copy of the target string with uppercase alphabetic characters converted to lowercase and vice versa
.title() returns a copy of the target string in which the first letter of each word is converted to uppercase and the remaining letters are lowercase
.upper() returns a copy of the target string with all alphabetic characters converted to uppercase

Finding and Replacing Substrings
.count(sub[, start[, end]])
sub is the substring to search for
start is the index to start with
default is beginning of the string
end is an exclusive index
default is the end of the string
returns the number of non-overlapping occurrences of the substring
>>> "foo goo moo".count("oo")
3

.find(sub[, start[, end]])
arguments are the same as above
can use .find() to check whether a string contains a particular substring
calling .find(sub) returns the lowest index in the target string where sub is found
returns -1 if substring is not found
>>> "foo bar foo baz foo qux".find("foo")
0

>>> "foo bar foo baz foo qux".find("grault")
-1

.index(sub[, start[, end]])
similar to .find()
raises an exception if the substring is not found
>>> "foo bar foo baz foo qux".index("foo")
0

>>> "foo bar foo baz foo qux".index("grault")
Traceback (most recent call last):
    ...
ValueError: substring not found

.rfind(sub[, start[, end]])
similar to .find()
returns the highest index in the target string where the substring sub is found
>>> "foo bar foo baz foo qux".rfind("foo", 0, 14)
8

>>> "foo bar foo baz foo qux".rfind("foo", 10, 14)
-1

.rindex(sub[, start[, end]])
similar to .rfind()
raises an exception if the substring is not found
>>> "foo bar foo baz foo qux".rindex("foo")
16

>>> "foo bar foo baz foo qux".rindex("grault")
Traceback (most recent call last):
    ...
ValueError: substring not found

.startswith(prefix[, start[, end]])
returns True if the target string starts with the specified prefix and False otherwise
>>> "foobar".startswith("foo")
True

>>> "foobar".startswith("bar")
False
comparison is restricted to the substring indicated by start and end if they're specified
>>> "foobar".startswith("bar", 3)
True

>>> "foobar".startswith("bar", 3, 5)
False

.endswith(suffix[, start[, end]])
returns True if the target string ends with the specified suffix and False otherwise
>>> "foobar".endswith("bar")
True

>>> "foobar".endswith("foo")
False

>>> "foobar".endswith("oob", 0, 4)
True

>>> "foobar".endswith("oob", 2, 4)
False
Classifying Strings
classify a string based on its characters
in all cases the methods are predicates returning True or False
.isalnum()
returns True if the target string isn't empty and all its characters are alphanumeric
>>> "abc123".isalnum()
True

>>> "abc$123".isalnum()
False

.isalpha()
returns True if the target string isn't empty and all its characters are alphabetic
whitespaces aren't considered alpha characters
>>> "ABCabc".isalpha()
True

>>> "abc123".isalpha()
False

>>> "ABC abc".isalpha()
False

.isdigit()
returns True if the target string is not empty and all its characters are numeric digits
>>> "123".isdigit()
True

>>> "123abc".isdigit()
False

.isdigit()
returns True if the target string is not empty and all its characters are numeric digits
>>> "123".isdigit()
True

>>> "123abc".isdigit()
False

.isidentifier()
returns True if the target string is a valid Python identifier according to the language definition
will return True for a string that matches a Python keyword even though that wouldn't be a valid identifier
>>> "foo32".isidentifier()
True

>>> "32foo".isidentifier()
False

>>> "foo$32".isidentifier()
False

>>> "and".isidentifier()
True

.iskeyword()
contained keyword module
>>> from keyword import iskeyword

>>> iskeyword("and")
True

.islower()
returns True if the target string isn't empty and all its alphabetic characters are lowercase
>>> "abc".islower()
True

>>> "abc1$d".islower()
True

>>> "Abc1$D".islower()
False

.isprintable()
returns True if the target string is empty or if all its alphabetic characters are printable
>>> "a\tb".isprintable()
False

>>> "a b".isprintable()
True

>>> "".isprintable()
True

>>> "a\nb".isprintable()
False

.isspace()
returns True if the target string isn't empty and all its characters are whitespaces
most commonly used whitespace characters are space (" "), tab ("\t"), and newline ("\n")
>>> " \t \n ".isspace()
True

>>> "   a   ".isspace()
False

.istitle()
returns True if
  • the target string isn't empty
  • the first alphabetic character of each word is uppercase
  • all other alphabetic characters in each word are lowercase
>>> "This Is A Title".istitle()
True

>>> "This is a title".istitle()
False

>>> "Give Me The #$#@ Ball!".istitle()
True

.isupper()
returns True if the target string isn't empty and all its alphabetic characters are uppercase
>>> "ABC".isupper()
True

>>> "ABC1$D".isupper()
True

>>> "Abc1$D".isupper()
False
Formatting Strings
.center(width[, fill])
returns a string consisting of the target string centered in a field of width characters
default padding consists of the ASCII space character
if the target string is as long as width or longer then it's returned unchanged
>>> "foo".center(10)
'   foo    '

>>> "bar".center(10, "-")
'---bar----'

>>> "foo".center(2)
'foo'

.expandtabs(tabsize=8)
replaces each tab character ("\t") found in the target string with spaces
default assumes eight characters per tab
>>> "a\tb\tc".expandtabs()
'a       b       c'

>>> "aaa\tbbb\tc".expandtabs()
'aaa     bbb     c'

>>> "a\tb\tc".expandtabs(4)
'a   b   c'

>>> "aaa\tbbb\tc".expandtabs(tabsize=4)
'aaa bbb c'

.ljust(width[, fill])
returns a string consisting of the target string left-justified in a field of width characters
default padding consists of the ASCII space character
if the target string is as long as width or longer, then it's returned unchanged
>>> "foo".ljust(10)
'foo       '

>>> "foo".ljust(10, "-")
'foo-------'

>>> "foo".ljust(2)
'foo'

rjust(width[, fill])
similar to.ljust() but right justifies string
.removeprefix(prefix)
returns a copy of the target string with prefix removed from the beginning
if the original string doesn't begin with prefix, then the string is returned unchanged
>>> "http://python.org".removeprefix("http://")
'python.org'

>>> "http://python.org".removeprefix("python")
'http://python.org'

.removesuffix(suffix)
similar to .removeprefix()
.lstrip([chars])
returns a copy of the target string with any whitespace characters removed from the left end
>>> "   foo bar baz   ".lstrip()
'foo bar baz   '

>>> "\t\nfoo\t\nbar\t\nbaz".lstrip()
'foo\t\nbar\t\nbaz'
optional chars argument is a string that specifies the set of characters to be removed
>>> "http://cnn.com".lstrip("/:htp")
'cnn.com'

.rstrip([chars])
Joining and Splitting Strings
similar to .lstrip()
.strip([chars])
trims whitespace from ends of string
>>> "   foo bar baz   ".strip()
'foo bar baz'
optional chars argument is a string that specifies the set of characters to be removed


.replace(old, new[, count])
use the .replace() method to replace a substring of a string
returns a copy of the target string with all the occurrences of the old substring replaced by new
>>> "foo bar foo baz foo qux".replace("foo", "grault")
'grault bar grault baz grault qux'
the optional count argument is the maximum of count replacements are performed
starts at the left end of the target string
>>> "foo bar foo baz foo qux".replace("foo", "grault", 2)
'grault bar grault baz foo qux'

.zfill(width)
returns a copy of the target string left-padded with zeroes to the specified width
if the target string contains a leading sign, then it remains at the left edge of the result string after zeros are inserted
if the target string is as long as width or longer, then it's returned unchanged:
>>> "42".zfill(5)
'00042'

>>> "+42".zfill(8)
'+0000042'

>>> "-42".zfill(8)
'-0000042'
will zero-pad a string that isn't a numeric value
>>> "foo".zfill(6)
'000foo'
Joining and Splitting Strings
these methods operate on or return iterables
.join(iterable)
takes an iterable of string objects
returns the string that results from concatenating the objects in the input iterable (argument) separated by the target string (separator)
>>> "**".join(["foo", "bar", "baz", "qux"])
'foo**bar**baz**qux'

.partition(sep)
the .partition(sep) call splits the target string at the first occurrence of string sep
the return value is a tuple with three objects
  1. the portion of the target string that precedes sep
  2. the sep object itself
  3. the portion of the target string that follows sep
if the string ends with the target sep, then the last item in the tuple is an empty string
if sep isn't found then the returned tuple contains the string followed by two empty strings
>>> "foo.bar".partition(".")
('foo', '.', 'bar')

>>> "foo@@bar@@baz".partition("@@")
('foo', '@@', 'bar@@baz')

>>> "foo.bar@@".partition("@@")
('foo.bar', '@@', '')

>>> "foo.bar".partition("@@")
('foo.bar', '', '')

.rpartition(sep)
works like .partition(sep) except that the target string is split at the last occurrence of sep instead of the first
>>> "foo@@bar@@baz".partition("@@"")
('foo', '@@', 'bar@@baz')

>>> "foo@@bar@@baz".rpartition("@@")
('foo@@bar', '@@', 'baz')

.split(sep=None, maxsplit=-1)
Without arguments .split() splits the target string into substrings delimited by any sequence of whitespace
consecutive whitespace characters are combined into a single delimiter
returns the substrings as a list
>>> "foo bar baz qux".split()
['foo', 'bar', 'baz', 'qux']

>>> "foo\n\tbar   baz\r\fqux".split()
['foo', 'bar', 'baz', 'qux']
the sep argument is specified it will be used as the separator
>>> "foo.bar.baz.qux".split(".")
['foo', 'bar', 'baz', 'qux']
if the optional parameter maxsplit is specified, then a maximum of that many splits are performed
>>> "foo.bar.baz.qux".split(".", 1)
['foo', 'bar.baz.qux']
if maxsplit isn't specified, then the results of .split() and .rsplit() are indentical
following escape sequences can work as line boundaries

sequencedescription
\n newline
\r carriage return
\r\n Carriage return + line feed
\v or \x0b Line tabulation
\f or \x0c Form feed
\x1c File separator
\x1d Group separator
\x1e Record separator
\x85 Next line (C1 control code)
\u2028 Unicode line separator
\u2029 Unicode paragraph separator
index