| Scrape and Parse Text From Websites | 
| 
 Build Your First Web Scraperurllib contains tools for working with URLs the urllib.request module contains a function named urlopen() can use to open a URL within a program urlopen()
>>> url = "http://olympus.realpython.org/profiles/aphrodite"
# returns HTTPResponse object
>>> page = urlopen(url) 
>>> page
<http.client.HTTPResponse object at 0x105fef820>
# read the bytes
>>> html_bytes = page.read()
# decode to text
>>> html = html_bytes.decode("utf-8")
Extract Text From HTML With String Methodsfind() method returns the index of the first occurence of a substring to get the page's title >>> title_index = html.find("<title>") + len("<title>")
>>> title_index
14
>>> end_index = html.find("</title>")
>>> end_index
39
>>> title = html[start_index:end_index]
>>> title
'Profile: Aphrodite'
relies on HTML tags being syntactically correctif the title tag is <Title > find() will return -1 Get to Know Regular Expressionsregular expressions are patterns which can be used to search for text within a string Python supports regular expressions through the standard library's re module the regular expression "ab*c" matches any part of the string that begins with "a", ends with "c", and has zero or more instances of "b" between the two re.findall() returns a list of all matches the string "ac" matches this pattern >>> import re
# re.findall(< expression to match >, < string to search >)
>>> re.findall("ab*c", "ac")
['ac']same pattern with different strings>>> re.findall("ab*c", "abcd")
['abc']
>>> re.findall("ab*c", "acc")
['ac']
>>> re.findall("ab*c", "abcac")
['abc', 'ac']
# d breaks the pattern, an empty list is returned
>>> re.findall("ab*c", "abdc")
[]pattern matching is case sensitivecan use re.IGNORECASE as an argument >>> re.findall("ab*c", "ABC")
[]
>>> re.findall("ab*c", "ABC", re.IGNORECASE)
['ABC']can use a period (.) to stand for any single character in a regular expressioncan find all the strings that contain the letters "a" and "c" separated by a single character >>> re.findall("a.c", "abc")
['abc']
>>> re.findall("a.c", "abbc")
[]
>>> re.findall("a.c", "ac")
[]
>>> re.findall("a.c", "acc")
['acc']can use re.search() to search for a particular pattern inside a stringThis function is somewhat more complicated than re.findall() function returns an object called MatchObject that stores different groups of data this is because there might be matches inside other matches, and re.search() returns every possible result .group() on MatchObject will return the first and most inclusive result >>> match_results = re.search("ab*c", "ABC", re.IGNORECASE)
>>> match_results.group()
'ABC're.sub() is short for substitutereplace the text in a string that matches a regular expression with new text Python's regexes are greedy tries to find longest match >>> string = "Everything isto use non-greedy regexes matches the shortest possible string >>> string = "Everything is Extract Text From HTML With Regular Expressions # regex_soup.py
import re
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
pattern = "
 | 
| Use an HTML Parser for Web Scraping in Python | 
| Install Beautiful Soup python -m pip install beautifulsoup4 Create a BeautifulSoup Object # beauty_soup.py
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")the second BeautifulSoup c'tor arg is Python's built-in HTML parserapp does 3 things 
 Use a BeautifulSoup Object.find_all() returns a list of all image tags imgs = soup.find_all("img")
print(imgs)
# [<img src="/static/dionysus.jpg" />, <img src="/static/grapes.png" />]# can upack the list in a call
>>> image1, image2 = soup.find_all("img")
# get the tag names
>>> image1["src"]
'/static/dionysus.jpg'
>>> image2["src"]
'/static/grapes.png'
# some tags in HTML documents can be accessed by properties of the Tag object
>>> soup.title
<title>Profile: Dionysus</title>
# get the title as a string
>>> soup.title.string
'Profile: Dionysus' | 
| Interact With HTML Forms | 
| sometimes scraping HTML requires interacting with forms MechanicalSoup installs what's known as a headless browser a web browser with no graphical user interface browser is controlled programmatically via a Python program Install MechanicalSoup python -m pip install MechanicalSoup Create a Browser Object >>> import mechanicalsoup >>> browser = mechanicalsoup.Browser() # get web page url = "http://olympus.realpython.org/login" page = browser.get(url) # check the response code >>> page <Response [200]> # page.soup returns a BeautifulSoup object >>> type(page.soup) <class 'bs4.BeautifulSoup'> # view the HTML >>> page.soup <html> <head> <title>Log In</title> </head> <body bgcolor="yellow"> <center> <br /><br /> <h2>Please log in to access Mount Olympus:</h2> <br /><br /> <form action="/login" method="post" name="login"> Username: <input name="user" type="text" /><br /> Password: <input name="pwd" type="password" /><br /><br /> <input type="submit" value="Submit" /> </form> </center> </body> </html> Submit a Form With MechanicalSoupthe key part of the code above is <form action="/login" method="post" name="login"> Username: <input name="user" type="text" /><br /> Password: <input name="pwd" type="password" /><br /><br /> <input type="submit" value="Submit" /> </form>three steps to login import mechanicalsoup
# 1
browser = mechanicalsoup.Browser()
url = "http://olympus.realpython.org/login"
login_page = browser.get(url)
login_html = login_page.soup
# 2
form = login_html.select("form")[0]
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"
# 3
profiles_page = browser.submit(form, login_page.url)
 use .select() to get all links on the page links = profiles_page.soup.select("a")iterate and print the listfor link in links:
    address = link["href"]
    text = link.text
    print(f"{text}: {address}")can prefix the host urlbase_url = "http://olympus.realpython.org"
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}") | 
| Interact With Websites in Real Time | 
| # mech_soup.py
import time
import mechanicalsoup
browser = mechanicalsoup.Browser()
for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")
    # Wait 10 seconds if this isn't the last request
    if i < 3:
        time.sleep(10) |