Python Topics : Web Scraping
Scrape and Parse Text From Websites
Build Your First Web Scraper
urllib contains tools for working with URLs
the urllib.request module contains a function named urlopen() can use to open a URL within a program
urlopen()
>>> url = "http://olympus.realpython.org/profiles/aphrodite"
# returns HTTPResponse object
>>> page = urlopen(url) 
>>> page
<http.client.HTTPResponse object at 0x105fef820>
# read the bytes
>>> html_bytes = page.read()
# decode to text
>>> html = html_bytes.decode("utf-8")
Extract Text From HTML With String Methods
find() method returns the index of the first occurence of a substring
to get the page's title
>>> title_index = html.find("<title>") + len("<title>")
>>> title_index
14
>>> end_index = html.find("</title>")
>>> end_index
39
>>> title = html[start_index:end_index]
>>> title
'Profile: Aphrodite'
relies on HTML tags being syntactically correct
if the title tag is <Title > find() will return -1

Get to Know Regular Expressions
regular expressions are patterns which can be used to search for text within a string
Python supports regular expressions through the standard library's re module

the regular expression "ab*c" matches any part of the string that begins with "a", ends with "c", and has zero or more instances of "b" between the two
re.findall() returns a list of all matches
the string "ac" matches this pattern

>>> import re
# re.findall(< expression to match >, < string to search >)
>>> re.findall("ab*c", "ac")
['ac']
same pattern with different strings
>>> re.findall("ab*c", "abcd")
['abc']
>>> re.findall("ab*c", "acc")
['ac']
>>> re.findall("ab*c", "abcac")
['abc', 'ac']
# d breaks the pattern, an empty list is returned
>>> re.findall("ab*c", "abdc")
[]
pattern matching is case sensitive
can use re.IGNORECASE as an argument
>>> re.findall("ab*c", "ABC")
[]

>>> re.findall("ab*c", "ABC", re.IGNORECASE)
['ABC']
can use a period (.) to stand for any single character in a regular expression
can find all the strings that contain the letters "a" and "c" separated by a single character
>>> re.findall("a.c", "abc")
['abc']
>>> re.findall("a.c", "abbc")
[]
>>> re.findall("a.c", "ac")
[]
>>> re.findall("a.c", "acc")
['acc']
can use re.search() to search for a particular pattern inside a string
This function is somewhat more complicated than re.findall()
function returns an object called MatchObject that stores different groups of data
this is because there might be matches inside other matches, and re.search() returns every possible result
.group() on MatchObject will return the first and most inclusive result
>>> match_results = re.search("ab*c", "ABC", re.IGNORECASE)
>>> match_results.group()
'ABC'
re.sub() is short for substitute
replace the text in a string that matches a regular expression with new text
Python's regexes are greedy
tries to find longest match
>>> string = "Everything is  if it's in ."
>>> string = re.sub("<.*>", "ELEPHANTS", string)
>>> string
'Everything is ELEPHANTS.'
to use non-greedy regexes
matches the shortest possible string
>>> string = "Everything is  if it's in ."
>>> string = re.sub("<.*?>", "ELEPHANTS", string)
>>> string
'Everything is ELEPHANTS if it's in ELEPHANTS.'
Extract Text From HTML With Regular Expressions
# regex_soup.py

import re
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = ".*?"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)
  • <title.*?> - matches the opening <TITLE> tag in html. the <title part of the pattern matches with <TITLE because re.search() is called with re.IGNORECASE
    .*?> matches any text after <TITLE up to the first instance of >
  • *? - non-greedily matches all text after the opening <TITLE> stopping at the first match for </title.*?>
  • </title.*?> - matches the closing </title / > tag in html
Use an HTML Parser for Web Scraping in Python
Install Beautiful Soup
python -m pip install beautifulsoup4
Create a BeautifulSoup Object
# beauty_soup.py

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
the second BeautifulSoup c'tor arg is Python's built-in HTML parser
app does 3 things
  1. opens the URL http://olympus.realpython.org/profiles/dionysus by using urlopen() from the urllib.request module
  2. reads the HTML from the page as a string and assigns it to the html variable
  3. creates a BeautifulSoup object and assigns it to the soup variable
Use a BeautifulSoup Object
.find_all() returns a list of all image tags
imgs = soup.find_all("img")
print(imgs)
# [<img src="/static/dionysus.jpg" />, <img src="/static/grapes.png" />]
# can upack the list in a call
>>> image1, image2 = soup.find_all("img")
# get the tag names
>>> image1["src"]
'/static/dionysus.jpg'
>>> image2["src"]
'/static/grapes.png'
# some tags in HTML documents can be accessed by properties of the Tag object
>>> soup.title
<title>Profile: Dionysus</title>
# get the title as a string
>>> soup.title.string
'Profile: Dionysus'
Interact With HTML Forms
sometimes scraping HTML requires interacting with forms
MechanicalSoup installs what's known as a headless browser
a web browser with no graphical user interface
browser is controlled programmatically via a Python program

Install MechanicalSoup
python -m pip install MechanicalSoup
Create a Browser Object
>>> import mechanicalsoup
>>> browser = mechanicalsoup.Browser()
# get web page
url = "http://olympus.realpython.org/login"
page = browser.get(url)
# check the response code
>>> page
<Response [200]>
# page.soup returns a BeautifulSoup object
>>> type(page.soup)
<class 'bs4.BeautifulSoup'>
# view the HTML
>>> page.soup
<html>
<head>
<title>Log In</title>
</head>
<body bgcolor="yellow">
<center>
<br /><br />
<h2>Please log in to access Mount Olympus:</h2>
<br /><br />
<form action="/login" method="post" name="login">
Username: <input name="user" type="text" /><br />
Password: <input name="pwd" type="password" /><br /><br />
<input type="submit" value="Submit" />
</form>
</center>
</body>
</html>
Submit a Form With MechanicalSoup
the key part of the code above is
<form action="/login" method="post" name="login">
Username: <input name="user" type="text" /><br />
Password: <input name="pwd" type="password" /><br /><br />
<input type="submit" value="Submit" />
</form>
three steps to login
import mechanicalsoup

# 1
browser = mechanicalsoup.Browser()
url = "http://olympus.realpython.org/login"
login_page = browser.get(url)
login_html = login_page.soup
# 2
form = login_html.select("form")[0]
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"
# 3
profiles_page = browser.submit(form, login_page.url)
  1. create a Browser instance
    use it to request the URL http://olympus.realpython.org/login
    use the .soup property to assign the HTML content of the page to the login_html variable
  2. login_html.select("form") returns a list of all <form> elements on the page
    the page has only one <form> element
    can access the form by retrieving the element at index 0 of the list
    the next two lines select the username and password inputs setting their values
  3. submit the form with browser.submit()
    pass two arguments to the method
    • the form object
    • the URL of the login_page
successful submission redirects to the /profiles page
use .select() to get all links on the page
links = profiles_page.soup.select("a")
iterate and print the list
for link in links:
    address = link["href"]
    text = link.text
    print(f"{text}: {address}")
can prefix the host url
base_url = "http://olympus.realpython.org"
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")
Interact With Websites in Real Time
# mech_soup.py

import time
import mechanicalsoup

browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")

    # Wait 10 seconds if this isn't the last request
    if i < 3:
        time.sleep(10)
index