Web Scraping

Scrape and Parse Text From Websites

Build Your First Web Scraper
Extract Text From HTML With String Methods
Get to Know Regular Expressions
Extract Text From HTML With Regular Expressions

Build Your First Web Scraper

urllib contains tools for working with URLs
the urllib.request module contains a function named urlopen() can use to open a URL within a program

urlopen()
>>> url = "http://olympus.realpython.org/profiles/aphrodite"
# returns HTTPResponse object
>>> page = urlopen(url) 
>>> page
<http.client.HTTPResponse object at 0x105fef820>
# read the bytes
>>> html_bytes = page.read()
# decode to text
>>> html = html_bytes.decode("utf-8")

Extract Text From HTML With String Methods

find() method returns the index of the first occurence of a substring
to get the page's title

>>> title_index = html.find("<title>") + len("<title>")
>>> title_index
14
>>> end_index = html.find("</title>")
>>> end_index
39
>>> title = html[start_index:end_index]
>>> title
'Profile: Aphrodite'

relies on HTML tags being syntactically correct
if the title tag is <Title > find() will return -1

Get to Know Regular Expressions

regular expressions are patterns which can be used to search for text within a string
Python supports regular expressions through the standard library's re module

the regular expression "ab*c" matches any part of the string that begins with "a", ends with "c", and has zero or more instances of "b" between the two
re.findall() returns a list of all matches
the string "ac" matches this pattern

>>> import re
# re.findall(< expression to match >, < string to search >)
>>> re.findall("ab*c", "ac")
['ac']

same pattern with different strings

>>> re.findall("ab*c", "abcd")
['abc']
>>> re.findall("ab*c", "acc")
['ac']
>>> re.findall("ab*c", "abcac")
['abc', 'ac']
# d breaks the pattern, an empty list is returned
>>> re.findall("ab*c", "abdc")
[]

pattern matching is case sensitive
can use re.IGNORECASE as an argument

>>> re.findall("ab*c", "ABC")
[]

>>> re.findall("ab*c", "ABC", re.IGNORECASE)
['ABC']

can use a period (.) to stand for any single character in a regular expression
can find all the strings that contain the letters "a" and "c" separated by a single character

>>> re.findall("a.c", "abc")
['abc']
>>> re.findall("a.c", "abbc")
[]
>>> re.findall("a.c", "ac")
[]
>>> re.findall("a.c", "acc")
['acc']

can use re.search() to search for a particular pattern inside a string
This function is somewhat more complicated than re.findall()
function returns an object called MatchObject that stores different groups of data
this is because there might be matches inside other matches, and re.search() returns every possible result
.group() on MatchObject will return the first and most inclusive result

>>> match_results = re.search("ab*c", "ABC", re.IGNORECASE)
>>> match_results.group()
'ABC'

re.sub() is short for substitute
replace the text in a string that matches a regular expression with new text
Python's regexes are greedy
tries to find longest match

>>> string = "Everything is  if it's in ."
>>> string = re.sub("<.*>", "ELEPHANTS", string)
>>> string
'Everything is ELEPHANTS.'

to use non-greedy regexes
matches the shortest possible string

>>> string = "Everything is  if it's in ."
>>> string = re.sub("<.*?>", "ELEPHANTS", string)
>>> string
'Everything is ELEPHANTS if it's in ELEPHANTS.'

Extract Text From HTML With Regular Expressions

# regex_soup.py

import re
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = ".*?"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)

<title.*?> - matches the opening <TITLE> tag in html. the <title part of the pattern matches with <TITLE because re.search() is called with re.IGNORECASE
.*?> matches any text after <TITLE up to the first instance of >
*? - non-greedily matches all text after the opening <TITLE> stopping at the first match for </title.*?>
</title.*?> - matches the closing </title / > tag in html

Use an HTML Parser for Web Scraping in Python

Install Beautiful Soup
Create a BeautifulSoup Object
Use a BeautifulSoup Object

Install Beautiful Soup

python -m pip install beautifulsoup4

Create a BeautifulSoup Object

# beauty_soup.py

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

the second BeautifulSoup c'tor arg is Python's built-in HTML parser
app does 3 things

opens the URL http://olympus.realpython.org/profiles/dionysus by using urlopen() from the urllib.request module
reads the HTML from the page as a string and assigns it to the html variable
creates a BeautifulSoup object and assigns it to the soup variable

Use a BeautifulSoup Object

.find_all() returns a list of all image tags

imgs = soup.find_all("img")
print(imgs)
# [<img src="/static/dionysus.jpg" />, <img src="/static/grapes.png" />]

# can upack the list in a call
>>> image1, image2 = soup.find_all("img")
# get the tag names
>>> image1["src"]
'/static/dionysus.jpg'
>>> image2["src"]
'/static/grapes.png'
# some tags in HTML documents can be accessed by properties of the Tag object
>>> soup.title
<title>Profile: Dionysus</title>
# get the title as a string
>>> soup.title.string
'Profile: Dionysus'

Interact With HTML Forms

Install MechanicalSoup
Create a Browser Object
Submit a Form With MechanicalSoup

sometimes scraping HTML requires interacting with forms
MechanicalSoup installs what's known as a headless browser
a web browser with no graphical user interface
browser is controlled programmatically via a Python program

Install MechanicalSoup

python -m pip install MechanicalSoup

Create a Browser Object

>>> import mechanicalsoup
>>> browser = mechanicalsoup.Browser()
# get web page
url = "http://olympus.realpython.org/login"
page = browser.get(url)
# check the response code
>>> page
<Response [200]>
# page.soup returns a BeautifulSoup object
>>> type(page.soup)
<class 'bs4.BeautifulSoup'>
# view the HTML
>>> page.soup
<html>
<head>
<title>Log In</title>
</head>
<body bgcolor="yellow">
<center>
<br /><br />
<h2>Please log in to access Mount Olympus:</h2>
<br /><br />
<form action="/login" method="post" name="login">
Username: <input name="user" type="text" /><br />
Password: <input name="pwd" type="password" /><br /><br />
<input type="submit" value="Submit" />
</form>
</center>
</body>
</html>

Submit a Form With MechanicalSoup

the key part of the code above is

<form action="/login" method="post" name="login">
Username: <input name="user" type="text" /><br />
Password: <input name="pwd" type="password" /><br /><br />
<input type="submit" value="Submit" />
</form>

three steps to login

import mechanicalsoup

# 1
browser = mechanicalsoup.Browser()
url = "http://olympus.realpython.org/login"
login_page = browser.get(url)
login_html = login_page.soup
# 2
form = login_html.select("form")[0]
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"
# 3
profiles_page = browser.submit(form, login_page.url)

create a Browser instance
use it to request the URL http://olympus.realpython.org/login
use the .soup property to assign the HTML content of the page to the login_html variable
login_html.select("form") returns a list of all <form> elements on the page
the page has only one <form> element
can access the form by retrieving the element at index 0 of the list
the next two lines select the username and password inputs setting their values
submit the form with browser.submit()
pass two arguments to the method
- the form object
- the URL of the login_page

successful submission redirects to the /profiles page
use .select() to get all links on the page

links = profiles_page.soup.select("a")

iterate and print the list

for link in links:
    address = link["href"]
    text = link.text
    print(f"{text}: {address}")

can prefix the host url

base_url = "http://olympus.realpython.org"
for link in links:
    address = base_url + link["href"]
    text = link.text
    print(f"{text}: {address}")

Interact With Websites in Real Time

# mech_soup.py

import time
import mechanicalsoup

browser = mechanicalsoup.Browser()

for i in range(4):
    page = browser.get("http://olympus.realpython.org/dice")
    tag = page.soup.select("#result")[0]
    result = tag.text
    print(f"The result of your dice roll is: {result}")

    # Wait 10 seconds if this isn't the last request
    if i < 3:
        time.sleep(10)