Scrape and Parse Text From Websites |
Build Your First Web Scraper
urllib contains tools for working with URLsthe urllib.request module contains a function named urlopen() can use to open a URL within a program urlopen() >>> url = "http://olympus.realpython.org/profiles/aphrodite" # returns HTTPResponse object >>> page = urlopen(url) >>> page <http.client.HTTPResponse object at 0x105fef820> # read the bytes >>> html_bytes = page.read() # decode to text >>> html = html_bytes.decode("utf-8") Extract Text From HTML With String Methods
find() method returns the index of the first occurence of a substringto get the page's title >>> title_index = html.find("<title>") + len("<title>") >>> title_index 14 >>> end_index = html.find("</title>") >>> end_index 39 >>> title = html[start_index:end_index] >>> title 'Profile: Aphrodite'relies on HTML tags being syntactically correct if the title tag is <Title > find() will return -1 Get to Know Regular Expressions
regular expressions are patterns which can be used to search for text within a stringPython supports regular expressions through the standard library's re module the regular expression "ab*c" matches any part of the string that begins with "a", ends with "c", and has zero or more instances of "b" between the two re.findall() returns a list of all matches the string "ac" matches this pattern >>> import re # re.findall(< expression to match >, < string to search >) >>> re.findall("ab*c", "ac") ['ac']same pattern with different strings >>> re.findall("ab*c", "abcd") ['abc'] >>> re.findall("ab*c", "acc") ['ac'] >>> re.findall("ab*c", "abcac") ['abc', 'ac'] # d breaks the pattern, an empty list is returned >>> re.findall("ab*c", "abdc") []pattern matching is case sensitive can use re.IGNORECASE as an argument >>> re.findall("ab*c", "ABC") [] >>> re.findall("ab*c", "ABC", re.IGNORECASE) ['ABC']can use a period (.) to stand for any single character in a regular expression can find all the strings that contain the letters "a" and "c" separated by a single character >>> re.findall("a.c", "abc") ['abc'] >>> re.findall("a.c", "abbc") [] >>> re.findall("a.c", "ac") [] >>> re.findall("a.c", "acc") ['acc']can use re.search() to search for a particular pattern inside a string This function is somewhat more complicated than re.findall() function returns an object called MatchObject that stores different groups of data this is because there might be matches inside other matches, and re.search() returns every possible result .group() on MatchObject will return the first and most inclusive result >>> match_results = re.search("ab*c", "ABC", re.IGNORECASE) >>> match_results.group() 'ABC're.sub() is short for substitute replace the text in a string that matches a regular expression with new text Python's regexes are greedy tries to find longest match >>> string = "Everything isto use non-greedy regexes matches the shortest possible string >>> string = "Everything is Extract Text From HTML With Regular Expressions
# regex_soup.py import re from urllib.request import urlopen url = "http://olympus.realpython.org/profiles/dionysus" page = urlopen(url) html = page.read().decode("utf-8") pattern = "
|
Use an HTML Parser for Web Scraping in Python |
Install Beautiful Soup
python -m pip install beautifulsoup4 Create a BeautifulSoup Object
# beauty_soup.py from bs4 import BeautifulSoup from urllib.request import urlopen url = "http://olympus.realpython.org/profiles/dionysus" page = urlopen(url) html = page.read().decode("utf-8") soup = BeautifulSoup(html, "html.parser")the second BeautifulSoup c'tor arg is Python's built-in HTML parser app does 3 things
Use a BeautifulSoup Object
.find_all() returns a list of all image tags
imgs = soup.find_all("img") print(imgs) # [<img src="/static/dionysus.jpg" />, <img src="/static/grapes.png" />] # can upack the list in a call >>> image1, image2 = soup.find_all("img") # get the tag names >>> image1["src"] '/static/dionysus.jpg' >>> image2["src"] '/static/grapes.png' # some tags in HTML documents can be accessed by properties of the Tag object >>> soup.title <title>Profile: Dionysus</title> # get the title as a string >>> soup.title.string 'Profile: Dionysus' |
Interact With HTML Forms |
sometimes scraping HTML requires interacting with forms MechanicalSoup installs what's known as a headless browser a web browser with no graphical user interface browser is controlled programmatically via a Python program Install MechanicalSoup
python -m pip install MechanicalSoup Create a Browser Object
>>> import mechanicalsoup >>> browser = mechanicalsoup.Browser() # get web page url = "http://olympus.realpython.org/login" page = browser.get(url) # check the response code >>> page <Response [200]> # page.soup returns a BeautifulSoup object >>> type(page.soup) <class 'bs4.BeautifulSoup'> # view the HTML >>> page.soup <html> <head> <title>Log In</title> </head> <body bgcolor="yellow"> <center> <br /><br /> <h2>Please log in to access Mount Olympus:</h2> <br /><br /> <form action="/login" method="post" name="login"> Username: <input name="user" type="text" /><br /> Password: <input name="pwd" type="password" /><br /><br /> <input type="submit" value="Submit" /> </form> </center> </body> </html> Submit a Form With MechanicalSoup
the key part of the code above is
<form action="/login" method="post" name="login"> Username: <input name="user" type="text" /><br /> Password: <input name="pwd" type="password" /><br /><br /> <input type="submit" value="Submit" /> </form>three steps to login import mechanicalsoup # 1 browser = mechanicalsoup.Browser() url = "http://olympus.realpython.org/login" login_page = browser.get(url) login_html = login_page.soup # 2 form = login_html.select("form")[0] form.select("input")[0]["value"] = "zeus" form.select("input")[1]["value"] = "ThunderDude" # 3 profiles_page = browser.submit(form, login_page.url)
use .select() to get all links on the page links = profiles_page.soup.select("a")iterate and print the list for link in links: address = link["href"] text = link.text print(f"{text}: {address}")can prefix the host url base_url = "http://olympus.realpython.org" for link in links: address = base_url + link["href"] text = link.text print(f"{text}: {address}") |
Interact With Websites in Real Time |
# mech_soup.py import time import mechanicalsoup browser = mechanicalsoup.Browser() for i in range(4): page = browser.get("http://olympus.realpython.org/dice") tag = page.soup.select("#result")[0] result = tag.text print(f"The result of your dice roll is: {result}") # Wait 10 seconds if this isn't the last request if i < 3: time.sleep(10) |