Use the attached document to complete the Web
Use the attached document to complete the Web Scraping exercise. All work has to be proven with required full-screen snapshots. Once the assignment is complete,
A Introduction to Web Scraping using Python
Table of Contents
· Scrape and Parse Text From Websites
· Your First Web Scraper
· Extract Text From HTML With String Methods
· A Primer on Regular Expressions
· Extract Text From HTML With Regular Expressions
· Check Your Understanding
· Use an HTML Parser for Web Scraping in Python
· Install Beautiful Soup
· Create a BeautifulSoup Object
· Use a BeautifulSoup Object
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
The Internet hosts perhaps the greatest source of information—and misinformation—on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.
In this tutorial, you’ll learn how to:
· Parse website data using string methods and regular expressions
· Parse website data using an HTML parser
· Interact with forms and other website components
Scrape and Parse Text From Websites
Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones you’ll create in this tutorial. Websites do this for two possible reasons:
1. The site has a good reason to protect its data. For instance, Google Maps doesn’t let you request too many results too quickly.
2. Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely.
Important: Before using your Python skills for web scraping, you should always check your target website’s acceptable use policy to see if accessing the website with automated tools is a violation of its terms of use. Legally, web scraping against the wishes of a website is very much a gray area.
Please be aware that the following techniques may be illegal when used on websites that prohibit web scraping.
Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial.
Your First Web Scraper
One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the urllib.request module contains a function called urlopen() that can be used to open a URL within a program.
In IDLE’s interactive window, type the following to import urlopen():
>>>
>>> from urllib.request import urlopen
The web page that we’ll open is at the following URL:
>>>
>>> url = "http://olympus.realpython.org/profiles/aphrodite"
To open the web page, pass url to urlopen():
>>>
>>> page = urlopen(url)
urlopen() returns an HTTPResponse object:
>>>
>>> page
<http.client.HTTPResponse object at 0x105fef820>
To extract the HTML from the page, first use the HTTPResponse object’s .read() method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8:
>>>
>>> html_bytes = page.read()
>>> html = html_bytes.decode("utf-8")
Now you can print the HTML to see the contents of the web page:
>>>
>>> print(html)
(Insert full-screen snapshot here)
Once you have the HTML as text, you can extract information from it in a couple of different ways.
Extract Text From HTML With String Methods
One way to extract information from a web page’s HTML is to use string methods. For instance, you can use .find() to search through the text of the HTML for the <title> tags and extract the title of the web page.
Let’s extract the title of the web page you requested in the previous example. If you know the index of the first character of the title and the first character of the closing </title> tag, then you can use a string slice to extract the title.
Since .find() returns the index of the first occurrence of a substring, you can get the index of the opening <title> tag by passing the string "<title>" to .find():
>>>
>>> title_index = html.find("<title>")
>>> title_index
[Answer Here]
You don’t want the index of the <title> tag, though. You want the index of the title itself. To get the index of the first letter in the title, you can add the length of the string "<title>" to title_index:
>>>
>>> start_index = title_index + len("<title>")
>>> start_index
[Answer Here]
Now get the index of the closing </title> tag by passing the string "</title>" to .find():
>>>
>>> end_index = html.find("</title>")
>>> end_index
[Answer Here]
Finally, you can extract the title by slicing the html string:
>>>
>>> title = html[start_index:end_index]
>>> title
[Answer Here]
(Insert full-screen snapshot for the above four code exercises here)
Real-world HTML can be much more complicated and far less predictable than the HTML on the Aphrodite profile page. Here’s another profile page with some messier HTML that you can scrape:
>>>
>>> url = "http://olympus.realpython.org/profiles/poseidon"
Try extracting the title from this new URL using the same method as the previous example:
>>>
>>> url = "http://olympus.realpython.org/profiles/poseidon"
>>> page = urlopen(url)
>>> html = page.read().decode("utf-8")
>>> start_index = html.find("<title>") + len("<title>")
>>> end_index = html.find("</title>")
>>> title = html[start_index:end_index]
>>> title
(Insert full-screen snapshot here)
A Primer on Regular Expressions
Regular expressions—or regexes for short—are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s re module.
Note: Regular expressions aren’t particular to Python. They’re a general programming concept and can be used with any programming language.
To work with regular expressions, the first thing you need to do is import the re module:
>>>
>>> import re
Regular expressions use special characters called metacharacters to denote different patterns. For instance, the asterisk character (*) stands for zero or more of whatever comes just before the asterisk.
In the following example, you use findall() to find any text within a string that matches a given regular expression:
>>>
>>> re.findall("ab*c", "ac")
['ac']
The first argument of re.findall() is the regular expression that you want to match, and the second argument is the string to test. In the above example, you search for the pattern "ab*c" in the string "ac".
The regular expression "ab*c" matches any part of the string that begins with an "a", ends with a "c", and has zero or more instances of "b" between the two. re.findall() returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list.
Here’s the same pattern applied to different strings:
>>>
>>> re.findall("ab*c", "abcd")
[Answer Here]
>>> re.findall("ab*c", "acc")
[Answer Here ]
>>> re.findall("ab*c", "abcac")
[Answer Here]
>>> re.findall("ab*c", "abdc")
(Insert full-screen snapshot of the above strings here)
Notice that if no match is found, then findall() returns an empty list.
Pattern matching is case sensitive. If you want to match this pattern regardless of the case, then you can pass a third argument with the value re.IGNORECASE:
>>>
>>> re.findall("ab*c", "ABC")
[Answer Here]
>>> re.findall("ab*c", "ABC", re.IGNORECASE)
(Insert full-screen snapshot of the above strings here)
You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:
>>>
>>> re.findall("a.c", "abc")
[Answer Here]
>>> re.findall("a.c", "abbc")
[Answer Here]
>>> re.findall("a.c", "ac")
[Answer Here]
>>> re.findall("a.c", "acc")
(Insert full-screen snapshot of the above strings here)
The pattern .* inside a regular expression stands for any character repeated any number of times. For instance, "a.*c" can be used to find every substring that starts with "a" and ends with "c", regardless of which letter—or letters—are in between:
>>>
>>> re.findall("a.*c", "abc")
[Answer Here]
>>> re.findall("a.*c", "abbc")
[Answer Here]
>>> re.findall("a.*c", "ac")
[Answer Here]
>>> re.findall("a.*c", "acc")
[Answer Here]
(Insert full-screen snapshot of the above strings here)
Often, you use re.search() to search for a particular pattern inside a string. This function is somewhat more complicated than re.findall() because it returns an object called a MatchObject that stores different groups of data. This is because there might be matches inside other matches, and re.search() returns every possible result.
The details of the MatchObject are irrelevant here. For now, just know that calling .group() on a MatchObject will return the first and most inclusive result, which in most cases is just what you want:
>>>
>>> match_results = re.search("ab*c", "ABC", re.IGNORECASE)
>>> match_results.group()
'ABC'
There’s one more function in the re module that’s useful for parsing out text. re.sub(), which is short for substitute, allows you to replace text in a string that matches a regular expression with new text. It behaves sort of like the .replace() string method.
The arguments passed to re.sub() are the regular expression, followed by the replacement text, followed by the string. Here’s an example:
>>>
>>> string = "Everything is <replaced> if it's in <tags>."
>>> string = re.sub("<.*>", "ELEPHANTS", string)
>>> string
(Insert full-screen snapshot of the above strings here)
Perhaps that wasn’t quite what you expected to happen.
re.sub() uses the regular expression "<.*>" to find and replace everything between the first < and last >, which spans from the beginning of <replaced> to the end of <tags>. This is because Python’s regular expressions are greedy, meaning they try to find the longest possible match when characters like * are used.
Alternatively, you can use the non-greedy matching pattern *?, which works the same way as * except that it matches the shortest possible string of text:
>>>
>>> string = "Everything is <replaced> if it's in <tags>."
>>> string = re.sub("<.*?>", "ELEPHANTS", string)
>>> string
[Answer Here]
(Insert full-screen snapshot of the above strings here)
This time, re.sub() finds two matches, <replaced> and <tags>, and substitutes the string "ELEPHANTS" for both matches.
Extract Text From HTML With Regular Expressions
Armed with all this knowledge, let’s now try to parse out the title from a new profile page , which includes this rather carelessly written line of HTML:
<TITLE >Profile: Dionysus</title / >
The .find() method would have a difficult time dealing with the inconsistencies here, but with the clever use of regular expressions, you can handle this code quickly and efficiently:
import re
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags
print(title)
Let’s take a closer look at the first regular expression in the pattern string by breaking it down into three parts:
1. <title.*?> matches the opening <TITLE > tag in html. The <title part of the pattern matches with <TITLE because re.search() is called with re.IGNORECASE, and .*?> matches any text after <TITLE up to the first instance of >.
2. .*? non-greedily matches all text after the opening <TITLE >, stopping at the first match for </title.*?>.
3. </title.*?> differs from the first pattern only in its use of the / character, so it matches the closing </title / > tag in html.
The second regular expression, the string "<.*?>", also uses the non-greedy .*? to match all the HTML tags in the title string. By replacing any matches with "", re.sub() removes all the tags and returns only the text.
Note: Web scraping in Python or any other language can be tedious. No two websites are organized the same way, and HTML is often messy. Moreover, websites change over time. Web scrapers that work today are not guaranteed to work next year—or next week, for that matter!
Regular expressions are a powerful tool when used correctly. This introduction barely scratches the surface. For more about regular expressions and how to use them, check out the two-part series Regular Expressions: Regexes in Python.
Use an HTML Parser for Web Scraping in Python
Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the Beautiful Soup library is a good one to start with.
Install Beautiful Soup
To install Beautiful Soup, you can run the following in your terminal:
$ python3 -m pip install beautifulsoup4
Run pip show to see the details of the package you just installed:
$ python3 -m pip show beautifulsoup4
Name: beautifulsoup4
Version: 4.9.1
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
Author-email: [email protected]
License: MIT
Location: c:realpythonvenvlibsite-packages
Requires:
Required-by:
In particular, notice that the latest version at the time of writing was 4.9.1.
Create a BeautifulSoup Object
Type the following program into a new editor window:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
This program does three things:
1. Opens the URL http://olympus.realpython.org/profiles/dionysus using urlopen() from the urllib.request module
2. Reads the HTML from the page as a string and assigns it to the html variable
3. Creates a BeautifulSoup object and assigns it to the soup variable
The BeautifulSoup object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.
Use a BeautifulSoup Object
Save and run the above program. When it’s finished running, you can use the soup variable in the interactive window to parse the content of html in various ways.
For example, BeautifulSoup objects have a .get_text() method that can be used to extract all the text from the document and automatically remove any HTML tags.
Type the following code into IDLE’s interactive window:
>>>
>>> print(soup.get_text())
(Insert full-screen snapshot of the above results here)
There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the string .replace() method if you need to.
Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the .find() string method is sometimes easier than working with regular expressions.
However, sometimes the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of <img> HTML tags.
In this case, you can use find_all() to return a list of all instances of that particular tag:
>>>
>>> soup.find_all("img")
[Answer Here]
(Insert full-screen snapshot of the above strings here)
This returns a list of all <img> tags in the HTML document. The objects in the list look like they might be strings representing the tags, but they’re actually instances of the Tag object provided by Beautiful Soup. Tag objects provide a simple interface for working with the information they contain.
Let’s explore this a little by first unpacking the Tag objects from the list:
>>>
>>> image1, image2 = soup.find_all("img")
Each Tag object has a .name property that returns a string containing the HTML tag type:
>>>
>>> image1.name
'img'
You can access the HTML attributes of the Tag object by putting their name between square brackets, just as if the attributes were keys in a dictionary.
For example, the <img src="/static/dionysus.jpg"/> tag has a single attribute, src, with the value "/static/dionysus.jpg". Likewise, an HTML tag such as the link <a href="https://realpython.com" target="_blank"> has two attributes, href and target.
To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:
>>>
>>> image1["src"]
[Answer Here]
>>> image2["src"]
(Insert full-screen snapshot of the above strings here)
Certain tags in HTML documents can be accessed by properties of the Tag object. For example, to get the <title> tag in a document, you can use the .title property:
>>>
>>> soup.title
[Answer Here]
(Insert full-screen snapshot of the above strings here)
If you look at the source of the Dionysus profile by navigating to the profile page, right-clicking on the page, and selecting View page source, then you’ll notice that the <title> tag as written in the document looks like this:
<title >Profile: Dionysus</title/>
Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash (/) in the closing tag.
You can also retrieve just the string between the title tags with the .string property of the Tag object:
>>>
>>> soup.title.string
[Answer Here]
(Insert full-screen snapshot of the above strings here)
One of the more useful features of Beautiful Soup is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the <img> tags that have a src attribute equal to the value /static/dionysus.jpg, then you can provide the following additional argument to .find_all():
>>>
>>> soup.find_all("img", src="/static/dionysus.jpg")
[Answer Here]
(Insert full-screen snapshot of the above strings here)
This example is somewhat arbitrary, and the usefulness of this technique may not be apparent from the example. If you spend some time browsing various websites and viewing their page sources, then you’ll notice that many websites have extremely complicated HTML structures.
When scraping data from websites with Python, you’re often interested in particular parts of the page. By spending some time looking through the HTML document, you can identify tags with unique attributes that you can use to extract the data you need.
Then, instead of relying on complicated regular expressions or using .find() to search through the document, you can directly access the particular tag you’re interested in and extract the data you need.
In some cases, you may find that Beautiful Soup doesn’t offer the functionality you need. The lxml library is somewhat trickier to get started with but offers far more flexibility than Beautiful Soup for parsing HTML documents. You may want to check it out once you’re comfortable using Beautiful Soup.
Note: HTML parsers like Beautiful Soup can save you a lot of time and effort when it comes to locating specific data in web pages. However, sometimes HTML is so poorly written and disorganized that even a sophisticated parser like Beautiful Soup can’t interpret the HTML tags properly.
In this case, you’re often left with using .find() and regular expression techniques to try to parse out the information you need.
BeautifulSoup is great for scraping data from a website’s HTML, but it doesn’t provide any way to work with HTML forms. For example, if you need to search a website for some query and then scrape the results, then BeautifulSoup alone won’t get you very far.
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.