Codecademy exercises have nicely packaged data prepared for us to use. However, there will be many occasions when we would like to explore a question that does not have a dataset ready for us to download. For example, based on their performance last year, which MLB players should I select in my fantasy baseball draft?
As we search for alternatives online, we may find a website that shares information we’d like to include in our analysis. However, it would be laborious to navigate to dozens of pages and manually copy and paste the information we find into a spreadsheet.
The good news is that Python web scraping libraries like Beautiful Soup can automate the collection of data from websites. Codecademy has a new course introducing you to the basics of webscraping and Beautiful Soup. In this article, we will walk through an example of how to use Beautiful Soup to collect MLB player stats from the 2018 season.
Specifically, we will cover how to:
- Set Up Beautiful Soup and Requests
- Connect to the ESPN Website with Requests
- Extract MLB player stats with Beautiful Soup
- Populate a Pandas DataFrame with the scraped player stats
At the end of this article, we’ll have our own .csv file containing the batting performance of all 331 players in the league, which will help inform our fantasy baseball picks. Afterwards, you’ll be able to use Beautiful Soup to pull information from anywhere online. Let’s start scraping!
Setting Up Beautiful Soup
In order to prepare our Python environment for web scraping, we need to install two Python packages, Requests and Beautiful Soup.
Requests is a library that allows us to read a website’s source code, and Beautiful Soup is the library that searches and parses this HTML source code based on its tags. We can install both these packages by simply going to our command line interface and executing this command:
pip install requests beautifulsoup4
If you need additional assistance installing these packages, feel free to consult Requests and Beautiful Soup’s own guides for installation.
Then, import these libraries for use in your Python development environment of choice (Jupyter Notebooks, Spyder, Atom, IDLE, you name it) with two import statements:
import requests
from bs4 import BeautifulSoup
While we’re at it, let’s import two other libraries that will come in handy as we write our program: Pandas and Regular Expressions (re
). Now our Python environment should be ready for web scraping!
As we build our own DataFrame with information we scrape from the ESPN page, we will be introducing the Requests and Beautiful Soup methods that are essential for web scraping. Namely, we will be going over the Request library’s .get()
method, and Beautiful Soup’s .find_all()
, .find()
, and .get_text()
methods.
Connect to the Website
We can connect to a website by passing a string of the URL we would like to visit into requests.get()
. So, to connect to the MLB statistics page we would execute the code:
page = requests.get(‘http://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2018/start/1’)
Then, we can look at the source code behind this website by looking at the .text
property with this get request: page.text
.
Next, we are going to specify that Beautiful Soup should parse the source code, page.text
, as HTML by including the argument html.parser
and saving it to a variable, soup
. You can learn more about different parsers here, but the most common we will encounter is HTML.
Now that we have the source code, we can begin to locate the HTML elements we want to scrape.
Useful Beautiful Soup Methods
Next, we’re going to use Beautiful Soup’s .find_all()
and find()
methods to extract information from the source code. Beautiful Soup().find_all()
is essential for us to locate the elements that contain the player’s statistics like HRs (Home Runs) or RBIs (Runs Batted In) and so on.
If you’ve gone through our Beautiful Soup lesson, you’ll recall that find_all()
works by searching the HTML source code for the tags and attributes we specify.
First, find_all()
requires that we state the tag of element we want to scrape. Next, we specify attributes in .find_all()
’s attrs
argument, which is constructed like so:
attrs = {'attribute1_name': 'attribute1_value', 'attribute2_name': 'attribute2_value'}
When you put it all together, here is an example of us extracting just the first player’s row from the table:
soup.find_all('tr', attrs = {'class': 'oddrow player-10-33039'})
Please note that .find_all()
always returns a list [ ]
. This is helpful when we want to include multiple elements with the same tag or attribute.
However, if we just wanted to focus on a single element, we could use the .find()
method, which returns just the first element that meets our search criteria. The arguments for .find()
are constructed in the exact same method as .find_all()
, where we specify the tags and/or attributes of the elements we want to scrape.
For instance, it is preferable to use .find()
when we scrape the table header that lists the column names (Player, HR, RBI, etc). ESPN designed the table with this header repeated every 10 players, so it appears on the current page five times. However, it is only necessary for us to collect this table header once.
Note the difference in output between a .find_all()
and .find()
constructed with the same parameters:
.find_all(‘tr’, attrs={'class':'colhead'})
returns a list of all five table headers that appear in the table, while .find('tr', attrs={'class':'colhead'}’)
returns just one table header element that lists the column names from PLAYER to BA (Batting Average).
Now that we know how to narrow in on the element(s) that we desire with .find_all()
and .find()
, it would be helpful to remove the HTML notation and keep only the plain text that contains the players’ statistics. For instance, in Mookie Betts’ element, we can begin to make out some of his statistics:
Specifically, we know that the last
tag is .346
.Ultimately we want to keep just the text that reports Betts’ batting average and remove all HTML notation (the start and end
tags). Fortunately, we can select just text with Beautiful Soup’s .get_text()
method, which we call by executing on a Beautiful Soup element. This method extracts all the text that falls outside of the tag notation (<>
).Here is what it looks like when we extract the text from each
element in Mookie Betts’ table row:Finally, we are ready to begin web scraping on a larger scale! With the knowledge of how to connect to websites with requests.get()
, to narrow in on HTML elements with .find_all()
and .find()
, and to extract readable text from HTML elements with .get_text()
, we can begin building a more involved web scraper that can repeat this process for every player in the league.
Populating Our DataFrame
Now that we have all the tools to scrape all the players’ names and stats from the HTML source code, let’s create a final DataFrame using Pandas to organize and save this information. At the end, we will have a DataFrame with the batting stats for all 331 players that we’ll export to a .csv file for analysis before our fantasy draft.
We want the column names for this final DataFrame to be the same as column names used in the table found on ESPN, so let’s scrape the text from the table header. When we look at the source code on ESPN, we see that the start tag for the table header is
.Since we only want to scrape a single element (and not the repeated headers every 10 rows in table), let’s use the .find()
method.
Once we have the table header, we want to extract the text from each column, which is stored in table data (
) elements. We will use list comprehensions with Beautiful Soup’s .get_text()
method to extract the text from each element and save it to a variable, columns.We will then create an empty DataFrame called final_df
and set its columns equal to the values we just scraped from the table header. As we scrape the players’ stats, we will populate final_df
with each player’s performance.
Now it’s time to scrape each baseball player’s stats. As we noticed with Mookie Betts’ row in the source code, each player’s data is stored in a table row (
) element with a class attribute. Fortunately, as we look at multiple player’s rows, we notice a pattern in the value of the class attribute.For instance, the Atlanta Braves’ Nick Markakis’s stats are contained in
and Manny Machado is listed in . We realize that all the players’ classes begins with oddrow
or evenrow
followed by player-10-
and then some identifying digits.Knowing this, we can scrape all player elements by using the regular expression’s compile
function to pick up all the elements that contain the string row player-10-
in its class value:
players = soup.find_all('tr', attrs={'class':re.compile('row player-10-')})
Now that we have all the HTML elements that contain the players’ data on this page, we want to extract the stats from each player’s row, and then concatenate it to the end of the final_df
DataFrame.
In order to accomplish this, we will loop through every player’s element and extract the
tags by using list comprehensions and Beautiful Soup’s .get_text()
method, much like we did with the table header.Then, we will create a temporary DataFrame, temp_df
, that stores an individual player’s stats with the same column names that we extracted from the table header. We can concatenate temp_df
to the end of final_df
with pd.concat()
, merging each successive player’s stats with the stats we have already collected.
Now we have all the player’s stats on this page, but we’re not quite finished. Each page on the ESPN site only lists 50 players, but there were 331 batters in the league during the 2018 season. In order to collect data on all 331 players, we need to scrape data from all 7 of the URLs that contain player data.
We notice a pattern in the URL address: if we just change the last number after /start/
, our table will begin with at that rank and list the next 50 players by batting average.
Therefore, we can recreate all the URLs we want to visit with a for
loop that iterates through a range function from 1 to 331 in increments of 50 and plugs it into the base URL address. We will then repeat the scraping of player data on each of the URLs in this for loop and concatenate them to the end our final_df
.
We now have the batting statistics for all 331 MLB players in our DataFrame final_df
! For our final step, let’s export this DataFrame to a .csv file for us to analyze in preparation for our fantasy baseball draft.
Conclusion
Like we did with the baseball player’s stats on ESPN, you can go forth and collect information on websites for your own analysis using Python’s Beautiful Soup. You have the tools to navigate the messy source code used to create web pages and extract the information of interest to you.
Web scraping will certainly save you time and energy when you want to collect data on multiple websites with a similar structure, or if you pull data on the same webpage on a daily cadence. Now, instead of visiting all of these webpages individually or visiting the same webpage each day, all you have to do is run your Python script written with Beautiful Soup.
To learn more about what you can do with Beautiful Soup, the best place to begin is Codecademy’s “Web Scraping with Beautiful Soup” course. Codecademy’s lesson can expand your Beautiful Soup expertise by reviewing how to navigate the HTML structure with the .parent
, .children
properties or select HTML elements based on their CSS selectors.
If you’re feeling overwhelmed or rusty navigating the source code of the webpages you’re trying to scrape, we recommend that you brush up on your HTML knowledge. Then, you’ll be scraping in no time.
GET ALL THE NEW COURSES IN YOUR EMAIL!Subscribe to our newsletter and get notified of new courses added! We hate SPAM, don't worry.