Most of the time, high-level decision-makers require aggregated data. For example, to understand sales trends, business analysts need to aggregate individual sales transactions by month, quarter, or fiscal year. Data aggregation is a key skill that can drive value for many organizations.
Pokémon is a video game where creatures (known as Pokémon) of different types battle each other for glory. To commemorate the release of the newest Pokémon games for the Nintendo Switch (Let’s Go Pikachu and Let’s Go Eevee), we will aggregate and analyze Pokémon data in order to answer the following questions:
- How many Pokémon of each type are there?
- Which Pokémon type is the most powerful?
First, we will complete our analysis using spreadsheets because spreadsheets are the most widely used tool for data analysis. However, Python programming provides more flexible and more scalable analysis options than spreadsheets, so we will complete the analysis using Python and the Pandas library.
Starting with spreadsheets
Let’s start with a dataset of all Pokémon from Pokémon Database. To follow along, download the data here (right click and select “Save As…”).
We will use Google Sheets, a free spreadsheet application, for our analysis. Regardless of the specific spreadsheet tool you use, the underlying concepts will be the same. Below is a preview of the data.
Each Pokémon belongs to one or two types, and certain types are strong against other types. For example, Charizard is a flying, fire-breathing lizard, so it is both flying
and fire
types, and it is weak against water
types.
Pokémon that belong to two types occupy two rows in our spreadsheet. In addition, every Pokémon has multiple stats to determine how it performs in battle. A description of each stat can be found in the Pokémon Database. For example, Blastoise has a higher defense stat than Charizard, so it will better withstand physical attacks. For our analysis, we will look at the type with the highest number for each stat.
A pivot table is a tool designed specifically to aggregate data, and it will be the easiest way to aggregate Pokémon by type in our spreadsheet. To create a pivot table, select your data, and select the Pivot Table option. Since we are aggregating by Pokémon type, we will add “Type” to rows in the pivot table options.
Right now, our pivot table is blank, and we need to add values to it. Since we want to count the number of unique Pokémon in each type, we would add it to values in our pivot table options.
Since there is no type that is definitively the “strongest”, we will look at the strongest type for each stat. This would be useful for Pokémon players who are building balanced teams with both offensive;y- and defensively-inclined Pokémon. To see which type has the highest median values for each stat, we will add additional options to values in the pivot table options.
At this point, we can see our results in the pivot table:
Here are some interesting observations from this initial analysis:
- There are a lot of
water
-type Pokémon and very fewice
-type Pokémon. Clearly, most Pokémon aren’t living in winter conditions! 🌴 -
Dragon
-type Pokémon seem to be the strongest whilebug
-type Pokémon are the weakest. 🐉 > 🐞 -
Fairy
-type Pokémon have low attack. Guess we don’t have to worry about being attacked by the tooth fairy. 🧚 -
Steel
-type Pokémon have the strongest defense. Does the aluminum industry have something to say about that? 🤔 -
Rocks
are slow. Even theground
andgrass
are faster…🗿
Now to Python
Spreadsheets are great, and we were able to glean some fun insights from our pivot table analysis. However, using Python with the Pandas library is far superior to spreadsheet analysis.
Writing code with Pandas is significantly quicker than interacting with a spreadsheet’s GUI interface (did you see all of those screenshots above?).
To aggregate data with Pandas, you will need to complete the following steps:
- Import the Pandas library
- Upload your data to a Pandas DataFrame
- Complete the aggregation
For our Pokémon analysis, our commented code and output are below:
Unlike the spreadsheet analysis, there are no intermediate steps when aggregating data with Python and Pandas.
Modifying our analysis
The dataset we used contained all Pokémon. However, only a subset of Pokémon are available in each “Let’s Go” game. Download the data with only the subset here (right click and select “Save As…“).
To aggregate this new data with spreadsheets, we would have to repeat all of the manual steps involved in making a pivot table. However, with Python, we only need to modify a single line of code:
When only Pokémon available in “Let’s Go Pikachu” and “Let’s Go Eevee” are included, Dragon
-type Pokémon still have the highest overall stats. However, they do not dominate as much as they did before in each of the stats, and overall, the stats are distributed more evenly across types.
What’s next
High-level decision makers often require analysts to make minor adjustments to view data in a slightly different format. In these cases, Python will save significantly more time when compared to traditional spreadsheet analysis.
I’d encourage you to try analyzing the data yourself. Below are other modifications you can apply to our Pokémon analysis:
- Include only final evolved Pokémon
- Exclude legendary Pokémon that are ineligible for Pokémon competitions
To learn more about data wrangling with Python and Pandas, take a look at Codecademy’s Data Analysis with Pandas course.