In hopes of pursuing a career in Public Policy, I’ve spent the past year reorienting my programming towards Data Science. Naturally, doing so required significant practice finding, reading, visualizing and analyzing data. This experience has impressed upon me the importance of having a framework in mind when tackling a big data project.
To demonstrate this, I will use my most recent data science project, which you can read more about here. Hopefully, the steps suggested below aide you in creating a framework for your own projects.
Step One: Familiarize yourself with your area of interest.
Context is key when analyzing any dataset, and unfortunately, many otherwise impressive works have been ruined by the author’s lack of familiarity with the subject matter. This doesn’t mean that you’ll have to know every surrounding detail. Working with large datasets would be impossible if that was the case. However, if you’re going to write about wage growth, you’ll need to know the difference between real and nominal wages. If you want to analyze migration patterns to the United States, you should probably read about the 1965 Immigration Act.
A key part of my piece was learning about the history of American colleges—where they initially formed, the circumstances that caused college enrollment to accelerate in the 20th century, and how students with degrees compare to students without degrees. Equally important was familiarizing myself with current research regarding intergenerational mobility. Without the background knowledge I gained from doing this, I would have struggled mightily in finding a dataset suited to answering my question.
You will inevitably make mistakes, and that’s why peer review has its own step. But some background reading will save you a good bit of time and prevent a truly erroneous error from getting through. If you’re lucky, step two will be completed in the midst of step 1.
Step Two: Determine your question
What do you want to know?
Unfortunately, this step doesn’t end with simple curiosity. When deciding what your question should be, you need to consider a number of complementary ones.
- How likely is it that the information you are looking for exists?
- Has this question been answered before? If so, what is distinct about your approach?
- Are you comfortable enough with the math and/or logic necessary to answer this question?
- Is this question interesting enough to you to sustain your frustration when you inevitably encounter difficulty?
My research went a long way in explaining the difference in collegiate and non collegiate incomes, while also exploring the divergent prospects between high education countries and low education counties. What was missing was the difference in outcomes between different colleges. This inspired me to explore mobility from the standpoint of the colleges themselves, and that led me to discovering Raj Chetty’s “Mobility Report Cards”.
The length of this step can vary drastically. Sometimes, what you need can be found on the Census or Bureau of Labor Statistics. These are accurate, trustworthy databases that, barring a few edge cases, will give you a fairly comprehensive picture of what you’re trying to discover. Needless to say, it’s not always that easy. It can take hours, if not days, to find the dataset you need.
As mentioned earlier, my background research allowed me to quickly identify the right dataset for analyzing college mobility rates. Chetty’s “Mobility Report Cards” had substantial information concerning student school selectivity and income for multiple decades. The data they used was provided by the Department of Education and the Internal Revenue Service, and it was clearly documented and laid out.
This documentation made the whole process exponentially easier. I couldn’t have asked for a better dataset. Unfortunately, expecting this for every project is unrealistic, and likely to severely hamper your motivation. It can take a long time to find the right datasets, and sometimes you’re only able to find data that is auxiliary to your main question. In these situations, my advice would be twofold:
- Reach out to others and see if someone more familiar with the subject knows where to locate the correct data.
- Consider modifying your question according to what you have and repeating step two.
This can be a long, painful process. While plowing through the mud is oftentimes necessary, don’t ever be afraid to adjust your parameters based on what you have.
Step 4: Familiarize yourself with the dataset
You will likely be creating several visualizations from this data, and cross checking it throughout the project. Before you start, take an hour or so to explore it. Create simple graphs, find the maxima/minima, visualize the distribution. Most importantly, see if you can isolate trends within your data.
This step, along with step 1, is a great way of preemptively identifying trends in your dataset. The graph above is a scatter plot of college tiers vs the percentage of parents at the top of the national income quartile. Drawing the regression line showed me what to expect when plotting the differences in top quartile representation by college tier. The same was also true for median income.
This proved immensely valuable, as knowing the trend in the data allowed me to quickly identify when I had made a mistake in creating the graphs I ended up using for the final piece. Simple plotting can be an invaluable tool when preparing yourself for a more thorough analysis.
Learn the subject, find your question, choose your dataset, and familiarize with it. Hopefully, these steps aide you in creating your own work. Analyzing and visualizing data is frustrating, often monotonous work, especially when you’re still learning the ropes as I am. However, the power that comes from being able to make unique observations and contribute to the historical conversation is well worth the effort.
So go on and give it your best shot.