Not All Data Is Created Equal

Not All Data Is Created Equal

High-quality and relevant data can be a powerful force for good, but flawed data only perpetuates inequalities under the guise of fairness.

At its best, data science can impact global societies in incredible ways. It can work to enhance ocean health, identify and deliver food surpluses to feed the hungry, and use cellphone data to standardize public transportation routes in developing areas like Nairobi.

Data scientists in both the public and private sectors must understand the underlying opportunities to use data in new applications, address potential ethical and bias risks, and weigh the need for data regulation.

Before algorithms can be used appropriately, it’s necessary to access good data sources and evaluate the quality of all available data. According to Vinod Bakthavachalam, a senior data scientist at Coursera, critical questions to ask before using a data set in any application include: Is there measurement error? Do I understand how the data was captured? Are there weird outliers or other abnormal numbers?

“Even if the data on its own is good, there’s always a chance it may be unusable if it’s not right for a specific purpose,” he says.

For example, you may have high-quality data on a consumer’s willingness to spend over $100 on shoes, but perhaps that data was collected during the holiday season when shoppers traditionally spend more and is thus inapplicable to predicting year-round shopping trends. In other words, it may be the best data in the world, but whether it’s the most relevant data is an entirely different matter.

Data scientists must also understand that although algorithms can make a positive difference in society, there is a risk that some algorithms instead further entrench cultural prejudice and bias.

Machine learning algorithms are one of the most common data algorithms in daily life. They are frequently used to suggest products for consumers on e-commerce sites, and they’re also increasingly applied in cases like hiring or lending decisions. Used correctly, such algorithms can remove racial or gender bias by focusing on internal characteristics that predict success, thereby ignoring the human tendency to prefer people who are similar to themselves.

However, used incorrectly, these models simply provide a veneer of respectability to an otherwise unethical process. An algorithm that sees bias in its training data will spit out biased conclusions when fed new data because machine learning algorithms don’t make the best decisions; they make the decision the human that “trained” it would have made. For example, if a company has only hired white males in the past and trains its hiring algorithm using that data, it will perpetuate such hiring practices. Biased data, then, leads to biased results.

To avoid such biases, Coursera deliberately chose to ignore gender when training its machine learning algorithms to recommend classes to potential students.

“In the U.S., women are less likely to enroll in STEM classes, so if we used gender, it wouldn’t recommend certain courses to women,” Bakthavachalam says. “We want to encourage women to enroll in STEM classes and avoid any biases in the algorithms.”

Coursera’s experience underscores the fact that although there is no silver bullet for avoiding algorithmic bias, it’s also not too complicated a problem to fix, either. In fact, it’s more a matter of awareness than a difficult engineering problem to solve, and it begins with the knowledge that artificial intelligence is by no means perfect. According to Bakthavachalam, data scientists must avoid treating machine learning algorithms as black boxes because “if you don’t know what’s going on under the hood, it’s hard to imagine and diagnose issues.”

Data scientists must also be vigilant in their initial examination of training data, a process that needs to have a diverse team and, in some situations, outside reviewers. The biggest risk, according to Bakthavachalam, is that data scientists realize the potential for data misuse, but don’t put in the necessary work to rectify potential issues.

“Everyone has different value systems, and being open and upfront about the algorithm can lead collectively to the right decision,” says Bakthavachalam.

On a positive note, data science makes it easier to eliminate bias by quantifying prejudices and highlighting trends that may otherwise go unnoticed. This allows data scientists to remove bias by analyzing only legitimately relevant information, therefore empowering companies to provide services to previously underserved populations, especially in the financial services realm.

An example is MyBucks, the fintech company powered by a machine learning-enabled, credit-scoring engine that serves the underbanked in 11 African nations. By aggregating large amounts of data, MyBucks has greater insight into which individuals are likely to default, allowing them to move beyond a reliance on more simplistic predictors like credit score.

In Kenya, for instance, data is pulled solely from an individual’s phone, and loans are paid directly into mobile wallets within minutes.

This service is especially important in nations where schools require full tuition payment upfront, historically a significant barrier to pursuing an education in some poorer countries.

Above all, data scientists must avoid getting lost in the techniques and methods of their trade. They must ask questions about who will be affected by the work and how are they ensuring that by doing “good” for one group, they don’t inadvertently harm another.

It’s through transparency about how data is collected, how it’s defined, and its limitations that analysts working together can get the most impactful results. Machines can learn, but it’s the human insights and supervision that enable organizations to balance power and fairness.