Jiawei Han, a professor of computer science at the University of Illinois at Urbana-Champaign, was recently named a Michael Aiken Chair, one of the University’s highest awards. The endowed chair is the latest honor in Han’s distinguished and pioneering career, with notable accomplishments including creating core data mining algorithms and co-authoring the textbook that is considered by many to have defined the field. Professor Han is also a busy and successful teacher with a love for “train[ing] the younger generation, whether at UIUC or all over the world on Coursera.” Professor Han had three PhD students graduate in May, with one becoming a professor at Georgia Tech, one joining Google, and one joining Facebook. Students taking his classes as part of the Online Master of Computer Science in Data Science degree have an opportunity to learn from him through videos and can ask him questions directly during live office hours.
In this conversation, Professor Han shares his perspective on the history and the future of data mining, the challenge of the “data explosion” problem, and why he thinks the University of Illinois offers set students up for long term success.
Can you explain what you mean when you talk about a “Data Explosion” problem?
Originally, people would say they are ‘data poor’ and that they couldn’t get enough data. Now there is lots of data–the new problem is actually extracting knowledge from it.
Whether you’re a journalist, a biologist, an engineer, or in almost any other discipline, there is this ‘data explosion’ problem: you need to turn unstructured data into structured knowledge. That means spend[ing] a lot of time figuring out how to structure your unstructured data into networks, and then how to mine that data.
For example, I have a group of students who work on how to handle biomedical literature. With biomedical literature, we can easily get 36 million papers – but to effectively use this huge corpus, you would have to ask experts to label which terms are genes, which are proteins, which are diseases.
It’s not realistic to ask humans to go through 1,000 papers and go over every sentence and label them. So, we take existing dictionaries with lists of genes, diseases, or chemicals as our starting place. Then, we take the massive unlabeled corpus and try to build a network that can find patterns and linkages automatically with a machine. Data mining can replace humans’ boring work.
You founded the data mining group several decades ago. What led you to get into this field to start with, and what has your research group accomplished over the years?
It’s a long journey! I started with databases. In the 1980s, when I did my PhD, lots of people built database systems allowing us to index them, sort them, and search them in powerful ways. I talked to my advisor and said I want[ed] the database to have intelligence, so my PhD thesis was essentially feeding the database logic and defining rules to make the database more intelligent.
Later, I found that if you ask humans to build the rules for the database, it is still a prohibitive burden. You have limited experts, and you have unlimited data and unlimited problems – you cannot scale up. The best way is to let data show the pattern by itself: data mining.
I clearly remember the first international Knowledge Discovery for Data (KDD) workshop in 1989, it was just 20 to 30 people in a workshop. But I got together with some of my collaborators to write and present a paper on a method to dig rules out of data. After a few years, lots of people found this direction promising, and by 1995 they held the first international conference on KDD. To everybody’s surprise, 500 to 600 people attended!
For the second conference, they elected me to be co-chair, and I shifted the majority of my research from deductive databases to inductive databases – from “you give me rules, I will get more data” to “give me more data and I will develop rules.” I had many students joining me to work on this, and we wrote several very impactful papers and algorithms. Two of these algorithms are so influential that they are introduced in many textbooks on pattern discovery. In the Spark Machine Learning Library, they have only collected two algorithms for pattern discovery – FPGrowth and PrefixSpan – and both are from my group.
And you’re teaching this influential research in your Coursera course on pattern discovery.
Yes, in 1999, I finished the first data mining textbook (Data Mining: Concepts and Techniques), and it basically defined what data mining is. The major contribution of this book is that it defined the key issues of data mining and the key things a student needs to learn. Data mining has its own dedicated algorithms, like pattern discovery, and we also use a lot of statistics and machine learning techniques like classification and cluster analysis.
What sets the U of I data science track apart from other universities?
Because the field and applications are so broad, we need lots of different types of experts. At UIUC, we have professors from very different backgrounds; we have people from computer science, but we also have people from library information science, and we have people from statistics. So, I think UIUC has a unique advantage just because the university has so many great departments that students wouldn’t typically have access to.
The field of data mining has changed a lot over your career – where do you see it going?
Data mining basically serves as a bridge between core techniques like machine learning, statistics, and optimization and their application to real world problems – and we are not confined to any approach, as we can use and develop different technologies for different problems. That’s the reason data mining has life, because you are facing the real world, which is so diverse.