25 Terms Every Data Scientist Should Know

25 Terms Every Data Scientist Should Know

Common data science terms your manager will expect you to know.

Data science is, among other things, a language, according to Robert Brunner, a professor in the School of Information Sciences at the University of Illinois. This concept might come as a shock to those who associate data science jobs with numbers alone.

Data scientists increasingly work across entire organizations, and communication skills are as important as technical ability. Data science is booming in every industry, as more people and companies are investing their time to better understand this constantly expanding field. The ability to communicate effectively is a key talent differentiator.

Whether you pursue a deeper knowledge of data science by learning a specialty, or simply want to gain a smart overview of the field, mastering the right terms will fast-track you to success on your educational and professional journey.

According to Vinod Bakthavachalam, a senior data scientist at Coursera, using the following data science terms accurately will help you stand out from the crowd:

  1. Business Intelligence (BI). BI is the process of analyzing and reporting historical data to guide future decision-making. BI helps leaders make better strategic decisions moving forward by determining what happened in the past using data, like sales statistics and operational metrics.
  2. Data Engineering. Data engineers build the infrastructure through which data is gathered, cleaned, stored and prepped for use by data scientists. Good engineers are invaluable, and building a data science team without them is a “cart before the horse” approach.
  3. Decision Science. Under the umbrella of data science, decision scientists apply math and technology to solve business problems and add in behavioral science and design thinking (a process that aims to better understand the end user).
  4. Artificial Intelligence (AI). AI computer systems can perform tasks that normally require human intelligence. This doesn’t necessarily mean replicating the human mind, but instead involves using human reasoning as a model to provide better services or create better products, such as speech recognition, decision-making and language translation.
  5. Machine Learning. A subset of AI, machine learning refers to the process by which a system learns from inputted data by identifying patterns in that data, and then applying those patterns to new problems or requests. It allows data scientists to teach a computer to carry out tasks, rather than programming it to carry out each task step-by-step. It’s used, for example, to learn a consumer’s preferences and buying patterns to recommend products on Amazon or sift through resumes to identify the highest-potential job candidates based on key words and phrases.
  6.  Supervised Learning. This is a specific type of machine learning that involves the data scientist acting as a guide to teach the desired conclusion to the algorithm. For instance, the computer learns to identify animals by being trained on a dataset of images that are properly labeled with each species and its characteristics.
  7. Classification is an example of supervised learning in which an algorithm puts a new piece of data under a pre-existing category, based on a set of characteristics for which the category is already known. For example, it can be used to determine if a customer is likely to spend over $20 online, based on their similarity to other customers who have previously spent that amount.
  8. Cross validation is a method to validate the stability, or accuracy, of your machine-learning model. Although there are several types of cross validation, the most basic one involves splitting your training set in two and training the algorithm on one subset before applying it the second subset. Because you know what output you should receive, you can assess a model’s validity.
  9. Clustering is classification but without the supervised learning aspect. With clustering, the algorithm receives inputted data and finds similarities in the data itself by grouping data points together that are alike.
  10. Deep Learning. A more advanced form of machine learning, deep learning refers to systems with multiple input/output layers, as opposed to shallow systems with one input/output layer. In deep learning, there are several rounds of data input/output required to assist computers to solve complex, real-world problems. A deep dive can be found here.
  11. Linear Regression. Linear regression models the relationship between two variables by fitting a linear equation to the observed data. By doing so, you can predict an unknown variable based on its related known variable. A simple example is the relationship between an individual’s height and weight.
  12. A/B Testing. Generally used in product development, A/B testing is a randomized experiment in which you test two variants to determine the best course of action. For example, Google famously tested various shades of blue to determine which shade earned the most clicks.
  13. Hypothesis Testing. Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. It’s frequently used in clinical research.
  14. Statistical Power. Statistical power is the probability of making the correct decision to reject the null hypothesis when the null hypothesis is false. In other words, it’s the likelihood a study will detect an effect when there is an effect to be detected. A high statistical power means a lower likelihood of concluding incorrectly that a variable has no effect.
  15. Standard Error. Standard error is the measure of the statistical accuracy of an estimate. A larger sample size decreases the standard error.
  16. Causal inference is a process that tests whether there is a relationship between cause and effect in a given situation—the goal of many data analyses in social and health sciences. They typically require not only good data and algorithms, but also subject-matter expertise.
  17. Exploratory Data Analysis (EDA). EDA is often the first step when analyzing datasets. With EDA techniques, data scientists can summarize a dataset’s main characteristics and inform the development of more complex models or logical next steps.
  18. Data Visualization. A key component of data science, data visualizations are the visual representations of text-based information to better detect and recognize patterns, trends and correlations. It helps people understand the significance of data by placing it in a visual context.
  19. R. R is a programming language and software environment for statistical computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
  20. Python is a programming language for general-purpose programming and is one language used to manipulate and store data. Many highly trafficked websites, such as YouTube, are created using Python.
  21. SQL. Structured Query Language, or SQL, is another programming language that is used to perform tasks, such as updating or retrieving data for a database.
  22. ETL. ETL is a type of data integration that refers to the three steps (extract, transform, load) used to blend data from multiple sources. It’s often deployed to build a data warehouse. An important aspect of this data warehousing is that it consolidates data from multiple sources and transforms it into a common, useful format. For example, ETL normalizes data from multiple business departments and processes to make it standardized and consistent.
  23. GitHub. GitHub is a code-sharing and publishing service, as well as a community for developers. It provides access control and several collaboration features, such as bug tracking, feature requests, task management and wikis for every project. GitHub offers both private repositories and free accounts, which are commonly used to host open-source software projects.
  24. Data Models define how datasets are connected to each other and how they are processed and stored inside a system. Data models show the structure of a database, including the relationships and constraints, which helps data scientists understand how the data can best be stored and manipulated.
  25. Data Warehouse. A data warehouse is a repository where all the data collected by an organization is stored and used as a guide to make management decisions.

Mastering these terms is an excellent first step towards a durable data science career. Equally important is ensuring they’re understood throughout your organization so that data scientists can operate more efficiently and effectively with their non-data science partners. Like anything, this takes practice, but by putting these data science building blocks in place, you’ll be at a natural advantage when opportunities arise.