This Data Science Glossary Can Help Make Your Life in the Industry a Little Easier

data scientists at work
data scientists at work

Data science terminology can be daunting for industry newcomers and veterans. The constantly evolving nature of the field leads to different viewpoints on foundational concepts. A strong foundation in data science terms also helps with the emergence of new ideas.

The following data science glossary covers concepts and words often used by practitioners. Early-career professionals can refer to this data science glossary as they get acquainted with their careers. Experts can also reacquaint themselves with core concepts by checking out these data science terms.


A step-by-step process followed by a computer to solve a specific problem set. Algorithms have been used for centuries in solving mathematical problems. In data science, these processes are essential for converting raw data into information digestible by human users.

Artificial Intelligence (AI)

The programmed emulation of human intelligence and behavior by machines. AI-enabled technology uses inputted data and algorithms to produce insights. For example, AI software can turn neighborhood-level data on power usage into predictions of areas susceptible to future outages.


An unexpected difference between the results of predictive modeling and a practitioner’s anticipated outcome. This deviation can be caused by flaws in raw data, modeling issues, or mistakes made by data scientists.

Big Data

Massive floods of raw data that require AI and other methods for further analysis. The entire world of data encompassed 44 zettabytes - or 44,000,000,000,000,000,000,000 bytes - by the end of 2021. This data science term often refers to raw information that is too large for manual processing or simple data processing systems.

Causal Inference

A process for determining how outcomes are influenced by different inputs. This process takes into account complicating factors called confounders along with variable conditions. Data scientists conduct experiments to determine the causes for effects found in data sets.


A process for placing new data points into categories based on comparisons with stored data points. Each point is evaluated for its known characteristics through machine learning processes. This process is helpful in making predictions based on likelihoods of comparable behavior based on similarities.


A statistical principle that determines the relationships between two or more variables. Data scientists conduct correlation analyses to determine the strengths of relationships between data points or sets. 

Cross Validation

A method for determining the accuracy of a machine learning system’s algorithms. A basic version of cross validation involves training an algorithm to process one data subset. Tests on additional subsets from the same set are compared to the training set.

Data Visualization

A graphical or image-based representation of what data means in the real world. This data science term applies to media ranging from line graphs to Tableau dashboards.

Data Warehouse

The central storage system for all data sets created or owned by a single organization. Data warehouses are fed by departmental systems and outside vendors. Practitioners use Structured Query Language (SQL) to access data sets for further analysis.

Data Wrangling

The acquisition, cleansing, and structuring of raw data into usable information. Data scientists start by reformatting data that uses conflicting styles and restructuring for in-house systems. This process also includes validation of random samples for accuracy and user-friendly access for future updates.


An acronym that refers to the Extract, Transform, Load process for creating data warehouses. Data extracted from outside sources are transformed into the end user’s preferred format before it is loaded into warehouses and databases. This process, along with data wrangling, converts global data streams into consumable insights.

Exploratory Data Analysis

An analytical method for finding the primary features of data sets prior to modeling. Data scientists use visual representations of data to find patterns and commonalities. This approach defines the limits and potential in sets as practitioners determine their next steps.

Machine Learning

A branch of artificial intelligence involving automated systems trained to analyze and make decisions based on inputted data. Automation development ranges from unsupervised learning with untagged datasets to supervision by data scientists. Practical uses range from recommendations on streaming services to diagnostic predictions in medicine.

Neural Network

A subset of machine learning whereby algorithms mimic neurons in the human brain to analyze data. Interconnected algorithms assign weights to data points in an effort to evaluate the relevance of entire sets. Data practitioners develop these networks for analysis of complex data including videos and audio files.


The overdependence of a data model on a particular data set. Overfitted models cannot make accurate predictions with other data types because their parameters are matched to specific sets. This modeling issue is contrasted by underfitting, which is an overly simplistic model lacking basic parameters.


A technique for determining the relationship between an independent variable and a dependent variable. The most frequently used version of regression is a graph that presents a trend line based on the past behaviors of variables. Data experts use this technique to predict health outcomes, stock market trends, and consumer behaviors.


A diagnostic approach to machine learning systems that compares modeled and real-world outcomes. An unbiased and functional system approaches zero for each calculation. Data scientists look for large differences between the outcomes or patterns not reflective of data to correct their systems or data inputs.

Statistical Significance

A method for evaluating whether trends found within data sets are caused by known factors or random variables. Data samples are tested against hypotheses created by practitioners to evaluate the likelihood of sampling errors. A sample that is closer to the hypothesis than its opposite is deemed to be significant.


A calculation of the difference between a random variable and the mean average of a data set. This difference is found by adding the values in a data set and dividing that total by the number of data points. Data practitioners use variance to determine the distribution of figures in large or complex sets.

Advance Your Computer Science Career at Baylor University

A graduate degree in computer science can further build your understanding of the concepts in this data science glossary. Your career prospects are enhanced with a master’s degree from an innovative school. Baylor University’s Online Masters in Computer Science teaches in-demand skills to students throughout the country.

This completely online graduate degree program covers many of the major concepts listed in the data science glossary. As a degree candidate, you’ll build advanced knowledge in core courses like:

  • Intro to Computational Theory
  • Advanced Algorithms
  • Advanced Databases
  • Advanced Data Communications
  • Intro to Machine Learning
  • Software Engineering

You are also offered two tracks that open new career paths after graduation. The Data Science track prepares graduates to create data-driven insights for public and private sector employers. Students in the Software Engineering track learn to build programs and systems that meet client needs.

Baylor University degrees impart the school’s stellar reputation on each graduate. In 2021, U.S. News & World Report recognized the school’s excellence in the following categories:

  • No. 25 in Most Innovative Schools
  • No. 47 in Best Colleges for Veterans
  • No. 76 in Best National Universities

Get your free Online Masters in Computer Science program guide here and connect with an enrollment advisor today to find out more about our online degree options.