Glossary

Whether you're an experienced AI professional or just starting to explore this exciting field, our glossary is designed to provide you with a comprehensive reference that will help you navigate the complex and ever-evolving world of Data-Centric AI. Start and explore the terminology that underpins this groundbreaking technology.

Artificial Intelligence (AI)

The simulation of human intelligence processes by machines, especially computer systems.

AI Models

Mathematical algorithms that are trained on data to make predictions or decisions about new, unseen data. AI models can include various types of machine learning models, such as supervised learning, unsupervised learning, and reinforcement learning. AI models can be used in a wide range of applications, such as natural language processing, computer vision, and predictive analytics. The accuracy and reliability of AI models depend on the quality and quantity of the data used to train and test them, as well as on the algorithm and the parameters chosen for the model.

Big Data

Extremely large data sets that can be analyzed to reveal patterns, trends, and associations.

Data

Facts and statistics collected together for analysis.

Data Analytics

The process of examining data sets to draw conclusions about the information they contain.

Data Augmentation

The process of artificially increasing the size of a dataset by generating additional examples using techniques such as flipping, rotating, cropping, or adding noise to the original data.

Data Bias

The presence of systematic errors or inaccuracies in data that can affect the performance and fairness of machine learning models. Data bias can be introduced by a variety of factors, such as sample selection bias, measurement bias, or societal biases, and can result in models that are inaccurate or unfair for certain groups of people. Data bias can be addressed through techniques such as data augmentation, sampling methods, and model calibration, as well as through responsible data governance and management practices that promote diversity, equity, and inclusion.

Data-Centric AI

An approach to AI development and deployment that places a greater emphasis on the data used to train and test machine learning models, as well as on the data generated by the AI systems themselves. Data-centric AI recognizes that the quality and quantity of data can have a significant impact on the accuracy, reliability, and fairness of machine learning models and that data governance and management practices are essential for ensuring the ethical and responsible use of AI. Data-centric AI involves using tools and techniques for data quality assessment, data labeling, data augmentation, and data bias detection and mitigation, to ensure that the data used to train and test AI models is accurate, representative, and unbiased. Data-centric AI also involves monitoring and analyzing the data generated by AI systems, to ensure that they are performing as intended and to identify potential issues or problems.

Data Labelling

The process of assigning predefined tags or categories to data samples, such as images, text, or audio, to enable supervised machine learning. Data labeling is often used to create labeled datasets that can be used to train and test machine learning models, such as image recognition, natural language processing, or speech recognition models. Data labeling can be performed by humans, using manual annotation or crowdsourcing platforms, or by using automated techniques, such as clustering or rule-based classification. Data labeling can be a time-consuming and expensive process, and the quality of the labeled data can have a significant impact on the accuracy and performance of the machine learning models.

Data Mining

The process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Data Privacy

The practice of protecting sensitive or confidential information in data from being disclosed, shared or misused without the consent of the individuals or entities to which the data belongs. Data privacy involves ensuring that data is only used for its intended purpose, and that it is stored, processed, and transmitted securely to prevent unauthorized access, use, or disclosure. This includes complying with regulations and laws related to data protection, such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA). Data privacy can also involve ethical considerations, such as ensuring that individuals have the right to access, correct, or delete their personal data. Techniques for ensuring data privacy include data encryption, access controls, and anonymization techniques, such as k-anonymity and differential privacy.

Data Profiling

The process of analyzing and summarizing the characteristics of a dataset, such as the type, format, distribution, and quality of the data. Data profiling is used to identify potential data quality issues, such as missing or duplicate values, inconsistent data types, or outlier values, which can affect the accuracy and reliability of data analysis and modeling. Data profiling tools can automate the process of analyzing large datasets and can provide summary statistics, visualizations, and data quality reports to help data scientists and analysts understand the data and identify potential problems.

Data Science

An interdisciplinary field that involves the use of scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

Datasets

Collections of data that are organized and stored in a specific format, often for use in machine learning, data analysis, or other research applications. Datasets can include structured or unstructured data and can be sourced from a variety of sources, such as surveys, sensors, social media, or public databases. Datasets can vary in size, complexity, and quality, and can require preprocessing and cleaning before they can be used in machine learning or data analysis tasks.

Data Sharing

The practice of making data available to others for the purpose of research or other activities, often to promote scientific collaboration or to enable broader analysis of the data. Data sharing can involve sharing raw data, data summaries, or metadata, and may be subject to legal or ethical restrictions, such as data privacy and security concerns or intellectual property rights. Data sharing can also involve the development of standards and protocols for sharing data, such as data repositories, data management plans, and data citation practices.

Read more

Data Silos

Separate and isolated collections of data that are stored and managed independently of other data sources within an organization. Data silos can arise from factors such as organizational structure, legacy systems, or proprietary data formats, and can create inefficiencies and barriers to effective data management and analysis. Data silos can make it difficult to access and analyze data across different parts of an organization, leading to duplication of effort, data inconsistencies, and missed opportunities for insights and innovation. Techniques for breaking down data silos include data integration, data governance, and data sharing policies, which can help to promote collaboration, standardization, and transparency across different parts of an organization.

Data Simulation

The process of creating synthetic data by modeling the statistical properties of real-world data using simulations or generative models.

Deep Learning

A subset of machine learning that involves training artificial neural networks with large datasets to recognize patterns and make predictions.

Differential Privacy

A privacy-enhancing technique that adds random noise to a dataset to protect the privacy of individuals while still allowing for useful statistical analysis.

Read more

Exploratory Data Analysis (EDA)

The process of analyzing and summarizing a dataset to understand its main characteristics and patterns, often using graphical and statistical methods. EDA is used to identify potential problems with the data, generate hypotheses, and identify relevant features for further analysis. The goal of EDA is to provide insights and understanding about the data that can be used to inform subsequent data modeling and decision-making. EDA often involves visualizing the data using plots and charts and using statistical tools such as descriptive statistics, hypothesis testing, and correlation analysis to uncover patterns and relationships in the data.

Fairness in AI

The idea that AI systems should not discriminate against individuals or groups based on characteristics such as race, gender, or age, and should provide equal opportunities and outcomes for everyone. Fairness in AI is important for ensuring that AI systems do not perpetuate or amplify existing biases and inequalities in society. Fairness in AI can be achieved through techniques such as data sampling, algorithmic transparency, and fairness metrics, which can help to identify and mitigate bias in AI systems. Fairness in AI is also closely linked to other principles of responsible AI, such as privacy, security, and transparency.

Generative Adversarial Networks (GANs)

A class of machine learning models that use a two-part architecture to generate synthetic data that is similar to real-world data. GANs consist of a generator network that produces synthetic data, and a discriminator network that tries to distinguish between the synthetic data and real data. The generator network is trained to produce data that is increasingly difficult for the discriminator to distinguish from real data. GANs have been used to generate synthetic images, videos, and audio, and have applications in areas such as image synthesis, data augmentation, and unsupervised learning. However, GANs can be difficult to train and can suffer from instability and mode collapse, where the generator produces only a few distinct outputs, rather than a diverse set of outputs.

Generative Models

Machine learning models that can learn to generate new data samples that are similar to the training data.

Machine Learning

A type of AI that allows computer systems to learn and improve from experience without being explicitly programmed.

Natural Language Processing (NLP)

A subfield of AI that deals with the interaction between humans and computers using natural language.

Predictive Analytics

The use of statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data.

Privacy-preserving synthetic data

Synthetic data that is designed to protect the privacy of individuals by ensuring that it does not reveal any personally identifiable information.

Reinforcement Learning

A type of machine learning that involves training an agent to make decisions in an environment by receiving feedback in the form of rewards or punishments.

Responsible AI

An approach to developing and deploying AI systems that takes into account ethical, social, and legal considerations, as well as the potential impact of the technology on individuals and society. Responsible AI aims to ensure that AI systems are transparent, fair, and accountable and that they do not reinforce biases or perpetuate discrimination. It involves considering the ethical implications of the data used to train and test AI models, the potential unintended consequences of AI systems, and the social and legal frameworks in which they operate.

Read more

Synthetic Data

Artificially generated data that mimics the statistical properties of real-world data, but does not contain any information about real individuals or entities.

Would you like to see our solution in action?

Try now