Back

The DataPrepOps Landscape

Data-Centric AI landscape by YData

Since Andrew Ng coined the term in 2021, the number of companies that identify themselves as providing data-centric AI tools has exploded. From synthetic data to data monitoring, companies all over the machine learning workflow have jumped on the term to align themselves with the exciting developments in model creation and productionization.

But what does it mean for these companies to offer data-centric AI tools? Are their offerings contradictory, complementary, or something in between? While in our previous post, we focused on the intuition of how DataPrepOps is the equivalent of MLOps for data-centric AI, in this post we’re going to highlight some of the tools and platforms that data scientists use to iteratively improve their models through data-centric AI best practices.

To avoid overwhelming readers with a hodgepodge of tools and companies which may be direct competitors or may have nothing to do with each other, we’ve organized the DataPrepOps landscape into 4 major categories, based on which step of the data-centric AI model development lifecycle they help with:

  • Data Understanding: Meaning tools that are used to visualize, explore, and understand the data that models are being trained on;
  • Data Preparation: Both designing data transformations and orchestrating them to run;
  • Data Versioning: To keep track of changes to the schema of a dataset, which can potentially be a pitfall for models after they go into production;
  • Data Monitoring: To track the quality and distribution of data, making sure that models aren’t being trained on or fed low-quality or out-of-distribution data.

Let’s take a look at each of these categories and learn about some of the tools we can use to improve our models from each one.

Data-Centric AI landscape ydata synthetic data data profiling

Data Understanding

Data Understanding tools allow users to develop intuition about the nature and characteristics of their datasets. They are useful because without understanding data and the relationships between variables that it expresses data scientists are flying blind.

Data Understanding can be seen as an umbrella term encompassing several related, although distinct, tooling that aims to describe the available data: data analytic solutions, data profiling tools, and data catalogs. Whereas data analytics is more grounded on business intelligence scope (e.g., analytics and reporting), data profiling focuses on summarizing the basic data descriptors, visualizing its main characteristics, and highlighting potential data quality issues. Fostering a more profound data understanding, data catalogs maintain all data assets within an organization, tracking the flow of data as it moves through various processes, transformations, and teams within the organization’s ecosystem.

Tools like Fabric Data Catalog (which leverages the incredibly popular and open source ydata-profiling), Sweetviz, and AutoViz are powerful tools for understanding your data. You can find additional tools for data understanding listed in our Awesome Data-Centric AI Repository, including Lux, DataPrep, and more.

Data Preparation

Transforming raw data into a dataset for a model to be trained on (which includes data cleaning and feature engineering, among others) is a vital stage in the DataPrepOps process. Without this step, ML models would be virtually untrainable because they wouldn’t have access to high-quality data to learn relationships from. Importantly, data preparation actually involves two steps: first designing the steps that clean the data and turn it into actionable features and then secondly, orchestrating those steps to run repeatedly through operationalization.

Data Labeling

Data labeling tools are platforms or service providers which allow you to either label your data more effectively yourself or outsource your data labeling. Some data labeling tools automate the labeling of new data, while others rely on human labeling to ensure high labeling accuracy.

Examples of automated labeling providers include Alectio, Snorkel, Galileo, and Cleanlab. For human labelers, you can turn to companies like Toloka, whereas if you want a platform that you can use for your own data labelers to more efficiently label your data, you can check out HumanSignal or LabelBox.

Synthetic Data

Synthetic data tools allow you to synthesize new data based on the data that you already have. This means that you can train models on larger, higher-quality, anonymized datasets, without the need to collect and label new data. Synthetic data tools train a new, generative model based on the data you’ve previously collected and then generate new data from that model.

Synthetic data tools include YData Fabric’s Data Synthesizer, Gretel, Mostly AI, Hazy, and Synthesized. On the open-source landscape, you can kickstart your synthetic data experiments with ydata-synthetic, gretel-synthetics, and SDV, among others.

Feature Engineering

Feature Engineering encompasses the process of selecting, transforming, and refining existing features to improve the input given to machine learning models. The tasks carried out during feature engineering will depend on the specific domains and problem we’re hoping to solve, but robust solutions in the scope include Fabric, Featuretools, tsfresh, and scikit-learn, popular tools that data scientists use when they want to design feature transformations for the data preparation step of the DataPrepOps lifecycle.

Data Orchestration

Once data transformations have been designed, they also need to be orchestrated to run repeatedly and automatically on new data. While a simple job scheduler like cron can be sufficient for the simplest of data pipelines, many data engineers and data scientists rely on more advanced tools for this step.

Often, tools will allow users to define a series of transformation steps that have some dependency between them in terms of a directed acyclic graph (a DAG), so that each data transformation step is only kicked off once all of its pre-requisites are completed. Tools that allow users to program and schedule data transformation DAGs include YData Fabric’s Pipelines (which offers a managed version of Kubeflow), Astronomer (which offers a managed version of Airflow), and Prefect (which offers a managed version of… also Prefect!).

Data Versioning

Tools like git (and platforms like GitHub and GitLab) allow software engineers to track changes and developments in a codebase through version control. Similarly, data scientists and machine learning engineers can track changes in a database through data versioning. Changes can come in the form of changes to a schema (e.g. adding or removing a column, or changing the definition of a column) or changes to the contents of a dataset (e.g. adding, subtracting, or modifying a row). By tracking changes to a dataset, data scientists can see how changing a dataset impacts the performance of models trained on that dataset. Popular tools for data versioning include the aptly named data version control (dvc) and Pachyderm.

Data Monitoring

Data monitoring is useful because it allows data scientists to get alerted about changes in a dataset or data stream, thereby giving them the power to proactively respond to data quality or data drift issues. Data monitoring is also closely related to activities like data quality validation and model monitoring, both of which are just permutations of data monitoring.

Some data monitoring companies, such as  WhyLabs or Evidently, focus specifically on data that’s being used for machine learning. Others, such as Monte Carlo and Metaplane, are more focused on data that’s being stored in SQL databases and data warehouses and therefore might be used for business intelligence or other analytics instead of machine learning. Still others, like Great Expectations and Deepchecks are focused on providing data quality checks that allow data teams to monitor their machine learning pipelines.

Conclusion

DataPrepOps tools offer streamlined and automated data preparation, cleaning, and transformation processes, implementing the best practices of Data-Centric AI. Ultimately, the goal of every vendor and tool that we’ve listed above is to make data scientists more effective at enhancing data quality and consistency, consequently improving their ML models. 

From data understanding to data monitoring, you now have a complete overview of the state-of-the-art tools and companies that will help you achieve the most out of your data.

The question is: are you ready to take your data science endeavours to the forefront of AI development?

Back
women-analysing-data

DataPrepOps in the Data-Centric AI context

Coined by Andrew Ng in 2021, the concept of “Data-Centric AI” has taken both academia and industry by storm. It has given rise to hundreds of research publications, fostered the creation of special tracks and colloquiums in the most...

Read More
Text data; synthetic text data; generative ai; large language models

Synthetic data to solve challenges in training and fine tuning LLMs

As machine learning continues to evolve, the use of Large Language Models (LLMs) has become increasingly prevalent, particularly in complex tasks requiring deep understanding and generation of human-like text. Retrieval-Augmented...

Read More
qscore-synthetic-data

How good is my Synthetic Data for Analytics?

Synthetic data, designed to mimic real-world datasets, must be able to provide the same answers as real data to be valuable. For instance, when determining the average of customers that buy certain products, the result returned by the...

Read More