Back

Should Data Science teams use Kubernetes? Hell no!

data science focused on data, container, Kubernetes

Data science teams should focus on analysing data and building models, not infrastructure management.


Kubernetes is great!

  1. 1. “Kubernetes is a future proof solution.” Because it is super cool to say “future proof”. Nobody knows how the future looks like, but everyone seems to be sure that Kubernetes will be part of it. I have to agree that the trend indicates that Kubernetes is the way — until something cooler is invented.
  2. 2. 99.9999999999999999% availability. Maybe I got excited with the 9’s, but there’s plenty of them. It is not 100% because it will fail when you need it the most. Just kidding, a good implementation of K8s can ensure almost flawless reliability of your application.
  3. 3. Cost reduction aka Scale to zero. This is a super cool feature that helps you saving infrastructure costs when the application does not need computational power!

It is indeed a great tool! And by taking a look at the pros, it seems to make a lot of sense to use it for machine learning — both kubernetes and machine learning are tech jargon words used to describe the future, so it makes sense to have them together. Seriously, auto scaling for the time and computational consuming processes like model’s training? Makes total sense to use it.

But should data science teams start working with it?

photo by growtika on unsplash

Photo by Growtika on Unsplash

 

Kubernetes is not so great

  1. Kubernetes is the new oil. Oh wait, that’s data! Then why are data scientists spending tons of time on kubernetes and not on the data?
  2. Kubernetes is a great tool and it is becoming the new standard for cloud applications. It is so cool that it is being used not only for software applications but also for machine learning. What we shouldn’t forget is that building software applications and machine learning models is not the same.
  3. Kubernetes has plenty of components, processes, services, subsystems, jobs, code… That’s great for someone that is expert on the subject but for data scientists it means plenty of risks or problems. Not to mention all the routing and networking services, reverse proxy’s, and so on!

What are we missing here?

A great tool for data science and it should be used by data scientists? That’s right. Kubernetes is an infrastructure tool, hence, it should be used by infrastructure specialists.

There’s a new movement regarding this topic and it is called MLOps. I joined a community that is making the first steps on the movement — MLOps.community — and surprisingly it grew for a bunch of people (~40) trying to figure out how to make processes simpler for Machine Learning into a strong community (~900) sharing knowledge, creating webinars and tools. It is agreed that the skills needed to scale ML using Kubernetes are much different from the ones needed to build models. However, there’s not yet a common standard for the role of these people in a company: some are called Machine Learning Engineers, other kept the previous role of Infrastructure Engineer, DevOps or SRE, and some are innovating by being named “AI Infrastructure Engineers”.

To give you an idea on how complex it is to create MLOps culture and processes, there are companies specializing on it. Yes, now that you are thinking on delegating this job to that one person, think that there are entire companies, maybe bigger than yours, working on this. MLOps platforms, AI platforms, DataOps platforms, Data Science platforms are all similar and focus on solving scalability for data science, among other technicalities.

Think well before deciding between building internally versus buying.

Conclusion

Data scientists are spending tons of time dealing with containerization and kubernetes in general. This is not a task for data scientists, it is a new specialized role that is emerging and is infrastructure related. Let data scientists work on their field and create value for the company, instead of having them dealing with issues out of their scope.

 

Gonçalo Martins Ribeiro is CEO at YData.

Improved data for AI

YData provides a privacy by design DataOps platform for Data Scientists to work with synthetic and high quality data.

Back
YData SDK

Synthetic data SDK now available for everyone

The Data-Centric AI toolkit for data quality profiling and synthetic data generation We are proud to announce that the YData SDK is now officially available to the broader data science community. With a single line of code, any team or...

Read More
The rise of DataPrepOps

The rise of DataPrepOps

Modern data development tools and how data quality impacts ML results ML is all around us! From healthcare to education, it is being applied in many domains that affect our daily activities and it’s able to deliver many benefits. Data...

Read More
fake data; dummy data; quality assurance; synthetic data generation;

Enhancing Data Management Solutions with data bootstrap

Synthetic data bootstrap In the dynamic landscape of organizations high-quality data is a requirement for the development of many solutions - from software testing and validation all the way to Artificial Intelligence (AI) initiatives. In...

Read More