Summary
In this episode of the AI Engineering podcast, host Tobias Macy interviews Tammer Saleh, founder of SuperOrbital, about the potentials and pitfalls of using Kubernetes for machine learning workloads. The conversation delves into the specific needs of machine learning workflows, such as model tracking, versioning, and the use of Jupyter Notebooks, and how Kubernetes can support these tasks. Tammer emphasizes the importance of a unified API for different teams and the flexibility Kubernetes provides in handling various workloads. Finally, Tammer offers advice for teams considering Kubernetes for their machine learning workloads and discusses the future of Kubernetes in the ML ecosystem, including areas for improvement and innovation.
Announcements
Parting Question
In this episode of the AI Engineering podcast, host Tobias Macy interviews Tammer Saleh, founder of SuperOrbital, about the potentials and pitfalls of using Kubernetes for machine learning workloads. The conversation delves into the specific needs of machine learning workflows, such as model tracking, versioning, and the use of Jupyter Notebooks, and how Kubernetes can support these tasks. Tammer emphasizes the importance of a unified API for different teams and the flexibility Kubernetes provides in handling various workloads. Finally, Tammer offers advice for teams considering Kubernetes for their machine learning workloads and discusses the future of Kubernetes in the ML ecosystem, including areas for improvement and innovation.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- Your host is Tobias Macey and today I'm interviewing Tammer Saleh about the potentials and pitfalls of using Kubernetes for your ML workloads.
- Introduction
- How did you get involved in Kubernetes?
- For someone who is unfamiliar with Kubernetes, how would you summarize it?
- For the context of this conversation, can you describe the different phases of ML that we're talking about?
- Kubernetes was originally designed to handle scaling and distribution of stateless processes. ML is an inherently stateful problem domain. What challenges does that add for K8s environments?
- What are the elements of an ML workflow that lend themselves well to a Kubernetes environment?
- How much Kubernetes knowledge does an ML/data engineer need to know to get their work done?
- What are the sharp edges of Kubernetes in the context of ML projects?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with Kubernetes?
- When is Kubernetes the wrong choice for ML?
- What are the aspects of Kubernetes (core or the ecosystem) that you are keeping an eye on which will help improve its utility for ML workloads?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for ML workloads today?
- SuperOrbital
- CloudFoundry
- Heroku
- 12 Factor Model
- Kubernetes
- Docker Compose
- Core K8s Class
- Jupyter Notebook
- Crossplane
- Ochre Jelly
- CNCF (Cloud Native Computing Foundation) Landscape
- Stateful Set
- RAG == Retrieval Augmented Generation
- Kubeflow
- Flyte
- Pachyderm
- CoreWeave
- Kubectl ("koob-cuddle")
- Helm
- CRD == Custom Resource Definition
- Horovod
- Temporal
- Slurm
- Ray
- Dask
- Infiniband
[00:00:05]
Tobias Macey:
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macey, and today, I'm interviewing Tammer Saleh about the potentials and pitfalls of using Kubernetes for your machine learning workloads. So, Tamir, can you start by introducing yourself?
[00:00:31] Tammer Saleh:
Oh, absolutely. Thank you, Tobias. So my name is Tammer. I founded Super Orbital back in 2017. We are a engineering services and and training company focused on solving extremely difficult problems using cloud native technologies. And what that means is just, you know, we do the hard things on Kubernetes. Before that, I was part of Pivotal Cloud Foundry and went back and forth between software engineering and operations. So I kinda have that DevOps mindset just because of my history.
[00:01:06] Tobias Macey:
And do you remember how you first got started working in that cloud and Kubernetes ecosystem and maybe a bit of your exposure to and background in machine learning as well?
[00:01:16] Tammer Saleh:
Yeah. Absolutely. So so, like, the the the cloud ecosystem was kind of interesting because I start I'm old. Right? I started off when, Linux administration meant driving down to the server room and, you know, racking and stacking machines and installing everything from hand. I'm sure you remember. Right? And, yeah, I swear every time I went to the server room, I would come back bleeding. I hated it so much. I carried Band Aids with me because, you know, those VA Linux boxes were terrible. And I remember back in the day, the cloud was just it was literally the thing you drew on the whiteboard. Right? It was the thing to represent the rest of the Internet. Right?
And when cloud computing became a thing, that's when I was working for another startup. It was called, Engine Yard, and it was basically the worst interpretation of cloud computing you could imagine. It was just like we resold AWS machines with a little bit of configuration on them and a lot of white glove service. That was the that was our superpower. But then Heroku came around. Do you do you remember Heroku?
[00:02:24] Tobias Macey:
Did you ever use it? I actually still use it today for my work.
[00:02:30] Tammer Saleh:
Well, that's fantastic.
[00:02:31] Tobias Macey:
Yes. I I've been using Heroku for many years now off and on, so I'm altogether too familiar with it.
[00:02:38] Tammer Saleh:
Well, the interesting thing is Heroku was even more ambitious when it first came out. Do you remember it had that cloud IDE that was an integral part of Heroku that later they they decommissioned?
[00:02:50] Tobias Macey:
Right? I I think I came to it right after that, but I have heard many recountings of that history.
[00:02:55] Tammer Saleh:
Yeah. And it looked like a toy. We we made fun of it. We're like, well, nobody's gonna post anything on that. We don't how do you even SSH into that? How do you back it up? How do you do any of the stuff that you're supposed to do? But that was the groundbreaking transitional moment when people really started thinking about what the cloud and what a platform would mean. Right? And to be very frank with you, I to saying this as somebody who runs a company based on Kubernetes technology, I still feel that the industry took a step backwards from that twelve factor model.
I think it needed to. The whole 12 factor thing was a little bit presumptuous, a little bit egotistical. You know, we we tell you how you should build applications. But, but it was very natural for me when I when I embraced it because at the end of the day, 12 Factor, the whole Heroku Cloud Foundry model was really just write a good UNIX y process, and we'll take care of it. You know, you stay within the bounds of, like, djbdns, like the old, the the old run it stuff. Like, stay stay within the bounds of what it means to be a good Unix process, and we're good.
So, anyways, because of that, like, I kinda naturally transitioned into cloud and, moved into Pivotal, ran engineering for Pivotal Cloud Foundry, and then, formed SuperOrbital shortly after. With the ML workloads, it's really interesting. Now first of all, I just wanna say, like, I I'm not a lawyer. Right? I am not an ML engineer. That is not my trade. I I wrote a couple of neural networks when I was in college. Like, we've we've all tried neural nets in college. Right? Like, you know, I experimented, and, and that was basically it. But we are myself and and all and all the people here are obviously extreme experts in all things Kubernetes and beyond.
And the reason that Kubernetes beat the whole twelve factor platform is because it came to you. Right? It said, look, you don't have to adapt to our model of what compute should look like. We will embrace any workload you wanna throw at us. And so it was really it still is really flexible as a platform. And because of that, it attracts interesting problems, and ML is full of interesting problems. So because of that, we've got a ton of clients that come our way saying, we we need to run these weird ML work workflows and workloads on Kubernetes, and we help them out.
[00:05:37] Tobias Macey:
And now for people who are maybe tangentially familiar with Kubernetes or maybe this is the first time that they've heard about it, although they must have been living under a rock for the past 5 years. How do you summarize it? What is your elevator pitch for somebody about what Kubernetes is and why they should care?
[00:05:56] Tammer Saleh:
Yeah. That's that's a great question. So the simplest way that that you know, what what it says on the tin for Kubernetes is that it's a system that will run your containerized workloads at scale in production. And if you don't know what a container is, you could just think of it as a little tiny machine, like a little micro VM, Right? Very fast, very lightweight. If you've used Docker, you know what a container is. And if you've used Docker, you might think, well, I I know what Docker Compose is. Like, I could just use that. Why would I not run all my stuff on Docker Compose and call it a day? Because there's a lot of extra stuff that has to be thought about when you're running that in production. Docker Compose is great for your laptop, but even though I know a couple of companies that have done this before, you should not just Docker Compose up on a server and walk away. Right? You need something stronger for that, and that's what Kubernetes provides. That's what it says on the tin, and that is, like, that's the the thing that you would use it for. But the reason that Kubernetes came so popular so quickly is because it embodied all of the Google SRE best practices, the site service reliability engineer best practices. Right? And as part of that, it has this incredible API for automation.
And that is the thing that makes Kubernetes so wonderful and so ubiquitous, is that it is this one API that if you as an as a software engineer learn that API, you no longer have to worry so much about learning the AWS API and the GCP API and the Azure API. And, frankly, the Kubernetes API is extremely well built. I can say with confidence I've never seen an API that is so well reasoned about and so well created. It's it's it's just really wonderful. So that's the when we teach students about Kubernetes and, like, our core case class, that's the the message that we hammer home a lot is that, sure, it's it's there to run your workloads, and it does it using all these SRE best practices, but it's really the API that makes it shine. Unfortunately, though, Kubernetes is pretty complicated. I mean, like, that workshop I just mentioned, it's 5 days long in, like, afternoons, but, like, tons and tons of content just to get you up to speed on Kubernetes. It is a it's a very complex beast. It's about as complicated
[00:08:25] Tobias Macey:
as learning Linux from scratch. Which many people do just for fun.
[00:08:33] Tammer Saleh:
Yes.
[00:08:35] Tobias Macey:
For the purpose of this conversation in this audience, Kubernetes as it applies to the machine learning and AI ecosystem, what are the aspects of machine learning that you would like to discuss and how they pertain to Kubernetes as a environment, a either for deployment or experimentation and just Kubernetes as a target for machine learning?
[00:08:58] Tammer Saleh:
Right. Great question. And, again, not a machine learning engineer talking kind of secondhand with regards to the type of work that we help our clients with. Right? But, you know, like I said before, Kubernetes is extremely flexible. So it's it's kind of there to help with most aspects of ML workloads, you know, model tracking and versioning, exploration through Jupyter Notebooks. We we've built systems where, it runs a huge number of Jupyter Notebooks at scale with tons of data behind them. And and that's really interesting because that's a thing that 12 Factor will be really bad at because Jupyter Notebooks are kinda semi stateful. Right? Like, you don't want a long running Jupyter Notebook computation to just be killed in the middle.
Kubernetes is pretty good at that kind of stuff. Obviously, inference, you you run the model in production and you serve, the actual model where you care about things like the efficient and, multi tenant use of GPUs. Then there's also training, where you've got efficient parallel scheduling across massive amounts of GPUs. All these things can be done on top of Kubernetes.
[00:10:05] Tobias Macey:
In your experience of working with teams who do have machine learning workloads, they are using Kubernetes, Machine learning never exists in a vacuum. There are always many other support systems, both organizationally and technically, not least of which is data engineering for being able to feed data into those machine learning models for training and updating, but also operations for being able to monitor those models and identify when it needs to be retrained or tuned. What are the benefits of Kubernetes in that socio technical ecosystem of machine learning and some of the ways that you've seen that be leveraged to improve the efficacy of machine learning
[00:10:49] Tammer Saleh:
engineers who are trying to just get their job done? That's a great question. I mean, I I think it comes back to that ubiquity. Right? Kubernetes provides that one single API that all these different teams are very comfortable working within. The operations team is used to running Kubernetes for their normal full stack workloads. And so it's very comfortable for them to be using Kubernetes to support the data engineers who are moving all this data around. And because Kubernetes is a lot more flexible than the old twelve factor models, it's much easier to construct a system through the Kubernetes lego bricks, the different volume types and stuff where you can efficiently move that data around. So all these support teams, and you're right, we see this in almost every company that we work with. It's not just an ML team off in a corner. You have to have many teams supporting them for them to be effective.
They can all leverage a platform like Kubernetes to get their job done, and they're all speaking the same language that way. The other interesting element of Kubernetes
[00:11:51] Tobias Macey:
is that it it has become, I guess, the, the ochre jelly, if you will, of the cloud ecosystem where it just consumes everything that it touches and everything that comes in contact with it. And as a result, when people think Kubernetes and the inherent complexity that you already mentioned, that complexity balloons to encompass all of the things that Kubernetes has consumed in the process or that have become attached to it as a result. For one example, you mentioned that if you're using Kubernetes, you don't necessarily have to be as familiar with the AWS or the GCP APIs. And there's actually a project, I think Crossplane is the name, where that's its whole purpose is that they actually just push those APIs underneath Kubernetes so that you use Kubernetes to manage those resources. And I'm wondering how the emergent complexity of the Kubernetes ecosystem has maybe acted as a deterrent for people from approaching Kubernetes as a solution to their problem because of the fact that they just view it as this amorphous mass of complexity that they don't know how to get started with.
[00:12:53] Tammer Saleh:
Absolutely. Absolutely. I mean, people talk about the complexity of Kubernetes, and it is very complex. Like I said, it's on par with Linux, and learning bash and all that from from scratch. But it's nothing compared to the complexity of the wider CNCF ecosystem. We've all seen or if you haven't, go to, just Google c o CNCF Ecosystem, and you'll find this website you can go to that has all the logos of all the companies and open source technologies that are, participating in and surrounding Kubernetes. Right? And some of those, fairly simple. Argo CD, things like that, not too hard to to understand and used in almost every Kubernetes installation.
Some things like Crossplane, incredibly complicated, almost as complicated as learning Kubernetes itself. Same thing with, like, Istio. Istio is incredibly complicated. And so you do need to be mindful and careful of how much you add on to Kubernetes, because you're taking on the load of maintenance and understanding and configuration and everything with every one of those components you use. You're you're spending your innovation points. Right?
[00:14:04] Tobias Macey:
Now for the purpose of actually using Kubernetes in an ML team or as an ML engineer, You mentioned the statefulness aspect when you're talking about Jupyter Notebooks. Machine learning is inherently a stateful operation, at least for the training portion, serving slightly less so. Because of that fact, what are the ways that machine learning workflows start to hit up against the edge cases of Kubernetes and some of the ways that you have to customize your Kubernetes runtime to be able to account for the statefulness and also the extreme of the use cases around machine learning.
[00:14:43] Tammer Saleh:
Right. I mean, Kubernetes had, for the longest time, a a bad reputation around stateful workloads. Right? It started as completely stateless. It was, you know, run your full stack workloads on Kubernetes. And if you needed a database, well, that's what you go to Amazon for afterwards. Right? Database should be outside of kates. And it added this this thing called the StatefulSet, which originally called the PetSet back in 1 5. It has been a slowly maturing feature, like, things like volume snapshots are relatively new. But at this point, it's about polish with Kubernetes and stateful workloads.
Many of our customers run databases, large amounts of data on Kubernetes, and it's just fine. Right? It works well. Does it work better than running it on raw EC 2? Probably not. In fact, definitely not. But you get that ubiquitous API. You get that that consistent substrate for managing those databases, and that's worth it to a lot of our customers. And you're right. Machine learning workloads have a lot of state associated with them, a lot of data associated with them, but they're not quite what I'd call stateful in the same way that the database is. Like, I think of them as semi stateful. And what I mean by that is if you lose your disk during the training phase of, working with a model, like, you your company is not gonna go bankrupt. Right? Like like, when when GitHub lost their primary database way back in the day, that was big news. Right? Everybody who used GitHub knew that that had happened, and it caused days of downtime while they recovered. It was really bad.
And, yeah, you'll lose, you know, potentially hours or days of data if you lose your disk, but it's not the same as as a database. That being said, ML workloads are very interesting. Training, for example, you're running thousands of these processes in a single batch job in parallel, but not Right? So when when you're full stack developer, you think, oh, in parallel, that's great. That means, like, if I'm running a 1000 of these processes, one of them dies. Like, it's totally fine. That's not how it works with training as as a bunch of your audience probably already knows. One of those, processes crashes, and that's it for the job back to the last checkpoint. So, hopefully, your training process is doing frequent checkpointing.
And that's especially bad when coupled with the GPUs that are being pushed out of the factory so quickly nowadays. Like, they're very flaky. So your processes will die frequently, and you won't know if it's because your Python code had a bug in it or because the GPU itself just, you know, pooped the bed. That's some of the interesting stuff that is challenging on Kubernetes. I can get more into that in a bit. But there are parts of Kubernetes that work really well with ML workloads. Again, it was designed to do everything. Some things can take more effort, but things like inference works very well, where you do have challenges around distributing the data of that massive model.
But Kubernetes has tons of solutions for that kind of challenge. You do end up with, like, these massive containers sometimes or or you got, like, volume management for those models, but that that's okay. The Kubernetes gives you these LEGO blocks that that are very flexible so you can solve them that way. And like I said, that API is great for automation, which means things like dynamic RAG pipelines where each, each one of your ML engineers who's making some change to the RAG pipeline gets their own preview environment for that RAG pipeline. That works swimmingly. Like, that's fantastic on Kubernetes. It's kind of what it was designed
[00:18:26] Tobias Macey:
for. To that point of being able to spin up these cloned ecosystems or cloned environments of a process or of a workflow, that is something that is very necessary in the experimentation phase of machine learning, data science use cases. What are some of the ways that you're seeing teams address that environment cloning or quick setup, and in particular, the experiment tracking and, checkpointing and, retry management around that workflow to be able to accelerate the experimentation cycle of machine learning engineers before they get to the point of solidifying, saying, okay. This is exactly the model that I am building. This is how I am getting it into production. But earlier in that cycle, to be able to use that flexibility of Kubernetes and the ease of being able to take copies of environments to be able to enable teams to work together for that experimentation.
[00:19:24] Tammer Saleh:
Yeah. I mean, that's a great question. There are a bunch of different tools for that workflow management, but it's part of just the declarative nature of the Kubernetes API that enables that. Right? You don't have to clone an existing RAG. You already have, you call it, the template. You've already got the the container image and the source code that goes into that and all that. So you you just create another identical version because the Kubernetes API is that it's all about declarative resource and declarative workloads. Now there's a bunch of higher order tools on top of Kubernetes. Kubeflow is the the one that everybody points to. It's getting a little long in the tooth. But, frankly, the problem with the Kubernetes ML ecosystem right now is that it just like just like the ML ecosystem in general, it's moving so fast and there are so many different solutions that it's really hard to pick a winner right now and to say, like, oh, this is this is the successor to to Kubeflow. Everybody should be using this.
Every company that we work with is using a different tool, and, some of them massive enterprise companies are using Kubeflow all over the place, very happy with it. But it really depends upon your team.
[00:20:39] Tobias Macey:
Kubernetes is something that a lot of operations engineers, cloud engineers have put in the hours and put in the time to figure out and become accustomed with. That's not necessarily the case for ML engineers because they have enough complexity that they're dealing with just trying to understand how deep learning works and these generative models. What are some of the foundational elements of Kubernetes knowledge that you think are necessary for them to be able to use effectively to solve their problems and some of the ways that you, either personally at Super Orbital, but also in the teams that you're working with, are seeing the educational elements of Kubernetes for these ML teams and some of the ways that you're trying to frame it in a way that feels natural and doesn't become overwhelming and just another job? That's a great question. I mean,
[00:21:31] Tammer Saleh:
I I wanna say the ideal answer, which is, you'd need very little knowledge of Kubernetes to be effective in your role. I mean, the ideal state in fact, the the the authors of Kubernetes have said on the record that they never expected the users of Kubernetes to be slinging YAML. Right? They never expected them to be building the Kubernetes resources by hand. Kubernetes was always intended to be platform for building platforms. And so this in this ideal state, you'd have this perfect platform that insulates the ML engineer from the, Kubernetes primitives by using these simple abstractions.
You wouldn't be thinking about containers or pods or services. It would just be, like, here's my code. Please run it and give me the answer back. You would need something that exposes metrics, logs, real time telemetry. It would have to, provide debugging tools like the works. And we've seen teams go down that path of trying to build that. You're ending up building better and more flexible version of Heroku, which means you need 1,000,000 of dollars of funding to make that work. It never works. Right? That platform, unfortunately, doesn't yet exist. We're always fingers crossed, kinda hoping for that to come about in the marketplace, but it hasn't yet. And in some ways, it's unlikely to exist because each team is different in their needs, in their workflows. And Kubernetes really appeals to those damn blue collar Tweakers because everybody wants to, like, modify like, everyone everyone wants to believe that they're special and their their own workflow is special, and Kubernetes facilitates that. Right? It's one of the the downsides of Kubernetes from my point of view. It is it is a bunch of LEGO bricks, and you can construct whatever your heart desires in there. And because of that, there's no platform that's gonna have that same level of flexibility. It's kinda the same story as Linux in the nineties. You remember, everybody was trying to produce these admin panels on top of Linux. Like, you'd install Red Hat, you get your Red Hat admin panel. And and the idea was you could do all of your operations work in that nice little GUI interface, but no one ever successfully provided that abstraction. It was always the case that the devil was in the details, and you always had to drop down to the terminal and files by hand and read all the man pages because the truth was the abstraction that was already being presented by this myriad of utilities on top of Linux turned out to be the simplest thing that made sense if everybody needs to be able to modify it. So in the end, it was simpler for all the people in the nineties just to learn Linux. Like, nowadays, it's kind of assumed that you are competent on the Unix shell. You're you're not given a Linux laptop as part of your new job and say, like, sweet. Where's the admin panel? You'd be fired again. Right? So, realistically, it's the same thing. We all need to learn Kubernetes, and I don't think that's a positive statement.
It's a really negative statement. Right? But it is the truth, just like we all have to understand Bash and Linux. Now it doesn't mean we have to understand all of it. We don't have to become kernel hackers. We don't have to, like, understand the ins and ins and outs of system d, for example, on Linux. And you don't have to understand every aspect of Kubernetes as well, but you do need to know more than just the the pod deployment service triad. Right? You need to understand things like the pod life cycle, how they start, stop, and fail, or exactly how and when traffic gets routed to a pod or the different volume types that are provided, and there's a ton of them. And exactly how each of those works, what, you know, read, write many versus read, write once, and all those types of phrases.
And the container ecosystem, you under need to understand, like, how containers work and especially, like, the container image layering system and how, layer caches work. That's very important for ML engineers. So I'd like to say that ML engineers do not need to learn the details of Kubernetes. But just like you had to pay that tax to learn Linux back in the day, you have to pay that tax again to learn Kubernetes now. The only silver lining is just like Linux, it was, like, the last one. Like, there was it was clear at the time BSD, open BSD, free BSD, and, like, Windows and and such were not where the market was going for engineering skills. So you were pretty confident, like, okay. I'm gonna invest this time. I'm gonna learn Linux, and it's gonna serve me well for multiple decades.
I truly believe that's the case with Kubernetes. The Kubernetes ecosystem is only growing massively.
[00:25:55] Tobias Macey:
And so the positive side is you don't need to learn AWS and GCP and all these different APIs. You can just learn the one. Right? Apropos of nothing, but as you were talking about the docker layering and image caching, it brought to mind the parallel with deep learning networks and the transformer architectures of being able to replace the output layer with your customizations and take advantage all of all the pretraining that's been done. It's very interesting
[00:26:21] Tammer Saleh:
parallel there. But That's a good analogy. I like that. Yeah. Parallel. I like that. And to your point of Kubernetes
[00:26:27] Tobias Macey:
being a platform for building platforms, for many years, there have been conversations saying, oh, well, Kubernetes will fade into the background, and you don't have to about Kubernetes. You'll just be using the thing on top of it. And there have been some projects I believed that dream. I did. There have been some projects in the ML and data engineering ecosystem that build on top of Kubernetes and aim to be that. Most notably that come to mind are flight from the folks at Lyft, which is their orchestrator for data engineering and ML workflows. And an early entrant, one of the first ones to build on top of Kubernetes was Packaderm.
I'm not sure what the state of that project is. I think they still exist in some fashion. But as you said, they are a facade on top of Kubernetes, but you're never going to be able to do absolutely everything without having to actually dig down into the next layer of Kubernetes itself.
[00:27:16] Tammer Saleh:
Right. It's it's absolutely true. And when I talk to people about, like, Flight and and the different ML tools, I'm not saying that one of them is not gonna come out as, like, the thing that you use. I really hope that one of them will mature to the point that that's the case. But it's always on the spectrum of either too complicated, and so it's really hard to get going with your ML team using adopting this tool or not nearly mature enough so that you're constantly fighting bugs and having to keep up to date with the latest version in order to avoid all the myriad issues that you're hitting. But, again, it's really about the investment in Kubernetes itself because, like I said, it's it's it's a, it's ubiquitous API. I mean, CorWeave, for example. You're familiar with CorWeave, the GPU cloud. Right? When you sign up for CoreWeave, you don't get a bespoke API that they created in order to compete with AWS and GCP. What do you get? You get a kubectl config. Right? They give you direct access to Kubernetes, which was a genius move, by the way. And I think we're gonna see more and more clouds adopting that model Just saying, look. Kubernetes is a fantastic API. We do not need to reinvent that. In fact, by just using that, our customers are already familiar with how to use our tool. Another interesting aspect of those frameworks
[00:28:29] Tobias Macey:
on top of Kubernetes is what level of coupling you see as being beneficial where flight and packet them, for instance, are very tightly coupled to Kubernetes. You cannot run one without the other, whereas there are other frameworks that support Kubernetes as an operational substrate but do not require it. And I'm wondering what you see as the appropriate balance for that tooling ecosystem. That's a great question. And I think my answer is gonna be, I mean, everybody's gonna have a different answer, but but I I actually fall
[00:29:00] Tammer Saleh:
pretty firmly on one side of the spectrum. And what we're talking about is what we internally refer to as either a cloud native workload where it's just maybe 12 factors just built for the cloud, and you can run it anywhere. Right? Versus a Kubernetes native workload, where it's making use of every ounce of the platform. Now the same question applied back in the days of raw cloud, AWS, for example. As a company, you could say, I'm gonna, just build, like, a a generally good distributed system based on Unix best practices, and it happens to run on AWS, which means I'm taking on a lot of the extra responsibilities. Like, I have to, you know, deploy RabbitMQ instead of making use of AWS' queuing system. I have to deploy my own databases instead of RDS. And that is a really good idea if you are concerned about supporting multiple clouds later on. Maybe you wanna go on premise. You don't wanna be beholden to AWS's ecosystem of stuff. But companies that got the most acceleration were the companies that went all in on one particular cloud. They said, we understand. We're never gonna be running on GCP. We're never gonna run on premise. It's not part of our future business model. So we're just gonna make the very most we can out of each cloud. We're gonna go 100% in full throttle, but they're locked in. The thing with Kubernetes though is that it is becoming ubiquitous.
More and more companies are coming to us saying, we want to be able to provide our SaaS on premise, and we see Kubernetes as being the delivery model to do it. So it's actually inverted the equation where if you go all in on Kubernetes, you're actually making things easier for the vast majority of users out there because Kubernetes is becoming more and more commonly the thing that people expect to have. Now you've just got a Helm chart. It's super easy to install. Right? So my philosophy is that at this point, you should not be worried about in fact, you might even wanna avoid tools that are designed for general purpose computing and just happen to have Kubernetes as a bolt on attachment because the integration's not gonna be nearly as good.
It's it's gonna feel after the fact. Now Kubernetes,
[00:31:08] Tobias Macey:
as we've been saying, has these sharp edges. It has the edge cases that you're going to run into where the facade that you're putting on top isn't going to do all the pieces that you want it to. I'm wondering what are some of those elements of Kubernetes and the way that it is designed, the way that it is used or deployed that start to become a hindrance when you're in an ML workflow?
[00:31:30] Tammer Saleh:
That's a that's a really good question. So I'm gonna say something that I think people will scoff at. And I say that, like, I I think a lot of people who use Kubernetes in anger on a daily basis are gonna be like, this is this is BS. There's no way that's true. I I don't think Kubernetes has a lot of sharp edges. And I'm gonna say that because because sharp edges are basically justified bugs. Right? Probably a lot of Python listeners on this channel. So the whole Python default argument thing where the the values are retained between executions of a function call. Right? Come on. That's a bug. Everybody knows that's a bug. Like, no other language does that. It it shoots you in the foot every time, and yet the Python team, they have no intention of fixing that. They're like, no. That's how it's designed to work. Ruby, which I used to do a lot of Ruby programming.
The differences between procs and blocks and Lambdas, like, that's crazy. Like, it should have just been one thing. They justified it for the longest time, but it's basically a bug. Right? That that's a sharp edge. Kubernetes is actually really well engineered. So, of course, there are bugs in Kubernetes, but they're treated with respect. Every once in a while, there's a bug where the core team will say, we get you. Like, that is definitely a bug. We don't know how to fix it, and we're working on it. Like, one that we teach in our workshops is, traffic will still get routed to pods that are in the process of dying. It's like a well known, well doc you could Google this, and you'll get 12 articles that that show you, like, what's going on and explain it very well. And that's a bug. Right? But they're relatively few. What Kubernetes does have is a lot of constraints and complexities, so that ML engineers are gonna be unused to.
For example, a constraint would be the lack of raw access to the underlying node. Right? An ML engineer, especially a data scientist, is gonna expect to have the entire machine at their disposal. They're used to SCP and SSH. Right? They copy their code in, they SSH in, they run it, and they just leave it running. In Kubernetes, they have a container. So a lot of that access that they need, like CPU memory and also access to GPUs, etcetera, all that needs to be configured. And so those constraints usually can be overcome, but in doing so, you're drastically increasing the amount of complexity that you have to deal with in order to get there. It is a lot more complex, especially in the beginning. The activation energy is a lot higher than spin up an EC 2 instance, SSH in, and just run some random Python code. In your experience of
[00:33:59] Tobias Macey:
working in this ecosystem, working with teams who are using Kubernetes for ML workloads, what are some of the most interesting or unexpected or innovative ways that you have seen them tackle that problem? I you know, everything about
[00:34:15] Tammer Saleh:
ML workloads on Kubernetes, by the very nature of how nascent this whole industry is, every one of our solutions ends up being interesting and innovative and unusual in some way. Right? No. We're not at the the the we're not at the point of maturity in this industry where you can cookie cutter out a solution and say, Here you go. Now you've got your ML team. You're good to go. Right? It's always a bespoke configuration that is very dependent upon the needs of the team. We have seen some, like, unexpected challenging lessons.
So I've said before, multiple times now, that the Kubernetes is all about the API. Right? The API is really gorgeous. It's it's it is worth if you're an engineer, it's worth studying the Kubernetes API just so you can build better APIs in your other in your day job. It is extremely consistent. It's very predictable because of that. Like, you can once you get used to Kubernetes, you just guess what the API is gonna be, and it's you're probably gonna be right. It's mostly declarative. It could be better on that, but that's a different topic. But it's designed to be declarative. You say this is the end state I want. Kubernetes will converge upon that end state. It's democratic, which is very interesting to me. It's, one API that you talk to, and it's the same API that all of Kubernetes, all the internal components Kubernetes also talk to. It's the same language.
It's just about a matter of level of access. If you have enough access as a user, you can mimic all the different parts of Kubernetes yourself. Right? You're talking to the same API. And most interesting, the API is expandable. It's extendable. You can do the equivalent of a SQL create table against the API. You can teach the API new tricks through these CRDs. Right? So the API is really fantastic because it was built by API focused engineers. It came out of the culture of Google. But the thing is, like, that's, like, at best 10% of the engineering world. Right? I mean, like, realistically, maybe 1%. And the challenge that we've seen comes when the other 90 to 99% of the industry is forced to use Kubernetes.
So there's there's sort of an irony there. We created the best system we could imagine for working through an API. And most of the world is baffled by it. And the best part of it, that API, goes completely unexploited by these teams. And data scientists, in particular, are a great example. We're like, here, data science team, we have bequeathed you with this phenomenal API driven work doing system, and the response is, by and large, that's great. I want SSH. Where's my SSH? Like, I'm very confused right now. Why don't I have SSH? And the the value of that API and it's not it's not their fault. To be very clear, that complexity is a major fault of Kubernetes.
And that's why these data science teams need other teams around them to produce all that automation and to to to leverage that API on behalf of the data science team. But, yeah, that's that's the thing that was most surprising for me because I came out of that world. I came out of the the API driven world, and I'm like, oh, this is perfect. Everybody's gonna love this. And it turns out that may maybe not so much. Right? It's the law of conservation of complexity where you can never remove complexity. You can only hide it away in different blocks. Right. Right. Exactly. Exactly.
[00:37:39] Tobias Macey:
We've been talking about using Kubernetes for running machine learning. But as you're talking about the fact that Kubernetes is this API driven system that has all of these declarative and eventually consistent capabilities, which has the idea of convergence in the same way that machine learning models will converge upon the optimal solution within that neural network environment, which also brought to mind the question of what are some of the ways that you're seeing machine learning being applied to Kubernetes for its operational capabilities?
[00:38:10] Tammer Saleh:
Oh, that's a that's a great question. I do have to be careful here because one of our clients is is focused in that space. So how do you say this? There's there's MLOps and there's AIOps. Right? I think it's the way we've decided to kinda split this up. MLOps is operations for machine learning teams, and AIOps is the other way around, machine learning teams for Kubernetes, right, or for operations. I think that that is going to be a very important. I I think that's gonna be groundbreaking for the industry, for for the Kubernetes industry in particular. I'm also extremely bullish on ML and AI and all this stuff in the future in general. But we just talked about all of that complexity.
And how do we we talked about the fact that you can't build a platform that, provides the right abstractions on top of that complexity. It seems fairly natural to me that what we need is something like that solution, something where you're applying AI to that problem.
[00:39:11] Tobias Macey:
Again, I need to be really careful about how far I go with that conversation. But yeah. Yeah. It it definitely seems as if if it doesn't already exist, we are not very far off from this world of the the analogous, space in data engineering is text to SQL where I say, this is what I want. The AI gives me a SQL query that I can run of in Kubernetes saying, I want this to be running in this environment, and then it says, here's a bunch of YAML. And so for people who are in the process of deciding how they want to build and manage and deploy their ML workflow workloads and organize their ML teams, what are the cases where Kubernetes is just absolutely the wrong choice?
[00:39:54] Tammer Saleh:
Thank you for asking that because, I do I, like, I feel like I'm coming off as a bit of a Kubernetes fanboy, and there's an obvious bias. We built a company on Kubernetes. Right? But at the same time, our value to our customers is being very transparent and clear about when Kubernetes is not the right solution. I was just talking to a non ML team who needed just a regular platform for 12 factor ish stuff. And I'm, like, you should not be running Kubernetes. It's very complicated. Right? And there is definitely some gaps with the ML space. So like I mentioned before, ML workloads are very interesting, and training is a great example. You're running those thousands of processes in a batch job in parallel, but again, not if one dies. It's very bad.
And because of that, training is not ideal on Kubernetes. Not only because of the, the complexity of Kubernetes limitations with regards to its scheduler. The the k8 scheduler is very simplistic. It's ironically, it's like the biggest chunk of code in Kubernetes. But because scheduling is such a difficult problem, it's not as sophisticated as it could be, especially for batch workloads. There is a concept of a job in Kubernetes. It's pretty crap. It's basically like run this pod once, fire and forget. That does have these fields of parallel and concurrent, but they're useless. Like, nobody uses jobs in that way. There's tools like Kubeflow where you can get the concept of a of a run this set of short lived pods, like more than one at the same time, but it's still not a great experience. And by the way, that's called gang scheduling. I'm sure your viewers, your your listeners already know. You're running, like, a wide set of processes in this job, and they all have to start in, at the same time, and the whole job's not complete till they all complete gang scheduling. Right? Now that's tricky, but doable on Kubernetes. The experience is not great. What I find interesting is, there's a fundamental limitation in Kubernetes that it does not know how long a job is supposed to take. Right? You submit a short lived pod. It runs until completion, either good or bad, failure or success, and then then it's done. But you don't tell Kubernetes, by the way, this job should take 4 hours. And because of that, it can't do some really interesting things. Like like, imagine you had this massive parallel, training job, a 1,000 pods for a 1,000 nodes. Each pod takes up basically the entire node, like, a huge parallel job. I say huge. That job may never run-in Kubernetes, even though you've got a 1,000 nodes, because smaller jobs will keep getting scheduled on those nodes. Like, the the Kubernetes scheduler or Kubeflow, for example, might say, like, okay. I can't run these yet.
I'm gonna wait till we've until all thousands of all thousand of the nodes are completely empty. That state's never gonna happen because in the meantime, the the normal case scheduler is gonna be sending regular jobs in, and and Kubeflow is gonna be sending regular jobs in. It it the smaller jobs are going to starve the larger job. But older schedulers, like, Slurm, for example. Have you heard of Slurm? You've you've probably used it. Right? Yes. I've I've been here with it. PC. Yeah. Yeah. Slurm's terrible. Right? It sucks in many ways. It's basically a hodgepodge of shell script and terrible c code. Right? But it understands the needs of batch jobs in a way that Kubernetes does not. So fundamentally, it's it's also, by the way, much simpler to use. For those who never used Slurm, it it looks like you're just writing a shell script, and you send that into Slurm through, sbatch, I think it is. And that's it. Kubernetes is much more complicated to use it to create a container image and all these things and YAMLify it. But fundamentally, Slurm knows how long jobs should take. And because of that, it can do these really smart scheduling gymnastics across time. So it can say, well, this job, needs all 1,000 of the nodes.
I can see that all of the nodes are currently scheduled to be free available in exactly 12 hours. In the meantime, somebody else sends another small job that has marked on it, this is only a 4 hour job. Slurm says, I can totally schedule that. All the nodes will still be available in 12 hours. That's called backfill scheduling, and that is something that you can't even graft onto Kubernetes right now. It is a fundamental limitation of the platform. And because of that, many ML teams have said, you know, we're gonna use Kubernetes for for inference, for exploration, for all these different aspects of, you know, for for managing the the model states and all that kind of stuff. But when it comes down to actual training, we're just gonna use a bare cluster of EC 2 or actually spin up a Slurm cluster or something like that. We're gonna we're gonna go a little bit lower level and not try and lean on Kubernetes for the for the real training. At the risk of getting too far down another rabbit hole, I'm wondering what you have seen
[00:44:57] Tobias Macey:
in the broader ecosystem of being able to maybe work at the layer just above the Kubernetes scheduler to take advantage of its flexibility, thinking in particular in terms of things like Ray or Dask or,
[00:45:10] Tammer Saleh:
I know there's a project called Horovod that is designed for parallel training of large machine learning jobs, things like that. I mean, like I said before, there's a ton of these those types of tools. But at the end of the day, some of this is a fundamental limitation in Kubernetes. No tool on top of it can stop the traditional Kubernetes scheduler from throwing the wrong process onto the nodes at the wrong time so that you can no longer set aside all that space. I think this is going to be I think this is really interesting because until now, the Kubernetes core team has kind of I don't wanna say rested on their laurels. They're all doing really hard work. But project has moved into a polish and fit phase, right, where the major features are pretty much done, and the core team has been focused on fixing all those those bugs, looking for rough edges, improving things like stateful workload management. But none of it's been, like, fundamental changes to Kubernetes. And I think that the ML industry is starting to push the boundaries of what the core team ever expected workflows to look like, and because of that, is gonna be putting pressure on the core team to go back to the drawing board for some of these larger features. It's gonna it's gonna force Kubernetes to evolve when Kubernetes had moved out of an evolution phase.
[00:46:27] Tobias Macey:
And as somebody who is working very closely in the Kubernetes space and as you're working with clients who are very interested in and invested in this ecosystem of machine learning, what are the elements of the Kubernetes project and ecosystem that you're keeping an eye on to help improve the story for people who are trying to use Kubernetes for machine learning? Yeah. That's a great question.
[00:46:51] Tammer Saleh:
Beyond what what we're just talking about with, like, the scheduler and the evolution there, I think there are some other areas where Kubernetes needs to improve, and we're keeping an eye on what those improvements look like. Some of them can be handled by most of them can be handled by tools on top of Kubernetes. But areas like networking speed, good use of InfiniBand, and things like that, that that's something where Kubernetes, you can do it. It's really complex and hard. Another one is access, like, deeper access to the actual hardware. For example, being able to align a CPU core with a GPU core and the memory bus in between and all that, like, really lying it up to to eke out the the most performance that you can get is something that's I think, impossible. One of my engineers swears that he knows how to get that done, but I I think that's impossible with Kubernetes right now. The, the different volume types, I think that's going to explode a lot more because of what we're and efficient distribution of of container images is another area where there are tools for this, but I think that they're gonna have to improve. And I think we're gonna see some evolution there as well. Are there any other aspects of
[00:47:56] Tobias Macey:
Kubernetes, machine learning, the confluence of the 2 that we didn't discuss yet that you'd like to cover before we close out the show? I think this has been a fantastic conversation. I I I feel like we've we've done a great job of of exploring the different corners of the space. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on on what you see as being the biggest gap in the tooling or technology that's available for machine learning today. That's a great question. Again, I'm not an ML engineer, but from what I've seen, that problem of
[00:48:32] Tammer Saleh:
the massively parallel jobs that are not So any one process that dies basically pushes the entire job back to the last checkpointing. Like, hopefully, your jobs are doing really good regular checkpointing. Your whole your whole job goes back to that last state. And you don't know if that death was due to I mean, you have to you have to do a lot of investigation to figure out if that death was due to a hardware failure or to your own code. I think that there's a lot of room in the observability space for these massively parallel jobs, the observability and even remediation space. I think it'll be really interesting to see how ML itself is shined back on itself and used to solve that same problem because that does feel like a problem that would be really well
[00:49:15] Tobias Macey:
geared towards an an AI solution. Well, thank you very much for taking the time today to join me and share your experience and expertise in the Kubernetes ecosystem and the ways that you're seeing machine learning be run on that substrate. So appreciate all the time and energy that you and your team are putting into helping folks on that journey, and I hope you enjoy the rest of your day. This has been a lot of fun. I really appreciate the conversation, and, hope you have a great great rest of your day as well.
[00:49:47] Tobias Macey:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hoststhemachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macey, and today, I'm interviewing Tammer Saleh about the potentials and pitfalls of using Kubernetes for your machine learning workloads. So, Tamir, can you start by introducing yourself?
[00:00:31] Tammer Saleh:
Oh, absolutely. Thank you, Tobias. So my name is Tammer. I founded Super Orbital back in 2017. We are a engineering services and and training company focused on solving extremely difficult problems using cloud native technologies. And what that means is just, you know, we do the hard things on Kubernetes. Before that, I was part of Pivotal Cloud Foundry and went back and forth between software engineering and operations. So I kinda have that DevOps mindset just because of my history.
[00:01:06] Tobias Macey:
And do you remember how you first got started working in that cloud and Kubernetes ecosystem and maybe a bit of your exposure to and background in machine learning as well?
[00:01:16] Tammer Saleh:
Yeah. Absolutely. So so, like, the the the cloud ecosystem was kind of interesting because I start I'm old. Right? I started off when, Linux administration meant driving down to the server room and, you know, racking and stacking machines and installing everything from hand. I'm sure you remember. Right? And, yeah, I swear every time I went to the server room, I would come back bleeding. I hated it so much. I carried Band Aids with me because, you know, those VA Linux boxes were terrible. And I remember back in the day, the cloud was just it was literally the thing you drew on the whiteboard. Right? It was the thing to represent the rest of the Internet. Right?
And when cloud computing became a thing, that's when I was working for another startup. It was called, Engine Yard, and it was basically the worst interpretation of cloud computing you could imagine. It was just like we resold AWS machines with a little bit of configuration on them and a lot of white glove service. That was the that was our superpower. But then Heroku came around. Do you do you remember Heroku?
[00:02:24] Tobias Macey:
Did you ever use it? I actually still use it today for my work.
[00:02:30] Tammer Saleh:
Well, that's fantastic.
[00:02:31] Tobias Macey:
Yes. I I've been using Heroku for many years now off and on, so I'm altogether too familiar with it.
[00:02:38] Tammer Saleh:
Well, the interesting thing is Heroku was even more ambitious when it first came out. Do you remember it had that cloud IDE that was an integral part of Heroku that later they they decommissioned?
[00:02:50] Tobias Macey:
Right? I I think I came to it right after that, but I have heard many recountings of that history.
[00:02:55] Tammer Saleh:
Yeah. And it looked like a toy. We we made fun of it. We're like, well, nobody's gonna post anything on that. We don't how do you even SSH into that? How do you back it up? How do you do any of the stuff that you're supposed to do? But that was the groundbreaking transitional moment when people really started thinking about what the cloud and what a platform would mean. Right? And to be very frank with you, I to saying this as somebody who runs a company based on Kubernetes technology, I still feel that the industry took a step backwards from that twelve factor model.
I think it needed to. The whole 12 factor thing was a little bit presumptuous, a little bit egotistical. You know, we we tell you how you should build applications. But, but it was very natural for me when I when I embraced it because at the end of the day, 12 Factor, the whole Heroku Cloud Foundry model was really just write a good UNIX y process, and we'll take care of it. You know, you stay within the bounds of, like, djbdns, like the old, the the old run it stuff. Like, stay stay within the bounds of what it means to be a good Unix process, and we're good.
So, anyways, because of that, like, I kinda naturally transitioned into cloud and, moved into Pivotal, ran engineering for Pivotal Cloud Foundry, and then, formed SuperOrbital shortly after. With the ML workloads, it's really interesting. Now first of all, I just wanna say, like, I I'm not a lawyer. Right? I am not an ML engineer. That is not my trade. I I wrote a couple of neural networks when I was in college. Like, we've we've all tried neural nets in college. Right? Like, you know, I experimented, and, and that was basically it. But we are myself and and all and all the people here are obviously extreme experts in all things Kubernetes and beyond.
And the reason that Kubernetes beat the whole twelve factor platform is because it came to you. Right? It said, look, you don't have to adapt to our model of what compute should look like. We will embrace any workload you wanna throw at us. And so it was really it still is really flexible as a platform. And because of that, it attracts interesting problems, and ML is full of interesting problems. So because of that, we've got a ton of clients that come our way saying, we we need to run these weird ML work workflows and workloads on Kubernetes, and we help them out.
[00:05:37] Tobias Macey:
And now for people who are maybe tangentially familiar with Kubernetes or maybe this is the first time that they've heard about it, although they must have been living under a rock for the past 5 years. How do you summarize it? What is your elevator pitch for somebody about what Kubernetes is and why they should care?
[00:05:56] Tammer Saleh:
Yeah. That's that's a great question. So the simplest way that that you know, what what it says on the tin for Kubernetes is that it's a system that will run your containerized workloads at scale in production. And if you don't know what a container is, you could just think of it as a little tiny machine, like a little micro VM, Right? Very fast, very lightweight. If you've used Docker, you know what a container is. And if you've used Docker, you might think, well, I I know what Docker Compose is. Like, I could just use that. Why would I not run all my stuff on Docker Compose and call it a day? Because there's a lot of extra stuff that has to be thought about when you're running that in production. Docker Compose is great for your laptop, but even though I know a couple of companies that have done this before, you should not just Docker Compose up on a server and walk away. Right? You need something stronger for that, and that's what Kubernetes provides. That's what it says on the tin, and that is, like, that's the the thing that you would use it for. But the reason that Kubernetes came so popular so quickly is because it embodied all of the Google SRE best practices, the site service reliability engineer best practices. Right? And as part of that, it has this incredible API for automation.
And that is the thing that makes Kubernetes so wonderful and so ubiquitous, is that it is this one API that if you as an as a software engineer learn that API, you no longer have to worry so much about learning the AWS API and the GCP API and the Azure API. And, frankly, the Kubernetes API is extremely well built. I can say with confidence I've never seen an API that is so well reasoned about and so well created. It's it's it's just really wonderful. So that's the when we teach students about Kubernetes and, like, our core case class, that's the the message that we hammer home a lot is that, sure, it's it's there to run your workloads, and it does it using all these SRE best practices, but it's really the API that makes it shine. Unfortunately, though, Kubernetes is pretty complicated. I mean, like, that workshop I just mentioned, it's 5 days long in, like, afternoons, but, like, tons and tons of content just to get you up to speed on Kubernetes. It is a it's a very complex beast. It's about as complicated
[00:08:25] Tobias Macey:
as learning Linux from scratch. Which many people do just for fun.
[00:08:33] Tammer Saleh:
Yes.
[00:08:35] Tobias Macey:
For the purpose of this conversation in this audience, Kubernetes as it applies to the machine learning and AI ecosystem, what are the aspects of machine learning that you would like to discuss and how they pertain to Kubernetes as a environment, a either for deployment or experimentation and just Kubernetes as a target for machine learning?
[00:08:58] Tammer Saleh:
Right. Great question. And, again, not a machine learning engineer talking kind of secondhand with regards to the type of work that we help our clients with. Right? But, you know, like I said before, Kubernetes is extremely flexible. So it's it's kind of there to help with most aspects of ML workloads, you know, model tracking and versioning, exploration through Jupyter Notebooks. We we've built systems where, it runs a huge number of Jupyter Notebooks at scale with tons of data behind them. And and that's really interesting because that's a thing that 12 Factor will be really bad at because Jupyter Notebooks are kinda semi stateful. Right? Like, you don't want a long running Jupyter Notebook computation to just be killed in the middle.
Kubernetes is pretty good at that kind of stuff. Obviously, inference, you you run the model in production and you serve, the actual model where you care about things like the efficient and, multi tenant use of GPUs. Then there's also training, where you've got efficient parallel scheduling across massive amounts of GPUs. All these things can be done on top of Kubernetes.
[00:10:05] Tobias Macey:
In your experience of working with teams who do have machine learning workloads, they are using Kubernetes, Machine learning never exists in a vacuum. There are always many other support systems, both organizationally and technically, not least of which is data engineering for being able to feed data into those machine learning models for training and updating, but also operations for being able to monitor those models and identify when it needs to be retrained or tuned. What are the benefits of Kubernetes in that socio technical ecosystem of machine learning and some of the ways that you've seen that be leveraged to improve the efficacy of machine learning
[00:10:49] Tammer Saleh:
engineers who are trying to just get their job done? That's a great question. I mean, I I think it comes back to that ubiquity. Right? Kubernetes provides that one single API that all these different teams are very comfortable working within. The operations team is used to running Kubernetes for their normal full stack workloads. And so it's very comfortable for them to be using Kubernetes to support the data engineers who are moving all this data around. And because Kubernetes is a lot more flexible than the old twelve factor models, it's much easier to construct a system through the Kubernetes lego bricks, the different volume types and stuff where you can efficiently move that data around. So all these support teams, and you're right, we see this in almost every company that we work with. It's not just an ML team off in a corner. You have to have many teams supporting them for them to be effective.
They can all leverage a platform like Kubernetes to get their job done, and they're all speaking the same language that way. The other interesting element of Kubernetes
[00:11:51] Tobias Macey:
is that it it has become, I guess, the, the ochre jelly, if you will, of the cloud ecosystem where it just consumes everything that it touches and everything that comes in contact with it. And as a result, when people think Kubernetes and the inherent complexity that you already mentioned, that complexity balloons to encompass all of the things that Kubernetes has consumed in the process or that have become attached to it as a result. For one example, you mentioned that if you're using Kubernetes, you don't necessarily have to be as familiar with the AWS or the GCP APIs. And there's actually a project, I think Crossplane is the name, where that's its whole purpose is that they actually just push those APIs underneath Kubernetes so that you use Kubernetes to manage those resources. And I'm wondering how the emergent complexity of the Kubernetes ecosystem has maybe acted as a deterrent for people from approaching Kubernetes as a solution to their problem because of the fact that they just view it as this amorphous mass of complexity that they don't know how to get started with.
[00:12:53] Tammer Saleh:
Absolutely. Absolutely. I mean, people talk about the complexity of Kubernetes, and it is very complex. Like I said, it's on par with Linux, and learning bash and all that from from scratch. But it's nothing compared to the complexity of the wider CNCF ecosystem. We've all seen or if you haven't, go to, just Google c o CNCF Ecosystem, and you'll find this website you can go to that has all the logos of all the companies and open source technologies that are, participating in and surrounding Kubernetes. Right? And some of those, fairly simple. Argo CD, things like that, not too hard to to understand and used in almost every Kubernetes installation.
Some things like Crossplane, incredibly complicated, almost as complicated as learning Kubernetes itself. Same thing with, like, Istio. Istio is incredibly complicated. And so you do need to be mindful and careful of how much you add on to Kubernetes, because you're taking on the load of maintenance and understanding and configuration and everything with every one of those components you use. You're you're spending your innovation points. Right?
[00:14:04] Tobias Macey:
Now for the purpose of actually using Kubernetes in an ML team or as an ML engineer, You mentioned the statefulness aspect when you're talking about Jupyter Notebooks. Machine learning is inherently a stateful operation, at least for the training portion, serving slightly less so. Because of that fact, what are the ways that machine learning workflows start to hit up against the edge cases of Kubernetes and some of the ways that you have to customize your Kubernetes runtime to be able to account for the statefulness and also the extreme of the use cases around machine learning.
[00:14:43] Tammer Saleh:
Right. I mean, Kubernetes had, for the longest time, a a bad reputation around stateful workloads. Right? It started as completely stateless. It was, you know, run your full stack workloads on Kubernetes. And if you needed a database, well, that's what you go to Amazon for afterwards. Right? Database should be outside of kates. And it added this this thing called the StatefulSet, which originally called the PetSet back in 1 5. It has been a slowly maturing feature, like, things like volume snapshots are relatively new. But at this point, it's about polish with Kubernetes and stateful workloads.
Many of our customers run databases, large amounts of data on Kubernetes, and it's just fine. Right? It works well. Does it work better than running it on raw EC 2? Probably not. In fact, definitely not. But you get that ubiquitous API. You get that that consistent substrate for managing those databases, and that's worth it to a lot of our customers. And you're right. Machine learning workloads have a lot of state associated with them, a lot of data associated with them, but they're not quite what I'd call stateful in the same way that the database is. Like, I think of them as semi stateful. And what I mean by that is if you lose your disk during the training phase of, working with a model, like, you your company is not gonna go bankrupt. Right? Like like, when when GitHub lost their primary database way back in the day, that was big news. Right? Everybody who used GitHub knew that that had happened, and it caused days of downtime while they recovered. It was really bad.
And, yeah, you'll lose, you know, potentially hours or days of data if you lose your disk, but it's not the same as as a database. That being said, ML workloads are very interesting. Training, for example, you're running thousands of these processes in a single batch job in parallel, but not Right? So when when you're full stack developer, you think, oh, in parallel, that's great. That means, like, if I'm running a 1000 of these processes, one of them dies. Like, it's totally fine. That's not how it works with training as as a bunch of your audience probably already knows. One of those, processes crashes, and that's it for the job back to the last checkpoint. So, hopefully, your training process is doing frequent checkpointing.
And that's especially bad when coupled with the GPUs that are being pushed out of the factory so quickly nowadays. Like, they're very flaky. So your processes will die frequently, and you won't know if it's because your Python code had a bug in it or because the GPU itself just, you know, pooped the bed. That's some of the interesting stuff that is challenging on Kubernetes. I can get more into that in a bit. But there are parts of Kubernetes that work really well with ML workloads. Again, it was designed to do everything. Some things can take more effort, but things like inference works very well, where you do have challenges around distributing the data of that massive model.
But Kubernetes has tons of solutions for that kind of challenge. You do end up with, like, these massive containers sometimes or or you got, like, volume management for those models, but that that's okay. The Kubernetes gives you these LEGO blocks that that are very flexible so you can solve them that way. And like I said, that API is great for automation, which means things like dynamic RAG pipelines where each, each one of your ML engineers who's making some change to the RAG pipeline gets their own preview environment for that RAG pipeline. That works swimmingly. Like, that's fantastic on Kubernetes. It's kind of what it was designed
[00:18:26] Tobias Macey:
for. To that point of being able to spin up these cloned ecosystems or cloned environments of a process or of a workflow, that is something that is very necessary in the experimentation phase of machine learning, data science use cases. What are some of the ways that you're seeing teams address that environment cloning or quick setup, and in particular, the experiment tracking and, checkpointing and, retry management around that workflow to be able to accelerate the experimentation cycle of machine learning engineers before they get to the point of solidifying, saying, okay. This is exactly the model that I am building. This is how I am getting it into production. But earlier in that cycle, to be able to use that flexibility of Kubernetes and the ease of being able to take copies of environments to be able to enable teams to work together for that experimentation.
[00:19:24] Tammer Saleh:
Yeah. I mean, that's a great question. There are a bunch of different tools for that workflow management, but it's part of just the declarative nature of the Kubernetes API that enables that. Right? You don't have to clone an existing RAG. You already have, you call it, the template. You've already got the the container image and the source code that goes into that and all that. So you you just create another identical version because the Kubernetes API is that it's all about declarative resource and declarative workloads. Now there's a bunch of higher order tools on top of Kubernetes. Kubeflow is the the one that everybody points to. It's getting a little long in the tooth. But, frankly, the problem with the Kubernetes ML ecosystem right now is that it just like just like the ML ecosystem in general, it's moving so fast and there are so many different solutions that it's really hard to pick a winner right now and to say, like, oh, this is this is the successor to to Kubeflow. Everybody should be using this.
Every company that we work with is using a different tool, and, some of them massive enterprise companies are using Kubeflow all over the place, very happy with it. But it really depends upon your team.
[00:20:39] Tobias Macey:
Kubernetes is something that a lot of operations engineers, cloud engineers have put in the hours and put in the time to figure out and become accustomed with. That's not necessarily the case for ML engineers because they have enough complexity that they're dealing with just trying to understand how deep learning works and these generative models. What are some of the foundational elements of Kubernetes knowledge that you think are necessary for them to be able to use effectively to solve their problems and some of the ways that you, either personally at Super Orbital, but also in the teams that you're working with, are seeing the educational elements of Kubernetes for these ML teams and some of the ways that you're trying to frame it in a way that feels natural and doesn't become overwhelming and just another job? That's a great question. I mean,
[00:21:31] Tammer Saleh:
I I wanna say the ideal answer, which is, you'd need very little knowledge of Kubernetes to be effective in your role. I mean, the ideal state in fact, the the the authors of Kubernetes have said on the record that they never expected the users of Kubernetes to be slinging YAML. Right? They never expected them to be building the Kubernetes resources by hand. Kubernetes was always intended to be platform for building platforms. And so this in this ideal state, you'd have this perfect platform that insulates the ML engineer from the, Kubernetes primitives by using these simple abstractions.
You wouldn't be thinking about containers or pods or services. It would just be, like, here's my code. Please run it and give me the answer back. You would need something that exposes metrics, logs, real time telemetry. It would have to, provide debugging tools like the works. And we've seen teams go down that path of trying to build that. You're ending up building better and more flexible version of Heroku, which means you need 1,000,000 of dollars of funding to make that work. It never works. Right? That platform, unfortunately, doesn't yet exist. We're always fingers crossed, kinda hoping for that to come about in the marketplace, but it hasn't yet. And in some ways, it's unlikely to exist because each team is different in their needs, in their workflows. And Kubernetes really appeals to those damn blue collar Tweakers because everybody wants to, like, modify like, everyone everyone wants to believe that they're special and their their own workflow is special, and Kubernetes facilitates that. Right? It's one of the the downsides of Kubernetes from my point of view. It is it is a bunch of LEGO bricks, and you can construct whatever your heart desires in there. And because of that, there's no platform that's gonna have that same level of flexibility. It's kinda the same story as Linux in the nineties. You remember, everybody was trying to produce these admin panels on top of Linux. Like, you'd install Red Hat, you get your Red Hat admin panel. And and the idea was you could do all of your operations work in that nice little GUI interface, but no one ever successfully provided that abstraction. It was always the case that the devil was in the details, and you always had to drop down to the terminal and files by hand and read all the man pages because the truth was the abstraction that was already being presented by this myriad of utilities on top of Linux turned out to be the simplest thing that made sense if everybody needs to be able to modify it. So in the end, it was simpler for all the people in the nineties just to learn Linux. Like, nowadays, it's kind of assumed that you are competent on the Unix shell. You're you're not given a Linux laptop as part of your new job and say, like, sweet. Where's the admin panel? You'd be fired again. Right? So, realistically, it's the same thing. We all need to learn Kubernetes, and I don't think that's a positive statement.
It's a really negative statement. Right? But it is the truth, just like we all have to understand Bash and Linux. Now it doesn't mean we have to understand all of it. We don't have to become kernel hackers. We don't have to, like, understand the ins and ins and outs of system d, for example, on Linux. And you don't have to understand every aspect of Kubernetes as well, but you do need to know more than just the the pod deployment service triad. Right? You need to understand things like the pod life cycle, how they start, stop, and fail, or exactly how and when traffic gets routed to a pod or the different volume types that are provided, and there's a ton of them. And exactly how each of those works, what, you know, read, write many versus read, write once, and all those types of phrases.
And the container ecosystem, you under need to understand, like, how containers work and especially, like, the container image layering system and how, layer caches work. That's very important for ML engineers. So I'd like to say that ML engineers do not need to learn the details of Kubernetes. But just like you had to pay that tax to learn Linux back in the day, you have to pay that tax again to learn Kubernetes now. The only silver lining is just like Linux, it was, like, the last one. Like, there was it was clear at the time BSD, open BSD, free BSD, and, like, Windows and and such were not where the market was going for engineering skills. So you were pretty confident, like, okay. I'm gonna invest this time. I'm gonna learn Linux, and it's gonna serve me well for multiple decades.
I truly believe that's the case with Kubernetes. The Kubernetes ecosystem is only growing massively.
[00:25:55] Tobias Macey:
And so the positive side is you don't need to learn AWS and GCP and all these different APIs. You can just learn the one. Right? Apropos of nothing, but as you were talking about the docker layering and image caching, it brought to mind the parallel with deep learning networks and the transformer architectures of being able to replace the output layer with your customizations and take advantage all of all the pretraining that's been done. It's very interesting
[00:26:21] Tammer Saleh:
parallel there. But That's a good analogy. I like that. Yeah. Parallel. I like that. And to your point of Kubernetes
[00:26:27] Tobias Macey:
being a platform for building platforms, for many years, there have been conversations saying, oh, well, Kubernetes will fade into the background, and you don't have to about Kubernetes. You'll just be using the thing on top of it. And there have been some projects I believed that dream. I did. There have been some projects in the ML and data engineering ecosystem that build on top of Kubernetes and aim to be that. Most notably that come to mind are flight from the folks at Lyft, which is their orchestrator for data engineering and ML workflows. And an early entrant, one of the first ones to build on top of Kubernetes was Packaderm.
I'm not sure what the state of that project is. I think they still exist in some fashion. But as you said, they are a facade on top of Kubernetes, but you're never going to be able to do absolutely everything without having to actually dig down into the next layer of Kubernetes itself.
[00:27:16] Tammer Saleh:
Right. It's it's absolutely true. And when I talk to people about, like, Flight and and the different ML tools, I'm not saying that one of them is not gonna come out as, like, the thing that you use. I really hope that one of them will mature to the point that that's the case. But it's always on the spectrum of either too complicated, and so it's really hard to get going with your ML team using adopting this tool or not nearly mature enough so that you're constantly fighting bugs and having to keep up to date with the latest version in order to avoid all the myriad issues that you're hitting. But, again, it's really about the investment in Kubernetes itself because, like I said, it's it's it's a, it's ubiquitous API. I mean, CorWeave, for example. You're familiar with CorWeave, the GPU cloud. Right? When you sign up for CoreWeave, you don't get a bespoke API that they created in order to compete with AWS and GCP. What do you get? You get a kubectl config. Right? They give you direct access to Kubernetes, which was a genius move, by the way. And I think we're gonna see more and more clouds adopting that model Just saying, look. Kubernetes is a fantastic API. We do not need to reinvent that. In fact, by just using that, our customers are already familiar with how to use our tool. Another interesting aspect of those frameworks
[00:28:29] Tobias Macey:
on top of Kubernetes is what level of coupling you see as being beneficial where flight and packet them, for instance, are very tightly coupled to Kubernetes. You cannot run one without the other, whereas there are other frameworks that support Kubernetes as an operational substrate but do not require it. And I'm wondering what you see as the appropriate balance for that tooling ecosystem. That's a great question. And I think my answer is gonna be, I mean, everybody's gonna have a different answer, but but I I actually fall
[00:29:00] Tammer Saleh:
pretty firmly on one side of the spectrum. And what we're talking about is what we internally refer to as either a cloud native workload where it's just maybe 12 factors just built for the cloud, and you can run it anywhere. Right? Versus a Kubernetes native workload, where it's making use of every ounce of the platform. Now the same question applied back in the days of raw cloud, AWS, for example. As a company, you could say, I'm gonna, just build, like, a a generally good distributed system based on Unix best practices, and it happens to run on AWS, which means I'm taking on a lot of the extra responsibilities. Like, I have to, you know, deploy RabbitMQ instead of making use of AWS' queuing system. I have to deploy my own databases instead of RDS. And that is a really good idea if you are concerned about supporting multiple clouds later on. Maybe you wanna go on premise. You don't wanna be beholden to AWS's ecosystem of stuff. But companies that got the most acceleration were the companies that went all in on one particular cloud. They said, we understand. We're never gonna be running on GCP. We're never gonna run on premise. It's not part of our future business model. So we're just gonna make the very most we can out of each cloud. We're gonna go 100% in full throttle, but they're locked in. The thing with Kubernetes though is that it is becoming ubiquitous.
More and more companies are coming to us saying, we want to be able to provide our SaaS on premise, and we see Kubernetes as being the delivery model to do it. So it's actually inverted the equation where if you go all in on Kubernetes, you're actually making things easier for the vast majority of users out there because Kubernetes is becoming more and more commonly the thing that people expect to have. Now you've just got a Helm chart. It's super easy to install. Right? So my philosophy is that at this point, you should not be worried about in fact, you might even wanna avoid tools that are designed for general purpose computing and just happen to have Kubernetes as a bolt on attachment because the integration's not gonna be nearly as good.
It's it's gonna feel after the fact. Now Kubernetes,
[00:31:08] Tobias Macey:
as we've been saying, has these sharp edges. It has the edge cases that you're going to run into where the facade that you're putting on top isn't going to do all the pieces that you want it to. I'm wondering what are some of those elements of Kubernetes and the way that it is designed, the way that it is used or deployed that start to become a hindrance when you're in an ML workflow?
[00:31:30] Tammer Saleh:
That's a that's a really good question. So I'm gonna say something that I think people will scoff at. And I say that, like, I I think a lot of people who use Kubernetes in anger on a daily basis are gonna be like, this is this is BS. There's no way that's true. I I don't think Kubernetes has a lot of sharp edges. And I'm gonna say that because because sharp edges are basically justified bugs. Right? Probably a lot of Python listeners on this channel. So the whole Python default argument thing where the the values are retained between executions of a function call. Right? Come on. That's a bug. Everybody knows that's a bug. Like, no other language does that. It it shoots you in the foot every time, and yet the Python team, they have no intention of fixing that. They're like, no. That's how it's designed to work. Ruby, which I used to do a lot of Ruby programming.
The differences between procs and blocks and Lambdas, like, that's crazy. Like, it should have just been one thing. They justified it for the longest time, but it's basically a bug. Right? That that's a sharp edge. Kubernetes is actually really well engineered. So, of course, there are bugs in Kubernetes, but they're treated with respect. Every once in a while, there's a bug where the core team will say, we get you. Like, that is definitely a bug. We don't know how to fix it, and we're working on it. Like, one that we teach in our workshops is, traffic will still get routed to pods that are in the process of dying. It's like a well known, well doc you could Google this, and you'll get 12 articles that that show you, like, what's going on and explain it very well. And that's a bug. Right? But they're relatively few. What Kubernetes does have is a lot of constraints and complexities, so that ML engineers are gonna be unused to.
For example, a constraint would be the lack of raw access to the underlying node. Right? An ML engineer, especially a data scientist, is gonna expect to have the entire machine at their disposal. They're used to SCP and SSH. Right? They copy their code in, they SSH in, they run it, and they just leave it running. In Kubernetes, they have a container. So a lot of that access that they need, like CPU memory and also access to GPUs, etcetera, all that needs to be configured. And so those constraints usually can be overcome, but in doing so, you're drastically increasing the amount of complexity that you have to deal with in order to get there. It is a lot more complex, especially in the beginning. The activation energy is a lot higher than spin up an EC 2 instance, SSH in, and just run some random Python code. In your experience of
[00:33:59] Tobias Macey:
working in this ecosystem, working with teams who are using Kubernetes for ML workloads, what are some of the most interesting or unexpected or innovative ways that you have seen them tackle that problem? I you know, everything about
[00:34:15] Tammer Saleh:
ML workloads on Kubernetes, by the very nature of how nascent this whole industry is, every one of our solutions ends up being interesting and innovative and unusual in some way. Right? No. We're not at the the the we're not at the point of maturity in this industry where you can cookie cutter out a solution and say, Here you go. Now you've got your ML team. You're good to go. Right? It's always a bespoke configuration that is very dependent upon the needs of the team. We have seen some, like, unexpected challenging lessons.
So I've said before, multiple times now, that the Kubernetes is all about the API. Right? The API is really gorgeous. It's it's it is worth if you're an engineer, it's worth studying the Kubernetes API just so you can build better APIs in your other in your day job. It is extremely consistent. It's very predictable because of that. Like, you can once you get used to Kubernetes, you just guess what the API is gonna be, and it's you're probably gonna be right. It's mostly declarative. It could be better on that, but that's a different topic. But it's designed to be declarative. You say this is the end state I want. Kubernetes will converge upon that end state. It's democratic, which is very interesting to me. It's, one API that you talk to, and it's the same API that all of Kubernetes, all the internal components Kubernetes also talk to. It's the same language.
It's just about a matter of level of access. If you have enough access as a user, you can mimic all the different parts of Kubernetes yourself. Right? You're talking to the same API. And most interesting, the API is expandable. It's extendable. You can do the equivalent of a SQL create table against the API. You can teach the API new tricks through these CRDs. Right? So the API is really fantastic because it was built by API focused engineers. It came out of the culture of Google. But the thing is, like, that's, like, at best 10% of the engineering world. Right? I mean, like, realistically, maybe 1%. And the challenge that we've seen comes when the other 90 to 99% of the industry is forced to use Kubernetes.
So there's there's sort of an irony there. We created the best system we could imagine for working through an API. And most of the world is baffled by it. And the best part of it, that API, goes completely unexploited by these teams. And data scientists, in particular, are a great example. We're like, here, data science team, we have bequeathed you with this phenomenal API driven work doing system, and the response is, by and large, that's great. I want SSH. Where's my SSH? Like, I'm very confused right now. Why don't I have SSH? And the the value of that API and it's not it's not their fault. To be very clear, that complexity is a major fault of Kubernetes.
And that's why these data science teams need other teams around them to produce all that automation and to to to leverage that API on behalf of the data science team. But, yeah, that's that's the thing that was most surprising for me because I came out of that world. I came out of the the API driven world, and I'm like, oh, this is perfect. Everybody's gonna love this. And it turns out that may maybe not so much. Right? It's the law of conservation of complexity where you can never remove complexity. You can only hide it away in different blocks. Right. Right. Exactly. Exactly.
[00:37:39] Tobias Macey:
We've been talking about using Kubernetes for running machine learning. But as you're talking about the fact that Kubernetes is this API driven system that has all of these declarative and eventually consistent capabilities, which has the idea of convergence in the same way that machine learning models will converge upon the optimal solution within that neural network environment, which also brought to mind the question of what are some of the ways that you're seeing machine learning being applied to Kubernetes for its operational capabilities?
[00:38:10] Tammer Saleh:
Oh, that's a that's a great question. I do have to be careful here because one of our clients is is focused in that space. So how do you say this? There's there's MLOps and there's AIOps. Right? I think it's the way we've decided to kinda split this up. MLOps is operations for machine learning teams, and AIOps is the other way around, machine learning teams for Kubernetes, right, or for operations. I think that that is going to be a very important. I I think that's gonna be groundbreaking for the industry, for for the Kubernetes industry in particular. I'm also extremely bullish on ML and AI and all this stuff in the future in general. But we just talked about all of that complexity.
And how do we we talked about the fact that you can't build a platform that, provides the right abstractions on top of that complexity. It seems fairly natural to me that what we need is something like that solution, something where you're applying AI to that problem.
[00:39:11] Tobias Macey:
Again, I need to be really careful about how far I go with that conversation. But yeah. Yeah. It it definitely seems as if if it doesn't already exist, we are not very far off from this world of the the analogous, space in data engineering is text to SQL where I say, this is what I want. The AI gives me a SQL query that I can run of in Kubernetes saying, I want this to be running in this environment, and then it says, here's a bunch of YAML. And so for people who are in the process of deciding how they want to build and manage and deploy their ML workflow workloads and organize their ML teams, what are the cases where Kubernetes is just absolutely the wrong choice?
[00:39:54] Tammer Saleh:
Thank you for asking that because, I do I, like, I feel like I'm coming off as a bit of a Kubernetes fanboy, and there's an obvious bias. We built a company on Kubernetes. Right? But at the same time, our value to our customers is being very transparent and clear about when Kubernetes is not the right solution. I was just talking to a non ML team who needed just a regular platform for 12 factor ish stuff. And I'm, like, you should not be running Kubernetes. It's very complicated. Right? And there is definitely some gaps with the ML space. So like I mentioned before, ML workloads are very interesting, and training is a great example. You're running those thousands of processes in a batch job in parallel, but again, not if one dies. It's very bad.
And because of that, training is not ideal on Kubernetes. Not only because of the, the complexity of Kubernetes limitations with regards to its scheduler. The the k8 scheduler is very simplistic. It's ironically, it's like the biggest chunk of code in Kubernetes. But because scheduling is such a difficult problem, it's not as sophisticated as it could be, especially for batch workloads. There is a concept of a job in Kubernetes. It's pretty crap. It's basically like run this pod once, fire and forget. That does have these fields of parallel and concurrent, but they're useless. Like, nobody uses jobs in that way. There's tools like Kubeflow where you can get the concept of a of a run this set of short lived pods, like more than one at the same time, but it's still not a great experience. And by the way, that's called gang scheduling. I'm sure your viewers, your your listeners already know. You're running, like, a wide set of processes in this job, and they all have to start in, at the same time, and the whole job's not complete till they all complete gang scheduling. Right? Now that's tricky, but doable on Kubernetes. The experience is not great. What I find interesting is, there's a fundamental limitation in Kubernetes that it does not know how long a job is supposed to take. Right? You submit a short lived pod. It runs until completion, either good or bad, failure or success, and then then it's done. But you don't tell Kubernetes, by the way, this job should take 4 hours. And because of that, it can't do some really interesting things. Like like, imagine you had this massive parallel, training job, a 1,000 pods for a 1,000 nodes. Each pod takes up basically the entire node, like, a huge parallel job. I say huge. That job may never run-in Kubernetes, even though you've got a 1,000 nodes, because smaller jobs will keep getting scheduled on those nodes. Like, the the Kubernetes scheduler or Kubeflow, for example, might say, like, okay. I can't run these yet.
I'm gonna wait till we've until all thousands of all thousand of the nodes are completely empty. That state's never gonna happen because in the meantime, the the normal case scheduler is gonna be sending regular jobs in, and and Kubeflow is gonna be sending regular jobs in. It it the smaller jobs are going to starve the larger job. But older schedulers, like, Slurm, for example. Have you heard of Slurm? You've you've probably used it. Right? Yes. I've I've been here with it. PC. Yeah. Yeah. Slurm's terrible. Right? It sucks in many ways. It's basically a hodgepodge of shell script and terrible c code. Right? But it understands the needs of batch jobs in a way that Kubernetes does not. So fundamentally, it's it's also, by the way, much simpler to use. For those who never used Slurm, it it looks like you're just writing a shell script, and you send that into Slurm through, sbatch, I think it is. And that's it. Kubernetes is much more complicated to use it to create a container image and all these things and YAMLify it. But fundamentally, Slurm knows how long jobs should take. And because of that, it can do these really smart scheduling gymnastics across time. So it can say, well, this job, needs all 1,000 of the nodes.
I can see that all of the nodes are currently scheduled to be free available in exactly 12 hours. In the meantime, somebody else sends another small job that has marked on it, this is only a 4 hour job. Slurm says, I can totally schedule that. All the nodes will still be available in 12 hours. That's called backfill scheduling, and that is something that you can't even graft onto Kubernetes right now. It is a fundamental limitation of the platform. And because of that, many ML teams have said, you know, we're gonna use Kubernetes for for inference, for exploration, for all these different aspects of, you know, for for managing the the model states and all that kind of stuff. But when it comes down to actual training, we're just gonna use a bare cluster of EC 2 or actually spin up a Slurm cluster or something like that. We're gonna we're gonna go a little bit lower level and not try and lean on Kubernetes for the for the real training. At the risk of getting too far down another rabbit hole, I'm wondering what you have seen
[00:44:57] Tobias Macey:
in the broader ecosystem of being able to maybe work at the layer just above the Kubernetes scheduler to take advantage of its flexibility, thinking in particular in terms of things like Ray or Dask or,
[00:45:10] Tammer Saleh:
I know there's a project called Horovod that is designed for parallel training of large machine learning jobs, things like that. I mean, like I said before, there's a ton of these those types of tools. But at the end of the day, some of this is a fundamental limitation in Kubernetes. No tool on top of it can stop the traditional Kubernetes scheduler from throwing the wrong process onto the nodes at the wrong time so that you can no longer set aside all that space. I think this is going to be I think this is really interesting because until now, the Kubernetes core team has kind of I don't wanna say rested on their laurels. They're all doing really hard work. But project has moved into a polish and fit phase, right, where the major features are pretty much done, and the core team has been focused on fixing all those those bugs, looking for rough edges, improving things like stateful workload management. But none of it's been, like, fundamental changes to Kubernetes. And I think that the ML industry is starting to push the boundaries of what the core team ever expected workflows to look like, and because of that, is gonna be putting pressure on the core team to go back to the drawing board for some of these larger features. It's gonna it's gonna force Kubernetes to evolve when Kubernetes had moved out of an evolution phase.
[00:46:27] Tobias Macey:
And as somebody who is working very closely in the Kubernetes space and as you're working with clients who are very interested in and invested in this ecosystem of machine learning, what are the elements of the Kubernetes project and ecosystem that you're keeping an eye on to help improve the story for people who are trying to use Kubernetes for machine learning? Yeah. That's a great question.
[00:46:51] Tammer Saleh:
Beyond what what we're just talking about with, like, the scheduler and the evolution there, I think there are some other areas where Kubernetes needs to improve, and we're keeping an eye on what those improvements look like. Some of them can be handled by most of them can be handled by tools on top of Kubernetes. But areas like networking speed, good use of InfiniBand, and things like that, that that's something where Kubernetes, you can do it. It's really complex and hard. Another one is access, like, deeper access to the actual hardware. For example, being able to align a CPU core with a GPU core and the memory bus in between and all that, like, really lying it up to to eke out the the most performance that you can get is something that's I think, impossible. One of my engineers swears that he knows how to get that done, but I I think that's impossible with Kubernetes right now. The, the different volume types, I think that's going to explode a lot more because of what we're and efficient distribution of of container images is another area where there are tools for this, but I think that they're gonna have to improve. And I think we're gonna see some evolution there as well. Are there any other aspects of
[00:47:56] Tobias Macey:
Kubernetes, machine learning, the confluence of the 2 that we didn't discuss yet that you'd like to cover before we close out the show? I think this has been a fantastic conversation. I I I feel like we've we've done a great job of of exploring the different corners of the space. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on on what you see as being the biggest gap in the tooling or technology that's available for machine learning today. That's a great question. Again, I'm not an ML engineer, but from what I've seen, that problem of
[00:48:32] Tammer Saleh:
the massively parallel jobs that are not So any one process that dies basically pushes the entire job back to the last checkpointing. Like, hopefully, your jobs are doing really good regular checkpointing. Your whole your whole job goes back to that last state. And you don't know if that death was due to I mean, you have to you have to do a lot of investigation to figure out if that death was due to a hardware failure or to your own code. I think that there's a lot of room in the observability space for these massively parallel jobs, the observability and even remediation space. I think it'll be really interesting to see how ML itself is shined back on itself and used to solve that same problem because that does feel like a problem that would be really well
[00:49:15] Tobias Macey:
geared towards an an AI solution. Well, thank you very much for taking the time today to join me and share your experience and expertise in the Kubernetes ecosystem and the ways that you're seeing machine learning be run on that substrate. So appreciate all the time and energy that you and your team are putting into helping folks on that journey, and I hope you enjoy the rest of your day. This has been a lot of fun. I really appreciate the conversation, and, hope you have a great great rest of your day as well.
[00:49:47] Tobias Macey:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hoststhemachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Tamir's Background and Journey into Cloud and Kubernetes
Kubernetes Flexibility and Machine Learning Workloads
Explaining Kubernetes to Beginners
Kubernetes for Machine Learning and AI Ecosystem
Benefits of Kubernetes in Machine Learning Teams
Challenges of Stateful Workloads in Kubernetes
Educational Elements of Kubernetes for ML Engineers
Hindrances of Kubernetes in ML Workflows
Applying Machine Learning to Kubernetes Operations
When Kubernetes is the Wrong Choice for ML Workloads
Broader Ecosystem and Future of Kubernetes in ML
Key Elements to Improve Kubernetes for ML
Closing Remarks and Contact Information