Summary
Because machine learning models are constantly interacting with inputs from the real world they are subject to a wide variety of failures. The most commonly discussed error condition is concept drift, but there are numerous other ways that things can go wrong. In this episode Wojtek Kuberski explains how NannyML is designed to compare the predicted performance of your model against its actual behavior to identify silent failures and provide context to allow you to determine whether and how urgently to address them.
Announcements
Parting Question
Because machine learning models are constantly interacting with inputs from the real world they are subject to a wide variety of failures. The most commonly discussed error condition is concept drift, but there are numerous other ways that things can go wrong. In this episode Wojtek Kuberski explains how NannyML is designed to compare the predicted performance of your model against its actual behavior to identify silent failures and provide context to allow you to determine whether and how urgently to address them.
Announcements
- Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
- Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations. Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today!
- Your host is Tobias Macey and today I’m interviewing Wojtek Kuberski about NannyML and the work involved in post-deployment data science
- Introduction
- How did you get involved in machine learning?
- Can you describe what NannyML is and the story behind it?
- What is "post-deployment data science"?
- How does it differ from the metrics/monitoring approach to managing the model lifecycle?
- Who is typically responsible for this work? How does NannyML augment their skills?
- What are some of your experiences with model failure that motivated you to spend your time and focus on this problem?
- What are the main contributing factors to alert fatigue for ML systems?
- What are some of the ways that a model can fail silently?
- How does NannyML detect those conditions?
- What are the remediation actions that might be necessary once an issue is detected in a model?
- Can you describe how NannyML is implemented?
- What are some of the technical and UX design problems that you have had to address?
- What are some of the ideas/assumptions that you have had to re-evaluate in the process of building NannyML?
- What additional capabilities are necessary for supporting less structured data?
- Can you describe what is involved in setting up NannyML and how it fits into an ML engineer’s workflow?
- Once a model is deployed, what additional outputs/data can/should be collected to improve the utility of NannyML and feed into analysis of the real-world operation?
- What are the most interesting, innovative, or unexpected ways that you have seen NannyML used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on NannyML?
- When is NannyML the wrong choice?
- What do you have planned for the future of NannyML?
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
- Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- NannyML
- F1 Score
- ROC Curve
- Concept Drift
- A/B Testing
- Jupyter Notebook
- Vector Embedding
- Airflow
- EDA == Exploratory Data Analysis
- Inspired book (affiliate link)
- ZenML
[00:00:10]
Unknown:
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea of the data scientists building natural language processing models to programmatically inspect, fix, and track their data across the ML workflow, from pretraining to posttraining and postproduction. No more Excel sheets or ad hoc Python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs while seeing 10 x faster ML iterations. GALILEO is offering listeners of the Machine Learning podcast a free 30 day trial and a 30% discount on the product thereafter. This offer is available until August 31st, so go to the machine learning podcast.com/galileo
[00:01:03] Unknown:
and request a demo today. Your host is Tobias Macey. And today, I'm interviewing Wojtek Kubertski about Nanny ML and the work involved in post deployment data science. So, Wojtek, can you start by introducing yourself?
[00:01:14] Unknown:
Absolutely. So thanks for having me, Tobias. I'm a cofounder at Nenemel. What I do is basically everything that's related to tech. So I kinda manage the product side of things, the research side of things, which are actually separate in that email because we are a deep tech company. My background is in mechanical engineering, and then I did my master's in artificial intelligence, Belgium, where I stayed for a while. I freelance for a bit, starting my own consultancy, handed off the consultancy to someone else. I moved to Portugal where, you know, the sun, the food, and everything else that you can get here.
[00:01:49] Unknown:
And, yeah, that's basically me. And do you remember how you first got started working in machine learning?
[00:01:55] Unknown:
Yes. So that is, I think, a typical story when you come from physics background or mechanical engineering background is that my bachelor thesis was about optimizing the shape of wind turbine blades and generally wind turbine stuff. And then I realized I actually enjoyed the optimization and computational side more than the actual physics. And then since I'm a very lazy guy, I wanted to automate it all, and that's where machine learning came in handy. So I basically decided that let's just try to optimize everything and automate everything, and machine learning is really the tool for that. The algorithms themselves were just really, really interesting. So that's basically when I made my decision to kinda switch from mechanical engineering to AI. And I got lucky that I got accepted to 1 of the masters in AI, so I didn't have to kind of start start from scratch with my education.
And that was really it.
[00:02:48] Unknown:
So as you mentioned, now you have cofounded a business in Nanny ML. I'm wondering if you can describe a bit about what it is that you're building there and some of the story behind why you decided that this was the problem that you wanted to spend your time and energy on. I started previous company, a consultancy,
[00:03:03] Unknown:
together also with my current cofounders at Nanny and El. And our goal was always to find a product that we could make big, basically, some unfulfilled need that we could try to fulfill. And we had a couple of ideas, but there was 1 thing that we came up over and over again with our consulting clients is that there was the question of what am I gonna do after you deploy these models for me? How do I make sure that these models actually keep on working? And we couldn't really make it our problem as part of the consulting because it was way too big to just do it as an kind of 1 off project. So we decided it's not our problem. Yeah. You need to deal with it. We hand off, we deploy your models for you, and that's it.
And then after a while, since it was really a pattern, we decided that if we can make it our problem, let's probably actually create a great company and a great product. And there's also kind of a lucky coincidence that we had a mentor and we still have a mentor that built a unicorn. And, basically, he kept on pushing us to actually jump into product already. There's no point in staying in consulting. So this was kind of, like, a perfect start of opportunity, right mentor, and the right pull from the market because I was actually pulled from the market. And then we decided to digital consultancy and start basically go all in with Nanny ML, raise funds, and build the product.
[00:04:27] Unknown:
In the landing page and some of the marketing material around Nanny ML, it brings up this concept of post deployment data science, which is what you were saying about you build these models, you hand them over to your clients, and then they say, what do I do with it now? And I'm wondering if you can give your definition of what that term encapsulates and some of the ways that it differs from the kind of metrics and monitoring approach that a lot of people have been falling into because it's something that they're familiar with from application or infrastructure?
[00:04:55] Unknown:
It's basically exactly what you said that there is right now this kind of default approach to MLOps that it's all about engineering. And what we noticed is that data science has always been about providing business value and finding insights in data that allow you to bring some value, extract some insights. And this, for the timing, we most of the time stops the moment you finish developing your model. And we think that if you extend this concept of let's try to bring insight. Let's try to ensure the value is there to everything that happens after deployment. This is really the the key to make sure that your models keep on delivering value. That's kind of fluffy.
But to be a bit more specific, what we mean by post deployment data science is doing things like monitoring, of course, but from the performance perspective. So we want to make sure that performance of your mouse stays high. And if something goes wrong, then we employ data science tools to find out what went wrong. So we have data analysis. We are doing actually machine learning on top of your machine learning models to figure out what went wrong and then how you can make sure that your model is fixed. So it's not from the engineering perspective. Up the uptime, throughput, whether the model pings, but looking at all kinds of silence that can happen and then trying to analyze using data science tools to figure out what went wrong and how we can fix it. That's basically
[00:06:22] Unknown:
it. As far as the current state of affairs where you have data scientists and ML engineers building the models and then handing it over to operations teams to manage the deployment and upkeep. I'm wondering who you typically see as being responsible for that work of the care and feeding of the model after it has, you know, gone past the training phase and you have that artifact to be able to put into production. And I'm wondering how you think about the design of Nanny ML for being able to help those people kind of augment their skills and also being able to loop those data scientists and ML engineers into that overall workflow?
[00:07:00] Unknown:
Yeah. So that is something that we're honestly still kind of figuring out. Identified kind of 3 main persona or 3 main kind of roles that could benefit from that email. The first 1 and the kind of people that are most interested in that email is actually heads of data science because these are the people who really answer to upper level management with the actual performance and the impact of their algorithms. So that's 1 thing. They need to have a tool like many more to make sure that the albums actually deliver value, but they will not be the ones using it day to day. They have the need, but they are not the end user. And then the end user we see is kind of, in a sense, maybe not equally split, but split between data scientists and ML engineers.
And so far, we see that it's more data scientists that kind of want to make sure that their models keep on develop working the most that they develop. So they still feel the ownership that this is my model. I want to make sure that it keeps on delivering value. And, also, there's that skill set that's kind of more suited for monitoring. Because, again, post deployment data science, it's not so much about engineering, but actually understanding what's going on. But, of course, when it comes to deploying any amount to production to make sure it can provide the insights on, like, day to day operation, this is something that will be handled by ML engineers. So in a way, it's similar to the models themselves when you need ML engineer to deploy them, deploy that email to monitor your model. And then you need the data scientists to actually look at the data, analyze the results, and see if something went wrong, how can I make it right?
[00:08:36] Unknown:
Given the fact that the data scientists are interested in being involved in the continued care and feeding of the models and understanding more about how the models are operating past the training cycle. I'm wondering what have been some of the points of friction that they've experienced previously that maybe prevent them from being able to be more involved in that process and that has led the current state of the industry more towards this just, you know, collect some metrics and try and figure out, you know, has this crossed the threshold of whether or not you need to care about it? So I think that
[00:09:10] Unknown:
the biggest problem is really that people had assumed that it's unsolvable. And it's something that has been a problem since machine learning production existed that models fail. But from a lot of our, like, early design partners and early users, we heard that, yeah, that's how it is. This is just a reality of life as a data scientist. So the first thing is really awareness that you can actually make sure that your models work well and they fail find it faster than you normally would. So that's the first thing. Another thing is we are still extremely early on when it comes to adoption of data science. We're kind of in a battle when, you know, work in cool companies that actually work with machine learning. But when you look at the general market, there is a lot of prototyping, and I think we were just simply too early. You first need to put models in production to monitor them. So I think right now, we're kind of crossing this threshold when data sciences, as they say, crossing the chasm. Models are actually making its way to production, and people start to realize that the problem exists and the problem is solvable.
And when it comes to kind of the friction points or maybe the examples of what prevented them, First, it's just on the kind of especially in big organizations on the political level, when there's different person responsible or even different departments responsible for deploying models and maintaining them and making sure there is an impact. And that I think is something that will change in the future. And when there's going to be maybe even a new role of a person who actually makes sure like, a post deployment data scientist or production data scientist. The person who makes sure that, these models keep on the different value after they've been deployed.
[00:10:57] Unknown:
As far as just the overall state of data science and the health of the models that are being built and deployed, I'm wondering what you see as the long term effects of having the data scientists more involved in the post deployment stages of the model development and the ways that gaining better understanding of how the model actually operates in the real world allows them to bring that learning back into the process of developing the models in the 1st place or developing successive models about how to learn more about the real world impacts of machine learning, you know, both the model on the real world and the real world on the model and kind of some of the ways that that will feed into kind of the ongoing research and development, both in companies and in academia?
[00:11:46] Unknown:
That is actually a very insightful question. And like you mentioned, there's definitely this feedback loop between the real world and the model itself. If you deploy, let's say, a churn model and the model works well, reduce the churn, which will also impact the population. So we're gonna have some kind of drift there already, which might mean, for example, that looks like the model is going to shit, but in reality, the model is still working while it's just population exchanging. So 1 thing that people definitely understand better, the data scientists that are working on these models, is how the real world interacts with the model and that it's a 2 way interaction. It's not just that model interacts with the world and changes the world, but also the other way around because the population will change, and if you return the model, the model is also going to be significantly significantly different.
Another thing is I think people will understand robustness much better. As right now, you can spend weeks of compute optimizing your hyperparameters. And most people know that they shouldn't evaluate things and test set more than once, but they still do it. And they just keep on looking for, you know, set of hyperparameters that performs best on your test set instead of your validation set. And they know that this idea of holdout is needed, but they don't really take it to the extremes. I think they should. And when you deploy models, you will restart realizing that things are not as robust as possible, especially if you spend a lot of time optimizing your models, and you're just getting good performance due to lag and due to just trying out and actually overfitting in some more subtle ways. And I think awareness of that will increase as you see that when you deploy your models, performance drops compared to your test. Even though it shouldn't, but it still does. And people are still asking why and finding the answers to that.
[00:13:39] Unknown:
More on the monitoring and alerting side of things. I'm wondering what are some of the ways that you have experienced model failures and some of the particular pain that you've encountered in terms of the available tooling, the visibility or lack thereof, and some of the ways that that has motivated you to want to invest the time and energy on solving this for everyone?
[00:14:00] Unknown:
So as I mentioned, kind of moment when we became aware of the problem was during our consulting days. And 1 very interesting thing that happened is that we were developing a segmentation model that would remove background from industrial tools, and they would be put in a catalog. So basically, for automation use case. And interestingly enough, somewhere throughout the process, as the model is being deployed, they changed the cameras. And the model really went to shit. And it's something that, of course, we couldn't have predicted, but it made us realize that things change in the real world. And as they do, the models will not perform well and failure is inevitable. Just part of life, things change, and we need to be able to monitor that. Another interesting use case that stumbled upon a bit more recently is in wind energy, where what you see is that there's always covariance shift, so change in your model inputs, also known as data drift.
And there, it really, really impacts the performance. There is no change in the patterns because it's all based in physics, so there's no concept drift. But change in just how data is distributed will have a huge impact on performance. This is something that happens seasonally, and you should know whether you should be worried about that or just normal performance. So just monitoring how your model is performing is not even enough because maybe this is something that's expected. Maybe as weather changes, there's more gusts of wind. If there's more gusts of wind, of course, your model will not predict the power production as well as in periods of low wind. So that's another use case. And then when it comes to credit scoring, 1 thing we noticed is that a lot of credit scoring companies and banks as well have this problem is that they deploy a model and they need to wait for a year, sometimes 2 years before they really get the production targets when they can actually evaluate the performance of their models. So they're kind of flying in blind for a year or 2 years, and they are using their models in basically the most important use case in their business.
So there's a huge need there to know whether to estimate performance in some way even if web ground truth. So these 3 kind of things really is what made us think of Nanny and Mal and how and the shape it is taken right now is that you cannot just simply monitor performance because it's not enough. Sometimes monitoring performance is not available, so you need to estimate performance instead. And looking at data drift is important from the performance perspective, but not for the sake of data drift itself because that is something that will always happen, and it's not necessarily bad if your models generalize well. Yeah. And to that point of data drift being the
[00:16:45] Unknown:
kind of most frequently brought up case of issues with model performance or things that you have to watch out for after you get a model into production. I'm wondering what are some of the other kind of more insidious or less understood or less observed ways that models can start to go wrong without based on maybe, you know, ROC or f 1 scores or just some of the ways that the model can still start giving you bad outputs without you necessarily noticing it. And, you know, conversely, how tracking these, you know, high level metrics or or these specific metrics that people are able to observe more directly, how that contributes to issues of alert fatigue in ML teams?
[00:17:32] Unknown:
That is a very loaded question, so I'm gonna answer it by by part. So first, starting with what can contribute to model failure. Let's assume that we cannot simply measure performance as it is the case in actually most of these cases because there's there are delay. Or in automation use cases, we don't have full ground truth. So just measuring f 1 or whatever metric you choose is not easily doable or doable at all. And then there is basically 2 main reasons why the models can fail. The first 1 is data drift or covaritship. So the change in the joint model input distribution to be very technical here. So we're looking at all the features and its joint distribution changes. If that happens, it might mean that your model your data is going to be drifting to regions when it's harder for the model to make its prediction.
Maybe because the model is not trained in this region, or maybe it's a region that, let's say, it's very close to the class boundary. So it's just harder to make the correct prediction. Then you will see that the performance of your model is decreasing. And this is something we can actually fully estimate the impact of convergence on performance with Honeywell, which is 1 of the cool unique features we have. And then kind of second, the reason why mouse can fail is concept drift. And concept drift is changing the pattern or mapping or a function, however you want to call it, between the model inputs and the target.
So it's probability of y given x. And that means that basically, the pattern that your model has learned during training is no longer so we will see a drop in performance. So these are kind of 2 ways that, models can fail. Now looking at how they can fail, and you will not even be able to tell by just looking at the metric. 1 thing is that imagine you have data drift to regions that are less uncertain. So the model performance should increase. But at the same time, you have concept drift, so the pattern changes. So the performance does not increase.
And that means that even if your f 1 stays the same, in reality, the model is worse. It's not predicting as well as it could be. And this is something you would not be able to see unless you can estimate the impact of data drift on performance. If you do it, you will see that given this data drift, the model should be performing better, but it's not. So something's off, and this off part is the constant drift. So that's 1 of, like, ways of really insidious silent failure. And also the inverse is true. Like I mentioned in when it comes to wind energy, it might be that you have periods when the model performance drops.
But there's still nothing to worry about because there is just a momentary drift to regions that are higher to predict them, like gusts of wind because wind blows from the mountains. Yeah. I think that's enough talking for 1 question.
[00:20:24] Unknown:
And in terms of the detection capabilities that you're bringing into NannyML, you mentioned that 1 of the ways that you're able to identify some of these failure modes that are often silent if you're just tracking, you know, bare metrics is the fact that you need to be able to see based on this prediction and the input data that I'm seeing. I'm assuming that this is going to be the performance that I see, but because the distribution of the data is different than what the model is trained against, the concept drift is such that it's actually degrading the performance despite my prediction that it will get better. And I'm wondering what are some of the other ways that that predictive capability that you're bringing in Nanny ML is able to highlight some of these failure modes that are kind of flying under the radar and some of the kind of broader impact on maybe confidence in ML and the organization, just some of the broader impacts that these silent failures have when they come to light and some of the ways that this greater visibility that Nanny ML brings allows ML teams to kind of build up better trust in the organization?
[00:21:32] Unknown:
That's another loaded question. So, again, I'm gonna start the first part, and I'll try to combine it with the alert fatigue that you mentioned. So 1 of the things that happens a lot is that people, when they do monitoring, they look at every single feature kind of in a void. And if your model has a 50 features, something is basically guaranteed to drift. And just because a drift is is significant from a statistical perspective, it doesn't mean it's significant from a model monitoring perspective. You could see a very strong drift in feature that's mostly irrelevant, and the model is very robust against changes there, and everything is gonna be fine. That's why it's so important to try to figure out what is the impact of data drift on performance.
Another thing is that these tests, if you do some kind of statistical test or you measure the difference, some kind of distance between 2 distributions, they assume that all features are independent. They don't track the change and the correlations or relationships between the features. So they kind of don't think about this joint part of the distribution. And for that, develop multivariate data drift capabilities, where you can basically train your data trainer, let's say, an autoencoder for the ease of explanation on your data for which you know everything's fine. And then you see how well this autoencoder performs on the dataset in question, your analysis dataset.
And if you see a drop in performance, it means that the internal geometry of the dataset has changed. So there's data drift. And this is a way to capture data drift that would not be captured by these tools that just look at everything from Univariate Research. Now for the second part of your question, which is how monitoring can change how smart sensors can impact the trust in machine learning. I think right now because the failures are silent, all the failures are not expected, and they are seen as something that's was not planned in saying Tibet should never happen. It's all your fault.
So what monitoring can really bring is first increase the trust that if, let's say, head of data science is reporting that the months are performing well, they actually are performing well because there's much more knowledge in the company. And another thing is kind of bring more understanding that failures will happen. They will be resolved fast, and they're just part of normal operating procedure. Just like like you see right now, you need to maintain your physical equipment, and physical equipment will have to be out of operation for a while. The same is true for machine learning models. Things will happen, and model is not performing as well as it used to, and it's gonna take us, let's say, 2 days to fix it. That is something that should be normal in companies, but now every time there is an issue, everyone's ringing alarm bells that it's no. You cannot trust machine learning, and it's not reliable.
It's reliable. It's just not perfectly reliable. And if we quantify this reliability, that will bring more trust.
[00:24:29] Unknown:
Another aspect of the question of alert fatigue is that if you do get an alert, you don't necessarily know what to do about it, where a lot of the conversation that you see in the ML space is, oh, if you see concept drift, then that just means that you need to go back and retrain your model based on the new data that you have and redeploy it, and everything will be great. I'm curious what you have seen as some of the ways that teams who are using Nanny ML or just this better understanding of what is happening with the model allows you to think about what actions can and should be taken based on the actual type of failure that you're seeing and the overall question of what are the remediation actions that can and should be taken when you do see that there is some problem that is manifesting?
[00:25:16] Unknown:
So when I think about monitoring, I think there's kind of 3 parts of it. The first 1 is detecting failure. The second 1 is realizing why failure happened, and the third 1 is resolving the failure. And right now, with Nanaimao, we basically provide the first and the second. And the third 1 is, for better words, manual. Because what we realized is that really is extremely use case dependent. The first thing you should do, your default action should be retrained, but instead of redeploying immediately, you should either run an AB test, and this is something that we saw some people doing. Or and that is a very interesting case, is that you retrain it, but you retrain just on the part of the drifted data, and you see how the model performs on the rest of the drifted data. So you do kind of a phantom deployment.
And then if everything's fine, it can redeploy. If not, to go back and basically do some kind of root cause analysis, which you can do with data drift capabilities, both being varied and multivariate. And then also, you know, tap into your domain knowledge to see what are the changes in data quality upstream that might have caused the failure. What are the changes in the business side of things that might have caused the failure? Maybe there is a new marketing campaign, and this marketing campaign completely fucked up our recommenders. That's something that might have happened, but a tool like Nanaimala has no way of seeing that because it's completely external to the to the machine learning system itself. So there are always the human component there because you're basically like an investigator trying to figure why things happen.
And then once you know the full picture of why, it's a matter of retraining or redeveloping your model, maybe with different architecture, maybe by adjusting the data. So if you have a strong data drift, maybe you should adjust your old data to some kind of shift there to make it match the current distribution because 1 will just get confused otherwise trying to complete the different patterns at the same time. Yeah. I guess that's
[00:27:16] Unknown:
it. Another interesting kind of roadblock to the question of just retrain and redeploy it is that sometimes for for a particular model instance, that retrain cycle might take, you know, several hours or days. And I'm wondering how that factors into the decision about, okay, what do I do in this instance? Like, is it typically a case of this is an urgent failure, and I need to be able to resolve this immediately? Or are the failure modes Oh,
[00:27:55] Unknown:
yeah. That Oh, yeah. That's another interesting question. And you mentioned that it can takes, you know, days. But if you look at wider organization in some banks, it can actually take months because you need to submit a request to train a model. It's to be approved by the compliance team, and it it really takes ages. 1 of our data scientists at Nanny Mal used to work at 1 of the more advanced banks. And even there, just because the regulation is so heavy, you cannot simply retrain your model. And you need to go through steps with multiple different teams, multiple different external validators before you can return your model. So it's at least a month.
And I think that is also why you actually have to return your model or it's nice to have is important. So, again, estimating the performance and monitoring the performance is really the key here. And from there, you can see where the impact is. Long term and severe assess the severity, assess also whether it's a long term impact or just passing impact to seasonality. And so again, data drift impact of data drift on performance, estimating performance, and measuring performance. Oh, and here there is another action thing at play, which is the uncertainty estimation of the performance itself.
This is something that we will just be missing to to the library, and I think 2 weeks time from now, when we'll also be giving the uncertainty estimates on our performance, our calculated performance. So you will not know that let's say, rock AUC is right now 0.85, but 0.85 plus minus 0.6. And if you see that you have strong drop in performance, but also uncertainty is really high, maybe that is something that, you know, you just got unlucky with the data, and you should keep on going as it is. You don't have to retrain. Maybe you do, but let's wait and see. So if you combine this uncertainty estimates with the data drift and performance estimation, you should have a full picture of whether you need to retrain or not.
[00:29:54] Unknown:
This might be going a little too far afield and too far into kind of the systems design aspect of it. But in the case where you do have a model, it is seeing critical errors in the ways that it is generating predictions. I'm wondering what are some of the kind of application or systems engineers' patterns that you've seen for people to be able to include some sort of a circuit breaker where it says, okay. This model is too far off the rails. We can't trust it. Retraining it is gonna take too long, so we need to have some sort of fallback mechanism to be able to say, you know, oh, the this capability is no longer functional right now. Check back later. Or, you know, we'll just go with kind of a more statistical heuristic approach that doesn't rely on this more advanced model capability. So we will say, you know, I have kind of a graceful degradation of capabilities, but the model is no longer in the loop such time as you can fix the underlying problem that it's experiencing.
[00:30:45] Unknown:
To be honest, I think that it's step too far, not necessary for us here, but for the reality. I don't think we're there yet. Maybe there's a handful of companies that already have those capabilities. I haven't spoken to anyone that actually already had a full blown system of how to do failure. Most of the companies that do monitoring do it manually. Everything is manual. All decisions are basically taken on the spot, and I think we're quite early when it comes to that. 1 thing I saw is that there is an off switch. If the failure is really critical to the point when it's still bringing values detrimental to the company, then the first thing you do is you just turn it off.
Or you still run a prediction, but you don't actually act on it. So you then something of that would actually have an impact. So the model is just running in the void, and then you start resolving the problem. And once the problem is resolved and you confirm that the failure is no longer there, you can deploy the new model.
[00:31:44] Unknown:
That's basically the only thing I know of. Yeah. It's definitely an interesting thing to think about of, like, all the different ways that things can go wrong and how to resolve them. And at a certain point, you just have to say, I can't predict it. It is what it is, and we'll figure it out when we get there. Yep. And so in terms of the Nanny ML project, can you talk through how it's implemented and some of the overall design questions that you had to address in terms of the technical and user experience interactions and how to be able to fit this into people's machine learning workflows?
[00:32:17] Unknown:
On high level perspective, I've decided to go for a Python open source library because, let's be honest, almost all data science right now is done in Python. People want to just install things or condense or whatever is your preferred method here. So that's 1 thing. When it comes to the actual implementation, we basically have kind of a default structure, default workflow for Nanaimo itself. Well, first, you plug in your data, and your data can be in form of kind of 1 of analysis when you have your reference datasets for which you know if it's fine, your analysis datasets for that you'd like to analyze. And this kind of mimics the manual monitoring process when from time to time, let's say, every week, every 2 weeks. However often you need to do it, you will manually run manual, see what's going on there. And then the flow is that is that you get your data via parquet files or whatever you want, or if you have it's a microservice that provides the data.
You see what's going on starting with performance calculation or estimation depending whether targets are available or not, Then everything's fine, good. If things are not fine, we need to go into data drift, and you'll go do a multivariate and univariate data drift detection. And then we have a, honestly, pretty minimal capability to link data drift to performance. So we'll be able to see what are the potential reasons for dropping performance. And from then on, we go to a manual quest of resolving the issues. That's kind of 1 workflow. Another workflow is that if you deploy to production.
In that case, Manheiml is running via Airflow or some other kind of service. We just release our CLI. So right now, it's much easier to just run an email as part of your ML system, and you get your alerts. And you have a, let's say, a script or notebook when an email is running when you can just view updated dashboard with however often you want to run an email. It can be every 10 minutes. It can be every 5 minutes, basically, as often as you want. There is 1 thing that I generally caution about, which is trying to do streaming with monitoring.
And the thing is that you can obviously use an animal in streaming fashion, but you need to have certain volume of data before you can say that the performance has degraded. Degradation is like inherently probabilistic process. So if you just get 1 new data point, it really does not make sense to run the whole analysis again because things cannot have changed that much. And even if they do, you will not be able to tell by just 1 point, and also the algorithms cannot take that. So what I would recommend is basically running batch every few minutes, few hours, how often do you need when you have enough volume of the data. And then if there's alerts, then the rest of the process triggers, and that can either automatic trigger retraining or an advanced organization that deal with automation use cases that could trigger manual spot checking.
So then you basically create a task, send the signal to team that does spot checking. So manually reevaluate whether the performance actually has decreased. And from then on, you can take actions to remediate.
[00:35:34] Unknown:
In the process of building Nanny ML and starting to work with some of the end users of it, I'm wondering what are some of the initial concepts or ideas that you had about how to approach the problem or how to integrate it into people's systems that you had to reconsider or reevaluate as you started getting it in front of people with real world use cases?
[00:35:55] Unknown:
Yeah. So our first idea is to basically market it towards business people because that is kind of the end consumer. And while the idea is sound in principle, in reality, it really did not work because they just don't understand machine learning at all. That's why you switched to data scientist and heads in data science. And the other thing is that we started with a dashboard. People want to view it as a dashboard. What we learned from early user interviews and from early design partners is that we actually don't care that much for a dashboard. Like, there's certainly some some companies organization when they want it, and they care about dashboard as kind of a second step when they deploy themselves via literally Jupyter Notebooks.
They view things themselves there. Everything's working fine or not fine. And then if there is a problem or they need to communicate it to business stake holder, that's when the dashboards come in. So then we had to completely change the priorities. We dropped the dashboard. Or focused on usability within Jupyter Notebooks. And really, in the entire question of UX is I want to run things fast using interface that I know. So we often default to how is scikit learn doing it, or how is when it comes to plotting, how is Seaborn doing it. And we look there, and it's our default. And unless we have a very strong reason to change the interface, we basically copy them because they did it well, people know how to use it, and they are happy with these libraries.
So I guess that is 1 big thing we needed to reevaluate. Another 1 is kind of we had assumed that the idea of estimating the performance without the ground truth is something that will click in in user's head. But we realized it's not as easy, and we had to do a lot of, like, nudging both in our documentation and in the visualizations themselves to show that we are estimating this performance. This is real performance. We are estimating what is likely to have happened due to data drift. And this is something still an open question because for a lot of people, it takes a while to even realize that it's possible to even estimate the performance in a machine learning model. And it's definitely not an intuitive idea.
[00:38:08] Unknown:
Another interesting challenge in the ML space is that you need to be able to support a fairly large variety of different frameworks and data types and model types and model architectures. And I'm wondering what are the initial target capabilities and initial focus that you had for that kind of matrix of use cases and the ways that you have thought about the initial foundational aspects of building Nanny ML to allow you to then re branch out into, you know, maybe going from tabular data to unstructured data or image data, etcetera, or, you know, the different frameworks that you wanted to support or model architectures that you wanted to be able to understand how to generate these predictions?
[00:38:56] Unknown:
So that's something we actually thought a lot about. And what we realized that from the use case perspective, there's basically 1 huge difference is that whether the ground truth is delayed or not. If we work with cases when ground truth is not really delayed, so high frequency training or delivery time prediction, when it's gonna take you half an hour to get your sorry, to get your targets. That is something that should be monitored in a completely different way than, let's say, loan default prediction for mortgages when you wait years until you know your target.
Then on the model and framework side, we made the decision very early on. Also, because, honestly, we got lucky with research that we're gonna be fully agnostic when it comes to everything there. So we don't actually need the model. Nanimo does not need the model file, does not need to know the framework. You just work with data. That was a very kind of committal decision that we took in the beginning that an animal should always be able to work with just data. So then by default, we actually support all kinds of models because we don't need to look at them, all kinds of framework because we don't need to use them. So we managed to kind of size the the problem, and so far, it's working quite well. Now when it comes to the actual data types, that is, I would say, the biggest challenge. That's something that we're able to sidestep.
We decided to focus on tabular data because that's where, so far, we've seen both the value is in the companies, and also this is the type of data that is most prone to failure. Like I always say, a horse is a horse. It's not gonna stop being a horse anytime soon. So if you look at image data, it's less likely to change. But, of course, as I already mentioned, sometimes the cameras change. 1 interesting example we got from a design partner is that they were doing detection for COVID based on x-ray images. And there was a new strain, and the model failed because the new strain actually showed itself in the x-ray images differently. So, of course, there's truth there, but we decided to focus on churn use cases, credit default scoring, upselling, cross selling, all this boring AI because these models are actually in production, and if they fail, that's a huge problem.
And now moving on to images and x data. We are right now actually running a small internal project to see what changes we need to do to support them natively. And what we find out, you can see you can already use an email for text and image data, but instead of working with the raw data, we need to go to the embedded layer. And there, you could basically treat it as tabular data, and performance estimation works well. Multivariate data detection works well. Univariate data detection returns bullshit because it just get pixel values or you get, you know the vector 17 has changed. Yeah. That's not interpretable and it's very useful.
So we don't have the interpretability there, but the core capability of drift detection and performance estimation is still there.
[00:41:57] Unknown:
As far as being able to integrate Nanny ML into an ML system, what are some of the foundational systems and capabilities that are necessary to be in place to be able to handle feeding the data into it, generating the predictions, being able to capture the outputs of Nanny ML and integrate that with the other monitoring and alerting systems that can and should be in place?
[00:42:24] Unknown:
So our default kind of workflow is you have your model deployed as a microservice. So first, hard requirement is that your model is actually in production, which people sometimes say that it is in production, but it's actually not in production. Or it's in production, but nobody's acting on the predictions. So 1 important thing is you should implement, an animal as you deploy your models or after you deploy your models, but it's a tool for post deployment data science. So that's 1 kind of obvious thing, but it's sometimes less obvious than it seems. And then we basically assume that you have a microservice architecture when you have a model deployed somewhere that is scheduled via Airflow or whatever you want. And then what you want to do is you want to spin a Docker container, let's say, or have a you you containerize your your microservices with Nanaimo.
And what you feed to Nanaimo is all the inputs that come to the model and all the outputs from the models. Plus, if you have access to this data, also the target data, the production target data. On top of that, you will need some periods for which you know everything went fine. This can be your validation set or you could just get some production historical data for everything, when everything was fine. Let's say, first month of production. And then you just if you do batch processing, you can print an email at the same time as you query the model. The same exact data you first run through the model, then you get more of the inputs and outputs to Nanima. And then we will need to have a way to display it. The defaults are following the Jupyter Notebooks only just have 9 month running. It displays the data. You view it in your browser and that's it to run it. So there is not that much to be done on this front.
[00:44:09] Unknown:
As far as the kind of post deployment data science aspect of it, we discussed this a little bit, but what are some of the maybe ways that the outputs of Nanny ML can get fed into a maybe partner model that sits, you know, with an Nanny ML and, you know, feeds back into the model that you're monitoring or just, you know, some ways that the information that any ML surfaces can be used to build additional models to then serve as kind of automated checks or automated remediation flows for the model that you're actually putting into production and is generating the value?
[00:44:46] Unknown:
The easiest way is basically the alerts. We have alerts that will return both in visual form and as a data frame. And what you will see there is that if you have alert on performance that can be set with a custom threshold or default threshold, that is the easiest things to act on. So basically, you can look at alerts in performance, alerts in data drift, and alerts in specific features. And then you can use that information to trigger first automated retraining. This is kind of the minimal fully automated loop. When you automatically retrain, you pass it through. You deploy the model, and then you pass it through with Phantom deployment or AB testing through an email again to see whether the issue has been resolved. If the issue has been resolved with this feedback loop, then the model, it gets deployed through production.
So that's kind of the the default remediation technique. And then you could see kind of another flow if the issue is not resolved, so the training is not sufficient. You could pass what has been tried, plus all the other asynchronous learning on performance by the data drift to a data scientist who could get basically start with it as and do an EDA there to figure out what went wrong. And then you go back to the same process when you deploy the model via phantom deployment or AB testing. You run it through Nanaimo again and see if the issue is resolved. If the issue is resolved, you deploy the model. So there is this kind of feedback loop approach when you can use an email to both alert you that something has gone wrong and confirm that the issue is resolved.
[00:46:18] Unknown:
And then in terms of the ways that you're thinking about making this project and business sustainable, I'm curious how you have thought about the kind of governance aspect of the open source project and the ways that the business you're building around that, you know, feeds into it, and just some of the division between what is freely available, what is commercial, and just the overall business model that you're targeting?
[00:46:43] Unknown:
So for the time being and for the foreseeable future, we are fully focused on open source, and there is going to be a open source led product. So the open source is going to be always the main thing. And we kinda follow the open core philosophy. When all the core algorithms that provide value and is, you know, new science that we're developing or toward new engineering at least, is something that's going to be always open source. Right now, we're not monetizing anything. What we're focusing on is adoption first and foremost, trying to get as many people to use non email introduction in real life use cases. And based on that, it will be seeking more funding soon, soon ish.
And we have a bit of runway yet. And in the future, there's basically 2 ways we plan to monetize. 1 is everything that has to do with enterprise addition in terms of security, integration, privacy, helping them basically create workflows, helping big enterprises create workflows when none email can be used to provide value, just like the 1 I just described before. And another thing is business oriented features. So trying to use additional signals from the business that are not in the dataset, to help improve model monitoring. If, it's possible to quantify the impact of a model on the business itself, also monitoring those metrics. So maybe ROI on the model.
Simple 1 in credit scoring would be, you know, risk adjusted ROI on how much money you're getting per loan. Is that changing? How is that relevant to data drift and model metrics? Can we also estimate that? These would be the things that almost nobody cares about except business people. So it's perfect thing to keep, outside of the open source solution because it's something that would not drive adoption, do not drive value for vast majority of people, but also is a good way to monetize and make it a business. Because if these capabilities are needed, they are really needed and people are willing to pay for them in a way that's, you know, really shows that it provides value.
[00:48:51] Unknown:
In your experience of building the project and working with some of your design partners, what are some of the most interesting or innovative or unexpected ways that you've seen Nanny ML used?
[00:49:01] Unknown:
The idea of using Nanny ML to AB test models is not mine. I mentioned a few times, but it was something that actually came from a design partner. When at first, they were not even interested that much in monitoring aspect of it, but they were doing retraining. And the biggest thing is that they never knew if the automated retraining actually makes sense. Maybe we're actually hurting the model more than we're fixing it. And that is something that actually once happened when there was an upstream data issue. And that meant that basically the quality of, I think, 2 months of data was much lower than the previous months of data, and the model was retrained, and the performance dropped significantly, because the the model was even though the data quality issue was found out and resolved, the the same data was still used to retrain the model. I don't know why. It just happens in enterprises.
And and we're able to actually spot that with NetEmo. And, then, that is 1 way to monitor automated retraining to make sure that you don't do something wrong by retraining your model. That is really the the biggest 1. Another 1 would be to, look at the training itself and understand it better so it can be used as an analysis tool, when you basically want to see what are the changes within the training data and how, whether you should use all the data for training or not. So to basically see whether this data is really relevant or not. That's something that normally would only be done in a very manual way. Let's say you have last 20 years of credit scoring. You would know as a person working in finance that everything that happened before 2008 and after 2008, it looked completely different.
But that's something you need to know, and there's maybe multiple different changes like that that you would not be aware of. And you can use that email, to kind of automatically analyze these changes in the concept in the training and data drift within the training set,
[00:51:08] Unknown:
to figure out which data should be used and which data should not be used to train your model. In your own experience of building the project and building the business, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:51:21] Unknown:
1 thing is that we really had no clue how to build a product, and we kinda had a clue how to build a business because we build a consultancy. And the product, I was like, what's product? Engineer, the thank you, the data science, and then you release. And we really did not realize that there's this entire thing called product thinking. When you need to figure out combined basic what's needed with how to present it, make sure that there is value, usability, business feasibility, and actual feasibility. And this is something that is completely not on my radar.
That was 1 big, like, piece of thing that I learned. I read a book called Inspired, and that is really something I would recommend to everyone, engineers, data scientists, whoever you are, if you work in a third company, this book is really great. Another thing is the value of early user testing. At first, we decided to build 1 thing. We built it, and I was like, yeah, it's kinda there. It's really it works, but not great. And after we hire a product designer, basically, we switch, to always try to test mocks prototypes, test as often as possible.
And once we decide to build something, we have much higher chances of it being the right thing. And then on top of that, we also are much more committed to actually building things and not abandoning halfway through because we found that that's actually not something that people want. Because when we build something, we already know what people want. And last thing I would say is adoption is much more nuanced than expected. We started with pure business focus because that is objectively where the value of data science is, but that doesn't matter. If these people are not interested or don't understand what we're trying to do, we will not be able to bring value to them because they are not open to it.
And that's why we decided to switch our focus and being open to who's actually your user is is very important. And focusing on people who will use learning immediately is more important than focusing on the person who will, in the end, get some value out of it. Working with engineers is completely different than working with data scientists. They are different people. They are driven by different things, and there's even huge communication gap between engineers and data scientists. 1 thing that I noticed is data scientists are in general, of course, I'm gonna make a terrible generalization here, are driven by curiosity. What makes them tick is figuring out things, and with engineers, it's much less so. They are driven by building things. They want to create things.
These are completely different motivations, completely different ways of looking at the world. For me, the problem ends as a data scientist when I figure out how to do it. Doing it is like it's an afterthought. If I already know how it works, that's what I want. Whereas with engineers, they don't really care that much how things work, but it's about building things. And there is these things sometimes stand on the seemingly opposite sides of the spectrum, and I had to learn how to recast the problem in different light to to appeal to different people. Yeah. It's definitely an interesting insight about kind of the
[00:54:33] Unknown:
challenge of being able to motivate people because of the kind of areas of where they want to spend their time and focus and what it is that they are actually intrinsically interested in.
[00:54:43] Unknown:
Absolutely. And then you have business people, and most of them, they just want to have 1 metric and drive this metric as fast as possible. So it's yet another deep motivation, and it's so very
[00:54:56] Unknown:
interesting. For people who are deploying machine learning, they have it in production, what are the cases where nanny ML is the wrong choice?
[00:55:04] Unknown:
So I would say that there are few use cases when we're just not ready for as a library, as a as a product. Let's say that you want to don't monitor at all and you work at JPMorgan Chase, and you're responsible for creating the entire platform for monitoring everything. At the same time, we don't have the capability to actually implement many mail users. We wouldn't be able to assist you there and then probably should look for a a closed source solution that would come in with the entire playbook of how to do it for a huge organization and do it from scratch to finish.
That email is inherently open source product, which means that we expect like, we're happy to help and assist with anything that has to do with implementing any amount or monitoring in general, observability in general, or post deployment data science in general, but we simply don't have the manpower and the capability to do the integration for you. So if you don't have the capability to integrate Nanaimo or Nanaimo, Nanaimo is not a choice for you. This is true, I think, for most open source projects and just as much for learning and all. Another thing is really big data.
We optimize our library. It works reasonably fast. And if you have, let's say, single digit terabytes of data per day, then it was a good solution for you. If we're looking into petabytes per day, this is gonna crash. It's not gonna work. It's not designed to work, with each data.
[00:56:32] Unknown:
As you continue to build out Nanny ML and iterate on the product and add new capabilities, what are some of the things you have planned for the near to medium term or any projects that you're particularly excited to dig into?
[00:56:46] Unknown:
So when it comes to the product road map, so we always have, like, a research road map because we are trying to come up with new ways to actually solve the problems that have not solved before. The main 1 being, of course, our performance estimation and background truth, but also multivariate data detection, and there, we recently figured out how to do performance estimation for regression models. So we will be releasing full support for regression in the coming weeks, to be more specific, within the next 4 weeks. It's going to be in the library.
On the short term horizon, as already mentioned, there is expected sampling error, so kind of a way to measure the uncertainty of our predictions for everything for data drift and for performance as well. And a bit more long term, we will be looking into segment level analysis. So first, automatically segment your data from the perspective of model monitoring and then run the entire analysis on these segments. You'll be able to find underperforming segments when maybe general population level monoclonal is still fine, but you have some segment of your data or the model is not performing well, and you should be looking into that. To kind of fine grained analysis. Then roughly, within the same time frame, you have explicit support and full support for x data and for the image data.
And then a bit more long term, we'll be also looking at concept drift detection, So ways to detect not only the data, but actual concept drift and link it to performance. This is something that we just started the research on, But since it's mostly unsolved problem if you don't have access to your ground truth, we are starting with assuming you can access your data. You can access your target data, and we'll be releasing support for that shortly. When it comes to concept detection without access to ground truth. This is a really open research problem.
But if we are able to really figure that out, we'll be able to almost always estimate the performance of your models quite correctly, which is kind of the holy grail of ML monitoring. But that's a while from now.
[00:58:54] Unknown:
Are there any other aspects of the work that you're doing at Nanny ML or the overall space of post deployment data science that we didn't discuss yet that you'd like to cover before we close out the show? Oh, I just remembered 1 thing about the roadmap
[00:59:08] Unknown:
on the integration side of things. So everything related to engineering, we will be definitely looking into explicit integration with MLOps tools to really make, deployment of Nanaimo itself as seamless and effortless as possible. So tools like ZenML and alike. On top of what we have right now with our CLI and Docker deployment. When it comes to other things, maybe when it comes to learnings, some things I learned in my journey is that recruitment is extremely important and hard. But if you spend a lot of time actually figuring out who you want and describing it very well, we actually never had issues with having not enough applicants and not being able to find the person we're looking for.
So it's just kind of a fluffy thing. It's focused on recruitment more than you think you should be. The recruitment decisions are really 1 of the most important decisions you can make in a start up. I think that's it actually. Nothing specific to talk about. Just, you know, I would like to encourage everyone to go to our GitHub, give it a try, and I am always happy to either assist or receive any kind of feedback. Oh, you always get haters. We got our haters recently, and we actually felt good about it, people bashing that email because if they are, that means that they care, and that's good.
[01:00:31] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[01:00:45] Unknown:
I think it's mostly the awareness of potential impact. This is not going to be, like, a popular default answer here. But I think if businesses and executives realize the potential impact of machine learning in their companies, And if they were able to actually track it, they would be much more willing to invest into proper structure and resources into deployment. And if you see the potential upside, problems tend to disappear and get resolved if there is really need for that. Another thing is well structured processes not only for developing machine learning models, but also deploying them and monitoring them. And I think right now, we have a very strong disjoint between prototyping, deployment, and everything that happens after deployment.
What we will hopefully see in the future is that these things will come closer together, and data science who develop models will be looking at it also through the lenses of, can I actually deploy this thing? Will it work when it's deployed? And people who work in deployment will realize what happens after deployment. If I deploy, it's easy to retrain, easy easy to develop. And people who work in kind of post deployment data science will be able to look back to to previous stages, kind of make this process really more structured together and not as disjoint as is now.
[01:02:13] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Nanny ML. It's definitely a very interesting problem space, and it's great to see folks tackling the question of how do you actually understand what your model is doing in production, and how do you understand what to do about it? So I appreciate all the time and energy that you and your team are putting into addressing that problem, and I hope you enjoy the rest of your day. Thank you. It was great to be here and enjoy.
[01:02:42] Unknown:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at the machine learning podcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea of the data scientists building natural language processing models to programmatically inspect, fix, and track their data across the ML workflow, from pretraining to posttraining and postproduction. No more Excel sheets or ad hoc Python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs while seeing 10 x faster ML iterations. GALILEO is offering listeners of the Machine Learning podcast a free 30 day trial and a 30% discount on the product thereafter. This offer is available until August 31st, so go to the machine learning podcast.com/galileo
[00:01:03] Unknown:
and request a demo today. Your host is Tobias Macey. And today, I'm interviewing Wojtek Kubertski about Nanny ML and the work involved in post deployment data science. So, Wojtek, can you start by introducing yourself?
[00:01:14] Unknown:
Absolutely. So thanks for having me, Tobias. I'm a cofounder at Nenemel. What I do is basically everything that's related to tech. So I kinda manage the product side of things, the research side of things, which are actually separate in that email because we are a deep tech company. My background is in mechanical engineering, and then I did my master's in artificial intelligence, Belgium, where I stayed for a while. I freelance for a bit, starting my own consultancy, handed off the consultancy to someone else. I moved to Portugal where, you know, the sun, the food, and everything else that you can get here.
[00:01:49] Unknown:
And, yeah, that's basically me. And do you remember how you first got started working in machine learning?
[00:01:55] Unknown:
Yes. So that is, I think, a typical story when you come from physics background or mechanical engineering background is that my bachelor thesis was about optimizing the shape of wind turbine blades and generally wind turbine stuff. And then I realized I actually enjoyed the optimization and computational side more than the actual physics. And then since I'm a very lazy guy, I wanted to automate it all, and that's where machine learning came in handy. So I basically decided that let's just try to optimize everything and automate everything, and machine learning is really the tool for that. The algorithms themselves were just really, really interesting. So that's basically when I made my decision to kinda switch from mechanical engineering to AI. And I got lucky that I got accepted to 1 of the masters in AI, so I didn't have to kind of start start from scratch with my education.
And that was really it.
[00:02:48] Unknown:
So as you mentioned, now you have cofounded a business in Nanny ML. I'm wondering if you can describe a bit about what it is that you're building there and some of the story behind why you decided that this was the problem that you wanted to spend your time and energy on. I started previous company, a consultancy,
[00:03:03] Unknown:
together also with my current cofounders at Nanny and El. And our goal was always to find a product that we could make big, basically, some unfulfilled need that we could try to fulfill. And we had a couple of ideas, but there was 1 thing that we came up over and over again with our consulting clients is that there was the question of what am I gonna do after you deploy these models for me? How do I make sure that these models actually keep on working? And we couldn't really make it our problem as part of the consulting because it was way too big to just do it as an kind of 1 off project. So we decided it's not our problem. Yeah. You need to deal with it. We hand off, we deploy your models for you, and that's it.
And then after a while, since it was really a pattern, we decided that if we can make it our problem, let's probably actually create a great company and a great product. And there's also kind of a lucky coincidence that we had a mentor and we still have a mentor that built a unicorn. And, basically, he kept on pushing us to actually jump into product already. There's no point in staying in consulting. So this was kind of, like, a perfect start of opportunity, right mentor, and the right pull from the market because I was actually pulled from the market. And then we decided to digital consultancy and start basically go all in with Nanny ML, raise funds, and build the product.
[00:04:27] Unknown:
In the landing page and some of the marketing material around Nanny ML, it brings up this concept of post deployment data science, which is what you were saying about you build these models, you hand them over to your clients, and then they say, what do I do with it now? And I'm wondering if you can give your definition of what that term encapsulates and some of the ways that it differs from the kind of metrics and monitoring approach that a lot of people have been falling into because it's something that they're familiar with from application or infrastructure?
[00:04:55] Unknown:
It's basically exactly what you said that there is right now this kind of default approach to MLOps that it's all about engineering. And what we noticed is that data science has always been about providing business value and finding insights in data that allow you to bring some value, extract some insights. And this, for the timing, we most of the time stops the moment you finish developing your model. And we think that if you extend this concept of let's try to bring insight. Let's try to ensure the value is there to everything that happens after deployment. This is really the the key to make sure that your models keep on delivering value. That's kind of fluffy.
But to be a bit more specific, what we mean by post deployment data science is doing things like monitoring, of course, but from the performance perspective. So we want to make sure that performance of your mouse stays high. And if something goes wrong, then we employ data science tools to find out what went wrong. So we have data analysis. We are doing actually machine learning on top of your machine learning models to figure out what went wrong and then how you can make sure that your model is fixed. So it's not from the engineering perspective. Up the uptime, throughput, whether the model pings, but looking at all kinds of silence that can happen and then trying to analyze using data science tools to figure out what went wrong and how we can fix it. That's basically
[00:06:22] Unknown:
it. As far as the current state of affairs where you have data scientists and ML engineers building the models and then handing it over to operations teams to manage the deployment and upkeep. I'm wondering who you typically see as being responsible for that work of the care and feeding of the model after it has, you know, gone past the training phase and you have that artifact to be able to put into production. And I'm wondering how you think about the design of Nanny ML for being able to help those people kind of augment their skills and also being able to loop those data scientists and ML engineers into that overall workflow?
[00:07:00] Unknown:
Yeah. So that is something that we're honestly still kind of figuring out. Identified kind of 3 main persona or 3 main kind of roles that could benefit from that email. The first 1 and the kind of people that are most interested in that email is actually heads of data science because these are the people who really answer to upper level management with the actual performance and the impact of their algorithms. So that's 1 thing. They need to have a tool like many more to make sure that the albums actually deliver value, but they will not be the ones using it day to day. They have the need, but they are not the end user. And then the end user we see is kind of, in a sense, maybe not equally split, but split between data scientists and ML engineers.
And so far, we see that it's more data scientists that kind of want to make sure that their models keep on develop working the most that they develop. So they still feel the ownership that this is my model. I want to make sure that it keeps on delivering value. And, also, there's that skill set that's kind of more suited for monitoring. Because, again, post deployment data science, it's not so much about engineering, but actually understanding what's going on. But, of course, when it comes to deploying any amount to production to make sure it can provide the insights on, like, day to day operation, this is something that will be handled by ML engineers. So in a way, it's similar to the models themselves when you need ML engineer to deploy them, deploy that email to monitor your model. And then you need the data scientists to actually look at the data, analyze the results, and see if something went wrong, how can I make it right?
[00:08:36] Unknown:
Given the fact that the data scientists are interested in being involved in the continued care and feeding of the models and understanding more about how the models are operating past the training cycle. I'm wondering what have been some of the points of friction that they've experienced previously that maybe prevent them from being able to be more involved in that process and that has led the current state of the industry more towards this just, you know, collect some metrics and try and figure out, you know, has this crossed the threshold of whether or not you need to care about it? So I think that
[00:09:10] Unknown:
the biggest problem is really that people had assumed that it's unsolvable. And it's something that has been a problem since machine learning production existed that models fail. But from a lot of our, like, early design partners and early users, we heard that, yeah, that's how it is. This is just a reality of life as a data scientist. So the first thing is really awareness that you can actually make sure that your models work well and they fail find it faster than you normally would. So that's the first thing. Another thing is we are still extremely early on when it comes to adoption of data science. We're kind of in a battle when, you know, work in cool companies that actually work with machine learning. But when you look at the general market, there is a lot of prototyping, and I think we were just simply too early. You first need to put models in production to monitor them. So I think right now, we're kind of crossing this threshold when data sciences, as they say, crossing the chasm. Models are actually making its way to production, and people start to realize that the problem exists and the problem is solvable.
And when it comes to kind of the friction points or maybe the examples of what prevented them, First, it's just on the kind of especially in big organizations on the political level, when there's different person responsible or even different departments responsible for deploying models and maintaining them and making sure there is an impact. And that I think is something that will change in the future. And when there's going to be maybe even a new role of a person who actually makes sure like, a post deployment data scientist or production data scientist. The person who makes sure that, these models keep on the different value after they've been deployed.
[00:10:57] Unknown:
As far as just the overall state of data science and the health of the models that are being built and deployed, I'm wondering what you see as the long term effects of having the data scientists more involved in the post deployment stages of the model development and the ways that gaining better understanding of how the model actually operates in the real world allows them to bring that learning back into the process of developing the models in the 1st place or developing successive models about how to learn more about the real world impacts of machine learning, you know, both the model on the real world and the real world on the model and kind of some of the ways that that will feed into kind of the ongoing research and development, both in companies and in academia?
[00:11:46] Unknown:
That is actually a very insightful question. And like you mentioned, there's definitely this feedback loop between the real world and the model itself. If you deploy, let's say, a churn model and the model works well, reduce the churn, which will also impact the population. So we're gonna have some kind of drift there already, which might mean, for example, that looks like the model is going to shit, but in reality, the model is still working while it's just population exchanging. So 1 thing that people definitely understand better, the data scientists that are working on these models, is how the real world interacts with the model and that it's a 2 way interaction. It's not just that model interacts with the world and changes the world, but also the other way around because the population will change, and if you return the model, the model is also going to be significantly significantly different.
Another thing is I think people will understand robustness much better. As right now, you can spend weeks of compute optimizing your hyperparameters. And most people know that they shouldn't evaluate things and test set more than once, but they still do it. And they just keep on looking for, you know, set of hyperparameters that performs best on your test set instead of your validation set. And they know that this idea of holdout is needed, but they don't really take it to the extremes. I think they should. And when you deploy models, you will restart realizing that things are not as robust as possible, especially if you spend a lot of time optimizing your models, and you're just getting good performance due to lag and due to just trying out and actually overfitting in some more subtle ways. And I think awareness of that will increase as you see that when you deploy your models, performance drops compared to your test. Even though it shouldn't, but it still does. And people are still asking why and finding the answers to that.
[00:13:39] Unknown:
More on the monitoring and alerting side of things. I'm wondering what are some of the ways that you have experienced model failures and some of the particular pain that you've encountered in terms of the available tooling, the visibility or lack thereof, and some of the ways that that has motivated you to want to invest the time and energy on solving this for everyone?
[00:14:00] Unknown:
So as I mentioned, kind of moment when we became aware of the problem was during our consulting days. And 1 very interesting thing that happened is that we were developing a segmentation model that would remove background from industrial tools, and they would be put in a catalog. So basically, for automation use case. And interestingly enough, somewhere throughout the process, as the model is being deployed, they changed the cameras. And the model really went to shit. And it's something that, of course, we couldn't have predicted, but it made us realize that things change in the real world. And as they do, the models will not perform well and failure is inevitable. Just part of life, things change, and we need to be able to monitor that. Another interesting use case that stumbled upon a bit more recently is in wind energy, where what you see is that there's always covariance shift, so change in your model inputs, also known as data drift.
And there, it really, really impacts the performance. There is no change in the patterns because it's all based in physics, so there's no concept drift. But change in just how data is distributed will have a huge impact on performance. This is something that happens seasonally, and you should know whether you should be worried about that or just normal performance. So just monitoring how your model is performing is not even enough because maybe this is something that's expected. Maybe as weather changes, there's more gusts of wind. If there's more gusts of wind, of course, your model will not predict the power production as well as in periods of low wind. So that's another use case. And then when it comes to credit scoring, 1 thing we noticed is that a lot of credit scoring companies and banks as well have this problem is that they deploy a model and they need to wait for a year, sometimes 2 years before they really get the production targets when they can actually evaluate the performance of their models. So they're kind of flying in blind for a year or 2 years, and they are using their models in basically the most important use case in their business.
So there's a huge need there to know whether to estimate performance in some way even if web ground truth. So these 3 kind of things really is what made us think of Nanny and Mal and how and the shape it is taken right now is that you cannot just simply monitor performance because it's not enough. Sometimes monitoring performance is not available, so you need to estimate performance instead. And looking at data drift is important from the performance perspective, but not for the sake of data drift itself because that is something that will always happen, and it's not necessarily bad if your models generalize well. Yeah. And to that point of data drift being the
[00:16:45] Unknown:
kind of most frequently brought up case of issues with model performance or things that you have to watch out for after you get a model into production. I'm wondering what are some of the other kind of more insidious or less understood or less observed ways that models can start to go wrong without based on maybe, you know, ROC or f 1 scores or just some of the ways that the model can still start giving you bad outputs without you necessarily noticing it. And, you know, conversely, how tracking these, you know, high level metrics or or these specific metrics that people are able to observe more directly, how that contributes to issues of alert fatigue in ML teams?
[00:17:32] Unknown:
That is a very loaded question, so I'm gonna answer it by by part. So first, starting with what can contribute to model failure. Let's assume that we cannot simply measure performance as it is the case in actually most of these cases because there's there are delay. Or in automation use cases, we don't have full ground truth. So just measuring f 1 or whatever metric you choose is not easily doable or doable at all. And then there is basically 2 main reasons why the models can fail. The first 1 is data drift or covaritship. So the change in the joint model input distribution to be very technical here. So we're looking at all the features and its joint distribution changes. If that happens, it might mean that your model your data is going to be drifting to regions when it's harder for the model to make its prediction.
Maybe because the model is not trained in this region, or maybe it's a region that, let's say, it's very close to the class boundary. So it's just harder to make the correct prediction. Then you will see that the performance of your model is decreasing. And this is something we can actually fully estimate the impact of convergence on performance with Honeywell, which is 1 of the cool unique features we have. And then kind of second, the reason why mouse can fail is concept drift. And concept drift is changing the pattern or mapping or a function, however you want to call it, between the model inputs and the target.
So it's probability of y given x. And that means that basically, the pattern that your model has learned during training is no longer so we will see a drop in performance. So these are kind of 2 ways that, models can fail. Now looking at how they can fail, and you will not even be able to tell by just looking at the metric. 1 thing is that imagine you have data drift to regions that are less uncertain. So the model performance should increase. But at the same time, you have concept drift, so the pattern changes. So the performance does not increase.
And that means that even if your f 1 stays the same, in reality, the model is worse. It's not predicting as well as it could be. And this is something you would not be able to see unless you can estimate the impact of data drift on performance. If you do it, you will see that given this data drift, the model should be performing better, but it's not. So something's off, and this off part is the constant drift. So that's 1 of, like, ways of really insidious silent failure. And also the inverse is true. Like I mentioned in when it comes to wind energy, it might be that you have periods when the model performance drops.
But there's still nothing to worry about because there is just a momentary drift to regions that are higher to predict them, like gusts of wind because wind blows from the mountains. Yeah. I think that's enough talking for 1 question.
[00:20:24] Unknown:
And in terms of the detection capabilities that you're bringing into NannyML, you mentioned that 1 of the ways that you're able to identify some of these failure modes that are often silent if you're just tracking, you know, bare metrics is the fact that you need to be able to see based on this prediction and the input data that I'm seeing. I'm assuming that this is going to be the performance that I see, but because the distribution of the data is different than what the model is trained against, the concept drift is such that it's actually degrading the performance despite my prediction that it will get better. And I'm wondering what are some of the other ways that that predictive capability that you're bringing in Nanny ML is able to highlight some of these failure modes that are kind of flying under the radar and some of the kind of broader impact on maybe confidence in ML and the organization, just some of the broader impacts that these silent failures have when they come to light and some of the ways that this greater visibility that Nanny ML brings allows ML teams to kind of build up better trust in the organization?
[00:21:32] Unknown:
That's another loaded question. So, again, I'm gonna start the first part, and I'll try to combine it with the alert fatigue that you mentioned. So 1 of the things that happens a lot is that people, when they do monitoring, they look at every single feature kind of in a void. And if your model has a 50 features, something is basically guaranteed to drift. And just because a drift is is significant from a statistical perspective, it doesn't mean it's significant from a model monitoring perspective. You could see a very strong drift in feature that's mostly irrelevant, and the model is very robust against changes there, and everything is gonna be fine. That's why it's so important to try to figure out what is the impact of data drift on performance.
Another thing is that these tests, if you do some kind of statistical test or you measure the difference, some kind of distance between 2 distributions, they assume that all features are independent. They don't track the change and the correlations or relationships between the features. So they kind of don't think about this joint part of the distribution. And for that, develop multivariate data drift capabilities, where you can basically train your data trainer, let's say, an autoencoder for the ease of explanation on your data for which you know everything's fine. And then you see how well this autoencoder performs on the dataset in question, your analysis dataset.
And if you see a drop in performance, it means that the internal geometry of the dataset has changed. So there's data drift. And this is a way to capture data drift that would not be captured by these tools that just look at everything from Univariate Research. Now for the second part of your question, which is how monitoring can change how smart sensors can impact the trust in machine learning. I think right now because the failures are silent, all the failures are not expected, and they are seen as something that's was not planned in saying Tibet should never happen. It's all your fault.
So what monitoring can really bring is first increase the trust that if, let's say, head of data science is reporting that the months are performing well, they actually are performing well because there's much more knowledge in the company. And another thing is kind of bring more understanding that failures will happen. They will be resolved fast, and they're just part of normal operating procedure. Just like like you see right now, you need to maintain your physical equipment, and physical equipment will have to be out of operation for a while. The same is true for machine learning models. Things will happen, and model is not performing as well as it used to, and it's gonna take us, let's say, 2 days to fix it. That is something that should be normal in companies, but now every time there is an issue, everyone's ringing alarm bells that it's no. You cannot trust machine learning, and it's not reliable.
It's reliable. It's just not perfectly reliable. And if we quantify this reliability, that will bring more trust.
[00:24:29] Unknown:
Another aspect of the question of alert fatigue is that if you do get an alert, you don't necessarily know what to do about it, where a lot of the conversation that you see in the ML space is, oh, if you see concept drift, then that just means that you need to go back and retrain your model based on the new data that you have and redeploy it, and everything will be great. I'm curious what you have seen as some of the ways that teams who are using Nanny ML or just this better understanding of what is happening with the model allows you to think about what actions can and should be taken based on the actual type of failure that you're seeing and the overall question of what are the remediation actions that can and should be taken when you do see that there is some problem that is manifesting?
[00:25:16] Unknown:
So when I think about monitoring, I think there's kind of 3 parts of it. The first 1 is detecting failure. The second 1 is realizing why failure happened, and the third 1 is resolving the failure. And right now, with Nanaimao, we basically provide the first and the second. And the third 1 is, for better words, manual. Because what we realized is that really is extremely use case dependent. The first thing you should do, your default action should be retrained, but instead of redeploying immediately, you should either run an AB test, and this is something that we saw some people doing. Or and that is a very interesting case, is that you retrain it, but you retrain just on the part of the drifted data, and you see how the model performs on the rest of the drifted data. So you do kind of a phantom deployment.
And then if everything's fine, it can redeploy. If not, to go back and basically do some kind of root cause analysis, which you can do with data drift capabilities, both being varied and multivariate. And then also, you know, tap into your domain knowledge to see what are the changes in data quality upstream that might have caused the failure. What are the changes in the business side of things that might have caused the failure? Maybe there is a new marketing campaign, and this marketing campaign completely fucked up our recommenders. That's something that might have happened, but a tool like Nanaimala has no way of seeing that because it's completely external to the to the machine learning system itself. So there are always the human component there because you're basically like an investigator trying to figure why things happen.
And then once you know the full picture of why, it's a matter of retraining or redeveloping your model, maybe with different architecture, maybe by adjusting the data. So if you have a strong data drift, maybe you should adjust your old data to some kind of shift there to make it match the current distribution because 1 will just get confused otherwise trying to complete the different patterns at the same time. Yeah. I guess that's
[00:27:16] Unknown:
it. Another interesting kind of roadblock to the question of just retrain and redeploy it is that sometimes for for a particular model instance, that retrain cycle might take, you know, several hours or days. And I'm wondering how that factors into the decision about, okay, what do I do in this instance? Like, is it typically a case of this is an urgent failure, and I need to be able to resolve this immediately? Or are the failure modes Oh,
[00:27:55] Unknown:
yeah. That Oh, yeah. That's another interesting question. And you mentioned that it can takes, you know, days. But if you look at wider organization in some banks, it can actually take months because you need to submit a request to train a model. It's to be approved by the compliance team, and it it really takes ages. 1 of our data scientists at Nanny Mal used to work at 1 of the more advanced banks. And even there, just because the regulation is so heavy, you cannot simply retrain your model. And you need to go through steps with multiple different teams, multiple different external validators before you can return your model. So it's at least a month.
And I think that is also why you actually have to return your model or it's nice to have is important. So, again, estimating the performance and monitoring the performance is really the key here. And from there, you can see where the impact is. Long term and severe assess the severity, assess also whether it's a long term impact or just passing impact to seasonality. And so again, data drift impact of data drift on performance, estimating performance, and measuring performance. Oh, and here there is another action thing at play, which is the uncertainty estimation of the performance itself.
This is something that we will just be missing to to the library, and I think 2 weeks time from now, when we'll also be giving the uncertainty estimates on our performance, our calculated performance. So you will not know that let's say, rock AUC is right now 0.85, but 0.85 plus minus 0.6. And if you see that you have strong drop in performance, but also uncertainty is really high, maybe that is something that, you know, you just got unlucky with the data, and you should keep on going as it is. You don't have to retrain. Maybe you do, but let's wait and see. So if you combine this uncertainty estimates with the data drift and performance estimation, you should have a full picture of whether you need to retrain or not.
[00:29:54] Unknown:
This might be going a little too far afield and too far into kind of the systems design aspect of it. But in the case where you do have a model, it is seeing critical errors in the ways that it is generating predictions. I'm wondering what are some of the kind of application or systems engineers' patterns that you've seen for people to be able to include some sort of a circuit breaker where it says, okay. This model is too far off the rails. We can't trust it. Retraining it is gonna take too long, so we need to have some sort of fallback mechanism to be able to say, you know, oh, the this capability is no longer functional right now. Check back later. Or, you know, we'll just go with kind of a more statistical heuristic approach that doesn't rely on this more advanced model capability. So we will say, you know, I have kind of a graceful degradation of capabilities, but the model is no longer in the loop such time as you can fix the underlying problem that it's experiencing.
[00:30:45] Unknown:
To be honest, I think that it's step too far, not necessary for us here, but for the reality. I don't think we're there yet. Maybe there's a handful of companies that already have those capabilities. I haven't spoken to anyone that actually already had a full blown system of how to do failure. Most of the companies that do monitoring do it manually. Everything is manual. All decisions are basically taken on the spot, and I think we're quite early when it comes to that. 1 thing I saw is that there is an off switch. If the failure is really critical to the point when it's still bringing values detrimental to the company, then the first thing you do is you just turn it off.
Or you still run a prediction, but you don't actually act on it. So you then something of that would actually have an impact. So the model is just running in the void, and then you start resolving the problem. And once the problem is resolved and you confirm that the failure is no longer there, you can deploy the new model.
[00:31:44] Unknown:
That's basically the only thing I know of. Yeah. It's definitely an interesting thing to think about of, like, all the different ways that things can go wrong and how to resolve them. And at a certain point, you just have to say, I can't predict it. It is what it is, and we'll figure it out when we get there. Yep. And so in terms of the Nanny ML project, can you talk through how it's implemented and some of the overall design questions that you had to address in terms of the technical and user experience interactions and how to be able to fit this into people's machine learning workflows?
[00:32:17] Unknown:
On high level perspective, I've decided to go for a Python open source library because, let's be honest, almost all data science right now is done in Python. People want to just install things or condense or whatever is your preferred method here. So that's 1 thing. When it comes to the actual implementation, we basically have kind of a default structure, default workflow for Nanaimo itself. Well, first, you plug in your data, and your data can be in form of kind of 1 of analysis when you have your reference datasets for which you know if it's fine, your analysis datasets for that you'd like to analyze. And this kind of mimics the manual monitoring process when from time to time, let's say, every week, every 2 weeks. However often you need to do it, you will manually run manual, see what's going on there. And then the flow is that is that you get your data via parquet files or whatever you want, or if you have it's a microservice that provides the data.
You see what's going on starting with performance calculation or estimation depending whether targets are available or not, Then everything's fine, good. If things are not fine, we need to go into data drift, and you'll go do a multivariate and univariate data drift detection. And then we have a, honestly, pretty minimal capability to link data drift to performance. So we'll be able to see what are the potential reasons for dropping performance. And from then on, we go to a manual quest of resolving the issues. That's kind of 1 workflow. Another workflow is that if you deploy to production.
In that case, Manheiml is running via Airflow or some other kind of service. We just release our CLI. So right now, it's much easier to just run an email as part of your ML system, and you get your alerts. And you have a, let's say, a script or notebook when an email is running when you can just view updated dashboard with however often you want to run an email. It can be every 10 minutes. It can be every 5 minutes, basically, as often as you want. There is 1 thing that I generally caution about, which is trying to do streaming with monitoring.
And the thing is that you can obviously use an animal in streaming fashion, but you need to have certain volume of data before you can say that the performance has degraded. Degradation is like inherently probabilistic process. So if you just get 1 new data point, it really does not make sense to run the whole analysis again because things cannot have changed that much. And even if they do, you will not be able to tell by just 1 point, and also the algorithms cannot take that. So what I would recommend is basically running batch every few minutes, few hours, how often do you need when you have enough volume of the data. And then if there's alerts, then the rest of the process triggers, and that can either automatic trigger retraining or an advanced organization that deal with automation use cases that could trigger manual spot checking.
So then you basically create a task, send the signal to team that does spot checking. So manually reevaluate whether the performance actually has decreased. And from then on, you can take actions to remediate.
[00:35:34] Unknown:
In the process of building Nanny ML and starting to work with some of the end users of it, I'm wondering what are some of the initial concepts or ideas that you had about how to approach the problem or how to integrate it into people's systems that you had to reconsider or reevaluate as you started getting it in front of people with real world use cases?
[00:35:55] Unknown:
Yeah. So our first idea is to basically market it towards business people because that is kind of the end consumer. And while the idea is sound in principle, in reality, it really did not work because they just don't understand machine learning at all. That's why you switched to data scientist and heads in data science. And the other thing is that we started with a dashboard. People want to view it as a dashboard. What we learned from early user interviews and from early design partners is that we actually don't care that much for a dashboard. Like, there's certainly some some companies organization when they want it, and they care about dashboard as kind of a second step when they deploy themselves via literally Jupyter Notebooks.
They view things themselves there. Everything's working fine or not fine. And then if there is a problem or they need to communicate it to business stake holder, that's when the dashboards come in. So then we had to completely change the priorities. We dropped the dashboard. Or focused on usability within Jupyter Notebooks. And really, in the entire question of UX is I want to run things fast using interface that I know. So we often default to how is scikit learn doing it, or how is when it comes to plotting, how is Seaborn doing it. And we look there, and it's our default. And unless we have a very strong reason to change the interface, we basically copy them because they did it well, people know how to use it, and they are happy with these libraries.
So I guess that is 1 big thing we needed to reevaluate. Another 1 is kind of we had assumed that the idea of estimating the performance without the ground truth is something that will click in in user's head. But we realized it's not as easy, and we had to do a lot of, like, nudging both in our documentation and in the visualizations themselves to show that we are estimating this performance. This is real performance. We are estimating what is likely to have happened due to data drift. And this is something still an open question because for a lot of people, it takes a while to even realize that it's possible to even estimate the performance in a machine learning model. And it's definitely not an intuitive idea.
[00:38:08] Unknown:
Another interesting challenge in the ML space is that you need to be able to support a fairly large variety of different frameworks and data types and model types and model architectures. And I'm wondering what are the initial target capabilities and initial focus that you had for that kind of matrix of use cases and the ways that you have thought about the initial foundational aspects of building Nanny ML to allow you to then re branch out into, you know, maybe going from tabular data to unstructured data or image data, etcetera, or, you know, the different frameworks that you wanted to support or model architectures that you wanted to be able to understand how to generate these predictions?
[00:38:56] Unknown:
So that's something we actually thought a lot about. And what we realized that from the use case perspective, there's basically 1 huge difference is that whether the ground truth is delayed or not. If we work with cases when ground truth is not really delayed, so high frequency training or delivery time prediction, when it's gonna take you half an hour to get your sorry, to get your targets. That is something that should be monitored in a completely different way than, let's say, loan default prediction for mortgages when you wait years until you know your target.
Then on the model and framework side, we made the decision very early on. Also, because, honestly, we got lucky with research that we're gonna be fully agnostic when it comes to everything there. So we don't actually need the model. Nanimo does not need the model file, does not need to know the framework. You just work with data. That was a very kind of committal decision that we took in the beginning that an animal should always be able to work with just data. So then by default, we actually support all kinds of models because we don't need to look at them, all kinds of framework because we don't need to use them. So we managed to kind of size the the problem, and so far, it's working quite well. Now when it comes to the actual data types, that is, I would say, the biggest challenge. That's something that we're able to sidestep.
We decided to focus on tabular data because that's where, so far, we've seen both the value is in the companies, and also this is the type of data that is most prone to failure. Like I always say, a horse is a horse. It's not gonna stop being a horse anytime soon. So if you look at image data, it's less likely to change. But, of course, as I already mentioned, sometimes the cameras change. 1 interesting example we got from a design partner is that they were doing detection for COVID based on x-ray images. And there was a new strain, and the model failed because the new strain actually showed itself in the x-ray images differently. So, of course, there's truth there, but we decided to focus on churn use cases, credit default scoring, upselling, cross selling, all this boring AI because these models are actually in production, and if they fail, that's a huge problem.
And now moving on to images and x data. We are right now actually running a small internal project to see what changes we need to do to support them natively. And what we find out, you can see you can already use an email for text and image data, but instead of working with the raw data, we need to go to the embedded layer. And there, you could basically treat it as tabular data, and performance estimation works well. Multivariate data detection works well. Univariate data detection returns bullshit because it just get pixel values or you get, you know the vector 17 has changed. Yeah. That's not interpretable and it's very useful.
So we don't have the interpretability there, but the core capability of drift detection and performance estimation is still there.
[00:41:57] Unknown:
As far as being able to integrate Nanny ML into an ML system, what are some of the foundational systems and capabilities that are necessary to be in place to be able to handle feeding the data into it, generating the predictions, being able to capture the outputs of Nanny ML and integrate that with the other monitoring and alerting systems that can and should be in place?
[00:42:24] Unknown:
So our default kind of workflow is you have your model deployed as a microservice. So first, hard requirement is that your model is actually in production, which people sometimes say that it is in production, but it's actually not in production. Or it's in production, but nobody's acting on the predictions. So 1 important thing is you should implement, an animal as you deploy your models or after you deploy your models, but it's a tool for post deployment data science. So that's 1 kind of obvious thing, but it's sometimes less obvious than it seems. And then we basically assume that you have a microservice architecture when you have a model deployed somewhere that is scheduled via Airflow or whatever you want. And then what you want to do is you want to spin a Docker container, let's say, or have a you you containerize your your microservices with Nanaimo.
And what you feed to Nanaimo is all the inputs that come to the model and all the outputs from the models. Plus, if you have access to this data, also the target data, the production target data. On top of that, you will need some periods for which you know everything went fine. This can be your validation set or you could just get some production historical data for everything, when everything was fine. Let's say, first month of production. And then you just if you do batch processing, you can print an email at the same time as you query the model. The same exact data you first run through the model, then you get more of the inputs and outputs to Nanima. And then we will need to have a way to display it. The defaults are following the Jupyter Notebooks only just have 9 month running. It displays the data. You view it in your browser and that's it to run it. So there is not that much to be done on this front.
[00:44:09] Unknown:
As far as the kind of post deployment data science aspect of it, we discussed this a little bit, but what are some of the maybe ways that the outputs of Nanny ML can get fed into a maybe partner model that sits, you know, with an Nanny ML and, you know, feeds back into the model that you're monitoring or just, you know, some ways that the information that any ML surfaces can be used to build additional models to then serve as kind of automated checks or automated remediation flows for the model that you're actually putting into production and is generating the value?
[00:44:46] Unknown:
The easiest way is basically the alerts. We have alerts that will return both in visual form and as a data frame. And what you will see there is that if you have alert on performance that can be set with a custom threshold or default threshold, that is the easiest things to act on. So basically, you can look at alerts in performance, alerts in data drift, and alerts in specific features. And then you can use that information to trigger first automated retraining. This is kind of the minimal fully automated loop. When you automatically retrain, you pass it through. You deploy the model, and then you pass it through with Phantom deployment or AB testing through an email again to see whether the issue has been resolved. If the issue has been resolved with this feedback loop, then the model, it gets deployed through production.
So that's kind of the the default remediation technique. And then you could see kind of another flow if the issue is not resolved, so the training is not sufficient. You could pass what has been tried, plus all the other asynchronous learning on performance by the data drift to a data scientist who could get basically start with it as and do an EDA there to figure out what went wrong. And then you go back to the same process when you deploy the model via phantom deployment or AB testing. You run it through Nanaimo again and see if the issue is resolved. If the issue is resolved, you deploy the model. So there is this kind of feedback loop approach when you can use an email to both alert you that something has gone wrong and confirm that the issue is resolved.
[00:46:18] Unknown:
And then in terms of the ways that you're thinking about making this project and business sustainable, I'm curious how you have thought about the kind of governance aspect of the open source project and the ways that the business you're building around that, you know, feeds into it, and just some of the division between what is freely available, what is commercial, and just the overall business model that you're targeting?
[00:46:43] Unknown:
So for the time being and for the foreseeable future, we are fully focused on open source, and there is going to be a open source led product. So the open source is going to be always the main thing. And we kinda follow the open core philosophy. When all the core algorithms that provide value and is, you know, new science that we're developing or toward new engineering at least, is something that's going to be always open source. Right now, we're not monetizing anything. What we're focusing on is adoption first and foremost, trying to get as many people to use non email introduction in real life use cases. And based on that, it will be seeking more funding soon, soon ish.
And we have a bit of runway yet. And in the future, there's basically 2 ways we plan to monetize. 1 is everything that has to do with enterprise addition in terms of security, integration, privacy, helping them basically create workflows, helping big enterprises create workflows when none email can be used to provide value, just like the 1 I just described before. And another thing is business oriented features. So trying to use additional signals from the business that are not in the dataset, to help improve model monitoring. If, it's possible to quantify the impact of a model on the business itself, also monitoring those metrics. So maybe ROI on the model.
Simple 1 in credit scoring would be, you know, risk adjusted ROI on how much money you're getting per loan. Is that changing? How is that relevant to data drift and model metrics? Can we also estimate that? These would be the things that almost nobody cares about except business people. So it's perfect thing to keep, outside of the open source solution because it's something that would not drive adoption, do not drive value for vast majority of people, but also is a good way to monetize and make it a business. Because if these capabilities are needed, they are really needed and people are willing to pay for them in a way that's, you know, really shows that it provides value.
[00:48:51] Unknown:
In your experience of building the project and working with some of your design partners, what are some of the most interesting or innovative or unexpected ways that you've seen Nanny ML used?
[00:49:01] Unknown:
The idea of using Nanny ML to AB test models is not mine. I mentioned a few times, but it was something that actually came from a design partner. When at first, they were not even interested that much in monitoring aspect of it, but they were doing retraining. And the biggest thing is that they never knew if the automated retraining actually makes sense. Maybe we're actually hurting the model more than we're fixing it. And that is something that actually once happened when there was an upstream data issue. And that meant that basically the quality of, I think, 2 months of data was much lower than the previous months of data, and the model was retrained, and the performance dropped significantly, because the the model was even though the data quality issue was found out and resolved, the the same data was still used to retrain the model. I don't know why. It just happens in enterprises.
And and we're able to actually spot that with NetEmo. And, then, that is 1 way to monitor automated retraining to make sure that you don't do something wrong by retraining your model. That is really the the biggest 1. Another 1 would be to, look at the training itself and understand it better so it can be used as an analysis tool, when you basically want to see what are the changes within the training data and how, whether you should use all the data for training or not. So to basically see whether this data is really relevant or not. That's something that normally would only be done in a very manual way. Let's say you have last 20 years of credit scoring. You would know as a person working in finance that everything that happened before 2008 and after 2008, it looked completely different.
But that's something you need to know, and there's maybe multiple different changes like that that you would not be aware of. And you can use that email, to kind of automatically analyze these changes in the concept in the training and data drift within the training set,
[00:51:08] Unknown:
to figure out which data should be used and which data should not be used to train your model. In your own experience of building the project and building the business, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:51:21] Unknown:
1 thing is that we really had no clue how to build a product, and we kinda had a clue how to build a business because we build a consultancy. And the product, I was like, what's product? Engineer, the thank you, the data science, and then you release. And we really did not realize that there's this entire thing called product thinking. When you need to figure out combined basic what's needed with how to present it, make sure that there is value, usability, business feasibility, and actual feasibility. And this is something that is completely not on my radar.
That was 1 big, like, piece of thing that I learned. I read a book called Inspired, and that is really something I would recommend to everyone, engineers, data scientists, whoever you are, if you work in a third company, this book is really great. Another thing is the value of early user testing. At first, we decided to build 1 thing. We built it, and I was like, yeah, it's kinda there. It's really it works, but not great. And after we hire a product designer, basically, we switch, to always try to test mocks prototypes, test as often as possible.
And once we decide to build something, we have much higher chances of it being the right thing. And then on top of that, we also are much more committed to actually building things and not abandoning halfway through because we found that that's actually not something that people want. Because when we build something, we already know what people want. And last thing I would say is adoption is much more nuanced than expected. We started with pure business focus because that is objectively where the value of data science is, but that doesn't matter. If these people are not interested or don't understand what we're trying to do, we will not be able to bring value to them because they are not open to it.
And that's why we decided to switch our focus and being open to who's actually your user is is very important. And focusing on people who will use learning immediately is more important than focusing on the person who will, in the end, get some value out of it. Working with engineers is completely different than working with data scientists. They are different people. They are driven by different things, and there's even huge communication gap between engineers and data scientists. 1 thing that I noticed is data scientists are in general, of course, I'm gonna make a terrible generalization here, are driven by curiosity. What makes them tick is figuring out things, and with engineers, it's much less so. They are driven by building things. They want to create things.
These are completely different motivations, completely different ways of looking at the world. For me, the problem ends as a data scientist when I figure out how to do it. Doing it is like it's an afterthought. If I already know how it works, that's what I want. Whereas with engineers, they don't really care that much how things work, but it's about building things. And there is these things sometimes stand on the seemingly opposite sides of the spectrum, and I had to learn how to recast the problem in different light to to appeal to different people. Yeah. It's definitely an interesting insight about kind of the
[00:54:33] Unknown:
challenge of being able to motivate people because of the kind of areas of where they want to spend their time and focus and what it is that they are actually intrinsically interested in.
[00:54:43] Unknown:
Absolutely. And then you have business people, and most of them, they just want to have 1 metric and drive this metric as fast as possible. So it's yet another deep motivation, and it's so very
[00:54:56] Unknown:
interesting. For people who are deploying machine learning, they have it in production, what are the cases where nanny ML is the wrong choice?
[00:55:04] Unknown:
So I would say that there are few use cases when we're just not ready for as a library, as a as a product. Let's say that you want to don't monitor at all and you work at JPMorgan Chase, and you're responsible for creating the entire platform for monitoring everything. At the same time, we don't have the capability to actually implement many mail users. We wouldn't be able to assist you there and then probably should look for a a closed source solution that would come in with the entire playbook of how to do it for a huge organization and do it from scratch to finish.
That email is inherently open source product, which means that we expect like, we're happy to help and assist with anything that has to do with implementing any amount or monitoring in general, observability in general, or post deployment data science in general, but we simply don't have the manpower and the capability to do the integration for you. So if you don't have the capability to integrate Nanaimo or Nanaimo, Nanaimo is not a choice for you. This is true, I think, for most open source projects and just as much for learning and all. Another thing is really big data.
We optimize our library. It works reasonably fast. And if you have, let's say, single digit terabytes of data per day, then it was a good solution for you. If we're looking into petabytes per day, this is gonna crash. It's not gonna work. It's not designed to work, with each data.
[00:56:32] Unknown:
As you continue to build out Nanny ML and iterate on the product and add new capabilities, what are some of the things you have planned for the near to medium term or any projects that you're particularly excited to dig into?
[00:56:46] Unknown:
So when it comes to the product road map, so we always have, like, a research road map because we are trying to come up with new ways to actually solve the problems that have not solved before. The main 1 being, of course, our performance estimation and background truth, but also multivariate data detection, and there, we recently figured out how to do performance estimation for regression models. So we will be releasing full support for regression in the coming weeks, to be more specific, within the next 4 weeks. It's going to be in the library.
On the short term horizon, as already mentioned, there is expected sampling error, so kind of a way to measure the uncertainty of our predictions for everything for data drift and for performance as well. And a bit more long term, we will be looking into segment level analysis. So first, automatically segment your data from the perspective of model monitoring and then run the entire analysis on these segments. You'll be able to find underperforming segments when maybe general population level monoclonal is still fine, but you have some segment of your data or the model is not performing well, and you should be looking into that. To kind of fine grained analysis. Then roughly, within the same time frame, you have explicit support and full support for x data and for the image data.
And then a bit more long term, we'll be also looking at concept drift detection, So ways to detect not only the data, but actual concept drift and link it to performance. This is something that we just started the research on, But since it's mostly unsolved problem if you don't have access to your ground truth, we are starting with assuming you can access your data. You can access your target data, and we'll be releasing support for that shortly. When it comes to concept detection without access to ground truth. This is a really open research problem.
But if we are able to really figure that out, we'll be able to almost always estimate the performance of your models quite correctly, which is kind of the holy grail of ML monitoring. But that's a while from now.
[00:58:54] Unknown:
Are there any other aspects of the work that you're doing at Nanny ML or the overall space of post deployment data science that we didn't discuss yet that you'd like to cover before we close out the show? Oh, I just remembered 1 thing about the roadmap
[00:59:08] Unknown:
on the integration side of things. So everything related to engineering, we will be definitely looking into explicit integration with MLOps tools to really make, deployment of Nanaimo itself as seamless and effortless as possible. So tools like ZenML and alike. On top of what we have right now with our CLI and Docker deployment. When it comes to other things, maybe when it comes to learnings, some things I learned in my journey is that recruitment is extremely important and hard. But if you spend a lot of time actually figuring out who you want and describing it very well, we actually never had issues with having not enough applicants and not being able to find the person we're looking for.
So it's just kind of a fluffy thing. It's focused on recruitment more than you think you should be. The recruitment decisions are really 1 of the most important decisions you can make in a start up. I think that's it actually. Nothing specific to talk about. Just, you know, I would like to encourage everyone to go to our GitHub, give it a try, and I am always happy to either assist or receive any kind of feedback. Oh, you always get haters. We got our haters recently, and we actually felt good about it, people bashing that email because if they are, that means that they care, and that's good.
[01:00:31] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[01:00:45] Unknown:
I think it's mostly the awareness of potential impact. This is not going to be, like, a popular default answer here. But I think if businesses and executives realize the potential impact of machine learning in their companies, And if they were able to actually track it, they would be much more willing to invest into proper structure and resources into deployment. And if you see the potential upside, problems tend to disappear and get resolved if there is really need for that. Another thing is well structured processes not only for developing machine learning models, but also deploying them and monitoring them. And I think right now, we have a very strong disjoint between prototyping, deployment, and everything that happens after deployment.
What we will hopefully see in the future is that these things will come closer together, and data science who develop models will be looking at it also through the lenses of, can I actually deploy this thing? Will it work when it's deployed? And people who work in deployment will realize what happens after deployment. If I deploy, it's easy to retrain, easy easy to develop. And people who work in kind of post deployment data science will be able to look back to to previous stages, kind of make this process really more structured together and not as disjoint as is now.
[01:02:13] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Nanny ML. It's definitely a very interesting problem space, and it's great to see folks tackling the question of how do you actually understand what your model is doing in production, and how do you understand what to do about it? So I appreciate all the time and energy that you and your team are putting into addressing that problem, and I hope you enjoy the rest of your day. Thank you. It was great to be here and enjoy.
[01:02:42] Unknown:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at the machine learning podcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to The Machine Learning Podcast
Interview with Wojtek Kubertski
Founding Nanny ML
Post Deployment Data Science
Roles in Model Maintenance
Challenges in Post Deployment
Real World Impact of Models
Model Failures and Monitoring
Detection Capabilities of Nanny ML
Remediation Actions for Model Failures
Systems Design for Model Failures
Implementation of Nanny ML
User Feedback and Iteration
Supporting Various Data Types and Models
Integration into ML Systems
Automated Remediation Flows
Business Model and Sustainability
Innovative Uses of Nanny ML
Lessons Learned in Building Nanny ML
When Nanny ML is Not the Right Choice
Future Plans for Nanny ML
Closing Remarks