Shedding Light On Silent Model Failures With NannyML

Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea

of

the

data

scientists

building natural language processing models to programmatically inspect, fix, and track their data across the ML workflow,

from pretraining to posttraining and postproduction.

No more Excel sheets or ad hoc Python scripts.

Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs while seeing 10 x faster ML iterations.

GALILEO is offering listeners of the Machine Learning podcast a free 30 day trial and a 30% discount on the product thereafter.

This offer is available until August 31st, so go to the machine learning podcast.com/galileo

and request a demo today. Your host is Tobias Macey. And today, I'm interviewing Wojtek Kubertski about Nanny ML and the work involved in post deployment data science. So, Wojtek, can you start by introducing yourself?

Absolutely. So thanks for having me, Tobias.

I'm a cofounder at Nenemel.

What I do is basically everything that's related to tech. So I kinda manage the product side of things, the research side of things, which are actually separate in that email because we are a deep tech company.

My background is in mechanical engineering, and then I did my master's in artificial intelligence, Belgium,

where I stayed for a while. I freelance for a bit, starting my own consultancy,

handed off the consultancy to someone else. I moved to Portugal

where, you know, the sun, the food, and everything else that you can get here.

And, yeah, that's basically me. And do you remember how you first got started working in machine learning?

Yes. So

that is, I think, a typical story when you come from physics background or mechanical engineering background is that my bachelor thesis was about optimizing the

shape of wind turbine blades and generally wind turbine stuff. And then I realized I actually enjoyed the optimization and computational side more

than the actual physics.

And then since I'm a very lazy guy, I wanted to automate it all, and that's where machine learning came in handy.

So I basically decided that

let's just try to optimize everything and automate everything, and machine learning is really the tool for that. The algorithms themselves were just really, really interesting. So that's basically when I made my decision to kinda switch from mechanical engineering to AI.

And I got lucky that I

got accepted to 1 of the masters in AI, so I didn't have to kind of start start from scratch with my education.

And that was really it.

So as you mentioned, now you have cofounded a business

in Nanny ML. I'm wondering if you can describe a bit about what it is that you're building there and some of the story behind why you decided that this was the problem that you wanted to spend your time and energy on. I started previous company, a consultancy,

together also with my current cofounders at Nanny and El. And our goal was always to find a product

that we could

make big, basically, some unfulfilled need that we could try to fulfill.

And we had a couple of ideas, but there was 1 thing that we came up over and over again with our consulting clients is that

there was the question of what am I gonna do after you deploy these models for me? How do I make sure that these models actually keep on working?

And

we couldn't really make it our problem as part of the consulting because it was way too big to just do it as an kind of 1 off project.

So we decided it's not our problem. Yeah. You need to deal with it. We hand off, we deploy your models for you, and that's it.

And then after a while, since it was really a pattern, we decided that if we can make it our problem,

let's probably actually

create a great company and a great product.

And there's also

kind of a lucky coincidence that we had a mentor and we still have a mentor that built a unicorn.

And, basically, he kept on pushing us to actually jump into product already. There's no point in staying in consulting.

So this was kind of, like, a perfect start of opportunity,

right mentor,

and the right pull from the market because I was actually pulled from the market. And then we decided to digital consultancy and start basically go all in with Nanny ML, raise funds, and build the product.

In the landing page and some of the marketing material around Nanny ML, it brings up this concept of post deployment data science, which is what you were saying about you build these models, you hand them over to your clients, and then they say, what do I do with it now? And I'm wondering if you can give your definition of what that term encapsulates and some of the ways that it differs from the kind of metrics and monitoring approach that a lot of people have been falling into because it's something that they're familiar with from application or infrastructure?

It's basically exactly what you said that there is right now this kind of default approach to MLOps that it's all about engineering.

And what we noticed is that data science has always been about providing business value and finding insights in data that allow you to bring some value, extract some insights.

And this, for the timing, we most of the time stops the moment you finish developing your model.

And we think that if you extend this concept of let's try to bring insight. Let's try to ensure the value is there to everything that happens after deployment.

This is really the the key to make sure that your models keep on delivering value. That's kind of fluffy.

But to be a bit more specific,

what we mean by post deployment data science is doing things like monitoring, of course, but from the performance perspective. So we want to make sure that performance of your mouse stays high. And if something goes wrong, then we employ data science tools

to find out what went wrong. So we have data analysis. We are doing actually machine learning on top of your machine learning models to figure out what went wrong and then

how you can make sure that your model is fixed. So it's not from the engineering perspective. Up the uptime, throughput, whether the model

pings, but looking at all kinds of silence that can happen and then trying to analyze using data science tools

to figure out what went wrong and how we can fix it. That's basically

it. As far as the

current state of affairs where you have data scientists and ML engineers building the models and then handing it over to operations teams to manage the deployment and upkeep.

I'm wondering who you typically see as being responsible

for that work of the care and feeding of the model after it has, you know, gone past the training phase and you have that artifact to be able to put into production.

And I'm wondering how you think about the

design of Nanny ML for being able to help those people kind of augment their skills and also being able to loop those data scientists and ML engineers into that overall workflow?

Yeah. So that is something that we're honestly still kind of figuring out.

Identified kind of 3 main

persona or 3 main

kind of roles

that could benefit from that email. The first 1 and the kind of people that are most interested in that email is actually heads of data science because these are the people who really answer to upper level management

with the actual performance and the impact of their algorithms. So that's 1 thing. They need to have a tool like many more to make sure that the albums actually deliver value,

but they will not be the ones using it day to day. They have the need, but they are not the end user. And then the end user we see is kind of, in a sense, maybe not equally split, but split between data scientists and ML engineers.

And so far, we see that it's more data scientists that kind of

want to make sure that their models keep on develop working the most that they develop. So they still feel the ownership that this is my model. I want to make sure that it keeps on delivering value.

And, also, there's that skill set that's kind of more suited for monitoring. Because, again, post deployment data science, it's

not so much about engineering, but actually understanding what's going on.

But, of course, when it comes to deploying any amount to production to make sure it can provide the insights on, like, day to day operation, this is something that will be handled by ML engineers. So in a way, it's similar to the models themselves when you need ML engineer to deploy them, deploy that email to monitor your model.

And then you need the data scientists to actually

look at the data, analyze the results, and see if something went wrong, how can I make it right?

Given the fact that the data scientists are interested in being involved in the continued

care and feeding of the models and understanding more about how the models are operating

past the training cycle.

I'm wondering what have been some of the points of friction that they've experienced previously that maybe prevent them from being able to be more involved in that process and that has led the current state of the industry more towards this just, you know, collect some metrics and try and figure out, you know, has this crossed the threshold of whether or not you need to care about it? So I think that

the biggest problem is really

that people had assumed that it's unsolvable.

And it's something that has been a problem since machine learning production existed that models fail.

But from a lot of our, like, early design partners and early users, we heard that, yeah, that's how it is. This is just a reality of life as a data scientist.

So the first thing is really awareness that you can actually

make sure that your models work well and they fail find it faster than you normally would.

So that's the first thing. Another thing is we are still extremely early on when it comes to adoption of data science. We're kind of in a battle

when,

you know, work in cool companies

that actually work with machine learning. But when you look at the general market, there is a lot of prototyping, and I think we were just simply too early. You first need to put models in production

to monitor them. So I think right now, we're kind of crossing this threshold

when data sciences,

as they say, crossing the chasm. Models are actually making its way to production, and people start to realize that the problem exists and the problem is solvable.

And when it comes to kind of the

friction points or maybe the examples

of what prevented them,

First, it's just on

the kind of especially in big organizations on the political level, when there's different person responsible or even different departments responsible

for deploying models and maintaining them and making sure there is an impact. And that I think is something that will

change in the future.

And when there's going to be maybe even a new role of a person who actually makes sure like, a post deployment data scientist or production data scientist. The person who makes sure that, these models keep on the different value after they've been deployed.

As far as just the

overall state of

data science and the health of the models that are being built and deployed,

I'm wondering what you see as the

long term effects of having the data scientists more involved in the post deployment stages of the model development and the ways that

gaining better understanding of how the model actually operates in the real world allows them to bring that learning back into the process of developing the models in the 1st place or developing successive models about

how to learn more about the real world impacts of machine learning,

you know, both the model on the real world and the real world on the model and kind of some of the ways that that will feed into kind of the ongoing research and development, both in companies and in academia?

That is actually a very insightful question.

And like you mentioned, there's definitely this feedback loop between the real world and the model itself. If you deploy, let's say, a churn model

and the model works well, reduce the churn, which will also impact the population. So we're gonna have some kind of drift there already,

which might mean, for example, that looks like the model is going to shit, but in reality, the model is still working while it's just population exchanging.

So 1 thing that people definitely understand better,

the data scientists that are working on these models,

is how the real world interacts with the model and that it's a 2 way interaction. It's not just that model interacts with the world and changes the world, but also the other way around because the population will change, and if you return the model, the model is also going to be significantly

significantly different.

Another thing is I think people will understand robustness much better.

As right now, you can spend

weeks of compute optimizing your hyperparameters.

And

most people know that they shouldn't evaluate things and test set more than once, but they still do it. And they just keep on looking for, you know, set of hyperparameters

that performs best on your test set instead of your validation set.

And they

know that this idea of holdout is needed, but they don't really take it to the extremes.

I think they should.

And when you deploy models, you will restart realizing that things are not as robust as possible, especially if you spend a lot of time optimizing your models, and you're just getting good performance due to lag and due to

just trying out and actually overfitting in some more subtle ways. And I think awareness of that will increase as you see that when you deploy your models, performance drops compared to your test. Even though it shouldn't, but it still does. And

people are still asking why and finding the answers to that.

More on

the monitoring and alerting side of things. I'm wondering what are some of the ways that you have experienced

model failures and some of the particular pain that you've encountered in terms of the available tooling, the visibility or lack thereof, and some of the ways that that has

motivated you to want to invest the time and energy on solving this for everyone?

So as I mentioned, kind of moment when we became aware of the problem was during our consulting days.

And 1 very interesting thing that happened is that we were developing a segmentation model

that would remove background from industrial tools, and they would be put in a catalog. So basically, for automation use case.

And interestingly enough, somewhere

throughout the process, as the model is being deployed, they changed the cameras.

And the model really went to shit. And it's something that, of course, we couldn't have predicted, but it made us realize that things change in the real world. And as they do, the models will not perform well and failure is inevitable. Just part of life, things change,

and we need to be able to monitor that. Another interesting use case that stumbled upon a bit more recently is in wind energy,

where what you see is that there's always covariance shift, so change in your model inputs, also known as data drift.

And there, it really, really impacts the performance. There is no change in the patterns because it's all based in physics, so there's no concept drift.

But change in just how data is distributed

will have a huge impact on performance. This is something that happens seasonally,

and you should know whether you should be

worried about that or just normal performance. So just monitoring how your model is performing is not even enough because maybe this is something that's expected.

Maybe as weather changes,

there's more gusts of wind. If there's more gusts of wind, of course, your model will not predict

the power production as well

as in periods of low wind. So that's another use case. And then when it comes to credit scoring,

1 thing we noticed is that

a lot of credit scoring companies and banks as well have this problem is that they deploy a model and they need to wait for a year, sometimes 2 years before they really get the production

targets when they can actually

evaluate the performance of their models. So they're kind of flying in blind for a year or 2 years,

and they are using their models in basically the most important use case in their business.

So there's a huge need there to know

whether to estimate performance in some way even if web ground truth. So these 3 kind of things really

is what made us think of Nanny and Mal and how and the shape it is taken right now is that you cannot just simply monitor performance because it's not enough.

Sometimes monitoring performance is not available, so you need to estimate performance instead.

And looking at data drift is important from the performance

perspective, but not for the sake of data drift itself because that is something that will always happen, and it's not necessarily bad if your models generalize well. Yeah. And to that point of data drift being the

kind of most frequently brought up case of

issues with model performance or things that you have to watch out for after you get a model into production.

I'm wondering what are some of the other

kind of more insidious or less understood or less observed ways that models can start to go wrong without

based on maybe, you know, ROC or f 1 scores or just some of the ways that the model can still start giving you bad outputs without you necessarily noticing it. And, you know, conversely,

how tracking these, you know, high level metrics or or these specific metrics that people are able

to observe more directly,

how that contributes to issues of alert fatigue in ML teams?

That is a very loaded question, so I'm gonna answer it by by part.

So

first, starting with what can contribute to model failure. Let's assume that we cannot simply measure performance

as it is the case in actually most of these cases because there's there are delay. Or in automation use cases, we don't have full ground truth. So just measuring f 1 or whatever metric you choose is not easily doable or doable at all. And then there is basically 2 main reasons why the models can fail.

The first 1 is data drift or covaritship.

So the change in the joint model input distribution to be very technical here. So we're looking at all the features and its joint distribution changes. If that happens, it might mean that your model your data is going to be drifting to regions when it's harder for the model to make its prediction.

Maybe because the model is not trained in this region, or maybe it's a region that, let's say, it's very close to the class boundary.

So it's just harder to make the correct prediction.

Then you will see that the performance of your model is decreasing.

And this is something we can actually

fully estimate the impact

of convergence on performance with Honeywell, which is 1 of the cool unique features we have.

And then kind of second, the reason why mouse can fail is concept drift. And concept drift is changing the pattern

or mapping or a function, however you want to call it, between

the model inputs and the target.

So it's probability

of y given x.

And that means that basically, the pattern that your model has learned during training is no longer so we will see a drop in performance.

So these are kind of 2 ways that, models can fail.

Now looking at how they can fail, and you will not even be able to tell by just looking at the metric.

1 thing is that imagine you have data drift

to regions that are less uncertain. So the model performance should increase.

But at the same time, you have concept drift, so the pattern changes.

So the performance does not increase.

And that means that even if your f 1 stays the same, in reality, the model is worse. It's not predicting as well as it could be. And this is something you would not be able to see unless you can estimate the impact of data drift on performance. If you do it, you will see that

given this data drift, the model should be performing better, but it's not. So something's off, and this off part is the constant drift.

So that's 1 of, like, ways of really insidious silent failure.

And also the inverse is true. Like I mentioned in

when it comes to wind energy, it might be that you have periods when the model performance drops.

But there's still nothing to worry about because there is just a momentary drift to

regions that are higher to predict them, like gusts of wind because wind blows from the mountains.

Yeah. I think that's enough talking for 1 question.

And

in terms of the

detection capabilities that you're bringing into NannyML, you mentioned that 1 of the ways that you're able to identify some of these

failure modes that are often silent if you're just tracking,

you know, bare metrics

is the fact that you need to be able to see

based on

this prediction and the input data that I'm seeing. I'm assuming that this is going to be the performance that I see, but because the distribution of the data is different than what the model is trained against, the concept drift is such that it's actually degrading the performance despite my prediction that it will get better. And I'm wondering

what are some of the other ways that that predictive capability that you're bringing in Nanny ML is able to highlight some of these failure modes that

are kind of flying under the radar and some of the

kind of broader impact on

maybe confidence in ML and the organization, just some of the broader impacts that these silent failures have

when they come to light and some of the ways that this greater visibility that Nanny ML brings allows ML teams to kind of build up better trust in the organization?

That's another loaded question. So, again, I'm gonna start the first part, and I'll try to combine it with the alert fatigue that you mentioned.

So 1 of the things that happens a lot is that people, when they do monitoring,

they look at

every single feature kind of in a void. And if your model has a 50 features, something is basically guaranteed to drift.

And just because a drift is is

significant from a statistical perspective,

it doesn't mean it's significant from a model monitoring perspective.

You could see a very strong drift in feature that's mostly irrelevant,

and the model is very robust against changes there,

and everything is gonna be fine. That's why it's so important to try to figure out what is the impact of data drift on performance.

Another thing is that

these tests, if you do some kind of statistical test or you measure the difference, some kind of distance between 2 distributions,

they assume that

all features are independent. They don't track the change

and the correlations or relationships between the features.

So they kind of

don't think about this joint part of the distribution.

And for that, develop multivariate

data drift capabilities,

where you can basically

train your data trainer, let's say, an autoencoder

for the ease of explanation

on your data for which you know everything's fine. And then you see how well this autoencoder

performs on the dataset in question, your analysis dataset.

And if you see a drop in performance, it means that the internal geometry of the dataset has changed. So there's data drift. And this is a way to capture data drift that would not be captured

by these tools that just look at everything from Univariate Research.

Now for the second part of your question, which is how monitoring can change

how smart sensors can impact the trust in machine learning.

I think right now because the failures are silent,

all the failures are not expected, and they are seen as something that's

was not planned in saying Tibet should never happen. It's all your fault.

So what monitoring can really bring is first increase the trust that

if, let's say, head of data science is reporting that the months are performing well, they actually are performing well because there's much more knowledge in the company.

And another thing is kind of bring more

understanding

that failures will happen. They will be resolved fast, and they're just part of normal operating procedure.

Just like like you see right now, you need to maintain

your physical equipment,

and physical equipment will have to be out of operation for a while. The same is true for machine learning models. Things will happen,

and

model is not performing as well as it used to, and it's gonna take us, let's say, 2 days to fix it. That is something that should be normal in companies, but now every time there is an issue,

everyone's ringing alarm bells that it's no. You cannot trust machine learning, and it's not reliable.

It's reliable. It's just not perfectly reliable. And if we quantify this reliability,

that will bring more trust.

Another

aspect of the question of alert fatigue is that if you do get an alert, you don't necessarily know what to do about it, where a lot of the conversation that you see in the ML space is, oh, if you see concept drift, then that just means that you need to go back and retrain your model based on the new data that you have and redeploy it, and everything will be great.

I'm curious

what you have seen as some of the ways that teams who are using Nanny ML or just this better understanding of what is happening with the model

allows you to think about what actions

can and should be taken

based on the actual

type of failure that you're seeing and the overall question of what are the remediation actions that can and should be taken when you do see that there is some problem that is manifesting?

So when I think about monitoring, I think there's kind of 3

parts of it. The first 1 is detecting failure. The second 1 is realizing why failure happened, and the third 1 is resolving the failure.

And right now, with Nanaimao, we basically provide the first and the second.

And the third 1 is, for better words, manual.

Because what we realized is that really is extremely use case dependent.

The first thing you should do, your default action should be retrained, but instead of redeploying immediately,

you should either run an AB test, and this is something that we saw some people doing.

Or and that is a very interesting case, is that you retrain it, but you retrain just on the part of the drifted data, and you see how the model performs on the rest of the drifted data. So you do kind of a phantom deployment.

And then if everything's fine, it can redeploy. If not,

to go back and basically do some kind of root cause analysis,

which you can do with data drift capabilities, both being varied and multivariate.

And then also, you know, tap into your domain knowledge

to see what are the changes in data quality upstream that might have caused the failure. What are the changes in the business side of things that might have caused the failure? Maybe there is a new marketing campaign, and this marketing campaign completely fucked up our recommenders.

That's something that might have happened,

but a tool like Nanaimala has no way of seeing that because it's completely external to the to the machine learning system itself. So there are always the human component there because you're basically like an investigator trying to figure

why things happen.

And then once you know the full picture of why, it's a matter of

retraining or redeveloping your model, maybe with different architecture,

maybe by adjusting the data. So

if you have a strong data drift, maybe you should adjust your old data to some kind of shift

there to make it match the current distribution because 1 will just get confused otherwise

trying to complete the different patterns at the same time.

Yeah. I guess that's

it. Another interesting

kind of

roadblock

to the question of just retrain and redeploy it is that sometimes

for for a particular model instance, that retrain cycle might take,

you know, several hours or days.

And I'm wondering how that

factors into the decision about, okay, what do I do in this instance? Like, is it typically a case of this is an urgent failure, and I need to be able to resolve this immediately? Or

are the failure modes

Oh,

yeah. That Oh, yeah. That's another interesting question.

And you mentioned that it can takes, you know, days. But if you look at wider organization in some banks, it can actually take months because you need to submit a request to train a model. It's to be approved by the compliance team,

and it it really takes ages.

1 of our data scientists at Nanny Mal used to work at 1 of the more advanced banks. And even there, just because the regulation is so heavy, you cannot simply retrain your model. And you need to go through steps with multiple different teams, multiple different external validators

before you can return your model. So it's at least a month.

And I think that is also why

you actually have to return your model or it's nice to have is important. So, again,

estimating the performance and monitoring the performance is really the key here.

And from there, you can see where the impact is. Long term and severe assess the severity,

assess also

whether it's a long term impact or just passing impact to seasonality.

And so again, data drift impact of data drift on performance, estimating performance, and measuring performance.

Oh, and here there is another action thing at play, which is

the uncertainty estimation of the performance itself.

This is something that we will just be missing to

to the library, and I think 2 weeks time from now, when we'll also be giving the uncertainty estimates on our performance,

our calculated performance. So you will not know that let's say, rock AUC is right now 0.85,

but 0.85

plus minus 0.6.

And if you see that

you have strong drop in performance, but also uncertainty is really high,

maybe that is something that, you know, you just got unlucky with the data,

and you should keep on going as it is. You don't have to retrain. Maybe you do, but let's wait and see.

So if you combine this uncertainty estimates with the data drift and performance estimation, you should have a full picture of whether you need to retrain or not.

This might be going a little too far afield and too far into kind of the systems design aspect of it. But in the case where you do have a model,

it is seeing critical errors in the ways that it is generating predictions.

I'm wondering what are some of the kind of application or systems engineers' patterns that you've seen for people to be able to include some sort of a circuit breaker where it says, okay. This model is too far off the rails. We can't trust it. Retraining it is gonna take too long, so we need to have some sort of fallback mechanism to be able to say, you know, oh, the this capability is no longer functional right now. Check back later. Or, you know, we'll just go with kind of a more

statistical heuristic approach

that doesn't rely on this more advanced model capability. So we will say, you know, I have kind of a graceful degradation of capabilities, but the model is no longer in the loop such time as you can fix the underlying problem that it's experiencing.

To be honest, I think that

it's step too far, not necessary for us here, but for the reality. I don't think we're there yet. Maybe there's a handful of companies

that already have those capabilities.

I haven't spoken to anyone that actually already had a full blown system of how to do failure.

Most of the companies that do monitoring do it manually.

Everything is manual. All decisions are basically taken on the spot, and I think we're quite early when it comes to that. 1 thing I saw is that there is an off switch. If the failure is really critical

to the point when it's still bringing values detrimental to the company, then the first thing you do is you just turn it off.

Or you still run a prediction, but you don't actually act on it. So you then something of that would actually have an impact. So the model is just running in the void,

and then you start

resolving the problem. And once the problem is resolved and you confirm that the failure is no longer there,

you can

deploy the new model.

That's basically the only thing I know of. Yeah. It's definitely an interesting thing to think about of, like, all the different ways that things can go wrong and how to resolve them. And at a certain point, you just have to say, I can't predict it. It is what it is, and we'll figure it out when we get there.

Yep. And so in terms of the Nanny ML project, can you talk through how it's implemented and some of the overall

design questions that you had to address in terms of

the technical and user experience interactions and how to be able to fit this into people's machine learning workflows?

On high level perspective, I've decided to go for a Python open source library because, let's be honest, almost all data science right now is done in Python. People want to just install things

or condense or whatever is your preferred

method

here. So that's 1 thing. When it comes to the actual implementation,

we basically have

kind of a default structure, default workflow for Nanaimo itself. Well, first, you plug in your data,

and your data can be in form of kind of 1 of analysis when you have your reference datasets for which you know if it's fine, your analysis datasets for that you'd like to analyze.

And this kind of mimics the manual monitoring process

when from time to time, let's say, every week, every 2 weeks. However often you need to do it, you will manually run manual, see what's going on there. And then the flow is that is that you get your data

via parquet files or whatever you want, or if you have it's a microservice that provides the data.

You see what's going on starting with performance calculation or estimation depending whether targets are available or not,

Then everything's fine, good.

If things are not fine, we need to go into

data drift, and you'll go do a multivariate and univariate data drift detection.

And then we have a, honestly, pretty minimal capability to link data drift to performance.

So we'll be able to see what are the potential reasons for dropping performance.

And from then on, we go

to a manual quest of resolving the issues.

That's kind of 1 workflow. Another workflow is that if you deploy to production.

In that case, Manheiml is running via Airflow or some other kind of service. We just release our CLI.

So right now, it's much easier to just run an email as part of your ML system,

and you get your alerts. And

you have a, let's say, a script or notebook when an email is running when you can just view

updated dashboard with however often you want to run an email. It can be every 10 minutes. It can be every 5 minutes,

basically, as often as you want. There is 1 thing that I generally caution about, which is

trying to do streaming with monitoring.

And the thing is that you can obviously

use an animal in streaming fashion,

but you need to have certain volume of data before you can say that the performance has degraded.

Degradation is like inherently

probabilistic process.

So if you just get 1 new data point, it really does not make sense to run the whole analysis again because things cannot have changed that much. And even if they do, you will not be able to tell by just 1 point, and also the algorithms cannot take that.

So

what I would recommend is basically running batch every few minutes, few hours, how often do you need when you have enough volume of the data.

And then

if there's alerts,

then the rest of the process triggers,

and that can either automatic trigger retraining

or an advanced organization that deal with automation use cases that could trigger manual spot checking.

So then you basically create a task, send the signal

to team that does spot checking. So manually

reevaluate whether the performance actually has decreased. And from then on, you can take actions to remediate.

In the process of building Nanny ML and starting to work with some of the end users of it, I'm wondering what are some of the initial

concepts or ideas that you had about how to approach the problem or how to integrate it into people's systems that you had to

reconsider or reevaluate as you started getting it in front of people with real world use cases?

Yeah.

So our first idea is to basically

market it towards business people because that is kind of the end consumer.

And while the idea is sound in principle, in reality, it really did not work because they just don't understand machine learning at all. That's why you switched to data scientist and heads in data science.

And the other thing is that we started with a dashboard. People want to view it as a dashboard.

What we learned from early user interviews

and from early design partners is that we actually don't care that much for a dashboard. Like, there's certainly

some some companies organization when they want it, and they care about dashboard as kind of a second step when they deploy themselves

via literally Jupyter Notebooks.

They view things themselves there. Everything's working fine or not fine. And then if there is a problem or they need to communicate it to business stake holder, that's when the dashboards come in.

So then we had to completely change the priorities.

We dropped the dashboard.

Or focused on usability

within Jupyter Notebooks.

And really, in the entire question of UX is I want to run things fast using interface that I know.

So we often default to how is scikit learn doing it,

or how is when it comes to plotting, how is Seaborn doing it.

And we look there, and it's our default. And unless we have a very strong reason to change the interface,

we basically copy them because they did it well, people know how to use it, and they are happy with these libraries.

So I guess that is 1 big thing we needed to reevaluate.

Another 1 is kind of we had assumed

that the idea of estimating the performance without the ground truth

is something that will click in in user's head.

But we realized it's not as easy, and we had to do a lot of,

like, nudging both in our documentation

and in the visualizations themselves to show that we are estimating this performance. This is real performance. We are estimating

what is likely to have happened due to data drift.

And this is something still an open question because

for a lot of people, it takes a while to even realize that it's possible to even estimate the performance in a machine learning model. And it's definitely not an intuitive idea.

Another interesting

challenge in the ML space is that you need to be able to support

a fairly large variety of different frameworks

and data types and

model types and model architectures.

And I'm wondering what are the

initial target capabilities and initial focus that you had for that kind of matrix of

use cases

and the ways that you have thought about the initial foundational aspects of building Nanny ML to allow you to then re branch out into,

you know, maybe going from tabular data to unstructured data or image data, etcetera, or, you know, the different frameworks that you wanted to support or

model architectures that you wanted to be able to understand how to generate these predictions?

So that's something we actually thought a lot about.

And what we realized that from the use case perspective,

there's basically 1 huge difference is that whether the ground truth is delayed or not.

If we work with cases when ground truth is not really delayed, so high frequency training

or delivery time prediction, when it's gonna take you half an hour to get your sorry, to get your targets.

That is something that should be monitored in a completely different way than, let's say, loan default prediction for mortgages

when you wait years until you know your target.

Then on the

model and framework side, we made the decision very early on. Also, because, honestly, we got lucky with research

that we're gonna be fully agnostic when it comes to everything there.

So we don't actually need the model. Nanimo does not need the model file, does not need to know the framework. You just work with data. That was a very kind of committal decision that we took in the beginning that an animal should always be able to work with just data.

So then by default, we actually support

all kinds of models because we don't need to look at them, all kinds of framework because we don't need to use them. So we managed to kind of size the the problem, and so far, it's working quite well. Now when it comes to the actual data types,

that is,

I would say, the biggest challenge. That's something that we're able to sidestep.

We decided to focus on tabular data because that's where,

so far, we've seen both the value is in the companies, and also this is the type of data that is most prone to failure.

Like I always say, a horse is a horse. It's not gonna stop being a horse anytime soon.

So if you look at image data, it's less likely to change.

But, of course, as I already mentioned, sometimes the cameras change.

1 interesting example we got from a design partner is that they were doing

detection

for COVID based on x-ray images.

And there was a new strain, and the model failed because the new strain actually showed itself in the x-ray images differently. So, of course, there's truth there, but we decided to focus

on churn use cases,

credit default scoring, upselling, cross selling, all this boring AI because these models are actually in production, and if they fail, that's a huge problem.

And now moving on to

images

and x data.

We are right now actually running a small internal project to see what changes we need to do to support them natively.

And what we find out, you can see you can already use an email for text and image data, but instead of working with the raw data, we need to go to the embedded layer.

And there, you could basically treat it as tabular data,

and performance estimation works well. Multivariate

data detection works well. Univariate data detection returns bullshit because it just get pixel values or you get, you know

the vector 17 has changed. Yeah. That's not interpretable and it's very useful.

So we don't have the interpretability

there, but the core capability of drift detection and performance estimation is still there.

As far as being able to

integrate Nanny ML into an ML system, what are some of the

foundational

systems and capabilities that are necessary to be in place to be able to

handle feeding the data into it, generating the predictions, being able to capture the outputs of Nanny ML and integrate that with the other monitoring and

alerting systems that can and should be in place?

So our

default

kind of workflow is you have your model deployed as a microservice.

So first,

hard requirement is that your model is actually in production,

which people sometimes say that it is in production, but it's actually not in production. Or it's in production, but nobody's acting on the predictions.

So 1 important thing is you should implement,

an animal as you deploy your models or after you deploy your models, but it's a tool for post deployment data science. So that's 1 kind of obvious thing, but it's sometimes less obvious than it seems.

And then we basically assume that you have a microservice architecture

when you have a model deployed somewhere

that is scheduled via Airflow or whatever you want. And then what you want to do is you want to spin a Docker container, let's say, or have a you you containerize your

your microservices

with Nanaimo.

And what you feed to Nanaimo is all the inputs that come to the model and all the outputs from the models.

Plus, if you have access to this data, also the target data, the production target data.

On top of that, you will need some periods for which you know everything went fine.

This can be your validation set or you could just get some production historical data for everything, when everything was fine. Let's say, first month of production.

And then you just if you do batch processing, you can print an email at the same time as you query the model. The same exact data you first run through the model, then you get more of the inputs and outputs to Nanima.

And then we will need to have a way to display it. The defaults are

following the Jupyter Notebooks

only just have 9 month running. It displays the data. You view it in your browser

and that's it to run it. So there is

not that much to be done on this front.

As far as the

kind of post deployment data science aspect of it, we discussed this a little bit, but

what are some of the

maybe ways that the outputs of Nanny ML can get fed into a maybe partner model that

sits, you know, with an Nanny ML and, you know, feeds back into the model that you're monitoring or just, you know, some ways that the information that any ML surfaces can be used to build additional models to then serve as kind of automated checks or automated

remediation

flows for the model that you're actually

putting into production and is generating the value?

The easiest way is basically the alerts.

We have alerts that will return both in visual form and as a data frame.

And what you will see there is that if you have alert on performance that can be set with a custom threshold or default threshold,

that is the

easiest things to act on. So basically, you can look at alerts in performance, alerts in data drift, and alerts in specific

features. And then you can use that information

to trigger first automated retraining. This is kind of the minimal fully automated loop.

When you automatically

retrain, you pass it through. You deploy the model, and then you pass it through with Phantom deployment

or AB testing through an email again to see whether the issue has been resolved. If the issue has been resolved with this feedback loop, then the model, it gets deployed through production.

So that's kind of the the default remediation technique. And then you could see kind of another

flow if the issue is not resolved, so the training is not sufficient.

You could pass

what has been tried, plus all the other asynchronous learning on performance by the data drift to a data scientist who could get basically start with it as and do an EDA there to figure out what went wrong.

And then you go back to the same process when you deploy the model via phantom deployment or AB testing.

You run it through Nanaimo again and see if the issue is resolved. If the issue is resolved, you deploy the model. So there is this kind of feedback loop approach when you can use an email to both alert you that something has gone wrong and confirm that the issue is resolved.

And then in terms of the ways that you're thinking about

making this project and business sustainable, I'm curious how you have

thought about the

kind of governance aspect of the open source project and the ways that the business you're building around that,

you know, feeds into it, and just some of the division between

what is freely available, what is commercial, and just the overall business model that you're targeting?

So for the time being and for the foreseeable future, we are fully focused on open source, and there is going to be

a open source led product. So the open source is going to be always the main thing.

And we kinda follow the open core philosophy.

When all the core algorithms that provide value and is, you know, new science that we're developing or

toward new engineering at least,

is something that's going to be always open source. Right now, we're not monetizing anything.

What we're focusing on is adoption first and foremost, trying to get as many people to use non email introduction in real life use cases.

And based on that, it will be seeking more funding soon,

soon ish.

And we have a bit of runway yet.

And in the future, there's basically 2 ways we plan to monetize. 1 is

everything that has to do with enterprise addition in terms of security, integration,

privacy,

helping them

basically create workflows, helping big enterprises create workflows when none email can be used to provide value, just like the 1 I just described before.

And another thing is

business oriented features.

So trying to use additional signals

from the business that are not in the dataset,

to help improve model monitoring.

If, it's possible to quantify the impact of a model on the business itself,

also monitoring those metrics. So maybe ROI on the model.

Simple 1 in credit scoring would be, you know, risk adjusted ROI

on how much money you're getting per loan. Is that changing? How is that relevant to data drift and model metrics?

Can we also estimate that? These would be the things that almost nobody cares about except business people.

So it's perfect thing to keep,

outside of the open source solution because it's something that would not drive adoption, do not drive value for vast majority of people, but also is a good way to monetize and make it a business. Because if these capabilities are needed,

they are really needed and people are willing to pay for them in a way that's, you know, really shows that it provides value.

In your experience of building the project and working with some of your design partners, what are some of the most interesting or innovative or unexpected ways that you've seen Nanny ML used?

The idea of using Nanny ML to AB test models is not mine. I mentioned a few times, but it was something that actually came from a design partner.

When at first, they were not even interested that much in monitoring aspect of it,

but they were doing retraining.

And the biggest thing is that they never knew if the automated retraining actually makes sense. Maybe we're actually hurting the model more than we're fixing it. And that is something that actually once happened when there was an upstream data issue.

And that meant that basically the quality of,

I think, 2 months of data

was much lower than the previous months of data,

and the model was retrained, and the performance dropped

significantly,

because the the model was even though the data quality issue was

found out and resolved,

the the same data was still used to retrain the model. I don't know why. It just happens in enterprises.

And and we're able to actually spot that with NetEmo.

And, then,

that is 1 way to monitor

automated retraining to make sure that you don't do something wrong by retraining your model.

That is really the the biggest 1. Another 1 would be to,

look at the training itself

and understand it better so it can be used as an analysis tool,

when you basically want to see what are the changes within the training data

and how, whether you should use all the data for training or not. So to basically see whether this data is really relevant or not. That's something that normally would only be done in a very manual way. Let's say you have last 20 years of credit scoring.

You would know as a person working in finance that everything that happened

before 2008 and after 2008, it looked completely different.

But that's something you need to know, and there's maybe multiple different changes like that that you would not be aware of. And you can use that email,

to kind of automatically analyze these changes in the concept in the training and data drift within the training set,

to figure out which data should be used and which data should not be used to train your model. In your own experience of building the project and building the business, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

1 thing is that we really had no clue how to build a product, and

we kinda had a clue how to build a business because we build a consultancy. And the product, I was like, what's product?

Engineer, the thank you, the data science, and then you release.

And we really did not realize that there's this entire thing called product thinking.

When you need to figure out combined basic what's needed with how to present it, make sure that there is value, usability, business feasibility,

and actual feasibility.

And this is something that is completely not on my radar.

That was 1 big, like,

piece of thing that I learned. I read a book called Inspired,

and that is really something I would recommend to everyone,

engineers, data scientists, whoever you are, if you work in a third company, this book is really great.

Another thing is the value of early user testing.

At first, we decided to build 1 thing. We built it, and I was like, yeah, it's kinda there. It's really it works, but not great.

And after we hire a product designer,

basically,

we switch,

to always try to test mocks prototypes,

test as often as possible.

And once we decide to build something, we have much higher chances of it being the right thing. And then on top of that,

we also are much more committed to actually building things and not abandoning halfway through

because we found that that's actually not something that people want. Because when we build something, we already know what people want.

And

last thing I would say is

adoption

is much more nuanced than expected.

We started with pure business focus because that is objectively where the value of data science is, but that doesn't matter. If these people are not interested or don't understand

what we're trying to do, we will not be able to bring value to them because they are not open to it.

And that's why we decided to switch our focus

and being open to who's actually your user is is very important.

And focusing on people who will use learning immediately

is more important than

focusing on the person who will, in the end, get some value out of it.

Working with engineers is completely different than working with data scientists.

They are different people. They are driven by different things,

and there's even huge communication gap between engineers and data scientists.

1 thing that I noticed is data scientists are in general, of course, I'm gonna make a terrible generalization here, are driven by curiosity.

What makes them tick is figuring out things, and with engineers, it's much less so. They are driven by building things. They want to create things.

These are completely different motivations, completely different ways of looking at the world.

For me, the problem ends as a data scientist when I figure out how to do it. Doing it is like it's an afterthought.

If I already know how it works, that's what I want.

Whereas with engineers, they don't really care that much how things work, but it's about building things. And

there is these things sometimes stand on the

seemingly opposite sides of the spectrum,

and I had to learn how to

recast the problem in different light to to appeal to different people. Yeah. It's definitely an interesting insight about kind of the

challenge of being able to motivate people because of the kind of areas of where they want to spend their time and focus and what it is that they are actually intrinsically interested in.

Absolutely. And then you have business people, and most of them, they just want to have 1 metric and drive this metric as fast as possible.

So it's yet another deep motivation,

and it's so very

interesting. For people who are

deploying machine learning, they have it in production, what are the cases where nanny ML is the wrong choice?

So I would say that there are few use cases when we're just not ready for as a library, as a as a product.

Let's say that you want to don't monitor at all and you work

at JPMorgan Chase, and you're responsible for creating the entire platform for monitoring everything.

At the same time, we don't have the capability to actually implement many mail users.

We wouldn't be able to assist you there and then probably should look for a a closed source solution

that would come in with the entire playbook of how to do it for a huge organization

and do it from scratch to finish.

That email is inherently open source product, which means that we expect

like, we're happy to help and assist

with anything that has to do with implementing any amount or monitoring in general, observability in general,

or post deployment data science in general,

but we simply don't have the manpower and the capability to do the integration

for you. So if you don't have the capability to integrate Nanaimo or Nanaimo, Nanaimo is not a choice for you. This is true, I think, for most open source projects and

just as much for learning and all. Another thing is really big data.

We optimize

our library. It works reasonably fast.

And if you have, let's say, single digit terabytes of data per day, then it was a good solution for you. If we're looking into petabytes per day, this is gonna crash. It's not gonna work. It's not designed to work, with each data.

As you continue to

build out Nanny ML and iterate on the product and add new capabilities, what are some of the things you have planned for the near to medium term or any projects that you're particularly excited to dig into?

So when it comes to the product road map, so we always have, like, a research road map because we are trying to come up with

new ways to actually solve the problems that have not solved before.

The main 1 being, of course, our performance estimation and background truth, but also multivariate data detection, and there, we recently figured out how to do performance estimation for regression models.

So we will be releasing full support for regression in the coming weeks, to be more specific,

within the next 4 weeks.

It's going to be in the library.

On the short term horizon, as already mentioned, there is expected sampling error, so kind of a way to measure the uncertainty

of our predictions for everything for data drift and for performance as well.

And a bit more long term, we will be looking into segment level analysis. So first, automatically segment your data

from the perspective of model monitoring and then run the entire analysis on these segments. You'll be able to find underperforming

segments

when maybe general population level monoclonal is still fine, but you have some segment of your data or the model is not performing well, and you should be looking into that. To kind of fine grained analysis.

Then roughly, within the same time frame, you have explicit support and full support

for x data and for the image data.

And then a bit more long term,

we'll be also looking at concept drift detection,

So ways to detect not only the data, but actual concept drift

and link it to performance.

This is something that we just started the research on,

But since

it's mostly unsolved problem if you don't have access to your ground truth,

we are starting with assuming you can access your data. You can access your target data,

and we'll be releasing support for that shortly. When it comes to concept detection

without access to ground truth. This is a really open

research problem.

But if we are able to

really figure that out, we'll be able to almost always estimate the performance of your models quite correctly, which is kind of the holy grail of ML monitoring. But that's

a while from now.

Are there any other aspects of the work that you're doing at Nanny ML or the overall space of post deployment data science that we didn't discuss yet that you'd like to cover before we close out the show? Oh, I just remembered 1 thing about the roadmap

on the integration side of things. So everything related to engineering,

we will be definitely looking into explicit integration with

MLOps tools to really make,

deployment of Nanaimo itself as seamless and effortless as possible.

So tools like ZenML

and alike.

On top of what we have right now with our CLI and

Docker deployment.

When it comes to other things,

maybe when it comes to learnings, some things I learned in my journey is that recruitment is extremely important and hard.

But

if you spend a lot of time actually figuring out who you want and describing it very well,

we actually never had issues with

having not enough applicants and not being able to find the person we're looking for.

So it's just kind of a fluffy thing. It's focused on recruitment more than you think you should be.

The recruitment decisions are really 1 of the most important decisions you can make in a start up.

I think that's it actually.

Nothing specific to talk about. Just, you know, I would like to encourage everyone to go to our GitHub,

give it a try,

and I am always happy to either assist or

receive any kind of feedback.

Oh, you always get haters. We got our haters recently,

and we actually felt good about it, people bashing that email because if they are, that means that they care, and that's good.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.

I think it's

mostly the

awareness of potential impact. This is not going to be, like, a popular default answer here.

But I think

if businesses

and executives realize the potential impact of

machine learning in their companies,

And if they were able to actually track it,

they would be much more willing to invest into proper structure and resources into deployment.

And if you see the potential upside, problems tend to disappear

and get resolved if there is really

need for

that. Another thing is well structured processes not only for developing machine learning models,

but also deploying them and monitoring them. And I think right now, we have a very strong disjoint between prototyping,

deployment, and everything that happens after deployment.

What we will hopefully see in the future

is that these things will come closer together,

and data science who develop models will be looking at it also through the lenses of, can I actually deploy this thing? Will it work when it's deployed?

And people who work in deployment will realize what happens after deployment.

If I deploy, it's easy to retrain, easy easy to develop.

And people who work in kind of post deployment data science will be able to look back to

to previous stages,

kind of make this process

really more

structured together

and not as disjoint as is now.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Nanny ML. It's definitely a very interesting problem space, and it's great to see folks tackling the question of how do you actually understand what your model is doing in production, and how do you understand what to do about it? So I appreciate all the time and energy that you and your team are putting into addressing that problem, and I hope you enjoy the rest of your day. Thank you. It was great to be here and enjoy.

Thank you for listening. And don't forget to check out our other shows, the Data Engineering podcast, which covers the latest in modern data management,

and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at the machine learning podcast.com

with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast