Update Your Model's View Of The World In Real Time With Streaming Machine Learning Using River

Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning.

Building good ML models is hard, but testing them properly is even harder. At DeepChex, they built an open source testing framework that follows best practices, ensuring that your models behave as expected.

Get started quickly using their built in library of checks for testing and validating your model's behavior and performance and extend it to meet your specific needs as your model evolves.

Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually.

Go to the machine learning podcast.com/deeptext

today to learn more and get started. Your host is Tobias Macy. And today, I'm interviewing Max Halford about River, a Python toolkit for streaming and online machine learning. So, Max, can you start by introducing yourself?

Oh, hey there. So I'm Max Gess. I consider myself as a data scientist.

My day job is doing data science. I,

actually measure the carbon footprint of,

clothing items.

But I have a wide interest in, you know, technical topics,

be it software engineering or data

engineering. Do a lot of open source. My academic background is leaning towards finance and computer science and statistics.

I actually did a PhD in applied machine learning,

which I finished a couple of years ago.

So, yeah, an all around node, basically.

And do you remember how Hapriska started working in machine learning?

Kind of, Jess. I was a late bloomer. I got started when maybe when I was 21, 22 when I was at university.

I basically had no idea what machining was, but I started this curriculum

that involved that was around statistics.

And we had a course, which was maybe 2 or 3 hours a week about machine learning, and it did kind of blow my mind. It was

around the time when, well,

machine learning and particularly deep learning was starting to explode.

So

I kinda stopped at university. So I was lucky enough to get a theoretical training.

And in terms of the river project, can you describe a bit more about what it is that you've built in some of the story behind how it came to be and why you decided that you wanted to build it in the first place?

When I was at university,

I received a kind of

normal introduction to a regular introduction to machine learning.

And then I did some internships.

I started PhD after my internships.

And I also did a lot of travel competitions on the side. So I was kind of hooked into

machine learning,

and it always felt to me that something was off because

when we were learning machine learning, everything made sense. But then when you get to do in practice, you often find that, well,

it's not playgrounds. Like, the playground scenarios that they describe at university when you learn machine learning just do not apply in the real world. It's available as well. You have data that's coming in, like a flow of data or every day there's new data or

yeah. There's like an interactive aspect to the world around us, the way the data is flowing. It's not like a CSV file.

Yeah. It just felt like fitting a square peg in a round hole. So I was always curious

in the back of my mind about

how

you could do online machine learning. Well, I didn't know it was called online machine learning because when I was a kid, I remember growing up and thinking that AI was this kind

of intelligent

machine that would keep learning as it went on and as it experienced

the world around it. Anyway, when I started my PhD, I was lucky enough to have a lot of time to read. So I read a lot of papers and blog posts and whatnot.

And I can't remember the exact day

or week that I stumbled upon it, but I just started learning about online machine learning.

Maybe some blog posts or something.

And then it was like a big

explosion in my head, and I was like, wow. This is crazy. Right? This this actually exists.

And I was so curious as to why it wasn't more popular.

And at the time, I did a lot of open source as a way to learn,

And so it just felt natural to me to start implementing

algorithms that I'd read in papers on

everywhere.

I've just started writing code to learn, basically, just to to

confirm what I've learned and and whatnot. That's just the way I learn.

And it kind of evolved into what is, which is a, well, an open source

package that people use. Now if I may expand a little bit, Viva is actually the merger between 2 projects.

So the first project is called Psyche Multifrow.

It was a package that was developed before I even got into machine learning. It has roots in

academia in New Zealand,

comes from an old package called Noah in Java.

Anyway, I wasn't not aware of that.

On my end, I started to get a package called cream at the time. So in French, creme means cream, and it plays funny with incremental, which is another way to say online.

So a year, I developed

cream on by myself.

And at some point, it just made sense to reach out to the guys from Psyche Multiflow and to propose a merger.

So

it took us quite a while, but after 9 months of

negotiation and,

you know, figuring out the details,

we merged, and we called the new package, ever.

You mentioned that it's built around this idea of online machine learning. And in the documentation, you also refer to it as streaming machine learning. I'm curious if you can just talk through what that really means in the context of building a machine learning system and

some of the practical differences between that and the typical batch oriented workflow that most folks who are working in ML are go going to be familiar with.

1st, just to recap on machine learning, the whole point of machine learning is to

teach a model

to learn from data and to take decisions. So, you know, monkey see, monkey do. And the typical way you do that is that you fit a model to a bunch of data, and

that's it really.

But

online machining is the equivalent of that, but for streaming data. So

you

stop thinking about data as a file or a table in a database, but you think of it as

a flow of data stream.

So online machine learning, you could call it incremental machine learning. You could call it streaming machine learning.

I mean, I more often see online machine learning being used, although

if you Google that, you kind of find these online courses for the online machine learning, so that's not kind of cool online. But anyway, yeah, it's just this

way to say, can I do machine learning but with streaming data? And so

the rule is that an online model

is 1 that can learn

1 sample at a time. So usually, you show a model

a whole dataset, and they can work with that dataset. It can calculate the average of the dataset. It can do a whole bunch of of stuff. But the restriction here with online machining is that

the day the model cannot see the whole data. It can't hold it in memory. It can only see 1 sample at a time, and it has to work like that. So it's a restriction. Right? So it makes it harder for the models to learn, but it also has many, many implications. If you have a model that can learn that way, well, you can have a model that just can keep learning as you go on.

Because a regular

machine learning model, once it's been fitted to a dataset,

you have to retrain it from scratch

if you want to incorporate

new samples into your model. That can be a source of frustration,

and that's why I was calling the square peg in the round hole before. So say you have a model, an online model that is just as performant as a a batch model. Well, you know, if you if you just regardless of performance, accuracy,

that has many implications,

and it actually makes things easier. Because if you have a model that can keep learning,

well, you don't have to, for instance, schedule the training of your model. You can just every time you have a new sample that arrives, you can just tell your model to learn from that, and then you're done. And so that ensures that your model is always as up to date as possible and that has obviously

many, many benefits. If you think about people working on the stock market, so trying to forecast the evolution of a particular stock,

they've actually been doing online machine learning since the eighties

because, obviously,

they have a lot to lose by making all this public. It just never

got into a big thing, and it always stayed in in stock market companies.

So the practical

differences

are that

you are working over stream of data. You're not working with a static dataset.

This stream of data has an order, meaning that the fact that 1 sample arrives before the other, well, that has a lot of meaning, and that's actually reflecting

what's happening in the real world. In the real world, well, you have data that's arriving in a certain order.

Well, if you train your model

offline

on that data, you want to, you know, process it in the same order. And so that ensures that you are actually

reproducing the

conditions that happen in the real world. Now another practical consideration is that

online learning is much less

popular or predominant

than batch learning, and so a lot less research and software work has been put into online learning. So if you are a newcomer to the field, well, there's just not a lot of resources to learn from. Actually, you could just spend a day on Google, and you you probably find all the resources you there are because there's just

not so many of them. There's probably, like,

by memory, just 10 links on Google that you can learn from about online learning. So it's a bit of a niche topic.

In terms of the fact that batch is such a predominant mode of building these ML systems despite the fact that it's not

very reflective of the way that

the real world actually operates, why do you think that's the case

that streaming or online machine learning is still such a niche topic and hasn't been more broadly adopted?

Sometimes it feels like I'm trying to teach a new religion,

which feels a bit weird because there's not a lot of us doing it.

So I'm also very

I never try to force people into this. There's obviously many good reasons why batch learning is still done.

And now from a historical point of view, I think it's interesting because

we always used to use statistical models to explain data and not necessarily to predict. So you just have a data set, and you just like to understand,

you know, what variables are affecting a particular outcome. So for instance, if you take linear aggression,

historically, it's been used

to explain the

impact

of global warming on the rise of sea level, but not necessarily to predict

if,

you know, the temperature of the globe was higher, what would be the impact on the sea level. But then someone said, let's use machine learning to predict

outcomes in a business context,

and that's why we have this big event of machine learning.

And we've kind of been using the tools that have been lying around.

So we've been using all these tools

that we used for

to a dataset and explain it, but now we've been using them for predicting. So

these models

are static.

Like, the people who when we started doing linear regression, we never really worried about

streaming data because datasets were small, datasets were static. Well, the Internet didn't even exist, so there was no real notion of IoT or sensors or

streaming data.

So

the fact is that

we've never needed online models.

And so

as a field, you look at the academia and the industry. We're very used to batch learning,

and we're very comfortable with it. There's a lot of good software,

and this is what people are being taught at university.

So I'm not saying that online learning is

necessarily better than bachelor learning, but I do think that the reasons why

batch learning is so predominant in comparison is because we are too used to it, basically.

And I do think that and I see it every week.

People who are trying to rethink

their

job or their projects and say and say,

maybe I could be doing online learning. It actually makes more sense. So

I think it's a question of habits, really.

For people who are

assessing which approach to take in their ML projects, what are some of the use cases where online or streaming ML is the

more logical approach or what the decision factors look like for somebody deciding, do I go with a batch oriented process where I'm going to have this

large set of tooling available to me, or do I want to use online or streaming ML because the benefits outweigh the potential costs of

plugging into this ecosystem of tooling?

So I'll be honest, I think it always makes sense to start with a batch model.

Why? Because, you know, if you're pragmatic and you actually have deadlines to meet and you just wanna be productive, there's so many good solutions to

train a batch model and deploy it. So, you know, I would just go with that

to start with. And then, yeah, there's

the question of, could I be doing this online? So I think there's 2 cases. There's cases where you need it. And so I have a great example.

So Netflix, when they do recommendations,

you know, you arrive on the website and Netflix recommends movies to you. Netflix actually retrains a model every night or every week, but they have many models anyway. But

they are learning from your behavior to kind of retrain their models to update their recommendations.

Right?

There's a team at Netflix that are working on learning instantly. So if you are scrolling on the Netflix website and you

see a recommendation for Netflix,

the fact that you did not click on that recommendation is a signal that you do not want to watch that movie maybe or that the recommendations will be changed. So if you're able to have a model, for instance, maybe in your browser

that would learn in real time from

your browsing activity and that could update and learn on the fly, that'd be really powerful. And the only way to do that is to have

a model per user

that is learning online.

And so

you cannot just use batch models for that. Yeah. You can't just

every time a user

scrolls or ignores a movie, you can't just take all the history of data and fit the model. It would be much too heavy, and it's just not practical. So sometimes the only

way forward is to do online learning. But, again, this is quite niche. Like, Netflix recommendations, I mean, obviously, are working reasonably well, I believe, because

they're, you know, just from their market value.

But if you are pushing the envelope, then sometimes you need online learning.

Now another case is when you do not necessarily need it, but you want it because it makes things easier. So

a good example I have is

imagine you're working on the app that categorizes

tickets. So

for instance, on the help support software. So, you know, you go on the website and you're sending a form or you send a message or an email, and you're asking you have some problem maybe with a reimbursement on your ultimately, you bought an Amazon.

And then, you know, there's a customer service behind that, human beings that are actually answering those questions.

And

it's really important to be able to categorize each request and put into a bucket, so that it gets assigned to the right person.

And maybe the public manager has decided that we need a new category. And so there's this new category,

and your

model is classifying tickets

into 1 of several categories. If you introduce a new category,

it means that you have to retrain the model to incorporate it.

I was in discussion with a company, and they were only able to or their budget

was that they were only able to retrain the model every 3 months. So if you introduce a new ticket, a new category into your system,

the model would only pick it up and predict it

after 3 months. So that sounded kind of insane

and, you know, wasn't predicted at all.

And I was not aware of the exact details, but it it just seemed too expensive for them to retrain their model from scratch. So

if they were using an online model, well, potentially, that model could just learn the new tickets

on the fly. And, you know, if you just introduced it and people started,

you know, you landed this feedback loop where you introduced a new category.

People send the email. Maybe a human assigns that ticket to a category, so that becomes

a signal for the model. The model picks that up, learns,

and, yeah, it's gonna incorporate that category into its next predictions. So that's a scenario where you don't necessarily need

online learning, but, actually, online learning just makes more sense and makes your your system easier to maintain, to

work with, basically.

There are a number of interesting things to dig into there.

1 of the things that you mentioned is the idea of having a model per user in the Netflix

example.

And I'm wondering if you can maybe talk through some of the

conceptual elements of saying, okay. I've got this baseline structure for it. This is how I'm going to build the model. Here is the

initial

training set. I've got this model deployed. And now every time a specific user interacts with this model, it is going to learn their specific behaviors

and be

tuned to the data that they are generating.

Would you then take that information and feed that back into

the

baseline model that gets loaded at the time that the browser interacts with the website and just some of the ways to think about

the

potential approaches for how to say, okay. I've got a model, but it's going to be customized per user and just managing the kind of fan out, fan in topologies that might result from

those event based interactions

spread out across n number of users or entities.

Oh, it sounds insane when you say it because to have 1 model per user and, you know, have it deployed on the user's browser or mobile phone

or, god knows what, on Apple Watch. It does kinda sound insane, but it is interesting, I guess.

I don't think there are so many

I mean, I'm not aware of a lot of companies that would have the justification to actually do this, and

I've never had the occasion to actually work into a setting where I would I would do this.

But I had 1 good example where I was kinda doing some pro bono consulting.

It was this car company where

the onboard navigation system,

they wanted to build a model where they could guess where you're going to. So, basically,

depending on where you left, if you left home in the morning, you're probably going to work.

And they would then use this to, you know,

just give you send you news about your

itinerary or things like that. They really needed a model that would just be able to learn online, and they made the bold decision to say, okay. We're going to embed the model

into the car. It's not going to be, like, a central

model that's, you know, hosted on some big server and that the car interaction the actually, the intelligence is actually happening in the car.

And so when you think about that, it's really interesting because it creates a decentralized system. There's not like a single

it actually creates a system where you don't even need the Internet for the model to work. So

there's so many operational

requirements for that. Actually, now that I think of it and I'm talking about cars, I realized that Tesla that's actually what Tesla is doing. Like, they're they're computing

and making decisions

inside the car,

you know, doing a bunch of stuff, and they're also communicating with

mother servers.

But the actual computer, they actually have GPUs in the car doing computes with their deep learning models and and whatnot. So

it's definitely possible to do this. Right? But

clearly not something that our company would go would go through or would have the need to to do.

It's interesting also to think about how something like that would play into the federated

learning approach where you have federated models where there is that core model that's being built and maintained

at the core. And then as users interact at the edge, whether it's on their device or in their browser,

it loads a federated learning component that has that streaming ML

capability.

So the model

evolves as the user is interacting with it on their device, and then that information is then sent back to

the centralized

system to be able to feed back into the core model so that you can have these kind of parallel streams of

the model that the user is interacting with as being customized to their behavior at the time that they're interacting with it, but it does still

get propagated back into the larger system so that the

new intelligence

is able to

generate an updated experience for everybody who then goes and interacts with it in the future.

Yeah. That's really, really interesting. So

I think first off, the the fact is that I

I'm actually still young, and there's so many things that I don't know. And I don't have, like, the technical savvy to be able to

suggest ways forward. But this is obviously things that,

you know, I think about.

So these things like Hub Wild, which is a project from Google, they have a a paper where they discuss these things.

I think that's a really simple thing that if you wanted to do this, if you're the listener, wanting to do something like this, I I think there's a simple pattern, which is to maybe once a month have a model that is retrained

and that you just train in batch, and that model is going to be

it's gonna be like a hydro. Like, you're gonna copy it, and you're gonna send it to each user.

And then

each copy for each user is going to be able to learn in its own environment. And for instance, a good idea would maybe to with your model to increase, like, the learning rates so that

every sample

that the user gives you matters a lot.

So for instance, if we take the Netflix example, you would have, you know, your run of the mill

recommendation system model that you would just train in batch,

you know, and you just use all the tools that we in community use. But then you would embed that into each person's browser,

and maybe you do this once a month. And then that model for each user would be a coffee, a clone, or like a just a separate model now.

And,

you know, it would keep learning

in an online manner. So maybe your model was trained in batch initially, but now for each user, it's, yeah, it's actually it's being trained online. So for instance, you can do this with factorization machines that can be trained in batch, but also online.

And, yeah, you would use a high learning rate

so that every sample matters a lot basically. And so you, the user, are tuning

your model.

And so I don't know how YouTube does it for instance, but I do imagine they have some sort of

core model. They're just learning how to make good recommendations. But, obviously, YouTube, there are some rules that make it so that, you know,

recommendations are tailored to each user. And

I don't know if that is done online, and I don't know if it's actually machine learning. It's probably just rules or scores.

But, yeah, I think it's a really fun idea to play around with, and I do think that online learning enables this even more. As far as the

operational model for people who are using online and streaming machine learning,

if they're coming from a batch background, they're going to be used to dealing with the

train, test, deploy cycle where I have my dataset. I build this model. I validate the model against the test dataset that I've held out from the training data.

Everything looks good based on the, you know, area under curve or whatever metrics I'm using to validate that model. Now I'm going to put it into my production environment or maybe it's being served as a Flask app or a fast API app.

And then I'm going to monitor it for concept drift, and eventually, I say, okay. This is no longer performing

up to the specification. So now I need to go back and retrain the model based on my updated datasets.

And I'm wondering

what that process looks like for somebody building a streaming ML model with something like River and how you address things like

concept drift and, you know, how concept drift manifests in this streaming

environment where you are continually learning and you don't have to worry about, you know, the real world data that I'm seeing is widely divergent from the data that I use to train against.

There's so many things to dig into, and I'll try to give a comprehensive answer.

So first off, it's important to understand that Revo itself is

to online learning what scikit learn is to batch learning. So

it only

desires to be a

machine learning

library.

Right? So it just contains basically

algorithms, routines to

train a model

and to have models that can learn and predict. And what you're going towards to with your question is MLOps. So how does

the life cycle look like for an online model? And so this is always something that

I'm spending a lot of time to look into.

The answer is that the first part of the answer is that online learning

enables

different patterns.

And I believe that these patterns are simpler to reason a lot. So as you said, you usually start off by

training a model, then evaluating it against a test set, and

maybe going to report to your stakeholders and show them the performance and guarantee that, you know,

the essentials of all positives

is underneath a certain threshold, then yes, we can diagnose cancer with this model or not.

And yeah, and then you kind of deploy it, maybe if you get lucky, if you get the approval, and

you sleep, you know, well or not well at night depending on how much you trust your model. But there's this notion of you deploy a model, and it's like a baby in the world, and this baby is not going to keep learning. So,

you know, it's a lie to believe that if you deploy a batch model,

you're going to be able to just let it,

you know, run by itself. There's actually main things main things that has to happen there. So

the reality is that any machine learning project, you know, any serious project

is never finished. It's like software, basically. We have to think of machining, projects have software engineering. And obviously,

well, we all know that you never just deploy a feature, a software engineering feature, and just never look at it. You monitor it,

you take care of that, well, investigating bugs and whatnot. So

batch learning in that sense is a bit it's a bit difficult to work with because obviously you can have if your model is drifting,

so meaning that its performance is dropping because

the data that it's looking at is different than the training set it was trained on, you basically have to be very lucky if you want your model to pick up performance. So you're gonna have to do something about it. And, yeah, you can just retrain it.

But what you do with online learning is that you can have the model just keep learning as you go. So

there is no

distinction between training and testing.

What online learning encourages you to do is to deploy your model

as soon as possible. So

say you have a model,

and it's not being trained on anything, well, you can put it into production straight away.

When samples arrive, it's gonna make a prediction.

So maybe, you know, user arrives on your website, you make recommendations,

that's your prediction.

And then your user is going to click or not on

something you recommended to her, and

that's gonna be feedback you for your model to keep training.

So

that is already a guarantee that your model is kind of up to date and kind of learning. And so that's really interesting because

just enables so many good patterns.

You can still monitor your the performance of your model. If the performance of your online model is dropping,

I mean, I haven't seen that yet, but it probably means that your problem is

really hard to solve.

So the really cool thing I stumbled upon was this idea of

test then train. So the idea that imagine the scenario where you

you have a classification model that is running online. And so what would happen is that

you have your model. Your model is generating features. So

say the user lives on the website and the features are, what's the time of the day? What's the weather like? What are the top films at the moment? And these are features. And these are features that you have at a certain point in time to

you generate these features. And then later on, when you get the feedback,

so did your recommendation was it a success or not? That's training data for your your model.

You use the same features that you used

for predictions.

You use those features for training.

And so you can see here that there's a clear feedback loop. The event happens. The user comes on the website.

Your model generates features.

And then

at some later point in time, the feedback arrives. So was the prediction successful or not? Or if so if not, by by how much was the error? And then,

yeah, you can use this feedback, join it with the feature that you generate predicting, and

use that as as training data. So and you essentially have, like, a small queue or database that's storing

your predictions,

your features, and your

training data, and the labels that make your training data.

So

the big difference here is that

you,

you do not necessarily have

to do a training test phase before deploying your model. You can actually just deploy your model initially, and it just learns online, and then you can monitor.

A really cool thing is that

if you do this, you have a log of people coming on your website, you making predictions, you gain features,

people clicking around and interact with your recommendations.

This creates a log

of what's happening on your website. And so this log, what's really cool is that you can

offline, afterwards, after the fact, you can process it in the same order it arrives in, and you can replay

what the history of what happened. So it means that if you on the side, when you're redeveloping a model or you want to develop a better model, you can just take this log of events,

run for it, and do this prediction and training dance

the whole life cycle.

You know, you're replaying the feedback loop, and then you have a very accurate representation of how your model would have performed

on that sequence of events. So that's really powerful because the way you're designing your model there is that you have a rough sketch of a model, which you deployed,

then you have a log on that model.

So you know

you can evaluate the performance of that model, but more importantly, you can have a log of the events.

And then when you're designing the version 2 of your model,

you

have a very reliable way to

estimate how your new model would have performed.

And that's really cool. Because when you are doing train test splits

in batch learning,

that is not representative of the way the way our world. The whole problem is that what you do with train and test is people are spending so much time

making sure that their train test split is correct,

when in fact, even having a good train and test split is not

a good proxy of the real world. A good proxy of the real world is to just replay

through history.

So and that's something that you can only do with online learning. That's really cool. Now to come to your point about concept drift,

so concept drift is there's many different kind of concept drift, and Chip has a really good on her blog. What matters really is that concept drift, the result of it is usually that your model is not performing as well. Right? It's gonna be a drop in performance. And so

the first thing you see on your in your monitoring dashboard is that a metric has dropped. And then when you dig into it, you see that maybe there's a class imbalance or

that the correlation between a feature and a class has changed or

something like that. So essentially saying that the data the model has been trained on

is not

representative of the new data that has been seen in production.

But again,

I have said this a few times, but online models,

if you put them in place with the correct camera ops setup, you they are able to learn as soon as possible. So that just guarantees that your model is as up to date as possible. So you're basically really doing the best you can.

So drift is always possible. You can always obviously have a model that's degrading or that's just going haywire,

that's not related necessarily to the

online learning aspect of things. And so there are also ways to cope with this. So

for instance, Dan Crankshaw and his team at Berkeley, they developed a system called CLIPr.

It's

a kind of an ops tool. It's a research project, but it's it's also it's I think it's been deprecated, but it's the ideas are still there. It's a project where they have a meta model, which

is kind of looking at many models being run-in production and deciding online which model should be used making prediction. So it's kind of like a teacher

selecting the best student at a certain point in time and, you know, kind of seeing throughout the year

how the students are evolving and, like, which students

are getting better or not good.

And so you can do this with Bandits, for instance.

But yeah. So just to say that there are many ways to deal with concept drift,

and the online models,

again,

help to cope with concept drift and in just a way, actually, it just makes sense more so than batch bundles.

And so digging now into River itself, can you talk through how you've implemented that framework and some of the

design considerations that went into how do I think about exposing this

online learning capability in a way that is accessible and understandable to people who are used to building batch models?

So I like to think of Viva more as a library than a framework.

If I'm not mistaken,

framework kind of forces you into a certain

behavior or way to do things, and there's an inversion of control where the framework is kind of designing things for you.

So, you know, if you look at Keras and PyTorch, Keras is very much more framework

in comparison to PyTorch because PyTorch, for me, the reason why it was successful is that it it kind of gave inversion of control towards the user. You can do so many things in PyTorch, and it's very flexible and doesn't really impose a single way of doing. So

we have that in mind with River. River, again, is just a library to do

machine learning, online machine learning. But it's it just contains the algorithms. It doesn't really

force you to, you know, read your data in a certain way, or you could use it in a web app. You could use it offline.

You could do a use it on an offline IoT sensor.

Livr is not concerned of that. It's

just a library that is agnostic with regards to that. So now to come in to to what Liver is, it is in terms of online machining, it is general purpose. So it's not dedicated to anomaly detection or forecasting or classification. It covers all of that. That's the ambition at least. So just to note there is that it's actually really hard to develop and maintain because

other maintainers and I,

we are not actually

specialized in different domains, and we kinda have to, you know 1 day, I'm going to be doing

working on the forecasting

module other than the other. I'm gonna work on anomaly detection, and it's it's kinda crazy. So

it's still fun. What we do provide is a common interface. So just like scikit learn,

every

piece of the puzzle and river follows a certain interface. So we have transformers, we have regressors,

we have anomaly detectors.

Of course, we have classifiers,

so binary and multi class. We have forecasting models and time series. And so

we guarantee to the user that each model follows a certain

API.

So every model is gonna be able to have a learn method, so it can just learn from new data.

And they usually have a predict method to make prediction.

And so forecasters will have a forecast method.

Anomaly detectors will have a score method, which I was supposed to say, anomaly score. And so

the strength of Viva is to, yeah, provide this consistent

API

for doing online machine learning. And

it's a bit opinionated because it's

well, it just it likes like it learn really. It just says, okay, you're gonna have learn and predict, but that's a reasonable

thing to impose.

And that makes it easier for users to

switch in new models because they have the same interface. And so, again, just to conclude on what I said at the start,

we made the explicit choice to follow the

single responsibility principle in that. Mevo only manages the machine learning aspect of things and not the deployments and whatnot. And so

if you wanted to use within production,

see people doing this, you have to worry

about some of the details yourself. Like, if you want to deploy in a in a web app, well,

we do not help at the moment at all with that. You have to deploy your own web app. As far as the overall design of the framework, you mentioned that it actually started off as 2 separate projects, and then you went through the long process of merging them together. I'm wondering how have the overall design and goals of the project changed or evolved since you first started working on this idea?

The reason actually why the merger between CREAM and Scikit Multiflow took a certain time was that

although we were both

online learning libraries,

there were some subtle differences which were kind of important. So

my vision with

Cream at the time and we were now is that

we should only cater to models which are what I call pure online models, is that in that they can learn from a single sample of data at

a time. But there are also mini batch models, so models which

can learn from streaming data but in chunks. So like in a mini batch of data. And Scikit Multiflow was kind of doing this, so much like PyTorch

and TensorFlow and you know deep learning models.

And so I kind of had to convince them that

there were reasons why it was just a bit better.

Why? Because,

you know, if you think about again a user infect on the website

or just any

web requests or things that are happening in the real life,

you want to learn as soon as possible.

You don't want to have to wait

for,

you know, 32 samples to arrive to have a batch to be able to feed that to your model. You could obviously, but it just made sense to me to have something simpler where we only

care about pure online learning. Because it means that you don't have to store anything,

you just

learn on the fly. And so I guess

the interaction I had with Flaky Multipl kind of confirmed this idea.

And

I guess, you know, they were a bit doubtful when we did the merger because and maybe I was a bit too opinionated, but history proved that it actually made sense, and it's not a decision that we look back on. Like, we're really happy with this now. So River has,

you know, arguably moderate success. So it's working. It's alive. It's breathing. It's been going on for

2 years and a half, the project.

And so we have a steady intake of

users that are adopting it, and you know, we can we see this from emails we receive, from GitHub discussions and issues, and just general feedback we get. So

the general idea of having a library that is only focused towards

ML

and just the algorithms is something that we are just gonna keep going with, because it just it looks like it's working, it looks like this is what people want.

You know, a simple example is,

hey. I want to compute a covariance matrix online.

Well, River aims to be the go to library to answer those kind of questions. Right?

But the truth is that people,

they don't just need that. They also need ways to

deploy these models and do MLOps online. So,

well, we basically did the the next steps offer us to build new tools in that direction.

And we also think that

the initial development of Rivers was a bit fast and furious.

The aim was to implement as many algorithms as possible and, you know, just to cover the wide spectrum of

machine

learning. Now that we've, you know, covered

quite a few topics,

and we also have day jobs. So when I was developing with initially, I was doing a PhD, so I had ironically, I had more time than now because I have a proper job. But we value our time a bit more, and we're not in this fast and surest mode. We kind of just focus

on picking certain models which are valuable and we see value in and just spending time to implement them properly. And we also see the final aspect is that we see that people, they don't just

want our user base does not just want algorithms. They also want us to educate them. So they have general questions as towards,

you know, what is online learning and how do I do it and how do I decide what model to use and, you know, all the questions that we're covering in this podcast, basically. So I think there's a huge

need for us to kind

of move into a educational aspect. So

when I was younger, scikit learn was my bible. Like, I would just spend so much time not even using it, not even just using the code, but actually just reading through the documentation because it's just so excellent.

So

obviously, that takes a lot of time, a lot of energy,

the people,

contributors, and help, but definitely something towards which we are moving.

In terms of the

overall process of

building the model using something like river,

When people are building a batch model, they end up getting a binary artifact out that is the entire state of that model after it has gone through that training process.

And I'm curious if you can talk to how

River manages the stateful aspect of that model as it goes through this continual learning process, both

in a, you know, sandbox use case where somebody is just playing around on their laptop, but also as you push it into production

where maybe you want to be able to use this model and scale out serving it across a fleet of different servers and just some of the

state management that goes into being able to continually learn as new information is presented to it?

So

batch learning the great advantage of batch learning is that once you train your model,

it's essentially a pure function.

There are no side effects.

The,

you know, decision process that's underlying the model is not gonna change. So,

you know, you can push the envelope and compile it. You can pickle it. You can convert it to another format. So that's what o n x does.

You can

compile it so that it can run on a mobile device. I mean, it does not need the ability to

train anymore. It's just basically a Python function. Or just it's just a function basically that takes an input and outputs something. So there's also a good reason why batch learning is predominant.

But with river, it's different because online models, they need to keep this ability

to learn. So that's what you've been saying. So it's actually

kind of straightforward,

but

the internal representation of most models in river is

fluid,

dynamic. It's usually stored in dictionaries that can

increase and decrease in size. So

imagine you have a new feature that arrives in your stream,

Well, every model will ever cope with that. It's they're not static. There's a new feature that appears well, they handle it gracefully. So for instance, a linear regression model is

just going to add a new weight to its internal

dictionary of weight. Now in terms of civilization

and pickling and whatnot,

River is mostly written

well, basically, River stands on the shoulders of

Python,

very much so. So we do not depend very much on NumPy or Pandas or SciPy.

We mostly depend on Python standard library. We use dictionaries

a lot,

and that plays really nicely with the standard library. It's very easy just to you can take any River model, pickle it, and just save it. You can also just dump it to JSON or whatnot.

Also,

the paradigm of you train a model, you pickle it, and you have an artifact that you can upload anywhere, it's a bit different with online learning because you would play this differently. You would

maintain your model in memory.

So if you have a web server serving your model,

you would not just load the model to make a prediction, you would just keep it in memory

and make a prediction of it because it's in memory, so you don't have to load it anymore,

preload it.

And then when a sample arrives, your model is in memory. You can just make it learn from that. So,

yeah, I think the big difference is that you

hold your model memory rather than picking it to the disk and

loading it when necessary.

In terms of the use of the dictionary as that internal state representation,

As you said, it gives you the flexibility

to be able to evolve with the updates and data.

But at the same time, you have this

heterogeneous

data structure that can be mutated

as the system is in flight, and you don't necessarily

have a

strict scheme of being applied to it. And I'm just curious if you can talk to the trade offs of

being able to add that flexibility, but also lacking in some of the validation

and, you know, schema and structure information that you might want in something that's dealing with these volumes of data?

So, yeah, Viva uses dictionaries. So

the advantage of dictionaries are plentiful.

First of all, a very important thing is that

dictionaries are to list what handles data frames are to numpy arrays.

So a dictionary has names,

and that's really important. So it means that each 1 of your feature actually has a name to it. And I find that hugely important because,

you know, we always see features as just numbers, but they also have names, and that's just really important.

Imagine you have a bunch of features coming in. Now if that was a list or a NumPy array, you have no real way of knowing which column corresponds to each variable. If you

switch 2 columns

with each other,

that could just be a really silent bug, which really affect you. Whereas if you name each feature,

if the column order changes, well the names of the columns are being permuted too, so you can kind of identify that. So

what's really cool with dictionaries and and that works with River is that the order of the features that you're receiving doesn't matter, because we access every feature by name and not by position.

Dictionaries also allow you are mutable in size. So,

you know, if a new feature arrives or a new feat a feature disappears between 2 different samples,

that just works.

So it's really cool also that dictionaries, when you think about it naturally sparse.

So

imagine that on the Netflix projects, the features that you arrive that you receive are

the name of the user,

semicolon, colon, 1 or

you know, the dates 1 or

Yeah. You can just store sparse information in a dictionary. That's kind of really useful.

There's this robustness principle

that we follow with others. So robustness principle is that we are used to be conservative

in what you do, but liberal in what you accept. So

Mivva is very liberal and that accepts

heterogeneous

data, as you said. So dictionaries are different in size, dictionaries which have different orders or whatnot, but that is really flexible for users. So a common

use case is to deploy whether in

a web app, and in the web app, you're receiving JSON data

a lot of the time. So the fact that JSON data has a 1 to 1 relationship with Python dictionaries

makes it really easy to integrate

with it into a web app. Whereas if you have a regular batch model, you have to mess about with

casting

the JSON data to a NumPy array,

and, you know, that has a cost actually. It actually has a cost because

although

NumPy, Torch, TensorFlow are,

you know, good at posting matrices,

there's actually a cost that comes with taking

native data, such as dictionaries, and casting them to

a higher order data structure, such as an NumPy array, that has a real cost. In a web app where, you know, into you're reading in terms of milliseconds,

well,

you're spending a lot of your time just converting your JSON data to NumPy.

Whereas with Revver,

because it consumes dictionaries, well, the data you receive, I don't know if you're coding in Django, Flask, or FastAPI,

the data you receive your your request is a dictionary, so you don't have to convert the data. It just runs. So actually, if you if you take a river model, like a a linear regression river and a linear regression in torch,

it's actually gonna be much faster in river because

there's no conversion cost.

Plus the features are names, and plus you just you you don't have to worry about the features being mixed order or anything. So it just makes a lot of sense really, dictionaries in that sense. Now the pitfalls, obviously, it's it's not perfect. The pitfall is that I kinda disagree that there's a problem with the short dictionaries.

I actually think that dictionaries

well, if you wanted to, you could create like if you're in Python, you can actually use a data class and you can convert that to a dictionary and feed that into

your model. The data class helps you to create structure.

So I don't think that's really a problem. Quite the contrary, I think. The fact is that also a dictionary can be nested. So maybe your your features that you're feeding to the model doesn't have to be a flat dictionary. It can actually be nested, and that's really cool too. You know, you can have features for your user. You can have features for your page, the features for the day, anything.

Things that you cannot necessarily do with a flat

structure such as a data frame or NumPy array. Anyway, I'm talking about benefits, I should be talking about cons.

But yeah, I guess just the main con of

processing dictionaries is that, you know, if you wanted a Viva model to process a 1000000

samples,

it would take much more time than

processing a 1000000 samples

with pandas data frame or NumPy array.

Because, yeah, the point of Viber is to process is to be half that processing 1 sample at a time, but not necessarily processing a 1000000 samples at a time. But those are 2 different problems.

So

although, you know, you take Torch or Scikit Learn,

their goal is to be able to process offline data really quick. So the goal of VIVA is to process

online data, single samples, as fast as possible. And you're comparing apples and oranges if you wanna do the comparison. It just doesn't work. So yeah. Actually, you know what? I don't think they'll downsize to using dish nodes. It just helps a lot. And

and to confirm this, we have a lot of users who tell us about this. They say, well, it's actually fun to use river. I just it just makes sense because it's very close

to the data structures that I use in Python. I don't have to introduce a new data structure to my system.

So for somebody who is using

River to build a machine learning model, can you just talk through the overall process of going from idea through development to deployment?

I'm going to rehash what I said before, but I think the great benefit of online learning

and river is that you can cut the r and d phase. So I've seen so many projects where there's an r and d phase, and

the model, you know, gets validated at some point in time, but there's like a real big

gap of time between the start of the r and d phase and the moment when the model is deployed. And

the process of using river and any streaming model in general is to actually, as I said, deploy the model as soon as possible,

monitor its predictions,

and

it's okay because sometimes that model, well, you know, you can

deploy

it in production, and those predictions do not necessarily have to be served to the user.

So you just make the predictions, you monitor them,

and it creates you a log again of training data and predictions and features. And that's what you call shadow deployment. You

have a model which is

deployed, is making predictions, but those predictions are not being used, you know, to inform decisions or to change or to influence the behavior of users. They just exist for the sake of existing and for monitoring.

1 thing to mention is that once you deploy this model, you have your log of events.

That's the phase where you want to

maybe design a new model. And you're going to have this model replace the existing model in production or coexist with it because you have a meta model or not. So I mentioned that you can

take your log of events, replay it in the order in which it arrived, and

have a good idea of

how well your model would have performed.

That's called progressive validation.

So it's just this idea that if you have your log of events,

every sample you're first gonna make a prediction, and then you're gonna

learn from it. So I have a good example.

There's a data set on Kaggle called the New York Taxi Data Set, and it's basically a log of people asking for a

hailing a taxi for a ride, and they depart from a position,

and they arrive at another position later in time. And so the goal of a machine learning system in this case could be to predict how long the taxi trip is going to last. So

when the taxi departs,

you want your model to make a prediction.

How long is this taxi ride gonna last? And so maybe that's gonna inform, I don't know, the cost of the

trip, or it's gonna help decision makers, you know, rear behind the taxis, or I don't know, whatever.

But you can imagine that this is a great feedback loop because you have your model making a prediction, and then later, maybe 18 minutes later or something, we are at the ground truth around. So you know how long the taxi trip actually lasts, and that's your ground truth. And then you can compare your prediction with your model. And that enables progressive validation because you have a log of events. You have when the taxi trip departs,

what was the value my model predicted,

what features I used

at prediction time,

and later on, I had to go on truth. And so I can just replay

the logs of events for, you know, I don't know, 7 days and

progressively evaluate my model. So I like this taxi example because it's

easy to reason about, and, you know, taxis are easy to understand. But the taxi example is really what online learning is about. It's about the feedback loop between predicting

and learning.

And just to

remind you, but how it would be with a batch model is that you would have your taxi date set and

well, I don't know. You would split your dates in 2. You would have start of the week, end of the week, train your model on the start of the week, evaluate on the rest of the week. Oh, no. The data I trained on

for the start of the week is not represented on the weekend.

Yeah. And it just becomes a bit

weird. It becomes this situation where you're trying

to reproduce conditions in real life, but you're never really sure of it. And you can only really know

how well your batch model is going to do well

in production.

And online learning, it just kind of encourages you to

go for it, to deploy your model straight away and not have to have this weird r and d phase where you live in a lab. You think you might be right, but then you're not really sure. And yeah, online learning just brings you closer to the reality,

in my opinion.

As you have been developing this project and helping people

understand and adopt it, what do you see as some of the

conceptual

challenges or complexities

that people

experience

as they're starting to adapt their thinking to how to build a machine learning model in this online streaming format versus

the batch oriented workflow where they do have to think about the train test split, just the overall shift in the way that they think about iterating on and deploying and developing these

models? It's a hard question, but I think there's 2 aspects. There's the online learning aspect, and then there's the MLOps aspect.

Now in terms of MLOps,

I think I I covered enough, but

it's much like a batch model. You have to

deploy your model, which means maybe survey behind the web app. As I mentioned, the ideal situation is to have your model

loaded into memory, and that's making prediction and training.

But all that is really

harder to do than to say.

The truth is that there's actually no framework out there which allows you to do this. You could do this yourself, and this is what we see. I mean, we get users who ask us questions

in a bit of context, so on on GitHub or in mails, but and they're asking us, how do I

deploy my model? What should I be doing? And we always give the same answers.

But

the fact is that, you know, we have these users who have basically embraced Viva and they understand it, but then they get into the production phase. And that's not what we're trying to

well, we feel bad because

they are all making the same mistakes in some way,

and VIVUS is not there to help them because that's not the purpose of VIVUS.

So, yeah, there's a lack of

tooling to actually just, you know, deploy an online model.

So, yeah, that's the MLOps aspect. I think in terms of some online learning,

a big challenge is that not everyone has the luxury to have a

PhD during which you can spend days and nights

going through online learning papers and trying to understand it, and that's

what I and others had the chance to do. But

a lot of our users,

you know,

they see the value of online learning, and they want to

put it into production, but they have deadlines to me. Right? They have to ship their project in 6 weeks, and they just don't have the time to understand things in detail. So

things like what I just described, progressive validation,

it kind of takes them a bit of time to understand.

And so, again, what we need to do is to

spend more

time creating resources or, you know, just diagrams to explain

what online learning is about. And that in terms of

library design,

it's really important. Like, if we wanted to introduce a new method to all our estimators, I would be against that. Like, the the whole point of Vivo is to make it as simple as possible

so that people can just, you know, be productive, understand it. So, yeah, I think that just to recapitulate those 2 problems is that people do not necessarily have the

resources to learn about online learning, and then

there are operational problems around

serving these models into production. So

it's kind of like a batch model because you have to serve your model behind an API,

you know, and you have to monitor it. And these are things, well,

you know, that are common to a batch model, but there's the added

complexity of having your model

being, you know, maintain the memory

and keep learning and stuff and things that are

basically not common. I mean, if you have to Google it or find something on GitHub, you

you just kind of find find these hacky projects, but no

real good allowed me to do that, at least not yet. And 1 of the things that we didn't discuss yet is the types of machine learning use cases that River supports where I'm speaking specifically to things, logistic regressions and decision trees versus deep learning and neural networks.

And I'm just wondering if you can talk to the

types

of machine learning approaches that River is designed to support and some of the

reasoning that went into where you decided to put your focus.

River, again, is a general purpose library, so it covers quite a few things.

There are

some cases

or flavors of machine learning, which are especially

interesting

when you cast them in an online learning scenario. So

if you're doing anomaly detection,

so for instance, you have

people doing transactions on a in a banking system, so they're making payments,

and you might want to be doing anomaly detection to detect fraudulent

payments.

That is very much a

situation where you have streaming data.

And so

in that case, you would like to be doing online anomaly detection. So

we see that every time we put out a notebook or

a new anomaly detection method,

a lot of people start using it. We start having bug reports and

and whatnot. So it's kind of surprising, but it's a good thing. But,

yeah, I think there

are modules and aspects of which

are clearly bring a lot of value to users. So that would be anomaly detection, but also

we have forecasting models. So when you do online forecasting, that just makes sense. But you have sensors which are, I don't know, measuring the temperature of something. People want to do that in real life in in real time. There's also a good example I have is we have this engineer who's working on the water pipes in in Italy.

He's trying to

predict how much water is going to flow through certain points in his pipeline.

So

he has sensors all over the pipeline, and he's trying to just do a forecasting model. And so it just makes so much sense for him to

be able to have his model run online inside the sensors

or inside the IoT systems he's running.

So just all that to say that

there are some more exotic

parts of the river, such as anomaly detection and forecasting, which are not which are probably more value than

the classic models such as, you know, linear regression, classification, regression.

Again, at the start of the show, I I talked about Netflix recommendations.

So we have some

very

basic

bricks to be able to make recommendations.

Well, we have factorization machines, and

we have some kind of ranking system so that if you have users and items interacting, you can kind of build a ranking of preferred items for a user. So we have these kind of exotic

machine learning cases which

provide value,

but require us to spend a lot of time to

work on them basically. So

it's very difficult for me and for other contributors to be specialized in anomaly detection time series forecasting

recommendation.

But, yeah, all this to say that

covers a wide spectrum.

You can do preprocessing. You can extract features. You can do classification, regression, forecasting, anything.

We try to well, because it's online, it's just a bit unique. In your experience of

working with the River Library

and working with end users of the tool, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

Well, unexpected is a good 1. There's 1 thing that comes to mind. We have this person who is a beekeeper. So a person who is, you know, taking care of bees and

I guess they once a week or every 2 weeks, they

go to the beehive and they pick up the honey in the beehive.

And this person has many beehives, and so they don't like to waste their time going into the beehive and actually checking if there's honey or not. So they have a sensor. He has sensors that are in each beehive.

They're kinda measuring

how much honey is in each beehive, and he likes to forecast

how much honey

he is expected to have in, you know, the weeks or months to come based on the weather, based on past data, based on

I don't know, what information he uses.

But really, really just fun just to see this person doing this hackish project

where they just thought it would be fun to use online learning to do it.

And again, it wasn't an IoT context, so

that made sense.

I guess in Innovative, I was kind of impressed when I heard about this project of having

a, you know, a model within each car to determine where your destination

would be. So I don't know. You wake in the morning, you take your car. You know? Is it Saturday you go to go to the market? Is it a weekday you go to work?

It sounds silly, obviously, but

having this this idea of having 1 model per user is is kind of fun.

The most impactful project I heard about, and I know it's which is being used,

is a situation where this company, they

prevent cyber attacks. So

they monitor this grid

of servers and computers,

and they're monitoring

traffic between machines.

And so they're trying to understand

when some of the traffic is malicious

and hackers basically trying to get into a system. So

you can

detect this by looking at the patterns of

this traffic. Right? And the trick is that

behind the traffic,

the malicious traffic, it's hackers, and they're constantly changing their patterns,

so access patterns to actually not be detected. And so if you manage to label traffic

as,

you know,

malicious, well, you want your model to keep learning. So

they have the system where they have, like, thousands of machines, and

they have a few machines that are dedicated to just learning from the traffic and in real time,

adapting,

learning,

detecting

anomalous traffic,

sending it to

human beings so that they can actually verify themselves, label it, etcetera.

And so it's really cool to know

that VIVO has been used in that context. Like, it just made so much sense for them to say,

wow. We can actually do this online, and we don't have to be trained.

And on batch learning was getting in their way.

They had this system which was going a 1000 miles an hour, just

hundreds of thousands of data accumulated all the time. And

batch learning was just, you know, it was just, again, annoying for them. They they having a system that will enable them to do all this online just made sense for them. And to know that you can do this at such a high amount of traffic, it was really cool and exciting.

In your own experience

of building the project and using it for your own work, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think I'm just gonna focus a bit on the human aspect here. But although I have been doing open source,

you know, quite a bit, I've always had this

approach where I probably

work too much on new projects that I make myself rather than on

existing projects. So I just rather just do my thing myself rather than contribute just to existing stuff, and it's not always that's any good, but it's just the way I work. And so Vivo is really the 1st open source project where I work for other people. So like probably many people,

a lot of my open source work is I just work on it myself. And obviously, I you work in companies and where you probably have a review process and you work with other people, but this is the first

open source project where I really work with a team of people.

And it's fun. It's just really so much fun. Like, just a month ago, we actually got to meet altogether and to

have this, like, informal reunion.

So that was really fun.

And you realize that,

you know, after 3 years, there are ups and downs and there's moments where you just do not want to work on Liv anymore and you want to you know, you have work, you have friends, girlfriends, whatnot.

And so the only way to subsist as a open source project in the long term is to have multiple people working. So

do it open source and,

you know, it's not realistic to do it on your own if you want something to be successful and to actually have an impact in the long term. So it's actually really important to

just

be nice and to have people around you who help you. And

although not everyone contributes as much as I do or core maintainers

do, people help a lot, and they make things alive. It's always a joy when I open an issue on GitHub, and I see that someone from the community has answered the question, and I don't have to do anything. It helps

tremendously.

Yeah.

We've already talked a bit about some of the situations where online learning might not be the right choice. But for the case where somebody is going to use an online streaming machine learning approach, what are the cases where River is the wrong choice and maybe there's a different library or framework that would be better suited?

Well, yeah. Again, honestly, I think that online learning is the wrong choice in 95% of cases. Like, you do not want to make the mistake to think that your problem is

a online problem. You probably, most of the time, have a batch problem that you can solve with a batch library.

You know, I mean, scikit learn now, if you open it and you just run it, it's always going to work reasonably well. So

sometimes I would just go for that. 1 thing we do get a lot is people asking how you can do deep learning with river. So they want to train deep learning models online.

So the answer is that we do have a sister library that is called Torchriver, and it's

dedicated towards

training Torch models online.

So but again, that is a bit finicky at the moment and still need some work being done on it.

But yeah. If you want to be doing deep learning and you want to be working with images and sound and,

you know, structured data,

River is not the right choice, even online, and you probably have to be looking at PyTorch.

As you continue

to build and iterate on the river project, what are some of the things you have planned for the near to medium term or any applications

of this online learning approach that you're excited to dig into?

We have a public road map. So it's a Notion page with a list of stuff we're working on. That 1 mostly has a list of algorithms to implement, and it's mostly there

to,

you know,

make people know what we're working on and to encourage new contributors to work on something.

So the few contributors we have just pick what they want to work on and, you know, just in general order of preference.

So for instance, me this summer, I've decided to work on

online covariance matrix estimation. So

if you actually learn online covariance matrix, it's kinda useful because

it's very useful in financial trading. And if you have an inverse covariance matrix that you can estimate online, that unlocks so many other algorithms, such as Bayesian linear aggression,

elliptic envelope method for unknown detection, Gaussian processes, and whatnot. So I think I'm still in the

nitty gritty details of implementing algorithms and not necessarily applying them to stuff. I'm kinda counting on users to to do the applications.

It just shows right at the moment. Now 1 thing I'm working on in the mid to long term is Beaver. So eventually, I want to try to spend less time on Weaver and work on a tool I'm building called Beaver. So Beaver is

a tool to deploy and maintain

online learning models. So essentially an MLOps tool for an

MLOps tool for online learning.

So it's in its infancy, but it's something I've I've been thinking about a lot.

So I recently gave a talk on it in Sweden.

I've sketched a blog post and some slides where I tried to describe what it's going to look like. But the goal of this project is to create a

very simple, user friendly tool to deploy a model,

and I'm hoping that that is going to encourage

people to actually use river and to use online learning because they're gonna say, hey. Okay. I can learn, but I can also just deploy the model and,

you know, and both tools play nicely together. So, yeah, the future of Vivo is to have Vivo and to have this reference tool to deploy online models.

It's not going to be catered just towards River.

The goal is to be able to, you know, run it with any model that can learn online.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.

I'm always impressed by how much the field is maturing.

I think that there's a clear separation now between regular machine learning, like business machine learning, I like to call it, and deep learning.

I think those 2 fields are becoming my separate fields.

So I've kind of stayed away from deep learning because I just not my

cup of tea, but some very interesting in business machine learning, so getting things that I call it.

And I think I'm impressed by how much the community

has evolved in terms of knowledge. People are the average ML practitioner

today is just so much more proficient than 5 years ago.

And I think it's a big question of education and tooling.

The tricky thing about an ML model when it's not deterministic,

and so

it's difficult to guarantee that its performance over time is going to be good,

and let alone certify the model or convince stakeholders that they should adopt it.

So in the real world, you don't just deploy a model and cross your fingers.

So although we've gone past the

test

and r and d phase of a model, we are still not there in terms of deploying model. And so

the reality is that there's

usually a feedback loop

where you monitor your model and

possibly retrain it, be online or,

you know, offline retraining. It doesn't matter.

And so I don't think we're really good at that right now. I don't think that we have great tools to

have human beings in the loop who work hand in hand with

machine learning models. So I think that tools like Progyny,

which is a tool to

have a user work hand in hand with an ML system by labeling data that the model is unsure about, they're crucial. They're game changers because they create real systems where

you care about, you

know, new data coming in, retraining your model,

having a human validate

predictions, stuff like that. So I think we have to move away from only having models that are

only having tools that are destined towards training a model, but we also need to get better at tools that,

you know, encourage you to monitor your model, to keep training it, to work with it, to

yeah. Again, just treat machine learning as software engineering and not just as some research project.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on River and helping to introduce the overall concept of online machine learning. It's definitely a very interesting space, and it's great to have tools like River available to help people take advantage of this approach. So thank you for all of the time and effort that you and the other maintainers are putting into the project, and I hope you enjoy the rest of your day. Oh, thank you. Thanks for having me. It was great.

Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management,

and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com

with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast