Summary
The majority of machine learning projects that you read about or work on are built around batch processes. The model is trained, and then validated, and then deployed, with each step being a discrete and isolated task. Unfortunately, the real world is rarely static, leading to concept drift and model failures. River is a framework for building streaming machine learning projects that can constantly adapt to new information. In this episode Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River.
Announcements
The majority of machine learning projects that you read about or work on are built around batch processes. The model is trained, and then validated, and then deployed, with each step being a discrete and isolated task. Unfortunately, the real world is rarely static, leading to concept drift and model failures. River is a framework for building streaming machine learning projects that can constantly adapt to new information. In this episode Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River.
Announcements
- Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
- Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
- Your host is Tobias Macey and today I’m interviewing Max Halford about River, a Python toolkit for streaming and online machine learning
- Introduction
- How did you get involved in machine learning?
- Can you describe what River is and the story behind it?
- What is "online" machine learning?
- What are the practical differences with batch ML?
- Why is batch learning so predominant?
- What are the cases where someone would want/need to use online or streaming ML?
- The prevailing pattern for batch ML model lifecycles is to train, deploy, monitor, repeat. What does the ongoing maintenance for a streaming ML model look like?
- Concept drift is typically due to a discrepancy between the data used to train a model and the actual data being observed. How does the use of online learning affect the incidence of drift?
- Can you describe how the River framework is implemented?
- How have the design and goals of the project changed since you started working on it?
- How do the internal representations of the model differ from batch learning to allow for incremental updates to the model state?
- In the documentation you note the use of Python dictionaries for state management and the flexibility offered by that choice. What are the benefits and potential pitfalls of that decision?
- Can you describe the process of using River to design, implement, and validate a streaming ML model?
- What are the operational requirements for deploying and serving the model once it has been developed?
- What are some of the challenges that users of River might run into if they are coming from a batch learning background?
- What are the most interesting, innovative, or unexpected ways that you have seen River used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on River?
- When is River the wrong choice?
- What do you have planned for the future of River?
- @halford_max on Twitter
- MaxHalford on GitHub
- From your perspective, what is the biggest barrier to adoption of machine learning today?
- Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- River
- scikit-multiflow
- Federated Machine Learning
- Hogwild! Google Paper
- Chip Huyen concept drift blog post
- Dan Crenshaw Berkeley Clipper MLOps
- Robustness Principle
- NY Taxi Dataset
- RiverTorch
- River Public Roadmap
- Beaver tool for deploying online models
- Prodigy ML human in the loop labeling
[00:00:10]
Unknown:
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning. Building good ML models is hard, but testing them properly is even harder. At DeepChex, they built an open source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built in library of checks for testing and validating your model's behavior and performance and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually.
Go to the machine learning podcast.com/deeptext today to learn more and get started. Your host is Tobias Macy. And today, I'm interviewing Max Halford about River, a Python toolkit for streaming and online machine learning. So, Max, can you start by introducing yourself?
[00:01:01] Unknown:
Oh, hey there. So I'm Max Gess. I consider myself as a data scientist. My day job is doing data science. I, actually measure the carbon footprint of, clothing items. But I have a wide interest in, you know, technical topics, be it software engineering or data engineering. Do a lot of open source. My academic background is leaning towards finance and computer science and statistics. I actually did a PhD in applied machine learning, which I finished a couple of years ago. So, yeah, an all around node, basically.
[00:01:35] Unknown:
And do you remember how Hapriska started working in machine learning?
[00:01:39] Unknown:
Kind of, Jess. I was a late bloomer. I got started when maybe when I was 21, 22 when I was at university. I basically had no idea what machining was, but I started this curriculum that involved that was around statistics. And we had a course, which was maybe 2 or 3 hours a week about machine learning, and it did kind of blow my mind. It was around the time when, well, machine learning and particularly deep learning was starting to explode. So I kinda stopped at university. So I was lucky enough to get a theoretical training.
[00:02:14] Unknown:
And in terms of the river project, can you describe a bit more about what it is that you've built in some of the story behind how it came to be and why you decided that you wanted to build it in the first place?
[00:02:25] Unknown:
When I was at university, I received a kind of normal introduction to a regular introduction to machine learning. And then I did some internships. I started PhD after my internships. And I also did a lot of travel competitions on the side. So I was kind of hooked into machine learning, and it always felt to me that something was off because when we were learning machine learning, everything made sense. But then when you get to do in practice, you often find that, well, it's not playgrounds. Like, the playground scenarios that they describe at university when you learn machine learning just do not apply in the real world. It's available as well. You have data that's coming in, like a flow of data or every day there's new data or yeah. There's like an interactive aspect to the world around us, the way the data is flowing. It's not like a CSV file.
Yeah. It just felt like fitting a square peg in a round hole. So I was always curious in the back of my mind about how you could do online machine learning. Well, I didn't know it was called online machine learning because when I was a kid, I remember growing up and thinking that AI was this kind of intelligent machine that would keep learning as it went on and as it experienced the world around it. Anyway, when I started my PhD, I was lucky enough to have a lot of time to read. So I read a lot of papers and blog posts and whatnot. And I can't remember the exact day or week that I stumbled upon it, but I just started learning about online machine learning.
Maybe some blog posts or something. And then it was like a big explosion in my head, and I was like, wow. This is crazy. Right? This this actually exists. And I was so curious as to why it wasn't more popular. And at the time, I did a lot of open source as a way to learn, And so it just felt natural to me to start implementing algorithms that I'd read in papers on everywhere. I've just started writing code to learn, basically, just to to confirm what I've learned and and whatnot. That's just the way I learn. And it kind of evolved into what is, which is a, well, an open source package that people use. Now if I may expand a little bit, Viva is actually the merger between 2 projects.
So the first project is called Psyche Multifrow. It was a package that was developed before I even got into machine learning. It has roots in academia in New Zealand, comes from an old package called Noah in Java. Anyway, I wasn't not aware of that. On my end, I started to get a package called cream at the time. So in French, creme means cream, and it plays funny with incremental, which is another way to say online. So a year, I developed cream on by myself. And at some point, it just made sense to reach out to the guys from Psyche Multiflow and to propose a merger. So it took us quite a while, but after 9 months of negotiation and, you know, figuring out the details, we merged, and we called the new package, ever.
[00:05:26] Unknown:
You mentioned that it's built around this idea of online machine learning. And in the documentation, you also refer to it as streaming machine learning. I'm curious if you can just talk through what that really means in the context of building a machine learning system and some of the practical differences between that and the typical batch oriented workflow that most folks who are working in ML are go going to be familiar with.
[00:05:50] Unknown:
1st, just to recap on machine learning, the whole point of machine learning is to teach a model to learn from data and to take decisions. So, you know, monkey see, monkey do. And the typical way you do that is that you fit a model to a bunch of data, and that's it really. But online machining is the equivalent of that, but for streaming data. So you stop thinking about data as a file or a table in a database, but you think of it as a flow of data stream. So online machine learning, you could call it incremental machine learning. You could call it streaming machine learning. I mean, I more often see online machine learning being used, although if you Google that, you kind of find these online courses for the online machine learning, so that's not kind of cool online. But anyway, yeah, it's just this way to say, can I do machine learning but with streaming data? And so the rule is that an online model is 1 that can learn 1 sample at a time. So usually, you show a model a whole dataset, and they can work with that dataset. It can calculate the average of the dataset. It can do a whole bunch of of stuff. But the restriction here with online machining is that the day the model cannot see the whole data. It can't hold it in memory. It can only see 1 sample at a time, and it has to work like that. So it's a restriction. Right? So it makes it harder for the models to learn, but it also has many, many implications. If you have a model that can learn that way, well, you can have a model that just can keep learning as you go on.
Because a regular machine learning model, once it's been fitted to a dataset, you have to retrain it from scratch if you want to incorporate new samples into your model. That can be a source of frustration, and that's why I was calling the square peg in the round hole before. So say you have a model, an online model that is just as performant as a a batch model. Well, you know, if you if you just regardless of performance, accuracy, that has many implications, and it actually makes things easier. Because if you have a model that can keep learning, well, you don't have to, for instance, schedule the training of your model. You can just every time you have a new sample that arrives, you can just tell your model to learn from that, and then you're done. And so that ensures that your model is always as up to date as possible and that has obviously many, many benefits. If you think about people working on the stock market, so trying to forecast the evolution of a particular stock, they've actually been doing online machine learning since the eighties because, obviously, they have a lot to lose by making all this public. It just never got into a big thing, and it always stayed in in stock market companies.
So the practical differences are that you are working over stream of data. You're not working with a static dataset. This stream of data has an order, meaning that the fact that 1 sample arrives before the other, well, that has a lot of meaning, and that's actually reflecting what's happening in the real world. In the real world, well, you have data that's arriving in a certain order. Well, if you train your model offline on that data, you want to, you know, process it in the same order. And so that ensures that you are actually reproducing the conditions that happen in the real world. Now another practical consideration is that online learning is much less popular or predominant than batch learning, and so a lot less research and software work has been put into online learning. So if you are a newcomer to the field, well, there's just not a lot of resources to learn from. Actually, you could just spend a day on Google, and you you probably find all the resources you there are because there's just not so many of them. There's probably, like, by memory, just 10 links on Google that you can learn from about online learning. So it's a bit of a niche topic.
[00:09:52] Unknown:
In terms of the fact that batch is such a predominant mode of building these ML systems despite the fact that it's not very reflective of the way that the real world actually operates, why do you think that's the case that streaming or online machine learning is still such a niche topic and hasn't been more broadly adopted?
[00:10:14] Unknown:
Sometimes it feels like I'm trying to teach a new religion, which feels a bit weird because there's not a lot of us doing it. So I'm also very I never try to force people into this. There's obviously many good reasons why batch learning is still done. And now from a historical point of view, I think it's interesting because we always used to use statistical models to explain data and not necessarily to predict. So you just have a data set, and you just like to understand, you know, what variables are affecting a particular outcome. So for instance, if you take linear aggression, historically, it's been used to explain the impact of global warming on the rise of sea level, but not necessarily to predict if, you know, the temperature of the globe was higher, what would be the impact on the sea level. But then someone said, let's use machine learning to predict outcomes in a business context, and that's why we have this big event of machine learning.
And we've kind of been using the tools that have been lying around. So we've been using all these tools that we used for to a dataset and explain it, but now we've been using them for predicting. So these models are static. Like, the people who when we started doing linear regression, we never really worried about streaming data because datasets were small, datasets were static. Well, the Internet didn't even exist, so there was no real notion of IoT or sensors or streaming data. So the fact is that we've never needed online models. And so as a field, you look at the academia and the industry. We're very used to batch learning, and we're very comfortable with it. There's a lot of good software, and this is what people are being taught at university.
So I'm not saying that online learning is necessarily better than bachelor learning, but I do think that the reasons why batch learning is so predominant in comparison is because we are too used to it, basically. And I do think that and I see it every week. People who are trying to rethink their job or their projects and say and say, maybe I could be doing online learning. It actually makes more sense. So I think it's a question of habits, really.
[00:12:39] Unknown:
For people who are assessing which approach to take in their ML projects, what are some of the use cases where online or streaming ML is the more logical approach or what the decision factors look like for somebody deciding, do I go with a batch oriented process where I'm going to have this large set of tooling available to me, or do I want to use online or streaming ML because the benefits outweigh the potential costs of plugging into this ecosystem of tooling?
[00:13:10] Unknown:
So I'll be honest, I think it always makes sense to start with a batch model. Why? Because, you know, if you're pragmatic and you actually have deadlines to meet and you just wanna be productive, there's so many good solutions to train a batch model and deploy it. So, you know, I would just go with that to start with. And then, yeah, there's the question of, could I be doing this online? So I think there's 2 cases. There's cases where you need it. And so I have a great example. So Netflix, when they do recommendations, you know, you arrive on the website and Netflix recommends movies to you. Netflix actually retrains a model every night or every week, but they have many models anyway. But they are learning from your behavior to kind of retrain their models to update their recommendations.
Right? There's a team at Netflix that are working on learning instantly. So if you are scrolling on the Netflix website and you see a recommendation for Netflix, the fact that you did not click on that recommendation is a signal that you do not want to watch that movie maybe or that the recommendations will be changed. So if you're able to have a model, for instance, maybe in your browser that would learn in real time from your browsing activity and that could update and learn on the fly, that'd be really powerful. And the only way to do that is to have a model per user that is learning online.
And so you cannot just use batch models for that. Yeah. You can't just every time a user scrolls or ignores a movie, you can't just take all the history of data and fit the model. It would be much too heavy, and it's just not practical. So sometimes the only way forward is to do online learning. But, again, this is quite niche. Like, Netflix recommendations, I mean, obviously, are working reasonably well, I believe, because they're, you know, just from their market value. But if you are pushing the envelope, then sometimes you need online learning. Now another case is when you do not necessarily need it, but you want it because it makes things easier. So a good example I have is imagine you're working on the app that categorizes tickets. So for instance, on the help support software. So, you know, you go on the website and you're sending a form or you send a message or an email, and you're asking you have some problem maybe with a reimbursement on your ultimately, you bought an Amazon.
And then, you know, there's a customer service behind that, human beings that are actually answering those questions. And it's really important to be able to categorize each request and put into a bucket, so that it gets assigned to the right person. And maybe the public manager has decided that we need a new category. And so there's this new category, and your model is classifying tickets into 1 of several categories. If you introduce a new category, it means that you have to retrain the model to incorporate it. I was in discussion with a company, and they were only able to or their budget was that they were only able to retrain the model every 3 months. So if you introduce a new ticket, a new category into your system, the model would only pick it up and predict it after 3 months. So that sounded kind of insane and, you know, wasn't predicted at all.
And I was not aware of the exact details, but it it just seemed too expensive for them to retrain their model from scratch. So if they were using an online model, well, potentially, that model could just learn the new tickets on the fly. And, you know, if you just introduced it and people started, you know, you landed this feedback loop where you introduced a new category. People send the email. Maybe a human assigns that ticket to a category, so that becomes a signal for the model. The model picks that up, learns, and, yeah, it's gonna incorporate that category into its next predictions. So that's a scenario where you don't necessarily need online learning, but, actually, online learning just makes more sense and makes your your system easier to maintain, to work with, basically.
[00:17:17] Unknown:
There are a number of interesting things to dig into there. 1 of the things that you mentioned is the idea of having a model per user in the Netflix example. And I'm wondering if you can maybe talk through some of the conceptual elements of saying, okay. I've got this baseline structure for it. This is how I'm going to build the model. Here is the initial training set. I've got this model deployed. And now every time a specific user interacts with this model, it is going to learn their specific behaviors and be tuned to the data that they are generating.
Would you then take that information and feed that back into the baseline model that gets loaded at the time that the browser interacts with the website and just some of the ways to think about the potential approaches for how to say, okay. I've got a model, but it's going to be customized per user and just managing the kind of fan out, fan in topologies that might result from those event based interactions spread out across n number of users or entities.
[00:18:23] Unknown:
Oh, it sounds insane when you say it because to have 1 model per user and, you know, have it deployed on the user's browser or mobile phone or, god knows what, on Apple Watch. It does kinda sound insane, but it is interesting, I guess. I don't think there are so many I mean, I'm not aware of a lot of companies that would have the justification to actually do this, and I've never had the occasion to actually work into a setting where I would I would do this. But I had 1 good example where I was kinda doing some pro bono consulting. It was this car company where the onboard navigation system, they wanted to build a model where they could guess where you're going to. So, basically, depending on where you left, if you left home in the morning, you're probably going to work.
And they would then use this to, you know, just give you send you news about your itinerary or things like that. They really needed a model that would just be able to learn online, and they made the bold decision to say, okay. We're going to embed the model into the car. It's not going to be, like, a central model that's, you know, hosted on some big server and that the car interaction the actually, the intelligence is actually happening in the car. And so when you think about that, it's really interesting because it creates a decentralized system. There's not like a single it actually creates a system where you don't even need the Internet for the model to work. So there's so many operational requirements for that. Actually, now that I think of it and I'm talking about cars, I realized that Tesla that's actually what Tesla is doing. Like, they're they're computing and making decisions inside the car, you know, doing a bunch of stuff, and they're also communicating with mother servers.
But the actual computer, they actually have GPUs in the car doing computes with their deep learning models and and whatnot. So it's definitely possible to do this. Right? But clearly not something that our company would go would go through or would have the need to to do.
[00:20:20] Unknown:
It's interesting also to think about how something like that would play into the federated learning approach where you have federated models where there is that core model that's being built and maintained at the core. And then as users interact at the edge, whether it's on their device or in their browser, it loads a federated learning component that has that streaming ML capability. So the model evolves as the user is interacting with it on their device, and then that information is then sent back to the centralized system to be able to feed back into the core model so that you can have these kind of parallel streams of the model that the user is interacting with as being customized to their behavior at the time that they're interacting with it, but it does still get propagated back into the larger system so that the new intelligence is able to generate an updated experience for everybody who then goes and interacts with it in the future.
[00:21:20] Unknown:
Yeah. That's really, really interesting. So I think first off, the the fact is that I I'm actually still young, and there's so many things that I don't know. And I don't have, like, the technical savvy to be able to suggest ways forward. But this is obviously things that, you know, I think about. So these things like Hub Wild, which is a project from Google, they have a a paper where they discuss these things. I think that's a really simple thing that if you wanted to do this, if you're the listener, wanting to do something like this, I I think there's a simple pattern, which is to maybe once a month have a model that is retrained and that you just train in batch, and that model is going to be it's gonna be like a hydro. Like, you're gonna copy it, and you're gonna send it to each user.
And then each copy for each user is going to be able to learn in its own environment. And for instance, a good idea would maybe to with your model to increase, like, the learning rates so that every sample that the user gives you matters a lot. So for instance, if we take the Netflix example, you would have, you know, your run of the mill recommendation system model that you would just train in batch, you know, and you just use all the tools that we in community use. But then you would embed that into each person's browser, and maybe you do this once a month. And then that model for each user would be a coffee, a clone, or like a just a separate model now.
And, you know, it would keep learning in an online manner. So maybe your model was trained in batch initially, but now for each user, it's, yeah, it's actually it's being trained online. So for instance, you can do this with factorization machines that can be trained in batch, but also online. And, yeah, you would use a high learning rate so that every sample matters a lot basically. And so you, the user, are tuning your model. And so I don't know how YouTube does it for instance, but I do imagine they have some sort of core model. They're just learning how to make good recommendations. But, obviously, YouTube, there are some rules that make it so that, you know, recommendations are tailored to each user. And I don't know if that is done online, and I don't know if it's actually machine learning. It's probably just rules or scores.
But, yeah, I think it's a really fun idea to play around with, and I do think that online learning enables this even more. As far as the
[00:23:40] Unknown:
operational model for people who are using online and streaming machine learning, if they're coming from a batch background, they're going to be used to dealing with the train, test, deploy cycle where I have my dataset. I build this model. I validate the model against the test dataset that I've held out from the training data. Everything looks good based on the, you know, area under curve or whatever metrics I'm using to validate that model. Now I'm going to put it into my production environment or maybe it's being served as a Flask app or a fast API app. And then I'm going to monitor it for concept drift, and eventually, I say, okay. This is no longer performing up to the specification. So now I need to go back and retrain the model based on my updated datasets.
And I'm wondering what that process looks like for somebody building a streaming ML model with something like River and how you address things like concept drift and, you know, how concept drift manifests in this streaming environment where you are continually learning and you don't have to worry about, you know, the real world data that I'm seeing is widely divergent from the data that I use to train against.
[00:24:48] Unknown:
There's so many things to dig into, and I'll try to give a comprehensive answer. So first off, it's important to understand that Revo itself is to online learning what scikit learn is to batch learning. So it only desires to be a machine learning library. Right? So it just contains basically algorithms, routines to train a model and to have models that can learn and predict. And what you're going towards to with your question is MLOps. So how does the life cycle look like for an online model? And so this is always something that I'm spending a lot of time to look into.
The answer is that the first part of the answer is that online learning enables different patterns. And I believe that these patterns are simpler to reason a lot. So as you said, you usually start off by training a model, then evaluating it against a test set, and maybe going to report to your stakeholders and show them the performance and guarantee that, you know, the essentials of all positives is underneath a certain threshold, then yes, we can diagnose cancer with this model or not. And yeah, and then you kind of deploy it, maybe if you get lucky, if you get the approval, and you sleep, you know, well or not well at night depending on how much you trust your model. But there's this notion of you deploy a model, and it's like a baby in the world, and this baby is not going to keep learning. So, you know, it's a lie to believe that if you deploy a batch model, you're going to be able to just let it, you know, run by itself. There's actually main things main things that has to happen there. So the reality is that any machine learning project, you know, any serious project is never finished. It's like software, basically. We have to think of machining, projects have software engineering. And obviously, well, we all know that you never just deploy a feature, a software engineering feature, and just never look at it. You monitor it, you take care of that, well, investigating bugs and whatnot. So batch learning in that sense is a bit it's a bit difficult to work with because obviously you can have if your model is drifting, so meaning that its performance is dropping because the data that it's looking at is different than the training set it was trained on, you basically have to be very lucky if you want your model to pick up performance. So you're gonna have to do something about it. And, yeah, you can just retrain it.
But what you do with online learning is that you can have the model just keep learning as you go. So there is no distinction between training and testing. What online learning encourages you to do is to deploy your model as soon as possible. So say you have a model, and it's not being trained on anything, well, you can put it into production straight away. When samples arrive, it's gonna make a prediction. So maybe, you know, user arrives on your website, you make recommendations, that's your prediction. And then your user is going to click or not on something you recommended to her, and that's gonna be feedback you for your model to keep training.
So that is already a guarantee that your model is kind of up to date and kind of learning. And so that's really interesting because just enables so many good patterns. You can still monitor your the performance of your model. If the performance of your online model is dropping, I mean, I haven't seen that yet, but it probably means that your problem is really hard to solve. So the really cool thing I stumbled upon was this idea of test then train. So the idea that imagine the scenario where you you have a classification model that is running online. And so what would happen is that you have your model. Your model is generating features. So say the user lives on the website and the features are, what's the time of the day? What's the weather like? What are the top films at the moment? And these are features. And these are features that you have at a certain point in time to you generate these features. And then later on, when you get the feedback, so did your recommendation was it a success or not? That's training data for your your model.
You use the same features that you used for predictions. You use those features for training. And so you can see here that there's a clear feedback loop. The event happens. The user comes on the website. Your model generates features. And then at some later point in time, the feedback arrives. So was the prediction successful or not? Or if so if not, by by how much was the error? And then, yeah, you can use this feedback, join it with the feature that you generate predicting, and use that as as training data. So and you essentially have, like, a small queue or database that's storing your predictions, your features, and your training data, and the labels that make your training data.
So the big difference here is that you, you do not necessarily have to do a training test phase before deploying your model. You can actually just deploy your model initially, and it just learns online, and then you can monitor. A really cool thing is that if you do this, you have a log of people coming on your website, you making predictions, you gain features, people clicking around and interact with your recommendations. This creates a log of what's happening on your website. And so this log, what's really cool is that you can offline, afterwards, after the fact, you can process it in the same order it arrives in, and you can replay what the history of what happened. So it means that if you on the side, when you're redeveloping a model or you want to develop a better model, you can just take this log of events, run for it, and do this prediction and training dance the whole life cycle.
You know, you're replaying the feedback loop, and then you have a very accurate representation of how your model would have performed on that sequence of events. So that's really powerful because the way you're designing your model there is that you have a rough sketch of a model, which you deployed, then you have a log on that model. So you know you can evaluate the performance of that model, but more importantly, you can have a log of the events. And then when you're designing the version 2 of your model, you have a very reliable way to estimate how your new model would have performed.
And that's really cool. Because when you are doing train test splits in batch learning, that is not representative of the way the way our world. The whole problem is that what you do with train and test is people are spending so much time making sure that their train test split is correct, when in fact, even having a good train and test split is not a good proxy of the real world. A good proxy of the real world is to just replay through history. So and that's something that you can only do with online learning. That's really cool. Now to come to your point about concept drift, so concept drift is there's many different kind of concept drift, and Chip has a really good on her blog. What matters really is that concept drift, the result of it is usually that your model is not performing as well. Right? It's gonna be a drop in performance. And so the first thing you see on your in your monitoring dashboard is that a metric has dropped. And then when you dig into it, you see that maybe there's a class imbalance or that the correlation between a feature and a class has changed or something like that. So essentially saying that the data the model has been trained on is not representative of the new data that has been seen in production.
But again, I have said this a few times, but online models, if you put them in place with the correct camera ops setup, you they are able to learn as soon as possible. So that just guarantees that your model is as up to date as possible. So you're basically really doing the best you can. So drift is always possible. You can always obviously have a model that's degrading or that's just going haywire, that's not related necessarily to the online learning aspect of things. And so there are also ways to cope with this. So for instance, Dan Crankshaw and his team at Berkeley, they developed a system called CLIPr.
It's a kind of an ops tool. It's a research project, but it's it's also it's I think it's been deprecated, but it's the ideas are still there. It's a project where they have a meta model, which is kind of looking at many models being run-in production and deciding online which model should be used making prediction. So it's kind of like a teacher selecting the best student at a certain point in time and, you know, kind of seeing throughout the year how the students are evolving and, like, which students are getting better or not good.
And so you can do this with Bandits, for instance. But yeah. So just to say that there are many ways to deal with concept drift, and the online models, again, help to cope with concept drift and in just a way, actually, it just makes sense more so than batch bundles.
[00:33:59] Unknown:
And so digging now into River itself, can you talk through how you've implemented that framework and some of the design considerations that went into how do I think about exposing this online learning capability in a way that is accessible and understandable to people who are used to building batch models?
[00:34:19] Unknown:
So I like to think of Viva more as a library than a framework. If I'm not mistaken, framework kind of forces you into a certain behavior or way to do things, and there's an inversion of control where the framework is kind of designing things for you. So, you know, if you look at Keras and PyTorch, Keras is very much more framework in comparison to PyTorch because PyTorch, for me, the reason why it was successful is that it it kind of gave inversion of control towards the user. You can do so many things in PyTorch, and it's very flexible and doesn't really impose a single way of doing. So we have that in mind with River. River, again, is just a library to do machine learning, online machine learning. But it's it just contains the algorithms. It doesn't really force you to, you know, read your data in a certain way, or you could use it in a web app. You could use it offline.
You could do a use it on an offline IoT sensor. Livr is not concerned of that. It's just a library that is agnostic with regards to that. So now to come in to to what Liver is, it is in terms of online machining, it is general purpose. So it's not dedicated to anomaly detection or forecasting or classification. It covers all of that. That's the ambition at least. So just to note there is that it's actually really hard to develop and maintain because other maintainers and I, we are not actually specialized in different domains, and we kinda have to, you know 1 day, I'm going to be doing working on the forecasting module other than the other. I'm gonna work on anomaly detection, and it's it's kinda crazy. So it's still fun. What we do provide is a common interface. So just like scikit learn, every piece of the puzzle and river follows a certain interface. So we have transformers, we have regressors, we have anomaly detectors.
Of course, we have classifiers, so binary and multi class. We have forecasting models and time series. And so we guarantee to the user that each model follows a certain API. So every model is gonna be able to have a learn method, so it can just learn from new data. And they usually have a predict method to make prediction. And so forecasters will have a forecast method. Anomaly detectors will have a score method, which I was supposed to say, anomaly score. And so the strength of Viva is to, yeah, provide this consistent API for doing online machine learning. And it's a bit opinionated because it's well, it just it likes like it learn really. It just says, okay, you're gonna have learn and predict, but that's a reasonable thing to impose.
And that makes it easier for users to switch in new models because they have the same interface. And so, again, just to conclude on what I said at the start, we made the explicit choice to follow the single responsibility principle in that. Mevo only manages the machine learning aspect of things and not the deployments and whatnot. And so if you wanted to use within production, see people doing this, you have to worry about some of the details yourself. Like, if you want to deploy in a in a web app, well,
[00:37:22] Unknown:
we do not help at the moment at all with that. You have to deploy your own web app. As far as the overall design of the framework, you mentioned that it actually started off as 2 separate projects, and then you went through the long process of merging them together. I'm wondering how have the overall design and goals of the project changed or evolved since you first started working on this idea?
[00:37:44] Unknown:
The reason actually why the merger between CREAM and Scikit Multiflow took a certain time was that although we were both online learning libraries, there were some subtle differences which were kind of important. So my vision with Cream at the time and we were now is that we should only cater to models which are what I call pure online models, is that in that they can learn from a single sample of data at a time. But there are also mini batch models, so models which can learn from streaming data but in chunks. So like in a mini batch of data. And Scikit Multiflow was kind of doing this, so much like PyTorch and TensorFlow and you know deep learning models.
And so I kind of had to convince them that there were reasons why it was just a bit better. Why? Because, you know, if you think about again a user infect on the website or just any web requests or things that are happening in the real life, you want to learn as soon as possible. You don't want to have to wait for, you know, 32 samples to arrive to have a batch to be able to feed that to your model. You could obviously, but it just made sense to me to have something simpler where we only care about pure online learning. Because it means that you don't have to store anything, you just learn on the fly. And so I guess the interaction I had with Flaky Multipl kind of confirmed this idea.
And I guess, you know, they were a bit doubtful when we did the merger because and maybe I was a bit too opinionated, but history proved that it actually made sense, and it's not a decision that we look back on. Like, we're really happy with this now. So River has, you know, arguably moderate success. So it's working. It's alive. It's breathing. It's been going on for 2 years and a half, the project. And so we have a steady intake of users that are adopting it, and you know, we can we see this from emails we receive, from GitHub discussions and issues, and just general feedback we get. So the general idea of having a library that is only focused towards ML and just the algorithms is something that we are just gonna keep going with, because it just it looks like it's working, it looks like this is what people want.
You know, a simple example is, hey. I want to compute a covariance matrix online. Well, River aims to be the go to library to answer those kind of questions. Right? But the truth is that people, they don't just need that. They also need ways to deploy these models and do MLOps online. So, well, we basically did the the next steps offer us to build new tools in that direction. And we also think that the initial development of Rivers was a bit fast and furious. The aim was to implement as many algorithms as possible and, you know, just to cover the wide spectrum of machine learning. Now that we've, you know, covered quite a few topics, and we also have day jobs. So when I was developing with initially, I was doing a PhD, so I had ironically, I had more time than now because I have a proper job. But we value our time a bit more, and we're not in this fast and surest mode. We kind of just focus on picking certain models which are valuable and we see value in and just spending time to implement them properly. And we also see the final aspect is that we see that people, they don't just want our user base does not just want algorithms. They also want us to educate them. So they have general questions as towards, you know, what is online learning and how do I do it and how do I decide what model to use and, you know, all the questions that we're covering in this podcast, basically. So I think there's a huge need for us to kind of move into a educational aspect. So when I was younger, scikit learn was my bible. Like, I would just spend so much time not even using it, not even just using the code, but actually just reading through the documentation because it's just so excellent.
So obviously, that takes a lot of time, a lot of energy, the people, contributors, and help, but definitely something towards which we are moving.
[00:41:50] Unknown:
In terms of the overall process of building the model using something like river, When people are building a batch model, they end up getting a binary artifact out that is the entire state of that model after it has gone through that training process. And I'm curious if you can talk to how River manages the stateful aspect of that model as it goes through this continual learning process, both in a, you know, sandbox use case where somebody is just playing around on their laptop, but also as you push it into production where maybe you want to be able to use this model and scale out serving it across a fleet of different servers and just some of the state management that goes into being able to continually learn as new information is presented to it?
[00:42:39] Unknown:
So batch learning the great advantage of batch learning is that once you train your model, it's essentially a pure function. There are no side effects. The, you know, decision process that's underlying the model is not gonna change. So, you know, you can push the envelope and compile it. You can pickle it. You can convert it to another format. So that's what o n x does. You can compile it so that it can run on a mobile device. I mean, it does not need the ability to train anymore. It's just basically a Python function. Or just it's just a function basically that takes an input and outputs something. So there's also a good reason why batch learning is predominant. But with river, it's different because online models, they need to keep this ability to learn. So that's what you've been saying. So it's actually kind of straightforward, but the internal representation of most models in river is fluid, dynamic. It's usually stored in dictionaries that can increase and decrease in size. So imagine you have a new feature that arrives in your stream, Well, every model will ever cope with that. It's they're not static. There's a new feature that appears well, they handle it gracefully. So for instance, a linear regression model is just going to add a new weight to its internal dictionary of weight. Now in terms of civilization and pickling and whatnot, River is mostly written well, basically, River stands on the shoulders of Python, very much so. So we do not depend very much on NumPy or Pandas or SciPy.
We mostly depend on Python standard library. We use dictionaries a lot, and that plays really nicely with the standard library. It's very easy just to you can take any River model, pickle it, and just save it. You can also just dump it to JSON or whatnot. Also, the paradigm of you train a model, you pickle it, and you have an artifact that you can upload anywhere, it's a bit different with online learning because you would play this differently. You would maintain your model in memory. So if you have a web server serving your model, you would not just load the model to make a prediction, you would just keep it in memory and make a prediction of it because it's in memory, so you don't have to load it anymore, preload it.
And then when a sample arrives, your model is in memory. You can just make it learn from that. So, yeah, I think the big difference is that you hold your model memory rather than picking it to the disk and loading it when necessary.
[00:45:11] Unknown:
In terms of the use of the dictionary as that internal state representation, As you said, it gives you the flexibility to be able to evolve with the updates and data. But at the same time, you have this heterogeneous data structure that can be mutated as the system is in flight, and you don't necessarily have a strict scheme of being applied to it. And I'm just curious if you can talk to the trade offs of being able to add that flexibility, but also lacking in some of the validation and, you know, schema and structure information that you might want in something that's dealing with these volumes of data?
[00:45:52] Unknown:
So, yeah, Viva uses dictionaries. So the advantage of dictionaries are plentiful. First of all, a very important thing is that dictionaries are to list what handles data frames are to numpy arrays. So a dictionary has names, and that's really important. So it means that each 1 of your feature actually has a name to it. And I find that hugely important because, you know, we always see features as just numbers, but they also have names, and that's just really important. Imagine you have a bunch of features coming in. Now if that was a list or a NumPy array, you have no real way of knowing which column corresponds to each variable. If you switch 2 columns with each other, that could just be a really silent bug, which really affect you. Whereas if you name each feature, if the column order changes, well the names of the columns are being permuted too, so you can kind of identify that. So what's really cool with dictionaries and and that works with River is that the order of the features that you're receiving doesn't matter, because we access every feature by name and not by position.
Dictionaries also allow you are mutable in size. So, you know, if a new feature arrives or a new feat a feature disappears between 2 different samples, that just works. So it's really cool also that dictionaries, when you think about it naturally sparse. So imagine that on the Netflix projects, the features that you arrive that you receive are the name of the user, semicolon, colon, 1 or you know, the dates 1 or Yeah. You can just store sparse information in a dictionary. That's kind of really useful. There's this robustness principle that we follow with others. So robustness principle is that we are used to be conservative in what you do, but liberal in what you accept. So Mivva is very liberal and that accepts heterogeneous data, as you said. So dictionaries are different in size, dictionaries which have different orders or whatnot, but that is really flexible for users. So a common use case is to deploy whether in a web app, and in the web app, you're receiving JSON data a lot of the time. So the fact that JSON data has a 1 to 1 relationship with Python dictionaries makes it really easy to integrate with it into a web app. Whereas if you have a regular batch model, you have to mess about with casting the JSON data to a NumPy array, and, you know, that has a cost actually. It actually has a cost because although NumPy, Torch, TensorFlow are, you know, good at posting matrices, there's actually a cost that comes with taking native data, such as dictionaries, and casting them to a higher order data structure, such as an NumPy array, that has a real cost. In a web app where, you know, into you're reading in terms of milliseconds, well, you're spending a lot of your time just converting your JSON data to NumPy.
Whereas with Revver, because it consumes dictionaries, well, the data you receive, I don't know if you're coding in Django, Flask, or FastAPI, the data you receive your your request is a dictionary, so you don't have to convert the data. It just runs. So actually, if you if you take a river model, like a a linear regression river and a linear regression in torch, it's actually gonna be much faster in river because there's no conversion cost. Plus the features are names, and plus you just you you don't have to worry about the features being mixed order or anything. So it just makes a lot of sense really, dictionaries in that sense. Now the pitfalls, obviously, it's it's not perfect. The pitfall is that I kinda disagree that there's a problem with the short dictionaries. I actually think that dictionaries well, if you wanted to, you could create like if you're in Python, you can actually use a data class and you can convert that to a dictionary and feed that into your model. The data class helps you to create structure.
So I don't think that's really a problem. Quite the contrary, I think. The fact is that also a dictionary can be nested. So maybe your your features that you're feeding to the model doesn't have to be a flat dictionary. It can actually be nested, and that's really cool too. You know, you can have features for your user. You can have features for your page, the features for the day, anything. Things that you cannot necessarily do with a flat structure such as a data frame or NumPy array. Anyway, I'm talking about benefits, I should be talking about cons. But yeah, I guess just the main con of processing dictionaries is that, you know, if you wanted a Viva model to process a 1000000 samples, it would take much more time than processing a 1000000 samples with pandas data frame or NumPy array.
Because, yeah, the point of Viber is to process is to be half that processing 1 sample at a time, but not necessarily processing a 1000000 samples at a time. But those are 2 different problems. So although, you know, you take Torch or Scikit Learn, their goal is to be able to process offline data really quick. So the goal of VIVA is to process online data, single samples, as fast as possible. And you're comparing apples and oranges if you wanna do the comparison. It just doesn't work. So yeah. Actually, you know what? I don't think they'll downsize to using dish nodes. It just helps a lot. And and to confirm this, we have a lot of users who tell us about this. They say, well, it's actually fun to use river. I just it just makes sense because it's very close to the data structures that I use in Python. I don't have to introduce a new data structure to my system.
[00:51:16] Unknown:
So for somebody who is using River to build a machine learning model, can you just talk through the overall process of going from idea through development to deployment?
[00:51:27] Unknown:
I'm going to rehash what I said before, but I think the great benefit of online learning and river is that you can cut the r and d phase. So I've seen so many projects where there's an r and d phase, and the model, you know, gets validated at some point in time, but there's like a real big gap of time between the start of the r and d phase and the moment when the model is deployed. And the process of using river and any streaming model in general is to actually, as I said, deploy the model as soon as possible, monitor its predictions, and it's okay because sometimes that model, well, you know, you can deploy it in production, and those predictions do not necessarily have to be served to the user.
So you just make the predictions, you monitor them, and it creates you a log again of training data and predictions and features. And that's what you call shadow deployment. You have a model which is deployed, is making predictions, but those predictions are not being used, you know, to inform decisions or to change or to influence the behavior of users. They just exist for the sake of existing and for monitoring. 1 thing to mention is that once you deploy this model, you have your log of events. That's the phase where you want to maybe design a new model. And you're going to have this model replace the existing model in production or coexist with it because you have a meta model or not. So I mentioned that you can take your log of events, replay it in the order in which it arrived, and have a good idea of how well your model would have performed.
That's called progressive validation. So it's just this idea that if you have your log of events, every sample you're first gonna make a prediction, and then you're gonna learn from it. So I have a good example. There's a data set on Kaggle called the New York Taxi Data Set, and it's basically a log of people asking for a hailing a taxi for a ride, and they depart from a position, and they arrive at another position later in time. And so the goal of a machine learning system in this case could be to predict how long the taxi trip is going to last. So when the taxi departs, you want your model to make a prediction.
How long is this taxi ride gonna last? And so maybe that's gonna inform, I don't know, the cost of the trip, or it's gonna help decision makers, you know, rear behind the taxis, or I don't know, whatever. But you can imagine that this is a great feedback loop because you have your model making a prediction, and then later, maybe 18 minutes later or something, we are at the ground truth around. So you know how long the taxi trip actually lasts, and that's your ground truth. And then you can compare your prediction with your model. And that enables progressive validation because you have a log of events. You have when the taxi trip departs, what was the value my model predicted, what features I used at prediction time, and later on, I had to go on truth. And so I can just replay the logs of events for, you know, I don't know, 7 days and progressively evaluate my model. So I like this taxi example because it's easy to reason about, and, you know, taxis are easy to understand. But the taxi example is really what online learning is about. It's about the feedback loop between predicting and learning.
And just to remind you, but how it would be with a batch model is that you would have your taxi date set and well, I don't know. You would split your dates in 2. You would have start of the week, end of the week, train your model on the start of the week, evaluate on the rest of the week. Oh, no. The data I trained on for the start of the week is not represented on the weekend. Yeah. And it just becomes a bit weird. It becomes this situation where you're trying to reproduce conditions in real life, but you're never really sure of it. And you can only really know how well your batch model is going to do well in production.
And online learning, it just kind of encourages you to go for it, to deploy your model straight away and not have to have this weird r and d phase where you live in a lab. You think you might be right, but then you're not really sure. And yeah, online learning just brings you closer to the reality, in my opinion.
[00:55:44] Unknown:
As you have been developing this project and helping people understand and adopt it, what do you see as some of the conceptual challenges or complexities that people experience as they're starting to adapt their thinking to how to build a machine learning model in this online streaming format versus the batch oriented workflow where they do have to think about the train test split, just the overall shift in the way that they think about iterating on and deploying and developing these
[00:56:15] Unknown:
models? It's a hard question, but I think there's 2 aspects. There's the online learning aspect, and then there's the MLOps aspect. Now in terms of MLOps, I think I I covered enough, but it's much like a batch model. You have to deploy your model, which means maybe survey behind the web app. As I mentioned, the ideal situation is to have your model loaded into memory, and that's making prediction and training. But all that is really harder to do than to say. The truth is that there's actually no framework out there which allows you to do this. You could do this yourself, and this is what we see. I mean, we get users who ask us questions in a bit of context, so on on GitHub or in mails, but and they're asking us, how do I deploy my model? What should I be doing? And we always give the same answers.
But the fact is that, you know, we have these users who have basically embraced Viva and they understand it, but then they get into the production phase. And that's not what we're trying to well, we feel bad because they are all making the same mistakes in some way, and VIVUS is not there to help them because that's not the purpose of VIVUS. So, yeah, there's a lack of tooling to actually just, you know, deploy an online model. So, yeah, that's the MLOps aspect. I think in terms of some online learning, a big challenge is that not everyone has the luxury to have a PhD during which you can spend days and nights going through online learning papers and trying to understand it, and that's what I and others had the chance to do. But a lot of our users, you know, they see the value of online learning, and they want to put it into production, but they have deadlines to me. Right? They have to ship their project in 6 weeks, and they just don't have the time to understand things in detail. So things like what I just described, progressive validation, it kind of takes them a bit of time to understand.
And so, again, what we need to do is to spend more time creating resources or, you know, just diagrams to explain what online learning is about. And that in terms of library design, it's really important. Like, if we wanted to introduce a new method to all our estimators, I would be against that. Like, the the whole point of Vivo is to make it as simple as possible so that people can just, you know, be productive, understand it. So, yeah, I think that just to recapitulate those 2 problems is that people do not necessarily have the resources to learn about online learning, and then there are operational problems around serving these models into production. So it's kind of like a batch model because you have to serve your model behind an API, you know, and you have to monitor it. And these are things, well, you know, that are common to a batch model, but there's the added complexity of having your model being, you know, maintain the memory and keep learning and stuff and things that are basically not common. I mean, if you have to Google it or find something on GitHub, you you just kind of find find these hacky projects, but no
[00:59:17] Unknown:
real good allowed me to do that, at least not yet. And 1 of the things that we didn't discuss yet is the types of machine learning use cases that River supports where I'm speaking specifically to things, logistic regressions and decision trees versus deep learning and neural networks. And I'm just wondering if you can talk to the types of machine learning approaches that River is designed to support and some of the reasoning that went into where you decided to put your focus.
[00:59:47] Unknown:
River, again, is a general purpose library, so it covers quite a few things. There are some cases or flavors of machine learning, which are especially interesting when you cast them in an online learning scenario. So if you're doing anomaly detection, so for instance, you have people doing transactions on a in a banking system, so they're making payments, and you might want to be doing anomaly detection to detect fraudulent payments. That is very much a situation where you have streaming data. And so in that case, you would like to be doing online anomaly detection. So we see that every time we put out a notebook or a new anomaly detection method, a lot of people start using it. We start having bug reports and and whatnot. So it's kind of surprising, but it's a good thing. But, yeah, I think there are modules and aspects of which are clearly bring a lot of value to users. So that would be anomaly detection, but also we have forecasting models. So when you do online forecasting, that just makes sense. But you have sensors which are, I don't know, measuring the temperature of something. People want to do that in real life in in real time. There's also a good example I have is we have this engineer who's working on the water pipes in in Italy.
He's trying to predict how much water is going to flow through certain points in his pipeline. So he has sensors all over the pipeline, and he's trying to just do a forecasting model. And so it just makes so much sense for him to be able to have his model run online inside the sensors or inside the IoT systems he's running. So just all that to say that there are some more exotic parts of the river, such as anomaly detection and forecasting, which are not which are probably more value than the classic models such as, you know, linear regression, classification, regression. Again, at the start of the show, I I talked about Netflix recommendations.
So we have some very basic bricks to be able to make recommendations. Well, we have factorization machines, and we have some kind of ranking system so that if you have users and items interacting, you can kind of build a ranking of preferred items for a user. So we have these kind of exotic machine learning cases which provide value, but require us to spend a lot of time to work on them basically. So it's very difficult for me and for other contributors to be specialized in anomaly detection time series forecasting recommendation.
But, yeah, all this to say that covers a wide spectrum. You can do preprocessing. You can extract features. You can do classification, regression, forecasting, anything. We try to well, because it's online, it's just a bit unique. In your experience of
[01:02:45] Unknown:
working with the River Library and working with end users of the tool, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[01:02:55] Unknown:
Well, unexpected is a good 1. There's 1 thing that comes to mind. We have this person who is a beekeeper. So a person who is, you know, taking care of bees and I guess they once a week or every 2 weeks, they go to the beehive and they pick up the honey in the beehive. And this person has many beehives, and so they don't like to waste their time going into the beehive and actually checking if there's honey or not. So they have a sensor. He has sensors that are in each beehive. They're kinda measuring how much honey is in each beehive, and he likes to forecast how much honey he is expected to have in, you know, the weeks or months to come based on the weather, based on past data, based on I don't know, what information he uses.
But really, really just fun just to see this person doing this hackish project where they just thought it would be fun to use online learning to do it. And again, it wasn't an IoT context, so that made sense. I guess in Innovative, I was kind of impressed when I heard about this project of having a, you know, a model within each car to determine where your destination would be. So I don't know. You wake in the morning, you take your car. You know? Is it Saturday you go to go to the market? Is it a weekday you go to work? It sounds silly, obviously, but having this this idea of having 1 model per user is is kind of fun.
The most impactful project I heard about, and I know it's which is being used, is a situation where this company, they prevent cyber attacks. So they monitor this grid of servers and computers, and they're monitoring traffic between machines. And so they're trying to understand when some of the traffic is malicious and hackers basically trying to get into a system. So you can detect this by looking at the patterns of this traffic. Right? And the trick is that behind the traffic, the malicious traffic, it's hackers, and they're constantly changing their patterns, so access patterns to actually not be detected. And so if you manage to label traffic as, you know, malicious, well, you want your model to keep learning. So they have the system where they have, like, thousands of machines, and they have a few machines that are dedicated to just learning from the traffic and in real time, adapting, learning, detecting anomalous traffic, sending it to human beings so that they can actually verify themselves, label it, etcetera.
And so it's really cool to know that VIVO has been used in that context. Like, it just made so much sense for them to say, wow. We can actually do this online, and we don't have to be trained. And on batch learning was getting in their way. They had this system which was going a 1000 miles an hour, just hundreds of thousands of data accumulated all the time. And batch learning was just, you know, it was just, again, annoying for them. They they having a system that will enable them to do all this online just made sense for them. And to know that you can do this at such a high amount of traffic, it was really cool and exciting.
[01:06:02] Unknown:
In your own experience of building the project and using it for your own work, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:06:11] Unknown:
I think I'm just gonna focus a bit on the human aspect here. But although I have been doing open source, you know, quite a bit, I've always had this approach where I probably work too much on new projects that I make myself rather than on existing projects. So I just rather just do my thing myself rather than contribute just to existing stuff, and it's not always that's any good, but it's just the way I work. And so Vivo is really the 1st open source project where I work for other people. So like probably many people, a lot of my open source work is I just work on it myself. And obviously, I you work in companies and where you probably have a review process and you work with other people, but this is the first open source project where I really work with a team of people.
And it's fun. It's just really so much fun. Like, just a month ago, we actually got to meet altogether and to have this, like, informal reunion. So that was really fun. And you realize that, you know, after 3 years, there are ups and downs and there's moments where you just do not want to work on Liv anymore and you want to you know, you have work, you have friends, girlfriends, whatnot. And so the only way to subsist as a open source project in the long term is to have multiple people working. So do it open source and, you know, it's not realistic to do it on your own if you want something to be successful and to actually have an impact in the long term. So it's actually really important to just be nice and to have people around you who help you. And although not everyone contributes as much as I do or core maintainers do, people help a lot, and they make things alive. It's always a joy when I open an issue on GitHub, and I see that someone from the community has answered the question, and I don't have to do anything. It helps tremendously.
Yeah.
[01:08:01] Unknown:
We've already talked a bit about some of the situations where online learning might not be the right choice. But for the case where somebody is going to use an online streaming machine learning approach, what are the cases where River is the wrong choice and maybe there's a different library or framework that would be better suited?
[01:08:18] Unknown:
Well, yeah. Again, honestly, I think that online learning is the wrong choice in 95% of cases. Like, you do not want to make the mistake to think that your problem is a online problem. You probably, most of the time, have a batch problem that you can solve with a batch library. You know, I mean, scikit learn now, if you open it and you just run it, it's always going to work reasonably well. So sometimes I would just go for that. 1 thing we do get a lot is people asking how you can do deep learning with river. So they want to train deep learning models online. So the answer is that we do have a sister library that is called Torchriver, and it's dedicated towards training Torch models online.
So but again, that is a bit finicky at the moment and still need some work being done on it. But yeah. If you want to be doing deep learning and you want to be working with images and sound and, you know, structured data, River is not the right choice, even online, and you probably have to be looking at PyTorch.
[01:09:14] Unknown:
As you continue to build and iterate on the river project, what are some of the things you have planned for the near to medium term or any applications of this online learning approach that you're excited to dig into?
[01:09:26] Unknown:
We have a public road map. So it's a Notion page with a list of stuff we're working on. That 1 mostly has a list of algorithms to implement, and it's mostly there to, you know, make people know what we're working on and to encourage new contributors to work on something. So the few contributors we have just pick what they want to work on and, you know, just in general order of preference. So for instance, me this summer, I've decided to work on online covariance matrix estimation. So if you actually learn online covariance matrix, it's kinda useful because it's very useful in financial trading. And if you have an inverse covariance matrix that you can estimate online, that unlocks so many other algorithms, such as Bayesian linear aggression, elliptic envelope method for unknown detection, Gaussian processes, and whatnot. So I think I'm still in the nitty gritty details of implementing algorithms and not necessarily applying them to stuff. I'm kinda counting on users to to do the applications.
It just shows right at the moment. Now 1 thing I'm working on in the mid to long term is Beaver. So eventually, I want to try to spend less time on Weaver and work on a tool I'm building called Beaver. So Beaver is a tool to deploy and maintain online learning models. So essentially an MLOps tool for an MLOps tool for online learning. So it's in its infancy, but it's something I've I've been thinking about a lot. So I recently gave a talk on it in Sweden. I've sketched a blog post and some slides where I tried to describe what it's going to look like. But the goal of this project is to create a very simple, user friendly tool to deploy a model, and I'm hoping that that is going to encourage people to actually use river and to use online learning because they're gonna say, hey. Okay. I can learn, but I can also just deploy the model and, you know, and both tools play nicely together. So, yeah, the future of Vivo is to have Vivo and to have this reference tool to deploy online models.
It's not going to be catered just towards River. The goal is to be able to, you know, run it with any model that can learn online.
[01:11:41] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[01:11:55] Unknown:
I'm always impressed by how much the field is maturing. I think that there's a clear separation now between regular machine learning, like business machine learning, I like to call it, and deep learning. I think those 2 fields are becoming my separate fields. So I've kind of stayed away from deep learning because I just not my cup of tea, but some very interesting in business machine learning, so getting things that I call it. And I think I'm impressed by how much the community has evolved in terms of knowledge. People are the average ML practitioner today is just so much more proficient than 5 years ago.
And I think it's a big question of education and tooling. The tricky thing about an ML model when it's not deterministic, and so it's difficult to guarantee that its performance over time is going to be good, and let alone certify the model or convince stakeholders that they should adopt it. So in the real world, you don't just deploy a model and cross your fingers. So although we've gone past the test and r and d phase of a model, we are still not there in terms of deploying model. And so the reality is that there's usually a feedback loop where you monitor your model and possibly retrain it, be online or, you know, offline retraining. It doesn't matter.
And so I don't think we're really good at that right now. I don't think that we have great tools to have human beings in the loop who work hand in hand with machine learning models. So I think that tools like Progyny, which is a tool to have a user work hand in hand with an ML system by labeling data that the model is unsure about, they're crucial. They're game changers because they create real systems where you care about, you know, new data coming in, retraining your model, having a human validate predictions, stuff like that. So I think we have to move away from only having models that are only having tools that are destined towards training a model, but we also need to get better at tools that, you know, encourage you to monitor your model, to keep training it, to work with it, to yeah. Again, just treat machine learning as software engineering and not just as some research project.
[01:14:16] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on River and helping to introduce the overall concept of online machine learning. It's definitely a very interesting space, and it's great to have tools like River available to help people take advantage of this approach. So thank you for all of the time and effort that you and the other maintainers are putting into the project, and I hope you enjoy the rest of your day. Oh, thank you. Thanks for having me. It was great. Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning. Building good ML models is hard, but testing them properly is even harder. At DeepChex, they built an open source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built in library of checks for testing and validating your model's behavior and performance and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually.
Go to the machine learning podcast.com/deeptext today to learn more and get started. Your host is Tobias Macy. And today, I'm interviewing Max Halford about River, a Python toolkit for streaming and online machine learning. So, Max, can you start by introducing yourself?
[00:01:01] Unknown:
Oh, hey there. So I'm Max Gess. I consider myself as a data scientist. My day job is doing data science. I, actually measure the carbon footprint of, clothing items. But I have a wide interest in, you know, technical topics, be it software engineering or data engineering. Do a lot of open source. My academic background is leaning towards finance and computer science and statistics. I actually did a PhD in applied machine learning, which I finished a couple of years ago. So, yeah, an all around node, basically.
[00:01:35] Unknown:
And do you remember how Hapriska started working in machine learning?
[00:01:39] Unknown:
Kind of, Jess. I was a late bloomer. I got started when maybe when I was 21, 22 when I was at university. I basically had no idea what machining was, but I started this curriculum that involved that was around statistics. And we had a course, which was maybe 2 or 3 hours a week about machine learning, and it did kind of blow my mind. It was around the time when, well, machine learning and particularly deep learning was starting to explode. So I kinda stopped at university. So I was lucky enough to get a theoretical training.
[00:02:14] Unknown:
And in terms of the river project, can you describe a bit more about what it is that you've built in some of the story behind how it came to be and why you decided that you wanted to build it in the first place?
[00:02:25] Unknown:
When I was at university, I received a kind of normal introduction to a regular introduction to machine learning. And then I did some internships. I started PhD after my internships. And I also did a lot of travel competitions on the side. So I was kind of hooked into machine learning, and it always felt to me that something was off because when we were learning machine learning, everything made sense. But then when you get to do in practice, you often find that, well, it's not playgrounds. Like, the playground scenarios that they describe at university when you learn machine learning just do not apply in the real world. It's available as well. You have data that's coming in, like a flow of data or every day there's new data or yeah. There's like an interactive aspect to the world around us, the way the data is flowing. It's not like a CSV file.
Yeah. It just felt like fitting a square peg in a round hole. So I was always curious in the back of my mind about how you could do online machine learning. Well, I didn't know it was called online machine learning because when I was a kid, I remember growing up and thinking that AI was this kind of intelligent machine that would keep learning as it went on and as it experienced the world around it. Anyway, when I started my PhD, I was lucky enough to have a lot of time to read. So I read a lot of papers and blog posts and whatnot. And I can't remember the exact day or week that I stumbled upon it, but I just started learning about online machine learning.
Maybe some blog posts or something. And then it was like a big explosion in my head, and I was like, wow. This is crazy. Right? This this actually exists. And I was so curious as to why it wasn't more popular. And at the time, I did a lot of open source as a way to learn, And so it just felt natural to me to start implementing algorithms that I'd read in papers on everywhere. I've just started writing code to learn, basically, just to to confirm what I've learned and and whatnot. That's just the way I learn. And it kind of evolved into what is, which is a, well, an open source package that people use. Now if I may expand a little bit, Viva is actually the merger between 2 projects.
So the first project is called Psyche Multifrow. It was a package that was developed before I even got into machine learning. It has roots in academia in New Zealand, comes from an old package called Noah in Java. Anyway, I wasn't not aware of that. On my end, I started to get a package called cream at the time. So in French, creme means cream, and it plays funny with incremental, which is another way to say online. So a year, I developed cream on by myself. And at some point, it just made sense to reach out to the guys from Psyche Multiflow and to propose a merger. So it took us quite a while, but after 9 months of negotiation and, you know, figuring out the details, we merged, and we called the new package, ever.
[00:05:26] Unknown:
You mentioned that it's built around this idea of online machine learning. And in the documentation, you also refer to it as streaming machine learning. I'm curious if you can just talk through what that really means in the context of building a machine learning system and some of the practical differences between that and the typical batch oriented workflow that most folks who are working in ML are go going to be familiar with.
[00:05:50] Unknown:
1st, just to recap on machine learning, the whole point of machine learning is to teach a model to learn from data and to take decisions. So, you know, monkey see, monkey do. And the typical way you do that is that you fit a model to a bunch of data, and that's it really. But online machining is the equivalent of that, but for streaming data. So you stop thinking about data as a file or a table in a database, but you think of it as a flow of data stream. So online machine learning, you could call it incremental machine learning. You could call it streaming machine learning. I mean, I more often see online machine learning being used, although if you Google that, you kind of find these online courses for the online machine learning, so that's not kind of cool online. But anyway, yeah, it's just this way to say, can I do machine learning but with streaming data? And so the rule is that an online model is 1 that can learn 1 sample at a time. So usually, you show a model a whole dataset, and they can work with that dataset. It can calculate the average of the dataset. It can do a whole bunch of of stuff. But the restriction here with online machining is that the day the model cannot see the whole data. It can't hold it in memory. It can only see 1 sample at a time, and it has to work like that. So it's a restriction. Right? So it makes it harder for the models to learn, but it also has many, many implications. If you have a model that can learn that way, well, you can have a model that just can keep learning as you go on.
Because a regular machine learning model, once it's been fitted to a dataset, you have to retrain it from scratch if you want to incorporate new samples into your model. That can be a source of frustration, and that's why I was calling the square peg in the round hole before. So say you have a model, an online model that is just as performant as a a batch model. Well, you know, if you if you just regardless of performance, accuracy, that has many implications, and it actually makes things easier. Because if you have a model that can keep learning, well, you don't have to, for instance, schedule the training of your model. You can just every time you have a new sample that arrives, you can just tell your model to learn from that, and then you're done. And so that ensures that your model is always as up to date as possible and that has obviously many, many benefits. If you think about people working on the stock market, so trying to forecast the evolution of a particular stock, they've actually been doing online machine learning since the eighties because, obviously, they have a lot to lose by making all this public. It just never got into a big thing, and it always stayed in in stock market companies.
So the practical differences are that you are working over stream of data. You're not working with a static dataset. This stream of data has an order, meaning that the fact that 1 sample arrives before the other, well, that has a lot of meaning, and that's actually reflecting what's happening in the real world. In the real world, well, you have data that's arriving in a certain order. Well, if you train your model offline on that data, you want to, you know, process it in the same order. And so that ensures that you are actually reproducing the conditions that happen in the real world. Now another practical consideration is that online learning is much less popular or predominant than batch learning, and so a lot less research and software work has been put into online learning. So if you are a newcomer to the field, well, there's just not a lot of resources to learn from. Actually, you could just spend a day on Google, and you you probably find all the resources you there are because there's just not so many of them. There's probably, like, by memory, just 10 links on Google that you can learn from about online learning. So it's a bit of a niche topic.
[00:09:52] Unknown:
In terms of the fact that batch is such a predominant mode of building these ML systems despite the fact that it's not very reflective of the way that the real world actually operates, why do you think that's the case that streaming or online machine learning is still such a niche topic and hasn't been more broadly adopted?
[00:10:14] Unknown:
Sometimes it feels like I'm trying to teach a new religion, which feels a bit weird because there's not a lot of us doing it. So I'm also very I never try to force people into this. There's obviously many good reasons why batch learning is still done. And now from a historical point of view, I think it's interesting because we always used to use statistical models to explain data and not necessarily to predict. So you just have a data set, and you just like to understand, you know, what variables are affecting a particular outcome. So for instance, if you take linear aggression, historically, it's been used to explain the impact of global warming on the rise of sea level, but not necessarily to predict if, you know, the temperature of the globe was higher, what would be the impact on the sea level. But then someone said, let's use machine learning to predict outcomes in a business context, and that's why we have this big event of machine learning.
And we've kind of been using the tools that have been lying around. So we've been using all these tools that we used for to a dataset and explain it, but now we've been using them for predicting. So these models are static. Like, the people who when we started doing linear regression, we never really worried about streaming data because datasets were small, datasets were static. Well, the Internet didn't even exist, so there was no real notion of IoT or sensors or streaming data. So the fact is that we've never needed online models. And so as a field, you look at the academia and the industry. We're very used to batch learning, and we're very comfortable with it. There's a lot of good software, and this is what people are being taught at university.
So I'm not saying that online learning is necessarily better than bachelor learning, but I do think that the reasons why batch learning is so predominant in comparison is because we are too used to it, basically. And I do think that and I see it every week. People who are trying to rethink their job or their projects and say and say, maybe I could be doing online learning. It actually makes more sense. So I think it's a question of habits, really.
[00:12:39] Unknown:
For people who are assessing which approach to take in their ML projects, what are some of the use cases where online or streaming ML is the more logical approach or what the decision factors look like for somebody deciding, do I go with a batch oriented process where I'm going to have this large set of tooling available to me, or do I want to use online or streaming ML because the benefits outweigh the potential costs of plugging into this ecosystem of tooling?
[00:13:10] Unknown:
So I'll be honest, I think it always makes sense to start with a batch model. Why? Because, you know, if you're pragmatic and you actually have deadlines to meet and you just wanna be productive, there's so many good solutions to train a batch model and deploy it. So, you know, I would just go with that to start with. And then, yeah, there's the question of, could I be doing this online? So I think there's 2 cases. There's cases where you need it. And so I have a great example. So Netflix, when they do recommendations, you know, you arrive on the website and Netflix recommends movies to you. Netflix actually retrains a model every night or every week, but they have many models anyway. But they are learning from your behavior to kind of retrain their models to update their recommendations.
Right? There's a team at Netflix that are working on learning instantly. So if you are scrolling on the Netflix website and you see a recommendation for Netflix, the fact that you did not click on that recommendation is a signal that you do not want to watch that movie maybe or that the recommendations will be changed. So if you're able to have a model, for instance, maybe in your browser that would learn in real time from your browsing activity and that could update and learn on the fly, that'd be really powerful. And the only way to do that is to have a model per user that is learning online.
And so you cannot just use batch models for that. Yeah. You can't just every time a user scrolls or ignores a movie, you can't just take all the history of data and fit the model. It would be much too heavy, and it's just not practical. So sometimes the only way forward is to do online learning. But, again, this is quite niche. Like, Netflix recommendations, I mean, obviously, are working reasonably well, I believe, because they're, you know, just from their market value. But if you are pushing the envelope, then sometimes you need online learning. Now another case is when you do not necessarily need it, but you want it because it makes things easier. So a good example I have is imagine you're working on the app that categorizes tickets. So for instance, on the help support software. So, you know, you go on the website and you're sending a form or you send a message or an email, and you're asking you have some problem maybe with a reimbursement on your ultimately, you bought an Amazon.
And then, you know, there's a customer service behind that, human beings that are actually answering those questions. And it's really important to be able to categorize each request and put into a bucket, so that it gets assigned to the right person. And maybe the public manager has decided that we need a new category. And so there's this new category, and your model is classifying tickets into 1 of several categories. If you introduce a new category, it means that you have to retrain the model to incorporate it. I was in discussion with a company, and they were only able to or their budget was that they were only able to retrain the model every 3 months. So if you introduce a new ticket, a new category into your system, the model would only pick it up and predict it after 3 months. So that sounded kind of insane and, you know, wasn't predicted at all.
And I was not aware of the exact details, but it it just seemed too expensive for them to retrain their model from scratch. So if they were using an online model, well, potentially, that model could just learn the new tickets on the fly. And, you know, if you just introduced it and people started, you know, you landed this feedback loop where you introduced a new category. People send the email. Maybe a human assigns that ticket to a category, so that becomes a signal for the model. The model picks that up, learns, and, yeah, it's gonna incorporate that category into its next predictions. So that's a scenario where you don't necessarily need online learning, but, actually, online learning just makes more sense and makes your your system easier to maintain, to work with, basically.
[00:17:17] Unknown:
There are a number of interesting things to dig into there. 1 of the things that you mentioned is the idea of having a model per user in the Netflix example. And I'm wondering if you can maybe talk through some of the conceptual elements of saying, okay. I've got this baseline structure for it. This is how I'm going to build the model. Here is the initial training set. I've got this model deployed. And now every time a specific user interacts with this model, it is going to learn their specific behaviors and be tuned to the data that they are generating.
Would you then take that information and feed that back into the baseline model that gets loaded at the time that the browser interacts with the website and just some of the ways to think about the potential approaches for how to say, okay. I've got a model, but it's going to be customized per user and just managing the kind of fan out, fan in topologies that might result from those event based interactions spread out across n number of users or entities.
[00:18:23] Unknown:
Oh, it sounds insane when you say it because to have 1 model per user and, you know, have it deployed on the user's browser or mobile phone or, god knows what, on Apple Watch. It does kinda sound insane, but it is interesting, I guess. I don't think there are so many I mean, I'm not aware of a lot of companies that would have the justification to actually do this, and I've never had the occasion to actually work into a setting where I would I would do this. But I had 1 good example where I was kinda doing some pro bono consulting. It was this car company where the onboard navigation system, they wanted to build a model where they could guess where you're going to. So, basically, depending on where you left, if you left home in the morning, you're probably going to work.
And they would then use this to, you know, just give you send you news about your itinerary or things like that. They really needed a model that would just be able to learn online, and they made the bold decision to say, okay. We're going to embed the model into the car. It's not going to be, like, a central model that's, you know, hosted on some big server and that the car interaction the actually, the intelligence is actually happening in the car. And so when you think about that, it's really interesting because it creates a decentralized system. There's not like a single it actually creates a system where you don't even need the Internet for the model to work. So there's so many operational requirements for that. Actually, now that I think of it and I'm talking about cars, I realized that Tesla that's actually what Tesla is doing. Like, they're they're computing and making decisions inside the car, you know, doing a bunch of stuff, and they're also communicating with mother servers.
But the actual computer, they actually have GPUs in the car doing computes with their deep learning models and and whatnot. So it's definitely possible to do this. Right? But clearly not something that our company would go would go through or would have the need to to do.
[00:20:20] Unknown:
It's interesting also to think about how something like that would play into the federated learning approach where you have federated models where there is that core model that's being built and maintained at the core. And then as users interact at the edge, whether it's on their device or in their browser, it loads a federated learning component that has that streaming ML capability. So the model evolves as the user is interacting with it on their device, and then that information is then sent back to the centralized system to be able to feed back into the core model so that you can have these kind of parallel streams of the model that the user is interacting with as being customized to their behavior at the time that they're interacting with it, but it does still get propagated back into the larger system so that the new intelligence is able to generate an updated experience for everybody who then goes and interacts with it in the future.
[00:21:20] Unknown:
Yeah. That's really, really interesting. So I think first off, the the fact is that I I'm actually still young, and there's so many things that I don't know. And I don't have, like, the technical savvy to be able to suggest ways forward. But this is obviously things that, you know, I think about. So these things like Hub Wild, which is a project from Google, they have a a paper where they discuss these things. I think that's a really simple thing that if you wanted to do this, if you're the listener, wanting to do something like this, I I think there's a simple pattern, which is to maybe once a month have a model that is retrained and that you just train in batch, and that model is going to be it's gonna be like a hydro. Like, you're gonna copy it, and you're gonna send it to each user.
And then each copy for each user is going to be able to learn in its own environment. And for instance, a good idea would maybe to with your model to increase, like, the learning rates so that every sample that the user gives you matters a lot. So for instance, if we take the Netflix example, you would have, you know, your run of the mill recommendation system model that you would just train in batch, you know, and you just use all the tools that we in community use. But then you would embed that into each person's browser, and maybe you do this once a month. And then that model for each user would be a coffee, a clone, or like a just a separate model now.
And, you know, it would keep learning in an online manner. So maybe your model was trained in batch initially, but now for each user, it's, yeah, it's actually it's being trained online. So for instance, you can do this with factorization machines that can be trained in batch, but also online. And, yeah, you would use a high learning rate so that every sample matters a lot basically. And so you, the user, are tuning your model. And so I don't know how YouTube does it for instance, but I do imagine they have some sort of core model. They're just learning how to make good recommendations. But, obviously, YouTube, there are some rules that make it so that, you know, recommendations are tailored to each user. And I don't know if that is done online, and I don't know if it's actually machine learning. It's probably just rules or scores.
But, yeah, I think it's a really fun idea to play around with, and I do think that online learning enables this even more. As far as the
[00:23:40] Unknown:
operational model for people who are using online and streaming machine learning, if they're coming from a batch background, they're going to be used to dealing with the train, test, deploy cycle where I have my dataset. I build this model. I validate the model against the test dataset that I've held out from the training data. Everything looks good based on the, you know, area under curve or whatever metrics I'm using to validate that model. Now I'm going to put it into my production environment or maybe it's being served as a Flask app or a fast API app. And then I'm going to monitor it for concept drift, and eventually, I say, okay. This is no longer performing up to the specification. So now I need to go back and retrain the model based on my updated datasets.
And I'm wondering what that process looks like for somebody building a streaming ML model with something like River and how you address things like concept drift and, you know, how concept drift manifests in this streaming environment where you are continually learning and you don't have to worry about, you know, the real world data that I'm seeing is widely divergent from the data that I use to train against.
[00:24:48] Unknown:
There's so many things to dig into, and I'll try to give a comprehensive answer. So first off, it's important to understand that Revo itself is to online learning what scikit learn is to batch learning. So it only desires to be a machine learning library. Right? So it just contains basically algorithms, routines to train a model and to have models that can learn and predict. And what you're going towards to with your question is MLOps. So how does the life cycle look like for an online model? And so this is always something that I'm spending a lot of time to look into.
The answer is that the first part of the answer is that online learning enables different patterns. And I believe that these patterns are simpler to reason a lot. So as you said, you usually start off by training a model, then evaluating it against a test set, and maybe going to report to your stakeholders and show them the performance and guarantee that, you know, the essentials of all positives is underneath a certain threshold, then yes, we can diagnose cancer with this model or not. And yeah, and then you kind of deploy it, maybe if you get lucky, if you get the approval, and you sleep, you know, well or not well at night depending on how much you trust your model. But there's this notion of you deploy a model, and it's like a baby in the world, and this baby is not going to keep learning. So, you know, it's a lie to believe that if you deploy a batch model, you're going to be able to just let it, you know, run by itself. There's actually main things main things that has to happen there. So the reality is that any machine learning project, you know, any serious project is never finished. It's like software, basically. We have to think of machining, projects have software engineering. And obviously, well, we all know that you never just deploy a feature, a software engineering feature, and just never look at it. You monitor it, you take care of that, well, investigating bugs and whatnot. So batch learning in that sense is a bit it's a bit difficult to work with because obviously you can have if your model is drifting, so meaning that its performance is dropping because the data that it's looking at is different than the training set it was trained on, you basically have to be very lucky if you want your model to pick up performance. So you're gonna have to do something about it. And, yeah, you can just retrain it.
But what you do with online learning is that you can have the model just keep learning as you go. So there is no distinction between training and testing. What online learning encourages you to do is to deploy your model as soon as possible. So say you have a model, and it's not being trained on anything, well, you can put it into production straight away. When samples arrive, it's gonna make a prediction. So maybe, you know, user arrives on your website, you make recommendations, that's your prediction. And then your user is going to click or not on something you recommended to her, and that's gonna be feedback you for your model to keep training.
So that is already a guarantee that your model is kind of up to date and kind of learning. And so that's really interesting because just enables so many good patterns. You can still monitor your the performance of your model. If the performance of your online model is dropping, I mean, I haven't seen that yet, but it probably means that your problem is really hard to solve. So the really cool thing I stumbled upon was this idea of test then train. So the idea that imagine the scenario where you you have a classification model that is running online. And so what would happen is that you have your model. Your model is generating features. So say the user lives on the website and the features are, what's the time of the day? What's the weather like? What are the top films at the moment? And these are features. And these are features that you have at a certain point in time to you generate these features. And then later on, when you get the feedback, so did your recommendation was it a success or not? That's training data for your your model.
You use the same features that you used for predictions. You use those features for training. And so you can see here that there's a clear feedback loop. The event happens. The user comes on the website. Your model generates features. And then at some later point in time, the feedback arrives. So was the prediction successful or not? Or if so if not, by by how much was the error? And then, yeah, you can use this feedback, join it with the feature that you generate predicting, and use that as as training data. So and you essentially have, like, a small queue or database that's storing your predictions, your features, and your training data, and the labels that make your training data.
So the big difference here is that you, you do not necessarily have to do a training test phase before deploying your model. You can actually just deploy your model initially, and it just learns online, and then you can monitor. A really cool thing is that if you do this, you have a log of people coming on your website, you making predictions, you gain features, people clicking around and interact with your recommendations. This creates a log of what's happening on your website. And so this log, what's really cool is that you can offline, afterwards, after the fact, you can process it in the same order it arrives in, and you can replay what the history of what happened. So it means that if you on the side, when you're redeveloping a model or you want to develop a better model, you can just take this log of events, run for it, and do this prediction and training dance the whole life cycle.
You know, you're replaying the feedback loop, and then you have a very accurate representation of how your model would have performed on that sequence of events. So that's really powerful because the way you're designing your model there is that you have a rough sketch of a model, which you deployed, then you have a log on that model. So you know you can evaluate the performance of that model, but more importantly, you can have a log of the events. And then when you're designing the version 2 of your model, you have a very reliable way to estimate how your new model would have performed.
And that's really cool. Because when you are doing train test splits in batch learning, that is not representative of the way the way our world. The whole problem is that what you do with train and test is people are spending so much time making sure that their train test split is correct, when in fact, even having a good train and test split is not a good proxy of the real world. A good proxy of the real world is to just replay through history. So and that's something that you can only do with online learning. That's really cool. Now to come to your point about concept drift, so concept drift is there's many different kind of concept drift, and Chip has a really good on her blog. What matters really is that concept drift, the result of it is usually that your model is not performing as well. Right? It's gonna be a drop in performance. And so the first thing you see on your in your monitoring dashboard is that a metric has dropped. And then when you dig into it, you see that maybe there's a class imbalance or that the correlation between a feature and a class has changed or something like that. So essentially saying that the data the model has been trained on is not representative of the new data that has been seen in production.
But again, I have said this a few times, but online models, if you put them in place with the correct camera ops setup, you they are able to learn as soon as possible. So that just guarantees that your model is as up to date as possible. So you're basically really doing the best you can. So drift is always possible. You can always obviously have a model that's degrading or that's just going haywire, that's not related necessarily to the online learning aspect of things. And so there are also ways to cope with this. So for instance, Dan Crankshaw and his team at Berkeley, they developed a system called CLIPr.
It's a kind of an ops tool. It's a research project, but it's it's also it's I think it's been deprecated, but it's the ideas are still there. It's a project where they have a meta model, which is kind of looking at many models being run-in production and deciding online which model should be used making prediction. So it's kind of like a teacher selecting the best student at a certain point in time and, you know, kind of seeing throughout the year how the students are evolving and, like, which students are getting better or not good.
And so you can do this with Bandits, for instance. But yeah. So just to say that there are many ways to deal with concept drift, and the online models, again, help to cope with concept drift and in just a way, actually, it just makes sense more so than batch bundles.
[00:33:59] Unknown:
And so digging now into River itself, can you talk through how you've implemented that framework and some of the design considerations that went into how do I think about exposing this online learning capability in a way that is accessible and understandable to people who are used to building batch models?
[00:34:19] Unknown:
So I like to think of Viva more as a library than a framework. If I'm not mistaken, framework kind of forces you into a certain behavior or way to do things, and there's an inversion of control where the framework is kind of designing things for you. So, you know, if you look at Keras and PyTorch, Keras is very much more framework in comparison to PyTorch because PyTorch, for me, the reason why it was successful is that it it kind of gave inversion of control towards the user. You can do so many things in PyTorch, and it's very flexible and doesn't really impose a single way of doing. So we have that in mind with River. River, again, is just a library to do machine learning, online machine learning. But it's it just contains the algorithms. It doesn't really force you to, you know, read your data in a certain way, or you could use it in a web app. You could use it offline.
You could do a use it on an offline IoT sensor. Livr is not concerned of that. It's just a library that is agnostic with regards to that. So now to come in to to what Liver is, it is in terms of online machining, it is general purpose. So it's not dedicated to anomaly detection or forecasting or classification. It covers all of that. That's the ambition at least. So just to note there is that it's actually really hard to develop and maintain because other maintainers and I, we are not actually specialized in different domains, and we kinda have to, you know 1 day, I'm going to be doing working on the forecasting module other than the other. I'm gonna work on anomaly detection, and it's it's kinda crazy. So it's still fun. What we do provide is a common interface. So just like scikit learn, every piece of the puzzle and river follows a certain interface. So we have transformers, we have regressors, we have anomaly detectors.
Of course, we have classifiers, so binary and multi class. We have forecasting models and time series. And so we guarantee to the user that each model follows a certain API. So every model is gonna be able to have a learn method, so it can just learn from new data. And they usually have a predict method to make prediction. And so forecasters will have a forecast method. Anomaly detectors will have a score method, which I was supposed to say, anomaly score. And so the strength of Viva is to, yeah, provide this consistent API for doing online machine learning. And it's a bit opinionated because it's well, it just it likes like it learn really. It just says, okay, you're gonna have learn and predict, but that's a reasonable thing to impose.
And that makes it easier for users to switch in new models because they have the same interface. And so, again, just to conclude on what I said at the start, we made the explicit choice to follow the single responsibility principle in that. Mevo only manages the machine learning aspect of things and not the deployments and whatnot. And so if you wanted to use within production, see people doing this, you have to worry about some of the details yourself. Like, if you want to deploy in a in a web app, well,
[00:37:22] Unknown:
we do not help at the moment at all with that. You have to deploy your own web app. As far as the overall design of the framework, you mentioned that it actually started off as 2 separate projects, and then you went through the long process of merging them together. I'm wondering how have the overall design and goals of the project changed or evolved since you first started working on this idea?
[00:37:44] Unknown:
The reason actually why the merger between CREAM and Scikit Multiflow took a certain time was that although we were both online learning libraries, there were some subtle differences which were kind of important. So my vision with Cream at the time and we were now is that we should only cater to models which are what I call pure online models, is that in that they can learn from a single sample of data at a time. But there are also mini batch models, so models which can learn from streaming data but in chunks. So like in a mini batch of data. And Scikit Multiflow was kind of doing this, so much like PyTorch and TensorFlow and you know deep learning models.
And so I kind of had to convince them that there were reasons why it was just a bit better. Why? Because, you know, if you think about again a user infect on the website or just any web requests or things that are happening in the real life, you want to learn as soon as possible. You don't want to have to wait for, you know, 32 samples to arrive to have a batch to be able to feed that to your model. You could obviously, but it just made sense to me to have something simpler where we only care about pure online learning. Because it means that you don't have to store anything, you just learn on the fly. And so I guess the interaction I had with Flaky Multipl kind of confirmed this idea.
And I guess, you know, they were a bit doubtful when we did the merger because and maybe I was a bit too opinionated, but history proved that it actually made sense, and it's not a decision that we look back on. Like, we're really happy with this now. So River has, you know, arguably moderate success. So it's working. It's alive. It's breathing. It's been going on for 2 years and a half, the project. And so we have a steady intake of users that are adopting it, and you know, we can we see this from emails we receive, from GitHub discussions and issues, and just general feedback we get. So the general idea of having a library that is only focused towards ML and just the algorithms is something that we are just gonna keep going with, because it just it looks like it's working, it looks like this is what people want.
You know, a simple example is, hey. I want to compute a covariance matrix online. Well, River aims to be the go to library to answer those kind of questions. Right? But the truth is that people, they don't just need that. They also need ways to deploy these models and do MLOps online. So, well, we basically did the the next steps offer us to build new tools in that direction. And we also think that the initial development of Rivers was a bit fast and furious. The aim was to implement as many algorithms as possible and, you know, just to cover the wide spectrum of machine learning. Now that we've, you know, covered quite a few topics, and we also have day jobs. So when I was developing with initially, I was doing a PhD, so I had ironically, I had more time than now because I have a proper job. But we value our time a bit more, and we're not in this fast and surest mode. We kind of just focus on picking certain models which are valuable and we see value in and just spending time to implement them properly. And we also see the final aspect is that we see that people, they don't just want our user base does not just want algorithms. They also want us to educate them. So they have general questions as towards, you know, what is online learning and how do I do it and how do I decide what model to use and, you know, all the questions that we're covering in this podcast, basically. So I think there's a huge need for us to kind of move into a educational aspect. So when I was younger, scikit learn was my bible. Like, I would just spend so much time not even using it, not even just using the code, but actually just reading through the documentation because it's just so excellent.
So obviously, that takes a lot of time, a lot of energy, the people, contributors, and help, but definitely something towards which we are moving.
[00:41:50] Unknown:
In terms of the overall process of building the model using something like river, When people are building a batch model, they end up getting a binary artifact out that is the entire state of that model after it has gone through that training process. And I'm curious if you can talk to how River manages the stateful aspect of that model as it goes through this continual learning process, both in a, you know, sandbox use case where somebody is just playing around on their laptop, but also as you push it into production where maybe you want to be able to use this model and scale out serving it across a fleet of different servers and just some of the state management that goes into being able to continually learn as new information is presented to it?
[00:42:39] Unknown:
So batch learning the great advantage of batch learning is that once you train your model, it's essentially a pure function. There are no side effects. The, you know, decision process that's underlying the model is not gonna change. So, you know, you can push the envelope and compile it. You can pickle it. You can convert it to another format. So that's what o n x does. You can compile it so that it can run on a mobile device. I mean, it does not need the ability to train anymore. It's just basically a Python function. Or just it's just a function basically that takes an input and outputs something. So there's also a good reason why batch learning is predominant. But with river, it's different because online models, they need to keep this ability to learn. So that's what you've been saying. So it's actually kind of straightforward, but the internal representation of most models in river is fluid, dynamic. It's usually stored in dictionaries that can increase and decrease in size. So imagine you have a new feature that arrives in your stream, Well, every model will ever cope with that. It's they're not static. There's a new feature that appears well, they handle it gracefully. So for instance, a linear regression model is just going to add a new weight to its internal dictionary of weight. Now in terms of civilization and pickling and whatnot, River is mostly written well, basically, River stands on the shoulders of Python, very much so. So we do not depend very much on NumPy or Pandas or SciPy.
We mostly depend on Python standard library. We use dictionaries a lot, and that plays really nicely with the standard library. It's very easy just to you can take any River model, pickle it, and just save it. You can also just dump it to JSON or whatnot. Also, the paradigm of you train a model, you pickle it, and you have an artifact that you can upload anywhere, it's a bit different with online learning because you would play this differently. You would maintain your model in memory. So if you have a web server serving your model, you would not just load the model to make a prediction, you would just keep it in memory and make a prediction of it because it's in memory, so you don't have to load it anymore, preload it.
And then when a sample arrives, your model is in memory. You can just make it learn from that. So, yeah, I think the big difference is that you hold your model memory rather than picking it to the disk and loading it when necessary.
[00:45:11] Unknown:
In terms of the use of the dictionary as that internal state representation, As you said, it gives you the flexibility to be able to evolve with the updates and data. But at the same time, you have this heterogeneous data structure that can be mutated as the system is in flight, and you don't necessarily have a strict scheme of being applied to it. And I'm just curious if you can talk to the trade offs of being able to add that flexibility, but also lacking in some of the validation and, you know, schema and structure information that you might want in something that's dealing with these volumes of data?
[00:45:52] Unknown:
So, yeah, Viva uses dictionaries. So the advantage of dictionaries are plentiful. First of all, a very important thing is that dictionaries are to list what handles data frames are to numpy arrays. So a dictionary has names, and that's really important. So it means that each 1 of your feature actually has a name to it. And I find that hugely important because, you know, we always see features as just numbers, but they also have names, and that's just really important. Imagine you have a bunch of features coming in. Now if that was a list or a NumPy array, you have no real way of knowing which column corresponds to each variable. If you switch 2 columns with each other, that could just be a really silent bug, which really affect you. Whereas if you name each feature, if the column order changes, well the names of the columns are being permuted too, so you can kind of identify that. So what's really cool with dictionaries and and that works with River is that the order of the features that you're receiving doesn't matter, because we access every feature by name and not by position.
Dictionaries also allow you are mutable in size. So, you know, if a new feature arrives or a new feat a feature disappears between 2 different samples, that just works. So it's really cool also that dictionaries, when you think about it naturally sparse. So imagine that on the Netflix projects, the features that you arrive that you receive are the name of the user, semicolon, colon, 1 or you know, the dates 1 or Yeah. You can just store sparse information in a dictionary. That's kind of really useful. There's this robustness principle that we follow with others. So robustness principle is that we are used to be conservative in what you do, but liberal in what you accept. So Mivva is very liberal and that accepts heterogeneous data, as you said. So dictionaries are different in size, dictionaries which have different orders or whatnot, but that is really flexible for users. So a common use case is to deploy whether in a web app, and in the web app, you're receiving JSON data a lot of the time. So the fact that JSON data has a 1 to 1 relationship with Python dictionaries makes it really easy to integrate with it into a web app. Whereas if you have a regular batch model, you have to mess about with casting the JSON data to a NumPy array, and, you know, that has a cost actually. It actually has a cost because although NumPy, Torch, TensorFlow are, you know, good at posting matrices, there's actually a cost that comes with taking native data, such as dictionaries, and casting them to a higher order data structure, such as an NumPy array, that has a real cost. In a web app where, you know, into you're reading in terms of milliseconds, well, you're spending a lot of your time just converting your JSON data to NumPy.
Whereas with Revver, because it consumes dictionaries, well, the data you receive, I don't know if you're coding in Django, Flask, or FastAPI, the data you receive your your request is a dictionary, so you don't have to convert the data. It just runs. So actually, if you if you take a river model, like a a linear regression river and a linear regression in torch, it's actually gonna be much faster in river because there's no conversion cost. Plus the features are names, and plus you just you you don't have to worry about the features being mixed order or anything. So it just makes a lot of sense really, dictionaries in that sense. Now the pitfalls, obviously, it's it's not perfect. The pitfall is that I kinda disagree that there's a problem with the short dictionaries. I actually think that dictionaries well, if you wanted to, you could create like if you're in Python, you can actually use a data class and you can convert that to a dictionary and feed that into your model. The data class helps you to create structure.
So I don't think that's really a problem. Quite the contrary, I think. The fact is that also a dictionary can be nested. So maybe your your features that you're feeding to the model doesn't have to be a flat dictionary. It can actually be nested, and that's really cool too. You know, you can have features for your user. You can have features for your page, the features for the day, anything. Things that you cannot necessarily do with a flat structure such as a data frame or NumPy array. Anyway, I'm talking about benefits, I should be talking about cons. But yeah, I guess just the main con of processing dictionaries is that, you know, if you wanted a Viva model to process a 1000000 samples, it would take much more time than processing a 1000000 samples with pandas data frame or NumPy array.
Because, yeah, the point of Viber is to process is to be half that processing 1 sample at a time, but not necessarily processing a 1000000 samples at a time. But those are 2 different problems. So although, you know, you take Torch or Scikit Learn, their goal is to be able to process offline data really quick. So the goal of VIVA is to process online data, single samples, as fast as possible. And you're comparing apples and oranges if you wanna do the comparison. It just doesn't work. So yeah. Actually, you know what? I don't think they'll downsize to using dish nodes. It just helps a lot. And and to confirm this, we have a lot of users who tell us about this. They say, well, it's actually fun to use river. I just it just makes sense because it's very close to the data structures that I use in Python. I don't have to introduce a new data structure to my system.
[00:51:16] Unknown:
So for somebody who is using River to build a machine learning model, can you just talk through the overall process of going from idea through development to deployment?
[00:51:27] Unknown:
I'm going to rehash what I said before, but I think the great benefit of online learning and river is that you can cut the r and d phase. So I've seen so many projects where there's an r and d phase, and the model, you know, gets validated at some point in time, but there's like a real big gap of time between the start of the r and d phase and the moment when the model is deployed. And the process of using river and any streaming model in general is to actually, as I said, deploy the model as soon as possible, monitor its predictions, and it's okay because sometimes that model, well, you know, you can deploy it in production, and those predictions do not necessarily have to be served to the user.
So you just make the predictions, you monitor them, and it creates you a log again of training data and predictions and features. And that's what you call shadow deployment. You have a model which is deployed, is making predictions, but those predictions are not being used, you know, to inform decisions or to change or to influence the behavior of users. They just exist for the sake of existing and for monitoring. 1 thing to mention is that once you deploy this model, you have your log of events. That's the phase where you want to maybe design a new model. And you're going to have this model replace the existing model in production or coexist with it because you have a meta model or not. So I mentioned that you can take your log of events, replay it in the order in which it arrived, and have a good idea of how well your model would have performed.
That's called progressive validation. So it's just this idea that if you have your log of events, every sample you're first gonna make a prediction, and then you're gonna learn from it. So I have a good example. There's a data set on Kaggle called the New York Taxi Data Set, and it's basically a log of people asking for a hailing a taxi for a ride, and they depart from a position, and they arrive at another position later in time. And so the goal of a machine learning system in this case could be to predict how long the taxi trip is going to last. So when the taxi departs, you want your model to make a prediction.
How long is this taxi ride gonna last? And so maybe that's gonna inform, I don't know, the cost of the trip, or it's gonna help decision makers, you know, rear behind the taxis, or I don't know, whatever. But you can imagine that this is a great feedback loop because you have your model making a prediction, and then later, maybe 18 minutes later or something, we are at the ground truth around. So you know how long the taxi trip actually lasts, and that's your ground truth. And then you can compare your prediction with your model. And that enables progressive validation because you have a log of events. You have when the taxi trip departs, what was the value my model predicted, what features I used at prediction time, and later on, I had to go on truth. And so I can just replay the logs of events for, you know, I don't know, 7 days and progressively evaluate my model. So I like this taxi example because it's easy to reason about, and, you know, taxis are easy to understand. But the taxi example is really what online learning is about. It's about the feedback loop between predicting and learning.
And just to remind you, but how it would be with a batch model is that you would have your taxi date set and well, I don't know. You would split your dates in 2. You would have start of the week, end of the week, train your model on the start of the week, evaluate on the rest of the week. Oh, no. The data I trained on for the start of the week is not represented on the weekend. Yeah. And it just becomes a bit weird. It becomes this situation where you're trying to reproduce conditions in real life, but you're never really sure of it. And you can only really know how well your batch model is going to do well in production.
And online learning, it just kind of encourages you to go for it, to deploy your model straight away and not have to have this weird r and d phase where you live in a lab. You think you might be right, but then you're not really sure. And yeah, online learning just brings you closer to the reality, in my opinion.
[00:55:44] Unknown:
As you have been developing this project and helping people understand and adopt it, what do you see as some of the conceptual challenges or complexities that people experience as they're starting to adapt their thinking to how to build a machine learning model in this online streaming format versus the batch oriented workflow where they do have to think about the train test split, just the overall shift in the way that they think about iterating on and deploying and developing these
[00:56:15] Unknown:
models? It's a hard question, but I think there's 2 aspects. There's the online learning aspect, and then there's the MLOps aspect. Now in terms of MLOps, I think I I covered enough, but it's much like a batch model. You have to deploy your model, which means maybe survey behind the web app. As I mentioned, the ideal situation is to have your model loaded into memory, and that's making prediction and training. But all that is really harder to do than to say. The truth is that there's actually no framework out there which allows you to do this. You could do this yourself, and this is what we see. I mean, we get users who ask us questions in a bit of context, so on on GitHub or in mails, but and they're asking us, how do I deploy my model? What should I be doing? And we always give the same answers.
But the fact is that, you know, we have these users who have basically embraced Viva and they understand it, but then they get into the production phase. And that's not what we're trying to well, we feel bad because they are all making the same mistakes in some way, and VIVUS is not there to help them because that's not the purpose of VIVUS. So, yeah, there's a lack of tooling to actually just, you know, deploy an online model. So, yeah, that's the MLOps aspect. I think in terms of some online learning, a big challenge is that not everyone has the luxury to have a PhD during which you can spend days and nights going through online learning papers and trying to understand it, and that's what I and others had the chance to do. But a lot of our users, you know, they see the value of online learning, and they want to put it into production, but they have deadlines to me. Right? They have to ship their project in 6 weeks, and they just don't have the time to understand things in detail. So things like what I just described, progressive validation, it kind of takes them a bit of time to understand.
And so, again, what we need to do is to spend more time creating resources or, you know, just diagrams to explain what online learning is about. And that in terms of library design, it's really important. Like, if we wanted to introduce a new method to all our estimators, I would be against that. Like, the the whole point of Vivo is to make it as simple as possible so that people can just, you know, be productive, understand it. So, yeah, I think that just to recapitulate those 2 problems is that people do not necessarily have the resources to learn about online learning, and then there are operational problems around serving these models into production. So it's kind of like a batch model because you have to serve your model behind an API, you know, and you have to monitor it. And these are things, well, you know, that are common to a batch model, but there's the added complexity of having your model being, you know, maintain the memory and keep learning and stuff and things that are basically not common. I mean, if you have to Google it or find something on GitHub, you you just kind of find find these hacky projects, but no
[00:59:17] Unknown:
real good allowed me to do that, at least not yet. And 1 of the things that we didn't discuss yet is the types of machine learning use cases that River supports where I'm speaking specifically to things, logistic regressions and decision trees versus deep learning and neural networks. And I'm just wondering if you can talk to the types of machine learning approaches that River is designed to support and some of the reasoning that went into where you decided to put your focus.
[00:59:47] Unknown:
River, again, is a general purpose library, so it covers quite a few things. There are some cases or flavors of machine learning, which are especially interesting when you cast them in an online learning scenario. So if you're doing anomaly detection, so for instance, you have people doing transactions on a in a banking system, so they're making payments, and you might want to be doing anomaly detection to detect fraudulent payments. That is very much a situation where you have streaming data. And so in that case, you would like to be doing online anomaly detection. So we see that every time we put out a notebook or a new anomaly detection method, a lot of people start using it. We start having bug reports and and whatnot. So it's kind of surprising, but it's a good thing. But, yeah, I think there are modules and aspects of which are clearly bring a lot of value to users. So that would be anomaly detection, but also we have forecasting models. So when you do online forecasting, that just makes sense. But you have sensors which are, I don't know, measuring the temperature of something. People want to do that in real life in in real time. There's also a good example I have is we have this engineer who's working on the water pipes in in Italy.
He's trying to predict how much water is going to flow through certain points in his pipeline. So he has sensors all over the pipeline, and he's trying to just do a forecasting model. And so it just makes so much sense for him to be able to have his model run online inside the sensors or inside the IoT systems he's running. So just all that to say that there are some more exotic parts of the river, such as anomaly detection and forecasting, which are not which are probably more value than the classic models such as, you know, linear regression, classification, regression. Again, at the start of the show, I I talked about Netflix recommendations.
So we have some very basic bricks to be able to make recommendations. Well, we have factorization machines, and we have some kind of ranking system so that if you have users and items interacting, you can kind of build a ranking of preferred items for a user. So we have these kind of exotic machine learning cases which provide value, but require us to spend a lot of time to work on them basically. So it's very difficult for me and for other contributors to be specialized in anomaly detection time series forecasting recommendation.
But, yeah, all this to say that covers a wide spectrum. You can do preprocessing. You can extract features. You can do classification, regression, forecasting, anything. We try to well, because it's online, it's just a bit unique. In your experience of
[01:02:45] Unknown:
working with the River Library and working with end users of the tool, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[01:02:55] Unknown:
Well, unexpected is a good 1. There's 1 thing that comes to mind. We have this person who is a beekeeper. So a person who is, you know, taking care of bees and I guess they once a week or every 2 weeks, they go to the beehive and they pick up the honey in the beehive. And this person has many beehives, and so they don't like to waste their time going into the beehive and actually checking if there's honey or not. So they have a sensor. He has sensors that are in each beehive. They're kinda measuring how much honey is in each beehive, and he likes to forecast how much honey he is expected to have in, you know, the weeks or months to come based on the weather, based on past data, based on I don't know, what information he uses.
But really, really just fun just to see this person doing this hackish project where they just thought it would be fun to use online learning to do it. And again, it wasn't an IoT context, so that made sense. I guess in Innovative, I was kind of impressed when I heard about this project of having a, you know, a model within each car to determine where your destination would be. So I don't know. You wake in the morning, you take your car. You know? Is it Saturday you go to go to the market? Is it a weekday you go to work? It sounds silly, obviously, but having this this idea of having 1 model per user is is kind of fun.
The most impactful project I heard about, and I know it's which is being used, is a situation where this company, they prevent cyber attacks. So they monitor this grid of servers and computers, and they're monitoring traffic between machines. And so they're trying to understand when some of the traffic is malicious and hackers basically trying to get into a system. So you can detect this by looking at the patterns of this traffic. Right? And the trick is that behind the traffic, the malicious traffic, it's hackers, and they're constantly changing their patterns, so access patterns to actually not be detected. And so if you manage to label traffic as, you know, malicious, well, you want your model to keep learning. So they have the system where they have, like, thousands of machines, and they have a few machines that are dedicated to just learning from the traffic and in real time, adapting, learning, detecting anomalous traffic, sending it to human beings so that they can actually verify themselves, label it, etcetera.
And so it's really cool to know that VIVO has been used in that context. Like, it just made so much sense for them to say, wow. We can actually do this online, and we don't have to be trained. And on batch learning was getting in their way. They had this system which was going a 1000 miles an hour, just hundreds of thousands of data accumulated all the time. And batch learning was just, you know, it was just, again, annoying for them. They they having a system that will enable them to do all this online just made sense for them. And to know that you can do this at such a high amount of traffic, it was really cool and exciting.
[01:06:02] Unknown:
In your own experience of building the project and using it for your own work, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:06:11] Unknown:
I think I'm just gonna focus a bit on the human aspect here. But although I have been doing open source, you know, quite a bit, I've always had this approach where I probably work too much on new projects that I make myself rather than on existing projects. So I just rather just do my thing myself rather than contribute just to existing stuff, and it's not always that's any good, but it's just the way I work. And so Vivo is really the 1st open source project where I work for other people. So like probably many people, a lot of my open source work is I just work on it myself. And obviously, I you work in companies and where you probably have a review process and you work with other people, but this is the first open source project where I really work with a team of people.
And it's fun. It's just really so much fun. Like, just a month ago, we actually got to meet altogether and to have this, like, informal reunion. So that was really fun. And you realize that, you know, after 3 years, there are ups and downs and there's moments where you just do not want to work on Liv anymore and you want to you know, you have work, you have friends, girlfriends, whatnot. And so the only way to subsist as a open source project in the long term is to have multiple people working. So do it open source and, you know, it's not realistic to do it on your own if you want something to be successful and to actually have an impact in the long term. So it's actually really important to just be nice and to have people around you who help you. And although not everyone contributes as much as I do or core maintainers do, people help a lot, and they make things alive. It's always a joy when I open an issue on GitHub, and I see that someone from the community has answered the question, and I don't have to do anything. It helps tremendously.
Yeah.
[01:08:01] Unknown:
We've already talked a bit about some of the situations where online learning might not be the right choice. But for the case where somebody is going to use an online streaming machine learning approach, what are the cases where River is the wrong choice and maybe there's a different library or framework that would be better suited?
[01:08:18] Unknown:
Well, yeah. Again, honestly, I think that online learning is the wrong choice in 95% of cases. Like, you do not want to make the mistake to think that your problem is a online problem. You probably, most of the time, have a batch problem that you can solve with a batch library. You know, I mean, scikit learn now, if you open it and you just run it, it's always going to work reasonably well. So sometimes I would just go for that. 1 thing we do get a lot is people asking how you can do deep learning with river. So they want to train deep learning models online. So the answer is that we do have a sister library that is called Torchriver, and it's dedicated towards training Torch models online.
So but again, that is a bit finicky at the moment and still need some work being done on it. But yeah. If you want to be doing deep learning and you want to be working with images and sound and, you know, structured data, River is not the right choice, even online, and you probably have to be looking at PyTorch.
[01:09:14] Unknown:
As you continue to build and iterate on the river project, what are some of the things you have planned for the near to medium term or any applications of this online learning approach that you're excited to dig into?
[01:09:26] Unknown:
We have a public road map. So it's a Notion page with a list of stuff we're working on. That 1 mostly has a list of algorithms to implement, and it's mostly there to, you know, make people know what we're working on and to encourage new contributors to work on something. So the few contributors we have just pick what they want to work on and, you know, just in general order of preference. So for instance, me this summer, I've decided to work on online covariance matrix estimation. So if you actually learn online covariance matrix, it's kinda useful because it's very useful in financial trading. And if you have an inverse covariance matrix that you can estimate online, that unlocks so many other algorithms, such as Bayesian linear aggression, elliptic envelope method for unknown detection, Gaussian processes, and whatnot. So I think I'm still in the nitty gritty details of implementing algorithms and not necessarily applying them to stuff. I'm kinda counting on users to to do the applications.
It just shows right at the moment. Now 1 thing I'm working on in the mid to long term is Beaver. So eventually, I want to try to spend less time on Weaver and work on a tool I'm building called Beaver. So Beaver is a tool to deploy and maintain online learning models. So essentially an MLOps tool for an MLOps tool for online learning. So it's in its infancy, but it's something I've I've been thinking about a lot. So I recently gave a talk on it in Sweden. I've sketched a blog post and some slides where I tried to describe what it's going to look like. But the goal of this project is to create a very simple, user friendly tool to deploy a model, and I'm hoping that that is going to encourage people to actually use river and to use online learning because they're gonna say, hey. Okay. I can learn, but I can also just deploy the model and, you know, and both tools play nicely together. So, yeah, the future of Vivo is to have Vivo and to have this reference tool to deploy online models.
It's not going to be catered just towards River. The goal is to be able to, you know, run it with any model that can learn online.
[01:11:41] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[01:11:55] Unknown:
I'm always impressed by how much the field is maturing. I think that there's a clear separation now between regular machine learning, like business machine learning, I like to call it, and deep learning. I think those 2 fields are becoming my separate fields. So I've kind of stayed away from deep learning because I just not my cup of tea, but some very interesting in business machine learning, so getting things that I call it. And I think I'm impressed by how much the community has evolved in terms of knowledge. People are the average ML practitioner today is just so much more proficient than 5 years ago.
And I think it's a big question of education and tooling. The tricky thing about an ML model when it's not deterministic, and so it's difficult to guarantee that its performance over time is going to be good, and let alone certify the model or convince stakeholders that they should adopt it. So in the real world, you don't just deploy a model and cross your fingers. So although we've gone past the test and r and d phase of a model, we are still not there in terms of deploying model. And so the reality is that there's usually a feedback loop where you monitor your model and possibly retrain it, be online or, you know, offline retraining. It doesn't matter.
And so I don't think we're really good at that right now. I don't think that we have great tools to have human beings in the loop who work hand in hand with machine learning models. So I think that tools like Progyny, which is a tool to have a user work hand in hand with an ML system by labeling data that the model is unsure about, they're crucial. They're game changers because they create real systems where you care about, you know, new data coming in, retraining your model, having a human validate predictions, stuff like that. So I think we have to move away from only having models that are only having tools that are destined towards training a model, but we also need to get better at tools that, you know, encourage you to monitor your model, to keep training it, to work with it, to yeah. Again, just treat machine learning as software engineering and not just as some research project.
[01:14:16] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on River and helping to introduce the overall concept of online machine learning. It's definitely a very interesting space, and it's great to have tools like River available to help people take advantage of this approach. So thank you for all of the time and effort that you and the other maintainers are putting into the project, and I hope you enjoy the rest of your day. Oh, thank you. Thanks for having me. It was great. Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Max Halford
Max Halford's Background
Introduction to River Project
Online vs. Batch Machine Learning
Challenges of Online Machine Learning
Use Cases for Online Machine Learning
Federated Learning and Online ML
Operational Model for Streaming ML
Design of River Library
State Management in River
Advantages and Pitfalls of Using Dictionaries
Challenges in Adopting Online ML
Interesting Use Cases of River
Lessons Learned from Developing River
When Not to Use River
Future Plans for River
Biggest Barriers to ML Adoption
Conclusion