The Complex World of Generative AI Governance

Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable

AI systems.

Your host is Tobias Macy, and today I'm interviewing Jim Olsen about governance of your generative AI models and applications. So, Jim, can you start by introducing yourself? Sure. Absolutely. Yeah. I'm, the chief technology officer of ModelOp,

and that's exactly,

what we provide as a solution is

a solution to cover the generative AI and the AI governance and regulations that's out there. I've been a

software engineer for more years than I care to admit. So I've seen, everything across the industry for a long time, and, this is an exciting new space. And do you remember how you first got started working in the machine learning arena?

Yeah. Actually, the very, very first thing, I did with machine learning was back in the, early nineties, like, probably 1990.

Actually, I have a degree in both computer science and psychology and was looking to bond the 2 back then even, which is why I did that. And, built I was building my own neural networks and things like that on a much more simplistic scale than we do today. But, you know, we were doing a lot of, like, vision recognition kind of stuff and also trying to model human thought processes as part of the psychology degree and things like that as well. So but it's kinda my early, early experiences of that. And then, of course, I've used aspects of it through throughout the entire time since then.

And in the context of governance

for the current generation

of AI models, which typically means generative AI, so large language models, some of these vision models that are able to produce images and video,

and, obviously, the the fields expanding as we speak.

I'm wondering if you can just start by giving a bit of an overview about what governance even means for those types of technologies.

Yeah. Well, you know, those are what we refer to as the foundational models. And, all the foundational models, of course, can be used for many, many, many different things. So governance really pertains more to the use cases that you then do with the models rather than the individual model itself. Because it's very different if I'm, like,

generating

a a image from my blog, you know. There shouldn't in my opinion, there shouldn't be a whole lot of governance on that. You're reviewing it. It's in your hand. You can decide to use it or not. It's a tool that I believe there should be less governance around in those spaces. But when it's doing something like, for instance, detecting cancer, summarizing,

a a doctor's notes,

if it inserted suddenly that,

oh, the it admitted, let's say, for the fact that the doctor gave a very specific medication or something, and that didn't make it in the summary and you couldn't have a double dose on that, that could have very, very serious consequences.

And so those are the kinds of spaces where, it's the use case of how you're using this that I think is deserves more attention around where governance should apply or not because the it's the implications. What's the risk? And that's why we always talk about model risk.

We won't dig too much into this for this conversation, but I think it's also interesting

to your point of the foundation models.

The the governance comes in after the point of they have already been created, at least for the use cases that most people are interacting with them for.

But I'm wondering what your thoughts are on

the

governance or possibly lack thereof or at least lack of transparency

about the governance that is involved in the creation of those models and some of the datasets and the the opacity of that overall process.

Yeah. I mean, that's one area that's,

definitely we refer to these these as vendor models. So that means,

myself as the,

person consuming that model

do not have any insights into into realistically how it was built.

And, you know, realistically, let's say you could view the entire dataset of what went into chat GPT 4 as an example.

How much insight would that actually really give you? I mean, you could absolutely find, you know, there are concerns around, like, g p p r,

and does it have,

information in there that should have been pulled out? And, you know, those are legal concerns. But, frankly, those legal concerns more address the developer of the model itself than the consumer of the model because, you know, we haven't seen any cases yet where they try to go after somebody who just is using a vendor model, who put some inappropriate data in there or something. Now we're not we're not gonna be able to, they're not gonna be able to go after you on that. But then how you choose to use it and how you monitor it and what you do. Just because you're using a vendor model doesn't mean you can't establish your own baselines,

and your and ongoing monitoring

of how it's performing for your specific use case and to review processes and other things to make sure you're compliant and how you use that model. I mean, that that is where we I'll talk a lot about use cases here because that's really what's important when it cut if you look at most of regulations like the EU regulation,

that's come out recently.

It's all around, you know, specific kinds of use cases where they could affect people like, making legal judgments or health care decisions or those kinds of things. That's where the concerns come out.

Governance,

in particular, as an overall practice,

is generally

categorized as a hybrid socio technical endeavor where it requires the involvement

of engineers and technical controls around the enforcement

of the policies, but the actual

policy creation is something that is largely managed

at the

business level and

in the context of that regulatory

environment.

And I'm wondering if you can talk to some of the difficulties that organizations are facing

in understanding

what actually even needs to be included in those policies

and the elements of generative AI that are in scope for those policies and some of the ways that the understanding of the models is a requirement of that overall process.

Yeah. Well, I the overall kind of,

how you approach it is

still evolving a lot. I mean, obviously, the EU regulation is just it's one of the first big ones to really just come online. So most organizations are struggling to understand just even what they do, but even as further than that, even what they have. Because as I said, how many models are actually embedded in products you just consume? We see this a lot in the health care industry, for example. And the health care industry in the US

is one of the first industries to really fall under a bunch of different state regulations.

You know, we don't have a real formal federal level regulation like EU yet. We have some guidance and some policies, but not hard line requirements like EU. That's not true when you get to the state level. So understanding, you know, these evolving

individual

AI acts that affect places like Texas and Colorado and others, you know, and having the clear understanding of what you have to do and even finding the models is kind of most organizations aren't prepared for that at all. And that's where, you know, you do need a system like we develop to be able to, you know, first and foremost, get a proper inventory of what the heck you're using even,

to understand and where they're being used. Because that's great, you know, you have chat g p 4, but what's it being used for? You know, I've heard situations where people were using chat gpt 4 in hospitals and actually sending patient

patient information to it to get summaries back,

and that's a big, big no no. You know, that's that's hitting all kinds of,

the PII kind of issues and etcetera and very sensitive information to be shipping that off-site, something like that. So, you know, it's

understanding what do we need to control, what's there, what does a proper process look like, what are the steps involved in order to make sure I am compliant, and how do I do this. Many places are trying to build this stuff themselves,

and, you know, we see lots of struggles with that because, again, you don't even understand what needs to be built. It's not like software where you have a straight CICD pipeline. Well understood. Do your unit test. Do your integration test. You know, it's very clear. When we start to get into testing models, especially when we talk about generative AI models where they're inherently

non interpretable,

you have to come with different testing strategies, different things that are based on the use case and how you do that and have proper steps in place to make sure you're compliant with all these regulations. And, you know, that's basically why we founded this company and why we started doing this because this goes back before generative AMI models, traditional ML models,

in in highly regulated industries like banks who have been having to do this for a long time. Gen AI just added in the new things that all of the models in banking are are generally interpretable.

They wouldn't put them out otherwise, whereas Gen AI use cases are not.

The mention of OpenAI

in particular is interesting in this space as well because, as you said, if you are interacting with GPT 4 or any of their other models, you are

sending data to them over their API. They are using that for some unclear purposes

and but but just the fact that you are sending data to them over the wire

is just by its nature a potential violation of regulations completely independent of whether or not it's involved in AI. And so I think that that's also an another interesting aspect to this of the the facts of it being AI is

why people are using it, but there is maybe a little bit of a lack of generalized knowledge about what that even means to interact with that AI and some of the requirements around what those interaction patterns should look like. And I'm wondering what you're seeing

as far as the general distribution of that knowledge and some of the ways that organizations

are starting to come to grips with that and how that factors into their overall policy evaluation, policy creation.

Yeah. Well, because of the that model, I mean, they obviously do offer their own version where you get an isolated version. They claim that they they're not storing any of that information, etcetera,

etcetera. And then it comes down to a trust situation with the the vendor and contracts and all that fun stuff. But what I am seeing is a lot of people who are more sensitive to their information are exploring

either private, like, llama 2 instances,

etcetera, where they host them on-site or private cloud kind of situations like using an Amazon Bedrock within your own secure

private cloud instance, etcetera.

So you do know what's happening with it and things like that. So we're seeing some some preference towards that kind of model. But, you know, there's different consumers and different levels of those because they are foundational models. If I'm if I've got true PII information that has to go through this for some reason, again, you may wanna consider

is the PII information even necessary to the task

to make sure you avoid that. But then, you know, you're probably gonna want pretty tight control over over where it's going, etcetera, because you don't wanna end up disclosing somebody else's PII

or, you know, worse yet, even, like, you know, health care protected information. I mean, that that's a whole another set of regulations that's out there now. You got HIPAA in place. You can't you know, you gotta be careful on those things as nothing as you said AI. It's good data hygiene to make sure you understand where any potentially,

sensitive or regulated data, etcetera, is going, and that's just something you gotta do now. I think the allure of the capabilities of of things like, chat g p d 4 and etcetera and the newness of it is, you know, tempting to you know, everyone wants to use it because it's it's once it does help solve a whole bunch of problems. But how do you do that in such a way that it maintains,

the privacy of both the individuals

that,

are contained within the request but also private company information, etcetera. You know, you don't wanna send your financials out into

whatever,

that kind of thing. That can be problematic. So it's not a new problem, but I think it's an enhanced problem

because of the the the features you can get out by sending all that information to, one of these foundational models.

Another interesting

wrinkle to the problem of governance

and control over how these different technologies are being used is that

the ease of use and ease of access

makes it very favorable to sort of the shadow IT pattern of

somebody is using these tools as a convenience, not necessarily in any sort of formalized process or any sort of directly integrated

technology stack. And I imagine that that also poses some challenges as far as the policy

enforcement

piece of making sure that these systems aren't being used out of band from established

and approved practices and means of technology integration. Maybe somebody's just using it on their laptop

to try and solve a quick problem, and they inherently leak some sensitive information. And I'm wondering how you're seeing people try to tackle that problem of the inherent,

kind of virality of these systems and

pension for being used in a shadow IT manner?

Yeah. I mean, I've I've talked to several IT professionals who are in, like, hospitals or these kinds of places where there is a lot of sensitive information, and they've had to take the steps of absolutely just blocking the domains. And they basically just do not allow traffic to go to them because the risks are too high. So, you know, that way it forces people to go through whatever interfaces they provide, which are more curated. Because, yeah, I mean, somebody can just hit Copilot in their, in their Edge browser and and cut and paste information.

You know, if you're a software developer, you know, you cut some sense of the piece of code into there for an abbreviated, where did that go? You know, what was logged? What what you know, you don't really know. It's so embedded in these things. And that's the dangers of vendor models and

not understanding what they're doing in the back end and not having an approval process for them is yeah. You you don't really know what the terms and conditions I mean, I doubt the average user is is reading through the EULA with the thousand things that it says it's gonna do and that changes every every time you get software update to understand,

what the the the ramifications

could be to the business of that. And so it's you gotta block at the network layer, but especially with today's remote work environment and all that kind of thing, it's, you know, I think it's it's an ongoing challenge and a lot of that will come down to education. You mentioned too that a lot of the

basics

of how to use these models

in an appropriate manner is really just a matter of good data hygiene. And data governance, as a practice, has been around for quite some time. There are established methods, established protocols. It is a generally understood, if not uniformly

deployed

set of practices.

And I'm wondering what you see as some of the different dimensions of technical controls that are present in the space of data governance, how those also can be applied in the application of generative AI systems? Yeah. Well, that's a part of the whole AI governance solution or or as we refer to it as model ops. It's, you know, kinda like what Gartner's using as a term, to describe the space. And you have to take all of that into account.

Like our solution, for example,

we rely on that they have

a data hygiene solution,

and have that in place. But what we do need to do is we tie references

to the data, to the usage of the model. So, you know, you have a use case, then you have the implementation.

And the implementation

should have those references to the exact datasets that are going to be used and or were part of the training, if that's appropriate, or part of the RAG implementation, you know, if they're the reference docs. All of that needs to be kinda closed loop, if that makes sense. So, you know, you need to have the data governance in place so you understand and have things like GPDR

support, etcetera, to remove data when when requested.

And but that ties into now we need to track, okay, which models are using those datasets. So if there ever is an issue with that dataset or, you know, you we want to make sure you understand that, oh, we found an issue with this. We need to do a review or anything. You need to have that back associated

with the model instance and the model governance process. So rather than model governance replacing

data governance, it actually works with data governance, and the 2 can work hand in hand, and give you that closed loop solution.

In addition to the core data hygiene practices and policies around the use and distribution of data and the access controls around that? What are some of the additional

technical controls that are necessary and are being developed to assist in the policy enforcement around the foundation model specifically?

I'm thinking in terms of things like context filters that make sure that certain pieces of data never actually get passed along in the API request or, you know, token masking of some of that data so that in the context of your application that is using the model, you use some sort of identifier as a replacement

that is not at the actual PII

so that when you send it to the model, it uses that token

as a placeholder. And then on the return trip, you're able to replace it back with the original data. Some some of the ways that you're seeing people approach some of those technical controls?

Yeah. Well, for years, people have used for doing, machine learning training. There's solutions

that actually generate

unique tokens that can basically, for neural networks,

can replace,

the the actual PII data itself. Because as long as it has a a consistent unique uniqueness to it, then the the neural network's not really gonna know the difference because it doesn't know, you know, Joe versus Mary. It just know there's a Joe and a Mary. You know, it doesn't it doesn't have to know. You know, you could change it to Ios and and, you can, what,

Ira. I don't know how to say married backwards, but, basically, you could do that.

And

the neural network, as long as that was done consistently and reproducibly,

then the training is gonna end up pretty much the same. So, you know, we've we've seen those techniques for years. Now when we get to actual generative AI models and fact that they're often foundational models and you have RAG architectures that can bring stuff in, etcetera,

on that, Typically, what we do is so we automate baseline testing and things like that of generative AI solutions. And one of the ways we do that, for instance, is, like, one of the the tests you can run is PI detection. So you can use some jailbreaking prompts and things like that to automate that and attempt to get it to disclose PII,

as well as do ongoing reviews of real answers that are out there so you can detect that a PII were going to be disclosed.

Obviously, things like guardrails and stuff can be put in place to, help prevent that. But it's also

nice to know that it was potentially going to disclose that information or potentially did. So, like, there's techniques to do that out there. But it's a lot getting back to the same ideas of, like, establishing baselines, etcetera. You can do it using natural language techniques. You can do it using,

basically, cross LLM querying techniques, etcetera. So that you can

understand what was submitted as the potential solution for the problem, but then continue to monitor that over time and see if you get affected drift from original baseline data to understand a little bit that the model might be doing something it's not supposed to be. So that's very important and that's part of a model governance solution is both initial baselining and then ongoing monitoring of model itself.

To your point about jailbreaking,

one of the

biggest challenges

that I've seen in this space as far as being able to

manage the

reliability of these applications for 1, but also from a security perspective,

is that

you can

have a set of test prompts where you give it some request, you get some response back, you eyeball it, you say, okay. That looks good.

But when you actually put it in front of consumers

and they're able to send whatever they want to the model, there are all kinds of different ways that they can maybe get around some of these static guardrails of, you know, if somebody says bomb, then say, I'm sorry. I don't know how to respond to your request. But if they then change that to say explosive, then it's like, oh, sure. No problem. And just the the immutability of language makes it very difficult

to have any sort of strict and generalizable

controls and guarantees around any filtering

or security that you're trying to put around those models. And I'm wondering how that factors into the ways that people are thinking about the context in which they're even willing to employ generative AI because of the fact that there is such a broad range of potential

surface area for attack vectors.

Is, has been said and I like to say as well is LLMs

are fluent but not factual. They basically they they know how to speak because of vector patterns and stuff. They don't understand what they're saying. So if, you know, you're never gonna get a a full restrictive

way of,

if you allow

open access to the foundational models and you can send any kind of prompt you want and etcetera. But that's not generally what we're seeing. You know, you can control the kinds of prompts because you put applications in front of them ideally. You know? You're, again, I don't see as much concern with, like, if you go on to chat GPT and you get to say something offensive,

who really cares at the end of the day? You're trying to trick it into doing something, and they all it's like, oh, no. You got to say a dirty word. That that's silly in my opinion because you're doing it, and and you're creating that outcome. But when it gets to a business use case though, again, I wouldn't expect open

access to the foundational model. Instead, it would be going through some kind of an application. So you can take steps to avoid that. I mean, like in NVIDIA's Nexmo, you can even control the flow of how they interact with it and avoid them from doing very specific flows and things like that, whereas guardrails are more of a pre filter, post filter kind of things for responses, etcetera as well. So there are techniques to handle that, but a lot can be done by controlling how they're interacting with it. I don't know. I I would put, just a raw allow prompts to come in from a customer for a chatbot. It's gonna be controlled. They'll remove any kind of prompting from it. When you're doing a real application and you're you're have a business purpose, that just gets a little bit easier, and I think that's what we'll see more and more. The early failures have been when somebody just slaps up a model with no review and, you know, like the car famous car dealership where where they got them to sell sell them a car for a dollar. Really, it's short of making the business look foolish. It really didn't have any impact. They're not gonna sell. It's not legally binding. They're not gonna sell the car for a dollar or whatever. But it shows the danger of running too quick and not having things in place because clearly nobody really reviewed this or tried it out. We just slapped one up there real quick and gave free access. I think you'll see as as people get more hygiene around this, they will have more purpose oriented

interfaces

where it's going to going to be careful. The big dangerous hallucinations

because that's even harder to detect

because, you know, it's like if it just makes something up as an answer, it it gets very challenging to detect that. And that's more one I think the bigger things is just telling people wrong information.

To the point too about the hallucinations

as well as the

constraints around what types of inputs you're allowing to the models.

I'm wondering what you're seeing as far as how people are maybe using

structured inputs and structured outputs to constrain the overall potential state space to reduce some of those error conditions and reduce some of those

potentialities

for making stuff up and just some of the

application architectures that have been useful to manage

the randomness of of these models and bring it into a

fairly

well

known

operating environment and making sure that it's not just going to go off the rails and just start spewing completely incorrect information?

Well, the inputs and up well, the outputs are the the, in fact, the non factual information. But the inputs don't control that as much because you can give the perfectly valid

input and it will just make stuff up sometimes. Because, again, it's it's fluent, not factual. So it sounds very convincing when it tells you that, and I think that's the biggest challenge with the the LLMs. There's like a a search engine leading you to a site that has incorrect information.

The model talks to you like a very smart person. It sounds very fluent. So it's pretty convincing. So a lot of people will just take what it says as fact. Now controlling the inputs, what we see is, you know, you wanna make sure they're not trying to, like, change system prompts and things like that, and we're definitely seeing that. And a lot of that comes down to, as I said, things like, using

NVIDIA's,

NEEMO, or Guardrails AI at these points, or even some homegrown solutions, you know, where you're looking at things where they're using langchain and doing their own things through through sequential chains in there and that kind of things to make sure those kinds of things are scrubbed. I think it's an it's still an immature space, and there's a lot of opportunity there, I think, to provide better overall solutions

for,

stopping these kinds of clearly jailbreaking

techniques and attacks. And you do see that now more and more like JAT GPT 4 when they do identify something. For instance, I the yeah. You have to be careful of not playing with these techniques too much because you get nasty grams from them saying, stop doing that. We're gonna sense suspend your account. So you don't wanna try to do demos of technology to show what is possible,

not not malicious.

It catches that as well. So you you will see more and more of that come on as fine as well, so thanks. The actual output hallucinations,

these models are uninterpretable.

We don't know how exactly it arrives at that conclusion. You know, people are trying to do some of that through, like, node clustering visualizations

and self explanation and other techniques like these, but they're still very immature. I mean, so there's,

other things you have to do. For instance, like in a RAG solution, one of the things, we offer within our package is the ability to compare

the response against the reference source documents and making sure the cosine similarity

is high between the 2. And, you know, from that, you would expect them if assuming

the data you fed in, you know, garbage in garbage out, assuming the data you fed in in your RAG, dot reference documents is good and those the output matches that fairly closely,

it's probably not too far of a hallucination, if anything. It's probably more interpretable that way. So there are some things you can do in the output side too to help detect these in in the typical usage scenario which you're seeing, which is the RAG architecture.

Another

aspect of these models that is still,

I think,

in the exploration and development phase, it hasn't been as widely adopted as the rag pattern, is the use of them for multi turn conversations where it has to maintain

conversational

context over the course of the interaction.

And my understanding is that increases the

potentiality for hallucination because it has to try and maintain more and more information, and, eventually, it just kind of loses its brains and starts coming up with stuff. And I'm wondering

how you're seeing organizations

weigh the potential benefits

of the user experience around those multi turn conversations

against the potential risks and the system and architectural complexity of being able to manage that from a implementation perspective.

Yeah. Well, and you also we got one other big one with those is cost. The more code tokens you, load into there and the more tokens you're processing, the more your cost goes up as well. So all of those have impacts. What I'm seeing initially

is, you know, the especially,

we deal mostly with Fortune 500 customers, so they're they're a little more cautious in their adaptation of, bringing these technologies in. But what we're seeing at first being used is human in the middle. So, basically, you know, like, I'm a a support agent for call center or something like that. I will have an internal app where I can ask these questions, etcetera, and that kind of things, which gets me the answers quicker so they don't need to know everything. But then the human will, actually converse with the outside person, whether it be live on audio or through their chat system, etcetera, and that kind of things. And that does give you that gating item of, you know, your internal employees not trying to jailbreak your own technique. That doesn't make sense, but it does help them get the benefits of quicker access to the to the customer's answers. So you have kinda human in the middle, and then they just converse directly with the human. And if they see something really bad come up, they're gonna be able to see that market and, you know, give it a thumbs down so it can go back to the the team the data scientists to actually potentially address. And that, you know, then

you've kinda got the a human gateway. Otherwise, the other areas I'm seeing is I said, like, NVIDIA's Nexmo is like guard simple guardrails are kinda just a post filter. NVIDIA Nexmo actually controls flow at any point. It's it's fairly complex to use as a disadvantage.

It's not real straightforward and requires a bunch of work and testing and development and etcetera and that kind of things. But, you know, it's promising if these could be made simpler because then you can prevent

flows where it's clearly going off the rails, so to speak. No pun intended there. But, it's I I think you'll see more and more of that kind of control of the conversation and techniques to evolve there. But right now, what I'm seeing first pass is a lot of internal apps where, you know, it's not going out to the customer and that kind of things until you get to something as big as an Amazon where they have the resources to to make sure they can clamp down on the automated

chatbots, but they've been doing that for years. You know? And to that point of internal versus

external

tool and application

use, how are you seeing organizations

manage that risk calculus

and the

cost benefit analysis

of whether they actually want to expose an AI driven capability

to their end users

versus the amount of engineering effort and organizational oversight that's required?

Well, I think that's where everyone's mostly just sticking their toe in the water right now, to be honest. I mean, that's, you know, where we're trying to provide the governance solutions

to enable that process so you can do that. And that's where, you know, as I said, it starts with a use case. And in a proper AI governance solution, you assign business risk just to that use case before you even pick the technology. So there's a traditional machine learn learning model, an Excel spreadsheet, Python code, generative AI. You there's a risk to doing something. So you need to analyze that and have that process in place. A lot of places don't, and that's a problem because, you know, it's like again, if it's a low risk, low business,

risk use case, you can probably go out on a limb a little bit more and and try newer technologies.

If it's like running your trade automated trading or something, you could lose a $1,000,000,000 overnight. That's probably gonna be for very high risk to the big business, and you're gonna be gonna be a lot more cautious about what it does. But risk assessments are not just about the dollars. They can also be about the liability. For instance, if it it was biased in in how it treated customers

on that kind of thing. So for instance, they're getting a lot better, but early on, we developed some tests by basically looking at protected classes and just asking it the same questions, but tweaking the persona only of the protected class. And early on, I would see very different answers versus, like, for a simple example of male versus female. You only change the attribute of a complex persona description, and I get things that are clearly biased female versus male, you know, based on stereotypes because that's how they're trained. And that's concerning. That's a risk to the business as well. You don't wanna be treating your customers

differently based on those. So the customers are really just starting to look at this. They're dipping their toes in the waters. A safer path right now is to do internally focused, have that human in the middle. And at the larger companies, that's what we're seeing. But, you know, they're just really getting going on the Gen AI journey. AI, however, it's been ML, yeah, AI, has been around for a long time, these companies, and they have better understanding of how to mitigate that risk because some models aren't inherently interpretable. Or, you know, you can do things like Shapley and Lime to understand why they're making decisions, and you can control the risks better. So that's why there's huge adoption of that. AI is generative AI is just getting going. And frankly, I I think they're seeing how it plays out. We had one customer do an internal kinda support chat thing, and they're seeing really good results with it. So, you know, that's giving them the confidence to go forward. And I think that's what we'll see is it's a it's a learning journey at this point. Another interesting aspect of this is that in the generative AI space, most of the attention has been around the natural language models of text in, text out, but there has been an increasing

investment in multimodality

or other modalities as well rather than just pure language or pure text. And I'm wondering how you are seeing

the

capabilities of other modalities impact the potentiality

for risk or, you know, maybe harm to the brand or even just risk in terms of a cost perspective.

Well, I think the risk doesn't change a lot between those. You know, how you detect it is a little harder. So for instance, classifying an image as offensive

is a heck of a lot harder than classifying a word

as offensive because, there there's, you know, well known measures of things like toxicity,

gibberish

detection, offensive language detection, you know, hate detection, etcetera. They're well known algorithms

that do that fairly well. So I think with the text, it's it's it's a little bit easier to pick that up. What makes an image of offensive

would change from culture to culture even. You know, you have certain religious

figures, etcetera. You can't have a fear in an image, for instance, an example. How do you detect that and and understand what that means? Never mind, you know, just inappropriate,

etcetera. You know, images are a whole new beast that way, and detecting it is hard. I don't think it's any different fundamentally

in that, you know, you can make an offensive audio clip, you can make an offensive

image, or offensive

language. They have the same risk

on that kind of things, but detecting them is what is harder. So how do I filter, for instance,

in an inappropriate image and do it in a culturally sensitive manner that actually works across all kinds of cultures. One thing that's offensive to somebody is clearly not offensive to the other. I think speech is a little more understood that way. I mean, that's why it's easier to detect, but that is you start to do that. Now I don't see a whole lot of customers doing, like, end users. So, like, you know, they're they're they're

customers, but giving them image generation capability. I mean, that that's kind of a like using DALL E 3 or something like that. It it's it's done by people. They get the results.

If they did find something offensive, it's gonna be back against the foundational model, and they're doing them, and they're reviewing them, seeing them again before they go out. I haven't seen a whole lot of, like, true large companies doing anything that, you know, actually, like, goes right to their customers that generates that kind of thing. So it's early on that. Same thing with audio. I'm not seeing a whole lot of that, but audio is a little easier because at least you can do text to speech. Obviously, there could be offensive background sounds or other things in there that could pop up, but I I think that's more of an edge case. So the modality doesn't so much affect what it could do, but more how you prevent it.

And in the overall

technical ecosystem

of the current generative

AI landscape,

what do you see as the areas that are in greatest need of investment to ease that burden of risk and validation

of exploration

and deployment?

Well, a lot of it comes down to trust in the exploration and deployment?

Well, a lot of it comes down to trust in the model at that point. How much do you trust it? And, you know, that's why we developed our company, you know, model op develops a governance solution.

So going through all the steps, going through all the reviews, understanding the data that's there, etcetera,

making sure you follow that process and then do the proper ongoing

monitoring.

Model should be expired at some point. You know, there's all kinds of things that need to occur. That helps a lot in developing the trust and that the model is appropriate for its business use case. And, you know, that will help move things forward. And that tooling that we've developed

really helps with that, but it's getting

the,

individual companies to recognize they must do this now. Think of it as like the early days in coding

where I mean, I loved it at the time is, yeah, you build it on your desktop, throw it out in production. You know, straight from there, no review,

no reproducibility,

that kind of things.

Eventually,

you know, DevOps came along.

And at first, there was resistance to using it because it's more steps you gotta do and, you know, developers are busy. They don't wanna do this. Over time, when it stops you from getting calls at 2 in the morning because your system's down, you you grew to appreciate having this formalized process for doing that. I think that's what we're seeing now. We're in the early stages of model governance

for the purpose of providing the same kind of safeguards, etcetera, and reproducibility and accept and all of those things that come with it. And so, you know, there's and accept and all of those things that come with it. And so, you know, there's resistance from data scientists, etcetera, to wanna do it because there's more stuff you gotta do, and it just seems like overhead. But it pays back for itself in the long run.

And, additionally, I think they know there needs to be more tools around

I don't think you ever get true explainability,

but what can what kinds of steps can we continue to provide that helps make us understand why these things are coming to the decisions they're coming to so we can do more effective monitoring. Now we've developed some of our own tests that are more business use case oriented

in testing it against its intended use. I think you'll see a lot more there because we've seen a bunch in the foundational models. That's great. Almost nobody's training their models. And most people aren't retraining or doing reinforcement learning. Really expensive. You know, they're doing brag architectures and things like that where it's well known. So what kinds of tests can be developed to make sure that they're actually forming against their use case? And I think we'll we'll get more and more

robust methodologies to do that as well because we if you don't fully have to understand

what exactly the model is doing unless you're in the highest regulated industries. But it is nice to know that it's performing well against this intended use, and that again builds trust.

And in your experience

of working in this space of model governance,

AI regulations, helping organizations

come to grips with what is

possible for

managing that risk

and helping them to understand what are appropriate policies and technical controls around that. I'm wondering

what are some of the most interesting or innovative or unexpected ways that you've seen companies,

address that challenge of AI governance?

Well, I think the the biggest thing I've seen is people underestimate what it takes. You know, we see a lot of DIY solutions within companies and think of it as like, I'm gonna go build my own Jenkins. I'm gonna build my own, you know, sonar. I'm gonna go build my it it compounds.

The amount you actually have to do and as as rapidly as these regulations are coming out and, worse yet, because we don't like in the US, we don't have kind of a federal standard. There's a lot of states doing their individual things. It just makes it more and more complex of, like, what do I have to exactly do? And you read regulations.

I have to, unfortunately. I don't ever suggest you you do. They're not, like, software documentation. They are really obscure. So you also have to find leaders in the industry to kinda follow along and say, hey. Yeah. You know, we're all in this together, and this is our understanding of what this really means.

So that that vagueness within the regulations as well. And so you a a lot of companies are just overwhelmed and frankly don't know what to do. And sometimes

though, I think human nature sometimes is it's safer to do nothing.

But even though it's really not, it feels that way. So, you know, this is a space we see is rapidly emerging. It's gonna be kinda you gotta get going now on it and do things now because it's gonna be a lot harder later, and there's just gonna be more and more regulations.

And, you know, it's the do nothing or try to build it yourself is usually not the best option in the world. And that's that's what I I keep seeing again and again is this underestimation

of the problem.

And in your own experience

of building this governance platform, helping organizations

understand

their risk profiles,

and just trying to keep track of the overall AI industry as it continues to evolve at such a rapid pace? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Well, I've just learned how quick things can shift. I mean, it's not I guess it's not so much of a surprise as I've been developing software for, like, 40 years. So I I've seen big inflections. I mean, think of, like, when, the Internet and the web really took off. I mean, the things were changing so rapidly and tons of investment in random directions and etcetera, and waiting for that needle to settle down a little as to exactly what is the direction. So I I've the the biggest challenge I've seen

is the the continued ongoing evolution

of what exactly it means to govern

models in general. So, you know, we've been talking a lot about AI, but, you know, we've started this,

5, 6 years ago. We're doing traditional machine learning models and etcetera. But now the the whole ball of wax of what it means to govern an AI model is so up in the air, and we're all trying to figure it out and what makes sense. Now you don't wanna stifle innovation, but you do wanna provide protection for consumers against discrimination

from

or misinformation or any of this kind of stuff when it's it's directly affecting their lives. And that's what we're seeing is this rapid evolution of the the large language models and how they sound so smart, yet can't do basic math.

And and people are are concerned that they're gonna take over the world, etcetera, and all these things. And this all this noise

plowing through that and figuring out the right things to do has been a it's any of these hype cycles,

it it's kind of a big deal to keep up with and and try to see through the fog and where it goes. And and just the the

steepness of this curve over the last year has has been has been pretty amazing, frankly.

As you continue to

keep tabs on the space and try to

come to grips with what are the risks, what are the capabilities,

what are the appropriate levels of control

and

appetite for risk? What are some of the technical or social or organizational

trends of AI risk and governance that you're monitoring?

Well, I mean, I read a lot. I mean, we do a lot, especially because we're dealing with larger customers around really understanding these emergent emerging,

regulations.

Because, like, the e regular act specifically is gonna require

certain industries and certain spaces to actually submit

reports

to them in order to be able to operate. So, you know, that's a challenge. How do you make sure these reports get in and updated and things like that? You know, that's one of the problems we've already solved. But, like, what we're seeing, especially organizationally, is, you know, there's a huge appetite

in the org individual lower level business units to want to consume this stuff. I mean, it's cool, it's new, you wanna learn about it. It's got capabilities that are really powerful. And then, you know, you have kind of the higher levels who understand that business risk and how do you bridge that gap to bring together not stifling the innovation,

but still

managing that risk. And that's where you have to have process. You have to open communication between all these different levels of larger organizations.

And that's why you have to have what we refer to as a model life cycle that manages that process and makes things happen and and and opens up that communication. So you can have these kinds of discussions.

And you do have the metrics and the factors and all the information and metadata, who's using it and when for what. So you can make an intelligent decision. So it's really gonna have to break down some barriers between the, you know, things that might otherwise be organizationally,

separated

so that you can work together to achieve the benefits

of what these provide. Because clearly, I mean, we we've seen use cases and they're using it, like, in in in in support call centers, etcetera, and they're seeing massive improvements and outcomes.

So that's very good for the business.

But, you know, how do I I both embrace this and move it forward in my organization

and avoid some of these big scary things that are out there, which a lot of times don't materialize

and or get hyped up. But, you know, like, there's the, things where I think it was McDonald's that put out their AI stuff in there, and somebody went up to a kiosk and ended up with 260 orders of chicken McNuggets and couldn't cancel the order and things like that. Bad deployments

cost

both image wise

and trust wise and etcetera.

So that's really what we're seeing is the the organizations that have to come together to work how to see the benefit from these, but also mitigating that risk.

In terms of the

current level of opacity

and the centralization

of the

kind of creation and maintenance and updates of these models

as well as just the overall operational cost of training them initially, but also using them for inference. I'm wondering

how you see that impacting

the future trajectory

of the model architectures

investment in their developments and capabilities and just some of the things that you either maybe

anticipate based on your own understanding of it or maybe just wishful thinking that you'd like to see it go towards? Yeah. Well, that's the, you know, the operational cost is a big issue. You know, it's it will consume a lot of computing resources, and that's what again, everyone's still trying to figure this out. I mean, people are still trying to figure out cloud costs in general to their business. It's getting a lot more predictable. But even then, you know, how do you rate now? They didn't go get a big surprise bill. So I think we're seeing a lot of organizations examining multiple paths, whether it be a vendor hosted solution like chat g p d 4 or a llama 2 instance. They run locally. And then, you know, if you're running it locally on your own hardware, for instance, you have some cost and that's it. But then does it perform well enough for that? Or do I use a hosted solution like Amazon's Bedrock and, you know, accept the costs that could come with that? And, again, don't have a lot of fully deployed

solutions out there yet. So a lot of that is is still being learned because, you know, we've seen clients where they do do the hosted, but they do have some controls around it. Then we see others who buy tons of NVIDIA GPUs at great upfront cost, but now their cost is fixed. And, you know, they're just ramping up efforts. So was that enough investment? They don't know yet. So I think especially when it come to cost, it's not well understood. But we are seeing the interesting side is both small language models, which, you know, you can actually run on on much more modest hardware but perform pretty darn good as well as things like GPD 4 o, which is,

much less cost per token and may perform adequately for your task. So I think we're gonna see a little bit of how do we balance the cost versus what comes out. And the really interesting thing is I'm seeing is some of the really newer large, ultra large language models

are not really performing better than, like, the the counterparts.

So we may be hitting a point where additional training, etcetera,

is not necessarily going to be worth it for the the cost of the difference in models,

especially when the other ones are good enough.

So I think we're gonna see a lot around that and never mind the small language models. I don't know if you've played with them. Yeah. You can run some of those on your cell phone and they actually perform okay.

So I think it's gonna get interesting. We're gonna see those costs potentially come down. Are there any other aspects of this overall space of AI model governance,

the policy considerations,

the regulatory environment, the technical capabilities that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. Well, the the biggest one is just realizing that, I mean, it's I keep

I got I feel like the person caught in the middle sometime because it's the people who want no governance and regulation

who,

you know, don't wanna stifle innovation in any way, shape, or form. And and for the raw foundational models, you know, I definitely feel that's okay because COVID is empty or, you know, you're using them. It's your responsibility to assess the risk, and that's why you need model governance. But on the other hand,

it makes me nervous to not have any regulation when it is making decisions that affect people's lives directly.

So it's making a health care decision. I mean, I think it was Nevada's developing one to actually decide whether you get unemployment insurance or not. You know? These are big decisions that really hurt people if they're done wrong or done in an uninterpretable and non understandable

way and don't have some kind of human review or something in the process to make this happen. So when you think of,

model governance, it often comes off in a negative light whereas it's like, yeah, if we're talking over regulating the development of a foundational model is afraid they're gonna take over the world, that is not the kind of governance I think is important. When it comes down to the business use case, it affects real people's lives. Whether you get a job, whether you get covered under health care insurance

or not, whether you get unemployment insurance or not, you know, whether it detects your cancer cells or not, you know. These are things that are really

important. And, you know, from that aspect, that's where I believe governance makes sense. We do need some oversight holding people

accountable for this. Because right now, you submit a resume to a company. I know a lot of people are looking for jobs out there, and it just disappears.

You don't you know, a lot of times, this AI model is just throwing the resume out. No human they've ever even saw it. That's not I don't think that's a great situation for us to be in, especially

let's if there was any discriminatory

practice built into that model that people are unaware of. I don't think people are doing this purposely. It creeps in from the training data, etcetera. So we just need to be cautious about that, and that's where we do want peer review and and, you know, processes within your organization.

So that not only are you getting a great outcome as a business, but your customers are getting great outcomes as a customer as well, and we're doing it in a moral and just way.

Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question,

I'd like to get your perspective on what you see as being some of the biggest gaps in tooling technology or human training that's available for AI systems today.

Yeah. For from my standpoint, I mean, obviously, we think model of evidence is a big one,

and that's why we're building it. There's very, very few companies in our space developing these solutions that actually scale and can handle fortune 500 kind of situations and do it the right way. So, you know, obviously, we identified that as a huge gap, and that's one of the things we we're building it just because of that.

That needs to be out there. I also feel the especially the I already mentioned this early, but anything we can do around applicability to business use case and detecting performance.

How do we do that with, these these AI,

generative AI models? Because we need to understand how well they are performing against task and any techniques we can develop to do that as well. We're seeing that out in the space. I also feel like, you know, around the specifically

around the flow control, I said, like, NVIDIA and Nemo

is great in some ways, very complex to use, very difficult to build. I think great tooling that would sit on top of that that would make it more visual in nature of how you could control that and and test it and debug it and etcetera. There's stuff being developed there. I think that's a great space where there could be absolutely more tooling because preventing a you need 2 levels. You need the governance upfront to make sure you're doing the things right, but then you need the real time defense as well, which isn't really governance.

It's just, you know, making sure that it it's not going off the rails, so to speak, in place and making that simpler, easier to use, more understandable, more traceable, etcetera, is an area that that is is largely lacking,

on that kind of thing. I think all the investment right now is going into just train, train, train these models to make them perform better on 1 or more scores. That's not useful, but probably now we need to start industrializing

these models, making how can that make them more usable for developing the business case? We saw a little bit of that with, like, the automated chat GPTs and copilots and things like that. I think there's a lot of opportunity in that space to make them more consumable to a business cause and use case. Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience,

working in this space of model governance, and I appreciate the work that you and the rest of your team are doing at model op to make that a more,

consumable

and

manageable

process for some of these organizations.

It's definitely very important to continue that work. So I appreciate you helping to push us all forward in that direction, and I hope you enjoy the rest of your day. Yeah. And thank you for having me on the show. I really enjoyed it and, enjoyed our conversation.

Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelarningpodcast.com

with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast