Building Better AI While Preserving User Privacy With TripleBlind

Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea idea to delivery with machine learning.

Your host is Tobias Massey, and today, I'm interviewing Gharib Gharib about the challenges of bias and data privacy in generative AI models. So, Gharib, can you start by introducing yourself?

Hello, Tobias. Thanks for, having me. Excited to, talk with you about this,

wonderful topic.

My name is Gareeb. I am,

interested, and, my interest fall at the intersections of AI privacy systems.

Today, I serve as the director of applied research and head of AI at TripleBlind,

which is a startup that's focused on basically protecting people's, privacy.

And we do so by

building a wide range of privacy preserving,

tools for AI, machine learning, and data analytics.

And do you remember how you first got started working in machine learning?

Yes.

So, basically,

when I grew up, we had a desktop in our home, and I always thought it will be really exciting to be able to have conversations with, with my computer.

Growing up, I know that in order to, do that, I needed to be able to talk to the computer and then make it do things for me or assist me in things. So in order to be able to talk to the computer, I pursued a master's degree in computer science, University of Missouri in Kansas City. It

was in software development. And, after that, I went into my PhD program, and I started focusing on deep learning.

And I think what really,

made me go into deep learning is a paper from 2012,

Jeffrey Hinton team,

that worked on ImageNet competition, and their results showed that this thing called the neural networks and convolution,

layers can actually achieve very good results.

I got very interested in this domain and,

studied deep learning in my PhD. And towards the end of it,

I started exploring,

privacy preserving

AI. And, here I am today.

Generative AI in particular has been gaining a lot of attention because of the different breakthroughs

and the level of sophistication and capability that it has reached recently with the different,

successions of g p t based models.

I'm wondering if you can

talk to some of the risks that the capabilities

of these GPT models pose, particularly in that context of data privacy and bias?

Yeah. Sure. That's, that's actually a great question. And to me, GPT or generative AI in general is

basically I see it as a general purpose technology.

Similar to all the previous technologies that we had, these technologies usually are double edged swords.

They have a lot of potential to do great things, but they also can cause some harm. Some some of the capabilities of chat GPT, for example,

can be used to,

amplify specific,

agenda or propaganda because it's very good at generating text

and

essays, for example, or blogs

easily and very cheap and with very convincing arguments.

So

that can, that can also be helpful to generate the fake news. Right?

And, GPT models today are multimodal, and in the future, they could be producing

videos.

And, actually, there's already generative AI models that can take some

question in a small sentence and text and generate a video out of it. So deep fakes is another big problem, for example.

So the capabilities

of these models can have, very serious implications,

and a lot of these capabilities

have fallen in the wrong hands of doctor Evel. It could, cause some serious harm.

But, in the big picture and on the long run, I believe that, the benefits of these systems are going to offset,

their, their issues.

With generative AI, there's a lot of excitement about, oh, they can do anything. I can tell it to do whatever I want, and it'll give me a reasonable sounding answer. But the the problem there is that it's reasonable sounding without necessarily

being actually correct.

And I'm curious what you see as the

contributing factors to the shortcomings of these generative models and some of the

inherent risk of people's

level of trust that is being built up with that trust not necessarily being well founded?

Yes. That's that's another great question. And,

I believe the this, these issues stem from 3 major factors,

data, the algorithms, and the oversight when it comes to training these algorithms.

1st, I usually loosely define AI algorithms or,

AI systems as programs that generalize from data.

And, therefore, if the data that we collected was biased in 1 way or another,

the programs that's going to generalize from this data is they're also going to amplify that bias.

A second thing is the algorithms themselves.

Despite having very sophisticated ways to train and optimize models and,

reinforce

some of the, good behaviors

in securing these models. The underlying algorithms still have some serious problems.

For example, large language models, as you mentioned,

do hallucinate.

They,

they cannot provide

very factual

answers. They're not good at retrieving factual information. They're not good at citing them. As a matter of fact, BERT, 1 of the models

that are very common, still believes that the president of the United States is Barack Obama,

and we all know that child gbt knowledge cutoff is in 2021.

So these large language model are still have some serious issues when it comes to the algorithms themselves.

And finally, the oversight when it comes to training these models.

Actually,

the lack of overs oversight in the in the training. So,

we still don't really understand how these models

generalize and why they generalize very well. And, therefore, it's hard to put safeguards in place to make sure all of their answers are correct and always consistent and that they can provide citations. Yeah. So I believe data

algorithms and the oversight or the lack of oversight are the major reasons of

hallucinations,

knowledge cutoff,

and not being able to cite their references. And as you said, they're very good at convincing generating very convincing,

arguments. Right?

Yeah. All you have to do is exude confidence that your answers are correct regardless of their actual factual basis. Yeah.

And to the point of bias in the underlying data that's used to build these models,

that is

not necessarily something that is exhibited in every interaction, and it can even be very subtle and hidden. And I'm curious, what are some of the ways that

that inherent bias in the underlying data can manifest and some of the ways that

that bias can have a negative

external impact on the users or the operators of that product?

Yeah. That's a very

another great question.

So I I I believe some of the examples for the subtle manifestation

of bias based on the training data can

some examples could be something I refer to or what's known as semantic bias.

So in semantic bias, basically, when a specific term appears

in the training data frequently associated with negative context,

the AI algorithm is going to, assign that negative context or negative meaning to that word.

So if you have a dataset that's always associating the word immigrant

with a bad context, then AI is going to have a

of

that word that it's being negative. So that's what I refer to semantic bias.

And,

then there is the

absence bias.

A lot of times, we are training on some specific data set for to to cure some disease or to learn a new

style of way of saying things or doing things or teaching the LLM to do some specific specialized task.

If our dataset does not include

other types of arguments or other views on the same topic or the same point,

that LLM is not going to understand the full argument and not able to,

really capture information from different perspectives.

And then there's,

that that causes some, in the future,

some bias towards

specific points of view, etcetera.

Then there is the the known bias that we can exhibit from the data. If the data is collected from specific geographical areas or from specific

respondents, if we have surveys and we are collecting them from specific,

groups that have specific political

views, then that data is

obviously biased towards some specific political,

perspective. For example,

that's the gender bias, whether it's because of historical reasons or because the language as a binary language, for example, etcetera.

So all types of bias that could be

hidden in the language itself, such as the semantic bias that our dataset is not representative enough, there's more strong, manifestation and of of bias and for historical reasons and some stereotypes as well. And in addition to

the inherent bias of the source datasets, the ways that that exhibits in the generated results,

the potential inaccuracies

because of lack of real world context or actual knowledge.

The other

potential risk of dealing with these generative AI systems is the opacity of how the user interaction data is actually going to be used or whether it's going to be used at all. And And I'm wondering what are some of the ways that the application or collection of the interactions might also be some measure of exposure to risk for the people who are interacting with these systems.

Yeah. Of course. I think that could have significant risks. Right? I I believe, historically,

our browsing data has been, like, a sensitive topic.

But I believe today, our, chat history

with chat GPT, for example, is probably way more sensitive than our browsing history.

We're asking a lot of questions

that could

demonstrate our incompetency

at work, for example.

We are asking questions to,

write a message or a letter to our significant other,

maybe we are, not being very transparent that it was generated by AI.

And if that piece of information was somehow leaked unintentionally,

it can, cause some serious,

some serious harms.

So, yeah, data today is collected in, in different ways. Basically, every single piece of data of our interaction with these systems is is collected.

And,

there is unfortunately not very transparent

ways for us to validate how this data is being used, where is it being kept, who has access to this data.

And,

therefore,

we cannot really fully understand and comprehend the possible risks.

Some providers promise that our data there are options to opt out from collecting your data and being used.

That data is still kept for 30 days somewhere at some servers,

and I'm more than willing and happy to, and I do believe the promises that these, providers

make that they're not going to use my data, but

accidents happens and honest mistakes happen. And,

we have seen incidents already where, some, people's data from OpenAI was leaked and some credit cards information.

2 weeks ago, some, researchers,

Microsoft

AI team,

accidentally published

tons of data about people and, private conversation on GitHub accidentally because of

a a file that was misconfigured.

So accidents do happen, and,

therefore, promises by themselves are not enough. And that's why I'm a big advocate and

interested in providing privacy preserving methods to to do these things beyond the, beyond the promises.

So the risks here could be very serious. We need to be able to know where this data is collected, how is it being used, who has access to it. This data could be also repurposed, as you mentioned. You're not using my data to retrain your system, but maybe you are using it for targeted advertisement,

for example. So it can have very serious implications.

And then the other element of risk for

generative AI and these aspects of bias and inaccuracy

is the

potential future impact of

lack of trust because of the negative interactions that people might have as these systems do improve and maybe address some of these underlying sources of bias or inaccuracy,

and that can influence

potential future investment in developing those technologies or improving those technologies

where if there is enough of a backlash, then those systems might just need to be shut down completely.

There are also issues of the

copyright elements of some of the source data that's used for,

my understanding is OpenAI in particular, but, you know, that that's a potential legal gray area. I'm curious what you see as the risk to the industry and the people who are working on these technologies as a result of

potential violations of trust because of these biases or inaccuracies?

Yes. I I I I believe that eventually we are going to

hopefully solve these problems at large,

and, we should be able to,

build and create systems that are trustworthy.

And these systems are supposed to preserve our,

privacy when whenever we interact

with them. However, today, we're still not able to fully comprehend,

the abilities of these systems, and therefore, it's still really difficult for us to safeguard it and understand the,

the ramifications

of of this technology.

Today, there is a very good positive

attitude towards this AI. I have never, met anyone personally that tried chatgpt or,

midjourney, for example, to generate images and said, hey. This is, very dangerous. This is not good. I'm not going to use it. So it seems like the general public today, when they look at this technology, they they like it, and they believe that it's going to actually benefit humanity.

Maybe,

1 great thing that OpenAI did is that it exposed the system to the, to the public

to for us to start,

understanding

the, the abilities of these AI systems and what actually could go wrong. So today, things are still seem to be under control,

but god forbid if some, major accident happens to these, to these systems, it could, it could be it could really change the attitude towards these systems.

If something goes wrong, if a major hacking incident

leak all of people's conversations with chat gpt, for example,

that could, that could lead to some serious, issues of adopting this technology.

And

if you look back in history, some some some of such accidents have, drastically,

changed the road map and the trajectory of some of the technologies that we had.

And,

nuclear energy is a very good example that comes to my mind. It emerged as this source

limitless source of, clean energy

that's very good for humanity and environment and everyone. It's was supposed to be also very cheap.

Unfortunately,

couple of incidents in the eighties, Chernobyl,

and I think, if I'm not wrong, before that, the 3 Mile Island here in the United States and Fukushima in Japan,

drastically and, greatly affected

people's,

attitude towards that technology. And today, the term nuclear is, like,

attached to all of these incidents that happened in the past, and that resulted in drastically

stopping the investments there, research, and the

people and students were not interested in studying that field anymore, etcetera.

So it will take something as big as

Fukushima

to the to to the AI to probably change the overall

semantic

and,

attitude towards it.

But, also, it's a little bit different here because,

yeah, people people privacy is very important, but it's not people's lives that will probably be affected if a big hacking happens to a JAD GPT, for example, or OpenAI servers.

That's a little bit different. So,

even if a great

incident that like that happens, I don't think it's gonna

maybe it's going to slow down the progress a little bit, but it's not going to stop it completely,

specifically after we have,

just seen the tip of the iceberg and what is this technology is capable of doing.

So maybe there might be an incidence here and there and some tax

that we have to pay for these technologies, but I think it's going to continue growing.

From the perspective

of the

practitioners, the people building these AI systems, applying ML technologies,

what are some of the ways that they need to be thinking about incorporating

awareness of and counteraction of bias into their work?

Some of the ways that they need to be thinking about the presentation

of their results in order

to convey

whether or not they are maybe conveying the level of confidence

so that there isn't this

blind trust in these systems.

Just wondering kind of the overall

scope of how

these risks need to be incorporated into the actual work of ML practitioners

to ensure that they are building systems that are beneficial to the end users?

That's a very good, very good question.

Basically, I believe

well, simply here, speaking, I think they should use a system like TripleBlind.

That's the start up that I work for. And the, the reason I say that is I believe that ML practitioners should actually be

careful,

when they are creating these AI systems and training the data.

And,

you might be aware that the AI life cycle is very different from the traditional software development life cycle. So you need to be careful when you collect the data, where you collect it from. You need to be careful when you curate the data, organize it, clean it, prepare it. You need to be careful what algorithms and privacy preserving algorithms you use to train the model, and we all know that the model training doesn't is not a onetime process. We retrain it again and again until it reaches a specific performance

point. And then we take that model, we deploy it, and that model starts being used and generates inferences, for example.

Even then, we need to make sure that these inferences is not leaking any information about the training data.

And then we want to monitor the behavior of the model that it's not shifting over time because

usually people behaviors and the world we are living in is shifting. So these AI models, they drift in their performance as well. So we need to monitor that. So it's a very complicated process, and that's why I believe

a very easy to use system

is is very necessary here. So AI practitioners

need to make sure that their their data is,

covers all possible scenarios of whatever

downstream task they are working on. You need to collect the data from multiple sources, not from a single source.

And, I recall here a study that

came out from Oxford in 2021

that,

examines more than 500 machine learning and deep learning models that,

were published in reputable journals

and conferences

that trained AI models to detect COVID 19 from chest X rays and other electronic health records. And that study demonstrated basically that almost every single 1 of these algorithms,

were fatally

wrong, and it failed,

big big time when it was,

when these models were exposed to

new types of data that came from a different distribution, from different peoples, from different geographical locations.

So

we need to make sure that we are collecting

samples,

that are representative

of all possible scenarios. We wanna make sure that we are using robust

training algorithms

that can

lead to models that generalize

very well and that do not overfit.

When we are exposing these models, we wanna make sure that we are preserving the privacy of the training data by

making sure

that the output of the predictions of the models cannot be used to leak any data.

Finally, we need to be transparent about, all of this process end to end.

Privacy should not be an after the fact thing.

We actually operate at TrebleBlind with a concept called privacy by design, and today, it's well known in the,

academia and industry. So privacy should be integrated in the entire life cycle.

So, yeah, data collection is important. Make sure that we have samples that cover all possible scenarios.

Make sure we are using algorithms that generalize well, do not lead to overfitting.

Even when we deploy the models, make sure that we're using techniques to preserve the testing data that's coming into the inference and techniques to, make sure that the output of the inference is not leaking any data as well.

And then transparency and being open about discussing it and collaborating with the privacy professionals and, scientists to make sure that our methods are are correct and actually not leaking any any sensitive information.

In the context of

ensuring that you cover your bases with regards to bias in the training process,

1 of the challenges there is that

bias is typically something that you're blind to. And so I'm curious how

you and your team and practitioners in general

can be thinking about

identifying

potential sources of bias and some of the ways to think about coverage of bias

within a given problem domain that you're trying to solve for? Yeah. Well, it's, it's not easy process. And generally speaking,

we cannot really build

it's not impossible

or

a very difficult problem to build a system that is not biased for every possible user. Right?

Specifically human if we are talking about LLMs, for example, you can chat with them about different

ideas, about different,

ideologies, etcetera.

So it's very difficult to build a system that is completely not biased toward anyone.

But, usually, when we work with

applications where we can manage that and mitigate that, such as medical applications that are based on AI systems and machine learning systems, we make sure that, again,

we have the necessary tools for hospitals and physicians who are machine learning practitioners to access data from around the globe, for example,

without having to go through a very tedious process of signing legal terms to be able to exchange data, etcetera, because

our system today is HIPAA compliant, GDPR compliant, etcetera. So it reduces

the time that it takes to run this experiment traditionally and to obtain the data from the European Union, for example, from 6 months to a year

to a couple of hours using our system.

And that's basically 1 of the biggest enablers to mitigate

bias as as much as possible is to,

increase the sources of the data and the distribution of the data.

And back to when we discussed some of the subtle ways that the bias could manifest in the data, we make sure that we are covering our bases.

The specific disease is affected by x, y, and z factors. Are x, y, and z factors covered enough in our data, training data?

If not, why is that not covered?

As the

factors x and y are undersampled

in the data that we don't have enough samples?

Is the training data

over

has too much representation

from the x factor,

and we try to

under sample some specific classes or over sample other classes to make sure that everything is represented as,

as equal as, as possible.

And then we double down on the validation process as well and the evaluation process. So we have this automated tool that help researchers, for example, to test

across every possible

characteristic in the dataset how biased that model is towards this specific feature

or column and our tabular data, for example.

So several several ways from collecting

as many data from as many possible data sources to rigorous evaluation and validation

methods.

Another interesting element of trying to account for bias is I'm wondering how that manifests as far as the accuracy for a particular subgroup of the target audience as you generalize to a broader set of audience is and some of the ways to be thinking about that where

is it better to accept the decrease in accuracy because you are addressing the potential bias? Or is it maybe better to

train specific models for the different cohorts that you're targeting? I'm wondering what you see as the the general approach to that problem.

It largely depends on the downstream task. And,

we have done all possible methods to solve that from

sometimes sacrificing a little bit of accuracy to address the actual targeted

group of people or task that we are the feature that we care for,

all the way to building a system of experts,

group of models.

Each 1 of them is specialized in a specific demographic and specific disease, and then they eventually vote on a specific

output or result.

And then we can look at all of that, and then we try to generalize that beyond the, the training process,

and tune the models afterwards toward

1 specific

way or another based, again, on the downstream task. A lot of times, explainability

and trying to understand how this, model's generalized or why generalize to this to this level could also help us, mitigate some of that bias. So if you can understand why the model is making

x decision

and not,

what is it supposed to do and why is it achieving much better on 1 class and not the other class,

we try also to understand that behavior and, tune it afterwards.

From the

perspective

of customers of yours, I'm wondering what are some of the elements of education or some of the typical questions that they come to you with to help understand

the risks, the reason that they need to invest extra time and energy and money into counteracting bias,

ensuring the privacy preserving aspects of their AI, just some of

the elements of customer education that you have found most useful and most effective?

Very good question. A lot of the customers that, come to us, they're already dealing with very sensitive information,

health care customers, financial customers, or customers that, in the advertising word advertisement word.

And, the customers are very

a lot. Sometimes they're aware to some level about the importance of privacy.

And then we,

a lot of times demonstrate to them that the model that you have used to train on your patient's data that's only used as a classification model

to protect a specific disease from just

X rays, for example,

can actually leak a lot of information

about your patients.

And we have demonstrated

before that if you take a model that was trained in a centralized way without any privacy enhancing technologies,

I probably can

reconstruct several samples of the training data, which means that these models are actually memorizing parts of the data, and they are leaking parts of the data. So we show what is possible

to extract from already trained models. We show what's possible to do

after you have anonymized your data that we can reidentify

patients in your dataset even if you follow the, HIPAA Safe Harbor Privacy Rules, for example, because we believe anonymization

itself is not is not sufficient.

So we do we educate our customers about all the possible

points where data leakage can happen in the overall AI life cycle

from collecting the data to training to validation to even to inference.

So we have a model that's trained to do some type of

classification

task. I probably can curate,

some specific input to that model to make it produce

more information that is biased toward the training data.

And giving enough runs on that, inference process, I probably can extract a lot of information about the training data. So demonstrating

what's, what's possible and how these systems,

leak

data and sensitive information

is 1 of the best, I think, educational

educational ways to raise the awareness to to customers.

They see that, and then

they see how we can mitigate all of that bias and privacy concerns

using the set of tools that they are used to working with. So you don't need to be a cryptography

expert.

You don't need to know what federated learning is or what blind learning is in order to be able to use our toolset. You basically have,

as a machine learning or AI practitioner,

you'll continue using the

tools tag that you are used to today, and we are going to encapsulate all of our automated methods

for you. So when they see that, usually, overall, enterprises today care about their customers and about their data, about their, and they care about their intellectual property,

it's easy to convince them.

With that list of different privacy preserving techniques that you rattled off there, federated learning, blind learning. I know differential privacy is another 1 that's gained a lot of attention.

What are the common approaches that are most beneficial that you've seen? And I'm just curious if you can give, general sense of the

current,

either state of the art or set of best practices around how to be thinking about and applying these privacy preserving aspects in the training and serving process? Yeah. Sure. So,

first, there's not a sing 1 single technology that's,

good for all the

phases of the AI life cycles. So 1 of the main things that we

did great here at TripleBlind is that we optimize the life cycle and the underlying privacy enhancing technologies for each 1 of the tasks.

So when you start collecting your data, we have a set of tools that enable you to, learn a lot of information about data that you don't own that exist at other organizations

without actually having some of that data outside the host environments

and without actually seeing any of the raw data. So that's how you do discovery,

and that's blind discovery basically for the data. You know that, hey. There is something

that's useful for my application, but I don't know what it is, and I cannot see it.

After that, we have a set of tools that enable you to run code privately and securely on the remote

datasets.

So we have something called remote Python,

executor that you can ship specific Python

code or Rust code, for example, to the other parties that have data. And that code will execute on that party's data given

the owner's permission and,

auditing process, etcetera.

After that, the training process come into place. And, usually, you mentioned a very good term there, differential privacy is

is 1 great technology to use to make sure that the output of the models in the future

do not leak information

about any specific data record. So, that's also possible in in in different scenarios.

And today, at TripleBlind, we enable differential privacy

on training or fine tuning or large language models as well without a large hit in, in accuracy, without sacrificing accuracy a lot, differential privacy.

And another great approach that is cooler than federated learning is our own method called the blind learning.

Blind learning basically allows you to train a shared model or a global model

from distributed datasets

that may exist at different branches of the enterprise of different locations around the world,

again, without ever having to ship the data to a centralized location.

And the great thing about blind learning is that it is about 2 to 3 times more computationally

efficient

than federated learning. It's actually also

can lead to, better performing results

and better better accuracy

for federated learning because of the way that it it trains.

So now we have covered our bases in finding proper datasets,

executing code at remote sites or remote machines,

and using differential privacy and federated learning to train on distributed data.

After that, you can validate all of these models and then you can deploy it. When we deploy these models,

we also use something called secure multiparty computation.

And secure multiparty computation

allows

the model owners to preserve the privacy of their models because usually they spend time and effort training that model. So the parameters or the weights of that models are considered IP

to the company that created it. And then there's the user of the model who wants to utilize that model, who wants to run some tests on that model, but they don't want to share their data to the,

host of the model or the model owner. So we run these inferences

completely

private

away and using something called, again, secure multiparty computation,

which is a privacy enhancing technology that enables,

joint computations without revealing the data to any of the involved parties. It's a super cool, it's a very cool approach as well.

So, yeah, every phase of the life cycle has a good privacy enhancing technology, and

I think we have a collection of great efficient tools that are very effective at this.

And in your work at TripleBlind and as an ML practitioner, what are the most interesting or innovative or unexpected ways that you've seen people addressing this challenge of data privacy and bias in their machine learning and AI applications?

Some of the coolest way that I've seen is how blind learning, our, in house algorithm,

has been,

and is being used today in the real world applications to train on, on patients' data

to actually improve the health care outcomes,

that has been really, super exciting to me. So we came up with this, blind learning approach,

that enables you to train algorithms on decentralized data without ever seeing it, without ever accessing it. You don't have to ship the data outside your infrastructure.

And,

the best thing about it is that it's very easy to use. So, our partners who are physicians, for example, from some health care providers that are building

algorithms to predict

specific diseases or rare diseases, etcetera,

they're able to run, this blind learning approach in less than, 20 lines of code. So this has been super exciting

to observe that

these methods are actually being used in the in the real world.

And despite that, these methods

bank largely on preserving the privacy of the training data. They're still able to produce,

accuracy and

high performing models that are on par with the model strength at when we are pulling the data to a centralized location.

So this has been,

very exciting to me to see in the in the happening in the real world.

And in your work, what are the most interesting or unexpected or challenging lessons that you've learned in the context of privacy and bias in AI?

Some of the

unexpected things are still surprising to me is

the,

that the current regulations

are lagging behind at,

big time

in providing good safeguards

and, good guidelines,

for the practitioners, for the enterprises, etcetera, to preserve

the privacy of the people of the data that they are using. So HIPAA, for example, today,

there are 2 ways to make sure that you are HIPAA compliant when you are using some AI and some patients' data.

First 1 is called safe harbor, and it basically works using data anonymization.

You only need to delete about 18 identifiers,

personal,

identifiers from the dataset. So you delete the patient's first name, last name, phone

number, email, IP address, etcetera.

But we know very good

and we know for a fact that data anonymization

isn't, and that's actually

a quote that has been made by Cynthia Dwirk, 1 of the, she actually

coined the term differential privacy,

that we talked about earlier.

So I'm still a little bit surprised how

the regulations are lagging behind and how,

we are able to get by,

by just doing data anonymization a lot of, enter prices out there today, which is not private, which is not protecting of the patient's,

privacy.

Privacy is a fundamental human right, so that's not respecting basically podcasts like yours. And there's also our recent calls

in podcasts like yours, and there's also our recent calls, and, people are becoming more aware of this problem,

more educated. And, hopefully, we can make some significant

positive changes to this, technology that we have today in order to preserve our privacy.

And for people who are

trying to address these challenges

of bias and privacy in their machine learning models, what are the cases where triple blind is the wrong choice and maybe they need to build their own in house solutions?

That's a good question.

Usually,

building your,

your own solution is a very complicated and tough tough process. It's it's not an easy process.

So today, 1 of our,

best competitive

advantages is that, we have a great team. So we have world rank

cryptographers

and AI practitioners that understand this technology at a very deep technical level.

So for example, our chief technology officer is Craig Gentry,

and he's basically the inventor of something called fit fully homomorphic encryption.

You probably have heard of that, and, it's widely used in so many,

companies around around the

globe. So we have a great experience in in this domain.

And I think

the only way you should, build your own is is if you wanna probably save some cost, and that will come back and bite you on the rear because you've probably taken some shortcuts.

So you probably don't wanna do that. If you are good at 1 thing, it's better to stick to it and,

give the experts let the experts deal with the with the privacy

work.

And as you continue to build and iterate on the triple bind product and stay up to date with what's happening in

Yeah. We have a couple of great projects that have been going now for

some time and where we have achieved some significant results. For example,

we are working on a project called the privacy monitor.

It will allow individual,

users of chat gpt, for example, and enterprise

to know that

they are sending some sensitive information.

It's going to notify them, and that privacy monitor is going to work automatically.

So if you're sending a question to chat gpt that has some sensitive information,

our privacy monitor can detect that, warn you that you are sending this piece of information

Even if there's no specific terms in the actual message that has your name, etcetera, it can't tell you, hey.

This idea,

is very similar to a patent document that you have on your machine,

and that could leak some of, your intellectual

property. So this is very exciting. It'll allow us to,

have a little bit more confidence that we are not leaking too much information whenever we are interacting with this large language model, specifically enterprises or doctors.

So you can use chat GPT, for example, to summarize a patient note, but we make sure that whatever is being sent from the patient note to chat GPT does not really have any

PII data and personal information or personal

identifiable

information. So privacy monitor is a very powerful tool that I'm very excited about, and we should be able to,

address a lot of the privacy concerns here in the very near future.

Differential privacy is another aspect of, how we can, fine tune these large language models in a very privacy preserving way.

We also have a private

retrieval augmented generation tool.

We talked about the knowledge cutoff of these large language models.

This tool will enable you to append

context to these large language models

at very cheap way without having to fine tune them.

So these are some of the main ideas today, and, a lot of them are targeted towards, again, the entire AI life cycle and,

specifically today, a lot a little bit more towards prompting and inference as well.

Are there any other aspects of the work that you're doing at TripleBlind or the or the overall space of bias and data privacy, particularly in the context of generative AI that we didn't discuss yet that you would like to cover before we close out the show?

I think we touched base on,

some of the very important,

topics.

Again, I think, technology this technology is, is is really great. It has

a great potential

positive

impacts that will happen in the future.

Perhaps before we reach there, it's going to be a little bit,

difficult, and, there's a lot of, nuances

in it. There's a lot of

unknowns as well.

But,

overall, I think it will have a very good, positive impact, and

we should be careful about

deploying these systems and how we train them, but we shouldn't also be too much worried.

We don't wanna paralyze,

our advancement and,

our, the way we are doing the research and, the way we are training these models. So we need to just continue pushing forward and not do the opposite. We should not slow down.

I think that's, that's an important thing to keep in mind.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.

That's a great question. So I believe you have trained probably some AI models and you've played with some tools. But, unfortunately, even for a machine

learning expert like myself or an AI expert, I cannot just wake up in the middle of the night and say, hey. I would like to

create models that predict diabetes of Alzheimer's before they happen because

I do not have access to such data.

So data being locked behind the doors of regulations and privacy concerns, which is important, a justifiable

thing, still makes it very difficult to, really move faster

in this domain and create things. So

the lack of access to data or privacy preserving way to access this data, I believe, is 1 of the biggest barriers to to enter this domain and to create useful machine learning models. And, hopefully, at TripleBlind, we are facilitating that and removing lots of lots of the barriers.

Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at TripleBlind and your perspective and context on the challenges of bias and privacy for these AI

projects that we are all

interacting with and hearing about all the time. So I appreciate the the appreciate your time, and I hope you enjoy the rest of your day. Thanks for having me on the, on the show, and I,

enjoyed answering your wonderful questions. Thanks a lot.

Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management,

and podcasts dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com

with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast