Summary
Machine learning and generative AI systems have produced truly impressive capabilities. Unfortunately, many of these applications are not designed with the privacy of end-users in mind. TripleBlind is a platform focused on embedding privacy preserving techniques in the machine learning process to produce more user-friendly AI products. In this episode Gharib Gharibi explains how the current generation of applications can be susceptible to leaking user data and how to counteract those trends.
Announcements
Parting Question
Machine learning and generative AI systems have produced truly impressive capabilities. Unfortunately, many of these applications are not designed with the privacy of end-users in mind. TripleBlind is a platform focused on embedding privacy preserving techniques in the machine learning process to produce more user-friendly AI products. In this episode Gharib Gharibi explains how the current generation of applications can be susceptible to leaking user data and how to counteract those trends.
Announcements
- Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
- Your host is Tobias Macey and today I'm interviewing Gharib Gharibi about the challenges of bias and data privacy in generative AI models
- Introduction
- How did you get involved in machine learning?
- Generative AI has been gaining a lot of attention and speculation about its impact. What are some of the risks that these capabilities pose?
- What are the main contributing factors to their existing shortcomings?
- What are some of the subtle ways that bias in the source data can manifest?
- In addition to inaccurate results, there is also a question of how user interactions might be re-purposed and potential impacts on data and personal privacy. What are the main sources of risk?
- With the massive attention that generative AI has created and the perspectives that are being shaped by it, how do you see that impacting the general perception of other implementations of AI/ML?
- How can ML practitioners improve and convey the trustworthiness of their models to end users?
- What are the risks for the industry if generative models fall out of favor with the public?
- How does your work at Tripleblind help to encourage a conscientious approach to AI?
- What are the most interesting, innovative, or unexpected ways that you have seen data privacy addressed in AI applications?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on privacy in AI?
- When is TripleBlind the wrong choice?
- What do you have planned for the future of TripleBlind?
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
- TripleBlind
- ImageNet Geoffrey Hinton Paper
- BERT language model
- Generative AI
- GPT == Generative Pre-trained Transformer
- HIPAA Safe Harbor Rules
- Federated Learning
- Differential Privacy
- Homomorphic Encryption
[00:00:10]
Unknown:
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea idea to delivery with machine learning.
[00:00:19] Unknown:
Your host is Tobias Massey, and today, I'm interviewing Gharib Gharib about the challenges of bias and data privacy in generative AI models. So, Gharib, can you start by introducing yourself?
[00:00:29] Unknown:
Hello, Tobias. Thanks for, having me. Excited to, talk with you about this, wonderful topic. My name is Gareeb. I am, interested, and, my interest fall at the intersections of AI privacy systems. Today, I serve as the director of applied research and head of AI at TripleBlind, which is a startup that's focused on basically protecting people's, privacy. And we do so by building a wide range of privacy preserving, tools for AI, machine learning, and data analytics.
[00:01:05] Unknown:
And do you remember how you first got started working in machine learning?
[00:01:09] Unknown:
Yes. So, basically, when I grew up, we had a desktop in our home, and I always thought it will be really exciting to be able to have conversations with, with my computer. Growing up, I know that in order to, do that, I needed to be able to talk to the computer and then make it do things for me or assist me in things. So in order to be able to talk to the computer, I pursued a master's degree in computer science, University of Missouri in Kansas City. It was in software development. And, after that, I went into my PhD program, and I started focusing on deep learning. And I think what really, made me go into deep learning is a paper from 2012, Jeffrey Hinton team, that worked on ImageNet competition, and their results showed that this thing called the neural networks and convolution, layers can actually achieve very good results.
I got very interested in this domain and, studied deep learning in my PhD. And towards the end of it, I started exploring, privacy preserving AI. And, here I am today.
[00:02:17] Unknown:
Generative AI in particular has been gaining a lot of attention because of the different breakthroughs and the level of sophistication and capability that it has reached recently with the different, successions of g p t based models. I'm wondering if you can talk to some of the risks that the capabilities of these GPT models pose, particularly in that context of data privacy and bias?
[00:02:43] Unknown:
Yeah. Sure. That's, that's actually a great question. And to me, GPT or generative AI in general is basically I see it as a general purpose technology. Similar to all the previous technologies that we had, these technologies usually are double edged swords. They have a lot of potential to do great things, but they also can cause some harm. Some some of the capabilities of chat GPT, for example, can be used to, amplify specific, agenda or propaganda because it's very good at generating text and essays, for example, or blogs easily and very cheap and with very convincing arguments.
So that can, that can also be helpful to generate the fake news. Right? And, GPT models today are multimodal, and in the future, they could be producing videos. And, actually, there's already generative AI models that can take some question in a small sentence and text and generate a video out of it. So deep fakes is another big problem, for example. So the capabilities of these models can have, very serious implications, and a lot of these capabilities have fallen in the wrong hands of doctor Evel. It could, cause some serious harm. But, in the big picture and on the long run, I believe that, the benefits of these systems are going to offset, their, their issues.
[00:04:11] Unknown:
With generative AI, there's a lot of excitement about, oh, they can do anything. I can tell it to do whatever I want, and it'll give me a reasonable sounding answer. But the the problem there is that it's reasonable sounding without necessarily being actually correct. And I'm curious what you see as the contributing factors to the shortcomings of these generative models and some of the inherent risk of people's level of trust that is being built up with that trust not necessarily being well founded?
[00:04:45] Unknown:
Yes. That's that's another great question. And, I believe the this, these issues stem from 3 major factors, data, the algorithms, and the oversight when it comes to training these algorithms. 1st, I usually loosely define AI algorithms or, AI systems as programs that generalize from data. And, therefore, if the data that we collected was biased in 1 way or another, the programs that's going to generalize from this data is they're also going to amplify that bias. A second thing is the algorithms themselves. Despite having very sophisticated ways to train and optimize models and, reinforce some of the, good behaviors in securing these models. The underlying algorithms still have some serious problems.
For example, large language models, as you mentioned, do hallucinate. They, they cannot provide very factual answers. They're not good at retrieving factual information. They're not good at citing them. As a matter of fact, BERT, 1 of the models that are very common, still believes that the president of the United States is Barack Obama, and we all know that child gbt knowledge cutoff is in 2021. So these large language model are still have some serious issues when it comes to the algorithms themselves. And finally, the oversight when it comes to training these models.
Actually, the lack of overs oversight in the in the training. So, we still don't really understand how these models generalize and why they generalize very well. And, therefore, it's hard to put safeguards in place to make sure all of their answers are correct and always consistent and that they can provide citations. Yeah. So I believe data algorithms and the oversight or the lack of oversight are the major reasons of hallucinations, knowledge cutoff, and not being able to cite their references. And as you said, they're very good at convincing generating very convincing, arguments. Right?
[00:06:53] Unknown:
Yeah. All you have to do is exude confidence that your answers are correct regardless of their actual factual basis. Yeah. And to the point of bias in the underlying data that's used to build these models, that is not necessarily something that is exhibited in every interaction, and it can even be very subtle and hidden. And I'm curious, what are some of the ways that that inherent bias in the underlying data can manifest and some of the ways that that bias can have a negative external impact on the users or the operators of that product?
[00:07:32] Unknown:
Yeah. That's a very another great question. So I I I believe some of the examples for the subtle manifestation of bias based on the training data can some examples could be something I refer to or what's known as semantic bias. So in semantic bias, basically, when a specific term appears in the training data frequently associated with negative context, the AI algorithm is going to, assign that negative context or negative meaning to that word. So if you have a dataset that's always associating the word immigrant with a bad context, then AI is going to have a of that word that it's being negative. So that's what I refer to semantic bias.
And, then there is the absence bias. A lot of times, we are training on some specific data set for to to cure some disease or to learn a new style of way of saying things or doing things or teaching the LLM to do some specific specialized task. If our dataset does not include other types of arguments or other views on the same topic or the same point, that LLM is not going to understand the full argument and not able to, really capture information from different perspectives. And then there's, that that causes some, in the future, some bias towards specific points of view, etcetera.
Then there is the the known bias that we can exhibit from the data. If the data is collected from specific geographical areas or from specific respondents, if we have surveys and we are collecting them from specific, groups that have specific political views, then that data is obviously biased towards some specific political, perspective. For example, that's the gender bias, whether it's because of historical reasons or because the language as a binary language, for example, etcetera. So all types of bias that could be hidden in the language itself, such as the semantic bias that our dataset is not representative enough, there's more strong, manifestation and of of bias and for historical reasons and some stereotypes as well. And in addition to
[00:09:58] Unknown:
the inherent bias of the source datasets, the ways that that exhibits in the generated results, the potential inaccuracies because of lack of real world context or actual knowledge. The other potential risk of dealing with these generative AI systems is the opacity of how the user interaction data is actually going to be used or whether it's going to be used at all. And And I'm wondering what are some of the ways that the application or collection of the interactions might also be some measure of exposure to risk for the people who are interacting with these systems.
[00:10:35] Unknown:
Yeah. Of course. I think that could have significant risks. Right? I I believe, historically, our browsing data has been, like, a sensitive topic. But I believe today, our, chat history with chat GPT, for example, is probably way more sensitive than our browsing history. We're asking a lot of questions that could demonstrate our incompetency at work, for example. We are asking questions to, write a message or a letter to our significant other, maybe we are, not being very transparent that it was generated by AI. And if that piece of information was somehow leaked unintentionally, it can, cause some serious, some serious harms.
So, yeah, data today is collected in, in different ways. Basically, every single piece of data of our interaction with these systems is is collected. And, there is unfortunately not very transparent ways for us to validate how this data is being used, where is it being kept, who has access to this data. And, therefore, we cannot really fully understand and comprehend the possible risks. Some providers promise that our data there are options to opt out from collecting your data and being used. That data is still kept for 30 days somewhere at some servers, and I'm more than willing and happy to, and I do believe the promises that these, providers make that they're not going to use my data, but accidents happens and honest mistakes happen. And, we have seen incidents already where, some, people's data from OpenAI was leaked and some credit cards information.
2 weeks ago, some, researchers, Microsoft AI team, accidentally published tons of data about people and, private conversation on GitHub accidentally because of a a file that was misconfigured. So accidents do happen, and, therefore, promises by themselves are not enough. And that's why I'm a big advocate and interested in providing privacy preserving methods to to do these things beyond the, beyond the promises. So the risks here could be very serious. We need to be able to know where this data is collected, how is it being used, who has access to it. This data could be also repurposed, as you mentioned. You're not using my data to retrain your system, but maybe you are using it for targeted advertisement, for example. So it can have very serious implications.
[00:13:17] Unknown:
And then the other element of risk for generative AI and these aspects of bias and inaccuracy is the potential future impact of lack of trust because of the negative interactions that people might have as these systems do improve and maybe address some of these underlying sources of bias or inaccuracy, and that can influence potential future investment in developing those technologies or improving those technologies where if there is enough of a backlash, then those systems might just need to be shut down completely. There are also issues of the copyright elements of some of the source data that's used for, my understanding is OpenAI in particular, but, you know, that that's a potential legal gray area. I'm curious what you see as the risk to the industry and the people who are working on these technologies as a result of potential violations of trust because of these biases or inaccuracies?
[00:14:16] Unknown:
Yes. I I I I believe that eventually we are going to hopefully solve these problems at large, and, we should be able to, build and create systems that are trustworthy. And these systems are supposed to preserve our, privacy when whenever we interact with them. However, today, we're still not able to fully comprehend, the abilities of these systems, and therefore, it's still really difficult for us to safeguard it and understand the, the ramifications of of this technology. Today, there is a very good positive attitude towards this AI. I have never, met anyone personally that tried chatgpt or, midjourney, for example, to generate images and said, hey. This is, very dangerous. This is not good. I'm not going to use it. So it seems like the general public today, when they look at this technology, they they like it, and they believe that it's going to actually benefit humanity.
Maybe, 1 great thing that OpenAI did is that it exposed the system to the, to the public to for us to start, understanding the, the abilities of these AI systems and what actually could go wrong. So today, things are still seem to be under control, but god forbid if some, major accident happens to these, to these systems, it could, it could be it could really change the attitude towards these systems. If something goes wrong, if a major hacking incident leak all of people's conversations with chat gpt, for example, that could, that could lead to some serious, issues of adopting this technology.
And if you look back in history, some some some of such accidents have, drastically, changed the road map and the trajectory of some of the technologies that we had. And, nuclear energy is a very good example that comes to my mind. It emerged as this source limitless source of, clean energy that's very good for humanity and environment and everyone. It's was supposed to be also very cheap. Unfortunately, couple of incidents in the eighties, Chernobyl, and I think, if I'm not wrong, before that, the 3 Mile Island here in the United States and Fukushima in Japan, drastically and, greatly affected people's, attitude towards that technology. And today, the term nuclear is, like, attached to all of these incidents that happened in the past, and that resulted in drastically stopping the investments there, research, and the people and students were not interested in studying that field anymore, etcetera.
So it will take something as big as Fukushima to the to to the AI to probably change the overall semantic and, attitude towards it. But, also, it's a little bit different here because, yeah, people people privacy is very important, but it's not people's lives that will probably be affected if a big hacking happens to a JAD GPT, for example, or OpenAI servers. That's a little bit different. So, even if a great incident that like that happens, I don't think it's gonna maybe it's going to slow down the progress a little bit, but it's not going to stop it completely, specifically after we have, just seen the tip of the iceberg and what is this technology is capable of doing.
So maybe there might be an incidence here and there and some tax that we have to pay for these technologies, but I think it's going to continue growing.
[00:18:03] Unknown:
From the perspective of the practitioners, the people building these AI systems, applying ML technologies, what are some of the ways that they need to be thinking about incorporating awareness of and counteraction of bias into their work? Some of the ways that they need to be thinking about the presentation of their results in order to convey whether or not they are maybe conveying the level of confidence so that there isn't this blind trust in these systems. Just wondering kind of the overall scope of how these risks need to be incorporated into the actual work of ML practitioners to ensure that they are building systems that are beneficial to the end users?
[00:18:48] Unknown:
That's a very good, very good question. Basically, I believe well, simply here, speaking, I think they should use a system like TripleBlind. That's the start up that I work for. And the, the reason I say that is I believe that ML practitioners should actually be careful, when they are creating these AI systems and training the data. And, you might be aware that the AI life cycle is very different from the traditional software development life cycle. So you need to be careful when you collect the data, where you collect it from. You need to be careful when you curate the data, organize it, clean it, prepare it. You need to be careful what algorithms and privacy preserving algorithms you use to train the model, and we all know that the model training doesn't is not a onetime process. We retrain it again and again until it reaches a specific performance point. And then we take that model, we deploy it, and that model starts being used and generates inferences, for example.
Even then, we need to make sure that these inferences is not leaking any information about the training data. And then we want to monitor the behavior of the model that it's not shifting over time because usually people behaviors and the world we are living in is shifting. So these AI models, they drift in their performance as well. So we need to monitor that. So it's a very complicated process, and that's why I believe a very easy to use system is is very necessary here. So AI practitioners need to make sure that their their data is, covers all possible scenarios of whatever downstream task they are working on. You need to collect the data from multiple sources, not from a single source.
And, I recall here a study that came out from Oxford in 2021 that, examines more than 500 machine learning and deep learning models that, were published in reputable journals and conferences that trained AI models to detect COVID 19 from chest X rays and other electronic health records. And that study demonstrated basically that almost every single 1 of these algorithms, were fatally wrong, and it failed, big big time when it was, when these models were exposed to new types of data that came from a different distribution, from different peoples, from different geographical locations. So we need to make sure that we are collecting samples, that are representative of all possible scenarios. We wanna make sure that we are using robust training algorithms that can lead to models that generalize very well and that do not overfit.
When we are exposing these models, we wanna make sure that we are preserving the privacy of the training data by making sure that the output of the predictions of the models cannot be used to leak any data. Finally, we need to be transparent about, all of this process end to end. Privacy should not be an after the fact thing. We actually operate at TrebleBlind with a concept called privacy by design, and today, it's well known in the, academia and industry. So privacy should be integrated in the entire life cycle. So, yeah, data collection is important. Make sure that we have samples that cover all possible scenarios.
Make sure we are using algorithms that generalize well, do not lead to overfitting. Even when we deploy the models, make sure that we're using techniques to preserve the testing data that's coming into the inference and techniques to, make sure that the output of the inference is not leaking any data as well. And then transparency and being open about discussing it and collaborating with the privacy professionals and, scientists to make sure that our methods are are correct and actually not leaking any any sensitive information.
[00:22:49] Unknown:
In the context of ensuring that you cover your bases with regards to bias in the training process, 1 of the challenges there is that bias is typically something that you're blind to. And so I'm curious how you and your team and practitioners in general can be thinking about identifying potential sources of bias and some of the ways to think about coverage of bias within a given problem domain that you're trying to solve for? Yeah. Well, it's, it's not easy process. And generally speaking,
[00:23:24] Unknown:
we cannot really build it's not impossible or a very difficult problem to build a system that is not biased for every possible user. Right? Specifically human if we are talking about LLMs, for example, you can chat with them about different ideas, about different, ideologies, etcetera. So it's very difficult to build a system that is completely not biased toward anyone. But, usually, when we work with applications where we can manage that and mitigate that, such as medical applications that are based on AI systems and machine learning systems, we make sure that, again, we have the necessary tools for hospitals and physicians who are machine learning practitioners to access data from around the globe, for example, without having to go through a very tedious process of signing legal terms to be able to exchange data, etcetera, because our system today is HIPAA compliant, GDPR compliant, etcetera. So it reduces the time that it takes to run this experiment traditionally and to obtain the data from the European Union, for example, from 6 months to a year to a couple of hours using our system.
And that's basically 1 of the biggest enablers to mitigate bias as as much as possible is to, increase the sources of the data and the distribution of the data. And back to when we discussed some of the subtle ways that the bias could manifest in the data, we make sure that we are covering our bases. The specific disease is affected by x, y, and z factors. Are x, y, and z factors covered enough in our data, training data? If not, why is that not covered? As the factors x and y are undersampled in the data that we don't have enough samples? Is the training data over has too much representation from the x factor, and we try to under sample some specific classes or over sample other classes to make sure that everything is represented as, as equal as, as possible.
And then we double down on the validation process as well and the evaluation process. So we have this automated tool that help researchers, for example, to test across every possible characteristic in the dataset how biased that model is towards this specific feature or column and our tabular data, for example. So several several ways from collecting as many data from as many possible data sources to rigorous evaluation and validation methods.
[00:26:04] Unknown:
Another interesting element of trying to account for bias is I'm wondering how that manifests as far as the accuracy for a particular subgroup of the target audience as you generalize to a broader set of audience is and some of the ways to be thinking about that where is it better to accept the decrease in accuracy because you are addressing the potential bias? Or is it maybe better to train specific models for the different cohorts that you're targeting? I'm wondering what you see as the the general approach to that problem.
[00:26:37] Unknown:
It largely depends on the downstream task. And, we have done all possible methods to solve that from sometimes sacrificing a little bit of accuracy to address the actual targeted group of people or task that we are the feature that we care for, all the way to building a system of experts, group of models. Each 1 of them is specialized in a specific demographic and specific disease, and then they eventually vote on a specific output or result. And then we can look at all of that, and then we try to generalize that beyond the, the training process, and tune the models afterwards toward 1 specific way or another based, again, on the downstream task. A lot of times, explainability and trying to understand how this, model's generalized or why generalize to this to this level could also help us, mitigate some of that bias. So if you can understand why the model is making x decision and not, what is it supposed to do and why is it achieving much better on 1 class and not the other class, we try also to understand that behavior and, tune it afterwards.
[00:27:50] Unknown:
From the perspective of customers of yours, I'm wondering what are some of the elements of education or some of the typical questions that they come to you with to help understand the risks, the reason that they need to invest extra time and energy and money into counteracting bias, ensuring the privacy preserving aspects of their AI, just some of the elements of customer education that you have found most useful and most effective?
[00:28:18] Unknown:
Very good question. A lot of the customers that, come to us, they're already dealing with very sensitive information, health care customers, financial customers, or customers that, in the advertising word advertisement word. And, the customers are very a lot. Sometimes they're aware to some level about the importance of privacy. And then we, a lot of times demonstrate to them that the model that you have used to train on your patient's data that's only used as a classification model to protect a specific disease from just X rays, for example, can actually leak a lot of information about your patients.
And we have demonstrated before that if you take a model that was trained in a centralized way without any privacy enhancing technologies, I probably can reconstruct several samples of the training data, which means that these models are actually memorizing parts of the data, and they are leaking parts of the data. So we show what is possible to extract from already trained models. We show what's possible to do after you have anonymized your data that we can reidentify patients in your dataset even if you follow the, HIPAA Safe Harbor Privacy Rules, for example, because we believe anonymization itself is not is not sufficient.
So we do we educate our customers about all the possible points where data leakage can happen in the overall AI life cycle from collecting the data to training to validation to even to inference. So we have a model that's trained to do some type of classification task. I probably can curate, some specific input to that model to make it produce more information that is biased toward the training data. And giving enough runs on that, inference process, I probably can extract a lot of information about the training data. So demonstrating what's, what's possible and how these systems, leak data and sensitive information is 1 of the best, I think, educational educational ways to raise the awareness to to customers.
They see that, and then they see how we can mitigate all of that bias and privacy concerns using the set of tools that they are used to working with. So you don't need to be a cryptography expert. You don't need to know what federated learning is or what blind learning is in order to be able to use our toolset. You basically have, as a machine learning or AI practitioner, you'll continue using the tools tag that you are used to today, and we are going to encapsulate all of our automated methods for you. So when they see that, usually, overall, enterprises today care about their customers and about their data, about their, and they care about their intellectual property, it's easy to convince them.
[00:31:14] Unknown:
With that list of different privacy preserving techniques that you rattled off there, federated learning, blind learning. I know differential privacy is another 1 that's gained a lot of attention. What are the common approaches that are most beneficial that you've seen? And I'm just curious if you can give, general sense of the current, either state of the art or set of best practices around how to be thinking about and applying these privacy preserving aspects in the training and serving process? Yeah. Sure. So,
[00:31:44] Unknown:
first, there's not a sing 1 single technology that's, good for all the phases of the AI life cycles. So 1 of the main things that we did great here at TripleBlind is that we optimize the life cycle and the underlying privacy enhancing technologies for each 1 of the tasks. So when you start collecting your data, we have a set of tools that enable you to, learn a lot of information about data that you don't own that exist at other organizations without actually having some of that data outside the host environments and without actually seeing any of the raw data. So that's how you do discovery, and that's blind discovery basically for the data. You know that, hey. There is something that's useful for my application, but I don't know what it is, and I cannot see it.
After that, we have a set of tools that enable you to run code privately and securely on the remote datasets. So we have something called remote Python, executor that you can ship specific Python code or Rust code, for example, to the other parties that have data. And that code will execute on that party's data given the owner's permission and, auditing process, etcetera. After that, the training process come into place. And, usually, you mentioned a very good term there, differential privacy is is 1 great technology to use to make sure that the output of the models in the future do not leak information about any specific data record. So, that's also possible in in in different scenarios.
And today, at TripleBlind, we enable differential privacy on training or fine tuning or large language models as well without a large hit in, in accuracy, without sacrificing accuracy a lot, differential privacy. And another great approach that is cooler than federated learning is our own method called the blind learning. Blind learning basically allows you to train a shared model or a global model from distributed datasets that may exist at different branches of the enterprise of different locations around the world, again, without ever having to ship the data to a centralized location.
And the great thing about blind learning is that it is about 2 to 3 times more computationally efficient than federated learning. It's actually also can lead to, better performing results and better better accuracy for federated learning because of the way that it it trains. So now we have covered our bases in finding proper datasets, executing code at remote sites or remote machines, and using differential privacy and federated learning to train on distributed data. After that, you can validate all of these models and then you can deploy it. When we deploy these models, we also use something called secure multiparty computation.
And secure multiparty computation allows the model owners to preserve the privacy of their models because usually they spend time and effort training that model. So the parameters or the weights of that models are considered IP to the company that created it. And then there's the user of the model who wants to utilize that model, who wants to run some tests on that model, but they don't want to share their data to the, host of the model or the model owner. So we run these inferences completely private away and using something called, again, secure multiparty computation, which is a privacy enhancing technology that enables, joint computations without revealing the data to any of the involved parties. It's a super cool, it's a very cool approach as well.
So, yeah, every phase of the life cycle has a good privacy enhancing technology, and I think we have a collection of great efficient tools that are very effective at this.
[00:35:43] Unknown:
And in your work at TripleBlind and as an ML practitioner, what are the most interesting or innovative or unexpected ways that you've seen people addressing this challenge of data privacy and bias in their machine learning and AI applications?
[00:35:58] Unknown:
Some of the coolest way that I've seen is how blind learning, our, in house algorithm, has been, and is being used today in the real world applications to train on, on patients' data to actually improve the health care outcomes, that has been really, super exciting to me. So we came up with this, blind learning approach, that enables you to train algorithms on decentralized data without ever seeing it, without ever accessing it. You don't have to ship the data outside your infrastructure. And, the best thing about it is that it's very easy to use. So, our partners who are physicians, for example, from some health care providers that are building algorithms to predict specific diseases or rare diseases, etcetera, they're able to run, this blind learning approach in less than, 20 lines of code. So this has been super exciting to observe that these methods are actually being used in the in the real world.
And despite that, these methods bank largely on preserving the privacy of the training data. They're still able to produce, accuracy and high performing models that are on par with the model strength at when we are pulling the data to a centralized location. So this has been, very exciting to me to see in the in the happening in the real world.
[00:37:24] Unknown:
And in your work, what are the most interesting or unexpected or challenging lessons that you've learned in the context of privacy and bias in AI?
[00:37:33] Unknown:
Some of the unexpected things are still surprising to me is the, that the current regulations are lagging behind at, big time in providing good safeguards and, good guidelines, for the practitioners, for the enterprises, etcetera, to preserve the privacy of the people of the data that they are using. So HIPAA, for example, today, there are 2 ways to make sure that you are HIPAA compliant when you are using some AI and some patients' data. First 1 is called safe harbor, and it basically works using data anonymization. You only need to delete about 18 identifiers, personal, identifiers from the dataset. So you delete the patient's first name, last name, phone number, email, IP address, etcetera.
But we know very good and we know for a fact that data anonymization isn't, and that's actually a quote that has been made by Cynthia Dwirk, 1 of the, she actually coined the term differential privacy, that we talked about earlier. So I'm still a little bit surprised how the regulations are lagging behind and how, we are able to get by, by just doing data anonymization a lot of, enter prices out there today, which is not private, which is not protecting of the patient's, privacy. Privacy is a fundamental human right, so that's not respecting basically podcasts like yours. And there's also our recent calls in podcasts like yours, and there's also our recent calls, and, people are becoming more aware of this problem, more educated. And, hopefully, we can make some significant positive changes to this, technology that we have today in order to preserve our privacy.
[00:39:33] Unknown:
And for people who are trying to address these challenges of bias and privacy in their machine learning models, what are the cases where triple blind is the wrong choice and maybe they need to build their own in house solutions?
[00:39:47] Unknown:
That's a good question. Usually, building your, your own solution is a very complicated and tough tough process. It's it's not an easy process. So today, 1 of our, best competitive advantages is that, we have a great team. So we have world rank cryptographers and AI practitioners that understand this technology at a very deep technical level. So for example, our chief technology officer is Craig Gentry, and he's basically the inventor of something called fit fully homomorphic encryption. You probably have heard of that, and, it's widely used in so many, companies around around the globe. So we have a great experience in in this domain.
And I think the only way you should, build your own is is if you wanna probably save some cost, and that will come back and bite you on the rear because you've probably taken some shortcuts. So you probably don't wanna do that. If you are good at 1 thing, it's better to stick to it and, give the experts let the experts deal with the with the privacy work.
[00:40:56] Unknown:
And as you continue to build and iterate on the triple bind product and stay up to date with what's happening in
[00:41:10] Unknown:
Yeah. We have a couple of great projects that have been going now for some time and where we have achieved some significant results. For example, we are working on a project called the privacy monitor. It will allow individual, users of chat gpt, for example, and enterprise to know that they are sending some sensitive information. It's going to notify them, and that privacy monitor is going to work automatically. So if you're sending a question to chat gpt that has some sensitive information, our privacy monitor can detect that, warn you that you are sending this piece of information Even if there's no specific terms in the actual message that has your name, etcetera, it can't tell you, hey.
This idea, is very similar to a patent document that you have on your machine, and that could leak some of, your intellectual property. So this is very exciting. It'll allow us to, have a little bit more confidence that we are not leaking too much information whenever we are interacting with this large language model, specifically enterprises or doctors. So you can use chat GPT, for example, to summarize a patient note, but we make sure that whatever is being sent from the patient note to chat GPT does not really have any PII data and personal information or personal identifiable information. So privacy monitor is a very powerful tool that I'm very excited about, and we should be able to, address a lot of the privacy concerns here in the very near future.
Differential privacy is another aspect of, how we can, fine tune these large language models in a very privacy preserving way. We also have a private retrieval augmented generation tool. We talked about the knowledge cutoff of these large language models. This tool will enable you to append context to these large language models at very cheap way without having to fine tune them. So these are some of the main ideas today, and, a lot of them are targeted towards, again, the entire AI life cycle and, specifically today, a lot a little bit more towards prompting and inference as well.
[00:43:25] Unknown:
Are there any other aspects of the work that you're doing at TripleBlind or the or the overall space of bias and data privacy, particularly in the context of generative AI that we didn't discuss yet that you would like to cover before we close out the show?
[00:43:39] Unknown:
I think we touched base on, some of the very important, topics. Again, I think, technology this technology is, is is really great. It has a great potential positive impacts that will happen in the future. Perhaps before we reach there, it's going to be a little bit, difficult, and, there's a lot of, nuances in it. There's a lot of unknowns as well. But, overall, I think it will have a very good, positive impact, and we should be careful about deploying these systems and how we train them, but we shouldn't also be too much worried. We don't wanna paralyze, our advancement and, our, the way we are doing the research and, the way we are training these models. So we need to just continue pushing forward and not do the opposite. We should not slow down.
I think that's, that's an important thing to keep in mind.
[00:44:39] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[00:44:53] Unknown:
That's a great question. So I believe you have trained probably some AI models and you've played with some tools. But, unfortunately, even for a machine learning expert like myself or an AI expert, I cannot just wake up in the middle of the night and say, hey. I would like to create models that predict diabetes of Alzheimer's before they happen because I do not have access to such data. So data being locked behind the doors of regulations and privacy concerns, which is important, a justifiable thing, still makes it very difficult to, really move faster in this domain and create things. So the lack of access to data or privacy preserving way to access this data, I believe, is 1 of the biggest barriers to to enter this domain and to create useful machine learning models. And, hopefully, at TripleBlind, we are facilitating that and removing lots of lots of the barriers.
[00:45:48] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at TripleBlind and your perspective and context on the challenges of bias and privacy for these AI projects that we are all interacting with and hearing about all the time. So I appreciate the the appreciate your time, and I hope you enjoy the rest of your day. Thanks for having me on the, on the show, and I,
[00:46:11] Unknown:
enjoyed answering your wonderful questions. Thanks a lot.
[00:46:19] Unknown:
Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcasts dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea idea to delivery with machine learning.
[00:00:19] Unknown:
Your host is Tobias Massey, and today, I'm interviewing Gharib Gharib about the challenges of bias and data privacy in generative AI models. So, Gharib, can you start by introducing yourself?
[00:00:29] Unknown:
Hello, Tobias. Thanks for, having me. Excited to, talk with you about this, wonderful topic. My name is Gareeb. I am, interested, and, my interest fall at the intersections of AI privacy systems. Today, I serve as the director of applied research and head of AI at TripleBlind, which is a startup that's focused on basically protecting people's, privacy. And we do so by building a wide range of privacy preserving, tools for AI, machine learning, and data analytics.
[00:01:05] Unknown:
And do you remember how you first got started working in machine learning?
[00:01:09] Unknown:
Yes. So, basically, when I grew up, we had a desktop in our home, and I always thought it will be really exciting to be able to have conversations with, with my computer. Growing up, I know that in order to, do that, I needed to be able to talk to the computer and then make it do things for me or assist me in things. So in order to be able to talk to the computer, I pursued a master's degree in computer science, University of Missouri in Kansas City. It was in software development. And, after that, I went into my PhD program, and I started focusing on deep learning. And I think what really, made me go into deep learning is a paper from 2012, Jeffrey Hinton team, that worked on ImageNet competition, and their results showed that this thing called the neural networks and convolution, layers can actually achieve very good results.
I got very interested in this domain and, studied deep learning in my PhD. And towards the end of it, I started exploring, privacy preserving AI. And, here I am today.
[00:02:17] Unknown:
Generative AI in particular has been gaining a lot of attention because of the different breakthroughs and the level of sophistication and capability that it has reached recently with the different, successions of g p t based models. I'm wondering if you can talk to some of the risks that the capabilities of these GPT models pose, particularly in that context of data privacy and bias?
[00:02:43] Unknown:
Yeah. Sure. That's, that's actually a great question. And to me, GPT or generative AI in general is basically I see it as a general purpose technology. Similar to all the previous technologies that we had, these technologies usually are double edged swords. They have a lot of potential to do great things, but they also can cause some harm. Some some of the capabilities of chat GPT, for example, can be used to, amplify specific, agenda or propaganda because it's very good at generating text and essays, for example, or blogs easily and very cheap and with very convincing arguments.
So that can, that can also be helpful to generate the fake news. Right? And, GPT models today are multimodal, and in the future, they could be producing videos. And, actually, there's already generative AI models that can take some question in a small sentence and text and generate a video out of it. So deep fakes is another big problem, for example. So the capabilities of these models can have, very serious implications, and a lot of these capabilities have fallen in the wrong hands of doctor Evel. It could, cause some serious harm. But, in the big picture and on the long run, I believe that, the benefits of these systems are going to offset, their, their issues.
[00:04:11] Unknown:
With generative AI, there's a lot of excitement about, oh, they can do anything. I can tell it to do whatever I want, and it'll give me a reasonable sounding answer. But the the problem there is that it's reasonable sounding without necessarily being actually correct. And I'm curious what you see as the contributing factors to the shortcomings of these generative models and some of the inherent risk of people's level of trust that is being built up with that trust not necessarily being well founded?
[00:04:45] Unknown:
Yes. That's that's another great question. And, I believe the this, these issues stem from 3 major factors, data, the algorithms, and the oversight when it comes to training these algorithms. 1st, I usually loosely define AI algorithms or, AI systems as programs that generalize from data. And, therefore, if the data that we collected was biased in 1 way or another, the programs that's going to generalize from this data is they're also going to amplify that bias. A second thing is the algorithms themselves. Despite having very sophisticated ways to train and optimize models and, reinforce some of the, good behaviors in securing these models. The underlying algorithms still have some serious problems.
For example, large language models, as you mentioned, do hallucinate. They, they cannot provide very factual answers. They're not good at retrieving factual information. They're not good at citing them. As a matter of fact, BERT, 1 of the models that are very common, still believes that the president of the United States is Barack Obama, and we all know that child gbt knowledge cutoff is in 2021. So these large language model are still have some serious issues when it comes to the algorithms themselves. And finally, the oversight when it comes to training these models.
Actually, the lack of overs oversight in the in the training. So, we still don't really understand how these models generalize and why they generalize very well. And, therefore, it's hard to put safeguards in place to make sure all of their answers are correct and always consistent and that they can provide citations. Yeah. So I believe data algorithms and the oversight or the lack of oversight are the major reasons of hallucinations, knowledge cutoff, and not being able to cite their references. And as you said, they're very good at convincing generating very convincing, arguments. Right?
[00:06:53] Unknown:
Yeah. All you have to do is exude confidence that your answers are correct regardless of their actual factual basis. Yeah. And to the point of bias in the underlying data that's used to build these models, that is not necessarily something that is exhibited in every interaction, and it can even be very subtle and hidden. And I'm curious, what are some of the ways that that inherent bias in the underlying data can manifest and some of the ways that that bias can have a negative external impact on the users or the operators of that product?
[00:07:32] Unknown:
Yeah. That's a very another great question. So I I I believe some of the examples for the subtle manifestation of bias based on the training data can some examples could be something I refer to or what's known as semantic bias. So in semantic bias, basically, when a specific term appears in the training data frequently associated with negative context, the AI algorithm is going to, assign that negative context or negative meaning to that word. So if you have a dataset that's always associating the word immigrant with a bad context, then AI is going to have a of that word that it's being negative. So that's what I refer to semantic bias.
And, then there is the absence bias. A lot of times, we are training on some specific data set for to to cure some disease or to learn a new style of way of saying things or doing things or teaching the LLM to do some specific specialized task. If our dataset does not include other types of arguments or other views on the same topic or the same point, that LLM is not going to understand the full argument and not able to, really capture information from different perspectives. And then there's, that that causes some, in the future, some bias towards specific points of view, etcetera.
Then there is the the known bias that we can exhibit from the data. If the data is collected from specific geographical areas or from specific respondents, if we have surveys and we are collecting them from specific, groups that have specific political views, then that data is obviously biased towards some specific political, perspective. For example, that's the gender bias, whether it's because of historical reasons or because the language as a binary language, for example, etcetera. So all types of bias that could be hidden in the language itself, such as the semantic bias that our dataset is not representative enough, there's more strong, manifestation and of of bias and for historical reasons and some stereotypes as well. And in addition to
[00:09:58] Unknown:
the inherent bias of the source datasets, the ways that that exhibits in the generated results, the potential inaccuracies because of lack of real world context or actual knowledge. The other potential risk of dealing with these generative AI systems is the opacity of how the user interaction data is actually going to be used or whether it's going to be used at all. And And I'm wondering what are some of the ways that the application or collection of the interactions might also be some measure of exposure to risk for the people who are interacting with these systems.
[00:10:35] Unknown:
Yeah. Of course. I think that could have significant risks. Right? I I believe, historically, our browsing data has been, like, a sensitive topic. But I believe today, our, chat history with chat GPT, for example, is probably way more sensitive than our browsing history. We're asking a lot of questions that could demonstrate our incompetency at work, for example. We are asking questions to, write a message or a letter to our significant other, maybe we are, not being very transparent that it was generated by AI. And if that piece of information was somehow leaked unintentionally, it can, cause some serious, some serious harms.
So, yeah, data today is collected in, in different ways. Basically, every single piece of data of our interaction with these systems is is collected. And, there is unfortunately not very transparent ways for us to validate how this data is being used, where is it being kept, who has access to this data. And, therefore, we cannot really fully understand and comprehend the possible risks. Some providers promise that our data there are options to opt out from collecting your data and being used. That data is still kept for 30 days somewhere at some servers, and I'm more than willing and happy to, and I do believe the promises that these, providers make that they're not going to use my data, but accidents happens and honest mistakes happen. And, we have seen incidents already where, some, people's data from OpenAI was leaked and some credit cards information.
2 weeks ago, some, researchers, Microsoft AI team, accidentally published tons of data about people and, private conversation on GitHub accidentally because of a a file that was misconfigured. So accidents do happen, and, therefore, promises by themselves are not enough. And that's why I'm a big advocate and interested in providing privacy preserving methods to to do these things beyond the, beyond the promises. So the risks here could be very serious. We need to be able to know where this data is collected, how is it being used, who has access to it. This data could be also repurposed, as you mentioned. You're not using my data to retrain your system, but maybe you are using it for targeted advertisement, for example. So it can have very serious implications.
[00:13:17] Unknown:
And then the other element of risk for generative AI and these aspects of bias and inaccuracy is the potential future impact of lack of trust because of the negative interactions that people might have as these systems do improve and maybe address some of these underlying sources of bias or inaccuracy, and that can influence potential future investment in developing those technologies or improving those technologies where if there is enough of a backlash, then those systems might just need to be shut down completely. There are also issues of the copyright elements of some of the source data that's used for, my understanding is OpenAI in particular, but, you know, that that's a potential legal gray area. I'm curious what you see as the risk to the industry and the people who are working on these technologies as a result of potential violations of trust because of these biases or inaccuracies?
[00:14:16] Unknown:
Yes. I I I I believe that eventually we are going to hopefully solve these problems at large, and, we should be able to, build and create systems that are trustworthy. And these systems are supposed to preserve our, privacy when whenever we interact with them. However, today, we're still not able to fully comprehend, the abilities of these systems, and therefore, it's still really difficult for us to safeguard it and understand the, the ramifications of of this technology. Today, there is a very good positive attitude towards this AI. I have never, met anyone personally that tried chatgpt or, midjourney, for example, to generate images and said, hey. This is, very dangerous. This is not good. I'm not going to use it. So it seems like the general public today, when they look at this technology, they they like it, and they believe that it's going to actually benefit humanity.
Maybe, 1 great thing that OpenAI did is that it exposed the system to the, to the public to for us to start, understanding the, the abilities of these AI systems and what actually could go wrong. So today, things are still seem to be under control, but god forbid if some, major accident happens to these, to these systems, it could, it could be it could really change the attitude towards these systems. If something goes wrong, if a major hacking incident leak all of people's conversations with chat gpt, for example, that could, that could lead to some serious, issues of adopting this technology.
And if you look back in history, some some some of such accidents have, drastically, changed the road map and the trajectory of some of the technologies that we had. And, nuclear energy is a very good example that comes to my mind. It emerged as this source limitless source of, clean energy that's very good for humanity and environment and everyone. It's was supposed to be also very cheap. Unfortunately, couple of incidents in the eighties, Chernobyl, and I think, if I'm not wrong, before that, the 3 Mile Island here in the United States and Fukushima in Japan, drastically and, greatly affected people's, attitude towards that technology. And today, the term nuclear is, like, attached to all of these incidents that happened in the past, and that resulted in drastically stopping the investments there, research, and the people and students were not interested in studying that field anymore, etcetera.
So it will take something as big as Fukushima to the to to the AI to probably change the overall semantic and, attitude towards it. But, also, it's a little bit different here because, yeah, people people privacy is very important, but it's not people's lives that will probably be affected if a big hacking happens to a JAD GPT, for example, or OpenAI servers. That's a little bit different. So, even if a great incident that like that happens, I don't think it's gonna maybe it's going to slow down the progress a little bit, but it's not going to stop it completely, specifically after we have, just seen the tip of the iceberg and what is this technology is capable of doing.
So maybe there might be an incidence here and there and some tax that we have to pay for these technologies, but I think it's going to continue growing.
[00:18:03] Unknown:
From the perspective of the practitioners, the people building these AI systems, applying ML technologies, what are some of the ways that they need to be thinking about incorporating awareness of and counteraction of bias into their work? Some of the ways that they need to be thinking about the presentation of their results in order to convey whether or not they are maybe conveying the level of confidence so that there isn't this blind trust in these systems. Just wondering kind of the overall scope of how these risks need to be incorporated into the actual work of ML practitioners to ensure that they are building systems that are beneficial to the end users?
[00:18:48] Unknown:
That's a very good, very good question. Basically, I believe well, simply here, speaking, I think they should use a system like TripleBlind. That's the start up that I work for. And the, the reason I say that is I believe that ML practitioners should actually be careful, when they are creating these AI systems and training the data. And, you might be aware that the AI life cycle is very different from the traditional software development life cycle. So you need to be careful when you collect the data, where you collect it from. You need to be careful when you curate the data, organize it, clean it, prepare it. You need to be careful what algorithms and privacy preserving algorithms you use to train the model, and we all know that the model training doesn't is not a onetime process. We retrain it again and again until it reaches a specific performance point. And then we take that model, we deploy it, and that model starts being used and generates inferences, for example.
Even then, we need to make sure that these inferences is not leaking any information about the training data. And then we want to monitor the behavior of the model that it's not shifting over time because usually people behaviors and the world we are living in is shifting. So these AI models, they drift in their performance as well. So we need to monitor that. So it's a very complicated process, and that's why I believe a very easy to use system is is very necessary here. So AI practitioners need to make sure that their their data is, covers all possible scenarios of whatever downstream task they are working on. You need to collect the data from multiple sources, not from a single source.
And, I recall here a study that came out from Oxford in 2021 that, examines more than 500 machine learning and deep learning models that, were published in reputable journals and conferences that trained AI models to detect COVID 19 from chest X rays and other electronic health records. And that study demonstrated basically that almost every single 1 of these algorithms, were fatally wrong, and it failed, big big time when it was, when these models were exposed to new types of data that came from a different distribution, from different peoples, from different geographical locations. So we need to make sure that we are collecting samples, that are representative of all possible scenarios. We wanna make sure that we are using robust training algorithms that can lead to models that generalize very well and that do not overfit.
When we are exposing these models, we wanna make sure that we are preserving the privacy of the training data by making sure that the output of the predictions of the models cannot be used to leak any data. Finally, we need to be transparent about, all of this process end to end. Privacy should not be an after the fact thing. We actually operate at TrebleBlind with a concept called privacy by design, and today, it's well known in the, academia and industry. So privacy should be integrated in the entire life cycle. So, yeah, data collection is important. Make sure that we have samples that cover all possible scenarios.
Make sure we are using algorithms that generalize well, do not lead to overfitting. Even when we deploy the models, make sure that we're using techniques to preserve the testing data that's coming into the inference and techniques to, make sure that the output of the inference is not leaking any data as well. And then transparency and being open about discussing it and collaborating with the privacy professionals and, scientists to make sure that our methods are are correct and actually not leaking any any sensitive information.
[00:22:49] Unknown:
In the context of ensuring that you cover your bases with regards to bias in the training process, 1 of the challenges there is that bias is typically something that you're blind to. And so I'm curious how you and your team and practitioners in general can be thinking about identifying potential sources of bias and some of the ways to think about coverage of bias within a given problem domain that you're trying to solve for? Yeah. Well, it's, it's not easy process. And generally speaking,
[00:23:24] Unknown:
we cannot really build it's not impossible or a very difficult problem to build a system that is not biased for every possible user. Right? Specifically human if we are talking about LLMs, for example, you can chat with them about different ideas, about different, ideologies, etcetera. So it's very difficult to build a system that is completely not biased toward anyone. But, usually, when we work with applications where we can manage that and mitigate that, such as medical applications that are based on AI systems and machine learning systems, we make sure that, again, we have the necessary tools for hospitals and physicians who are machine learning practitioners to access data from around the globe, for example, without having to go through a very tedious process of signing legal terms to be able to exchange data, etcetera, because our system today is HIPAA compliant, GDPR compliant, etcetera. So it reduces the time that it takes to run this experiment traditionally and to obtain the data from the European Union, for example, from 6 months to a year to a couple of hours using our system.
And that's basically 1 of the biggest enablers to mitigate bias as as much as possible is to, increase the sources of the data and the distribution of the data. And back to when we discussed some of the subtle ways that the bias could manifest in the data, we make sure that we are covering our bases. The specific disease is affected by x, y, and z factors. Are x, y, and z factors covered enough in our data, training data? If not, why is that not covered? As the factors x and y are undersampled in the data that we don't have enough samples? Is the training data over has too much representation from the x factor, and we try to under sample some specific classes or over sample other classes to make sure that everything is represented as, as equal as, as possible.
And then we double down on the validation process as well and the evaluation process. So we have this automated tool that help researchers, for example, to test across every possible characteristic in the dataset how biased that model is towards this specific feature or column and our tabular data, for example. So several several ways from collecting as many data from as many possible data sources to rigorous evaluation and validation methods.
[00:26:04] Unknown:
Another interesting element of trying to account for bias is I'm wondering how that manifests as far as the accuracy for a particular subgroup of the target audience as you generalize to a broader set of audience is and some of the ways to be thinking about that where is it better to accept the decrease in accuracy because you are addressing the potential bias? Or is it maybe better to train specific models for the different cohorts that you're targeting? I'm wondering what you see as the the general approach to that problem.
[00:26:37] Unknown:
It largely depends on the downstream task. And, we have done all possible methods to solve that from sometimes sacrificing a little bit of accuracy to address the actual targeted group of people or task that we are the feature that we care for, all the way to building a system of experts, group of models. Each 1 of them is specialized in a specific demographic and specific disease, and then they eventually vote on a specific output or result. And then we can look at all of that, and then we try to generalize that beyond the, the training process, and tune the models afterwards toward 1 specific way or another based, again, on the downstream task. A lot of times, explainability and trying to understand how this, model's generalized or why generalize to this to this level could also help us, mitigate some of that bias. So if you can understand why the model is making x decision and not, what is it supposed to do and why is it achieving much better on 1 class and not the other class, we try also to understand that behavior and, tune it afterwards.
[00:27:50] Unknown:
From the perspective of customers of yours, I'm wondering what are some of the elements of education or some of the typical questions that they come to you with to help understand the risks, the reason that they need to invest extra time and energy and money into counteracting bias, ensuring the privacy preserving aspects of their AI, just some of the elements of customer education that you have found most useful and most effective?
[00:28:18] Unknown:
Very good question. A lot of the customers that, come to us, they're already dealing with very sensitive information, health care customers, financial customers, or customers that, in the advertising word advertisement word. And, the customers are very a lot. Sometimes they're aware to some level about the importance of privacy. And then we, a lot of times demonstrate to them that the model that you have used to train on your patient's data that's only used as a classification model to protect a specific disease from just X rays, for example, can actually leak a lot of information about your patients.
And we have demonstrated before that if you take a model that was trained in a centralized way without any privacy enhancing technologies, I probably can reconstruct several samples of the training data, which means that these models are actually memorizing parts of the data, and they are leaking parts of the data. So we show what is possible to extract from already trained models. We show what's possible to do after you have anonymized your data that we can reidentify patients in your dataset even if you follow the, HIPAA Safe Harbor Privacy Rules, for example, because we believe anonymization itself is not is not sufficient.
So we do we educate our customers about all the possible points where data leakage can happen in the overall AI life cycle from collecting the data to training to validation to even to inference. So we have a model that's trained to do some type of classification task. I probably can curate, some specific input to that model to make it produce more information that is biased toward the training data. And giving enough runs on that, inference process, I probably can extract a lot of information about the training data. So demonstrating what's, what's possible and how these systems, leak data and sensitive information is 1 of the best, I think, educational educational ways to raise the awareness to to customers.
They see that, and then they see how we can mitigate all of that bias and privacy concerns using the set of tools that they are used to working with. So you don't need to be a cryptography expert. You don't need to know what federated learning is or what blind learning is in order to be able to use our toolset. You basically have, as a machine learning or AI practitioner, you'll continue using the tools tag that you are used to today, and we are going to encapsulate all of our automated methods for you. So when they see that, usually, overall, enterprises today care about their customers and about their data, about their, and they care about their intellectual property, it's easy to convince them.
[00:31:14] Unknown:
With that list of different privacy preserving techniques that you rattled off there, federated learning, blind learning. I know differential privacy is another 1 that's gained a lot of attention. What are the common approaches that are most beneficial that you've seen? And I'm just curious if you can give, general sense of the current, either state of the art or set of best practices around how to be thinking about and applying these privacy preserving aspects in the training and serving process? Yeah. Sure. So,
[00:31:44] Unknown:
first, there's not a sing 1 single technology that's, good for all the phases of the AI life cycles. So 1 of the main things that we did great here at TripleBlind is that we optimize the life cycle and the underlying privacy enhancing technologies for each 1 of the tasks. So when you start collecting your data, we have a set of tools that enable you to, learn a lot of information about data that you don't own that exist at other organizations without actually having some of that data outside the host environments and without actually seeing any of the raw data. So that's how you do discovery, and that's blind discovery basically for the data. You know that, hey. There is something that's useful for my application, but I don't know what it is, and I cannot see it.
After that, we have a set of tools that enable you to run code privately and securely on the remote datasets. So we have something called remote Python, executor that you can ship specific Python code or Rust code, for example, to the other parties that have data. And that code will execute on that party's data given the owner's permission and, auditing process, etcetera. After that, the training process come into place. And, usually, you mentioned a very good term there, differential privacy is is 1 great technology to use to make sure that the output of the models in the future do not leak information about any specific data record. So, that's also possible in in in different scenarios.
And today, at TripleBlind, we enable differential privacy on training or fine tuning or large language models as well without a large hit in, in accuracy, without sacrificing accuracy a lot, differential privacy. And another great approach that is cooler than federated learning is our own method called the blind learning. Blind learning basically allows you to train a shared model or a global model from distributed datasets that may exist at different branches of the enterprise of different locations around the world, again, without ever having to ship the data to a centralized location.
And the great thing about blind learning is that it is about 2 to 3 times more computationally efficient than federated learning. It's actually also can lead to, better performing results and better better accuracy for federated learning because of the way that it it trains. So now we have covered our bases in finding proper datasets, executing code at remote sites or remote machines, and using differential privacy and federated learning to train on distributed data. After that, you can validate all of these models and then you can deploy it. When we deploy these models, we also use something called secure multiparty computation.
And secure multiparty computation allows the model owners to preserve the privacy of their models because usually they spend time and effort training that model. So the parameters or the weights of that models are considered IP to the company that created it. And then there's the user of the model who wants to utilize that model, who wants to run some tests on that model, but they don't want to share their data to the, host of the model or the model owner. So we run these inferences completely private away and using something called, again, secure multiparty computation, which is a privacy enhancing technology that enables, joint computations without revealing the data to any of the involved parties. It's a super cool, it's a very cool approach as well.
So, yeah, every phase of the life cycle has a good privacy enhancing technology, and I think we have a collection of great efficient tools that are very effective at this.
[00:35:43] Unknown:
And in your work at TripleBlind and as an ML practitioner, what are the most interesting or innovative or unexpected ways that you've seen people addressing this challenge of data privacy and bias in their machine learning and AI applications?
[00:35:58] Unknown:
Some of the coolest way that I've seen is how blind learning, our, in house algorithm, has been, and is being used today in the real world applications to train on, on patients' data to actually improve the health care outcomes, that has been really, super exciting to me. So we came up with this, blind learning approach, that enables you to train algorithms on decentralized data without ever seeing it, without ever accessing it. You don't have to ship the data outside your infrastructure. And, the best thing about it is that it's very easy to use. So, our partners who are physicians, for example, from some health care providers that are building algorithms to predict specific diseases or rare diseases, etcetera, they're able to run, this blind learning approach in less than, 20 lines of code. So this has been super exciting to observe that these methods are actually being used in the in the real world.
And despite that, these methods bank largely on preserving the privacy of the training data. They're still able to produce, accuracy and high performing models that are on par with the model strength at when we are pulling the data to a centralized location. So this has been, very exciting to me to see in the in the happening in the real world.
[00:37:24] Unknown:
And in your work, what are the most interesting or unexpected or challenging lessons that you've learned in the context of privacy and bias in AI?
[00:37:33] Unknown:
Some of the unexpected things are still surprising to me is the, that the current regulations are lagging behind at, big time in providing good safeguards and, good guidelines, for the practitioners, for the enterprises, etcetera, to preserve the privacy of the people of the data that they are using. So HIPAA, for example, today, there are 2 ways to make sure that you are HIPAA compliant when you are using some AI and some patients' data. First 1 is called safe harbor, and it basically works using data anonymization. You only need to delete about 18 identifiers, personal, identifiers from the dataset. So you delete the patient's first name, last name, phone number, email, IP address, etcetera.
But we know very good and we know for a fact that data anonymization isn't, and that's actually a quote that has been made by Cynthia Dwirk, 1 of the, she actually coined the term differential privacy, that we talked about earlier. So I'm still a little bit surprised how the regulations are lagging behind and how, we are able to get by, by just doing data anonymization a lot of, enter prices out there today, which is not private, which is not protecting of the patient's, privacy. Privacy is a fundamental human right, so that's not respecting basically podcasts like yours. And there's also our recent calls in podcasts like yours, and there's also our recent calls, and, people are becoming more aware of this problem, more educated. And, hopefully, we can make some significant positive changes to this, technology that we have today in order to preserve our privacy.
[00:39:33] Unknown:
And for people who are trying to address these challenges of bias and privacy in their machine learning models, what are the cases where triple blind is the wrong choice and maybe they need to build their own in house solutions?
[00:39:47] Unknown:
That's a good question. Usually, building your, your own solution is a very complicated and tough tough process. It's it's not an easy process. So today, 1 of our, best competitive advantages is that, we have a great team. So we have world rank cryptographers and AI practitioners that understand this technology at a very deep technical level. So for example, our chief technology officer is Craig Gentry, and he's basically the inventor of something called fit fully homomorphic encryption. You probably have heard of that, and, it's widely used in so many, companies around around the globe. So we have a great experience in in this domain.
And I think the only way you should, build your own is is if you wanna probably save some cost, and that will come back and bite you on the rear because you've probably taken some shortcuts. So you probably don't wanna do that. If you are good at 1 thing, it's better to stick to it and, give the experts let the experts deal with the with the privacy work.
[00:40:56] Unknown:
And as you continue to build and iterate on the triple bind product and stay up to date with what's happening in
[00:41:10] Unknown:
Yeah. We have a couple of great projects that have been going now for some time and where we have achieved some significant results. For example, we are working on a project called the privacy monitor. It will allow individual, users of chat gpt, for example, and enterprise to know that they are sending some sensitive information. It's going to notify them, and that privacy monitor is going to work automatically. So if you're sending a question to chat gpt that has some sensitive information, our privacy monitor can detect that, warn you that you are sending this piece of information Even if there's no specific terms in the actual message that has your name, etcetera, it can't tell you, hey.
This idea, is very similar to a patent document that you have on your machine, and that could leak some of, your intellectual property. So this is very exciting. It'll allow us to, have a little bit more confidence that we are not leaking too much information whenever we are interacting with this large language model, specifically enterprises or doctors. So you can use chat GPT, for example, to summarize a patient note, but we make sure that whatever is being sent from the patient note to chat GPT does not really have any PII data and personal information or personal identifiable information. So privacy monitor is a very powerful tool that I'm very excited about, and we should be able to, address a lot of the privacy concerns here in the very near future.
Differential privacy is another aspect of, how we can, fine tune these large language models in a very privacy preserving way. We also have a private retrieval augmented generation tool. We talked about the knowledge cutoff of these large language models. This tool will enable you to append context to these large language models at very cheap way without having to fine tune them. So these are some of the main ideas today, and, a lot of them are targeted towards, again, the entire AI life cycle and, specifically today, a lot a little bit more towards prompting and inference as well.
[00:43:25] Unknown:
Are there any other aspects of the work that you're doing at TripleBlind or the or the overall space of bias and data privacy, particularly in the context of generative AI that we didn't discuss yet that you would like to cover before we close out the show?
[00:43:39] Unknown:
I think we touched base on, some of the very important, topics. Again, I think, technology this technology is, is is really great. It has a great potential positive impacts that will happen in the future. Perhaps before we reach there, it's going to be a little bit, difficult, and, there's a lot of, nuances in it. There's a lot of unknowns as well. But, overall, I think it will have a very good, positive impact, and we should be careful about deploying these systems and how we train them, but we shouldn't also be too much worried. We don't wanna paralyze, our advancement and, our, the way we are doing the research and, the way we are training these models. So we need to just continue pushing forward and not do the opposite. We should not slow down.
I think that's, that's an important thing to keep in mind.
[00:44:39] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[00:44:53] Unknown:
That's a great question. So I believe you have trained probably some AI models and you've played with some tools. But, unfortunately, even for a machine learning expert like myself or an AI expert, I cannot just wake up in the middle of the night and say, hey. I would like to create models that predict diabetes of Alzheimer's before they happen because I do not have access to such data. So data being locked behind the doors of regulations and privacy concerns, which is important, a justifiable thing, still makes it very difficult to, really move faster in this domain and create things. So the lack of access to data or privacy preserving way to access this data, I believe, is 1 of the biggest barriers to to enter this domain and to create useful machine learning models. And, hopefully, at TripleBlind, we are facilitating that and removing lots of lots of the barriers.
[00:45:48] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing at TripleBlind and your perspective and context on the challenges of bias and privacy for these AI projects that we are all interacting with and hearing about all the time. So I appreciate the the appreciate your time, and I hope you enjoy the rest of your day. Thanks for having me on the, on the show, and I,
[00:46:11] Unknown:
enjoyed answering your wonderful questions. Thanks a lot.
[00:46:19] Unknown:
Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcasts dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Generative AI: Risks and Capabilities
Factors Contributing to AI Shortcomings
Manifestation and Impact of Bias in AI
User Interaction Data and Privacy Risks
Future Impact and Trust in AI
Incorporating Awareness of Bias in ML Practice
Identifying and Mitigating Bias
Accuracy vs. Bias in AI Models
Customer Education on Privacy and Bias
Privacy Preserving Techniques in AI
Innovative Approaches to Privacy and Bias
Future Projects and Developments
Closing Thoughts on AI and Privacy
Biggest Barrier to ML Adoption
Conclusion and Additional Resources