The Role Of Synthetic Data In Building Better AI Applications

Hello, and welcome to the AI Engineering podcast,

your guide to the fast moving world of building scalable and maintainable

AI systems.

Seamless data integration into AI applications often falls short, leading many to adopt RAG methods, which come with high costs, complexity, and limited scalability.

Cogni offers a better solution with its open source semantic memory engine that automates data ingestion and storage, creating dynamic knowledge graphs from your data.

Cogni enables AI agents to understand the meaning of your data, resulting in accurate responses at a lower cost.

Take full control of your data and LLM apps without unnecessary overhead.

Visit AIengineeringpodcast.com/cognee,

that's c o g n e e, today to learn more and elevate your AI apps and agents. Your host is Tobias Macy, and today I'm interviewing Ali Golshan about the role of synthetic data in building, scaling, and improving AI systems. So, Ali, for people who haven't heard your previous experience, can you give a brief introduction?

Yeah. Sure. Thanks for having me.

So I am one of the cofounders at Gretel.ai.

I'm the CEO prior to Gretel.

I was the CTO cofounder for another startup. We focused on the infrastructure

and security space, a company named StackRocks. We worked on, you know, managing and scaling Kubernetes infrastructure for various workloads.

The predominant one we started to see was around data engineering and machine learning. That company was eventually acquired by IBM Red Hat. Prior to that, I had another security startup. I was the CTO cofounder for,

named SyFord. We focused on endpoint security analytics, so keeping end users

so safe and secure. That company was eventually acquired as well.

And prior to that, I started my work in government intelligence agencies mostly as a data and infosecond signals intelligence analyst, on a variety of different types of projects, mostly related to privacy of data and cybersecurity.

And do you remember how you first got started working in the ML and AI space?

Yeah. So

my entire career has had

one consistent theme, which was hyperscale environments with a lot of data.

So in most of the use cases that, I have built around as a developer engineer and then eventually sort of as more of an entrepreneur,

they've all required some massive data processing

requirements, and machine learning just happened to be one of what one of the predominant drivers behind what we used. You know, it sort of all started with big data analytics, the whole term, and, you know, the rise of Hadoop and so forth. But I would say just because data has been an integral part of everything I've done as far as work goes,

processing and sort of feature engineering that data at scale was something that was just embedded into all my work. And

at Gretl, we decided why not just focus on that part of it.

And so

synthetic data is definitely a core piece of what you're doing at Gretl. But before we dig too much into that, I'm wondering if you could just start by summarizing and framing what we even mean by the term synthetic data because it can mean a few different things from I just cat dev null into

a file somewhere to,

somebody who manually generates a bunch of stuff. I'm just wondering if we can set the framing of the conversation.

Yeah.

Synthetic data as a concept and as a term is not new. It's been around for, you know, a good decade, decade and a half.

I think it really comes down

to how you produce that data to your point, which sort of, frames our conversation. So

what we focus on is

our view of

synthetic data is something that is

purpose built for AI use cases. Now, you know, what does that mean?

It means

the quality, the deep structural stability, the privacy and safety of the data has to be incredibly high, but at the same time,

it can create additional friction or sort of workflow issues for users.

In that context,

as we talked about, there's sort of different approaches to it. Sort of there's traditional ways of dealing with synthetic data, which is basically just generating large volumes of sort of fake data.

One area that that was very useful is is, like, test data management, QA and task. Like, I'm just trying to test the preproduction

service, and I need lots and lots of volume. So I can simulate some event, and that's all I need. There's sort of another

component that has become pretty dominant in the narrative, which is

really how let's call it, like, frontier model companies or large language model companies talk about synthetic data, which is taking a model

and having a large language model generate high quality data for training or fine tuning another model. And we've seen this in the industry on a number of different areas, and this is also why you're starting to see, you know, proprietary and open source companies have different policies and and terms of agreement when it comes to the usage of data from those models.

Our view is is that our goal is really to help enable organizations and enterprises to use the data they have

or get started with data they need, but don't have enough of it. So the way we approach it and sort of our view of synthetic data is our customers have typically one of two bottlenecks when it comes to data. I have the data I wanna use, but because of safety, privacy, compliance, regulation, sensitivity, I can't use it and give access to my users or use third party infrastructure,

or use it for fine tuning and customization of models. So we take a privacy enhanced approach to how we generate synthetic data from your data being the core seed input into it. And then there's another category which we can sort of expand on, which is, as an example, I'm trying to improve this large language model's ability to translate text to SQL code,

or generally text to source code. Large language models are not good for, you know, natural language to code generation translation. So that is a type of use case where an enterprise might be trying to, for example, build a very specific capability into a model, and they just don't have the data. So that's what we mean by synthetic data is essentially addressing those two data bottleneck problems.

As you mentioned,

the whole idea and use cases around synthetic data have been around for a number of years now. And I'm wondering if you can talk to some of the ways that the

capabilities

around the actual creation and the quality and validation of the those synthetic datasets

have changed across the,

I guess, boundary layer of pre and post LLMs because there's definitely a pretty substantial shift in the capabilities and use cases and the

the overall technique of actually generating that synthetic data.

Yeah. Absolutely.

So

let's call it sort of pre LLM and post LLM or not even LLM, just language models versus non language models. Also, partially because, like, when we started, we were the only company that started building synthetic data using language models, but we weren't even using LLMs. We were actually using LSTM in 2020 when we got started before sort of the,

LLMs became popular.

But before language models, really the approach was using

GANs or statistical models,

so like a graph based model.

And

the power there is they were very fast. They were efficient. You can even run some of them on CPUs. You didn't need GPUs.

But what those models do, and this is where sort of almost all other synthetic data companies started and and build their products from is

those types of synthetic data models and approaches are very good at finding the shape, the distribution,

some insights into your data. But they are not good at sort of putting together the deep structural stability, like the ripple effects that are caused if you move one variable, what happens somewhere else. So, like, if you have a shape in the distribution of the data, you can increase

one variable in your data.

But the model, if it knows the shape and the distribution,

doesn't know what other aspects or underlying components will be impacted.

So it turns out this is actually what language models are great at, is understanding that deep structural stability, that prediction of influence and impact and sort of next token prediction

is phenomenal for understanding in synthetic data, not only what is my data saying, but this is actually we hear this quite often is is that if I use a language model,

I can address use cases that one of our customers calls what if scenarios. Like, what if I had more of this? What if I had more of that? What if this happened in my data? So you can actually change

nonarbitrarily

the shape and distributions of your data. So what I would say is that the big change

in synthetic data with language models was going from being able to

make

predictions on previous datasets and doing something with it that is more, call it, rudimentary,

maybe machine learning and feature engineering, to being able to actually

augment real data and making predictions

and using it for generative use cases in AI. And I think that's the tipping point that language models created for us.

In terms of the actual

motivating factors for synthetic data? You mentioned a few of them, but I'm curious if you can talk

to also maybe some of the ways that

the purpose of synthetic data and its application has changed across that pre LLM or in post LLM

epoch.

Yeah.

So

there's there's a few things.

One is

one of the things we did very early on, we actually ended up pioneering this approach, was the application of differential privacy techniques to language model training at training time.

What this enabled was,

not only the synthetic data, but for the model itself to ensure it is safe, private, and it's not memorizing

any secrets. The reason that became very valuable is

one one approach here is companies who are trying to take

data and extract all of its insights without these attributions and teach a model about it. So a good example here is is, like, the analogy we use is is that in health care,

one approach of privacy techniques to synthetic data

enables you to teach models about the disease, not the patient. So that's ultimately one sort of massive breakthrough

in some of the new techniques of differential privacy and synthetic data, which is is that we can now finally teach these models

about insights about us, but not teach them about us specifically. I think that's a very important distinction. Sort of our view is is, like, we need to put all our private and sensitive knowledge into these models, but not teach these models about us personally. So that's one big area. That's again where language models have been pioneering.

The other part of it is

we do fundamentally think that data is the the bottleneck for AI to take sort of generational jumps forward.

If you think about the last four or five years, it has been building these massive frontier models on large volumes of public domain, messy, privacy riddled data. The next ten to twenty years as we're thinking about smaller language models, agentic approaches, more of sort of compound AI systems.

It's about incredibly

domain specific or task specific data that can customize a small model to perform

incremental

or perform highly specialized tasks in collaboration with other models. So what synthetic data is very good at, and there's a really great paper around this from, Microsoft, which is called textbooks are all you need. It describes, and this is sort of the core philosophy of our approaches,

synthetic data has the ability to communicate knowledge from data the way a textbook does, not a way like a Twitter post does. So what that means is you can use an order magnitude less data

to train model one fifth to one eighth of the size, and we have some examples of this with with the work we did, for example, with Databricks,

and still end up with double digit improvements comparable. So you're improving

how much data, the size of the model as a result, the inference cost, the total dataset. This means you can build much smaller and much more specialized models that in collaboration can perform significantly better. And in ways, when you start to then introduce techniques like reflection into them as part of an agentic system,

you can have systems that generate just from a prompt or from a conversational

AI interaction.

They can synchronously generate domain specific data for you just so you can get started. And that's sort of one of the goals we have is is that ultimately,

if you're a builder developer researcher, you can experiment

with hardware configuration,

architecture, GPUs, code. But experimenting with data is incredibly

slow, expensive,

and inefficient. Like, working with data is five orders of magnitude slower than working with code. If you can take a few orders of magnitude off and make that more of a life cycle, sort of a flywheel life cycle,

then acceleration

and adoption, we think, will really sort of improve across AI as well.

For teams who are

embarking down the path of incorporating synthetic data into their overall

AI system development and deployment strategies,

what are some of the

typical techniques that they'll reach for first and maybe some of the edge cases or shortcomings that they'll run into as they try to scale from initial prototype of, hey. This looks like it kind of works to now I need to run this at scale for production use cases where I have continuous data needs.

Yeah.

One of the common friction points we come across in the market also has to do with the fact that the market in itself is very nascent.

So when you're doing, for example, synthetic data the traditional way, you know, I'm doing QA and test. I'm doing test data management.

I'm just redacting or, you know, removing PII. I'm not actually, you know, making the data differentially private.

What you're doing at that point is is you're looking at a large corpus of data. So you're looking, let's say, at a lake house or a data warehouse you have, and you say, I just wanna make that data synthetic because then I can just push it and test with it and do all these experiments. And that's all good and well.

However, when you're getting into the sort of the the use cases you're talking about, especially when you're talking about alignment with AI use cases, especially at scale,

in these types of scenarios, when you have for example, you're a financial institution and you're trying to improve fraud detection

or you are in a health care industry and you're trying to use EHR data to predict hospital stays and patient stays in ERs,

you're not taking your entire corpus of data and just synthesizing it and using it again. At that point, what you are doing is is you're taking your data,

and you are synthesizing it for a very particular use case. So what you want in some cases is you might wanna highlight and actually

improve and increase the diversity

distribution of what fraud looks like in your data because that's the signal you're looking for. Or you are looking, for example, for very specific indicators

that help you make better inference point decisions. So the point being is is that in, you know, sort of traditional use cases,

when you were going lower level environment, you just took all your data and you made it synthetic and used it almost the same way. Now that you're trying to sort of go up and more into production and make forecasts and predictions,

the data you transform and synthesize has to be synthesized for a

very particular use case, which means how that pipeline generates the data, the configuration that tunes the the epoxy run. For example, the privacy quality and evaluation

datasets you generate to measure it all look very different. So it's almost like the the input might be your data, but the output is specialized, customized data for the use case you are trying to improve.

For that use case perspective,

I think another interesting

challenge is when you're dealing with domains where you need very specific

factual grounding for that information. I'm thinking in terms of things like scientific or educational domains where

you want to provide some reliable data

as the input to something like a RAG system that an end user is then going to interact with in a maybe a a tutor oriented or an educational

context if they want to be able to learn more, understand more about that specific domain. So you wanna make sure that the date that the data you're feeding into it has,

an appropriate grounding. And I'm wondering, is that a situation where

you just can't use synthetic data or just

some of the ways that you need to think about the ways that the synthetic data is applied or sourced for being able to feed into those particular types of domains?

There are a lot of answers there. So maybe just sort of breaking the the question apart a little bit is the problem you're stating is very real in synthetic data, and it's actually where originally synthetic data really struggled because a lot of the traditional synthetic data models are what we call random weight models, not pretrained models. So the first time they were learning about the data, they were learning on your data, so they had no context of the larger world. This is also where sort of most synthetic data companies came from. That was problematic because

what good or bad looks like in that specific domain data was not very, you know, sort of apparent.

Language models improve that because if you ask a lang large language model like this is health data or talk to me like a financial adviser or now talk to me like a radiologist, it has some basic understanding of what that means. But, ultimately, this is where, you know, sort of our view is is that the way you solve that is not at the model level. The way you solve that is using compound AI architecture. So you have multiple models that have

very special domain expertise.

But there are just some cases where there like, synthetic data is also just another tool. Right? It's not for everything. And the use case I would say is is that if you, for example, are using synthetic data,

you have to make a decision between utility and privacy.

If you want mathematically provable privacy,

it means the data you generate can never be reversed back to an individual.

That also means if you're doing, for example, drug trials

and you have individuals in the trial, if you synthesize their data, you never know who the original person was. So you can't rehydrate the data. So then some of those types of use cases are not a good fit for synthetic data that is, you know, mathematically provably safe.

Now in the case of some domain specific work, what we have found is there is no sort of silver bullet, but then sometimes you have the opportunity like, we did this with a company named Illumina in the genome genomic space, where they do a lot of work in genotype and phenotypes.

And we work with them, and based on sort of open datasets that they had and large volumes of nonsense that the data they had, we could fine tune models

that understood better what genotypes and phenotypes mean in a language model. At the end of the day,

no sort of standard generic model is gonna be perfect. This is where we do think there is some level of customization

with the with the company's own data needs to be done. And the way to do that is you need to ensure that that is being done in a safe way so the data going into it is not memorized. It's not raw data. It's not just redacted or anonymized. Like, there are differentially private guarantees.

But

this is where I would say the market is still somewhat fragmented as to, like, what is it that you're trying to accomplish and then sort of take the associated approach.

Digging a bit more into that architectural question

from

a naive perspective,

the way that I would think about synthetic data generation is, okay. I need to populate a bunch of data for this one use case. I'm going to click the button and say, this is how much data I need. It spews it out into some sort of file format, and then I go on my merry way and do whatever downstream work needs to happen. And I'm wondering if you can just talk to

some of the ways that the overall life cycle of synthetic data

is managed and some of the methods for being able to integrate it more closely into the overall engineering life cycle of AI applications and AI model development or fine tuning or agentic structures?

Yeah. What you're describing is certainly one of the flows that exist in synthetic data. That particular workflow, we tend to see more when it's companies or users who are doing more experimentation

work. So one ops trying to validate something, proof of concept, test something, things like that. The sort of the other side of it where most of our focus is is in this model where we call always on. Basically, I just need data all the time for whatever my use case is, and these are typically more production or, you know, trying to move towards

serving the model, not just training the model.

In those particular cases,

there's a few things that we've tried to do, and our view is is, like, synthetic data is just data. Right? So what you don't want is another layer into the existing builder developer's workflow researcher's workflow that they have to go sort of request data.

So the approach we've taken is, on one end, build all these native connectors to things like

BigQuery and Microsoft Azure Fabric and s three and Redshift and Snowflake and, you know, Databricks as various solutions

where

the builders one time integrate us into that source. And then they can run schedules or build entire workflows that says once a day, once a week, once a month, I wanna take this type of data at this time, synthesize it, and pump it into this system. And the output of that could be directly basically then fed into things like SageMaker or Azure OpenAI or Vertex for model fine tuning.

And, eventually, that model can then just go into, like, something like a mass or bedrock and just be served. But the way we think about it is is that what are all the touch points that it sort of a builder needs from the moment they request the data to fine tune a model or customize a model to then generate data to measure the the sort of the capabilities of the model, to then measure the gaps and regenerate data, and how do those injection points happen and just make those a natural part of the overall process. So very much like an SDK API integrations into the building development workflow for for builders rather than

breaking their flow and saying, like, here's something manually you do. You download it to a file, and then you sort of take action on it.

From

a scaling perspective,

as you're going to that production use case, what are some of the challenges or complexities

that are

not at first obvious when you're first starting with, I'm prototyping. I'll generate a bunch of data, whether it's using Gradle or,

you know, a a faker script in Python or something, and then saying, now I actually need to

run it at production scale and just some of the validation and quality checks that you need to be thinking about to make sure that you have an appropriate distribution, but not necessarily a lot of duplication or that the values are, you know, within appropriate ranges or actually are sensical and not you're not just spewing gibberish into it?

Yeah.

There's there's quite a bit to think through. I mean, I think

nobody would argue that operationalization

of AI tools is by far the most challenging part of it. And I wouldn't even go as far as say, like, most can probably do some of the things they claim. It's just actually being able to run them in an effective environment that is that is the most difficult part of it. And the reason is is that,

like, we, Gretl, are one part of this. Right? But for a synthetic data tool to be effective,

to be perceived valuable, and then to deliver value,

you have to have a relatively

mature

platform for, you know, how you manage your data to your ML ops, to your fine tuning, to your serving. So you have to have some of that infrastructure in place.

So what we have typically found is

the scale part or the operationalization,

some part of the system that they are building, let's call it these, like, AI specific

applications or integrating AI into these new services,

have a reliance on some either traditional system that doesn't support them or has certain infrastructure

requirements that they cannot meet.

Or as an example, like, the data reside on prem, but they're trying to use,

cloud providers,

manage stack for it, and migrating that data is the complexity.

So what you find is is that in in very small scales, proving something is much more achievable.

It's all the edge cases that really arise from it. Like, in some cases, it might be that

you can't even get, for example, like, reserve instances GPUs, and as a result, you just can't process something. Like, we've come across that for some larger company. So the problems vary. I would say, though, in a lot of the cases, what we see are pretty common patterns, and those patterns

do tie somewhat to the fact that there hasn't been a lot of consolidation,

maturity, and growing in the infrastructure and the enterprise readiness of how people operationalize AI right now.

In terms of the

effect of synthetic data on model performance,

there's also the question of, well, if I'm just sending a bunch of random data into it, eventually, I'm gonna start cycling through the same values and the model is just gonna start getting dumber.

Or if I'm using the model to feed you know, using the synthetic data to feed into the model, then it's gonna start drifting for in terms of its understanding of the world because then maybe the random data doesn't have enough contextual grounding for that fact based or evidence based use case. I'm wondering

some of the ways that you think about mitigating that in terms of the

use and judicious application of synthetic data versus mixing in real world values to make sure that there's an appropriate

balance between the synthetic versus the naturally generated datasets?

Yeah.

There are a number of different techniques you can take. I can sort of focus on some of the things we do just because I'm more familiar with them. So one is, you know, almost at the core premise of Gretl, which is

we focused a lot of our efforts on privacy

in large language model and small language model fine tuning and training. The reason for that is our view is if you're gonna keep generating more and more synthetic data,

you need and we are enterprise focused, so that's sort of an important part of this, qualifier for this answer is

our view is you have to, as an enterprise or an organization,

use some of your sensitive data

to act as guardrails and seeds for how we generate more synthetic data. So we can't just start from absolutely nothing. Unless, again, it's like, I am trying to improve the SQL coding capabilities of this model. That's different. But if it's that I am a radiology

company and I need x, y, and z to be represented in my data,

then we we have taken approaches where we can use your data safely as a seed and guardrails and then use that to scale. Now so then the question becomes, okay. Well, how do you scale that and not repeat itself?

There are techniques

such as reflection that we talked about now, some, advancement of data to cognition techniques,

which enable then the models to further reason. And, actually, what they do is this enables when you're using a compound AI system

to increase diversity of data generation, not arbitrarily on the edges. So you get more diversity as you scale the data, ensuring you just don't end up with this back middle thing. And then the last part of it is this is where we think just at the same time evaluation is a big part of it. Like, right now, there's a lot of evaluation metrics and and frameworks, and a lot of them, quite frankly, are just

very biased or one-sided or just validate one thing or another. So there's an enormous amount of space to be built into there.

We're trying to do some work just to sort of be able to validate some of our own work, but I would say those are the three pieces. You know, really

comprehensive approach to evaluation of the output of the data, especially around utility,

using techniques like reflection, you know, migrating into things like cognition techniques, and then using privacy techniques like the differential privacy

to use your data. Like, those things used in combination,

tend to then improve that ability. But, yeah, we like, I don't think you can just take, you know,

a 500,000,000,000

parameter large language model. Just keep generating data over and over again and just fine tune prompt tuning it just so you can get good data. Like, there will be some catastrophic collapse at some point.

On the other end of the spectrum of

maybe I'm just trying to prove out the idea for some AI use case. I have my model, but I have no data. So you've got the cold start problem of I can't gather data because I don't have anything to run to actually generate data naturally. And so I'm just wondering some of the ways to be thinking about the application of synthetic data in that use case for just prototyping and proving out an idea or

bootstrapping some product that you want to eventually have be a production use case? Yeah.

So this is actually sort of the second problem we solve. It's our core platform helps do this. It's a product we have called Navigator,

where it generates for you

sort of synchronous

small datasets

and then asynchronously

large datasets just from a prompt or conversational. I need EHR data, and I need to have these fields or these columns. I needed it to look and sound like this, especially in tabular format.

And that under the hood is actually our compound AI system.

It's a number of

small language models that we have fine tuned on particular domains or specific tasks that in sort of working in collaboration then can give you data if you have the cold start problem. Now that data is never gonna be a % good enough. It's probably, like, 80 to 90% good enough, but it is certainly

quality of data is where where you can build an application,

where you can POC it, you can validate these. Does this work? Will it run? Will it scale? And then be able to really sort of accelerate your process. Where that data is good enough and in some cases even better than raw data is, if you're trying to improve what we call system level capability. So the model to write code, to speak to an API, to be able to write more declarative functions,

Those areas, like the work we did, for example, with with Databricks, we we demonstrated that we can actually improve the model's capabilities by 60%

in four weeks just by generating synthetic data from essentially an expert prompt into our agentic system. So, again, these are all sort of improving fields. Like, the the the I think

the most promising thing is not to say that all these things are ready for sort of showtime and at scale and sort of expected utilities.

I think the good news and the most promising part for us is we are seeing month to month significant improvements on the outcomes of all these techniques and the, the applications of these techniques

because everything is sort of getting better across the board. So digging a bit more into

the work that you're doing at Gretl, I'm wondering if you can talk to some of the ways that you've thought about the

system design and architecture design of what you're building and some of the ways that

you have to plan for evolution and adaptation as the ecosystem around you continues to grow and proliferate and become fractal?

Yeah.

Feels like every three to six months, there's some massive shift under your feet, and you have to adapt to it.

There were a number of decisions we made early on, and then there were a few along the way that we made that have all been sort of in in the pursuit of making the company as antifragile as possible.

And some of those things are, you know, like first principles, we didn't wanna be a model company. We actually always viewed models to be either something that we could commoditize or will potentially open source will catch up enough that it will be good enough. So what we decided to do was focus more on

the system architecture or system design versus the model. So this is where

our bet was that if we you know, this was agents and chains before they just called them agentic systems, but our view was that if we can basically have a system where we can use any model

and we can fine tune that model quickly

and we can replace that model very quickly,

we will never be reliant on a model. Because what we didn't want is one day a company showing up with this massively powerful model and just leapfrogging us as a company.

So what we did is is that, one, we invested very heavily into privacy techniques. So if we use any model, we can have that model talk to your data in a safe way.

The other thing we focus very heavily on is under the hood from an engineering standpoint.

We build infrastructure that allows us to grab models, fine tune them, test them, generate data, and then retest them to a point that in the last eighteen months, we've gone from about being able to run four to five of these experiments to per week now to over a hundred. So that velocity means we can at any point, we have an automated system that looks for best of breed of that model, the newest version, or the comparable version brings it down, compares it against ones we have, fine tunes it, measures its outcome. If it's better, it replaces it. So and then we offer these things called model suites where a customer can say,

I'm just trying to test something. So if I'm using, you know, call it OpenAI models under the hood, it's fine. Otherwise, they might say, well, I need an Apache license because I want the data to be used for building purposes.

So we've also tried to make it very simple when you're using our system to understand that you own and can do what with the outcome data. Like, can I train another model with this, or is this something I can just use for experimentation?

And then the final thing we bet on was from very early on because we didn't wanna be a model company is multimodality.

Our view was that especially when you look at enterprises,

data gets very messy, very fragmented, and typically data in large volumes becomes in tabular form, but it has

underlying time series components. It might have free text or unstructured text. If you go into health and finance, there's images attached to those. So multimodality

was a big part of this. So I would say sort of everything we have done to make

design decisions in the company sound and long term

have been built on

system design that enables us to use any model and very quickly

and privacy techniques that allow us to touch your data so we can use it as context. And those two things merged sort of expand the ecosystem of things we can work with. Now some of the things that become challenging, like, for us is the operationalization

part. So how do we natively fit fit into all the cloud providers, platforms, and data platforms, platforms, and every emerging AI tools ecosystem that a customer wants to work with? That's a little bit more complex in a sense that it's not

difficult work. It's just a lot of volume of work, and it requires some maturity so you have very little control over it. But, you know, everybody's fighting that fight right now.

To the point to you mentioned multi modality.

Another thing that we didn't really discuss yet is the

structure and format of the data being generated and some of the ways that you need to be thinking about that as you're planning

what data do I need, what shape does it need to be in, and how do I need to then process it to be able to have it ready for use in these different AI contexts where maybe a lot of that is generating an embedding of that text data or just some of the variations of structured versus semi structured versus unstructured data generation that you need to be thinking about for the different use cases of AI applications.

Yeah.

At the highest level, when we sort of started the conversation

at Gretl,

we were trying to back into certain use cases and the things we were good at and language models were good at. So, naturally, when we initially thought about the limited scope of modality, we thought about it in essentially

categorical, numerical

sort of free text type thing. So,

obviously, free text, natural language is one part of it, but then you can't really just focus on that. So this is where tabular modality, time series,

relational were other things we brought in because they're sort of compound value add. The way the data gets stored is all sort of numerical or categorical in these environments. It's just the shape and the structure of them that changes.

When you then take steps towards, for example, like, image or audio or video, then you're talking about entirely new models that are built for that, and it's the interoperability

that makes it a little bit more difficult.

So the decisions we've made on those is what are modalities that have compound value add for us. So when we look at some of our core sectors,

you know, automotive, robotics,

finance, health, gaming,

while things like tabular, unstructured, free text, time series, relational are important,

The next sort of critical thing we saw was image. So we have an image synthetic that's in beta that's gonna be GA soon. Those go hand in hand. So, like, a radiologist has unstructured doctor's notes with a bunch of tabular data about a patient and actually, like, an imagery work.

And if they wanna do large scale sort of analysis on these, all that has to be synthesized.

We're still in the process of sort of trying to figure out how audio video fits into these. We don't even know if those are sort of capabilities that at this earlier stage you can mix all under one because especially because we, like,

unlike a, you know, chat GPT

or a, you know, Gemini, we're not consumer facing.

So the the way we think about modalities purely in the context of enterprises, not in the context of how an individual consumer uses our product.

And in your work of building Gretl, working in this ecosystem

of synthetic data generation with a focus on these ML and AI use cases, what are some of the most interesting or innovative or unexpected ways that you've seen that capability applied?

Yeah.

The the the there were some

some really crazy ones,

especially in, like, 2021,

'20 '20 '2 ish when the crypto industry was taking off, and they just, like, had infinite amounts of money, and they just thought, you know, every tool could solve their their use cases. So they they just turned out to be some really, really dumb ones I won't get into. But the most some of the most interesting ones that we think are actually valuable.

Like, as an example, we work with government of Australia's health system.

And one of the most interesting things that we saw that they're implementing now using differentially private synthetic data is they have physicians that work in hospitals, but because of their regulations and policies only have data that is accessible within that hospital. They couldn't go across hospitals. But what they didn't want is is to give the, like, the doctor a snowflake environment and go say go query it. So what they wanted to do was actually have a large language model builder, like an AI assistant that can talk to all the data and the physician could just ask it questions.

So they're using us to create differentially private versions of synthetic data from all the hospital systems and then use that corpus to train the model on. So when a physician asks the question now, instead of handfuls of data samples, they have hundreds of data samples. And at the same time, if they don't have the expertise

to solve for the problem they're trying to do, they can take a synthetic version of that data and work with research institutes or academia institutions

who might have postdoctoral researchers focusing on that to get better results. So

synthetic data in health, we think, has enormous capabilities

and and and reach.

Some of the other ones we've seen that is super interesting are companies that are looking to build,

AI agents for users on the endpoint,

but the companies that we're working with don't wanna necessarily give companies access to the entire user's endpoint data. So making sure

the endpoint data, these apps or agents see is just the synthetic version of your usage data is another one. It's sort of back to the analogy of teach them about behaviors, not about the person,

because, ultimately, what you don't want is is this person to know Tobias. They just you wanted to be able to help Tobias.

And then there there are

some some really interesting ones we've seen about simulating, for example, or creating diversity and what, like, toxicity language looks like so they can be determined much sooner in gaming forums and environments like that. Because

what what we found is synthetic data is really good when you put a model into production and that model comes across a new pattern that it has never seen. What happens is that model drifts off course. And in a lot of the cases, now you have to take that model, find that new pattern,

train it on it again, and then load it, which this sort of drift has enormous cost, especially if you think about it in the cases of health or fraud.

Synthetic data is very good to basically see that small sample dataset, generate more of it, quickly retrain the model to prevent that drift. So

that has been another cycle. And then one last one that we particularly found interesting is is that

some financial institutions or hedge funds using it to create these, like, black swan events in their data just to stress test their systems and see when they tip over. I think they're all trying to figure out how do I prevent myself from becoming, like, the GameStop sort of receiver end of of these types of things. And,

that sort of what if scenario that you can prompt into your data using synthetics generated with language models is actually quite powerful for that, where you can generate scenarios and see how your systems respond to it.

Yeah. I can definitely imagine that being very useful, particularly

if you're saying,

I want to enter some new

product area. We don't have any prior experience

actually with working with those consumers or that particular

industry or vertical.

So now I'm going to generate a bunch of data that looks like something from that area so that I can see how my systems are able to actually work with that and some of the different

engineering effort that I need to do to be able to prepare for that before we actually launch and decide and find out that we're actually just gonna fall over.

No. %. It's actually a really interesting point. We have a blog about this that we wrote. So back to the language model's sort of deep structural stability understanding of your data, we actually have customers where they use this, for example, for consumer goods and predicting sales where

they might have a lot of data for a particular region, like, let's say, San this is a real blog we have about San Diego at a particular time of the year, and they're trying to say, well, if I were to sell the same goods in Tokyo at this time of the year, what would be the response?

So they use their own data as a seed, but because agentic and language models are good at understanding

contextual behavior of how people operate in Japan, what do they typically buy, what are the seasons look like. Those two things merge actually can make these types of predictions that turned out to be incredibly accurate.

So it is a way where we sort of describe we actually have this term we use for this is that reducing economics for data acquisition.

Like, if you're building a new service, if you're Uber and launching into a new city, you might have to go there six months and learn to your point all the pitfalls

versus at least, like, starting 90% good enough and figuring out the last mile versus the whole, like, you know, 10 miles.

And in your experience

of building Gretl, working in this space of synthetic data generation and its application to different data and AI systems, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Man, challenges. It's like infinite and not ending.

So

I think the biggest learning lesson

in the last four years working in the AI space has been,

how do you continuously build product market fit? Because it feels like every six months you need to adjust for product market fit because of how things change from

deployments

to implementation to even pricing strategies.

So that's one. The second part of it is

the

the war for talent is real.

Majority of the new techniques or types of tooling and work that is being done is there aren't folks that have more than a handful of years of experience, and most people have no experience. So sort

of not only bringing talent, but when you build talent, you're building a you're investing a lot of knowledge and inherent sort of value into them and then retaining talent. So talent is a real

critical sort of tent pole of this whole thing.

The other part of it is

there is

like most industries at sort of nascent

emerging space, there is a lot of gap

in industry knowledge for what good means. You know, what is good data? What is private enough data? What does it mean for me to perform well? Like, all these types of questions that for enterprises to bet

millions, not tens of thousands of dollars on a product or a platform, need better answers and more clear answers. So the validation process for a lot of those types of challenges to commercially validate and and prove something with enterprises

is not very repeatable yet. There's a lot of white glove sort of services component that needs to come with it. So I would say

the the thing that all this

collapses into is

that

real enterprise sort of outside the chatbot level, that real enterprise repeatability

motions

that can be consolidated and then asymmetrically scaled. Those are sort

of at the core of all the challenges we tend to see and I've been experiencing sort of front lines.

And what are the cases where the use of synthetic data is the wrong choice or

maybe you just need to have a

Python dictionary that you're using something like faker to just spew a bunch of data into, and that's good enough, and you don't necessarily need to reach for the industrial strength language model to generate a whole large volume of various data types?

Yeah. So, you know, if if you're doing QA and test, you're measuring, you know, TCP packets, or is this thing gonna get DDoSed if I get this amount of traffic? You don't need our systems. It's overly expensive to try to do that. It's like, you know, trying to use

a laser to light a cigarette or something like that. There's much easier ways.

But I would say those where the quality,

the privacy, the utility, it's it's not it's not critical. It's not paramount. It's just the volume of data you need that you just don't have otherwise.

Those are things that are overkill for our approach or using language models of sort of any kind for that approach. It's just overly expensive.

The other side of it is any data that needs to be reverted back to its original.

So let's say you're doing

drug trials and you wanna, you know, for example, have thousands of patients, you collect their data, you wanna do some analytics, you wanna make some predictions,

but you need to take that data and reverse it back and say, like, all these things happen, all these predictions were made.

If I go back to the user or to the patient in this case, did the outcome match my predictions?

So if you and there there are ways to do that with synthetic data, but, ultimately, synthetic data is meant to sort of be privacy preserving in that sense. So those those use cases that are two way doors

maybe are a better fit for anonymization,

deidentification,

transformation

versus going full differentially private synthetics. Like, our approach

at that level is a one way door. You know, I wanna be able to train a model, fine tune a model, potentially share my data with somebody else, not be worried about compliance. And in some cases, we're even talking customers who are, like, trying to monetize or sell their data in that way. So it's about, like, do you want it to be a one way door or a two way door that defines, you know, what level or type of synthetic data is the best option for you?

And as you continue to build and invest in this space and keep tabs on the evolution

of the AI ecosystem, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore and invest in?

There there are a few categories. One, the biggest category we are focused on is it may not sound terribly exciting, but it's actually like enterprise readiness that we're focusing on. So over the last, you know, little while, we've been focusing a little bit more on making our product easier to use with native integration. So we spent the last quarter roughly

building a lot of native operational in integrations into things, into the entire stack of Azure, into,

AWS, and in some in sort of other platforms as well.

That has been very promising because we see now customers coming faster, transacting faster, contracting faster, moving through the pipeline faster.

The other areas that we're super excited about are some new privacy techniques that we've been looking at that enable the same level of mathematically

provable output of the data, but they have

significantly

less overhead as far as computational costs go compared to differential privacy.

So there are sort of better techniques you can use there. We are also starting to see much better evaluation

frameworks where it allows us to be able to build

domain specific or task specific models much faster with sort of better guaranteed outcomes.

I guess the area that I'm sort of particularly,

you know, super interested in

is the techniques that we are starting to implement into synthetics like reflection. So you can take

a system like ours and generate synthetic data

that can teach any open source model to reason, like an o one model. Now not quite at that level yet and not quite at that scale yet, but we are actually demonstrating some of that early steps of that being able to do this. We wrote some blogs about it showing actually some better outcomes,

than some comparable models that were trained, using traditional techniques out there. But that to me is very interesting because if we can enable most customers to just be able to adopt a synthetic platform

and using their data, teach a model to reason about their data, not generically reason, that is a very powerful tool to open, especially if we think about, like, individual researchers or small organizations, things folks who don't necessarily have that level of expertise or resources.

That is a very

democratizing

approach to sort of leveling up everyone's capabilities.

And are there

any other aspects of synthetic data generation, the work that you're doing at Gretl, and the application

of this source of data to AI systems that we didn't discuss yet that you'd like to cover before we close out the show?

I guess the the one area I would sort of talk about is

how, like, how governance and policy is starting to impact some of this. So one of the things that we hear quite a bit from our customers, one of the reasons a lot of our enterprise customers look at synthetic data from a sort of an adoption standpoint

is their view is is that, like, if I'm in a regulated industry,

if I forward five years, if I'm a CIO or CTO or CSO, I'm worried that I'm gonna have to deal with hundreds of compliances and regulations

from my industry, from my state, from my country, from my region, if I'm in the EU, like, 15 different ways it's gonna apply to me. So I can't be on this treadmill of, like, every month there's a new regulation and I have to go retroactively

check and forward check and validate.

So synthetic data helps with some of that.

But I actually think the the more interesting part of it is how do you apply some of these things more ubiquitously?

Like, as we know, there's regulations that California, Texas, Utah are trying to pass for, you know, AI ethics and governance and best practices.

Frankly, things like this at the state level are a terrible idea. Like, you don't wanna have encryption standards that are across, you know,

you know, hundreds of countries and, you know, dozens and dozens of states. Right? Like, these are the types of regulations, policies, and protocols you want very much standardized.

So I think the the most interesting part of it that is maybe noninteresting

technically is

there does need to be a push for regulation and some policies around privacy in data. Synthetic data is just a subset of that, but data that is used in these environments

and how that should be controlled, how it should be

managed so it's sustainable because I think the current approach is not a sustainable approach to data usage. Essentially,

it feels like

AI in a way

is taking what social networks did at the beginning, which is the user was the product they were selling to advertisers.

AI is sort of doing that with your data to basically sell to other users. So there needs to be a more sustainable approach, and I think there is room for policy and regulation to help with that.

Yeah. That's definitely a very

valid,

insight and one that I think we should all be thinking about as we continue to iterate on these systems and build them out is what is the actual

model that is sustainable

both

economically

and ethically

and kind of,

in a,

societal

factor as well, particularly in light of some of the recent,

insights that we've gotten into OpenAI's pricing model.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.

I think the biggest gap right now is

around evaluation,

evaluating model outcomes, evaluating data, evaluating quality. So it is really

the the gap I see

that takes us from

training, pretraining customization to serving inference time compute. It's it's that chasm we are trying to cross right now, and the biggest question is really around all these aspects of quality and good looks like and abundance looks like, and it is very unclear. And that is one of the areas we're chipping away at because it sort of contributes to everybody's success. But I think that is

that is, one of the one of the pieces left that if solved for can really accelerate adoption

and measure of value for these AI tools.

Well, thank you very much for taking the time today to join me and share your experiences

and your expertise

in the space of synthetic data generation and ways that we can apply it to AI systems is definitely a very interesting and important problem area. So I appreciate all the time and energy that you and your team are putting into making that more tractable and available for the people that need that, and I hope you enjoy the rest of your day. Yeah. Thanks for having me. I appreciate the time.

Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management,

and podcast.in

it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machinelearningpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast

dot com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast