Strategies For Building A Product Using LLMs At DataChat

Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine

learning.

Your host is Tobias Macy. And today, I'm interviewing Jignesh Patel about working with LLMs.

From understanding how they work to building your own and some of the interesting challenges that they pose when incorporated into products. So, Jignesh, can you start by introducing yourself?

Yes. It's nice to,

talk to you and your listeners.

I am a professor in computer science at Carnegie Mellon. I've been working in the area of data for

close to 30 years now from when I started in the field of computer science as a grad student.

And I've been through all the waves from the early days of the big data

revolution to

mobile to

initial days of machine learning, then data science to where we are now, where Gen AI, LLMs, and other techniques, foundational models have completely taken over. So seen the whole gamut

and super excited to be where we are at this point in time because it's really, we are here today because it's at the intersection of not just all the data stuff that we've built over time, but but also the huge hardware advances and algorithms.

So really delighted to chat with you.

And do you remember how you first got started working in machine learning?

Yeah. So if I go back to my first touch with machine learning,

it was when I was an assistant professor

at Michigan. This was in 2000,

and the human genome revolution had just started. And I decided I'd I got my PhD in building parallel data processing systems,

and even had a startup that was acquired by Teradata,

which is part of my thesis work, and then went to Michigan and wanted to do something different.

And the human gen genome revolution had just happened. We had the 1st sequenced human genome, and I wanted to work on a broader set of data applications,

including things around genomics and proteomics. And all of those are very data rich fields.

And that's when I started working with data in a broader sense.

And the first thing you start to do when you think of data at a broader sense is you have to find insights from it, not just through the structured lens

of SQL and other more,

very

specific types of programming paradigms,

but you have to start applying more general ways of getting insights from patterns where you're saying, let the machine decide or help me with very little guidance

and surface patterns. So I'm not asking for something and getting a response back, but I'm effectively saying, tell me something interesting about this data by only defining certain parameters of of what is considered interesting.

So machine learning started for me back in those days mostly as a way in which to apply machine learning

and apply machine learning on large volumes of data to get that insights.

And, obviously, the machine learning methods back in those days, if you talk to any of the machine learning experts, they would have said, SVMs were the

the algorithm and the framework to use, and everything could be solved with that. We obviously know things have passed, so you go through waves. And what was hot now is cold now, and maybe what is hot today might be cold, you know, a decade from now. So it started very much early on about 2 decades ago when I started to look at data in a broader sense and look at problems at a broader sense. And, of course, you hit for machine learning

tools right away.

The current era of machine learning and AI is definitely being dominated by these generative AI capabilities. At least the conversation is being dominated, whether that actually constitutes the bulk of the work is debatable.

And I'm wondering if you can just start by talking through some of the ways that you are working with LLMs currently

and the role that generative AI plays in the work that you're doing, whether that is at CMU or I know that you're also running a business as well.

Yeah. I'll talk about both those fronts. So we are starting to look at,

using,

in general,

these

interrogative tools of which Gen AI is 1 part. But as you alluded to in your question,

machine learning methods are still super important on the types of problems that are still super important for businesses.

So at CMU, we are looking at some relatively new research projects that try to look at structured plus unstructured data, which are becoming quite popular in things like lake houses and being able to ask general questions of them. And the main theme is to allow

nontechnical people or slightly technical people to be able to ask questions without having to become experts in multiple programming languages.

A similar theme applies to data chat, which is my start up, where about 30 of us are building

effectively using LLMs to point to structure data. So it's a little bit more focused over there, but, obviously, the deliverables are much higher. Literally, if I've got tabular data, whether it is sitting in a data warehouse or,

sitting in a bunch of CSV files or parquet

files or some combination of that, can I just point to it and ask questions and get responses back from them? And LEMS are, of course, a critical component

of that process.

But when I get response from them, here's the interesting part. I want to get a response because the user in this case could be a nontechnical user, could be a business user.

Not only do you want to give them the correct response, the answer to the question, which might be why do my customers churn? For example, when you point to a customer file in which you have information about who's churned and who's not. But along with that, you also want to tell them

why that answer is correct in ways that they can understand.

So transparency and reproducibility is a critical piece for the type of work we do at data chat. You can think of data chat as

automating data science and so people generating data science notebooks

by writing

cell by cell programs in your favorite environment like a Jupyter notebook environment. You just ask a question, and they all get filled up. And it was 1 of the very early papers my students and I wrote in 2017

where we imagined this data science world being completely changed by just asking questions in natural language and getting the code generated. At data chat, we've gone further where we generate the insights and then give you the response along with an explanation.

That explanation itself

comes in English. It's an intermediate form of programming language, so we have own special programming language, which is a subset of English. It's like a programming language that you may never have seen where the syntax of the programming language is in English, which means

when in data chat, you get a response along with an explanation of that response, That response is actually a program that anyone can understand because every sentence in that is in another natural language,

which which is invented here at Datachat.

When working with large language models and generative AI more broadly, what are some of the business challenges that are involved in building on top of these capabilities,

particularly in the case where you don't wholly own or control the model that is being interacted with?

Great question.

We think of it as a toolbox in which the LLM is effectively a black box, and we learned that lesson the hard way. So in data chat today, if you ask a question against your data, you get a response back. But the processes

that get carried out in a platform to generate that response

has many steps. It's about a dozen steps now. And 1 of those 12 steps is going to be a call to an LLM.

The rest of the magic comes from everything around it, and we'd have to treat that LLM as a black box. In other words, the way we treat that at data chat is we say, can we pluck this LLM, put something else in, and have to do minimal and perhaps no other change to the rest of the machinery

to be able to get the response back. And we did it not because we we we

planned for it that way, but we learned some hard lessons. If you remember,

earlier in this year around the March time frame,

OpenAI, we were using their LMs, and we were using codecs, which they were serving off the OpenAI

API.

And literally overnight, they said we are going to stop supporting it. So 1 of those 12 steps that we have was a call to an LN. It used to be a call to codecs. And now that thing is gone, replaced by something completely different.

And because that whole space was evolving so fast, we had built a little bit deeper integration into that codex machinery even though it's a black box, then we needed to since we've completely abstracted out, we can plug and play models, pull them out. And what that means is you put a lot more emphasis on the engineering

algorithmic infrastructure around this black box,

which then allows us to build a more resilient platform and allows us to do all kinds of interesting stuff where in 1 environment,

our black box could be

Azure's OpenAI. In another environment, it could be LAMA, the, you know, completely open source. In another environment, it could be Google's vertex,

and very little to no changes needed in the rest of the machinery. So some hard lessons learned in terms of what you need to do, and I think that's a common pattern. If you're taking a dependency on an LMM in a complex engineering pipeline,

your smart

better not just be the LMM

because

if you're building a product, the differentiation from making a call to an LLM and giving a response is nothing because everyone has access to an LLM.

It better be that the value it comes from all the other stuff that you do around to deliver

a unique product experience of which that LLM is just a black box. It's 1 of many toolkits and how that comes together has to be the distinctive value add that you provide to that end customer.

And that aspect of making the specifics of the LLM pluggable is something that is reflected in some of the open source tool chains. The 1 that comes most readily to mind is langchain,

and I know that there are a number of others out there. And from your experience of actually building on top of some of these LLMs and generative AI capabilities, I'm curious

what your experience has been as far as the level of utility of some of those existing

abstraction layers and how much of it you had to actually build yourself because of your specific requirements?

Yeah. We've played with a lot of these stuff. And, of course, they've evolved continue to evolve quite rapidly.

They allow you to build something

with an LLM being called out in a far more robust way, be able to deal with handling of messages and stuff like that. That's just 1 piece of the puzzle, though. The some of the hard stuff comes

with what happens if I make a call to an LLM and it gives unpredictable response? Today, if I make the exact same call,

let's say, to GPD 4,

today and I do it even an hour later, I'm not getting I'm not going to get exactly the same response back. That's not guaranteed. In fact, very rarely will you see exactly the same response back if what you've asked it is complex enough.

So,

Langchain and all these other tools allow you to build something fast to get it running,

and that's

important. But the hard part is building

around

this piece, which is no longer deterministic.

It's like we haven't really thought of that as a programming language. Right? If I tell you you can make a call to a library and it gives you different answers on the same input, you're gonna freak out as a programmer, But that's what these things do.

So line chain and other methods allow you to put together that pipeline together, but the hard stuff about in my application, how do I deal with uncertainty?

And this black box that could

behave in very

random ways, although the surface

of misbehavior, if you want to characterize it that way, is not the,

unknown. It has a little box in it, but you have to deal with that randomness, and it's there every time you interact with it. And so that the rest of that machinery, we still have to build. It's not gonna come to you from any tool chain that you have. The other components are deep error handling. When something goes bad for some reason, whether and it's not that

you, the caller of that LLM, did something else. But it's like service interrupts, or you made enough calls that you hit a throttle point, and you are now getting throttled down because you had some limit in your service contract. All of those stuff are things that you have to manage outside the box.

And so

it helps,

but the hard work often is outside for us, especially when you're building a sophisticated system like ours at data chat. We are hiding all of that sophistication from the user, presenting it to them a very simple use case. Simple is hard, and there's a ton more things to do besides just chaining together calls and, connecting actions.

Another aspect of the platform risk and the challenges involved in actually building a business on top of these Gen AI capabilities

is the question of

compliance,

privacy,

and also the fact that as a business in the modern digital age, a majority of the

moat, if you will, that you can build is in the data that you are able to collect and take advantage of.

And I'm wondering what you see as some of the challenges of how to

manage that strategic advantage of your data as well as some of the compliance and risk aspects when you are interfacing with these other very data hungry applications and platforms and

some some of the ways that you think about managing that relationship and that balance of ensuring that the system that you are feeding information into is giving you the results that you want, but you're also maintaining

the appropriate level of control over your own information?

Fantastic question. So I'll give you, a start with an example of a real customer conversation and then drill down from there. You're the customer that we're explaining, here's what you can do with data chat. You ask a question, get your answer back along with this transparency reproducibility

aspect.

And, of course, the first question was, okay. You call out to an LLM and you are you going to send that data across to them? And then we explained to them how

we had thought about that risk and actually built in that 12 step process that I described earlier

so that when we call the LLM, we do not send any data across.

We will only send to them information about the schema elements, like what are the database that you're connecting to and what are the data elements, what are the names, types, stuff like that. The question and the rest of the machinery that we need that is value based semantics

that we need to collect from, get from the data to produce that answer is all done outside completely in data chat space.

So we call out to the LLM without leaking any data

from the customer, and they were shocked because

and they see a lot of vendors. No one's quite doing that. For us to do that goes back to Tobias, the earlier part of what we were talking about. You know, that LLM call need should not be the reason why your product exists.

It has to be all the other stuff that you build around it. So data privacy is super important. Some other mundane examples of what we do is that we also

when we are deployed in a private VPC deployment in a customer,

we can guarantee that no data leaks out of that system, including things like logs and, you know, users like the product or don't like the product, all that user feedback that we get, all is completely local and never leaves the system unless we have permission to go and do that. So many products will build in little hooks that needs to keep calling up to the mothership even in the VPC. None of that stuff. That sounds very basic, but you'd be surprised how many products keep calling up to the mothership in including in a in a in a VPC environment. But the LLM story, we had to do extra algorithmic and engineering work to make sure we can make it all work without sending data to the LLM because in many cases, it is really important that no data sent out gets sent out.

1 of the phrases that you've mentioned a few times already in the context of these LLMs is black box, where you send something in, you get something out, and you have complete opacity as to why certain things are getting returned,

how is actually making those decisions internally.

And I was listening to an interesting conversation a little while ago about

the question of bias in AI and the fact that

AI is actually

better for being able to manage bias because

you can tirelessly just feed it with a barrage of questions and get back responses so that you can try and hone in on understanding what are the biases that are existed in that system. Whereas with a human, they would end up getting tired or irritated and tell you to go pound sand eventually.

And so I'm wondering what are some of the ways that you think about how to

gain insight into that black box or build some understanding

of how those decisions are arrived at in order to be able to more confidently engineer around those capabilities?

Yeah. That's a great question.

We try to do as little

of, by way of requiring the LLM to behave in a certain way for us to build the rest of the product in a mature way. So we'll assume that the answer we get from the LLM is unpredictable even on the same input,

or it's unpredictable if we change the input slightly.

So we put a lot more emphasis

on trust but verify kind of a paradigm where we'll say, yeah. We trust that you'll do a good job with that. Whatever biases or other implicit programmatic

assumptions that are made in your response are going to come back to us, but then we do a bunch of checking when the response comes back. The response that we get from the NLM is not sent to the user directly. We'll take that response as a suggestion,

then we'll do a whole bunch of work on that response to create the real response that we sent to the user. This also helps us because sometimes that allows us to do things like,

I've got a question. I sent it to an LLM. I don't think this quite looks right because we have we test that externally, and we have quantitative base of dealing with that,

in our our architecture.

And then we can see, you know what? I'm gonna ask the LLM again in a slightly different way because there's randomness in that. Right? It has biases in a certain way, but we can use that to our advantage

to say, I sent you this question. You gave me a response. I can tell for sure it's wrong, but I think you can give me the right response if I ask you in a slightly different way because I've understood which way you are biased,

in producing me the producing the correct answer to me. And so we can do that iteratively on the fly. We can also do things like we might say, I'm actually gonna ask 2 completely separate black boxes

this question and figure out which answer I think is better suited because we built that checking for correctness

outside the box and don't really take the response we get back from the L and M as gospel, but we will verify it and adjust it and maybe get another opinion if we need to before we return anything back to the user.

In terms of

the

predictability and repeatability

of the response, there has been a lot of effort that has gone in recently to the area of prompt engineering

and trying to say, okay. I'm going to

constrain

or templatize the overall question that I'm asking and automatically assign a certain amount of context

to the overall request.

I'm wondering,

in your experience of building on top of these capabilities,

how much of the balance of engineering and algorithmic

management is in that area of building and maintaining the appropriate context, whether that's through prompt engineering or,

feeding in a context database, and how much of it is

removing that level of

kind of integration or removing the dependence on the LLM to provide consistency and instead engineering

the that consistency

of experience into your own system so that you own that without having to be so dependent on

providing constraints to the LLM specifically?

Yeah. Great. Another great question. And I hope I don't

get some of your listeners mad because they might throw virtual tomatoes at me as I listen to this broadcast. But I think there's a lot of

overemphasis

on overengineering

the prompt.

Those are not very robust methods. So we take the approach. Yes. Prompt engineering is important,

but are we going to try to get that next 5%

accuracy improvement in the end response to the product by putting more into prompt engineering?

No. We'd rather do it in the stuff that we do outside

because what works for a given black box with a given prompt engineering style is probably gonna change even for that same black box as it gets a new version. New version of GPD 4.5

might need something else.

Also, it seems

from the technology perspective that the long term, the way these LLMs are evolving, especially with all of the stuff that goes into training them, not just in the core model, but in the our lecture loop and all the other stuff they do before they, LLM providers produce us an API to use the next version of LLM. It is more likely than not that hyper engineering your prompt is not where the value add is.

You want to give it all the information. You want to use retrieval augmentation, other types of methods.

But do you want to get your little bit of performance stealing from a specific style of prompt engineering, which is fleeting?

We'd rather put that type of engineering effort more in this macro level algorithms, like something that like the ones I described to you, because

that's just a 1 time payoff. And, again, nothing that will likely last

as the emphasis on hyper engineering prompt, which is very fashionable right now. Right? You see so many blog posts and books getting written on how to construct the right prompt. I'm not a believer in

that will stand the test of time. The LLMs are getting smarter. The emphasis on putting more and more into prompt engineering is, in my view,

perhaps controversial, a losing battle.

We put way more emphasis on the stuff that happens around it. Give the information to the LLM that you need to give.

Yes. Presented in Clean Bay. Don't be haphazard in that. Don't overengineer the prompt.

Another

interesting angle to this is

1 of the major applications of LLMs is in automatic summarization

and automatically pulling out the key bits of information.

And I see a potential for being able to add maybe a lightweight LLM or even a

linear model that on top of the responses from the LLM to do some further distillation

of that response

so that you have more visibility and transparency

over your own layer, but you're still relying on the LLM to do the heavy lifting and being able to build some sort of a,

model chaining or composite strategy to it, or,

alternatively,

using a an adversarial

approach of, you know, getting the 2 models to kind of communicate with each other and trying to optimize for the return value. Of course, that adds a lot of latency to the system and just,

spitballing there. Curious what your thoughts are on those approaches.

Yeah. Great question again. I'm a huge fan of ensemble methods. You know, ask

2 models, 3 models if you're not sure, and then figure out

the strategy to figure out which response is the more appropriate 1, maybe even combining that.

Model starting to models is another great idea, but we aren't quite pursuing that because we think they are lower hanging fruit in at least immediately

from some of the ensemble methods that are more readily

engineered and more robust for us to grow. But all of these are great ideas. You know, it's kind of like if you're in a classroom and having a discussion and an open ended topic in which you need creative responses,

if you're a teacher, you won't just ask 1 student. You'll ask multiple students, and you let the conversation evolve before you come to a conclusion

by getting input from multiple

bright minds

and different ways of getting that input even in this set stack makes sense. And repeating what I just said, on several methods are the cheapest way to go do that stuff. It still requires careful engineering, careful planning, all the stuff we talked about. You know, each of these models is black box. There's having 1 random variable. You have 2 random variables or 3 or 4. You still have to deal with that. But sometimes life gets a little bit easier in that space. Some things get harder. So huge believer in treating this.

Each of each model, they are magical in what they do. I don't think anyone would have imagined they do the types of things they do today. Even a year ago, it seemed unimaginable when GPD first came out and became so widespread. We were odd by what it could do. We are odd even more by what it can do. And, of course, you have to remember just before GPD came out, there was stable diffusion. And what it started to do for images was crazy and all of that has accelerated. So the future is that these things are gonna get better, maybe not exponentially better as it has happened over the last 18 months. But getting multiple of them to use to help you is a great idea in these different architectural configurations.

That is likely a winning strategy long term.

Given the level of opacity

and inherent risk in depending on these platforms as a component of your business and as a core element of the features that you're offering.

I'm wondering

what you see as the factors, whether technical or organizational,

that might influence you to go down the path of,

building your own LLM or even self hosting 1 of the open source ones and just some of the ways that you think about that overall calculus of

the risk but ease of use against the,

increased technical

requirements

with the, enhanced control available?

I love that question. Philosophically,

at data chat, we think of it in a simple way. No single point of failure.

And as I said, when we had

real dependence on codecs in a product in March of this year, because that was effectively the only game in town,

we learned a hard lesson and got yanked from us overnight that you can't do that. Now you take that idea of no single point of failure, control your own destiny as a sub cross to that, and take that to that next level. What do we want? We want to be able to say, if I take a dependency on an API call to an LLM model that is hosted somewhere else, I want some guarantee that that won't go away very quickly. And I want a couple guarantees in there. I want a guarantee on the length of time that I can continue to depend on it and it'll be available to me. And I want a guarantee in terms of what my costs are, and they better be predictable at different

volumes of API calls that you make. Now luckily, both of those are relatively solved problem now, especially if you go with the with any of the big LLM providers. Right? There are lots of small LLM providers for which there isn't a formal contractual model. But over the last year, a lot of that has baked in. But that's just 1 dimension to this. The other dimension which you pointed out is, what is my risk

of taking a hard dependency on 1 of these models,

that is closed source

and that

increases the randomness that you that we just talked about. Every time I call it, it may give me something else. I have no control over it. So the other thread that I think everyone needs to do is not just be relied on 1 LLM. Maybe g p d 4 is working great for you. Go for it. Or Vertex AI is working great for you. Go for it. But if you can and you're sophisticated enough, you should have another model that also works for you. And whether you use it in an ensemble mode or whether you use it as an alternative and you have to use it, for example, in some customers, they will not allow us to make a call to g p d 4.

Even under

Azure's contractual and tight security psych security parameters because they just do not want to even leak that much,

data out. In which case, you have no choice but to say, I need an alternative path in which I have a self hosted LMM.

And luckily,

while there is no true open source LMM, there are lots of open weight LMS,

like, LAMA and Mixedral,

which you can take. You can deploy that in your own containers.

Now the containers tend to be a little expensive today. Right? You need GPUs and stuff like that,

even on the inferencing side. But you need that. You need that as a strategy. And, of course, we have that as a core component of how we do our work here at data chat because there are actually times where those things do better on certain tasks than the closed source models and vice versa. So you need to have a dual strategy, including

some open rates model has to be part of your strategy. And luckily, there are alternatives, and I'm so thankful for the open rates model because all the progress in this field and in terms of derisking,

the dependency you take as a product

would be very high risk if we did not have these open source models. So huge shout out to Meta who started this whole process and got us really serious open rates model out. Not truly open source, but I'll take that over having not having that at

all. That's interesting

digression that I would love to dig into as well as far as the semantics of open source and open weights and what it means for an AI model to be open, but that's probably a days long conversation that is orthogonal to the 1 we're having now.

But,

digging more into that question of

what it means to have a an LLM or a generative AI model

that you own or that you have built,

as you said, you don't necessarily start with. I'm going to pull the entire corpus of the web and collect that myself and engineer an entire

model building chain because it's incredibly expensive and time consuming.

But I'm wondering if you can just talk through some of the aspects of what it really means to actually build your own or to have your own model in that context of these

substantial

AI capabilities?

Yeah. There are 2 broad paths to

getting your own model for the task that is critical to you. 1 is to start exactly as you said. Start with the corpus of the thing that you're trying to teach it and build the model from scratch. That's extremely expensive. Most people can't afford to do that. It costs anywhere from a few $1, 000, 000 to maybe $1, 000, 000, 000 plus depending on what you're trying to do to get that to work.

Furthermore, to do that, you need massive amount of GPU resources for long periods of time, which is

often very, very hard, if not impossible to do.

The second path, the 1 that we follow, is start with something that is already trained on a broad set of tasks, including the longer family of models and mixed trial, models.

And then

use methods,

the 2 critical components of fine tuning and retrieval

augmentation,

to teach

it how to answer the types of questions you wanted to answer

a lot better. And now the amazing thing is that fine tuning, for example, can be done relatively cheaply at many orders of magnitude

less of a cost, both in time

and in actual dollars you have to spend to get a pretrained large model

to be fine tuned on the task that you have. And this community is so amazing. A lot of that methods are all open source. It's a pretty you still need to know technically

what to do. Technically, more importantly, you need to know what not to do because, you know, there are 20 methods published every day and to how to do it well. But if you have the technical expertise with a fraction of time and fraction of cost,

you can get an existing large language model fine tuned for the task at hand. And then with retrieval augmentation method at runtime, you can teach it more in context for what you're trying to answer on the specific type of problem, on the specific type of data scenario that the customer is now asking a question

at at at run time with retrieval augmentation, you can take that fine tuned model to do that task even better. So it's a 3 step process. Start with a large language model at a fraction of cost and time. Do fine tuning to produce your own model. Now you own

it, especially if you've started with an open weights model. You kind of own it. Right? And, again, we won't get into that open source sources, open weights. But for practical purposes, you control your destiny. No 1 can take it away from you. Right? You can host it in the container, and no 1 can yank it from you. So you get that freedom even with the open weights. And then you can do appropriate retrieval augmentation to get that extra level of stuff

that you need when you're inference you'll be making that inference call at runtime.

And

in terms of the work that you're doing at data chat and the fact that you do have this reliance on these generative capabilities as far as the overall product

and the ways that you are presenting it to end users.

I'm wondering how the progression

and the exponential increase in sophistication and capabilities

and complexity

of these underlying models has impacted the way that you think about your overall product strategy, both in terms of the platform risk elements that we've discussed, but also in terms of the

enhanced capabilities

and with that extra capability, the extra potential for ways that things can go wrong

and how you're, managing that as far as your product road map, user expectations,

user experience design, etcetera?

Yeah. Great question. I think there are 2 aspects to this. 1 is as these

LMM models

keep getting

produced and new and new ones come out every time, it's a mad race for us to even take the top few that emerge every week and say, does it materially change a game in our ensemble method to plug something that we're using with something out?

What we've done over time is develop a robust test framework so that's it allows us to test things very quickly with our own internal benchmarks and stuff that matters to us to be able to tell, is it worth trying to dig into this thing deeper? And, again, we didn't have this stuff till a couple weeks ago, but we have an amazing,

team that does that. But, essentially, the main point is to automate

trying to look at all the models that are getting produced and come up with a quantitatively

definitive answer whether it's worth chasing something that's come out. Because so many new things are coming out. The important thing is not to figure out which 1 to go after, but which ones to say no to and not waste any engineering time on. So that's the the negative part of this thing is the more important part, and it'll matter in terms of, like, what is your data set because all of these LLMs will produce,

results on some benchmark. The benchmarks are not always the same.

Some of the benchmarks are very macro. They don't quite matter. It's only a sub portion of the benchmark. We know matters a lot more for us. Those results may not be published. So we need a very low cost, effectively mechanical way to test some of these things out to decide what not to chase because that's where you waste time. The second part is that as these LLMs get you know, the general trend is in many of these LLMs, they're getting bigger and bigger and bigger, and the bigger ones are doing amazing things. Of course, there's this huge interest in small models, but they're not as capable. There's a material difference in terms of the quality of the product, and at least right now, the bigger ones do better.

If I can get even a 5% accuracy boost from a bigger model,

that's a huge benefit to that end customer. That 5% will show up to that end customer. It's materially important in that end to end product game.

So the other aspect is if these bigger models are getting

more accurate faster and they hold the highest accuracy,

metric.

And that's what we want to use.

As they get bigger, the cost

to use them,

both in terms of the time it takes to

make that inference call and have it do all of its work and the cost it takes to run them. You need more machines. You need more GPU cycles. You need maybe more,

GPU machines to serve your workload. So that cost goes up, and that's the other challenge. Right? The bigger, better ones are improving.

Sometimes they are the right things to use, but they're getting

more and more expensive. Like, my poor call cost,

if that keeps going up, then I have to figure out how to make that work within the rest of the product. In terms of your experience of

building data chat, working with these LLMs and engineering on top of them. I'm wondering what are some of the most interesting or innovative or unexpected ways that you have seen either the data chat project itself or the overall application of LLMs applied

in your experience?

Yeah. Really great question.

We've been really surprised how some of these LLMs can converse with you within the same data chat,

context. And so if

a customer asks us a question, we'll get a response. We'll have made an LLM call. But then we get the response back and we do all the magic that we said in a platform we do to construct the response, We can actually take that response and feed it back to the LMM and ask it to do more. And it's understanding of something that is relatively new to it, which is just being developed in the context of that conversation,

that capability has been growing quite nicely.

And at times, the types of stuff that it does seems

ultimately magical,

and that's fascinating. Like, whether it is saying, I produced a chart and, you know, can you tell me explain to me what I did, what this chart looks like. It can take text and pictures and stuff like that, that multimodal

models that are starting to become available

are starting to do crazy good things

and pleasantly surprised by sometimes what it can come up with. And I say sometimes because many times it'll come up with complete garbage, so you still need that filter.

Even as you iteratively use the model and just producing a single response to the user.

In some sense, that filtering check and the trust plus verify things get harder as you keep making multiple calls to LLMs even on a single customer request.

But it's amazing what they can do, and that's been pleasantly surprising.

In many ways, some of the things that we thought we would have to do completely by ourselves, you can kind of use LMM to get you going in the right direction. It often gets more than half the work done for you, and not using it

in that way would be just dumb in this day and age. So that's been great. It's been crazy what these LLMs can do today. It was unimaginable,

you know, even even a year, a year and a half ago. In the work that you've been doing to build data chat and build a product oriented around these generative AI capabilities,

particularly in the space of using these capabilities

to improve the overall experience of working with data?

What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. I think the some of the most challenging stuff is that the popularity of all of these LLMs

also becomes a course because at times,

we can't get enough requests

out because of quotas and other limits that you have. It makes experimentation

difficult.

It's also very expensive to build products with LLMs, especially when you are building these products. You're going to make a large number of calls to the LNMs. A lot of it is just experimental. You're trying to figure out how to build the product. It is not cheap. You need a big

bank account to even play the game

at

a reasonably high level.

So some of these are very logistic

call type of questions, but

it's not a game everyone can play. You need to be well funded. You need to know what you're doing. It goes back again to

knowing what to do is important, but what's more important is what not to do. Do not chase every shiny thing that gets presented your way. And they are you'll just open up your Twitter feed, and you'll see 50 things that will all sound very plausible as things you should pursue.

But I guarantee you, 48 of them should not don't have merit, and how to distinguish that is a challenge. So it's an expensive game. It's a noisy space.

And in some sense, the job is harder as a when you're building a product as a start up because

the 1 resource you don't have is time,

and you could waste a lot of time if you're not careful and you don't know what you're doing. And as a natural segue into the next question, what are the cases where an LLM or generative AI more broadly are the wrong choice, particularly when you're considering it from a product standpoint.

Awesome. We'll circle back to where we started with machine learning.

Guess what? A lot of business questions

can be solved

with very what would now be considered traditional machine learning,

XGBoost,

cat boost, regression.

A lot of business questions need just that.

Now

you may need the response as we do in data chat to have constructed an ETL pipeline on the fly and then make the right

XGBoost, cat boost call. And, of course, you want to extract all of that away from the user,

but

traditional machine learning tools on structured data and businesses

are often the world course

things that we see our customers get a ton of benefit from. So, yes, all this Gen AI is to help you develop that pipeline,

but often the magic at the end of it that's bringing business value

is straight up machine learning tools that you've called on behalf of the customer. Of course, explain to them that's what you did,

But that old

bedrock that is rock solid, traditional ML, you might call it traditional now. 5 years ago, everyone was it was a new thing. That still runs a lot of decision making in businesses today.

And as you continue to build and iterate on data chat and keep tabs on the overall LLM space and the capabilities that they are offering. What are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?

Yeah. I think it is the use of multiple LLMs and

not just using multiple l m LLMs in in simple ensemble mode, but, you know, in different configurations perhaps,

maybe in a hierarchical mode or some other sequence where

because, you know, even if you're calling multiple LMs and even if it's all free, it's taking even if the LMs are completely owned by you, like, you self host them, It cost you time. It cost you money to get that response.

And, ultimately,

the end customers

need something that's pretty close to 0

$0

or or a fraction of pennies on each response.

Today, just making a call to an LLM is very expensive.

It'll cost us, like, 10¢ or 20¢ depending upon how big that token budget that we eat up to make a call to an external stuff, but the customer's mindset is to pay nothing. Right? This is Google search problem. They had L and M's before, but the if they had done deep integration

of L and M's in their search engine, that's a huge case to be made why that makes sense. The COGS would go up by a dramatic amount, but the customer is not gonna pay more for the search query right now. Right?

So it's that end to end COGS

where now we have a very expensive piece. We want to call more of it on each request. We want to call many of it, but that is not free. Both it increases the latency of the response that, to the to the end user, but also the cost for us to go deploy that. So it's a pretty sophisticated

game. It's a pretty complex problem

and how to solve that in a way that makes sense, delivers value, but also doesn't you can't throw everything and everything because it's just not sustainable

in that way. You can't, give the costs both from the time and the dollar perspective,

too heavy. So there's a lot of really sophisticated ways of thinking about making all that deliverable, and that's what consumes us along with all the other stuff that we talked about. And there are some non trivial ways to try to figure out how to make all the business model components of this work to be able to make

what customers expect to have things be pretty cheap, but these things are not cheap. Are there any other aspects of the work that you're doing at data chat or this overall space of using LLMs and generative AI as a product capability that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think the last thing I would suggest is, you know, there's a lot happening with people thinking about how to

regulate AI,

and you see all kinds of things getting passed. I think it's great. The policyholders

are jumping a little bit ahead of where the technology is. But, you know, I I worry if sometimes

the ways in which you do that ends up

making it super hard for startups to compete with. And I know that open source versus open rate stuff is a completely different beast. But some people have complained as to why did Meta put out the open rate stuff. Did they make more bad things possible? I applaud Meta for putting that stuff out and the better way to go and deal with the power of what these LNs can do. By the way, we don't really understand why they do what they do. Right? We are all

amazed by what they do. But having more openness in research, having more open models, even if it's open weights, I think that's a great direction. And we should be

very careful to not take that away because, otherwise,

the pace of progress in this field would be

very slow.

And,

arguably, the bad actors would get a better advantage. Let the world have it. Let the good guys figure out the good things to do because that's the better way to do this. So that'd be the other stuff that's always at the back of our mind. It's like this debate, and everyone's trying to figure out what to do with this technology, and the technology is moving at a rapid pace. And sometimes people have this tendency to want to over regulate. Yes. There must be some,

controls in place, but I hope it's done in a sensible way. So far, it seems like the answer is yes, and I hope that

continues. Yeah. On that note of regulation

and how to apply it to AI, I read an interesting

article from Bruce Schneier's blog, and I'll add a link in the show notes about how

rather than targeting

the specifics of the AI technology with the the regulations,

they need to be aimed at the

ramifications

of the organizations that are operating those AIs in order to incentivize them to do the right thing so

that rather than focusing on the capabilities of the model, focus on how you're using the model to make sure that the companies are being held accountable and they can't just say, oh, no. It was the AI model. It's not my problem. It's not my fault.

Yeah. I think all of those make a lot of sense. So, you know, we are all drinking from a fire hose, and, you know, the last thing you want to do is to shut it down prematurely

before we can really figure this out. So but some controls are needed, but it has to be balanced.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today. Yeah. I think the biggest barrier to adoption for machine learning is data quality.

Often, people have the data

that that is potentially meaningful in terms of being able to answer the very question they have,

but they don't know how to get it in the right form for machine learning to use it. And it's not just a matter of saying, do I have the right ETL pipeline, but also do I have the right semantic interpretation

of that data?

And through the big data revolution, we've learned how to collect a lot of data and store it cheaply. People have large piles of data, but getting value from it is not just a matter of saying, let me throw some Gen AI or ML at it. It's a lot more of, do I? Have I taken care to understand what's in my data?

Have I provided the right amount of semantic information? Have I kept track of that stuff?

So I know of the 20 date fields, for example, that I have in my table, which date field do I look at if I'm looking for all the orders that were processed and,

paid for in the last 30 days? Sometimes simple stuff like that. So the basic things for which we were all excited about in big data, period, of being able to get information from data and carefully building all these methods and mechanisms

to be able to ask questions on structured data, build these ETL pipelines, and stuff like that.

I hope we don't lose track of how important that is because

if you feed garbage in to an ML or Gen AI tool, you're gonna get garbage out. So that hard work

that we have to keep doing is still necessary before you can use the power of any of these tools. Alright. Well, thank you very much for taking the time today to join me and share your experiences

of building on top of LLMs, the work that you're at data chat, the experiences that you've had about how to

build a product on such a shaky foundation that is constantly in motion. I appreciate all of the time and energy that you're putting into the products that you're building and you taking the time to share those experiences.

So thank you again for that, and I hope and I hope you enjoy the rest of your day. Thank you.

Thank you for listening, and don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management,

and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com

with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

AI Engineering Podcast