Summary
In this episode we're joined by Matt Zeiler, founder and CEO of Clarifai, as he dives into the technical aspects of retrieval augmented generation (RAG). From his journey into AI at the University of Toronto to founding one of the first deep learning AI companies, Matt shares his insights on the evolution of neural networks and generative models over the last 15 years. He explains how RAG addresses issues with large language models, including data staleness and hallucinations, by providing dynamic access to information through vector databases and embedding models. Throughout the conversation, Matt and host Tobias Macy discuss everything from architectural requirements to operational considerations, as well as the practical applications of RAG in industries like intelligence, healthcare, and finance. Tune in for a comprehensive look at RAG and its future trends in AI.
Announcements
Parting Question
In this episode we're joined by Matt Zeiler, founder and CEO of Clarifai, as he dives into the technical aspects of retrieval augmented generation (RAG). From his journey into AI at the University of Toronto to founding one of the first deep learning AI companies, Matt shares his insights on the evolution of neural networks and generative models over the last 15 years. He explains how RAG addresses issues with large language models, including data staleness and hallucinations, by providing dynamic access to information through vector databases and embedding models. Throughout the conversation, Matt and host Tobias Macy discuss everything from architectural requirements to operational considerations, as well as the practical applications of RAG in industries like intelligence, healthcare, and finance. Tune in for a comprehensive look at RAG and its future trends in AI.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- Your host is Tobias Macey and today I'm interviewing Matt Zeiler, Founder & CEO of Clarifai, about the technical aspects of RAG, including the architectural requirements, edge cases, and evolutionary characteristics
- Introduction
- How did you get involved in the area of data management?
- Can you describe what RAG (Retrieval Augmented Generation) is?
- What are the contexts in which you would want to use RAG?
- What are the alternatives to RAG?
- What are the architectural/technical components that are required for production grade RAG?
- Getting a quick proof-of-concept working for RAG is fairly straightforward. What are the failures modes/edge cases that start to surface as you scale the usage and complexity?
- The first step of building the corpus for RAG is to generate the embeddings. Can you talk through the planning and design process? (e.g. model selection for embeddings, storage capacity/latency, etc.)
- How does the modality of the input/output affect this and downstream decisions? (e.g. text vs. image vs. audio, etc.)
- What are the features of a vector store that are most critical for RAG?
- The set of available generative models is expanding and changing at breakneck speed. What are the foundational aspects that you look for in selecting which model(s) to use for the output?
- Vector databases have been gaining ground for search functionality, even without generative AI. What are some of the other ways that elements of RAG can be re-purposed?
- What are the most interesting, innovative, or unexpected ways that you have seen RAG used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on RAG?
- When is RAG the wrong choice?
- What are the main trends that you are following for RAG and its component elements going forward?
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. [Podcast.__init__]() covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
- Clarifai
- Geoff Hinton
- Yann Lecun
- Neural Networks
- Deep Learning
- Retrieval Augmented Generation
- Context Window
- Vector Database
- Prompt Engineering
- Mistral
- Llama 3
- Embedding Quantization
- Active Learning
- Google Gemini
- AI Model Attention
- Recurrent Network
- Convolutional Network
- Reranking Model
- Stop Words
- Massive Text Embedding Benchmark (MTEB)
- Retool State of AI Report
- pgvector
- Milvus
- Qdrant
- Pinecone
- OpenLLM Leaderboard
- Semantic Search
- Hashicorp
[00:00:05]
Tobias Macey:
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macy, and today, I'm interviewing Matt Zieler, founder and CEO of Clarify, about the technical aspects of retrieval augmented generation, including the architectural requirements, edge cases, and evolutionary characteristics. So, Matt, can you start by introducing yourself?
[00:00:34] Matt Zeiler:
Hey. Yeah. Thanks for having me. So I'm Matt Zilar, founder and CEO of Clarify. I founded Clarify about 10 years ago at this point, one of the first true deep learning AI companies. And even before starting Clarify, I was diving deep into neural networks and deep learning all the way back at underground at University of Toronto. Very fortunate to work with Jeff Hinton there, who people consider the godfather of AI. And that really was a little bit of luck stumbling across AI there. But once I did come across it, I was really hooked and so decided to do my PhD, which brought me to New York University and got to work with people like Rob Fergus and Yan Lakan, more pioneers in this field.
So that was kind of the the background and learning I was able to do before starting Clarify and, lots of lessons learned over the years running the company and and being a a pioneer in this space.
[00:01:28] Tobias Macey:
You mentioned that you kind of fell into machine learning and AI. You've been in it for a while now. Obviously, the thing that has sucked all of the air out of the room when anybody says AI is generative models, in particular, large language models. And I'm wondering, given your history in the space and the fact that you've been working in it for a while, what your perspective is on the utility and the lasting power of large language models and generative AI and the role that so called traditional AI or more statistical models plays in this current landscape?
[00:02:03] Matt Zeiler:
Yeah. Absolutely. At the end of the day, all of these models for for the last kind of 15 years that have been showing promise are neural networks, including generative or what people are now calling generative AI. It's all neural network based, which is kind of algorithms meant to simulate how your brain works. And even back at that, University of Toronto days, I was doing generative models, for motion capture data. It was, an interesting project about understanding pigeons, walking, and, filling in missing motion capture recordings. And so it was actually generating data just like these large language models generate text, it was generating motion capture.
And so this is not new technology. I think what has really changed over the last, you know, 15 years is the size of the models, the amount of data they're trained on, and both of those things require a significant increase in compute. And so that's why all of a sudden it's at the the capabilities are are understanding enough about the world that they're doing a lot of interesting properties. And so I think it's opening we're still in the early days of, you know, exploring those, those possibilities or these models, and there's new ones every day. And, I think it's gonna be this way for for many years to come.
Continue to grow with on training size, algorithm size, and, and compute. We'll just keep unlocking more and more capability.
[00:03:26] Tobias Macey:
With the rise in popularity and at least potential application, if not actual real world application for these large language models. One of the topic areas that has gained a lot of interest is the idea of retrieval augmented generation or being able to build your own context corpus for the model to feed on to be able to actually complete the task that you want it to complete versus the general task that it was originally trained for. And I'm wondering if you can just start by giving your definition of retrieval augmented generation and some of the requirements that it brings to the architecture.
[00:04:08] Matt Zeiler:
Yeah. Absolutely. So I think most people are familiar with how LLMs are used today. You have a context window, which you pass in input tokens, and then you ask the model to continue generating output tokens, and all that has to fit in the context window. And the generations occur based on whatever went into the training of those large language models. And so there's a lot of problems with that. Like, it gets stale, whenever you last trained as all the information it has. And so, and it can hallucinate a lot of stuff. The it's not there's not that much control in how it generates, what it's gonna generate. And so, retrieval augmented generation came around a few years ago to help address some of those problems of giving it more dynamic access to information and improving hallucinations.
And retrieval augmented generation is kind of explicit in how it actually works. So it first starts with retrieving out of a large corpus of information, more fine grained, information that it can stuff into that context window so that the LLM has more facts that it can use in addition to the prompt that is provided in order to augment its generation. So that's how you get the word retrieval augmented generation. So it first takes your input prompt queries, usually a vector database to retrieve most similar chunks of documents that can help the LLM augment its generation.
[00:05:37] Tobias Macey:
As you mentioned, vector databases have taken off alongside the growth of retrieval augmented generation. Another thing that is maybe worth touching on is that at the beginning of this upswing in popularity of these AI models was the idea of prompt engineering, which pretty rapidly gave way to retrieval augmented generation. And I'm wondering what you see as the overlap of prompt engineering and retrieval augmented generation, and what are the cases where prompt engineering is actually still a viable utility in the application of these AI models?
[00:06:13] Matt Zeiler:
Yeah. There's kind of this, spectrum of prompt engineering is the quickest, easiest thing you can try first. RAG is the second one, retrieval augmented generation. The third being fine tuning models. So taking these, you know, popular open source models, customizing them on your data. And then 4th is kinda training new models from scratch. And the the spectrum kinda gets into a question of how much compute and data you have access to. You if you don't have much data, you can't really pretrain a model at all. Fine tuning is unlikely to work very well. You can prompt engineer an existing model, but it's only gonna get you as far as whatever that model understands.
And so RAG becomes a really good choice where you can have limited sets, or small dataset sizes, but still customize the outputs of the model to be more specific for your particular task and your dataset. And that helps it, you know, kinda ground these models or I've heard the the term tame these models. I like that because if you don't tame them, they can go off the rails and generate a lot of stuff that you wouldn't wanna put in front of your customers. And it and it can also help these models kind of adapt to new things that have never been seen or different domains of data that they've never seen. So for example, they maybe weren't trained on legal documents or finance documents or, you know, classified documents. We do a lot of work with the government and, intelligence community.
And so there's a lot of things that simply these models have never been trained to see. And so RAG can help you, adapt to those, whereas prompt engineering would help in those scenarios. And then if you really want to, you know, get the models to perform better and continue that trend of reducing hallucination, customizing on your data, that's when getting it to fine tuning. And and at the biggest scale, training your own specific model can come into play. But for most part, you can get it away with prompt engineering and and fine tuning for a lot of applications.
[00:08:14] Tobias Macey:
Circling back on the vector database aspect of it, there's definitely utility in having the data that you're trying to stuff into your context already formatted in the same data structure or the same language that the model understands. But I'm wondering what you see as vector databases being a requirement retrieval augmented generation versus just an optimization of the functionality?
[00:08:41] Matt Zeiler:
Yeah. It's a good question. The in today's most popular rag systems, they are kind of a core component to it, but I have this feeling that it's kind of stitching together a bunch of components very loosely, and there's gonna be improvements to that in the future. So stitching on a vector database to stuff text into the context, helps. And there's even papers. I was just reading one this week about kinda provability that it does help with hallucinations and and grounding these models. There's some theorems around that, which is exciting. So these things do the the rag approach does work, but it just still feels like it's a bunch of, you know, independent components, bolted on to each other. I think what we're gonna see is more end to end training, and we're already starting to see kind of that in the research community.
And once you have that, the vector database is just one type of thing that you can query information from. You're seeing a lot of other usage of these large language models that just, you know, generally can query information from tools. It doesn't have to be a vector database. It could be an API to get the weather. It could be, you know, your Salesforce account to get your CRM data. It could be a lot of different integrations. And so I think the vector database is just one simple one that people can get started started with. But long term, I think we're gonna see a lot more, flexibility and a lot more kinda unified understanding, not less kinda stitching things together.
[00:10:13] Tobias Macey:
Digging further into that collection of tools that are being stitched together, can you give an overview of what you see as the typical architecture and the technical components for being able to actually bring RAG to bear and some of the operational considerations around that? Yeah. Absolutely.
[00:10:32] Matt Zeiler:
And this is kinda where Clarify fits in. I realized I didn't really give an overview of what Clarify does, but, we are that AI workflow orchestration platform. So we have all the tools in one place to get you from prototype to production as quickly as possible. We often talk about it as being AI in 5 minutes. Because literally with, workflows like RAG, you can actually be set up and running in about 4 or 5 lines of code using our SDKs. And we have vector databases built in. We have all the embedding models you need to retrieve and index your data. Popular large language models, all the third party ones like OpenAI, Anthropic, Google, etcetera, and all the open source ones like Mistral, LAMA 3.
And within hours of new ones being announced, we're importing them into our platform so that they're ready to go for your use cases. And stitching together, you know, a RAG has a lot of these components. You have to think about the, vector database, how you want to, run that at production scale. How do you want to actually like index your, your data? There's like the data pipelines of chunking up large documents like PDFs or or large text files into small chunks before going into the embedding model. Once you get embeddings out of those, you have to decide, do you want to keep them in full precision? Do you want to quantize those embeddings, which is popular for saving memory in your factory database? How do you wanna index the embeddings in your database? There's a lot of configurations there.
And then once you have that kind of indexing sorted out, you have to think through how do you actually reindex when you have an one more document to add or or, maybe you want to, change the embedding model, like a new one comes out on a regular basis. How do you think through the reindexing process? So there's a whole, you know, can of worms on doing production quality on the embedding and and vector DB side, and that's even before you start doing RAG. Once you start doing RAG, you have to think about now the LLM or LMM if you wanna go into multimodal type of models.
You have to make a choice on which one's suitable from a size perspective, a cost perspective, latency perspective, and accuracy perspective. There's lots of dimensions to choosing the the appropriate large model for your use case, and and then you have to test it out. Typically, the there's evaluations for the large language models, and there's less evaluations for this end to end process of having embedding vectors to retrieve, having, the right indexing for that, having the right LM, etcetera. And then that's kinda getting you to an MVP. Now when you think about running in production, you have to think about monitoring, tracing.
How do you collect feedback so that system gets better? The best data for any kinda AI system is feedback from real world usage. And so do you have pipelines to collect that and incorporate that into fine tuning your models, fine tuning your prompts, etcetera? And then when you think about kinda enterprise grade, you have to think about what permissions does an employee have to access the data that goes into feeding this rag pipeline? What permissions do they have to the models? How do you scale up replicas of the vector database, the embedders, the LLMs?
How do you do failover? How do you upgrade when, you know, whenever GPT 5 comes out or Claude 4 or Gemini 2, how do you actually or lam a 4, like, there's always gonna be new models. How do you think about the upgrade process? It becomes really, really complicated. Very easy to get started with a proof of concept for reg, but very difficult to run it in production. And having a platform like Clarify that has all these components and these workflows of feedback, access control, auto scaling, etcetera, helps people go from that prototype phase to production.
[00:14:28] Tobias Macey:
The other aspect of operating at scale is the question of what are the edge cases, what are the failure modes, and in particular, how do you fail gracefully where maybe your vector database goes down, but you don't want your AI product to just completely fail. Maybe you just wanna be able to let it operate in a slightly degraded capacity. What does that look like? And I'm curious how you are seeing people address those aspects of the failure modes, recovery modes, and how to fail gracefully.
[00:15:00] Matt Zeiler:
Yeah. I mean, that's the big problem with taking the approach of stitching together a bunch of different vendors, like a a vendor for embeddings, a vendor for vector database, a vendor for LLMs, a vendor for monitoring. Like, if any one of those vendors goes down or their API just changes, like maybe they made a breaking change, you missed, an announcement on it, your system falls apart. And that makes your, your experience for your customers so much less reliable. And and then the other dimension you mentioned is kinda, other faults that can happen in production or just the the models are still even with reg and and helping ground them, you can run into issues where they still hallucinate. And how do you get feedback quickly from your users, to catch those cases and tune things. Sometimes it may be an automated tuning, like active learning pipelines.
But other times, it might just be thinking about, you know, changing your prompts, changing the context, parameters, or what's indexed into the database, to help improve the experience for your users, over time. So getting that feedback loop is is really, really important to to help catch these edge cases because it's not just something you would see on a typical, like, monitoring dashboard or tracing dashboard. You you need the actual qualitative feedback as well.
[00:16:25] Tobias Macey:
One of the terms that keeps coming up as we're talking about this is the idea of the context window, and each of these models has different sizes of context windows or the number of tokens that they can accept as context. We're talking about using these vector embeddings as additions to the context for the request, and I'm wondering how that sizing of the context window influences the ways that you think about the embedding generation, what size the chunking should be, how many records to retrieve for augmenting that generation request, and some of the ways that that context window acts as a limiting factor and a forcing function to the way that you design the rest of your overall pipelines and the data flows?
[00:17:10] Matt Zeiler:
Yeah. Yeah. It's a good question. Like, the and this is something that really surprises me in the last, you know, kinda 18 to whatever we're at now, 20 months since Chat GPT came out. How many people talk about the different models that are available and details about them, like context window size. And these are like I've had conversations with people that are not technical at all, but they're they have this, in their vocabulary now. It's a it's a crazy moment in time. And I don't know if that will will stay true forever, but it is really interesting. The the context window is important for Rag because the more context you can stuff with all the retrieved documents, the more facts and more grounding that it can do.
So there's some models like Gemini. It's, like a million or 2,000,000, I forget what it is, tokens that you can fit in. There's even new algorithms that are not just transformer based, but variants of transformers that kinda have infinite windows. They get away from having this this attention mechanism that has to look at every token every time to something that's more more like recurrent networks or convolutional networks, which kinda just continually process as you go. And that opens up really long context windows and doing that processing efficiently. So I think the trend is gonna be that these context windows continue to to grow.
But when it is limited, one component I I might have missed in, describing all the different possible components for RAG is a re ranker, which can come in really useful when the context window is small. So it kinda comes into play when the query goes and retrieves documents from a vector store. It can rerank the shortlist of documents to get the final subset that is stuffed into the context window. And so that re ranker can be another AI model, and it's an additional step on top of the embedding model and on top of the, large language model that can help improve the accuracy of the the retrieval process. And that helps kinda when you are on a small context window to make sure that whatever you're stuffing in there is the best possible, chunks of documents.
And another part of your question was that chunking process, and this is, I don't know if there I haven't seen any good, like, you know, solution or perfect solution for this. It's kinda a lot of heuristics of how large you wanna chunk your documents, kinda relates to how good your embedding model is at understanding the whole chunk. Some embedding models are trained kind of at the word level, at the sentence level, at the, the paragraph level and at the document level. So depending on which embedding model you wanna, use, it'll impact your chunk size.
And then it also factors into the LLM's context window size. If you have really big chunk size but a smaller context window, you might only get one chunk. And so you better get the right chunk in that scenario. Otherwise, it's not gonna it it may actually hurt your generation performance. So there's lots of kinda parameters in that decision. I haven't seen, like, a a perfect solution to that, but that that could be an interesting area for researchers to come up with a a more audit made way of deciding on these chunks.
[00:20:34] Tobias Macey:
On that context of the embeddings too, I'm wondering if there's any utility in applying dimensionality reduction to be able to reduce the size or the the amount of space in that context window that a single document is taking up without necessarily changing the chunking pattern or, like, how big of a chunk of the document you're retrieving?
[00:20:56] Matt Zeiler:
So, yeah, I think so so just to clarify one important part for the listeners. The embeddings aren't used directly in the in a typical RAG system. It's not the embeddings that are put into the context window. It's the text itself. There are some approaches that kinda factor in embeddings as as tokens, but, typically, you just retrieve the documents using an embedder and a vector database to get the text, and then you stuff the text into the context window. I have seen some approaches that take that text and apply things like summarization on it so that it's reducing the amount of words. Because typically, any language model, whether it's these large models or, you know, the first transformers like BERT style models, etcetera, that are still incredibly well suited for most use cases, they don't care about the stop words, you know, the of, you know, throwing all that stuff out is gonna save you tokens in your context window and probably not affect the, the generation that much. It'll have some educations with that because obviously there's important words like not that could change completely the, the meaning of a sentence.
So you have to be careful in that, but summarizes can are are good enough that they can keep the important parts and, and get the meaning, and that can save you on the, the chunk size fitting into the context window.
[00:22:20] Tobias Macey:
The other piece of chunk size and to your point of these models that have either much larger or potentially unlimited context windows that brings up the question of what impacts does it have on the overall latency of the experience where I can put the entirety of War and Peace into my talk context window, but now maybe it's gonna take an hour for the model to actually give me a response. And I'm wondering how that affects the balance and the ways that you think about how large of a document to put into the request given the fact that it may impact the end user experience.
[00:22:54] Matt Zeiler:
Yeah. Yeah. That's where these newer style attention mechanisms become a, really important. There's a lot of work in this area because everybody knows that as the context window increases, the speed, the the latency increases. And so these attention mechanisms that get away from looking at every token in the context window and comparing it to every other token, which is kind of an n squared, operation, and there's there's other overhead on top of that. These, these these new attention know, know, even 32,000 context window or a 128,000, like, the latency gets significantly slower. So that's why you see most context windows are, like, 4000, 8000 typically, especially for these small models. Otherwise, the, the attention just, makes the models not really useful at that point.
[00:23:59] Tobias Macey:
Now getting into the practical elements of I have an application, I want to use a large language model or some other modality of generative model. I'm going to use retrieval augmented generation. So now the first step is I have to generate those embeddings to put into my vector database. You mentioned the consideration around which embedding model to use, the fact that they're constantly changing, what are the strategies for chunking, and how often to reindex or how to append to the indexes. I'm wondering if you can talk through some of the ways that somebody needs to think through that planning and design step of selecting the embedding model, selecting the chunk sizes, etcetera.
[00:24:45] Matt Zeiler:
Yeah. So, I mean, it all starts with your use case and the data that you have available for that use case. And then everything should fall kinda backwards from there. Because if you're dealing with, you know, inherently small snippets of text, like emails are are pretty small compared to PDFs or, you know, 10 k financial reports, which are really large PDFs. There's it it all depends on your your data. And that will dictate a lot of how you decide to chunk it, how you which embedding models you want to use, which large language models you wanna use, maybe even the vector database. But that typically won't change depending on the on the type of data or the use case.
There'll be other considerations when choosing your vector DB. For the embedding vector, there's some, kinda open benchmarks, like the massive text embedding benchmark, MTEB, that has become popular. And you can see typically these these types of things on leaderboards on the Internet if you just search for, you know, embedding benchmark. And that gives you a good overview for some general purpose use cases, and those will cover things like emails, chat experiences, documentation, that kind of stuff. If you get into a domain that, you know, these models likely have not seen before.
I always like to talk about, you know, these classified documents because clearly, they wouldn't have been trained on that data. So in those scenarios, you may have to run your own evaluation benchmarks on your own dataset. And we provide tools that clarify to help you evaluate models very easily, and, actually, you can create your own leaderboards. So using tools like that is really valuable because, you know, the general purpose benchmarks are good to to understand which models you should probably evaluate, but then actually evaluating them on your ground truth datasets becomes really, really important.
So then you can so that'll give you kinda your embedding model from that process. And then you have to think about kinda, another factor actually in the in the embedding model choice is gonna be your compute. Do you have GPUs available that are big enough for that model? How are you gonna auto scale that model? Do you have a service that's already doing that, etcetera? That becomes an important choice. Another consideration for both the embedding and the the compute is also true about the large language models, both for embeddings and for the large language models is is understanding, is your data all in one language as well?
You know, English is trained into most of these models, but, other languages, you'll get get mixed results depending on which model you choose. We're seeing a trend obviously that these models are training on the whole Internet, and that includes lots of languages. So, multilanguage support is becoming more of a standard, but the way each model treats different languages is not the same. So if multilanguage support is important for your use case, then you have to factor in, evaluating in different languages as well, beyond just, English or your or your native language. And then, yeah, the chunking stuff we talked about, thinking through how do you actually parse documents of different types, There's different tools to help you, you know, handle PDFs, convert them into text, and then, the embedding model will take that text and convert it into embeddings.
There's all the chunk size considerations, etcetera, based on your context window and and, again, your your use case. Ultimately, all these decisions, you can use kinda heuristics and, like we talked about. But the most important thing is evaluating the end to end system once you're done. Once you build your your RAG system, is it actually improving the generation quality and having evaluation metrics for that, especially as you iterate on these things, is really important so that you can compare for each of your settings of chunk size, of embedding model, of LLM model, of do I go multi language or not, etcetera.
You wanna see quantitative results for your given use case. And so being able to evaluate the end to end solution, I think, is the most important thing to think through how you're gonna do that. And we can provide some tools to help our our customers do that, which is really exciting because it gets complicated fast as you as you heard us talk about all the different components you need to think through when thinking about solutions like RAG.
[00:29:19] Tobias Macey:
The majority of our conversation and the majority of the conversation across the ecosystem right now is focused on text and natural language, but there has also been an increase in the desire and application of multimodal models or models of different modalities than just text. And I'm wondering how that impacts the ways that you even can do chunking where it doesn't necessarily make sense to chunk an image because the image is kind of the entirety of the thing that you're asking for, and maybe then you have to go to downscaling the image to reduce the size and quality. Or if you're dealing with audio or video data, how does that influence the applicability of RAG and the ways that you think about that context retrieval?
[00:30:08] Matt Zeiler:
Yeah. Absolutely. I think the trend is definitely going multimodal. And I I think in the next kinda year or 2, it'll be rare that we have a unimodal model. I think every model's gonna end up being multimodal just because the more different types of data that these models can learn from, the smarter they can get. And when you think about it, it's kinda how humans learn. Like, I have, a young child, and just seeing him learn every single day is is really interesting because he can't speak yet. He understands some words, but, he can see, he can hear, he can feel, like, it's all different modalities of data that that he's clearly learning from.
And my older daughter has has gone through that stage, and she's still learning in in a variety of different ways from multiple different, data types. And so it's very obvious that these models are only gonna get smarter because they learn from multiple modalities. And how that affects RAG is that you gotta think about the the chunking as you mentioned. Typically, the way these transformers work beyond text is that, everything gets tokenized. So text is tokenized, by taking a few characters and converting it into a token. In images, you take patches of pixels and convert them into tokens. Similar for audio, you chunk up the audio. And so all of it becomes tokens, and then they're just fed into this giant transformer model that can understand the context and attention amongst all these this kind of soup of of different modalities of tokens.
That's what allows different multimodal models to take in not just one image and text, but it could be just a list of multiple different data types, like and and you can do cool use cases like what's going on between the what what what are the differences between these two images or, you know, give me an audio clip that, like, represents this picture. You know, you can do really cool, multimodal stuff, with on the generation side and and the the q and a side. And, and that has to factor into all the different components within the reg system, because you have to embed all the different data types ideally into one common embedding space so that when you get a query, you can retrieve in all the different types of data that are relevant, and then they get tokenized and fed into the the large multimodal model.
And then the same process that we walked through with text only reg of evaluating the end to end system is really crucial. We're starting to see more and more benchmarks for multimodal, but I think it's it's definitely, like, a year and a half or so behind the the text, large language model benchmarks. And so I think that's gonna be a trend we'll all benefit from and see is just much more, high quality datasets to give you the general purpose kind of evaluation benchmarks and then more tools to help you evaluate multimodal flows like reg on your particular, datasets.
[00:33:10] Tobias Macey:
So we have talked through the embedding piece, thinking about how to strategize that, the considerations that go into it. The next layer of RAG is the vector database as we've already discussed. For the purposes of retrieval augmented generation, what are the features of a vector store that are most critical and some of the evaluation criteria for figuring out which one of these vector databases versus vector indexes, etcetera, you actually want to include in your architecture.
[00:33:41] Matt Zeiler:
Yeah. This is, another place where there's lots of different choices. I think in the last 20 months, there's, like, 50 different vector databases you could choose from. I don't think that's gonna continue. I think there's gonna be a lot of consolidation, which is gonna be easier to, to stand up a a vector DB in the future. But you have choices like, do you want to use the vector store that's already in the database you're using? So, I was just looking at this. It's a really good report that I think your your viewers would love flipping through from Retool. It's the state of AI report that I think just came out, and they have, like, a list of the most popular vector databases.
And it was, like, PG vector, which is an extension to Postgres, is most popular. MongoDB is up there. Pinecones up there, etcetera. And with Postgres, Elastic's up there as well. So, like, with Postgres, MongoDB, and Elastic, it's kinda obvious why they're up there. It's because it's just a feature on your existing database. And that's just a huge convenience rather than having a place where you store your data today and then having a completely different, vector store. And that's actually honestly how Clarify always looked at it. We may be the 1st production vector database. We launched it in 2016, and it was always built natively with all the other data that you're storing in our platform so that you can and this is a a criteria that you should evaluate vector stores with. Can you do the queries that you want to do in addition and kinda joining with the vector queries? So filtering by metadata, we let people do, searching by AI labels, by, what the AI is predicting, as well as, combining with vector stores all in one kinda query. And most vector DVs kinda focus just on the the vector part, making that good, and not doing the the joins with other queries that would be, really useful for the overall query that you wanna make.
And that even in the rag situation, that becomes really important because back to kind of what we talked about earlier about access control, how would you actually filter by what permissions a user has to the data without having that additional kinda join? You you could do kind of, you know, joins in memory and all that stuff between different data sources, but that's prone to bugs and inefficiency. So having a vector store that has all the filtering you need in one place is really important. And then for vector store specific features, you know, latency is obviously important.
For use cases like RAG, though, it's not I I would argue it's not, like, the most important thing because at the end of the day, these large language models and large multimodal models are slow. And especially when you have all the retrieved context that you're gonna stuff their context window with, it's pretty slow to generate. And so having a a little bit slower latency won't actually impact the overall system latency that much because it's not gonna really dominate compared to the the generation. Obviously, faster is is better, but it's I think when people are choosing vector stores, that can take, a lower priority.
I think how does it handle scale is important. Can you stuff in lots of data? And when that happens, does it kinda get distributed beyond a single node to handle, fail over and and scale? Those become really important. How does it handle the different dimensionality of your embedding vectors? We talked to so some of them have limits on, like, you know, it only handles a 1024 dimensions. It can't do 2,048. Stuff like that can actually cap you on reg because your embedding model that you want to use might not actually be able to fit its embeddings into the vector store. And then there's all the different kinda indexing. How does it index? Is it accurate on retrieval and and precision?
Is it efficient to do that retrieval? Can it quantize the embeddings and still keep that index performance, high? When people think about a vector store quantization, this is actually, like, seeing this whole, market of vector stores come out and quantization is like a a hot new feature. We literally had that in in 2016, and, me and a couple of engineers actually wrote that code. And it's the same techniques that people are kinda inventing now, today. But it we did it because it saves a ton of memory in your vector store. Taking a 32 bit precision floating point number, which is typically what these models will output, and crushing it down to 8 bits or 4 bits because the the range of these numbers isn't that big.
And sometimes you don't even care about negative numbers because the the models, often output only positive numbers in their layers. So you can actually get rid of lots of bits by thinking about the data and and looking at kind of distributions of these these values. And so the quantization can take a 32 bit number down to, you know, just a few bits. There's even results where you can actually do binary just zeros and ones for all the values in the embedding vector. So you can get a 32 times reduction in the memory used with a trade off on how accurate those binary vectors are gonna be be at the end of the day. And not every vector database supports that kind of binary, embeddings.
They don't all support, you know, 4 bit, 8 bit. Many would support all of them would support 32 bit, but many of them would support 16 bit floats. So that quantization becomes really important because it's memory and the retrieval latency and the accuracy all get affected by that quantization process. And there's there's also, like, benchmarks for these vector stores. It's another area where you can look at a lot of these metrics. I haven't seen that many good independent benchmarks, unfortunately. What you when you search for this I was actually just literally doing this this morning. When you search for, like, benchmarks of PG vector versus Pinecone or Quadrant or Milvus, there's lots of these vector DBs.
They're often written by one of those providers. And so it's a little bit like, the the provider of any tool is gonna know their tool better than anyone else, and so they're gonna make sure the benchmarks are really optimized, really working well, and then they're gonna use kinda default settings for all the other providers. And so I haven't seen a good, like, ground truth independent benchmark. If anybody knows, I'd love to to see that. But I think that would really help the vector store decision process.
[00:40:37] Tobias Macey:
The maybe not the final decision, but I think the next large decision to be made in your rag stack is what model you're actually going to use to do the generation. Obviously, this is a very fast moving target. It seems like there's a new model every week or at least tweaked versions of the same set of models that are coming out on a day to day basis. What are the foundational aspects of a model that you look to when you're deciding which one you actually want to, apply to a given problem domain?
[00:41:11] Matt Zeiler:
Yeah. Again, it comes down to some of these open benchmarks. There's, you know, open LM benchmark and a lot of these leaderboards you'll see, online. So that's a good place to start because it those are typically where any model provider is gonna want to showcase their model, on these standard general purpose benchmarks. But, again, those are typically good for the general purpose use cases, which will cover a lot around, like writing emails or chatbots. And we're starting to see kind of leaderboards and evaluation benchmarks that are becoming more specific.
I think I just saw one about CRMs. So if you're wanting to, you know, interact with your your Salesforce, HubSpot, Pipedrive, etcetera, CRM, And that data is very specific to sales and marketing activities. An LLM that's trained on that data is likely gonna perform much better. And so, again, it starts with your use case, decide on what you wanted to to be good at, and then find the most appropriate, benchmark. And it'll give you an idea of accuracy. Then there's the, notion of how big the the generative model is and where you're gonna host that thing. Is it gonna fit into compute that you already have?
Does that compute auto scale? All the production grade inference becomes really, really, important with with this model. Because, typically, the embedding models are are smaller than the the, generation models that are used in a RAG system. And you don't always need the biggest model. I think people kinda default to, you know, the the big third party models. And for most use cases, they're likely overkill. And for reg use cases in particular, because you're getting that additional help from the retrieval to augment the generation, It doesn't need the best model that has, you know, a trillion parameters and has memorized the entire Internet because you're actually getting the data that you need to generate with, stuffed into your context window. So a smaller model can actually get those facts that are stuffed in, and pretty much all the small models are really good at writing English and many other languages at this point. And so the task of RAG is much more, well defined and constrained so a smaller model can perform well. And that'll give you, improvements not just, how costly it is to run these things, but also the latency.
The bigger the model, typically, the slower it's gonna be. And, and it also kinda limits a little bit the ability for the models to kinda hallucinate and go off the rails. So that trust and safety and and evaluating the whole end to end, comes into play again. And, then it's another dimension to choosing the the generative model. Hopefully, the RAG pipeline is going to end up kinda addressing the hallucination then and the trust and safety concerns. But, some models that have never been trained on, you know, offensive words are just not gonna be able to, you know, generate that kind of content as an example. And so depending on your use case, you may want that or you may not want that, And that can be another dimension of of trust and safety you have to think through in choosing your your generative model.
[00:44:38] Tobias Macey:
One of the interesting side effects of all of this interest in generative AI, RAG, is that growth in vector databases, vector indexes, and those have started to be repurposed towards semantic search. I'm wondering what are some of the other ways that this investment in rack capabilities is able to be repurposed for other use cases?
[00:45:05] Matt Zeiler:
Yeah. And this is actually why when we launched vector, database in Clarify, we never actually called it a vector database. We never marketed at that. We never actually anticipated the world to have this vocabulary of talking about embeddings and quantization of them and all these LM choices. Like, that kinda changed overnight when chat gpt happened. So when we created it, our vector database, it was never for reg. It was for things like visual similarity search. So you have an example of a, you know, a product in front of you that you wanna buy, like a nice chair or something, and you use that image to query a product catalog to find the, the most similar looking chairs to buy online.
So that's a great kinda use case for similarity search. Same thing happens with text. It's the exact same process that happens in the retrieval step of rag, just stopping there. I have a snippet of text. Do I have any similar, snippets of text? This is useful in in general for content organization. So we work a lot with marketing teams as an example. They have all their visual content. They have all their text content indexed and to clarify, And people can very easily, in the search bar, type in a query and find similar content to that query automatically without kinda writing code and and all that and stitching all these components together.
That's a huge value add, and that come into play in all these kind of intelligence community applications as well, lots of use cases for similarity search, like, you know, a vehicle, for example. You think you've seen that in security camera footage before and, you want to check. So you can use the query being the vehicle and match it in a in a large database. And then the other kinda bucket of use cases beyond the similarity search is just deduplicating data. So this happens, and this can be done like, think of, duplicate images. You can one technique is to just match the pixels, but then it has to be exactly the same image, and that's not, very reliable. Even just changing the compression, settings on an image will change the pixels even though they look visually the same to our eyes. And so, having, the images embedded and then comparing at a high level understanding of the images is very effective at deduplicating data because you can actually find similarities in clusters and say all all these are, you know, derived from the same image or they're duplicates.
And when you do these these matches in in vector stores, you actually get a score typically of how similar things are. You can use that score to get the how confident you are that these are exact duplicates, near duplicates, or or not duplicates at all. And so, you know, going back to some of the business use cases for that, it's like the marketing team buy stock photos all the time, And now a lot of those are generative. Or do they already have one that looks like this? Do they need to go buy a new one? That deduplication has a real kinda dollar, value directly attached to it. So it becomes really powerful, having a vector store for these other use cases, and and reg just kinda came about as as one additional newer use case, for these these generative models.
[00:48:34] Tobias Macey:
In your experience of working in this space, supporting these RAG use cases, building the functionality in your platform to manage that end to end flow, what is up in the most interesting or innovative or unexpected ways that you've seen those techniques used? Yeah. I mean, there's so many,
[00:48:53] Matt Zeiler:
I don't know. I I maybe I'll start with the least innovative, I guess. I'm seeing just so many, like, chat experiences added to products, and I think that's gonna be a a fad and baby's already fading away. But I think those types of use cases are are suitable when you already have a chat experience, like a customer service bot on your website. Like, that totally makes sense. But just injecting a chat window to have people, like, interact with things that are never meant to be, interacted with in a single box is a is a decrease in in user experience.
So I'm actually, like, not a a big fan of those types of rag use cases. When it is a scenario where you already have a search bar and that is a natural thing, whether it's, you know, Internet search, whether it's, you know, we use Confluence for our internal documentation and Google Drive. Those types of searches can all be improved very, very significantly with these components that you need for for RAG. And then the next step of just retrieving the documents off the search is to start summarizing them. And, we're starting to see that kind of even internally. You you see it obviously when you search on on Google and I believe Bing as well at this point. They're all kinda summarizing depending on your query, of course.
Ones that are already indexed, they're summarizing, a a a good summary so you don't have to click through all the results, and that's just a huge productivity gain for everybody. When I think of kinda some of the most interesting applications of this, or or the highest value applications maybe, it's where the people having to do those, traditional search and read and and summarize tasks are in high demand. They're kinda domain experts in their field. That might be, you know, intelligence analysts in the intelligence community, doctors in the medical community, lawyers in the in the legal space, financial analysts, etcetera.
That's where these productivity gains are really unlocking business value. And when when you think of some of these domains, that's the the ones I listed. They're kind of the the domains that are least likely to have been trained into the model in the first place. And so solutions like reg where you can ground it with, you know, factual stuff like medical textbooks or, you know, all the financial 10 k's of the quarter, etcetera, all the classified documents in an archive, that's when RAG really unlocks the the biggest ROI beyond having a a generative model alone.
So I think, those are the most exciting use cases I've seen RAG for.
[00:51:35] Tobias Macey:
And in your experience of working in this space, being in the ML community for so long as generative AI has really started to take all of the attention away from other applications of machine learning? What are some of the most interesting or unexpected or challenging lessons you've learned personally?
[00:51:54] Matt Zeiler:
Yeah. I mean, the gap between prototype and production, I think, is just surprising, and almost feel like it's getting bigger kinda as the the field continues to have lots of new things every day, lots of continued innovation. People can get up and running easier, so the prototype is becoming faster, but getting into production is is, a huge undertaking. And I think people often underestimate that. We like to talk about that as the false finish line. People, like, get a use case in their company, up and running. It might be on your laptop. It's pretty easy to do that even with solutions like reg. But then thinking about those day 2 operations of running in production, not just getting to production, but how do you keep it up 247 with many nines uptime?
How do you do that cost effectively? How do you have those access controls scaling up, scaling down, etcetera? How do you keep that up to date with state of the art? How do you do that in a way that you're not, you know, constantly in a procurement mode and reviewing security posture of every vendor you're choosing because they're all different for the stack. How do you have tests if you're cobbling together the stack? How do you have tests that make sure it's gonna maintain its, its its kinda stitched together state over time as vendors change their APIs or, new things come out. It becomes a huge undertaking going from prototype to production, and that just continues to be surprising. Because when we talk to a lot of different customers, they they have this default, mindset that they need to build the stack themselves, And everybody's spending, like, 75 to 80% of their time building tools for AI rather than building AI into their their business.
And I think that's just a huge mistake and something that we've seen before, with the cloud transformation. People like HashiCorp, built a bunch of great developer tools that helped with the cloud transformation. And the the companies that adopted those tools accelerated and the ones that didn't ended up spending their time reinventing the wheel, building tools themselves, and lagged. And the same thing's happening and and repeating itself in the AI space. That's actually what kind of excited a bunch of new executives joining Clarify from HashiCorp and Cohesity and UiPath and a bunch of other great companies, because they see what we're doing here of accelerating developers from prototype to production in enterprises is, is exactly the same kind of playbook that was successful in the, cloud transformation and really helped empower, users to, to kind of become heroes in their in their business.
So we wanna make a lot of developers into AI heroes, with our platform.
[00:54:47] Tobias Macey:
To that point, there's always the there's always a gradient of I can throw together a prototype really quickly, but I don't actually understand what's happening to I'm a subject matter expert in, in this case, machine learning, but maybe I don't have all the operational capabilities. What is the level of domain knowledge and expertise in that context of machine learning and AI that's necessary to be able to effectively apply these AI bottles and rag stacks to a problem domain and build a product around it.
[00:55:22] Matt Zeiler:
That's such a good point. Yeah. I I think the there's a lot of tools like Clarify that abstract away all the complexities and make it really easy. And I think that lowers the barrier to entry a lot in adopting AI. And so some concrete examples are, in the intelligence community. We actually have intelligence analysts themselves using our platform. Our platforms, you know, the APIs that power all the back end, really easy to use SDKs for developers to get up and running, doing rag and 5 lines of code, for example. But then on top of that is user interfaces that make it literally a few clicks to configure a rag system or do search queries over your data, train a model at the click of a button, manage your datasets, etcetera.
And so that's where you don't have to have a PhD in AI like I do. You don't have to be a developer even. You can actually start getting the the benefits of AI, with just a few clicks. And I think that all depends on your your stack. If you're going the route of stitching together a bunch of tools, you're really the domain experts can't use it. It's too complicated. The developers are spending most of their time stitching together tools, maintaining them. Probably never gonna write tests, so it's gonna be brittle. And the data scientists may come into play there, kinda just customizing some of the the stack, but all of it's not gonna be as customizable as as they really want.
So it's kinda not good for anybody. So choosing the right set of tools really helps kinda, democratize the, the experience around AI.
[00:57:05] Tobias Macey:
And for people who want to take advantage of all these new AI capabilities, what are the cases where RAG is the wrong choice? And maybe you just wanna go with prompt engineering or use the out of the box model or you need to go down one of the other roads of fine tuning or building your own model from scratch?
[00:57:23] Matt Zeiler:
Yeah. So I think these general purpose use cases are pretty effective, like, taking a a bunch of text and summarizing it. Just continuing to generate, like, prompt engineer to tell me a bedtime story for my my daughter. Those types of things, like, the the information that's trained into the model from the Internet is good enough, and actually kinda what you want. So doing rag in those scenarios is gonna be a wrong choice. You really wanna use rag when you have custom data and you want to ground, to prevent hallucinations as much as possible. When you're actually generate generating, like, a bedtime story, you kinda actually want hallucinations. So it's a good example where Rag would actually be, be hurtful to to performance.
Being doing creative work, like, generating writing samples, generating advertisements, writing your emails, those types of things. Rag may not be the, the best choice, unless you really need to get something very specific. Like, it has to be advertised in this certain message or, you wanted to write emails exactly like you do and not any better than you do, which these models can typically do now. So, it's really that that level of customization, that is gonna dictate when you want to use RAG or not. And that's also true for fine tuning and and pretraining models as well. Prompt engineering should generally be the kind of where you start for a use case, and then you can see if it's good enough, or not.
[00:59:00] Tobias Macey:
As you continue to work in the space, work with your customers, stay abreast of the latest trends and developments, what are your predictions or aspects of the ecosystem that you're keeping a close eye on and getting most excited by?
[00:59:16] Matt Zeiler:
Yeah. I I think we talked about this a little bit. This con notion of consolidation, I think, is gonna happen. There's just too many different options at every layer for the stack. Vector database, we we mentioned this. There's, like, 50 different vector databases. And it's not just an AI thing. Like, no market can support that much variety. So I think a lot of the a a lot of this is gonna be consolidated into offerings that provide multiple different components in one place. Right now, it just feels very disjoint in, you know, the whole pipeline you need for these more complicated use cases like rag, you know, the data prep tools, transformation to data, integrations with your data sources, the prompt engineering tools, the vector database, the embedders, the LLMs. Like, there's lots of different components, evaluation metrics and leaderboards, etcetera.
So consolidating where all that just works out of the box as a workflow. That's why we're calling Clarify the AI workflow orchestration platform because we actually help people do a use case like RAG very, very quickly. And so I think that's gonna be a trend. We talked about the the multimodal as well. I think that's an obvious trend. And models like the GPT 4 o are really interesting because they're blending the modalities kind of in real time, and I think that's gonna be a continued trend as well. It's gonna open up a lot of experiences that are much more natural to communicate with these models than, you know, the chat window we talked about, before or kind of high latency experience in, in other approaches. So I think that's gonna be a continued trend. And then I think just more and more ease of use.
You know, the everybody talked about embedding models and getting excited about the next new LLM coming out and all that kind of stuff. I think it's gonna be it's gonna move to the background pretty soon, and people are just gonna be like, how do I use this? How do I use it in production in the easiest possible way? I don't care about all these details. I just want it to work. I think that's gonna be a trend as
[01:01:28] Tobias Macey:
well. Are there any aspects of the overall space of retrieval augmented generation, applications of these generative models, the work that you're doing at Clarify that we didn't discuss yet that you'd like to cover before we close out the show?
[01:01:41] Matt Zeiler:
No. I think we covered a lot today. It was a it was a great overview. Hopefully, that was helpful for everybody listening. Absolutely.
[01:01:48] Tobias Macey:
And for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning and AI today.
[01:02:05] Matt Zeiler:
Yeah. I think it's a lot about what we talked about today. Just how disjoint the the ecosystem is, and that's leading to difficulty going beyond that prototype into production. So, I hope people test out clarifi.com. You can sign up for free, get started, and you'll see the difference of having all the tools in one place can help accelerate you all the way to production.
[01:02:29] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your experience and perspectives on retrieval augmented generation and the considerations that go into the different layers of those stacks. It's definitely great to learn a bit more about those concepts, the technologies involved, the business questions involved. So I appreciate you taking the time today, and, hope you enjoy the rest of your evening. Thank you. Thanks for having me. Thank you for listening. Don't forget to check out our other shows. The Data Engineering podcast covers the latest on modern data management, and podcast dot init covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@aiengineeringpodcast.com with your story.
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macy, and today, I'm interviewing Matt Zieler, founder and CEO of Clarify, about the technical aspects of retrieval augmented generation, including the architectural requirements, edge cases, and evolutionary characteristics. So, Matt, can you start by introducing yourself?
[00:00:34] Matt Zeiler:
Hey. Yeah. Thanks for having me. So I'm Matt Zilar, founder and CEO of Clarify. I founded Clarify about 10 years ago at this point, one of the first true deep learning AI companies. And even before starting Clarify, I was diving deep into neural networks and deep learning all the way back at underground at University of Toronto. Very fortunate to work with Jeff Hinton there, who people consider the godfather of AI. And that really was a little bit of luck stumbling across AI there. But once I did come across it, I was really hooked and so decided to do my PhD, which brought me to New York University and got to work with people like Rob Fergus and Yan Lakan, more pioneers in this field.
So that was kind of the the background and learning I was able to do before starting Clarify and, lots of lessons learned over the years running the company and and being a a pioneer in this space.
[00:01:28] Tobias Macey:
You mentioned that you kind of fell into machine learning and AI. You've been in it for a while now. Obviously, the thing that has sucked all of the air out of the room when anybody says AI is generative models, in particular, large language models. And I'm wondering, given your history in the space and the fact that you've been working in it for a while, what your perspective is on the utility and the lasting power of large language models and generative AI and the role that so called traditional AI or more statistical models plays in this current landscape?
[00:02:03] Matt Zeiler:
Yeah. Absolutely. At the end of the day, all of these models for for the last kind of 15 years that have been showing promise are neural networks, including generative or what people are now calling generative AI. It's all neural network based, which is kind of algorithms meant to simulate how your brain works. And even back at that, University of Toronto days, I was doing generative models, for motion capture data. It was, an interesting project about understanding pigeons, walking, and, filling in missing motion capture recordings. And so it was actually generating data just like these large language models generate text, it was generating motion capture.
And so this is not new technology. I think what has really changed over the last, you know, 15 years is the size of the models, the amount of data they're trained on, and both of those things require a significant increase in compute. And so that's why all of a sudden it's at the the capabilities are are understanding enough about the world that they're doing a lot of interesting properties. And so I think it's opening we're still in the early days of, you know, exploring those, those possibilities or these models, and there's new ones every day. And, I think it's gonna be this way for for many years to come.
Continue to grow with on training size, algorithm size, and, and compute. We'll just keep unlocking more and more capability.
[00:03:26] Tobias Macey:
With the rise in popularity and at least potential application, if not actual real world application for these large language models. One of the topic areas that has gained a lot of interest is the idea of retrieval augmented generation or being able to build your own context corpus for the model to feed on to be able to actually complete the task that you want it to complete versus the general task that it was originally trained for. And I'm wondering if you can just start by giving your definition of retrieval augmented generation and some of the requirements that it brings to the architecture.
[00:04:08] Matt Zeiler:
Yeah. Absolutely. So I think most people are familiar with how LLMs are used today. You have a context window, which you pass in input tokens, and then you ask the model to continue generating output tokens, and all that has to fit in the context window. And the generations occur based on whatever went into the training of those large language models. And so there's a lot of problems with that. Like, it gets stale, whenever you last trained as all the information it has. And so, and it can hallucinate a lot of stuff. The it's not there's not that much control in how it generates, what it's gonna generate. And so, retrieval augmented generation came around a few years ago to help address some of those problems of giving it more dynamic access to information and improving hallucinations.
And retrieval augmented generation is kind of explicit in how it actually works. So it first starts with retrieving out of a large corpus of information, more fine grained, information that it can stuff into that context window so that the LLM has more facts that it can use in addition to the prompt that is provided in order to augment its generation. So that's how you get the word retrieval augmented generation. So it first takes your input prompt queries, usually a vector database to retrieve most similar chunks of documents that can help the LLM augment its generation.
[00:05:37] Tobias Macey:
As you mentioned, vector databases have taken off alongside the growth of retrieval augmented generation. Another thing that is maybe worth touching on is that at the beginning of this upswing in popularity of these AI models was the idea of prompt engineering, which pretty rapidly gave way to retrieval augmented generation. And I'm wondering what you see as the overlap of prompt engineering and retrieval augmented generation, and what are the cases where prompt engineering is actually still a viable utility in the application of these AI models?
[00:06:13] Matt Zeiler:
Yeah. There's kind of this, spectrum of prompt engineering is the quickest, easiest thing you can try first. RAG is the second one, retrieval augmented generation. The third being fine tuning models. So taking these, you know, popular open source models, customizing them on your data. And then 4th is kinda training new models from scratch. And the the spectrum kinda gets into a question of how much compute and data you have access to. You if you don't have much data, you can't really pretrain a model at all. Fine tuning is unlikely to work very well. You can prompt engineer an existing model, but it's only gonna get you as far as whatever that model understands.
And so RAG becomes a really good choice where you can have limited sets, or small dataset sizes, but still customize the outputs of the model to be more specific for your particular task and your dataset. And that helps it, you know, kinda ground these models or I've heard the the term tame these models. I like that because if you don't tame them, they can go off the rails and generate a lot of stuff that you wouldn't wanna put in front of your customers. And it and it can also help these models kind of adapt to new things that have never been seen or different domains of data that they've never seen. So for example, they maybe weren't trained on legal documents or finance documents or, you know, classified documents. We do a lot of work with the government and, intelligence community.
And so there's a lot of things that simply these models have never been trained to see. And so RAG can help you, adapt to those, whereas prompt engineering would help in those scenarios. And then if you really want to, you know, get the models to perform better and continue that trend of reducing hallucination, customizing on your data, that's when getting it to fine tuning. And and at the biggest scale, training your own specific model can come into play. But for most part, you can get it away with prompt engineering and and fine tuning for a lot of applications.
[00:08:14] Tobias Macey:
Circling back on the vector database aspect of it, there's definitely utility in having the data that you're trying to stuff into your context already formatted in the same data structure or the same language that the model understands. But I'm wondering what you see as vector databases being a requirement retrieval augmented generation versus just an optimization of the functionality?
[00:08:41] Matt Zeiler:
Yeah. It's a good question. The in today's most popular rag systems, they are kind of a core component to it, but I have this feeling that it's kind of stitching together a bunch of components very loosely, and there's gonna be improvements to that in the future. So stitching on a vector database to stuff text into the context, helps. And there's even papers. I was just reading one this week about kinda provability that it does help with hallucinations and and grounding these models. There's some theorems around that, which is exciting. So these things do the the rag approach does work, but it just still feels like it's a bunch of, you know, independent components, bolted on to each other. I think what we're gonna see is more end to end training, and we're already starting to see kind of that in the research community.
And once you have that, the vector database is just one type of thing that you can query information from. You're seeing a lot of other usage of these large language models that just, you know, generally can query information from tools. It doesn't have to be a vector database. It could be an API to get the weather. It could be, you know, your Salesforce account to get your CRM data. It could be a lot of different integrations. And so I think the vector database is just one simple one that people can get started started with. But long term, I think we're gonna see a lot more, flexibility and a lot more kinda unified understanding, not less kinda stitching things together.
[00:10:13] Tobias Macey:
Digging further into that collection of tools that are being stitched together, can you give an overview of what you see as the typical architecture and the technical components for being able to actually bring RAG to bear and some of the operational considerations around that? Yeah. Absolutely.
[00:10:32] Matt Zeiler:
And this is kinda where Clarify fits in. I realized I didn't really give an overview of what Clarify does, but, we are that AI workflow orchestration platform. So we have all the tools in one place to get you from prototype to production as quickly as possible. We often talk about it as being AI in 5 minutes. Because literally with, workflows like RAG, you can actually be set up and running in about 4 or 5 lines of code using our SDKs. And we have vector databases built in. We have all the embedding models you need to retrieve and index your data. Popular large language models, all the third party ones like OpenAI, Anthropic, Google, etcetera, and all the open source ones like Mistral, LAMA 3.
And within hours of new ones being announced, we're importing them into our platform so that they're ready to go for your use cases. And stitching together, you know, a RAG has a lot of these components. You have to think about the, vector database, how you want to, run that at production scale. How do you want to actually like index your, your data? There's like the data pipelines of chunking up large documents like PDFs or or large text files into small chunks before going into the embedding model. Once you get embeddings out of those, you have to decide, do you want to keep them in full precision? Do you want to quantize those embeddings, which is popular for saving memory in your factory database? How do you wanna index the embeddings in your database? There's a lot of configurations there.
And then once you have that kind of indexing sorted out, you have to think through how do you actually reindex when you have an one more document to add or or, maybe you want to, change the embedding model, like a new one comes out on a regular basis. How do you think through the reindexing process? So there's a whole, you know, can of worms on doing production quality on the embedding and and vector DB side, and that's even before you start doing RAG. Once you start doing RAG, you have to think about now the LLM or LMM if you wanna go into multimodal type of models.
You have to make a choice on which one's suitable from a size perspective, a cost perspective, latency perspective, and accuracy perspective. There's lots of dimensions to choosing the the appropriate large model for your use case, and and then you have to test it out. Typically, the there's evaluations for the large language models, and there's less evaluations for this end to end process of having embedding vectors to retrieve, having, the right indexing for that, having the right LM, etcetera. And then that's kinda getting you to an MVP. Now when you think about running in production, you have to think about monitoring, tracing.
How do you collect feedback so that system gets better? The best data for any kinda AI system is feedback from real world usage. And so do you have pipelines to collect that and incorporate that into fine tuning your models, fine tuning your prompts, etcetera? And then when you think about kinda enterprise grade, you have to think about what permissions does an employee have to access the data that goes into feeding this rag pipeline? What permissions do they have to the models? How do you scale up replicas of the vector database, the embedders, the LLMs?
How do you do failover? How do you upgrade when, you know, whenever GPT 5 comes out or Claude 4 or Gemini 2, how do you actually or lam a 4, like, there's always gonna be new models. How do you think about the upgrade process? It becomes really, really complicated. Very easy to get started with a proof of concept for reg, but very difficult to run it in production. And having a platform like Clarify that has all these components and these workflows of feedback, access control, auto scaling, etcetera, helps people go from that prototype phase to production.
[00:14:28] Tobias Macey:
The other aspect of operating at scale is the question of what are the edge cases, what are the failure modes, and in particular, how do you fail gracefully where maybe your vector database goes down, but you don't want your AI product to just completely fail. Maybe you just wanna be able to let it operate in a slightly degraded capacity. What does that look like? And I'm curious how you are seeing people address those aspects of the failure modes, recovery modes, and how to fail gracefully.
[00:15:00] Matt Zeiler:
Yeah. I mean, that's the big problem with taking the approach of stitching together a bunch of different vendors, like a a vendor for embeddings, a vendor for vector database, a vendor for LLMs, a vendor for monitoring. Like, if any one of those vendors goes down or their API just changes, like maybe they made a breaking change, you missed, an announcement on it, your system falls apart. And that makes your, your experience for your customers so much less reliable. And and then the other dimension you mentioned is kinda, other faults that can happen in production or just the the models are still even with reg and and helping ground them, you can run into issues where they still hallucinate. And how do you get feedback quickly from your users, to catch those cases and tune things. Sometimes it may be an automated tuning, like active learning pipelines.
But other times, it might just be thinking about, you know, changing your prompts, changing the context, parameters, or what's indexed into the database, to help improve the experience for your users, over time. So getting that feedback loop is is really, really important to to help catch these edge cases because it's not just something you would see on a typical, like, monitoring dashboard or tracing dashboard. You you need the actual qualitative feedback as well.
[00:16:25] Tobias Macey:
One of the terms that keeps coming up as we're talking about this is the idea of the context window, and each of these models has different sizes of context windows or the number of tokens that they can accept as context. We're talking about using these vector embeddings as additions to the context for the request, and I'm wondering how that sizing of the context window influences the ways that you think about the embedding generation, what size the chunking should be, how many records to retrieve for augmenting that generation request, and some of the ways that that context window acts as a limiting factor and a forcing function to the way that you design the rest of your overall pipelines and the data flows?
[00:17:10] Matt Zeiler:
Yeah. Yeah. It's a good question. Like, the and this is something that really surprises me in the last, you know, kinda 18 to whatever we're at now, 20 months since Chat GPT came out. How many people talk about the different models that are available and details about them, like context window size. And these are like I've had conversations with people that are not technical at all, but they're they have this, in their vocabulary now. It's a it's a crazy moment in time. And I don't know if that will will stay true forever, but it is really interesting. The the context window is important for Rag because the more context you can stuff with all the retrieved documents, the more facts and more grounding that it can do.
So there's some models like Gemini. It's, like a million or 2,000,000, I forget what it is, tokens that you can fit in. There's even new algorithms that are not just transformer based, but variants of transformers that kinda have infinite windows. They get away from having this this attention mechanism that has to look at every token every time to something that's more more like recurrent networks or convolutional networks, which kinda just continually process as you go. And that opens up really long context windows and doing that processing efficiently. So I think the trend is gonna be that these context windows continue to to grow.
But when it is limited, one component I I might have missed in, describing all the different possible components for RAG is a re ranker, which can come in really useful when the context window is small. So it kinda comes into play when the query goes and retrieves documents from a vector store. It can rerank the shortlist of documents to get the final subset that is stuffed into the context window. And so that re ranker can be another AI model, and it's an additional step on top of the embedding model and on top of the, large language model that can help improve the accuracy of the the retrieval process. And that helps kinda when you are on a small context window to make sure that whatever you're stuffing in there is the best possible, chunks of documents.
And another part of your question was that chunking process, and this is, I don't know if there I haven't seen any good, like, you know, solution or perfect solution for this. It's kinda a lot of heuristics of how large you wanna chunk your documents, kinda relates to how good your embedding model is at understanding the whole chunk. Some embedding models are trained kind of at the word level, at the sentence level, at the, the paragraph level and at the document level. So depending on which embedding model you wanna, use, it'll impact your chunk size.
And then it also factors into the LLM's context window size. If you have really big chunk size but a smaller context window, you might only get one chunk. And so you better get the right chunk in that scenario. Otherwise, it's not gonna it it may actually hurt your generation performance. So there's lots of kinda parameters in that decision. I haven't seen, like, a a perfect solution to that, but that that could be an interesting area for researchers to come up with a a more audit made way of deciding on these chunks.
[00:20:34] Tobias Macey:
On that context of the embeddings too, I'm wondering if there's any utility in applying dimensionality reduction to be able to reduce the size or the the amount of space in that context window that a single document is taking up without necessarily changing the chunking pattern or, like, how big of a chunk of the document you're retrieving?
[00:20:56] Matt Zeiler:
So, yeah, I think so so just to clarify one important part for the listeners. The embeddings aren't used directly in the in a typical RAG system. It's not the embeddings that are put into the context window. It's the text itself. There are some approaches that kinda factor in embeddings as as tokens, but, typically, you just retrieve the documents using an embedder and a vector database to get the text, and then you stuff the text into the context window. I have seen some approaches that take that text and apply things like summarization on it so that it's reducing the amount of words. Because typically, any language model, whether it's these large models or, you know, the first transformers like BERT style models, etcetera, that are still incredibly well suited for most use cases, they don't care about the stop words, you know, the of, you know, throwing all that stuff out is gonna save you tokens in your context window and probably not affect the, the generation that much. It'll have some educations with that because obviously there's important words like not that could change completely the, the meaning of a sentence.
So you have to be careful in that, but summarizes can are are good enough that they can keep the important parts and, and get the meaning, and that can save you on the, the chunk size fitting into the context window.
[00:22:20] Tobias Macey:
The other piece of chunk size and to your point of these models that have either much larger or potentially unlimited context windows that brings up the question of what impacts does it have on the overall latency of the experience where I can put the entirety of War and Peace into my talk context window, but now maybe it's gonna take an hour for the model to actually give me a response. And I'm wondering how that affects the balance and the ways that you think about how large of a document to put into the request given the fact that it may impact the end user experience.
[00:22:54] Matt Zeiler:
Yeah. Yeah. That's where these newer style attention mechanisms become a, really important. There's a lot of work in this area because everybody knows that as the context window increases, the speed, the the latency increases. And so these attention mechanisms that get away from looking at every token in the context window and comparing it to every other token, which is kind of an n squared, operation, and there's there's other overhead on top of that. These, these these new attention know, know, even 32,000 context window or a 128,000, like, the latency gets significantly slower. So that's why you see most context windows are, like, 4000, 8000 typically, especially for these small models. Otherwise, the, the attention just, makes the models not really useful at that point.
[00:23:59] Tobias Macey:
Now getting into the practical elements of I have an application, I want to use a large language model or some other modality of generative model. I'm going to use retrieval augmented generation. So now the first step is I have to generate those embeddings to put into my vector database. You mentioned the consideration around which embedding model to use, the fact that they're constantly changing, what are the strategies for chunking, and how often to reindex or how to append to the indexes. I'm wondering if you can talk through some of the ways that somebody needs to think through that planning and design step of selecting the embedding model, selecting the chunk sizes, etcetera.
[00:24:45] Matt Zeiler:
Yeah. So, I mean, it all starts with your use case and the data that you have available for that use case. And then everything should fall kinda backwards from there. Because if you're dealing with, you know, inherently small snippets of text, like emails are are pretty small compared to PDFs or, you know, 10 k financial reports, which are really large PDFs. There's it it all depends on your your data. And that will dictate a lot of how you decide to chunk it, how you which embedding models you want to use, which large language models you wanna use, maybe even the vector database. But that typically won't change depending on the on the type of data or the use case.
There'll be other considerations when choosing your vector DB. For the embedding vector, there's some, kinda open benchmarks, like the massive text embedding benchmark, MTEB, that has become popular. And you can see typically these these types of things on leaderboards on the Internet if you just search for, you know, embedding benchmark. And that gives you a good overview for some general purpose use cases, and those will cover things like emails, chat experiences, documentation, that kind of stuff. If you get into a domain that, you know, these models likely have not seen before.
I always like to talk about, you know, these classified documents because clearly, they wouldn't have been trained on that data. So in those scenarios, you may have to run your own evaluation benchmarks on your own dataset. And we provide tools that clarify to help you evaluate models very easily, and, actually, you can create your own leaderboards. So using tools like that is really valuable because, you know, the general purpose benchmarks are good to to understand which models you should probably evaluate, but then actually evaluating them on your ground truth datasets becomes really, really important.
So then you can so that'll give you kinda your embedding model from that process. And then you have to think about kinda, another factor actually in the in the embedding model choice is gonna be your compute. Do you have GPUs available that are big enough for that model? How are you gonna auto scale that model? Do you have a service that's already doing that, etcetera? That becomes an important choice. Another consideration for both the embedding and the the compute is also true about the large language models, both for embeddings and for the large language models is is understanding, is your data all in one language as well?
You know, English is trained into most of these models, but, other languages, you'll get get mixed results depending on which model you choose. We're seeing a trend obviously that these models are training on the whole Internet, and that includes lots of languages. So, multilanguage support is becoming more of a standard, but the way each model treats different languages is not the same. So if multilanguage support is important for your use case, then you have to factor in, evaluating in different languages as well, beyond just, English or your or your native language. And then, yeah, the chunking stuff we talked about, thinking through how do you actually parse documents of different types, There's different tools to help you, you know, handle PDFs, convert them into text, and then, the embedding model will take that text and convert it into embeddings.
There's all the chunk size considerations, etcetera, based on your context window and and, again, your your use case. Ultimately, all these decisions, you can use kinda heuristics and, like we talked about. But the most important thing is evaluating the end to end system once you're done. Once you build your your RAG system, is it actually improving the generation quality and having evaluation metrics for that, especially as you iterate on these things, is really important so that you can compare for each of your settings of chunk size, of embedding model, of LLM model, of do I go multi language or not, etcetera.
You wanna see quantitative results for your given use case. And so being able to evaluate the end to end solution, I think, is the most important thing to think through how you're gonna do that. And we can provide some tools to help our our customers do that, which is really exciting because it gets complicated fast as you as you heard us talk about all the different components you need to think through when thinking about solutions like RAG.
[00:29:19] Tobias Macey:
The majority of our conversation and the majority of the conversation across the ecosystem right now is focused on text and natural language, but there has also been an increase in the desire and application of multimodal models or models of different modalities than just text. And I'm wondering how that impacts the ways that you even can do chunking where it doesn't necessarily make sense to chunk an image because the image is kind of the entirety of the thing that you're asking for, and maybe then you have to go to downscaling the image to reduce the size and quality. Or if you're dealing with audio or video data, how does that influence the applicability of RAG and the ways that you think about that context retrieval?
[00:30:08] Matt Zeiler:
Yeah. Absolutely. I think the trend is definitely going multimodal. And I I think in the next kinda year or 2, it'll be rare that we have a unimodal model. I think every model's gonna end up being multimodal just because the more different types of data that these models can learn from, the smarter they can get. And when you think about it, it's kinda how humans learn. Like, I have, a young child, and just seeing him learn every single day is is really interesting because he can't speak yet. He understands some words, but, he can see, he can hear, he can feel, like, it's all different modalities of data that that he's clearly learning from.
And my older daughter has has gone through that stage, and she's still learning in in a variety of different ways from multiple different, data types. And so it's very obvious that these models are only gonna get smarter because they learn from multiple modalities. And how that affects RAG is that you gotta think about the the chunking as you mentioned. Typically, the way these transformers work beyond text is that, everything gets tokenized. So text is tokenized, by taking a few characters and converting it into a token. In images, you take patches of pixels and convert them into tokens. Similar for audio, you chunk up the audio. And so all of it becomes tokens, and then they're just fed into this giant transformer model that can understand the context and attention amongst all these this kind of soup of of different modalities of tokens.
That's what allows different multimodal models to take in not just one image and text, but it could be just a list of multiple different data types, like and and you can do cool use cases like what's going on between the what what what are the differences between these two images or, you know, give me an audio clip that, like, represents this picture. You know, you can do really cool, multimodal stuff, with on the generation side and and the the q and a side. And, and that has to factor into all the different components within the reg system, because you have to embed all the different data types ideally into one common embedding space so that when you get a query, you can retrieve in all the different types of data that are relevant, and then they get tokenized and fed into the the large multimodal model.
And then the same process that we walked through with text only reg of evaluating the end to end system is really crucial. We're starting to see more and more benchmarks for multimodal, but I think it's it's definitely, like, a year and a half or so behind the the text, large language model benchmarks. And so I think that's gonna be a trend we'll all benefit from and see is just much more, high quality datasets to give you the general purpose kind of evaluation benchmarks and then more tools to help you evaluate multimodal flows like reg on your particular, datasets.
[00:33:10] Tobias Macey:
So we have talked through the embedding piece, thinking about how to strategize that, the considerations that go into it. The next layer of RAG is the vector database as we've already discussed. For the purposes of retrieval augmented generation, what are the features of a vector store that are most critical and some of the evaluation criteria for figuring out which one of these vector databases versus vector indexes, etcetera, you actually want to include in your architecture.
[00:33:41] Matt Zeiler:
Yeah. This is, another place where there's lots of different choices. I think in the last 20 months, there's, like, 50 different vector databases you could choose from. I don't think that's gonna continue. I think there's gonna be a lot of consolidation, which is gonna be easier to, to stand up a a vector DB in the future. But you have choices like, do you want to use the vector store that's already in the database you're using? So, I was just looking at this. It's a really good report that I think your your viewers would love flipping through from Retool. It's the state of AI report that I think just came out, and they have, like, a list of the most popular vector databases.
And it was, like, PG vector, which is an extension to Postgres, is most popular. MongoDB is up there. Pinecones up there, etcetera. And with Postgres, Elastic's up there as well. So, like, with Postgres, MongoDB, and Elastic, it's kinda obvious why they're up there. It's because it's just a feature on your existing database. And that's just a huge convenience rather than having a place where you store your data today and then having a completely different, vector store. And that's actually honestly how Clarify always looked at it. We may be the 1st production vector database. We launched it in 2016, and it was always built natively with all the other data that you're storing in our platform so that you can and this is a a criteria that you should evaluate vector stores with. Can you do the queries that you want to do in addition and kinda joining with the vector queries? So filtering by metadata, we let people do, searching by AI labels, by, what the AI is predicting, as well as, combining with vector stores all in one kinda query. And most vector DVs kinda focus just on the the vector part, making that good, and not doing the the joins with other queries that would be, really useful for the overall query that you wanna make.
And that even in the rag situation, that becomes really important because back to kind of what we talked about earlier about access control, how would you actually filter by what permissions a user has to the data without having that additional kinda join? You you could do kind of, you know, joins in memory and all that stuff between different data sources, but that's prone to bugs and inefficiency. So having a vector store that has all the filtering you need in one place is really important. And then for vector store specific features, you know, latency is obviously important.
For use cases like RAG, though, it's not I I would argue it's not, like, the most important thing because at the end of the day, these large language models and large multimodal models are slow. And especially when you have all the retrieved context that you're gonna stuff their context window with, it's pretty slow to generate. And so having a a little bit slower latency won't actually impact the overall system latency that much because it's not gonna really dominate compared to the the generation. Obviously, faster is is better, but it's I think when people are choosing vector stores, that can take, a lower priority.
I think how does it handle scale is important. Can you stuff in lots of data? And when that happens, does it kinda get distributed beyond a single node to handle, fail over and and scale? Those become really important. How does it handle the different dimensionality of your embedding vectors? We talked to so some of them have limits on, like, you know, it only handles a 1024 dimensions. It can't do 2,048. Stuff like that can actually cap you on reg because your embedding model that you want to use might not actually be able to fit its embeddings into the vector store. And then there's all the different kinda indexing. How does it index? Is it accurate on retrieval and and precision?
Is it efficient to do that retrieval? Can it quantize the embeddings and still keep that index performance, high? When people think about a vector store quantization, this is actually, like, seeing this whole, market of vector stores come out and quantization is like a a hot new feature. We literally had that in in 2016, and, me and a couple of engineers actually wrote that code. And it's the same techniques that people are kinda inventing now, today. But it we did it because it saves a ton of memory in your vector store. Taking a 32 bit precision floating point number, which is typically what these models will output, and crushing it down to 8 bits or 4 bits because the the range of these numbers isn't that big.
And sometimes you don't even care about negative numbers because the the models, often output only positive numbers in their layers. So you can actually get rid of lots of bits by thinking about the data and and looking at kind of distributions of these these values. And so the quantization can take a 32 bit number down to, you know, just a few bits. There's even results where you can actually do binary just zeros and ones for all the values in the embedding vector. So you can get a 32 times reduction in the memory used with a trade off on how accurate those binary vectors are gonna be be at the end of the day. And not every vector database supports that kind of binary, embeddings.
They don't all support, you know, 4 bit, 8 bit. Many would support all of them would support 32 bit, but many of them would support 16 bit floats. So that quantization becomes really important because it's memory and the retrieval latency and the accuracy all get affected by that quantization process. And there's there's also, like, benchmarks for these vector stores. It's another area where you can look at a lot of these metrics. I haven't seen that many good independent benchmarks, unfortunately. What you when you search for this I was actually just literally doing this this morning. When you search for, like, benchmarks of PG vector versus Pinecone or Quadrant or Milvus, there's lots of these vector DBs.
They're often written by one of those providers. And so it's a little bit like, the the provider of any tool is gonna know their tool better than anyone else, and so they're gonna make sure the benchmarks are really optimized, really working well, and then they're gonna use kinda default settings for all the other providers. And so I haven't seen a good, like, ground truth independent benchmark. If anybody knows, I'd love to to see that. But I think that would really help the vector store decision process.
[00:40:37] Tobias Macey:
The maybe not the final decision, but I think the next large decision to be made in your rag stack is what model you're actually going to use to do the generation. Obviously, this is a very fast moving target. It seems like there's a new model every week or at least tweaked versions of the same set of models that are coming out on a day to day basis. What are the foundational aspects of a model that you look to when you're deciding which one you actually want to, apply to a given problem domain?
[00:41:11] Matt Zeiler:
Yeah. Again, it comes down to some of these open benchmarks. There's, you know, open LM benchmark and a lot of these leaderboards you'll see, online. So that's a good place to start because it those are typically where any model provider is gonna want to showcase their model, on these standard general purpose benchmarks. But, again, those are typically good for the general purpose use cases, which will cover a lot around, like writing emails or chatbots. And we're starting to see kind of leaderboards and evaluation benchmarks that are becoming more specific.
I think I just saw one about CRMs. So if you're wanting to, you know, interact with your your Salesforce, HubSpot, Pipedrive, etcetera, CRM, And that data is very specific to sales and marketing activities. An LLM that's trained on that data is likely gonna perform much better. And so, again, it starts with your use case, decide on what you wanted to to be good at, and then find the most appropriate, benchmark. And it'll give you an idea of accuracy. Then there's the, notion of how big the the generative model is and where you're gonna host that thing. Is it gonna fit into compute that you already have?
Does that compute auto scale? All the production grade inference becomes really, really, important with with this model. Because, typically, the embedding models are are smaller than the the, generation models that are used in a RAG system. And you don't always need the biggest model. I think people kinda default to, you know, the the big third party models. And for most use cases, they're likely overkill. And for reg use cases in particular, because you're getting that additional help from the retrieval to augment the generation, It doesn't need the best model that has, you know, a trillion parameters and has memorized the entire Internet because you're actually getting the data that you need to generate with, stuffed into your context window. So a smaller model can actually get those facts that are stuffed in, and pretty much all the small models are really good at writing English and many other languages at this point. And so the task of RAG is much more, well defined and constrained so a smaller model can perform well. And that'll give you, improvements not just, how costly it is to run these things, but also the latency.
The bigger the model, typically, the slower it's gonna be. And, and it also kinda limits a little bit the ability for the models to kinda hallucinate and go off the rails. So that trust and safety and and evaluating the whole end to end, comes into play again. And, then it's another dimension to choosing the the generative model. Hopefully, the RAG pipeline is going to end up kinda addressing the hallucination then and the trust and safety concerns. But, some models that have never been trained on, you know, offensive words are just not gonna be able to, you know, generate that kind of content as an example. And so depending on your use case, you may want that or you may not want that, And that can be another dimension of of trust and safety you have to think through in choosing your your generative model.
[00:44:38] Tobias Macey:
One of the interesting side effects of all of this interest in generative AI, RAG, is that growth in vector databases, vector indexes, and those have started to be repurposed towards semantic search. I'm wondering what are some of the other ways that this investment in rack capabilities is able to be repurposed for other use cases?
[00:45:05] Matt Zeiler:
Yeah. And this is actually why when we launched vector, database in Clarify, we never actually called it a vector database. We never marketed at that. We never actually anticipated the world to have this vocabulary of talking about embeddings and quantization of them and all these LM choices. Like, that kinda changed overnight when chat gpt happened. So when we created it, our vector database, it was never for reg. It was for things like visual similarity search. So you have an example of a, you know, a product in front of you that you wanna buy, like a nice chair or something, and you use that image to query a product catalog to find the, the most similar looking chairs to buy online.
So that's a great kinda use case for similarity search. Same thing happens with text. It's the exact same process that happens in the retrieval step of rag, just stopping there. I have a snippet of text. Do I have any similar, snippets of text? This is useful in in general for content organization. So we work a lot with marketing teams as an example. They have all their visual content. They have all their text content indexed and to clarify, And people can very easily, in the search bar, type in a query and find similar content to that query automatically without kinda writing code and and all that and stitching all these components together.
That's a huge value add, and that come into play in all these kind of intelligence community applications as well, lots of use cases for similarity search, like, you know, a vehicle, for example. You think you've seen that in security camera footage before and, you want to check. So you can use the query being the vehicle and match it in a in a large database. And then the other kinda bucket of use cases beyond the similarity search is just deduplicating data. So this happens, and this can be done like, think of, duplicate images. You can one technique is to just match the pixels, but then it has to be exactly the same image, and that's not, very reliable. Even just changing the compression, settings on an image will change the pixels even though they look visually the same to our eyes. And so, having, the images embedded and then comparing at a high level understanding of the images is very effective at deduplicating data because you can actually find similarities in clusters and say all all these are, you know, derived from the same image or they're duplicates.
And when you do these these matches in in vector stores, you actually get a score typically of how similar things are. You can use that score to get the how confident you are that these are exact duplicates, near duplicates, or or not duplicates at all. And so, you know, going back to some of the business use cases for that, it's like the marketing team buy stock photos all the time, And now a lot of those are generative. Or do they already have one that looks like this? Do they need to go buy a new one? That deduplication has a real kinda dollar, value directly attached to it. So it becomes really powerful, having a vector store for these other use cases, and and reg just kinda came about as as one additional newer use case, for these these generative models.
[00:48:34] Tobias Macey:
In your experience of working in this space, supporting these RAG use cases, building the functionality in your platform to manage that end to end flow, what is up in the most interesting or innovative or unexpected ways that you've seen those techniques used? Yeah. I mean, there's so many,
[00:48:53] Matt Zeiler:
I don't know. I I maybe I'll start with the least innovative, I guess. I'm seeing just so many, like, chat experiences added to products, and I think that's gonna be a a fad and baby's already fading away. But I think those types of use cases are are suitable when you already have a chat experience, like a customer service bot on your website. Like, that totally makes sense. But just injecting a chat window to have people, like, interact with things that are never meant to be, interacted with in a single box is a is a decrease in in user experience.
So I'm actually, like, not a a big fan of those types of rag use cases. When it is a scenario where you already have a search bar and that is a natural thing, whether it's, you know, Internet search, whether it's, you know, we use Confluence for our internal documentation and Google Drive. Those types of searches can all be improved very, very significantly with these components that you need for for RAG. And then the next step of just retrieving the documents off the search is to start summarizing them. And, we're starting to see that kind of even internally. You you see it obviously when you search on on Google and I believe Bing as well at this point. They're all kinda summarizing depending on your query, of course.
Ones that are already indexed, they're summarizing, a a a good summary so you don't have to click through all the results, and that's just a huge productivity gain for everybody. When I think of kinda some of the most interesting applications of this, or or the highest value applications maybe, it's where the people having to do those, traditional search and read and and summarize tasks are in high demand. They're kinda domain experts in their field. That might be, you know, intelligence analysts in the intelligence community, doctors in the medical community, lawyers in the in the legal space, financial analysts, etcetera.
That's where these productivity gains are really unlocking business value. And when when you think of some of these domains, that's the the ones I listed. They're kind of the the domains that are least likely to have been trained into the model in the first place. And so solutions like reg where you can ground it with, you know, factual stuff like medical textbooks or, you know, all the financial 10 k's of the quarter, etcetera, all the classified documents in an archive, that's when RAG really unlocks the the biggest ROI beyond having a a generative model alone.
So I think, those are the most exciting use cases I've seen RAG for.
[00:51:35] Tobias Macey:
And in your experience of working in this space, being in the ML community for so long as generative AI has really started to take all of the attention away from other applications of machine learning? What are some of the most interesting or unexpected or challenging lessons you've learned personally?
[00:51:54] Matt Zeiler:
Yeah. I mean, the gap between prototype and production, I think, is just surprising, and almost feel like it's getting bigger kinda as the the field continues to have lots of new things every day, lots of continued innovation. People can get up and running easier, so the prototype is becoming faster, but getting into production is is, a huge undertaking. And I think people often underestimate that. We like to talk about that as the false finish line. People, like, get a use case in their company, up and running. It might be on your laptop. It's pretty easy to do that even with solutions like reg. But then thinking about those day 2 operations of running in production, not just getting to production, but how do you keep it up 247 with many nines uptime?
How do you do that cost effectively? How do you have those access controls scaling up, scaling down, etcetera? How do you keep that up to date with state of the art? How do you do that in a way that you're not, you know, constantly in a procurement mode and reviewing security posture of every vendor you're choosing because they're all different for the stack. How do you have tests if you're cobbling together the stack? How do you have tests that make sure it's gonna maintain its, its its kinda stitched together state over time as vendors change their APIs or, new things come out. It becomes a huge undertaking going from prototype to production, and that just continues to be surprising. Because when we talk to a lot of different customers, they they have this default, mindset that they need to build the stack themselves, And everybody's spending, like, 75 to 80% of their time building tools for AI rather than building AI into their their business.
And I think that's just a huge mistake and something that we've seen before, with the cloud transformation. People like HashiCorp, built a bunch of great developer tools that helped with the cloud transformation. And the the companies that adopted those tools accelerated and the ones that didn't ended up spending their time reinventing the wheel, building tools themselves, and lagged. And the same thing's happening and and repeating itself in the AI space. That's actually what kind of excited a bunch of new executives joining Clarify from HashiCorp and Cohesity and UiPath and a bunch of other great companies, because they see what we're doing here of accelerating developers from prototype to production in enterprises is, is exactly the same kind of playbook that was successful in the, cloud transformation and really helped empower, users to, to kind of become heroes in their in their business.
So we wanna make a lot of developers into AI heroes, with our platform.
[00:54:47] Tobias Macey:
To that point, there's always the there's always a gradient of I can throw together a prototype really quickly, but I don't actually understand what's happening to I'm a subject matter expert in, in this case, machine learning, but maybe I don't have all the operational capabilities. What is the level of domain knowledge and expertise in that context of machine learning and AI that's necessary to be able to effectively apply these AI bottles and rag stacks to a problem domain and build a product around it.
[00:55:22] Matt Zeiler:
That's such a good point. Yeah. I I think the there's a lot of tools like Clarify that abstract away all the complexities and make it really easy. And I think that lowers the barrier to entry a lot in adopting AI. And so some concrete examples are, in the intelligence community. We actually have intelligence analysts themselves using our platform. Our platforms, you know, the APIs that power all the back end, really easy to use SDKs for developers to get up and running, doing rag and 5 lines of code, for example. But then on top of that is user interfaces that make it literally a few clicks to configure a rag system or do search queries over your data, train a model at the click of a button, manage your datasets, etcetera.
And so that's where you don't have to have a PhD in AI like I do. You don't have to be a developer even. You can actually start getting the the benefits of AI, with just a few clicks. And I think that all depends on your your stack. If you're going the route of stitching together a bunch of tools, you're really the domain experts can't use it. It's too complicated. The developers are spending most of their time stitching together tools, maintaining them. Probably never gonna write tests, so it's gonna be brittle. And the data scientists may come into play there, kinda just customizing some of the the stack, but all of it's not gonna be as customizable as as they really want.
So it's kinda not good for anybody. So choosing the right set of tools really helps kinda, democratize the, the experience around AI.
[00:57:05] Tobias Macey:
And for people who want to take advantage of all these new AI capabilities, what are the cases where RAG is the wrong choice? And maybe you just wanna go with prompt engineering or use the out of the box model or you need to go down one of the other roads of fine tuning or building your own model from scratch?
[00:57:23] Matt Zeiler:
Yeah. So I think these general purpose use cases are pretty effective, like, taking a a bunch of text and summarizing it. Just continuing to generate, like, prompt engineer to tell me a bedtime story for my my daughter. Those types of things, like, the the information that's trained into the model from the Internet is good enough, and actually kinda what you want. So doing rag in those scenarios is gonna be a wrong choice. You really wanna use rag when you have custom data and you want to ground, to prevent hallucinations as much as possible. When you're actually generate generating, like, a bedtime story, you kinda actually want hallucinations. So it's a good example where Rag would actually be, be hurtful to to performance.
Being doing creative work, like, generating writing samples, generating advertisements, writing your emails, those types of things. Rag may not be the, the best choice, unless you really need to get something very specific. Like, it has to be advertised in this certain message or, you wanted to write emails exactly like you do and not any better than you do, which these models can typically do now. So, it's really that that level of customization, that is gonna dictate when you want to use RAG or not. And that's also true for fine tuning and and pretraining models as well. Prompt engineering should generally be the kind of where you start for a use case, and then you can see if it's good enough, or not.
[00:59:00] Tobias Macey:
As you continue to work in the space, work with your customers, stay abreast of the latest trends and developments, what are your predictions or aspects of the ecosystem that you're keeping a close eye on and getting most excited by?
[00:59:16] Matt Zeiler:
Yeah. I I think we talked about this a little bit. This con notion of consolidation, I think, is gonna happen. There's just too many different options at every layer for the stack. Vector database, we we mentioned this. There's, like, 50 different vector databases. And it's not just an AI thing. Like, no market can support that much variety. So I think a lot of the a a lot of this is gonna be consolidated into offerings that provide multiple different components in one place. Right now, it just feels very disjoint in, you know, the whole pipeline you need for these more complicated use cases like rag, you know, the data prep tools, transformation to data, integrations with your data sources, the prompt engineering tools, the vector database, the embedders, the LLMs. Like, there's lots of different components, evaluation metrics and leaderboards, etcetera.
So consolidating where all that just works out of the box as a workflow. That's why we're calling Clarify the AI workflow orchestration platform because we actually help people do a use case like RAG very, very quickly. And so I think that's gonna be a trend. We talked about the the multimodal as well. I think that's an obvious trend. And models like the GPT 4 o are really interesting because they're blending the modalities kind of in real time, and I think that's gonna be a continued trend as well. It's gonna open up a lot of experiences that are much more natural to communicate with these models than, you know, the chat window we talked about, before or kind of high latency experience in, in other approaches. So I think that's gonna be a continued trend. And then I think just more and more ease of use.
You know, the everybody talked about embedding models and getting excited about the next new LLM coming out and all that kind of stuff. I think it's gonna be it's gonna move to the background pretty soon, and people are just gonna be like, how do I use this? How do I use it in production in the easiest possible way? I don't care about all these details. I just want it to work. I think that's gonna be a trend as
[01:01:28] Tobias Macey:
well. Are there any aspects of the overall space of retrieval augmented generation, applications of these generative models, the work that you're doing at Clarify that we didn't discuss yet that you'd like to cover before we close out the show?
[01:01:41] Matt Zeiler:
No. I think we covered a lot today. It was a it was a great overview. Hopefully, that was helpful for everybody listening. Absolutely.
[01:01:48] Tobias Macey:
And for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning and AI today.
[01:02:05] Matt Zeiler:
Yeah. I think it's a lot about what we talked about today. Just how disjoint the the ecosystem is, and that's leading to difficulty going beyond that prototype into production. So, I hope people test out clarifi.com. You can sign up for free, get started, and you'll see the difference of having all the tools in one place can help accelerate you all the way to production.
[01:02:29] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share your experience and perspectives on retrieval augmented generation and the considerations that go into the different layers of those stacks. It's definitely great to learn a bit more about those concepts, the technologies involved, the business questions involved. So I appreciate you taking the time today, and, hope you enjoy the rest of your evening. Thank you. Thanks for having me. Thank you for listening. Don't forget to check out our other shows. The Data Engineering podcast covers the latest on modern data management, and podcast dot init covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@aiengineeringpodcast.com with your story.
Introduction to the AI Engineering Podcast
Guest Introduction: Matt Zieler of Clarify
The Evolution and Impact of Generative AI
Understanding Retrieval Augmented Generation (RAG)
Prompt Engineering vs. RAG
Technical Architecture of RAG
Handling Edge Cases and Failures in RAG Systems
The Role of Context Windows in RAG
Generating Embeddings for RAG
Multimodal Models and RAG
Choosing the Right Vector Database
Selecting the Right Generative Model
Repurposing RAG Capabilities
Innovative Uses of RAG
Lessons Learned in the ML Community
Domain Knowledge Required for RAG
When RAG is the Wrong Choice
Future Trends in RAG and AI
Closing Remarks