Summary
In this episode of the AI Engineering podcast Viraj Mehta, CTO and co-founder of TensorZero, talks about the use of LLM gateways for managing interactions between client-side applications and various AI models. He highlights the benefits of using such a gateway, including standardized communication, credential management, and potential features like request-response caching and audit logging. The conversation also explores TensorZero's architecture and functionality in optimizing AI applications by managing structured data inputs and outputs, as well as the challenges and opportunities in automating prompt generation and maintaining interaction history for optimization purposes.
Announcements
Parting Question
In this episode of the AI Engineering podcast Viraj Mehta, CTO and co-founder of TensorZero, talks about the use of LLM gateways for managing interactions between client-side applications and various AI models. He highlights the benefits of using such a gateway, including standardized communication, credential management, and potential features like request-response caching and audit logging. The conversation also explores TensorZero's architecture and functionality in optimizing AI applications by managing structured data inputs and outputs, as well as the challenges and opportunities in automating prompt generation and maintaining interaction history for optimization purposes.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- Seamless data integration into AI applications often falls short, leading many to adopt RAG methods, which come with high costs, complexity, and limited scalability. Cognee offers a better solution with its open-source semantic memory engine that automates data ingestion and storage, creating dynamic knowledge graphs from your data. Cognee enables AI agents to understand the meaning of your data, resulting in accurate responses at a lower cost. Take full control of your data in LLM apps without unnecessary overhead. Visit aiengineeringpodcast.com/cognee to learn more and elevate your AI apps and agents.
- Your host is Tobias Macey and today I'm interviewing Viraj Mehta about the purpose of an LLM gateway and his work on TensorZero
- Introduction
- How did you get involved in machine learning?
- What is an LLM gateway?
- What purpose does it serve in an AI application architecture?
- What are some of the different features and capabilities that an LLM gateway might be expected to provide?
- Can you describe what TensorZero is and the story behind it?
- What are the core problems that you are trying to address with Tensor0 and for whom?
- One of the core features that you are offering is management of interaction history. How does this compare to the "memory" functionality offered by e.g. LangChain, Cognee, Mem0, etc.?
- How does the presence of TensorZero in an application architecture change the ways that an AI engineer might approach the logic and control flows in a chat-based or agent-oriented project?
- Can you describe the workflow of building with Tensor0 and some specific examples of how it feeds back into the performance/behavior of an LLM?
- What are some of the ways in which the addition of Tensor0 or another LLM gateway might have a negative effect on the design or operation of an AI application?
- What are the most interesting, innovative, or unexpected ways that you have seen TensorZero used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on TensorZero?
- When is TensorZero the wrong choice?
- What do you have planned for the future of TensorZero?
Parting Question
- From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
- TensorZero
- LLM Gateway
- LiteLLM
- OpenAI
- Google Vertex
- Anthropic
- Reinforcement Learning
- Tokamak Reactor
- Viraj RLHF Paper
- Contextual Dueling Bandits
- Direct Preference Optimization
- Partially Observable Markov Decision Process
- DSPy
- PyTorch
- Cognee
- Mem0
- LangGraph
- Douglas Hofstadter
- OpenAI Gym
- OpenAI o1
- OpenAI o3
- Chain Of Thought
[00:00:05]
Tobias Macey:
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Seamless data integration into AI applications often falls short, leading many to adopt RAG methods which come with high costs, complexity, and limited scalability. Cogni offers a better solution with its open source semantic memory engine that automates data ingestion and storage, creating dynamic knowledge graphs from your data. Cogni enables AI agents to understand the meaning of your data, resulting in accurate responses at a lower cost. Take full control of your data and LLM apps without unnecessary overhead.
Visitaiengineeringpodcast.com/cogni, that's cogneetoday to learn more and elevate your AI apps and agents. Your host is Tobias Macey. And today, I'm interviewing Viraj Mehta about the purpose of an LLM gateway and his work on Tensor 0. So, Viraj, can you start by introducing yourself?
[00:01:09] Viraj Mehta:
Yeah. Sure. Hi. I'm Viraj. I am the CTO and the cofounder at Tensor 0. And I'm really excited to be here today to tell you guys about how we think about AI applications and our product and and the open source software that supports it.
[00:01:24] Tobias Macey:
And do you remember how you first got started working in the ML and AI space?
[00:01:29] Viraj Mehta:
Yeah. Sure. So I was, I actually so I started computer science more generally, like, when I was a freshman in college and, like, it only had limited experience programming with that. But I got quite bored during my 2nd summer internship. I was at Google and I, like, finished some of the software engineering work that they had signed me to do. So I spent a lot of time that summer running around at Google where there are a ton of really interesting people and trying to you know just chat with folks that I thought their jobs were cool and what I realized was that I you know kind of observing the patterns in my own interest that a lot of those folks had done machine learning work and were doing machine learning work even if I didn't realize that you know at the time the Google self driving car project at x was heavily heavy on machine learning or that the, you know, generative music project was heavy on machine learning I post facto noticed that. And I also, realized that all of those folks had research backgrounds and in particular like most of them had PhDs. And so having like noticed that about myself when I went back to school I got involved in the vision lab and started working on things like 3 d vision and robot grasping and some of those some of those techniques out of the standard vision lab and at that point the bug got me that AI research is awesome it's so fun it's a it's an unbounded and like, fractally interesting problem and it touches so many different layers of the stack of computing and I really enjoyed those aspects of the problem. And so that is a bug that hasn't left since those maybe 2016, I think.
[00:02:59] Tobias Macey:
And so that has ultimately brought you to tensor 0, but before we get too deep into what you're building specifically, I just wanted to get an understanding about what even is an LLM gateway, and what is its purpose in the overall architecture of an AI application.
[00:03:17] Viraj Mehta:
Yeah. Sure. So, the way we think about it all in gateway is that it is a server that you might stick in between all of the client side application code that does all the normal things an application does and the set of AI models out there that you might be interfacing with. And so, whether those are external, you know, third party API providers than certain OLMs, you can take open AI and drop it there or cloud providers like GCP, AWS, Azure, or your own self hosted application or l m epic l m's, like, things you might run on your own GPUs such as a VLLM or other competitors. There you want a centralized place where if your application code needs to talk to an l l m, it can make a request to one place and then it gets routed to the right downstream l m server and it also does a lot of the bookkeeping and standardization and observability and a lot of the other stuff that makes, that you know that that happens all in one place so that you don't have to do this all over your application code. And so a gateway server is a nice way to do all these things and also manage the credentials so that you're not sending your OpenAI API key to a bunch of different places where your application might wanna call OpenAI from.
[00:04:37] Tobias Macey:
And in terms of that LLM gateway functionality and its role as a centralized proxy for being able to actually interact with the LLMs ultimately. What are some of the different types of features and capabilities that you might expect from that type of a service, and what are the opportunities that it provides as a sort of choke point in the architecture to be able to add additional features or functionality or value added utility?
[00:05:10] Viraj Mehta:
Yeah. Sure. So I'll start with the general case, and then I'll I'll I'll talk about how we depart from maybe the standard architecture. I think, the the thing like, the at minimum thing that you want, and this could be provided by a client library or by a gateway server, and I think, like, light, Alain made the smart design decision of saying, okay, like, we're gonna have we're gonna give you the option. So when you're developing, you'll need a binary that's working somewhere else. You can just directly call l m APIs, and then as you move to production, you can have the actual server.
But regardless, you wanna be able to call many many many all m's without having to, you know, change your form of your request to match the expected interface for each of the providers. And for some of the providers, like, most of the providers are in some way OpenAI compliant, and so you could use the OpenAI client to talk to DLLM and SG Lang and some of the APIs exposed by Google, but not most, and but you wouldn't be able to call the anthropic family of models and, one of the other things that's like more tricky is like even within the broad umbrella of things that look like OpenAI there are a lot of small differences of how each of the individual providers work, and so, further desiderata for an Olam Gateway type application is that it does some rationalization of all the features that, like, may or may not work amongst, all the different providers. And so for example, you may want your return type to be JSON and you know that's a common thing that people want from LMS because obviously you want machine readable output and so on and so some providers don't support JSON but they do support toolkom and OpenAI I believe was like this for a while but now, one like example that just came up for us was that TGI supports the tool mode where you can actually get like a guaranteed JSON with a you know, they claim they have a guaranteed schema and so you can actually do the thing you want and get JSON out, but they don't explicitly support JSON mode the way OpenAI does. And so one thing that you might want from your LM gateway providers if I'm using the TGI back end and I'm actually routing my request to my own model that runs in TGI, I want you to transparently without my application code having to think about this when I ask for Jason know that the only way to get a Jason out of TGI is actually to pretend this is a tool call and then force the model to call that tool and generate the correct arguments for that tool which then we, you know, munch back into oh yeah here's your JSON and have a nice day. And so, you know, those are the kinds of features that are like table stakes for an alarm gateway to just make the downstream provider set look exactly the same to you. Some other kind of software engineering style features that you're gonna want are things like configurable retries, load balancing, fallbacks, that sort of behavior so that your application doesn't have to think about well there are 6 different places that I can call to get an implementation of llama 70 b. We have API keys for all of them, but, like, maybe we should try them in order of cost, so ascending order of cost. So if if for example, the cheapest endpoints are available, we'll use those, but then, you know, maybe there are higher cost, more more high availability endpoints that we would wanna try lower in the stack of of choices and so we'll get to those if the first few requests fail. And so that's another feature that's, like, super common and I think most people really like about all on gateways.
And then I I've talked to people who work at large organizations where they've implemented their own internal API Gateways that also do some sort of like request prioritization or accounting stuff. So, oh, yeah. We have like internal keys that map to different budgets, and we wanna know which organizations are actually responsible for how much traffic. And so we wanna do we wanna do that kind of bookkeeping at the the gateway layer as well.
[00:08:50] Tobias Macey:
From an application perspective as well, it seems that the gateway could be a natural point to do things like request response caching to say, okay. This request is fundamentally the same as this other one, so we're just going to return the cached response rather than send it all the way to the LLM and deal with that latency and the cost as well as some of the audit logging, etcetera. And I know that you have some of that in tensor 0, so maybe this is a good point for us to talk a little bit about what it is that you're building and some of the story behind how you came to tensor 0 being the thing that you wanted to spend your time on.
[00:09:28] Viraj Mehta:
Yeah. Totally. So I think, especially I think took I I forgot caching. Caching is like another we we don't have that support in the product today, but a lot of the gateways do and it's a, obviously, valuable source of, you know, you can save on tokens and save your time. There's also kind of on caching more broadly, a lot of the API service this is a tangent, I'll answer your question. A lot of the API services also implement things like request caching on the the kv cache layer of the transformer back end and so you want to at least optionally allow your customers to do things like if for example for OpenAI this is automatic but for anthropic you can turn on some flags and things to get that behavior and if you actually want that behavior, it would be better if that was centrally managed rather than every single time you wanna call an OLM that might be anthropic on your application that you have to, like, send those flags in. So it's another place to, like, do that kind of global settings management of, how you ought to interact with, like, the Azure providers. To tensor 0. So I come to tensor 0 from, like, a very, very different perspective than probably most of the folks who get involved in all of gateways. So I did, before I was working on tensor 0 is, wrapping up a PhD Carnegie Mellon where I was thinking about reinforcement learning quite a bit. So that started very far away from OLMs. I was working, you know, on a department of energy research project where we were trying to use reinforcement learning to improve control of plasmas in nuclear reactors which is a whole other long topic that I think is like amazing I can talk about forever but I'll stay on topic a bit. The key consideration with that problem though that like eventually led me to language models was that this data was unbelievably fabulously expensive.
Like you can think you know, like a like a car per data point like $30,000 for 5 seconds of data. And so that problem leads to a, you know, huge amount of concern about okay like we're going to only get to run a handful of trials per trial like what is the marginally most valuable place we can collect data from in the in the configuration space of the Tokamak, like where should we initialize the problem, what should the policy do to collect the most valuable information. And so got really interested in this like very, if we're gonna pick a single data point from a dynamical system to inform an RL agent, what would the most valuable data point be? And so I spent most of my PhD thinking about problems like that and like applying them to nuclear physics stuff and nuclear engineering stuff. But then at the end, I I realized that, oh, yes, we've developed a lot of machinery for data efficient reinforcement learning and some of this machinery applies quite well to the techniques that they're now using to align language models. So I wrote a paper about if you're doing this, you know, standard set up for reinforcement learning from human feedback, you are trying to make a query of, like, okay, here's a prompt, here are 2 completions and a human is gonna tell you which completion is better. And you wanna use that information to improve the performance of your downstream language model on, like, a distribution of questions humans ask. And so the question that we asked was, alright, let's say you could ask you have one more label you could get from a person on this. What would be the prompt and which would be the completions for which you would, like, wanna use that last data point for? So you can get the best improvement in your policy. So we, like, came up to like, came to a theoretical solution for that, which the setting is called contextual dueling bandits, but I don't know if that that would mean much to anybody. And so we we came to a theoretical, like, oh, yes, this policy would return like a good, but would be would do well, in that theoretical setting we implemented it for first like a simple toy problem with like that same thing and then we like said okay this actually one neat thing about this algorithm by design is that actually works for real techniques that we would use to really train language models. And in that case, it was direct preference optimization, d p o. And so, we actually implemented that for language models. And that got me really kind of thinking about the language modeling problem from a perspective of reinforcement learning. And I think that has become a lot more popular today. We hear about our l a lot more especially since the o one released in September, but back then this was maybe late 2023.
It was a little bit more heterodox to think this way and so we were like one thing that was really obvious to me as I was deciding to start this business with my co founder Gabriel that l l m applications in their broader context feel like reinforcement learning problems. And so let me unpack that a little bit and how that gets me back to gateways. So first off, like, one generic model of a language model call in an application is that you have some business variables in your code. They're probably, you know, instant floats and strings and that kinds of thing. And then you erase thereof like a JSON. And you take that JSON and you template it into some strings.
You have now like an array of strings. You send that string to open AI or equivalent. You get back some more strings or another string and then you parse that string and do more business stuff with and then maybe you make more language model calls like that double line. And then eventually some business stuff happens and like either it goes well or it goes poorly and usually this feedback is gonna be noisy and sparse and, you know, maybe natural language and not really quantifiable, but like you can get something back maybe hopefully. And let's assume that you get something back for now and then I can unpack when we talk about the industry and whether this is like, you know, generally true. But what this looks like then is you make many calls to some machine learning model and you have variables, structured variables on one side, you have structured variables on the other side, and then you have, like, stuff that happens in the middle. You do this many times, then you get back some reward. And this looks to me like a reinforcement learning problem. Maybe I'm biased because I've spent a lot of time thinking about things that way. But I'm like this does smell like a, you know, partially observable MDP decision process. And so I was like, we we kind of took this really seriously, back in the early part of last year and we actually wrote down like in math like for real what is what do we think the mapping is between like the standard design of an Olin application and upon DP.
Sorry I'll write a blog post about this soon and so we can share it with folks but that led us to the idea that okay really the policy here which is like the thing that makes the decisions is the entire function that go is inside of the business variables on the input side of the call and outside, you know, before the business variables and the output set the call. So we what we realized is like, okay, our interface design is going to be different than OpenAI's interface design. Because OpenAI due to their limitations of their system and the design of like how they're serving you a model, they take strings as input and then they take, you know, they give you basically strings as output and that that makes sense for them. But I I think like for for applications the right way to think about talking to a language model is like, oh, yes, I'm gonna design a input and output signature for a job that I would like done by something smart And then I will make that typed function call and I don't really care what happens under the hood at all. What I want is outputs that lead to good business outcomes down the line. And so obviously that's not sufficient to like get started.
You need to go in the under the hood and, like, write a good prompt and try it a few times and make sure it's sensible and then it won't won't do anything crazy, upon deployment. But then once you have something reasonable going, really your goal is, like, okay, now we have something reasonable going. Let's optimize against whatever feedback signals we have or the comments that the PMs make about the performance of the system or those kinds of things and try to do better over time. And so that to me also looks really very very much like a reinforcement learning problem. And so from there we realized, okay, if someone, you know, if we were to build an OLM application ourselves today I would probably take these insights and use that to, you know, treat the application code as making these type of function calls and then have the ML process be okay, we have, like, this incumbent implementation of the function call that was the original prompt and, like, whatever model you originally chose and so on. And then the job of, like, an ML scientist or MLE who's working on the team is, like, let's repeatedly improve the implementation to come up with new implementation that might be better than our original implementation and actually, like, slot them in or AB test them against the original implementation to see, you know, is the new prompt model generation parameters triple that's caught, like, actually implementing this function call going to be better. And so I think for us that led us to, okay, well, we need to clearly where are we going to insert this into the stack and it made sense that it would be, oh, yeah, like, we're gonna one of the design variables obviously of trying to come up with new implementations of function is you're gonna maybe wanna try different language models. So like, well, obviously it makes sense to then insert into the stack of like at all l m applications at the point where you have the choice of different language models. And so we optionally, I think and we're we're one of the few products that does this today and I think we probably have the most articulated and like explicit view of how this should work is the prompting we we want to also manage the prompting like you can use the product without and keep the prompts in the application code and a lot of people do this, like, you just build your strings out of pipe and have strings or with JavaScript or so on. But if you can keep the prompt in the gateway layer, what that means is that you might you're able to add inference time, say, I wanna call anthropic with prompt a or I wanna call OpenAI with prompt b or I wanna call Jim and I with prompt c or I wanna call my fine tuned llama that doesn't need much of a prompt because it's been fine tuned on so much data and it it like doesn't it like knows what the problem is already without, like, with minimal prompt d and like let's compare those.
And I think that because language models, like different language model families should be prompted differently and because, you know, you really when you're thinking about the performance of a particular implementation of your l m function call, you care about the prompt, like, the whole function. You don't care about just the model. Like, this is like a nice unit of abstraction for application code and for people who are monitoring these systems. I just talked for, like, 10 minutes. I'm sure, like, you have questions.
[00:19:28] Tobias Macey:
No. That that was great and very useful. And to the point most recently that you're making about automating some of the prompting, it puts me in mind of another project framework that you may or may not be familiar with of dspie, which I think came out of, maybe Stanford. And I'm wondering some of the ways that allowing something, whether it's dspie or tensor 0, to automate the prompt generation, how that shifts the way that you think about the overall design of your application where the LLM is some core piece of that functionality and just how you've seen teams try to tackle that of, oh, I've spent all of this time on prompt engineering. I've got the perfect prompt. Oh, wait. The prompt started falling apart because they released a new iteration of the model and just so some of the ways that they think about the role of prompting in the overall application stack and the behavior thereof.
[00:20:25] Viraj Mehta:
Yeah. I I think d s pie is, you know, I I think it's a really cool idea and it was definitely one of the inspirations for me as I, like, thought about this. And in fact, like, right during that time in late 2023 when I was, like, thinking about this in the very early days, I emailed and hung out with Omar at the Omar Khattab, the author of DSPI, at NeurIPS. And so we talked about this then. And I think one way to contextualize, like, the relationship between Tensor 0 and DSPI where we actually do there's an example in our repository of using DSPI to optimize prompts, is that, like, Tensor 0 is production software that collects and collects data that a lot that is well structured to do all kinds of things to improve your implementations of all your function calls, let's say. And these, you know, DSPI is like one set of techniques because they they have many optimizers as well that can read from the data model that we're storing and write essentially a new implementation of one of these functions to our config model, and then you can AB test it or evaluate it or do whatever you want with, like, a new implementation. And so DSPI is one way to I think one really smart way of thinking, you know, prompts our element applications, our programs and we should think about this as a composable kind of PyTorch style graph of of of element calls and dependencies and then we can quote unquote back propagate through that graph as as like in a pipe towards kind of way to improve the language model calls that happen upstream.
I think we think about this very similarly. Maybe some of the differences with the d s pie is the op it doesn't really manage data it thinks about all right you hand me a data set and I will run optimization. But the accumulation and structuring and labeling and that whole part of the problem is external to DSPIRE and so you need if you're going to do this on on hand curated data then you can hand curate your data set and hand it to DSPIRE and see what kind of prompts it comes up with or so on, but if you're going to do this as part of your production system which I think is like especially as a all on deployment scale like the right way to think about this then you need a tool like tensor 0 that's going to in a structured form dump all the information that you need to then do d s pi later. And I wanna be super clear about, like, it's actually really hard to get this data model correct.
For example, most tools in the industry that are doing LLM observability including, like, many of the gateways, they're gonna dump the strings that you send because they're they don't have this concept of, okay, I have typed input, gets templated into strings that then get sent to an an API. You you you're looking at the strings that go into the model and the strings that come out of the model as your your dataset that you're storing in your VoIP platform. And in doing so, it means that like down the road I I've changed my prompt 6 times over the last year and I want to fine tune on the historical data that like ends up going well. You in order to make sure that all the data has been templated with your last prompt, you have to parse out or like go get the variables from somewhere or a re template, like that's either gonna be a difficult parsing problem or you have to go and ETL it or do some join with some external system or you have to, you know, it's like hard.
Whereas, like, if you just store structured information about, like, what went in, then it becomes easy. It's okay. Yeah. Like, we just re template everything we have it in this other table. And so there are considerations like that along with the fact that you wanna know these are the different types of LOM calls that led to some downstream outcome that, force you to kind of think about the data model in roughly the way we think about the data model. We we I spent a lot of time, I think that was the thing that we spent the most time on last year was thinking about this and and, making sure that the entire universe of things that we would want to do in the future was supported by the data model. And, now I think we're really happy with having, like, a forward compatible, data model that lets you run tensor 0 today and know that in the future as we start to, like, implement more optimization techniques, which is kind of the direction we're turning to now, that they will be compatible with the data you collected last year.
[00:24:27] Tobias Macey:
So digging a bit more into Tensor 0 and its overall architecture and system design, I think that that will be useful to frame the rest of the conversation about its features and functionality and workflow. And so just, I guess, if you can give a bit of an overview about how you have implemented Tensor 0 and some of the supporting infrastructure that it needs to be able to do this operation of storing the structured inputs and outputs and then performing optimization on that?
[00:24:59] Viraj Mehta:
Yeah. So, first, like, I'll talk about, I guess, a user's perspective on this rather than a, you know, system administrator's perspective on this and then because that system administration is actually not too complicated. So as a user, you think about, like, if you built your application with OpenAI, let's say, and how would you move it over to Tensor 0? You would and and do it correctly because Tensor 0 does offer a drop in OpenAI compliant endpoint, so you can just point your endpoint at Tensor 0 and start using it. But to do it properly, let's say, I would think about, okay, what are all the jobs in my application which are done by l m's? So what are all if I had a bunch of smart typed functions that I would be calling, like, what would they be called?
And then, like, where would they be called in the in the application? And then, okay, what would the input variables look like? And ideally write down, like, adjacent scheme out of that. And, what would the output signature look like? And so, you know, other tools that would be called, would there, you know, would do we want Jason's out or like what would the schemes of those be? And then that allows you to go and configure, okay, we have a TOML configuration file where you write down here are the functions that I'm calling and here are the inputs and output signatures and pretty quickly replicate. Okay. Now I have an open AI I have the same thing that I was doing with open AI. But now I've I've I've factored it out into the jobs that are being done and also like put structure into, a few of the input variables and here are the output variables and so now that, tensor 0 comes in as it's a client that you you pip install or you, you know, make web request to and there's a binary basically that's a rust binary that sits, it's a Docker container you can deploy anywhere you deploy your Docker containers. That is stateless but it reads the configuration so it knows what is going on with like what are the functions that are going to be called by the client. And so everywhere in your client code where you're calling LMS, you're you're calling these functions. And so now, the ClickHouse gateway will route, you know, implement the functions by taking the variables templating and then shipping them off to the appropriate language model provider and then returning the responses and parsing and validating and so on. And implements things like fallbacks and retry, it does all the things.
But the other thing that it does is it dumps the other basically dependency that it sits on is ClickHouse. And so you would like to have ideally like some sort of deployment of ClickHouse, so whether you're using the cloud service, your own open source deployment or however, you want a ClickHouse that where this data starts to be materialized in a structured format. And so what that buys you is after you run this application for a while and, oh, important to add, the 10 to 0 gateway also has a feedback endpoint, so we support booleans, floats, comments, and demonstrations so if there's like a human in the loop that actually edits a response before it goes out you can send that response as a demonstration of the desired behavior. But then what that gives you is a click house that's full of the interact interaction history so what were all the inferences that were part of a particular, oh, I'm like run of your application and then how did it go. And that allows it, that's a really useful data set for optimization. So we can start to then run, you know, do things like let's curate the data based on which ones went well, template them to our latest prompt, and then fine tune a smaller model either like a llama series or an opening I series or, one of those kinds of things. And then let's slot that in as a new implement to use that model now, slot it in as a new implementation of our language model call and run 5% of traffic through it. And so all that really takes is we have a web UI now that reads from ClickHouse that you can it's another document there. And you can, like, go in the UI, look at your inferences, there's a form for fine tuning, although we also publish notebooks that do this, and it, will curate by the metric, and then you can pick which template you wanna use and which model you'd like to fine tune, And then go and it gives you out a tomo block that you add to your config and now you have a brand new model that you can run an AB test against. And so it tries to and then all of that data is continued to be, you know, dumped to ClickHouse and you can see like how does, you know, variant a or incumbent variant doing against variant b or, your your new fine tune model and you can make a determination as to like, alright, how are these confidence variables on this metric and are we able to actually like, is it worth flipping over to the the new model?
And so did that make sense? Was that clear? I know it was, again, a lot of text. That that that makes absolute
[00:29:21] Tobias Macey:
sense and I think that that is definitely a very interesting and useful aspect of the system is that it does maintain that history in the ClickHouse database for that feedback. And one of the questions I was going to ask, which you proactively answered, was, how that human feedback gets factored in and what are the interfaces available for being able to then provide input back to the application and tensor 0 to say, yes. This is what I want or, no, this is really bad. And so the fact that you have the web UI is definitely very useful for some auditor or some, administrator of the system to go in periodically and say, okay. These are the things that we like. These are the things we don't like. But it also gives you the opportunity to expose that in your client interface where even if it's just a thumbs up, thumbs down of this was a good response or this was not helpful, and then being able to feed that back into the system as an automated feedback loop without necessarily having to have that be a manual task by one of your teammates on whatever periodic basis.
[00:30:26] Viraj Mehta:
Sure. And I I a quick digression. I spent a while 2 years ago before 10 years ago, thinking about, like, what if you were to acquire a services business and use o m's to, like, start to automate things, but, like, you want you don't wanna, like, degrade the quality of service the services business. You have humans review that the humans, like, still review all the work that the l m's are doing and fix it. And that would eventually generate a training set to improve the quality of your l m's. I did end up doing that because I'm not I'm not a private equity tycoon. I'm a grad I was a grad student at the time. I was thinking that through. And what I landed on was, like, you're for many businesses, like, it's gonna make sense to have this okay. The humans are continuing to, like, audit deal and output for a long time and what you're real what you really want is like the frequency at which the auditors have to do have to change anything goes down over time and that that means you're getting closer to your eventual goal of I guess automating the job or decreasing, you know, increasing the productivity of labor, let's put it nicely.
And so there's like this natural, okay, well, we're gonna get demonstrations of good behavior as we're doing the hand off from humans to AI for this menial job and that that you need to be able to take the take account of that information. That's that's the only way this transition ultimately ends up happening. And I think, you know, even as the base models get smarter, so, you know, that will continue to happen for sure there's too many smart people and too many dollars and too much compute being thrown at this problem. It doesn't necessarily mean that the models will be getting better at your particular business task. So for example, like I don't know that since the GVD 3.5 series the models have gotten quite a bit better at like writing very targeted sales emails.
Like I think you just have to learn what works you know in my experience doing any kind of sales or marketing like you have to learn from trying stuff like what works for your particular kind of customer population and like what works for your particular product and what are the hooks that actually like do the job well and in you know if you could do it contextually that'd be great nobody really can. And so I think there are still plenty of business problems where you wanna take advantage of the production data in order to do better even as the models get stronger.
[00:32:37] Tobias Macey:
Another interesting aspect of what you're building with that ClickHouse record keeping is effectively the interaction history of at least the application with the LLM. Another core capability that is built into a lot of the chat based applications or that people want to have built into their chat based applications is that keep keeping of history of what are all of the requests and responses for a given user, and how do I also use that to enhance my profile of that user to then be able to give them more useful responses and maybe be able to feed that into the context of the request that I'm sending?
And I'm curious how you're thinking about the role of Tensor 0 in that use case or LLM gateways more generally and some of the ways that that factors into the overall system architecture of what is the piece that owns that. And another aspect of this is the idea of memory that is being built into a number of other products, whether they are part of an existing framework or an addendum to the overall stack. I'm thinking of as far as, like, Cogni, MEM 0, the lang chain, or lang graph memory capabilities. I
[00:33:53] Viraj Mehta:
think this is this is a this is a really interesting kind of distinction to make that we we have, like, probably we're probably far on one end of the spectrum here, so it's an interesting perspective with which to regard it. Let's for a second, just for completeness, I wanna rule out the new architectures that have been coming out lately where the during inference, the weights to model are, like, some memories baked into the weights to model and instead stick to, okay, the model is, like, memory memory less and we're not, like, we're not, like, baking dynamically new things in. So if we leave that out, I think that the the concept of memory and all of our applications boils down to let's try and it's a context that's writable.
It's a context store that's writable. So the idea there is, like, even like the chat gbt implementation, I believe is, like, chat gbt all all has the option of saying, okay, this is a good piece of context that I would wanna return to in the future and writes it to some store on open AI's server side. And then in future conversations, that context that that memory is put into context along with, you know, the user question at the time. So to me, I think about this kind of similarly to rag in that our retrieval augmented generation in that you have like some store of documents that you might need that might be useful for making an inference and you retrieve the most relevant ones prior to your inference time.
I think, like, user specific memory writing is, like, if you had a a vector database, for example, it doesn't have to be that is writable with the, you know, with actually writable by the context of the by the l m as it's making the conversation and then future inferences can read from it as part of that memory stack. And so when I look at the products that are out there from langchain and from m zero and from some of these other providers, that is usually the architecture that I see. For tensor 0, we don't really we're we're we're primarily focused with like what are the things that we can use to improve the actual long calls. And so we don't actually deal with this very much at all because we think that for a well designed l m application, like, you're gonna wanna like, probably the design of your memory is, like, very application specific. And so some things might you might want to share among groups and some things you, you know, if it's a team setting or an enterprise application, some things are very personal and you want to keep that to the user. And, it was, you know, we kind of explicitly were like that kind of thing. Along with rag are are things that are not worth touching today because we are, you know, the for example, for Rag, like, the the data architecture of a particular enterprise application is going to be highly different, you know, very different from the next one over. And there's not a lot of patterns that we wanted to generally impose on how people store and retrieve their data that, so so we we kind of explicitly left it out of the scope. For the the way we would treat it is, you know, add a field to your add a field to the input signature of your function that takes memory and then optionally send it as your memory system dictates you do. And we haven't really handled that, but we would integrate well. That would integrate fine with any other memory provider. I think, you know, as because the gateway layer is such a good choke point for dealing with the general behaviors of all of applications. I think, you know, other people have taken the other approach of, no, we're gonna have a batteries included solution where, you know, we can offer this either vector store or a pen only memory context document or other, you know, memory system there that will know about, like, your application and know which memory to load into context prior to make the request. But we thought that, you know, our decision was that that was something that developers could figure out on their own.
[00:37:49] Tobias Macey:
Aside from the more detail oriented memory capabilities, even just for the history tracking of what are all the requests and responses that have happened in historical conversations. Is that something that you would presume people would use rely on Tensor 0's history tracking to take care of, or would you say you have your own history tracking for user conversations and you can load that in? We are keeping history for a different purpose, and they don't necessarily want to mix together.
[00:38:21] Viraj Mehta:
So we we don't, like, offer client side retrieval for this purpose, so it's but it we do keep, like, comprehensive and, like, you know, add many layers of specificity, like, documentation of, like, what exactly went in and out of the language model. And in particular, they're retrievable and we give the client an episode ID. So you can go and query, what were all the inferences made for this episode ID ordered by time. And so that is the right like, you don't need an additional store to go ahead and, actually get that information. So it's not we haven't seen folks use this as I guess we we don't know exactly what everybody's doing because we don't have a lot of we don't have telemetry built into the product. But, like, it is totally valid and we publish the data model and it will be, you know, ClickHouse is like a perfectly fine database for that sort of thing. And so it is it is definitely we keep the data in a useful way to do that and we give you the tools to be able to retrieve it if that's what you want. But, we haven't marketed as such. So, I think, yeah, I I hadn't considered it from that angle before, but, yeah, it's it's totally possible with the product.
[00:39:28] Tobias Macey:
You mentioned that you don't have telemetry built in, so you don't have that tracking of how users are using the application, which is definitely, I think, a desirable trait in an open source project. I'm wondering how you're approaching the long term sustainability and overall business model of what you're building at tensor 0 and how you think about what are the capabilities that belong in that open source core application, and what are some of the additional functionality and value add features that you're looking to build over time as you kind of grow the capabilities and grow the organizational capacity?
[00:40:05] Viraj Mehta:
Yeah. So, I mean, we're we're still early. We're lucky enough to have investors who are aligned with the kind of longer term vision that I'll present in a second. So we're not racing to make money or and we're not even really building anything that's not gonna be part of the core assets today. But we do have a clear design and a line for where this is drawn. We're happy to share that, you know, we we share that with anybody who asked because like we think this is part of, you know, insurance against rug pulls for folks who want to depend on the open source package. So the way we think about this is like from the beginning and actually the original version of tensor 0 that we were running last year as we were trying to figure out what we were gonna build was managed service.
So we were running the gate we were running this gateway as SaaS, you know, a couple of folks would integrate against it and send feedback to it and we would store the historical data on our side. And then we have this man behind the curtain, Wizard of Oz style. Okay. We're gonna run the optimization techniques mostly manually like we're both Gabriel and I are both been doing ML a long time, but we'll just optimize your models for you based on the feedback you're giving us. We have the data set to do it so and we'll run the AB test and we'll just make it work. And we were able to get actually quite good results doing that, obviously not sustainable from, like, how much sweat went into doing that. So what we realized was, oh, okay, like, we need to give we need to be really honest with ourselves here and build, a system that even if we weren't there would allow people to make their own systems better based on the feedback that was being collected. And so we we the plan and and we've already done the vast majority of this so far is let's publish tools that allow you to build your application with tensor 0 and collect the data and store it in a way that's useful to us. And then techniques that read from that click house data model and write, like, new variants, new new implementation of the functions to our config. And so we have a couple of those built. So we have supervised fine tuning with a couple of providers. We have workflow for dynamic and context learning. We have, we I can do an open PR that that should be merged there tomorrow that does dynamic or direct preference optimization.
And then, you know, our global focus to you know, shift towards 1, building better UI and 2, let's improve the number of and keep working on these optimization techniques going forward. And so all of that I think will be in the scope of the open source project. But there's a natural thing that happens as as you scale and you this these techniques get more advanced is that right now the first few techniques have been things that you can run with only an API key and no compute. We we wanted to ship those first because that's easy for everybody to run. But of course, right, there are limits to the customization APIs that are available from both the, like, major lab, like, the open eyes of the world, but also the, like, togethers and fireworks that will, like, do their own custom fine tuning stuff. And eventually, you're gonna we're gonna wanna publish techniques that we think work quite well that are not supported as APIs by existing players. And so you'll have to again, as part of the open source, you know, run to run the recipe, you're gonna have to go and rent your own 8 by h 100 node. It's not cheap and it's annoying to, like, even set up. And if if you want several, that's even more annoying and, then you have to run the technique and you have to do this. You have to know that it's time and you have to choose the technique and you have to do this and then you have to run the evals and then you have to, put that in the a b test and then you have to manage the a b test and then you have to, you know, then your application gets better. So it's still less work way less work than it was before when you were designing the data model and you were, you know, doing all implementing the algorithms you're going and finding them and making sure they fit and, like, that kind of thing. It's like an order of magnitude less work than that, but it still work. And as these applications scale further, I think this is like a lot of work, especially if you imagine like an agent that's doing lots of things all day and does lots of different it it starts to get hard to manage. And so the we wanna go back to the original vision of if you have been using the open source software and the pain of getting all the GPU's and managing them and then, like, you know, figuring out which experiments to run and, like, managing the a b test weights automatically and, like, all that stuff is a pain.
We wanna offer a service that basically starts to do all of this proactively. So, hey, we noticed programmatically that you have achieved you know, collected 10,000 more data points since the last time we tried fine tuning and a new llama model version got released. Let's proactively run, like, state of the art filtered fine tuning job and train a new model and then proactively run your evaluations against it and send you a ping saying, hey. Your new model is ready for you. Here are the results. And do you wanna deploy it in production? And then click this button if yes, and we'll run our usual AB test routine. And so what this starts to look like is that the improvement aspect of all of them application starts to be mostly automatic. And we think like, especially for scale deployments where improvements in KPIs are really all that matters and of course cost is critical. So you can the more data you collect, the smaller model you can train to achieve the same performance. There's a lot of work to be doing there all the time. And if we can offer, you know, the the tools to do that, but you have to manually manage all the compute and, like, proactively do all the things yourself, or we have, like, an autopilot for this. Like, that's a clear line to draw around the open source project that keeps it healthy and useful and and we can keep contributing to it and, like, you know, keep making sure that this is, like, an excellent great piece of software. But then also for folks who are on the larger end and they wanna see they wanna, you know, they have like a lot this is a lot riding on this application. We can start to make it as good as it can possibly be without a ton of them all effort. And like, you know, those are expensive employees who would be doing that effort.
[00:45:43] Tobias Macey:
Yeah. It's definitely a very interesting and useful gradation for a project like this, where, as you said, you can do it all yourself, but eventually, you're gonna get sick of it, and we'll be there when that happens.
[00:45:56] Viraj Mehta:
And I think, like, not just you're you're gonna be sick of it, but also we're we can make better utilization if you have GPUs, if we're doing it automatically because we can just run stuff when it's done. And also, because we can just run more stuff because it's like it's able to try things in parallel and like exhaust more exhaustively search the space of like tricks you can pull, we should be able to like stack more little wins and get to or I would I would hypothesize better performance than like if it were done manually. Just due to, like, the the amount of search that can be done over, like, what are the possibilities here.
[00:46:31] Tobias Macey:
And also, if you're if you have a number of clients who are within adjacent verticals or, you know, who are doing similar enough work, then you can say, oh, hey. This looks like something that we've seen before over there. Here's our prebuilt fine tuned model for you to use, and it saves you all of that the headache of having to do all the data collection and feedback first, and we'll get you an easy win because we already know enough about this type of problem space to know that this is going to work better for you.
[00:46:59] Viraj Mehta:
Yeah. I mean, we definitely don't wanna, you know, cross pollinate data, but I definitely think cross pollinate learning is a really important and interesting part of building a company like this where we'll see a lot of things. We'll do a lot of kind of core research that maybe other companies can't do because their job is, like, actually building the application. And then, ideally, share that with everyone who uses the product. I think another analogy that I really like here kind of going back to your point about it's it works but it becomes annoying to maintain its scale is I think like this is the story of Databricks on some level, you know with like large spark clusters and large delta lake deployments like it started out as an excellent really useful open source project and then down the line has turned into a monster business because they built something really useful and then people wanted it at scale and people don't want the headaches that came with that and there they are.
[00:47:49] Tobias Macey:
Absolutely. And as you have been building tensor 0, working with some of the early adopters and clients, what are some of the ways that you see tensor 0 and LLM gateways shifting their thinking architecturally and maybe some of the patterns where the gateway actually gets in the way? Because maybe there's only ever one LLM that they're going to call for a very limited use case or just some of the cases where, the the addition of the gateway just adds too much cognitive overhead or architectural or operational maintenance?
[00:48:28] Viraj Mehta:
Yeah. Sure. So I I can talk I think the tensor 0 has its own particular wrinkles that we're trying to iron out of the product. So I can talk about those specifically and, they definitely cause headaches for people and so we're gonna, you know, give us a month or 2 and they'll be gone hopefully. But then, I'll also talk about kind of the general case of this. So for for 10 to 0 right now, we don't have Litellum has this. We don't have the we'll we'll fix this shortly, like, a client only implementation of the gateway that, keeps everything on the so you can do quick development with just the client.
You also don't have this, you know, before like one thing about keeping them structured as tensor 0 kind of recommends you do for production. If you're doing your quick development and you're changing things all the time, you don't know what variable should go into the function in the first place, then like writing down JSON schemas and like keeping everything templatized and organized and stuff can be more friction than it's worth when you're originally just like trying to rapidly iterate on like what do I even put in the all end and like what do I even want out of it. And so for their, like, time to getting an inference done and looking at the outputs and like repeating that process a lot of times is really a useful thing for developers. And like right now, I would say that tensor 0 is a bit heavy for and like a bit crafty for that kind of thing. So we're certainly working on removing some of those assumptions and making this lighter but I would say that is a place in this in the life cycle of an application where our project our own product is not amazing yet, we're working on that. But then I think in general for gateways I think like maybe this is x tensor 0 because we spend a lot of time on this. It's like when I use light l m there are little wrinkles of each inference provider that are a little bit different and they have they're not really like enforced by the interface and the inner you know, light l m doesn't fix it for you. So for a while, this is this is fixed now so I don't think it's an issue anymore, but Anthropic, won't let you send 2 consecutive user messages as part of the conversation and Lydal and will just let you do it and then get a give you a 400 and like you're like what happened and then you have to go and figure it out and it's like, did dig through their docs and then you're like, oh, I didn't, like, it's not even that clear. And so there's like this headache of like because there's a layer of obfuscation between you and the and this is true of tensor 0 to some extent as well, because there's a layer of obfuscation between you and the ultimate service that you're calling. When you're trying to debug things, you have to debug 2 2 layers instead of 1, and that's harder.
So because there's a layer of obfuscation, like, you can't it took me a while to figure out, like, why I was getting 400. For 10 to 0, we, like, know that and we just, like, coagulate the messages and do something sensible and try and get it to work. But that is, you know, maybe that's not the desired behavior. And so even with us, like, there is a layer of, like, okay, there's somebody else's software between me and what I wanna talk to. And so there's value, but there's also drawbacks.
[00:51:19] Tobias Macey:
And as you have been building and working with some of the early adopters, what are some of the most interesting or innovative or unexpected ways that you've seen tensor 0 applied?
[00:51:32] Viraj Mehta:
Yeah. So, I think it's very cool to see, you know, some of our earliest users we set we help them obviously handhold them on how do you set up this for your application and let's go through it. I think the thing that's been cool is, okay, like, now we're gone. It's open source software. Do whatever you want. And people have been for example, a lot of people use LLMs as a judge for emails. So, like, can you check whether this takes your and, you know, that's an old one call too that's done by some kind of application of something, and so might as well make that a tensor zero function as well. And so now we have, like, tensor zero functions judging tensor zero functions, and, like, in principle, you could have tensor zero functions judging those and turtles all the way down. And I think that's really cool because ultimately, like, one problem with Olam as a judge is it it too has a failure rate and you have to then independently kind of monitor the failure rate of that and you have to optimize the failure rate of that. And so if we have a common abstraction that lets us say, okay, like every 11 call is something that we can, like, treat in this unified way and optimize in this unified way and observe in this unified way.
It lets people build things that are, you know, I I think it's one of the really beautiful things about computers is like, okay, now we can immediately make it recursion and see what happens when we loop it on itself and it, like, feels very Douglas Hofstattery. And I'm I'm very excited to see this kind of thing more often. We'll, you know, build it into the core product at some point. And so, yeah, I thought that was, like, very unintuitive for me and exciting.
[00:52:59] Tobias Macey:
And in your work of building the system, building a business around it, and exploring the overall space of LLM gateways and how they can be used to improve the operation of AI applications, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:53:19] Viraj Mehta:
So, I think, again, a sort I think this goes back to how easy it is to be to use originally. Like, when you go off into a rabbit hole for a year and, you know, like, this is the correct theoretical thing that should be done and, okay, let's try and make it very usable grade and so on. We we over personally, like, our in our journey, we over index on production grade. And, we're, okay, like, we're gonna build it in Rust, and we're gonna make it, like, all configurable through files that you can commit in, preview, and get. And, I'm gonna use ClickHouse, which is, like, the ultimate scale database. And, you know, it'll work when your data data you have trillions of rows and it doesn't matter. And it's, like, all great. But then do making that decision means that the bones are really good. But it also means, like, now we have a ton of work to, like, make it really easy to get started with. I'm like, I don't want a quick house deployment and, like, I don't need to run a doc and can I just want PIP install something and get to work and, like, not write tomophiles or learn anything?
And so it's really been a big process of where I like we think we have like some of the highest quality implementations like most thoughtful implementations of an all on gateway out there and but it's like also the hardest to use probably So, now it's time to, like, go back and, like, fix all of those things that we it was an emphasis thing. There aren't bugs, but, like, let's go back and backfill and and and make this now, like, a really ergonomic tool that you can just, like, PIP installer or node installer, go on through life. And so I think for us, we made an explicit, like, hard right turn, and now we need to, like, course correct. And we're and we were doing that.
[00:55:05] Tobias Macey:
And with ClickHouse, at least, it also offers an in process option. So you can just run it with a PIP install, and you don't have to have a full distributed multi node cluster of ClickHouse to to get started with your laptop.
[00:55:18] Viraj Mehta:
Yeah. Yeah. Yeah. Like, I I mean, for development purposes, like, all of our examples come with ClickHouse's and UIs. So, like, you can run Docker Compose up and then everything works. So it's not we have, like, a bunch of rungable examples in a repository where that is there. But, still, you know, there's it's just a a little bit more of a doc, on rip. And something like DuckDB would have been even better maybe where, you know, we could have compiled it directly into the gateway or something of that sort. But, then, you know, we wanted this to work for, you know, big multi node deployments if you wanted to later down the line. And so we and we have to pick one warehouse to get started with. I wish we can support all of them today. But, like, you know, we're a small team, and so ClickHouse seemed reasonable to be in money.
[00:56:01] Tobias Macey:
And and SQL is never the same no matter how hard you try.
[00:56:06] Viraj Mehta:
Yeah. Yeah. I wish that every every warehouse had the same SQL model, but that's never it. And also, like, you know, you're gonna wanna optimize things differently based on whether it's Snowflake or Redshift or ClickHouse or BigQuery or Docker or so on.
[00:56:21] Tobias Macey:
And you've mentioned a little bit about some of the future road map. I'm just wondering if there are any particular projects or problem areas you're particularly excited to explore in the near to medium term.
[00:56:32] Viraj Mehta:
Yes. I I think I'll talk about this publicly. I don't mind. So I think, like, I'll say this. We in order to develop a new technique, like, let's say, generic, like, we have some OLM system and we we have, like, we wanna develop a new technique for making that OLM system better from our data model essentially. And so, really, what we want in order to do that is a bunch of different problems that cover different problem spaces where we can run a bunch of examples, fill up the data model, then run the technique, and then see how much better it got. And so today that involves, you know, kind of building a bunch of ad hoc scripts that run a bunch of examples for each, like set up the environment, set up the example, run a bunch of episodes of interaction and then run the script for improvement and then run a bunch more episodes to see how it did. But like this is a problem that was solved really really well in reinforcement learning, so the the pre 2015 reinforcement learning community would do the same thing of we need to integrate against the dynamical system and then there's this bipartite graph of algorithms and dynamical systems and they would all have to jointly integrate with one another. Whereas, then the people over at OpenAI published Jim, and Jim was like this unified interface for, okay, now, every RL algorithm integrates against Jim and every, environment integrates against Jim, and now it works. I think that there's, you know, we're very excited about building something internally and then externally that lets us do research at scale, against a bunch of different applied problems and, allows, like, one common integration that lets us, without having to know what the inter what the environment is or what the problem is, run a bunch of examples, run the thing to make it better, run a bunch more examples, and repeat across a bunch of different problems. And so I think that research tool will unlock a ton of new progress for us in terms of how we're gonna make your LLM systems better, and, I'm excited to share that with the community.
[00:58:28] Tobias Macey:
Absolutely. Are there any other aspects of LLM gateways or the work that you're doing at tensor 0 that we didn't discuss yet that you'd like to cover before we close out the show?
[00:58:40] Viraj Mehta:
Let's see. I guess we we touched on it, but I think it's like a really interesting kind of we we maybe we didn't touch on this. So I wanna talk for a second about how directly like the reasoning paradigm and the the, you know, in the time compute paradigm fits into this process and I think it's very exciting. So if you look at, you know, the work by opening out of the 1 and 3 and their reasoning fine tuning, offering in beta and some of the open source work that also looks like this and smells like this, you realize like there is, like, what this would look like to me is, oh, yes, like part there's a new type of implementation of a tensor zero function which takes a lot more time and, like, runs a lot more tokens, but at the end it's like the same inputs and outputs. And now we have this intermediate computational substrate on which, you know, change of thought are happening, and we should optimize it. We should optimize the whole implementation of that. And I think for those kinds of problems where it's clear that OpenAI and, you know, their competitors are mining the world for math problems and compute you know, lead code problems and IUI problems to solve, you need domain specific data or you'd like domain specific data to make that happen. And so I'm excited for the direction of work that's, you know, soon to soon to come.
Alright. Like, we have our open source reasoning models. How can we directly customize those on, the data that's in ClickHouse and, do that in a way that makes our customers model all of our customers' applications better, out of the box. But that, you know, all the users are the open source project. So very excited for this. Definitely.
[01:00:20] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling, technology or or human training that's available for AI systems today.
[01:00:39] Viraj Mehta:
I'm gonna take this a different direction than probably you anticipated. The question we always ask people when we're we're we meet a new team to applying to all ends to some business problems. What are the KPIs that you associate with your product or your management associates with your product? And I think that there is still ways to go on the management technology of, okay, like, we have this o m system that's driving this business process. It's no longer a curiosity. It's something that we care about for business reasons and we wanna provide business values. And so we need to, as a management team, look at the denial rate for the claims that were generated by the AI bot or the, you know, click through rate of the ads that you're using by AI bot. Like, I think that style of management thinking actually is, like, not yet as widespread as I would have anticipated a year or 2 ago. And it's an important part of the story of these applications becoming extremely important load bearing parts of the world economy as, like, people anticipate.
And so I'm gonna, I think, put the onus on the business professionals who interact with AI folks and AI systems and say, like, how can like, the the management technology of how do we start to measure and optimize and, like, use these things in a serious way requires some development yet.
[01:01:59] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Tensor 0 and the functionality and features that you are adding to the AI stack to improve the functionality and efficiency and accuracy of these AI models that we're all still figuring our way around. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day.
[01:02:23] Viraj Mehta:
Yeah. Thanks. You too.
[01:02:29] Tobias Macey:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Seamless data integration into AI applications often falls short, leading many to adopt RAG methods which come with high costs, complexity, and limited scalability. Cogni offers a better solution with its open source semantic memory engine that automates data ingestion and storage, creating dynamic knowledge graphs from your data. Cogni enables AI agents to understand the meaning of your data, resulting in accurate responses at a lower cost. Take full control of your data and LLM apps without unnecessary overhead.
Visitaiengineeringpodcast.com/cogni, that's cogneetoday to learn more and elevate your AI apps and agents. Your host is Tobias Macey. And today, I'm interviewing Viraj Mehta about the purpose of an LLM gateway and his work on Tensor 0. So, Viraj, can you start by introducing yourself?
[00:01:09] Viraj Mehta:
Yeah. Sure. Hi. I'm Viraj. I am the CTO and the cofounder at Tensor 0. And I'm really excited to be here today to tell you guys about how we think about AI applications and our product and and the open source software that supports it.
[00:01:24] Tobias Macey:
And do you remember how you first got started working in the ML and AI space?
[00:01:29] Viraj Mehta:
Yeah. Sure. So I was, I actually so I started computer science more generally, like, when I was a freshman in college and, like, it only had limited experience programming with that. But I got quite bored during my 2nd summer internship. I was at Google and I, like, finished some of the software engineering work that they had signed me to do. So I spent a lot of time that summer running around at Google where there are a ton of really interesting people and trying to you know just chat with folks that I thought their jobs were cool and what I realized was that I you know kind of observing the patterns in my own interest that a lot of those folks had done machine learning work and were doing machine learning work even if I didn't realize that you know at the time the Google self driving car project at x was heavily heavy on machine learning or that the, you know, generative music project was heavy on machine learning I post facto noticed that. And I also, realized that all of those folks had research backgrounds and in particular like most of them had PhDs. And so having like noticed that about myself when I went back to school I got involved in the vision lab and started working on things like 3 d vision and robot grasping and some of those some of those techniques out of the standard vision lab and at that point the bug got me that AI research is awesome it's so fun it's a it's an unbounded and like, fractally interesting problem and it touches so many different layers of the stack of computing and I really enjoyed those aspects of the problem. And so that is a bug that hasn't left since those maybe 2016, I think.
[00:02:59] Tobias Macey:
And so that has ultimately brought you to tensor 0, but before we get too deep into what you're building specifically, I just wanted to get an understanding about what even is an LLM gateway, and what is its purpose in the overall architecture of an AI application.
[00:03:17] Viraj Mehta:
Yeah. Sure. So, the way we think about it all in gateway is that it is a server that you might stick in between all of the client side application code that does all the normal things an application does and the set of AI models out there that you might be interfacing with. And so, whether those are external, you know, third party API providers than certain OLMs, you can take open AI and drop it there or cloud providers like GCP, AWS, Azure, or your own self hosted application or l m epic l m's, like, things you might run on your own GPUs such as a VLLM or other competitors. There you want a centralized place where if your application code needs to talk to an l l m, it can make a request to one place and then it gets routed to the right downstream l m server and it also does a lot of the bookkeeping and standardization and observability and a lot of the other stuff that makes, that you know that that happens all in one place so that you don't have to do this all over your application code. And so a gateway server is a nice way to do all these things and also manage the credentials so that you're not sending your OpenAI API key to a bunch of different places where your application might wanna call OpenAI from.
[00:04:37] Tobias Macey:
And in terms of that LLM gateway functionality and its role as a centralized proxy for being able to actually interact with the LLMs ultimately. What are some of the different types of features and capabilities that you might expect from that type of a service, and what are the opportunities that it provides as a sort of choke point in the architecture to be able to add additional features or functionality or value added utility?
[00:05:10] Viraj Mehta:
Yeah. Sure. So I'll start with the general case, and then I'll I'll I'll talk about how we depart from maybe the standard architecture. I think, the the thing like, the at minimum thing that you want, and this could be provided by a client library or by a gateway server, and I think, like, light, Alain made the smart design decision of saying, okay, like, we're gonna have we're gonna give you the option. So when you're developing, you'll need a binary that's working somewhere else. You can just directly call l m APIs, and then as you move to production, you can have the actual server.
But regardless, you wanna be able to call many many many all m's without having to, you know, change your form of your request to match the expected interface for each of the providers. And for some of the providers, like, most of the providers are in some way OpenAI compliant, and so you could use the OpenAI client to talk to DLLM and SG Lang and some of the APIs exposed by Google, but not most, and but you wouldn't be able to call the anthropic family of models and, one of the other things that's like more tricky is like even within the broad umbrella of things that look like OpenAI there are a lot of small differences of how each of the individual providers work, and so, further desiderata for an Olam Gateway type application is that it does some rationalization of all the features that, like, may or may not work amongst, all the different providers. And so for example, you may want your return type to be JSON and you know that's a common thing that people want from LMS because obviously you want machine readable output and so on and so some providers don't support JSON but they do support toolkom and OpenAI I believe was like this for a while but now, one like example that just came up for us was that TGI supports the tool mode where you can actually get like a guaranteed JSON with a you know, they claim they have a guaranteed schema and so you can actually do the thing you want and get JSON out, but they don't explicitly support JSON mode the way OpenAI does. And so one thing that you might want from your LM gateway providers if I'm using the TGI back end and I'm actually routing my request to my own model that runs in TGI, I want you to transparently without my application code having to think about this when I ask for Jason know that the only way to get a Jason out of TGI is actually to pretend this is a tool call and then force the model to call that tool and generate the correct arguments for that tool which then we, you know, munch back into oh yeah here's your JSON and have a nice day. And so, you know, those are the kinds of features that are like table stakes for an alarm gateway to just make the downstream provider set look exactly the same to you. Some other kind of software engineering style features that you're gonna want are things like configurable retries, load balancing, fallbacks, that sort of behavior so that your application doesn't have to think about well there are 6 different places that I can call to get an implementation of llama 70 b. We have API keys for all of them, but, like, maybe we should try them in order of cost, so ascending order of cost. So if if for example, the cheapest endpoints are available, we'll use those, but then, you know, maybe there are higher cost, more more high availability endpoints that we would wanna try lower in the stack of of choices and so we'll get to those if the first few requests fail. And so that's another feature that's, like, super common and I think most people really like about all on gateways.
And then I I've talked to people who work at large organizations where they've implemented their own internal API Gateways that also do some sort of like request prioritization or accounting stuff. So, oh, yeah. We have like internal keys that map to different budgets, and we wanna know which organizations are actually responsible for how much traffic. And so we wanna do we wanna do that kind of bookkeeping at the the gateway layer as well.
[00:08:50] Tobias Macey:
From an application perspective as well, it seems that the gateway could be a natural point to do things like request response caching to say, okay. This request is fundamentally the same as this other one, so we're just going to return the cached response rather than send it all the way to the LLM and deal with that latency and the cost as well as some of the audit logging, etcetera. And I know that you have some of that in tensor 0, so maybe this is a good point for us to talk a little bit about what it is that you're building and some of the story behind how you came to tensor 0 being the thing that you wanted to spend your time on.
[00:09:28] Viraj Mehta:
Yeah. Totally. So I think, especially I think took I I forgot caching. Caching is like another we we don't have that support in the product today, but a lot of the gateways do and it's a, obviously, valuable source of, you know, you can save on tokens and save your time. There's also kind of on caching more broadly, a lot of the API service this is a tangent, I'll answer your question. A lot of the API services also implement things like request caching on the the kv cache layer of the transformer back end and so you want to at least optionally allow your customers to do things like if for example for OpenAI this is automatic but for anthropic you can turn on some flags and things to get that behavior and if you actually want that behavior, it would be better if that was centrally managed rather than every single time you wanna call an OLM that might be anthropic on your application that you have to, like, send those flags in. So it's another place to, like, do that kind of global settings management of, how you ought to interact with, like, the Azure providers. To tensor 0. So I come to tensor 0 from, like, a very, very different perspective than probably most of the folks who get involved in all of gateways. So I did, before I was working on tensor 0 is, wrapping up a PhD Carnegie Mellon where I was thinking about reinforcement learning quite a bit. So that started very far away from OLMs. I was working, you know, on a department of energy research project where we were trying to use reinforcement learning to improve control of plasmas in nuclear reactors which is a whole other long topic that I think is like amazing I can talk about forever but I'll stay on topic a bit. The key consideration with that problem though that like eventually led me to language models was that this data was unbelievably fabulously expensive.
Like you can think you know, like a like a car per data point like $30,000 for 5 seconds of data. And so that problem leads to a, you know, huge amount of concern about okay like we're going to only get to run a handful of trials per trial like what is the marginally most valuable place we can collect data from in the in the configuration space of the Tokamak, like where should we initialize the problem, what should the policy do to collect the most valuable information. And so got really interested in this like very, if we're gonna pick a single data point from a dynamical system to inform an RL agent, what would the most valuable data point be? And so I spent most of my PhD thinking about problems like that and like applying them to nuclear physics stuff and nuclear engineering stuff. But then at the end, I I realized that, oh, yes, we've developed a lot of machinery for data efficient reinforcement learning and some of this machinery applies quite well to the techniques that they're now using to align language models. So I wrote a paper about if you're doing this, you know, standard set up for reinforcement learning from human feedback, you are trying to make a query of, like, okay, here's a prompt, here are 2 completions and a human is gonna tell you which completion is better. And you wanna use that information to improve the performance of your downstream language model on, like, a distribution of questions humans ask. And so the question that we asked was, alright, let's say you could ask you have one more label you could get from a person on this. What would be the prompt and which would be the completions for which you would, like, wanna use that last data point for? So you can get the best improvement in your policy. So we, like, came up to like, came to a theoretical solution for that, which the setting is called contextual dueling bandits, but I don't know if that that would mean much to anybody. And so we we came to a theoretical, like, oh, yes, this policy would return like a good, but would be would do well, in that theoretical setting we implemented it for first like a simple toy problem with like that same thing and then we like said okay this actually one neat thing about this algorithm by design is that actually works for real techniques that we would use to really train language models. And in that case, it was direct preference optimization, d p o. And so, we actually implemented that for language models. And that got me really kind of thinking about the language modeling problem from a perspective of reinforcement learning. And I think that has become a lot more popular today. We hear about our l a lot more especially since the o one released in September, but back then this was maybe late 2023.
It was a little bit more heterodox to think this way and so we were like one thing that was really obvious to me as I was deciding to start this business with my co founder Gabriel that l l m applications in their broader context feel like reinforcement learning problems. And so let me unpack that a little bit and how that gets me back to gateways. So first off, like, one generic model of a language model call in an application is that you have some business variables in your code. They're probably, you know, instant floats and strings and that kinds of thing. And then you erase thereof like a JSON. And you take that JSON and you template it into some strings.
You have now like an array of strings. You send that string to open AI or equivalent. You get back some more strings or another string and then you parse that string and do more business stuff with and then maybe you make more language model calls like that double line. And then eventually some business stuff happens and like either it goes well or it goes poorly and usually this feedback is gonna be noisy and sparse and, you know, maybe natural language and not really quantifiable, but like you can get something back maybe hopefully. And let's assume that you get something back for now and then I can unpack when we talk about the industry and whether this is like, you know, generally true. But what this looks like then is you make many calls to some machine learning model and you have variables, structured variables on one side, you have structured variables on the other side, and then you have, like, stuff that happens in the middle. You do this many times, then you get back some reward. And this looks to me like a reinforcement learning problem. Maybe I'm biased because I've spent a lot of time thinking about things that way. But I'm like this does smell like a, you know, partially observable MDP decision process. And so I was like, we we kind of took this really seriously, back in the early part of last year and we actually wrote down like in math like for real what is what do we think the mapping is between like the standard design of an Olin application and upon DP.
Sorry I'll write a blog post about this soon and so we can share it with folks but that led us to the idea that okay really the policy here which is like the thing that makes the decisions is the entire function that go is inside of the business variables on the input side of the call and outside, you know, before the business variables and the output set the call. So we what we realized is like, okay, our interface design is going to be different than OpenAI's interface design. Because OpenAI due to their limitations of their system and the design of like how they're serving you a model, they take strings as input and then they take, you know, they give you basically strings as output and that that makes sense for them. But I I think like for for applications the right way to think about talking to a language model is like, oh, yes, I'm gonna design a input and output signature for a job that I would like done by something smart And then I will make that typed function call and I don't really care what happens under the hood at all. What I want is outputs that lead to good business outcomes down the line. And so obviously that's not sufficient to like get started.
You need to go in the under the hood and, like, write a good prompt and try it a few times and make sure it's sensible and then it won't won't do anything crazy, upon deployment. But then once you have something reasonable going, really your goal is, like, okay, now we have something reasonable going. Let's optimize against whatever feedback signals we have or the comments that the PMs make about the performance of the system or those kinds of things and try to do better over time. And so that to me also looks really very very much like a reinforcement learning problem. And so from there we realized, okay, if someone, you know, if we were to build an OLM application ourselves today I would probably take these insights and use that to, you know, treat the application code as making these type of function calls and then have the ML process be okay, we have, like, this incumbent implementation of the function call that was the original prompt and, like, whatever model you originally chose and so on. And then the job of, like, an ML scientist or MLE who's working on the team is, like, let's repeatedly improve the implementation to come up with new implementation that might be better than our original implementation and actually, like, slot them in or AB test them against the original implementation to see, you know, is the new prompt model generation parameters triple that's caught, like, actually implementing this function call going to be better. And so I think for us that led us to, okay, well, we need to clearly where are we going to insert this into the stack and it made sense that it would be, oh, yeah, like, we're gonna one of the design variables obviously of trying to come up with new implementations of function is you're gonna maybe wanna try different language models. So like, well, obviously it makes sense to then insert into the stack of like at all l m applications at the point where you have the choice of different language models. And so we optionally, I think and we're we're one of the few products that does this today and I think we probably have the most articulated and like explicit view of how this should work is the prompting we we want to also manage the prompting like you can use the product without and keep the prompts in the application code and a lot of people do this, like, you just build your strings out of pipe and have strings or with JavaScript or so on. But if you can keep the prompt in the gateway layer, what that means is that you might you're able to add inference time, say, I wanna call anthropic with prompt a or I wanna call OpenAI with prompt b or I wanna call Jim and I with prompt c or I wanna call my fine tuned llama that doesn't need much of a prompt because it's been fine tuned on so much data and it it like doesn't it like knows what the problem is already without, like, with minimal prompt d and like let's compare those.
And I think that because language models, like different language model families should be prompted differently and because, you know, you really when you're thinking about the performance of a particular implementation of your l m function call, you care about the prompt, like, the whole function. You don't care about just the model. Like, this is like a nice unit of abstraction for application code and for people who are monitoring these systems. I just talked for, like, 10 minutes. I'm sure, like, you have questions.
[00:19:28] Tobias Macey:
No. That that was great and very useful. And to the point most recently that you're making about automating some of the prompting, it puts me in mind of another project framework that you may or may not be familiar with of dspie, which I think came out of, maybe Stanford. And I'm wondering some of the ways that allowing something, whether it's dspie or tensor 0, to automate the prompt generation, how that shifts the way that you think about the overall design of your application where the LLM is some core piece of that functionality and just how you've seen teams try to tackle that of, oh, I've spent all of this time on prompt engineering. I've got the perfect prompt. Oh, wait. The prompt started falling apart because they released a new iteration of the model and just so some of the ways that they think about the role of prompting in the overall application stack and the behavior thereof.
[00:20:25] Viraj Mehta:
Yeah. I I think d s pie is, you know, I I think it's a really cool idea and it was definitely one of the inspirations for me as I, like, thought about this. And in fact, like, right during that time in late 2023 when I was, like, thinking about this in the very early days, I emailed and hung out with Omar at the Omar Khattab, the author of DSPI, at NeurIPS. And so we talked about this then. And I think one way to contextualize, like, the relationship between Tensor 0 and DSPI where we actually do there's an example in our repository of using DSPI to optimize prompts, is that, like, Tensor 0 is production software that collects and collects data that a lot that is well structured to do all kinds of things to improve your implementations of all your function calls, let's say. And these, you know, DSPI is like one set of techniques because they they have many optimizers as well that can read from the data model that we're storing and write essentially a new implementation of one of these functions to our config model, and then you can AB test it or evaluate it or do whatever you want with, like, a new implementation. And so DSPI is one way to I think one really smart way of thinking, you know, prompts our element applications, our programs and we should think about this as a composable kind of PyTorch style graph of of of element calls and dependencies and then we can quote unquote back propagate through that graph as as like in a pipe towards kind of way to improve the language model calls that happen upstream.
I think we think about this very similarly. Maybe some of the differences with the d s pie is the op it doesn't really manage data it thinks about all right you hand me a data set and I will run optimization. But the accumulation and structuring and labeling and that whole part of the problem is external to DSPIRE and so you need if you're going to do this on on hand curated data then you can hand curate your data set and hand it to DSPIRE and see what kind of prompts it comes up with or so on, but if you're going to do this as part of your production system which I think is like especially as a all on deployment scale like the right way to think about this then you need a tool like tensor 0 that's going to in a structured form dump all the information that you need to then do d s pi later. And I wanna be super clear about, like, it's actually really hard to get this data model correct.
For example, most tools in the industry that are doing LLM observability including, like, many of the gateways, they're gonna dump the strings that you send because they're they don't have this concept of, okay, I have typed input, gets templated into strings that then get sent to an an API. You you you're looking at the strings that go into the model and the strings that come out of the model as your your dataset that you're storing in your VoIP platform. And in doing so, it means that like down the road I I've changed my prompt 6 times over the last year and I want to fine tune on the historical data that like ends up going well. You in order to make sure that all the data has been templated with your last prompt, you have to parse out or like go get the variables from somewhere or a re template, like that's either gonna be a difficult parsing problem or you have to go and ETL it or do some join with some external system or you have to, you know, it's like hard.
Whereas, like, if you just store structured information about, like, what went in, then it becomes easy. It's okay. Yeah. Like, we just re template everything we have it in this other table. And so there are considerations like that along with the fact that you wanna know these are the different types of LOM calls that led to some downstream outcome that, force you to kind of think about the data model in roughly the way we think about the data model. We we I spent a lot of time, I think that was the thing that we spent the most time on last year was thinking about this and and, making sure that the entire universe of things that we would want to do in the future was supported by the data model. And, now I think we're really happy with having, like, a forward compatible, data model that lets you run tensor 0 today and know that in the future as we start to, like, implement more optimization techniques, which is kind of the direction we're turning to now, that they will be compatible with the data you collected last year.
[00:24:27] Tobias Macey:
So digging a bit more into Tensor 0 and its overall architecture and system design, I think that that will be useful to frame the rest of the conversation about its features and functionality and workflow. And so just, I guess, if you can give a bit of an overview about how you have implemented Tensor 0 and some of the supporting infrastructure that it needs to be able to do this operation of storing the structured inputs and outputs and then performing optimization on that?
[00:24:59] Viraj Mehta:
Yeah. So, first, like, I'll talk about, I guess, a user's perspective on this rather than a, you know, system administrator's perspective on this and then because that system administration is actually not too complicated. So as a user, you think about, like, if you built your application with OpenAI, let's say, and how would you move it over to Tensor 0? You would and and do it correctly because Tensor 0 does offer a drop in OpenAI compliant endpoint, so you can just point your endpoint at Tensor 0 and start using it. But to do it properly, let's say, I would think about, okay, what are all the jobs in my application which are done by l m's? So what are all if I had a bunch of smart typed functions that I would be calling, like, what would they be called?
And then, like, where would they be called in the in the application? And then, okay, what would the input variables look like? And ideally write down, like, adjacent scheme out of that. And, what would the output signature look like? And so, you know, other tools that would be called, would there, you know, would do we want Jason's out or like what would the schemes of those be? And then that allows you to go and configure, okay, we have a TOML configuration file where you write down here are the functions that I'm calling and here are the inputs and output signatures and pretty quickly replicate. Okay. Now I have an open AI I have the same thing that I was doing with open AI. But now I've I've I've factored it out into the jobs that are being done and also like put structure into, a few of the input variables and here are the output variables and so now that, tensor 0 comes in as it's a client that you you pip install or you, you know, make web request to and there's a binary basically that's a rust binary that sits, it's a Docker container you can deploy anywhere you deploy your Docker containers. That is stateless but it reads the configuration so it knows what is going on with like what are the functions that are going to be called by the client. And so everywhere in your client code where you're calling LMS, you're you're calling these functions. And so now, the ClickHouse gateway will route, you know, implement the functions by taking the variables templating and then shipping them off to the appropriate language model provider and then returning the responses and parsing and validating and so on. And implements things like fallbacks and retry, it does all the things.
But the other thing that it does is it dumps the other basically dependency that it sits on is ClickHouse. And so you would like to have ideally like some sort of deployment of ClickHouse, so whether you're using the cloud service, your own open source deployment or however, you want a ClickHouse that where this data starts to be materialized in a structured format. And so what that buys you is after you run this application for a while and, oh, important to add, the 10 to 0 gateway also has a feedback endpoint, so we support booleans, floats, comments, and demonstrations so if there's like a human in the loop that actually edits a response before it goes out you can send that response as a demonstration of the desired behavior. But then what that gives you is a click house that's full of the interact interaction history so what were all the inferences that were part of a particular, oh, I'm like run of your application and then how did it go. And that allows it, that's a really useful data set for optimization. So we can start to then run, you know, do things like let's curate the data based on which ones went well, template them to our latest prompt, and then fine tune a smaller model either like a llama series or an opening I series or, one of those kinds of things. And then let's slot that in as a new implement to use that model now, slot it in as a new implementation of our language model call and run 5% of traffic through it. And so all that really takes is we have a web UI now that reads from ClickHouse that you can it's another document there. And you can, like, go in the UI, look at your inferences, there's a form for fine tuning, although we also publish notebooks that do this, and it, will curate by the metric, and then you can pick which template you wanna use and which model you'd like to fine tune, And then go and it gives you out a tomo block that you add to your config and now you have a brand new model that you can run an AB test against. And so it tries to and then all of that data is continued to be, you know, dumped to ClickHouse and you can see like how does, you know, variant a or incumbent variant doing against variant b or, your your new fine tune model and you can make a determination as to like, alright, how are these confidence variables on this metric and are we able to actually like, is it worth flipping over to the the new model?
And so did that make sense? Was that clear? I know it was, again, a lot of text. That that that makes absolute
[00:29:21] Tobias Macey:
sense and I think that that is definitely a very interesting and useful aspect of the system is that it does maintain that history in the ClickHouse database for that feedback. And one of the questions I was going to ask, which you proactively answered, was, how that human feedback gets factored in and what are the interfaces available for being able to then provide input back to the application and tensor 0 to say, yes. This is what I want or, no, this is really bad. And so the fact that you have the web UI is definitely very useful for some auditor or some, administrator of the system to go in periodically and say, okay. These are the things that we like. These are the things we don't like. But it also gives you the opportunity to expose that in your client interface where even if it's just a thumbs up, thumbs down of this was a good response or this was not helpful, and then being able to feed that back into the system as an automated feedback loop without necessarily having to have that be a manual task by one of your teammates on whatever periodic basis.
[00:30:26] Viraj Mehta:
Sure. And I I a quick digression. I spent a while 2 years ago before 10 years ago, thinking about, like, what if you were to acquire a services business and use o m's to, like, start to automate things, but, like, you want you don't wanna, like, degrade the quality of service the services business. You have humans review that the humans, like, still review all the work that the l m's are doing and fix it. And that would eventually generate a training set to improve the quality of your l m's. I did end up doing that because I'm not I'm not a private equity tycoon. I'm a grad I was a grad student at the time. I was thinking that through. And what I landed on was, like, you're for many businesses, like, it's gonna make sense to have this okay. The humans are continuing to, like, audit deal and output for a long time and what you're real what you really want is like the frequency at which the auditors have to do have to change anything goes down over time and that that means you're getting closer to your eventual goal of I guess automating the job or decreasing, you know, increasing the productivity of labor, let's put it nicely.
And so there's like this natural, okay, well, we're gonna get demonstrations of good behavior as we're doing the hand off from humans to AI for this menial job and that that you need to be able to take the take account of that information. That's that's the only way this transition ultimately ends up happening. And I think, you know, even as the base models get smarter, so, you know, that will continue to happen for sure there's too many smart people and too many dollars and too much compute being thrown at this problem. It doesn't necessarily mean that the models will be getting better at your particular business task. So for example, like I don't know that since the GVD 3.5 series the models have gotten quite a bit better at like writing very targeted sales emails.
Like I think you just have to learn what works you know in my experience doing any kind of sales or marketing like you have to learn from trying stuff like what works for your particular kind of customer population and like what works for your particular product and what are the hooks that actually like do the job well and in you know if you could do it contextually that'd be great nobody really can. And so I think there are still plenty of business problems where you wanna take advantage of the production data in order to do better even as the models get stronger.
[00:32:37] Tobias Macey:
Another interesting aspect of what you're building with that ClickHouse record keeping is effectively the interaction history of at least the application with the LLM. Another core capability that is built into a lot of the chat based applications or that people want to have built into their chat based applications is that keep keeping of history of what are all of the requests and responses for a given user, and how do I also use that to enhance my profile of that user to then be able to give them more useful responses and maybe be able to feed that into the context of the request that I'm sending?
And I'm curious how you're thinking about the role of Tensor 0 in that use case or LLM gateways more generally and some of the ways that that factors into the overall system architecture of what is the piece that owns that. And another aspect of this is the idea of memory that is being built into a number of other products, whether they are part of an existing framework or an addendum to the overall stack. I'm thinking of as far as, like, Cogni, MEM 0, the lang chain, or lang graph memory capabilities. I
[00:33:53] Viraj Mehta:
think this is this is a this is a really interesting kind of distinction to make that we we have, like, probably we're probably far on one end of the spectrum here, so it's an interesting perspective with which to regard it. Let's for a second, just for completeness, I wanna rule out the new architectures that have been coming out lately where the during inference, the weights to model are, like, some memories baked into the weights to model and instead stick to, okay, the model is, like, memory memory less and we're not, like, we're not, like, baking dynamically new things in. So if we leave that out, I think that the the concept of memory and all of our applications boils down to let's try and it's a context that's writable.
It's a context store that's writable. So the idea there is, like, even like the chat gbt implementation, I believe is, like, chat gbt all all has the option of saying, okay, this is a good piece of context that I would wanna return to in the future and writes it to some store on open AI's server side. And then in future conversations, that context that that memory is put into context along with, you know, the user question at the time. So to me, I think about this kind of similarly to rag in that our retrieval augmented generation in that you have like some store of documents that you might need that might be useful for making an inference and you retrieve the most relevant ones prior to your inference time.
I think, like, user specific memory writing is, like, if you had a a vector database, for example, it doesn't have to be that is writable with the, you know, with actually writable by the context of the by the l m as it's making the conversation and then future inferences can read from it as part of that memory stack. And so when I look at the products that are out there from langchain and from m zero and from some of these other providers, that is usually the architecture that I see. For tensor 0, we don't really we're we're we're primarily focused with like what are the things that we can use to improve the actual long calls. And so we don't actually deal with this very much at all because we think that for a well designed l m application, like, you're gonna wanna like, probably the design of your memory is, like, very application specific. And so some things might you might want to share among groups and some things you, you know, if it's a team setting or an enterprise application, some things are very personal and you want to keep that to the user. And, it was, you know, we kind of explicitly were like that kind of thing. Along with rag are are things that are not worth touching today because we are, you know, the for example, for Rag, like, the the data architecture of a particular enterprise application is going to be highly different, you know, very different from the next one over. And there's not a lot of patterns that we wanted to generally impose on how people store and retrieve their data that, so so we we kind of explicitly left it out of the scope. For the the way we would treat it is, you know, add a field to your add a field to the input signature of your function that takes memory and then optionally send it as your memory system dictates you do. And we haven't really handled that, but we would integrate well. That would integrate fine with any other memory provider. I think, you know, as because the gateway layer is such a good choke point for dealing with the general behaviors of all of applications. I think, you know, other people have taken the other approach of, no, we're gonna have a batteries included solution where, you know, we can offer this either vector store or a pen only memory context document or other, you know, memory system there that will know about, like, your application and know which memory to load into context prior to make the request. But we thought that, you know, our decision was that that was something that developers could figure out on their own.
[00:37:49] Tobias Macey:
Aside from the more detail oriented memory capabilities, even just for the history tracking of what are all the requests and responses that have happened in historical conversations. Is that something that you would presume people would use rely on Tensor 0's history tracking to take care of, or would you say you have your own history tracking for user conversations and you can load that in? We are keeping history for a different purpose, and they don't necessarily want to mix together.
[00:38:21] Viraj Mehta:
So we we don't, like, offer client side retrieval for this purpose, so it's but it we do keep, like, comprehensive and, like, you know, add many layers of specificity, like, documentation of, like, what exactly went in and out of the language model. And in particular, they're retrievable and we give the client an episode ID. So you can go and query, what were all the inferences made for this episode ID ordered by time. And so that is the right like, you don't need an additional store to go ahead and, actually get that information. So it's not we haven't seen folks use this as I guess we we don't know exactly what everybody's doing because we don't have a lot of we don't have telemetry built into the product. But, like, it is totally valid and we publish the data model and it will be, you know, ClickHouse is like a perfectly fine database for that sort of thing. And so it is it is definitely we keep the data in a useful way to do that and we give you the tools to be able to retrieve it if that's what you want. But, we haven't marketed as such. So, I think, yeah, I I hadn't considered it from that angle before, but, yeah, it's it's totally possible with the product.
[00:39:28] Tobias Macey:
You mentioned that you don't have telemetry built in, so you don't have that tracking of how users are using the application, which is definitely, I think, a desirable trait in an open source project. I'm wondering how you're approaching the long term sustainability and overall business model of what you're building at tensor 0 and how you think about what are the capabilities that belong in that open source core application, and what are some of the additional functionality and value add features that you're looking to build over time as you kind of grow the capabilities and grow the organizational capacity?
[00:40:05] Viraj Mehta:
Yeah. So, I mean, we're we're still early. We're lucky enough to have investors who are aligned with the kind of longer term vision that I'll present in a second. So we're not racing to make money or and we're not even really building anything that's not gonna be part of the core assets today. But we do have a clear design and a line for where this is drawn. We're happy to share that, you know, we we share that with anybody who asked because like we think this is part of, you know, insurance against rug pulls for folks who want to depend on the open source package. So the way we think about this is like from the beginning and actually the original version of tensor 0 that we were running last year as we were trying to figure out what we were gonna build was managed service.
So we were running the gate we were running this gateway as SaaS, you know, a couple of folks would integrate against it and send feedback to it and we would store the historical data on our side. And then we have this man behind the curtain, Wizard of Oz style. Okay. We're gonna run the optimization techniques mostly manually like we're both Gabriel and I are both been doing ML a long time, but we'll just optimize your models for you based on the feedback you're giving us. We have the data set to do it so and we'll run the AB test and we'll just make it work. And we were able to get actually quite good results doing that, obviously not sustainable from, like, how much sweat went into doing that. So what we realized was, oh, okay, like, we need to give we need to be really honest with ourselves here and build, a system that even if we weren't there would allow people to make their own systems better based on the feedback that was being collected. And so we we the plan and and we've already done the vast majority of this so far is let's publish tools that allow you to build your application with tensor 0 and collect the data and store it in a way that's useful to us. And then techniques that read from that click house data model and write, like, new variants, new new implementation of the functions to our config. And so we have a couple of those built. So we have supervised fine tuning with a couple of providers. We have workflow for dynamic and context learning. We have, we I can do an open PR that that should be merged there tomorrow that does dynamic or direct preference optimization.
And then, you know, our global focus to you know, shift towards 1, building better UI and 2, let's improve the number of and keep working on these optimization techniques going forward. And so all of that I think will be in the scope of the open source project. But there's a natural thing that happens as as you scale and you this these techniques get more advanced is that right now the first few techniques have been things that you can run with only an API key and no compute. We we wanted to ship those first because that's easy for everybody to run. But of course, right, there are limits to the customization APIs that are available from both the, like, major lab, like, the open eyes of the world, but also the, like, togethers and fireworks that will, like, do their own custom fine tuning stuff. And eventually, you're gonna we're gonna wanna publish techniques that we think work quite well that are not supported as APIs by existing players. And so you'll have to again, as part of the open source, you know, run to run the recipe, you're gonna have to go and rent your own 8 by h 100 node. It's not cheap and it's annoying to, like, even set up. And if if you want several, that's even more annoying and, then you have to run the technique and you have to do this. You have to know that it's time and you have to choose the technique and you have to do this and then you have to run the evals and then you have to, put that in the a b test and then you have to manage the a b test and then you have to, you know, then your application gets better. So it's still less work way less work than it was before when you were designing the data model and you were, you know, doing all implementing the algorithms you're going and finding them and making sure they fit and, like, that kind of thing. It's like an order of magnitude less work than that, but it still work. And as these applications scale further, I think this is like a lot of work, especially if you imagine like an agent that's doing lots of things all day and does lots of different it it starts to get hard to manage. And so the we wanna go back to the original vision of if you have been using the open source software and the pain of getting all the GPU's and managing them and then, like, you know, figuring out which experiments to run and, like, managing the a b test weights automatically and, like, all that stuff is a pain.
We wanna offer a service that basically starts to do all of this proactively. So, hey, we noticed programmatically that you have achieved you know, collected 10,000 more data points since the last time we tried fine tuning and a new llama model version got released. Let's proactively run, like, state of the art filtered fine tuning job and train a new model and then proactively run your evaluations against it and send you a ping saying, hey. Your new model is ready for you. Here are the results. And do you wanna deploy it in production? And then click this button if yes, and we'll run our usual AB test routine. And so what this starts to look like is that the improvement aspect of all of them application starts to be mostly automatic. And we think like, especially for scale deployments where improvements in KPIs are really all that matters and of course cost is critical. So you can the more data you collect, the smaller model you can train to achieve the same performance. There's a lot of work to be doing there all the time. And if we can offer, you know, the the tools to do that, but you have to manually manage all the compute and, like, proactively do all the things yourself, or we have, like, an autopilot for this. Like, that's a clear line to draw around the open source project that keeps it healthy and useful and and we can keep contributing to it and, like, you know, keep making sure that this is, like, an excellent great piece of software. But then also for folks who are on the larger end and they wanna see they wanna, you know, they have like a lot this is a lot riding on this application. We can start to make it as good as it can possibly be without a ton of them all effort. And like, you know, those are expensive employees who would be doing that effort.
[00:45:43] Tobias Macey:
Yeah. It's definitely a very interesting and useful gradation for a project like this, where, as you said, you can do it all yourself, but eventually, you're gonna get sick of it, and we'll be there when that happens.
[00:45:56] Viraj Mehta:
And I think, like, not just you're you're gonna be sick of it, but also we're we can make better utilization if you have GPUs, if we're doing it automatically because we can just run stuff when it's done. And also, because we can just run more stuff because it's like it's able to try things in parallel and like exhaust more exhaustively search the space of like tricks you can pull, we should be able to like stack more little wins and get to or I would I would hypothesize better performance than like if it were done manually. Just due to, like, the the amount of search that can be done over, like, what are the possibilities here.
[00:46:31] Tobias Macey:
And also, if you're if you have a number of clients who are within adjacent verticals or, you know, who are doing similar enough work, then you can say, oh, hey. This looks like something that we've seen before over there. Here's our prebuilt fine tuned model for you to use, and it saves you all of that the headache of having to do all the data collection and feedback first, and we'll get you an easy win because we already know enough about this type of problem space to know that this is going to work better for you.
[00:46:59] Viraj Mehta:
Yeah. I mean, we definitely don't wanna, you know, cross pollinate data, but I definitely think cross pollinate learning is a really important and interesting part of building a company like this where we'll see a lot of things. We'll do a lot of kind of core research that maybe other companies can't do because their job is, like, actually building the application. And then, ideally, share that with everyone who uses the product. I think another analogy that I really like here kind of going back to your point about it's it works but it becomes annoying to maintain its scale is I think like this is the story of Databricks on some level, you know with like large spark clusters and large delta lake deployments like it started out as an excellent really useful open source project and then down the line has turned into a monster business because they built something really useful and then people wanted it at scale and people don't want the headaches that came with that and there they are.
[00:47:49] Tobias Macey:
Absolutely. And as you have been building tensor 0, working with some of the early adopters and clients, what are some of the ways that you see tensor 0 and LLM gateways shifting their thinking architecturally and maybe some of the patterns where the gateway actually gets in the way? Because maybe there's only ever one LLM that they're going to call for a very limited use case or just some of the cases where, the the addition of the gateway just adds too much cognitive overhead or architectural or operational maintenance?
[00:48:28] Viraj Mehta:
Yeah. Sure. So I I can talk I think the tensor 0 has its own particular wrinkles that we're trying to iron out of the product. So I can talk about those specifically and, they definitely cause headaches for people and so we're gonna, you know, give us a month or 2 and they'll be gone hopefully. But then, I'll also talk about kind of the general case of this. So for for 10 to 0 right now, we don't have Litellum has this. We don't have the we'll we'll fix this shortly, like, a client only implementation of the gateway that, keeps everything on the so you can do quick development with just the client.
You also don't have this, you know, before like one thing about keeping them structured as tensor 0 kind of recommends you do for production. If you're doing your quick development and you're changing things all the time, you don't know what variable should go into the function in the first place, then like writing down JSON schemas and like keeping everything templatized and organized and stuff can be more friction than it's worth when you're originally just like trying to rapidly iterate on like what do I even put in the all end and like what do I even want out of it. And so for their, like, time to getting an inference done and looking at the outputs and like repeating that process a lot of times is really a useful thing for developers. And like right now, I would say that tensor 0 is a bit heavy for and like a bit crafty for that kind of thing. So we're certainly working on removing some of those assumptions and making this lighter but I would say that is a place in this in the life cycle of an application where our project our own product is not amazing yet, we're working on that. But then I think in general for gateways I think like maybe this is x tensor 0 because we spend a lot of time on this. It's like when I use light l m there are little wrinkles of each inference provider that are a little bit different and they have they're not really like enforced by the interface and the inner you know, light l m doesn't fix it for you. So for a while, this is this is fixed now so I don't think it's an issue anymore, but Anthropic, won't let you send 2 consecutive user messages as part of the conversation and Lydal and will just let you do it and then get a give you a 400 and like you're like what happened and then you have to go and figure it out and it's like, did dig through their docs and then you're like, oh, I didn't, like, it's not even that clear. And so there's like this headache of like because there's a layer of obfuscation between you and the and this is true of tensor 0 to some extent as well, because there's a layer of obfuscation between you and the ultimate service that you're calling. When you're trying to debug things, you have to debug 2 2 layers instead of 1, and that's harder.
So because there's a layer of obfuscation, like, you can't it took me a while to figure out, like, why I was getting 400. For 10 to 0, we, like, know that and we just, like, coagulate the messages and do something sensible and try and get it to work. But that is, you know, maybe that's not the desired behavior. And so even with us, like, there is a layer of, like, okay, there's somebody else's software between me and what I wanna talk to. And so there's value, but there's also drawbacks.
[00:51:19] Tobias Macey:
And as you have been building and working with some of the early adopters, what are some of the most interesting or innovative or unexpected ways that you've seen tensor 0 applied?
[00:51:32] Viraj Mehta:
Yeah. So, I think it's very cool to see, you know, some of our earliest users we set we help them obviously handhold them on how do you set up this for your application and let's go through it. I think the thing that's been cool is, okay, like, now we're gone. It's open source software. Do whatever you want. And people have been for example, a lot of people use LLMs as a judge for emails. So, like, can you check whether this takes your and, you know, that's an old one call too that's done by some kind of application of something, and so might as well make that a tensor zero function as well. And so now we have, like, tensor zero functions judging tensor zero functions, and, like, in principle, you could have tensor zero functions judging those and turtles all the way down. And I think that's really cool because ultimately, like, one problem with Olam as a judge is it it too has a failure rate and you have to then independently kind of monitor the failure rate of that and you have to optimize the failure rate of that. And so if we have a common abstraction that lets us say, okay, like every 11 call is something that we can, like, treat in this unified way and optimize in this unified way and observe in this unified way.
It lets people build things that are, you know, I I think it's one of the really beautiful things about computers is like, okay, now we can immediately make it recursion and see what happens when we loop it on itself and it, like, feels very Douglas Hofstattery. And I'm I'm very excited to see this kind of thing more often. We'll, you know, build it into the core product at some point. And so, yeah, I thought that was, like, very unintuitive for me and exciting.
[00:52:59] Tobias Macey:
And in your work of building the system, building a business around it, and exploring the overall space of LLM gateways and how they can be used to improve the operation of AI applications, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:53:19] Viraj Mehta:
So, I think, again, a sort I think this goes back to how easy it is to be to use originally. Like, when you go off into a rabbit hole for a year and, you know, like, this is the correct theoretical thing that should be done and, okay, let's try and make it very usable grade and so on. We we over personally, like, our in our journey, we over index on production grade. And, we're, okay, like, we're gonna build it in Rust, and we're gonna make it, like, all configurable through files that you can commit in, preview, and get. And, I'm gonna use ClickHouse, which is, like, the ultimate scale database. And, you know, it'll work when your data data you have trillions of rows and it doesn't matter. And it's, like, all great. But then do making that decision means that the bones are really good. But it also means, like, now we have a ton of work to, like, make it really easy to get started with. I'm like, I don't want a quick house deployment and, like, I don't need to run a doc and can I just want PIP install something and get to work and, like, not write tomophiles or learn anything?
And so it's really been a big process of where I like we think we have like some of the highest quality implementations like most thoughtful implementations of an all on gateway out there and but it's like also the hardest to use probably So, now it's time to, like, go back and, like, fix all of those things that we it was an emphasis thing. There aren't bugs, but, like, let's go back and backfill and and and make this now, like, a really ergonomic tool that you can just, like, PIP installer or node installer, go on through life. And so I think for us, we made an explicit, like, hard right turn, and now we need to, like, course correct. And we're and we were doing that.
[00:55:05] Tobias Macey:
And with ClickHouse, at least, it also offers an in process option. So you can just run it with a PIP install, and you don't have to have a full distributed multi node cluster of ClickHouse to to get started with your laptop.
[00:55:18] Viraj Mehta:
Yeah. Yeah. Yeah. Like, I I mean, for development purposes, like, all of our examples come with ClickHouse's and UIs. So, like, you can run Docker Compose up and then everything works. So it's not we have, like, a bunch of rungable examples in a repository where that is there. But, still, you know, there's it's just a a little bit more of a doc, on rip. And something like DuckDB would have been even better maybe where, you know, we could have compiled it directly into the gateway or something of that sort. But, then, you know, we wanted this to work for, you know, big multi node deployments if you wanted to later down the line. And so we and we have to pick one warehouse to get started with. I wish we can support all of them today. But, like, you know, we're a small team, and so ClickHouse seemed reasonable to be in money.
[00:56:01] Tobias Macey:
And and SQL is never the same no matter how hard you try.
[00:56:06] Viraj Mehta:
Yeah. Yeah. I wish that every every warehouse had the same SQL model, but that's never it. And also, like, you know, you're gonna wanna optimize things differently based on whether it's Snowflake or Redshift or ClickHouse or BigQuery or Docker or so on.
[00:56:21] Tobias Macey:
And you've mentioned a little bit about some of the future road map. I'm just wondering if there are any particular projects or problem areas you're particularly excited to explore in the near to medium term.
[00:56:32] Viraj Mehta:
Yes. I I think I'll talk about this publicly. I don't mind. So I think, like, I'll say this. We in order to develop a new technique, like, let's say, generic, like, we have some OLM system and we we have, like, we wanna develop a new technique for making that OLM system better from our data model essentially. And so, really, what we want in order to do that is a bunch of different problems that cover different problem spaces where we can run a bunch of examples, fill up the data model, then run the technique, and then see how much better it got. And so today that involves, you know, kind of building a bunch of ad hoc scripts that run a bunch of examples for each, like set up the environment, set up the example, run a bunch of episodes of interaction and then run the script for improvement and then run a bunch more episodes to see how it did. But like this is a problem that was solved really really well in reinforcement learning, so the the pre 2015 reinforcement learning community would do the same thing of we need to integrate against the dynamical system and then there's this bipartite graph of algorithms and dynamical systems and they would all have to jointly integrate with one another. Whereas, then the people over at OpenAI published Jim, and Jim was like this unified interface for, okay, now, every RL algorithm integrates against Jim and every, environment integrates against Jim, and now it works. I think that there's, you know, we're very excited about building something internally and then externally that lets us do research at scale, against a bunch of different applied problems and, allows, like, one common integration that lets us, without having to know what the inter what the environment is or what the problem is, run a bunch of examples, run the thing to make it better, run a bunch more examples, and repeat across a bunch of different problems. And so I think that research tool will unlock a ton of new progress for us in terms of how we're gonna make your LLM systems better, and, I'm excited to share that with the community.
[00:58:28] Tobias Macey:
Absolutely. Are there any other aspects of LLM gateways or the work that you're doing at tensor 0 that we didn't discuss yet that you'd like to cover before we close out the show?
[00:58:40] Viraj Mehta:
Let's see. I guess we we touched on it, but I think it's like a really interesting kind of we we maybe we didn't touch on this. So I wanna talk for a second about how directly like the reasoning paradigm and the the, you know, in the time compute paradigm fits into this process and I think it's very exciting. So if you look at, you know, the work by opening out of the 1 and 3 and their reasoning fine tuning, offering in beta and some of the open source work that also looks like this and smells like this, you realize like there is, like, what this would look like to me is, oh, yes, like part there's a new type of implementation of a tensor zero function which takes a lot more time and, like, runs a lot more tokens, but at the end it's like the same inputs and outputs. And now we have this intermediate computational substrate on which, you know, change of thought are happening, and we should optimize it. We should optimize the whole implementation of that. And I think for those kinds of problems where it's clear that OpenAI and, you know, their competitors are mining the world for math problems and compute you know, lead code problems and IUI problems to solve, you need domain specific data or you'd like domain specific data to make that happen. And so I'm excited for the direction of work that's, you know, soon to soon to come.
Alright. Like, we have our open source reasoning models. How can we directly customize those on, the data that's in ClickHouse and, do that in a way that makes our customers model all of our customers' applications better, out of the box. But that, you know, all the users are the open source project. So very excited for this. Definitely.
[01:00:20] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling, technology or or human training that's available for AI systems today.
[01:00:39] Viraj Mehta:
I'm gonna take this a different direction than probably you anticipated. The question we always ask people when we're we're we meet a new team to applying to all ends to some business problems. What are the KPIs that you associate with your product or your management associates with your product? And I think that there is still ways to go on the management technology of, okay, like, we have this o m system that's driving this business process. It's no longer a curiosity. It's something that we care about for business reasons and we wanna provide business values. And so we need to, as a management team, look at the denial rate for the claims that were generated by the AI bot or the, you know, click through rate of the ads that you're using by AI bot. Like, I think that style of management thinking actually is, like, not yet as widespread as I would have anticipated a year or 2 ago. And it's an important part of the story of these applications becoming extremely important load bearing parts of the world economy as, like, people anticipate.
And so I'm gonna, I think, put the onus on the business professionals who interact with AI folks and AI systems and say, like, how can like, the the management technology of how do we start to measure and optimize and, like, use these things in a serious way requires some development yet.
[01:01:59] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Tensor 0 and the functionality and features that you are adding to the AI stack to improve the functionality and efficiency and accuracy of these AI models that we're all still figuring our way around. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day.
[01:02:23] Viraj Mehta:
Yeah. Thanks. You too.
[01:02:29] Tobias Macey:
Thank you for listening. And don't forget to check out our other shows, the Data Engineering podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Viraj Mehta and Tensor 0
Understanding LLM Gateways
Features and Capabilities of LLM Gateways
Tensor 0's Approach to LLM Applications
Tensor 0 Architecture and System Design
Role of Memory and History in LLM Applications
Sustainability and Business Model of Tensor 0
Challenges and Lessons in Building Tensor 0
Future Roadmap and Exciting Projects