Summary
In this episode of the AI Engineering Podcast, Vasilije Markovich talks about enhancing Large Language Models (LLMs) with memory to improve their accuracy. He discusses the concept of memory in LLMs, which involves managing context windows to enhance reasoning without the high costs of traditional training methods. He explains the challenges of forgetting in LLMs due to context window limitations and introduces the idea of hierarchical memory, where immediate retrieval and long-term information storage are balanced to improve application performance. Vasilije also shares his work on Cognee, a tool he's developing to manage semantic memory in AI systems, and discusses its potential applications beyond its core use case. He emphasizes the importance of combining cognitive science principles with data engineering to push the boundaries of AI capabilities and shares his vision for the future of AI systems, highlighting the role of personalization and the ongoing development of Cognee to support evolving AI architectures.
Announcements
Parting Question
In this episode of the AI Engineering Podcast, Vasilije Markovich talks about enhancing Large Language Models (LLMs) with memory to improve their accuracy. He discusses the concept of memory in LLMs, which involves managing context windows to enhance reasoning without the high costs of traditional training methods. He explains the challenges of forgetting in LLMs due to context window limitations and introduces the idea of hierarchical memory, where immediate retrieval and long-term information storage are balanced to improve application performance. Vasilije also shares his work on Cognee, a tool he's developing to manage semantic memory in AI systems, and discusses its potential applications beyond its core use case. He emphasizes the importance of combining cognitive science principles with data engineering to push the boundaries of AI capabilities and shares his vision for the future of AI systems, highlighting the role of personalization and the ongoing development of Cognee to support evolving AI architectures.
Announcements
- Hello and welcome to the AI Engineering Podcast, your guide to the fast-moving world of building scalable and maintainable AI systems
- Your host is Tobias Macey and today I'm interviewing Vasilije Markovic about adding memory to LLMs to improve their accuracy
- Introduction
- How did you get involved in machine learning?
- Can you describe what "memory" is in the context of LLM systems?
- What are the symptoms of "forgetting" that manifest when interacting with LLMs?
- How do these issues manifest between single-turn vs. multi-turn interactions?
- How does the lack of hierarchical and evolving memory limit the capabilities of LLM systems?
- What are the technical/architectural requirements to add memory to an LLM system/application?
- How does Cognee help to address the shortcomings of current LLM/RAG architectures?
- Can you describe how Cognee is implemented?
- Recognizing that it has only existed for a short time, how have the design and scope of Cognee evolved since you first started working on it?
- What are the data structures that are most useful for managing the memory structures?
- For someone who wants to incorporate Cognee into their LLM architecture, what is involved in integrating it into their applications?
- How does it change the way that you think about the overall requirements for an LLM application?
- For systems that interact with multiple LLMs, how does Cognee manage context across those systems? (e.g. different agents for different use cases)
- There are other systems that are being built to manage user personalization in LLm applications, how do the goals of Cognee relate to those use cases? (e.g. Mem0 - https://github.com/mem0ai/mem0)
- What are the unknowns that you are still navigating with Cognee?
- What are the most interesting, innovative, or unexpected ways that you have seen Cognee used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cognee?
- When is Cognee the wrong choice?
- What do you have planned for the future of Cognee?
Parting Question
- From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@aiengineeringpodcast.com with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
- Cognee
- Montenegro
- Catastrophic Forgetting
- Multi-Turn Interaction
- RAG == Retrieval Augmented Generation
- GraphRAG
- Long-term memory
- Short-term memory
- Langchain
- LlamaIndex
- Haystack
- dlt
- Pinecone
- Agentic RAG
- Airflow
- DAG == Directed Acyclic Graph
- FalkorDB
- Neo4J
- Pydantic
- AWS ECS
- AWS SNS
- AWS SQS
- AWS Lambda
- LLM As Judge
- Mem0
- QDrant
- LanceDB
- DuckDB
[00:00:05]
Tobias Macey:
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macy, and today, I'm interviewing Vasilije Markovich about adding memory to LLMs to improve their accuracy. So, Vasily, can you start by introducing yourself?
[00:00:29] Vasilije Markovich:
Hi. Nice to meet you, Tobias, and, thanks everyone for listening. I'm Vasilije, originally from Montenegro. I've been in Berlin for around 10 years working in the big data field, worked as a data analyst, data engineer, data project manager, usually managing big data systems, everything from batch to streaming, so pretty close to the data engineering podcast world. And, I've been, building, memory engine for AI apps and AI agents for the past year. We started as a series of concepts and prototypes based on my studies of clinical psychology and then combined with data engineering. And we ended up with the Python library that you can build semantic memory on. Happy to share more about that later.
[00:01:07] Tobias Macey:
And do you remember how you first got started working in the ML and AI space?
[00:01:12] Vasilije Markovich:
Yes. I was I was a long time ago working in a crypto company, and they needed to connect the funnels, all the data from the beginning to the end and see who's who's who. So I built the first Python scripts, connected all the analytics, and I really liked analytics more than I liked crypto at the time. So I just decided to switch, and, then I became a data analyst data engineer, and then started kinda dabbling whole whole the ML world and then put but pretty much saved it in those calls.
[00:01:40] Tobias Macey:
And so in the context of LLMs, you mentioned that you're working on building a system to add semantic memory to those applications, and I'm wondering if you can just describe a bit about what that term memory means in the context of LLM systems?
[00:01:57] Vasilije Markovich:
That's a great question. So how I look at memory in context of, LLM systems is pretty much based on in context learning. So if we have an LLM, we can give it some sentence or 2 or 10, which we all do when we copy paste some context for it to reason on. We effectively are giving it some memory it can operate on. And, this, compared to the traditional, let's say, training and fine tuning methods is, something we can do for almost no money. And now as things are actually moving along, we're also seeing potentially unlimited context windows where we can, create this memory for for the other one. Where this becomes and where, we actually start diverging a bit is that we are actually now having to manage this context across a lot of other one calls, a lot of, interactions. And, yeah, we see it as pretty much a way to manage and structure this in context windows as a new type of some type of feature store. And that's what we call LLM memory in Cognig. But, of course, I think the definitions can can diverge there.
[00:03:05] Tobias Macey:
The opposite of memory is typically considered forgetting, which I know is also a problem in these LLM systems, particularly if you're trying to pack a lot of information into those context windows. And I'm wondering if you can talk to some of the ways that that forgetting manifests as a symptom when you're interacting with these LLMs and trying to build applications on top of them.
[00:03:27] Vasilije Markovich:
Sure. So first, there is the technical limitation, the context window limitation. Right? So let's say we have a context window of 4,000 characters that we can put in, then everything outside of that context window won't be retained by the LLM and we won't be able to reason for that. We are pretty much technically limited to some ever increasing context, but still somewhat of a limitation for, on, let's say, price side and on the other levels too. That's one thing. The second thing is, let's say we try to change some LLM prompts. We see now this chain of thoughts where we kinda call the LLM multiple times and and pass some information along the line. In these cases, the problem is that the most recent information is often not prioritized compared to the to the last things or initial things that we gave to the other language can be interpreted as some type of forgetting.
Whereas these systems tend to just prioritize what's newer and respond to that than than the later information because of the size of the context window and because of the, let's say, issues with the the the prioritization can't really get get into the place.
[00:04:37] Tobias Macey:
Another aspect of LLM systems that adds an additional layer of complexity from the initial playground experience that we have been working our way through over the past couple of years is the idea of going from single turn interactions to multi turn interactions where you have to have a continual context window and maintain history of the conversation for the LLM to behave in the manner that is desired. And I'm curious how those concepts of memory and forgetting become more complicated or some of the additional considerations that need to be factored in when thinking about those in the context of these multi turn systems?
[00:05:20] Vasilije Markovich:
So it again goes to the context. So let's say we have these multi turn interactions as a set of agents that are working on the same problem, exchange information, then it all becomes about the management of their shared context and and how we are actually dealing with with that. What usually people do, they take this chain of thoughts types of, of systems as in the different LLMs. But effectively, it's a sequence of calls to 1 LLM, which is essentially a single transform. Right? So we can which has some ins and outs. So you're effectively talking to the same thing with different calls. But if you don't really manage, let's say the the context and the information that's being processed toward or managed across these calls, then it's really hard to move forward. Right? Because then you're effectively talking to the same system. Although it might seem to you that the system is different, intuitively, it's just the same same API.
[00:06:17] Tobias Macey:
Another piece of that memory equation and the ways that we're managing that in the rag stack that has been evolving and becoming commonplace in the context of LLM applications is the idea of hierarchical memory of there are certain pieces of information that you need to maintain for immediate retrieval given the context of the interaction that's happening right now, but there are other aspects of memory and information that need to be maintained over longer time frames and maybe only infrequently accessed. And I'm wondering what that aspect of hierarchical and evolving memory means in the context of LLMs and some of the ways that the current application architecture is not up to the task of being able to support those more nuanced elements of memory and retrieval?
[00:07:12] Vasilije Markovich:
Yeah. That's a great question. So when we talk about what is a memory currently for LLMs, we have, like, maybe let's let's go back to the basics. We have in context what we copied to the LLM and we say this is your memory. As we said, like, it's limited, we forget. So what people started doing is like, okay, I'll just dump everything inside of a vector database. As everyone knows in the data engineering channel, dumping something somewhere usually doesn't end up well. You know, we've seen all these s three buckets full of things that have been long forgotten 10 years ago and are slowly mutating to something else. So, in this context, the vector story is the same. Right? So you're just adding stuff, you're retrieving things based on similarity, but you have no control over what you're getting, the metadata management, the embeddings and indexing. If you change some type of thing, you need to index everything, which is slow and expensive.
It feels like we are in the days before SQL was even a thing in the online world and before relational databases more. What we can then do is and what people do is like, okay, I need bigger memory, I leave it in the vector store. Then we saw that doesn't really lead to some 70% accuracy. Then next step was like, okay, we need to actually map out the knowledge somehow and create some type of a structure. GraphRag is the first, let's say, more serious attempt to that. There were others. And, that's, let's say, the direction we are taking, as of now, of course, this will evolve. And then with that, we have an external system. You know, like in 18th century, you had this nature versus nurture debate. You know, who's gonna influence what? Is it gonna be the embeddings or is the external context that's gonna kinda motivate our thinking and reasoning? In this context, we say both, but we try to outsource this semantic layer outside of the embedding so not to actually store everything in the vector store, but actually have a management system outside of it. And, that is, in our view, gonna allow for this evolving context because what currently they are mostly is dictionaries with some predefined set of values that are not even data contracts. Right? They are just like, you know, a dump of some JSON files that for a particular use case can evolve and you can add things and then you can forget that. I think that agent paper on on same agents that forget was pretty much that. Right? And although it looked impressive and it showed some capabilities, it simply fails if you really want to do something in production because of all the issues that we're gonna come along with that. So for us, effectively, this is one of the things we're really thinking deeply about. And we're thinking deeply about, like, hey, can we bring the best data engineering practices into this world and, like, structure the data properly, manage the data properly? Because based on that, we'll get the better output. Right? And and this is, this is gonna allow for this evolutionary nature of the system where we can dynamically overwrite and create nodes and entities. But if we don't solve the data engineering puzzle, we can't go to the multi agent, networks and and what everyone else is happening to think about. And I'm seeing a lot of talk there, but not much work on the data engineering side of origin.
[00:10:11] Tobias Macey:
Another interesting aspect of the way that you presented what you're building right now is this idea of semantic memory, where GraphRAG is intended to bring semantics to those contextual embeddings. But I'm wondering if you can talk a little bit more to that aspect of semantic memory as opposed to just a blob of data and some of the ways that that maps into some of the cognitive science debate and some of the ways that the LLMs are just these stochastic parrots and the the work that's also being done in the space of cognitive AI to try and bring the AI ecosystem into a more nuanced direction versus the brute force approach that has generated the current set of LLM models?
[00:11:00] Vasilije Markovich:
Yeah. Sure. So let let me let me give it to staff. So in in our understanding, the semantic layer, and we started with, let's say, creating this type of a semantic layer based on human commission because we thought that's gonna be the best approximation of what, LLMs should do. They should think like humans because we have no better reference right now. So we took these data models from Atkinson Shiffrin's, you know, 1969 papers on, like, long term, short term memory, episodic, semantic, episodic being something that happened to you relatively recently, an episode in life and semantic representing knowledge of facts like what's an apple, what's a chair, and things like that. So we figured, like, how about we create these types of memory domains and then we load the data into these memory domains and have them, pretty much serve as a semantic, memory layer. There was a paper from Princeton last year that pretty much talked about the same thing and proposed the same thing, call it, it's called. And, we were a bit ahead of them 2 weeks. I think first time I can say that is probably the last time in terms of the academic paper. But what we got to understand is that, even by adding and extracting the composition and and managing the storage in different levels, we already get improvement in performance compared to the naive LLM cost. And, this, coupled with a lot of research that's been happening in in cognitive sciences and especially recent from like 30, 40 years ago that no one's reading in the AI world thinking that's not relevant.
We pretty much have a good opportunity now to implement some basic mechanisms like forgetting, attention mechanisms on, actually these, let's say, graph reg elements. We can quantify and increase the size of the relationships in the nodes and also understand based on the interaction with certain nodes, types of graphs, and elements what is important for and what's not. And if you talk about psycholinguistics and, you know, we had all these ideas like first in, first out, last in, first out, like, 40 years ago. And they've been abandoned these models because we came to, like, more more complex and understand. We took, like, this neural network approach back to to psychology and back and forth. Right? But I think a lot of ideas there are still pretty valid. And as algorithms, we can start applying them now in terms of how these LLMs should think, and we should have them as tools to see what works. Right? Then that coupled with a good set of emails is gonna get us much closer considering that things like LLMs are plateauing now in terms of the performance and the outputs. And that's been, let's say, a trend in the last couple of months. So so this is our thinking and we are trying to use a lot of, let's say, traditional approaches there to just kinda push the LLMs a bit forward with a new set of tools that we could use. And the quality of that model has failed for us, but it it showed us the way that we could kinda do that more and be and focused on multilayered networks of language, took some inspiration from these types of approaches, and and are kind of currently also, yeah, investigating a few new things that are gonna be announced.
[00:14:00] Tobias Macey:
And the first era of these large language model applications was very simple. Let's put an API on top of the inference engine and just do whatever goes in, generate some sort of output. Now we're building some more sophisticated architectural patterns, RAG being the most dominant one. So that requires the addition of these vector stores, some sort of reranking model to figure out which pieces of context we actually want to bring in. And I'm wondering for the case of bringing in this more nuanced and detail oriented semantic memory layer? What are some of the additional system components, architectural components that are necessary to be able to support that functionality and some of the ways that that either supplements or replaces elements of the RAG Stack?
[00:14:52] Vasilije Markovich:
Yeah. That that's that's a good question. So if we take the current Rack Stack, it's you know, I have a question, then I'm going to, like, 6, 7 services that are either chunking, loading, processing, retrieving the data, re ranking it, and then giving you an output. So already if you have 6, 7 services, you you know you have a problem. Right? It's ideally you shouldn't have 6, 7. You should have, like, 1 to 2 max. And, in our case, what we thought about is, like, hey. Okay. We can't do all of this work ourselves. Right? That's what we see with Lambda, Lambda index, and the others. They're trying to do everything from ingestion, data management, and then, graph reg, and then retrieval, and have like across everything across the board. So what our thoughts were, let's create create a new type of a data store that is gonna be some type of a short term memory, and that's what we are hearing from the others. Like, they want to use Cognigy from user interviews. Right? They wanna use Cognigy short term memory stored there, have, let's say, something warm that it can retrieve and reduce the number of LLM LLM calls. Ideally, we would have something like, hey, we have an ingestion part which is different for every company. They can use off the shelf things from, they can do Paystack, whatever is there. They load the data. We have our ingestion tooling that using DLT supports 30 plus data sources and everything is pretty much done in a way supports 30 plus data sources and everything is pretty much done in a way that it can be managed and parallelized and operationalized production. Then we roll the data to this data store and then we have different types of retrieval patterns and we pretty much automatically detect and store and process the data in the next iteration. Ideally, we're we're going about gonna launch this one. And then people can use us in this new stack as a net short term memory. And then they can fine tune the LLM afterwards and use that as, let's say, a long term memory or use that in combination.
That's how I'm imagining the stack is going to evolve because as the alarms get better, as we, yeah, let's say improve the short term memory, we won't need all these rerank or chunkers. They're just gonna be a part of, let's say, maybe our system, maybe some other system, but, like, invisible. You know, you don't need to tweak this manually. And I think this is avoiding this is gonna be a huge thing if we can do it. And now digging into Cognee itself, can you describe a bit more about how it's implemented
[00:17:06] Tobias Macey:
and some of the ways that the initial concepts have shifted and evolved in terms of scope and capability as you've gone through the journey of building it?
[00:17:16] Vasilije Markovich:
Sure. I'll start from how I got started, how it evolves slowly, and then where it's now, because otherwise, it's gonna be hard to follow, I think. So I started, with Cogni after building 2 failed b to c apps last year. So I realized I have no skills in building b to c apps. I built some tools, recommender system. I built some image generator app. And as I was doing that with Langchain and other tools that are available, I saw that there is an issue on how to actually manage these contacts. Collector stores weren't giving me back the the user's information. I was hoping I'm just gonna get the and then reach on the fly. LLM calls were taking too long. There was a lot of issues.
And, yeah. So I failed with the 2 apps, and I'm thinking, like, hey. Well, I I know something about data engineering. I've been doing this, stuff in psychology, some cognitive science courses, and I'm seeing some patterns I could implement. So I tried posting on Twitter and kinda creating a few prototypes, to see, you know, what would people say and and and could it be used. That led to a lot of activity and and success. And I didn't honestly expect that I had, like, 38 followers on Twitter or something like that. So it wasn't really, something I knew how to do or had those skills. And, that led to introduction to 1 guy I know. I'm from London who had a keypad, WhatsApp chatbot with around 20,000 active users. And there I had a chance to deploy something to production. Right? Because he was looking for a better way for to manage the long term memory, and he needed something better than vector store. Pinecone didn't work for him. He couldn't just make it work. And then people were complaining.
So first iteration of the product, they took this Atkinson Shiffrin model of long term memory, short term memory, working memory, this call of paper. I built some Python scripts and wrappers around Neo 4 j. I vectorized everything with VV8. I built, like, a combination, let's say, some interface for them and and and hard coded these models. This was an ugly piece of code that if you go to, like, GitHub, you're not gonna you're not gonna appreciate it much, but we deployed it. We deployed it on AWS. It was an endpoint, scalable. We could integrate it with Keyp and we did. Then we saw some activity. People were creating their own, let's say, note memory notes and storing things in the memory and retrieving.
And, as we were kind of progressing with that, there's so there is a lot more opportunities, what they call agentic crack these days, which is like we could create classifiers that could just pick certain types of memories, retrieve certain subgroups of information, and and we did that as a second iteration with with a tool that was supposed to help architects find the relevant laws in architecture to apply. If you want to know what is the size of the stairs in an apartment building, we could find the relevant law, extract information, give that to the LOM and then just kind of format it. Then this worked quite well. And, as I was doing that, I saw that the architecture of the tool is not really that that that ideal. So Boris, my cofounder, joined me at that time and we were like, okay. We need something more simple and we need something people can actually use because people were interested, but they were really having hard time dealing with our static model and dealing with our code, right, which is fair. So we refactored everything to become a Python library where we have, let's say, in the beginning, we had 3 elements. 1 was add, so you could ingest any type of data from these, let's say, 30 data sources. And on the ingestion, you would have the JSON normalization, merge IDs, extract the all of the fun data engineering stuff that you don't wanna think about if you just come from the JavaScript world or or or, you know, are there other parts where you haven't really explored that? But we do that for you. We put it all in the meta store we know and associate all the IDs and and you're pretty much ready. Right? 2nd piece was we did this what we call cognify. Cognify is this term that in psychology is used to kinda understand and conceptualize your context. It's it's, it's effectively how do you understand the things around you and how you put them in your brain. We took that and then we called the pipeline Cognify pipeline. And we did a lot of operations, created summaries, created loaded chunks, created some ontologies.
A lot of things that that ended up with us being able to search for, you know, notes that had certain names or certain relationships for embeddings themselves, whatever we need. And that was, let's say, the first version of the the the pipeline. As we did that, we realized that our assumption of what Cognify should do is a bit too strong because people need to do different things. So we refactored the library again and now we have everything in the library being a set of tasks that can become pipeline. Similar to what in the airflow world you would have with x, we have something like that. So you can inherit from one deck to another, pass the information.
You can chain them together. You can create pipelines that call other pipelines. So all of the all of the fun stuff that, like, would actually let someone use this, we we try to implement that and we try to implement that with generators and then pretty much make it async. We still need to work a bit in parallelization. But effectively, the whole process now is we ingest some of the data you give us with no external ingestion or, DOT. We load the data, manage it. We create summaries, ontologies. You can custom add your tasks. It's just Python functions now. It's very simple. And then we have the search methods and we have these projections of the graph where we take the whole skeleton of the graph inside of the memory where you can operate on that and deal with that. We raised that last week. And then we plan to kind of build more complex retrieval methods and and implement all these default algorithms that are there in the in the graph field that have been known for a while, and then we'll see where that takes us. Right? So so that that's pretty much that.
[00:22:53] Tobias Macey:
And in terms of the data structures and ontologies, I'm wondering if you can talk to some of the ways that you are thinking about how that maps into the I'm gonna use the term conceptual even though it's a misnomer in the context of LLMs, but how that maps to the conceptual structures that the LLMs are anticipating to be able to help to manage the usage of the context window most efficiently and, in particular, some of the ways that you're able to use those data structures and ontologies to map into the problem domain that the application is trying to address?
[00:23:33] Vasilije Markovich:
Yeah, sure. For us, ontology is a key term here. Understanding how do we actually structure and correlate different types of data points inside of one system, especially when we talk about millions of documents coming from, you know, finance, legal, HR. If you just imagine dumping the whole, company's, data on on to something like this, you would expect these types of connections to get formed or to get transformed by the human, which would help the LLM or, you know, work on some type of a preset that it's already given. And, we think this is the key piece because there is a lot of context that humans bring into the play, but there is also a lot of context that we can kinda create and a lot of ontology that we can generate using deterministic methods, but also using this, let's say, LLM or LN powered.
For us, currently, one of the major goals is to be able to create these ontologies on the fly from the documents by treating each of the Cogniz pipelines as ML pipelines. So we would, you know, do train test sets, that will also be question pairs of different issues. And then we would run the pipeline until it can pass the the train, test it on the test set. And, that way, we assume that the semantic structure and and the ontology of the initial datasets would be, let's say, optimal for the question pairs that were given in in in the question sets. And we could avoid, like, fine tuning and and dealing with that by just using all the tools at our disposal to kind of train. So this is the evolutionary aspect to the pipeline. And this is something that we think is going to become dynamic in the future. You're going to have dynamic ontology because as soon as new data comes into the system, the whole system change. You now need to go delete, you need to deal with the data in such a way that you can always continuously manage it with while it's interacting with, like, let's say, thousands of these agents in the future because we assume that you're gonna have many, many calls to the LLMs, millions probably in a couple of years as like the cost is going down or the 10 x over the past year is probably gonna go 100 x down. And, we we assume that this is gonna be cheaper than the CPU at some point. Right? Or at least like pricier, but but something that, you know, we can afford. So, that's that's where we see it, and that's where we think the context window is. They become unlimited with these bigger ontologies and the subset of ontologies we can actually get for the problem they're trying to solve. We're gonna we can then get to a good state of actually making better decisions. So still work in progress, but but that's just a general idea.
[00:26:11] Tobias Macey:
When dealing with graph systems and ontologies, that brings us into the knowledge graph arena, which has been around for a while. And there has been a lot of tooling built up around it. So I imagine that it's gotten easier to manage, but the creation and curation and evolution of knowledge graphs is in and of itself a fairly substantial undertaking. And I'm wondering how you're managing some of that in Cogni without having to surface it all to the end user and add that as an additional conceptual challenge to how they're structuring their overall system.
[00:26:49] Vasilije Markovich:
Yeah. It's a good question. I would say there is, like, 2, 3 things that would appear to the end user. 1 is the performance. So if I use Neo 4 j, I'm gonna run into issues, especially at some big scale. So yeah. Okay. We need to find something better. So we just did an integration with Alcore DB. We are trying to talk to other people who are building, let's say, these types of databases now and and they are actually popping up. There is some interesting people in San Francisco I met recently. And I think the the the, let's say, database tag is going to evolve significantly to support this. I hope we see something post or soon, you know, like pj vector because that's going to make everything else obsolete. But, yeah, effectively that's one thing. The second thing is like the whole graph management, the interactions and the generations, it's on us. Right? So we have adapters for most of the graph databases. We try to abstract everything inside of the system as a data point, which would represent a pedantic model that we can just pass through the system and not really be dependent on the, different implementation types. But pretty much give the data point or the the the pedantic model to the adapters and then have the, let's say, the difficulty of the implementation line on that side to extract out these parts of the system. That took a lot of iteration to get to that state. We hope it works. We'll see.
But, it's not an easy problem. So, on that side, how we are solving, let's say, the graph management and iteration is like this. As I mentioned, we project the graph in memory. We are playing with that. We can project n graphs in memory and subsets and then, even do simulations on those. So there is a lot of possibilities that we are kinda introducing into the architecture now that makes us a bit more than an OEM for graph and graphs and vector stories. And secondly, what I think is the key is like the who works with you. Right? So, one of our engineers, he has a PhD just working on graphs. He's been doing that for like 10 years. So He's building and bringing this knowledge because I'm definitely not a graph expert and I'm learning from him. I'm running to catch up with the guys. That's always a good thing to have in startup. I think we will definitely see how that evolves. So far we have a certain set of retrieval strategies, a certain set of ways to generate graphs that we want to implement in the next couple of months. And after that, that we'll see that, my assumption is that we'll have to bring this down to one level lower, so something closer to the metal.
But once we get to cognitive use more introduction, then we'll definitely talk about the, let's say, issues that the Python is going to bring us because it's definitely not going to be the optimal tool for this job out the line.
[00:29:29] Tobias Macey:
For people who are interested in incorporating Cogni into their overall system design, I'm wondering if you can talk to some of the steps that are involved in integrating it and taking advantage of it, and in particular, the case of a greenfield system, change the way that you want to approach the overall design of the application architecture?
[00:29:54] Vasilije Markovich:
Sure. So integrating it is relatively simple. We have a Python library. We have a Docker image with an API. So you can have everything documented in our, GitHub repo. It's open source, so you can just download it yourself. Effectively, the way we deployed it and the way we included it in the in the system, there is one way which we did, which is ECS. We plan to add SNS, SQS. That's a queue that would just trigger Lambdas running Cogni and we use DCS now for for the when we didn't have too much of big loads. And you could pretty much have some s three buckets, sending the events to SMS SQS on AWS, for example, and then the triggering Lambdas with Cognite. That's relatively simple architecture to to to do.
We also built Helm Charts for for the deployment of Kubernetes. So you could also deploy it as a service or it's a set of services there. We have this managed metastore. So what you would also need is a Postgres store that you will deploy outside the site, and that will allow it to actually run parallel, instances. And you will be able to kind of trigger it and and keep the state so you don't end up redoing the things. And the the final, place or the final things you would need would be the vector stores and the graph stores. You can deploy these yourselves. We support most of them. So VV8, Quadrant, LensDB.
So so that's supported. We have Neo 4j, PowerCore now of the graph ones and then NetworkX if you want to experiment with things. And, I think like how the architecture will change for anyone who's trying to use Cogniz instead of just having something to load the data from some type of LLM call to vector store, you would add Cognite also as a destination and then you could just fetch the things from Cognite 2. You don't have to do any type of a destructive modification on your existing system. The idea is like, hey, let's just plug in this memory, get the feedback back. We have also now evals that are going to be merged very soon for the LLM as a judge, but also for traditional deterministic ones. And then, you can just, evaluate the outputs, on top of that and and see what you get and how good it is. So that that's kind of the the the short version.
[00:32:10] Tobias Macey:
We touched on the multiagent systems a little bit earlier. And so for people who are using Cogni, how does that factor into that multiagent or multi LLM system architecture where maybe you have different LLMs that are focused on different problem areas. Maybe you've got a, larger model for general language usage. Maybe you've got a code specific model for doing code generation, or you've got an image generation model and just some of the ways that that impacts the ways that you want to apply Cogni or how you think about the, either segmentation or schematization of the underlying knowledge graph for being able to appropriately retrieve the right semantic constructs for the different LLMs and the different use cases?
[00:32:59] Vasilije Markovich:
So for generating different, let's say, semantic layers, you can use bydantic. We pretty much allow you to just generate a bydantic schema that will be represented in the graph in the way you need it. If you have, let's say, different types of data that you have a custom way of you wanting to represent, you can do it with the Pydantic. We can also help with having an LLM generated ontology, although I will not recommend that for production. It's going to be dangerous. Secondly, in terms of ingestion of the data, we support image, audio, text. Pretty much that's all going to be converted and loaded. Lastly, I would say we have users and these users, they have access to write and read to certain parts of the graph. Let's say you have different microservices, each of them could be an individual user if you don't want to really model things but you just want to keep it isolated. You can then you can still have it all in the same space by just loading to to different parts of the the subgraph that you need, that's shared approach.
I think there is many more ways we can we can isolate this. We can we talked about, like, having, you know, different graph vector store combinations for different use cases to isolate completely the data down the line. So you could just spin up the instances that effectively then are loading things, across just one vertical. But, yes, so far we haven't implemented that. So so for now, yeah, the answer is users and the pydantic models you can use to define, and then point towards when you're actually loading the data. But, yeah, we are still thinking about this. Any feedback would be appreciated on this side.
[00:34:35] Tobias Macey:
Another element to the idea of memory is the use case for personalization for the end user where you want to be able to store useful pieces of information, semantic context, pieces of data that you have about a user, whether that's their preferences or their demographics, etcetera. And I know that there are other projects out there. The one in particular that comes to mind is mem0. And I'm I'm wondering how Cogni relates to some of those use cases, whether you're thinking about some of that personalization aspect or if Cogni is more for managing the data in the underlying problem domain and the the data that you would typically put into that rag stack?
[00:35:22] Vasilije Markovich:
Yeah. That's a that's a good question. So, you know, when I was pitching to investors, I think we just secured 1,500,000 funding round, which went public today. So, yeah, I don't know when is this, gonna be shown. But but, yeah, we just, we just did that. So I think, like, one of the base pitches there was, like, hey. What's the problem that, vector stores and and these data external data stores are solving that the LLMs can't just do by themselves is the personalization. Right? So all of this work that we do is effectively to personalize the data, that is fed to the LLMs. LLMs can give us more nuanced answers to what we need. I think, mem0 and and the others like Leta and there's another couple of, let's say, projects that are working a lot on, like, I would call it, like, almost b to c approach where they let the users, or chatbots really have these personalized interactions. We spent time with our 1st, bigger design partner, Dynamo, where we built a support chat system for their human agents for, let's say, gaming industry where we did the same thing. And the thing is, I don't think you really need the graph right there because it's not that much data, and and it's pretty much just, says some classifiers, some extraction of the, you know, user personality and, like, and some thoughts. So I think, that is not on the, let's say, this user individual response level. It's not a difficult problem. Where it becomes difficult problem is when you're trying to reestablish this context across, like, very complicated domain area or let's say more complicated set of constraints or business tools that are applied to personalize the data. Because if John is friendly or John is angry, LLM is gonna be already quick trained for that. But if John has a user session defined until midnight, but then every second week, it's until midnight or 5 because someone in analytics made a mistake 6 years ago when he was defining it and now the reporting for investors work that way. You know, these these types of, like, more complex cases and matching that those and and managing those and also, like, pretty much text to SQL. Right? And then all of these, let's say, this context that we could do and and build, this is where I think the the difference in the approach comes. You're trying to be a bit more general than than just focusing on what could give us quick money, but more of like, hey. Can we actually build something more sustainable, more long term? Like, we're focusing on indexing. Right? And and indexing for us is personalization.
For some others, it's like personalization in really domain of personalization for, let's support agents and things like that, which is easier to sell, but I guess it's it's more more difficult to defend on them. So I think these guys, Memo, I've followed them for a while. I've seen what they did with drag in 3 lines of code. I think they have, like, really good, let's say, approach to, open source to to, you know, community building. They are, like, innovating fast. They're pretty much doing similar things that we do. But the difference is they have their I think that's my impression, maybe something changed. But they have their, code behind some cloud system. They have an API. They will do everything. The magic will be done for you. But I want to be a part of that magic, and I assume the developers in the future want to control how the magic happens. Because as soon as we start giving, like, black boxes, it's, you know, hard to trust the black box. And I think that that's that's pretty much the philosophical difference.
[00:38:41] Tobias Macey:
And as you continue to iterate on the design and use cases of Cogni and try to navigate the constant shifts in the LLM and AI ecosystem, what are some of the unknowns that you're still navigating?
[00:38:56] Vasilije Markovich:
So many. I mean, I don't know. It's a startup. I think everything is unknown, except the knowns, which you want to be unknown sometimes because it's bad news. I think in this context, we are facing, a lot of questions on, what will be the data stack of the future. Right? So what will we need to actually use the LLMs in production together with its more traditional systems and how will these interfaces look like? I think guys like DockerDB and others, they are really doing a good job on kind of moving the needle forward, in terms of, like, what's possible and what can be done. But so far, I'm I'm still not sure where, let's say, the whole stack is gonna go. And there's so many interesting things and the things that really rapidly change, which is not that common in the data world. You know, usually, you can see it like a mile ahead. You know, something's gonna come. But I think now it's it's a different time. It's actually it feels like start up time, you know, after like 10 years of just building slowly the systems. Second thing that we don't know or can't say right now is what the LLMs are going to be like in like 3 to 5 years. Are they going to make this whole work unnecessary because they're going to just be so general and good or not?
I don't know. Is regs going to be even a thing in a couple of years or are we going to even need it? My core assumption is that people will always need to process data and to ingest it into any type of a system because I haven't seen a system that works without data so far. And, I don't know. Ideally, that's still gonna be a need. But if our stool is gonna answer that need, I'm I'm not too sure. And then, lastly, I think the whole, let's say, research space is super interesting but super noisy at same time. So we are reading a lot of papers. We are, like, checking what people do. And, you know, I have people in the team that are much smarter than I am and and have much bigger PhDs from better universities. I have no PhD by the way, far away. And I I think that even they are, like, you know, having a hard time distinguishing what makes sense, what doesn't, and they spend time researching for for for a number of years. And I think this noise plus all the possibilities that we could do and kinda narrowing down, on what we exactly want to do without really clear roadmap is is always a difficult thing, but we do have a couple of ideas there. But I think just the the space is is very large, and you just need to be very strict with yourself. So I think there's always a chance we can make some big mistake there and and kinda go the wrong path. But hopefully, that's why we're trying to work with design partners and and more closely with people so we can, you know, solve these problems as we go. So so, yeah, there there's a few things. There is many more operational risks and stuff like that, but I think, like, I wouldn't go into those too much.
[00:41:34] Tobias Macey:
And beyond the core use case and the core focus of Cogni being a tool that aims to improve the overall capabilities of LLM systems given that you are focused on that data management, data transformation, graph generation. I'm wondering what are some of the other potential applications of Cogni that you either have seen or have tested out or that you foresee as being capable either now or in the future.
[00:42:06] Vasilije Markovich:
Yeah. That's a good question. I've seen people come, like, book a demo and come and say, hey. We have a case in agriculture where, you know, if you plant some seed and there was a pesticide applied 3 weeks before and something else happened. So complex rules. So they need a graph for that if they want to cast some LLM feature. I would never think about agriculture. You know, I come from a village. I I did run away from that for a reason. Like, I would not go back to that. But someone has that type of a problem and then they recognize that, hey, okay, we could do this with reps. So, we're seeing, cases from, you know, basic ones from chatbots. It's not something too surprising in personalization of them. We've seen, like, these cases with architecture.
We've seen cases where, like, construction company that is trying to project on every little brick, you know, some type of a QR code that can get scanned by the workers so you would know how far the brick the the building progress is going. They have an issue that, like, you know, building managers want to actually communicate with their system, so they would want to ingest all their data, structure it, load it to their system so they can update the progress reports on the fly. So very different use cases, very different needs. I think a lot of, let's say, manual work that people have been doing now for, like, 30 years, is gonna get automated because they don't want to do it or or, you know, there is no really need anymore. And, these all these work verticals that we are seeing, it's it's super exciting. You know? Logistics, there was a lot of, let's say, movement across those chains, how things move. There was an issue with something like Starbucks menu. Right? So that was one interesting case where if you have, like, a cappuccino or something, it's gonna have some ingredients. Then this is, like, already what ingredients in what proportion go inside of a cappuccino and what these ingredients contain.
Then, what if you change one ingredient and you want to actually have an hour lunch suggest you something that has this or that ingredient? You pretty much need some semantics, and you need to update that, you know, the flour changed in some of these recipes because it might have some allergens or something like that. So so I think, yeah, we'll see. But so far, I'm pretty positive that there is a lot of use cases where we we'll graph that be able to respond to all of those or some next generation of it, which we hope to kind of evolve towards. Let's see. I think there is already, like, a a good number of things we can do and and and work.
[00:44:28] Tobias Macey:
And in your experience of building Cogni, exploring the space, establishing a business around it, and figuring out how it slots into the overarching LOM, generative AI ecosystem. What are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:44:47] Vasilije Markovich:
So, where do I start? Don't don't hire friends in your start up. I don't even like that. That's that's a good one. You know, you need to, you need to distract yourself emotionally from your business. I think that's that's one one of the biggest lessons I had. I put in all my savings, tried to build this, got now some funding. I think we were also lucky to a degree. Right? But effectively, a lot of lessons that are the world works in a certain ways that you don't imagine when you're working in the data field and then kind of, you know, when you have reports and analytics and and things that are very material to do, the the chaos outside in managing that, it becomes somewhat of a difficulty. So I think there was a lot of learnings on the side of of the operation of the business, a lot of learnings on assuming things for users. I think, like, everyone's saying, like, talk to users more, but not too many people actually do it. And I think, for us, it was like, hey. You know, we tend to have a lot of assumptions, and we don't even realize how many. So I think that was that changed a lot for us. We're now doing user interviews weekly, and talking to people as even if the product is relatively new and and, you know, not used that much, we are still kinda trying to to get as much feedback as we can. And then, lastly, I think it's the the the challenge is often enough the amount, right, of the parallel things going on where you can't really focus and solve one problem optimally.
But I think that's also a a good, let's say, caveat because you tend to then solve the problem to a degree it needs to be solved and you don't over optimize. Because, you know, I used to spend, like, 2, 3 days solving some, like, writing some fancy sequel somewhere to run on some scripts. And, you know, it was nice, but now, you know, it's like you just choose your battles and you know how to choose them better. So I think that's pretty much on the start of the business side. But yeah. I mean, it's it's a broad question, so I'm not sure.
[00:46:36] Tobias Macey:
And for people who are building generative AI systems and they're trying to figure out what is the appropriate means of storing and retrieving context for those applications, what are the cases where Cogni is the wrong choice?
[00:46:54] Vasilije Markovich:
Yeah. So if you have something where an LLM call is gonna solve most of your issues or a set of LLM calls and you don't have much data, you don't need Cogni. You don't need, you know, a complex system just because, well, someone told you graphs are cool. If you have relatively easy use case where you're not that worried about accuracy, but you just want to have something generated, you also don't probably need Cogni. If you need structured outputs, you probably don't need Cogni. You can use, Instructure, library or any of these, let's say, tools, so you can get, like, some data.
What else would you not need it? I think, everything that's pretty much agent work right now, you can use it. But also, honestly, you probably won't need it unless you have really a lot of agents. So, you know, if you have route in the simple workflows where you can just pass things through in some type of a JSON, I mean, maybe you first get to something working and then if you wanna tweak it, then then you can add, like, a memory layer for it. But if and if then you have 700, yeah, you would need something, but it's, like, for more simple use cases also. I would say if you're experimenting, yeah, you can play around with it, but it's probably not gonna lead to a lot of results. If you don't have much data or if your use cases are out of the simple structured outputs can solve them or you're just kinda trying to do some basic LLM calls, yeah, you you're pretty good on your own.
[00:48:20] Tobias Macey:
And as you continue to build and iterate on the core product, the open source project, and just navigate this ecosystem, what are some of the things you have planned for the near to medium term?
[00:48:33] Vasilije Markovich:
That's a that's a good question. So I think so far, we are trying to work on continue dot there, which is an open source AI copilot calling copilot like, GitHub, has, and we're trying to give it memory. We're trying to use PowerCore DB as a new type of a memory kind of playing with a lot of, let's say, code, and, how would you call it? Yeah, called graph rack. I think that would be the the the correct term. And then, once we finish with that, our plan is to have these self improving pipelines that we can just kind of plug into anyone's data and see how they actually evolve. After that, I think we are mainly looking to work with a lot of people in the industry to bring this more production, work on scaling it, managing it better.
And, there is gonna be a few interesting use cases that that we are now discussing with others that I think we can we can, solve. But in the next, let's say, a couple of months, you're gonna see better polarization, better support, more scalability, more tests, you know, everything that can make actually an open source software something you want to deploy on your side. And, what we will also do is, announce a couple of interesting things I still can't talk about due to some contractual issues. But we are trying also to move the needle in terms of the research a bit forward, and and some of that's gonna be public, some of that won't. But effectively, whatever we do, we'll try to to kind of keep the community updated with.
[00:50:05] Tobias Macey:
Are there any other aspects of the Cogni project, the overall application of these memory semantics to the LLM application stack or just the overall problem domain of generative AI applications and moving them into production that we didn't discuss yet that you'd like to cover before we close out the show?
[00:50:28] Vasilije Markovich:
I think this was a pretty good good, good, way to cover everything. You did you did a good job. So I think I I couldn't think of anything else at this point. But, yeah, I'm happy to, you know, talk again in, like, a year and see where we got with this. But so far, I think, this is as far as I can I can say?
[00:50:45] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.
[00:51:04] Vasilije Markovich:
Yeah. So I think the 10 gpt, release left everyone a bit, trying to catch up, and everyone saw also that there is a lot of money to be made. I know, like, guys making 1,500,000, ARR with, like, a teleprompter tool that helps you do interviews, which I'm not using here by the way. It would be fun if I was. They, effectively are monetizing so well that everyone's confused and shocked especially because these monetization strategies and go to markets have been like really explored over the last decade, and people know what to do if they have product market fit. So, the tooling space is rapidly trying to catch up. There is some things that won't change. There was a great post from Yuri Makelov. He's a research scientist in Facebook. I've been also in touch with him a couple of times, And he kinda tried to identify certain areas if they're gonna stay the same. I think the relational world and and and and those those tools are gonna pretty much stay as they are because, no one's gonna really move from something that works to to a fancy new thing just because and it takes years.
In terms of the vector stores, they will probably be replaced, in my opinion, by some open source tools because the technology branch is not that big. We'll have frameworks that, agent frameworks that actually work. Some of them that exist now work to a degree, but it's still, let's say, debatable on how much can you scale it. Data ingestion and unstructured to structured data, that's a huge use case that appeared last year due to Twitter and people figuring out how can they actually tweak the function calling a bit, which no one was aware of. That's becoming an issue in its own and I think there's a lot of competitors there and making moves. Then we will see new players in the database, data layer space where you can have graphs, you can have vector stores, you can have combinations, you might have something even else, we'll see. But I see a lot of better graphs right now that might change in the year.
Then in terms of that stack, we'll have protocols that help us communicate between agent networks. No one's really working on that right now because no one really, yeah, sees that as a relevant point to monetize on at this stage. I think that's going to come in a year or 2. And then we'll probably also have a lot of, let's say, automation on top of the existing tools that we'll probably have some shared base layer between them because it's gonna just be simpler to use Apache something versus, you know, building everything on your own. What that's gonna be like, I don't know. But, but yeah. So I would kinda group it that way, but, I would just suggest, check find Yuri Makarov on on Google and then, his posts on on the future of the AI stack, and I think he does a pretty good job on identifying what areas can be changed.
[00:53:54] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Cogni, helping to shed some light on the overall space of memory and how it is applied in the context of these generative AI systems. It's definitely very interesting problem domain. It's great to see you working to make that more tractable. So I appreciate all the time and energy you're putting in there, and I hope you enjoy the rest of your day.
[00:54:19] Vasilije Markovich:
Thank you. Thanks for being
[00:54:25] Tobias Macey:
with. Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the AI Engineering podcast, your guide to the fast moving world of building scalable and maintainable AI systems. Your host is Tobias Macy, and today, I'm interviewing Vasilije Markovich about adding memory to LLMs to improve their accuracy. So, Vasily, can you start by introducing yourself?
[00:00:29] Vasilije Markovich:
Hi. Nice to meet you, Tobias, and, thanks everyone for listening. I'm Vasilije, originally from Montenegro. I've been in Berlin for around 10 years working in the big data field, worked as a data analyst, data engineer, data project manager, usually managing big data systems, everything from batch to streaming, so pretty close to the data engineering podcast world. And, I've been, building, memory engine for AI apps and AI agents for the past year. We started as a series of concepts and prototypes based on my studies of clinical psychology and then combined with data engineering. And we ended up with the Python library that you can build semantic memory on. Happy to share more about that later.
[00:01:07] Tobias Macey:
And do you remember how you first got started working in the ML and AI space?
[00:01:12] Vasilije Markovich:
Yes. I was I was a long time ago working in a crypto company, and they needed to connect the funnels, all the data from the beginning to the end and see who's who's who. So I built the first Python scripts, connected all the analytics, and I really liked analytics more than I liked crypto at the time. So I just decided to switch, and, then I became a data analyst data engineer, and then started kinda dabbling whole whole the ML world and then put but pretty much saved it in those calls.
[00:01:40] Tobias Macey:
And so in the context of LLMs, you mentioned that you're working on building a system to add semantic memory to those applications, and I'm wondering if you can just describe a bit about what that term memory means in the context of LLM systems?
[00:01:57] Vasilije Markovich:
That's a great question. So how I look at memory in context of, LLM systems is pretty much based on in context learning. So if we have an LLM, we can give it some sentence or 2 or 10, which we all do when we copy paste some context for it to reason on. We effectively are giving it some memory it can operate on. And, this, compared to the traditional, let's say, training and fine tuning methods is, something we can do for almost no money. And now as things are actually moving along, we're also seeing potentially unlimited context windows where we can, create this memory for for the other one. Where this becomes and where, we actually start diverging a bit is that we are actually now having to manage this context across a lot of other one calls, a lot of, interactions. And, yeah, we see it as pretty much a way to manage and structure this in context windows as a new type of some type of feature store. And that's what we call LLM memory in Cognig. But, of course, I think the definitions can can diverge there.
[00:03:05] Tobias Macey:
The opposite of memory is typically considered forgetting, which I know is also a problem in these LLM systems, particularly if you're trying to pack a lot of information into those context windows. And I'm wondering if you can talk to some of the ways that that forgetting manifests as a symptom when you're interacting with these LLMs and trying to build applications on top of them.
[00:03:27] Vasilije Markovich:
Sure. So first, there is the technical limitation, the context window limitation. Right? So let's say we have a context window of 4,000 characters that we can put in, then everything outside of that context window won't be retained by the LLM and we won't be able to reason for that. We are pretty much technically limited to some ever increasing context, but still somewhat of a limitation for, on, let's say, price side and on the other levels too. That's one thing. The second thing is, let's say we try to change some LLM prompts. We see now this chain of thoughts where we kinda call the LLM multiple times and and pass some information along the line. In these cases, the problem is that the most recent information is often not prioritized compared to the to the last things or initial things that we gave to the other language can be interpreted as some type of forgetting.
Whereas these systems tend to just prioritize what's newer and respond to that than than the later information because of the size of the context window and because of the, let's say, issues with the the the prioritization can't really get get into the place.
[00:04:37] Tobias Macey:
Another aspect of LLM systems that adds an additional layer of complexity from the initial playground experience that we have been working our way through over the past couple of years is the idea of going from single turn interactions to multi turn interactions where you have to have a continual context window and maintain history of the conversation for the LLM to behave in the manner that is desired. And I'm curious how those concepts of memory and forgetting become more complicated or some of the additional considerations that need to be factored in when thinking about those in the context of these multi turn systems?
[00:05:20] Vasilije Markovich:
So it again goes to the context. So let's say we have these multi turn interactions as a set of agents that are working on the same problem, exchange information, then it all becomes about the management of their shared context and and how we are actually dealing with with that. What usually people do, they take this chain of thoughts types of, of systems as in the different LLMs. But effectively, it's a sequence of calls to 1 LLM, which is essentially a single transform. Right? So we can which has some ins and outs. So you're effectively talking to the same thing with different calls. But if you don't really manage, let's say the the context and the information that's being processed toward or managed across these calls, then it's really hard to move forward. Right? Because then you're effectively talking to the same system. Although it might seem to you that the system is different, intuitively, it's just the same same API.
[00:06:17] Tobias Macey:
Another piece of that memory equation and the ways that we're managing that in the rag stack that has been evolving and becoming commonplace in the context of LLM applications is the idea of hierarchical memory of there are certain pieces of information that you need to maintain for immediate retrieval given the context of the interaction that's happening right now, but there are other aspects of memory and information that need to be maintained over longer time frames and maybe only infrequently accessed. And I'm wondering what that aspect of hierarchical and evolving memory means in the context of LLMs and some of the ways that the current application architecture is not up to the task of being able to support those more nuanced elements of memory and retrieval?
[00:07:12] Vasilije Markovich:
Yeah. That's a great question. So when we talk about what is a memory currently for LLMs, we have, like, maybe let's let's go back to the basics. We have in context what we copied to the LLM and we say this is your memory. As we said, like, it's limited, we forget. So what people started doing is like, okay, I'll just dump everything inside of a vector database. As everyone knows in the data engineering channel, dumping something somewhere usually doesn't end up well. You know, we've seen all these s three buckets full of things that have been long forgotten 10 years ago and are slowly mutating to something else. So, in this context, the vector story is the same. Right? So you're just adding stuff, you're retrieving things based on similarity, but you have no control over what you're getting, the metadata management, the embeddings and indexing. If you change some type of thing, you need to index everything, which is slow and expensive.
It feels like we are in the days before SQL was even a thing in the online world and before relational databases more. What we can then do is and what people do is like, okay, I need bigger memory, I leave it in the vector store. Then we saw that doesn't really lead to some 70% accuracy. Then next step was like, okay, we need to actually map out the knowledge somehow and create some type of a structure. GraphRag is the first, let's say, more serious attempt to that. There were others. And, that's, let's say, the direction we are taking, as of now, of course, this will evolve. And then with that, we have an external system. You know, like in 18th century, you had this nature versus nurture debate. You know, who's gonna influence what? Is it gonna be the embeddings or is the external context that's gonna kinda motivate our thinking and reasoning? In this context, we say both, but we try to outsource this semantic layer outside of the embedding so not to actually store everything in the vector store, but actually have a management system outside of it. And, that is, in our view, gonna allow for this evolving context because what currently they are mostly is dictionaries with some predefined set of values that are not even data contracts. Right? They are just like, you know, a dump of some JSON files that for a particular use case can evolve and you can add things and then you can forget that. I think that agent paper on on same agents that forget was pretty much that. Right? And although it looked impressive and it showed some capabilities, it simply fails if you really want to do something in production because of all the issues that we're gonna come along with that. So for us, effectively, this is one of the things we're really thinking deeply about. And we're thinking deeply about, like, hey, can we bring the best data engineering practices into this world and, like, structure the data properly, manage the data properly? Because based on that, we'll get the better output. Right? And and this is, this is gonna allow for this evolutionary nature of the system where we can dynamically overwrite and create nodes and entities. But if we don't solve the data engineering puzzle, we can't go to the multi agent, networks and and what everyone else is happening to think about. And I'm seeing a lot of talk there, but not much work on the data engineering side of origin.
[00:10:11] Tobias Macey:
Another interesting aspect of the way that you presented what you're building right now is this idea of semantic memory, where GraphRAG is intended to bring semantics to those contextual embeddings. But I'm wondering if you can talk a little bit more to that aspect of semantic memory as opposed to just a blob of data and some of the ways that that maps into some of the cognitive science debate and some of the ways that the LLMs are just these stochastic parrots and the the work that's also being done in the space of cognitive AI to try and bring the AI ecosystem into a more nuanced direction versus the brute force approach that has generated the current set of LLM models?
[00:11:00] Vasilije Markovich:
Yeah. Sure. So let let me let me give it to staff. So in in our understanding, the semantic layer, and we started with, let's say, creating this type of a semantic layer based on human commission because we thought that's gonna be the best approximation of what, LLMs should do. They should think like humans because we have no better reference right now. So we took these data models from Atkinson Shiffrin's, you know, 1969 papers on, like, long term, short term memory, episodic, semantic, episodic being something that happened to you relatively recently, an episode in life and semantic representing knowledge of facts like what's an apple, what's a chair, and things like that. So we figured, like, how about we create these types of memory domains and then we load the data into these memory domains and have them, pretty much serve as a semantic, memory layer. There was a paper from Princeton last year that pretty much talked about the same thing and proposed the same thing, call it, it's called. And, we were a bit ahead of them 2 weeks. I think first time I can say that is probably the last time in terms of the academic paper. But what we got to understand is that, even by adding and extracting the composition and and managing the storage in different levels, we already get improvement in performance compared to the naive LLM cost. And, this, coupled with a lot of research that's been happening in in cognitive sciences and especially recent from like 30, 40 years ago that no one's reading in the AI world thinking that's not relevant.
We pretty much have a good opportunity now to implement some basic mechanisms like forgetting, attention mechanisms on, actually these, let's say, graph reg elements. We can quantify and increase the size of the relationships in the nodes and also understand based on the interaction with certain nodes, types of graphs, and elements what is important for and what's not. And if you talk about psycholinguistics and, you know, we had all these ideas like first in, first out, last in, first out, like, 40 years ago. And they've been abandoned these models because we came to, like, more more complex and understand. We took, like, this neural network approach back to to psychology and back and forth. Right? But I think a lot of ideas there are still pretty valid. And as algorithms, we can start applying them now in terms of how these LLMs should think, and we should have them as tools to see what works. Right? Then that coupled with a good set of emails is gonna get us much closer considering that things like LLMs are plateauing now in terms of the performance and the outputs. And that's been, let's say, a trend in the last couple of months. So so this is our thinking and we are trying to use a lot of, let's say, traditional approaches there to just kinda push the LLMs a bit forward with a new set of tools that we could use. And the quality of that model has failed for us, but it it showed us the way that we could kinda do that more and be and focused on multilayered networks of language, took some inspiration from these types of approaches, and and are kind of currently also, yeah, investigating a few new things that are gonna be announced.
[00:14:00] Tobias Macey:
And the first era of these large language model applications was very simple. Let's put an API on top of the inference engine and just do whatever goes in, generate some sort of output. Now we're building some more sophisticated architectural patterns, RAG being the most dominant one. So that requires the addition of these vector stores, some sort of reranking model to figure out which pieces of context we actually want to bring in. And I'm wondering for the case of bringing in this more nuanced and detail oriented semantic memory layer? What are some of the additional system components, architectural components that are necessary to be able to support that functionality and some of the ways that that either supplements or replaces elements of the RAG Stack?
[00:14:52] Vasilije Markovich:
Yeah. That that's that's a good question. So if we take the current Rack Stack, it's you know, I have a question, then I'm going to, like, 6, 7 services that are either chunking, loading, processing, retrieving the data, re ranking it, and then giving you an output. So already if you have 6, 7 services, you you know you have a problem. Right? It's ideally you shouldn't have 6, 7. You should have, like, 1 to 2 max. And, in our case, what we thought about is, like, hey. Okay. We can't do all of this work ourselves. Right? That's what we see with Lambda, Lambda index, and the others. They're trying to do everything from ingestion, data management, and then, graph reg, and then retrieval, and have like across everything across the board. So what our thoughts were, let's create create a new type of a data store that is gonna be some type of a short term memory, and that's what we are hearing from the others. Like, they want to use Cognigy from user interviews. Right? They wanna use Cognigy short term memory stored there, have, let's say, something warm that it can retrieve and reduce the number of LLM LLM calls. Ideally, we would have something like, hey, we have an ingestion part which is different for every company. They can use off the shelf things from, they can do Paystack, whatever is there. They load the data. We have our ingestion tooling that using DLT supports 30 plus data sources and everything is pretty much done in a way supports 30 plus data sources and everything is pretty much done in a way that it can be managed and parallelized and operationalized production. Then we roll the data to this data store and then we have different types of retrieval patterns and we pretty much automatically detect and store and process the data in the next iteration. Ideally, we're we're going about gonna launch this one. And then people can use us in this new stack as a net short term memory. And then they can fine tune the LLM afterwards and use that as, let's say, a long term memory or use that in combination.
That's how I'm imagining the stack is going to evolve because as the alarms get better, as we, yeah, let's say improve the short term memory, we won't need all these rerank or chunkers. They're just gonna be a part of, let's say, maybe our system, maybe some other system, but, like, invisible. You know, you don't need to tweak this manually. And I think this is avoiding this is gonna be a huge thing if we can do it. And now digging into Cognee itself, can you describe a bit more about how it's implemented
[00:17:06] Tobias Macey:
and some of the ways that the initial concepts have shifted and evolved in terms of scope and capability as you've gone through the journey of building it?
[00:17:16] Vasilije Markovich:
Sure. I'll start from how I got started, how it evolves slowly, and then where it's now, because otherwise, it's gonna be hard to follow, I think. So I started, with Cogni after building 2 failed b to c apps last year. So I realized I have no skills in building b to c apps. I built some tools, recommender system. I built some image generator app. And as I was doing that with Langchain and other tools that are available, I saw that there is an issue on how to actually manage these contacts. Collector stores weren't giving me back the the user's information. I was hoping I'm just gonna get the and then reach on the fly. LLM calls were taking too long. There was a lot of issues.
And, yeah. So I failed with the 2 apps, and I'm thinking, like, hey. Well, I I know something about data engineering. I've been doing this, stuff in psychology, some cognitive science courses, and I'm seeing some patterns I could implement. So I tried posting on Twitter and kinda creating a few prototypes, to see, you know, what would people say and and and could it be used. That led to a lot of activity and and success. And I didn't honestly expect that I had, like, 38 followers on Twitter or something like that. So it wasn't really, something I knew how to do or had those skills. And, that led to introduction to 1 guy I know. I'm from London who had a keypad, WhatsApp chatbot with around 20,000 active users. And there I had a chance to deploy something to production. Right? Because he was looking for a better way for to manage the long term memory, and he needed something better than vector store. Pinecone didn't work for him. He couldn't just make it work. And then people were complaining.
So first iteration of the product, they took this Atkinson Shiffrin model of long term memory, short term memory, working memory, this call of paper. I built some Python scripts and wrappers around Neo 4 j. I vectorized everything with VV8. I built, like, a combination, let's say, some interface for them and and and hard coded these models. This was an ugly piece of code that if you go to, like, GitHub, you're not gonna you're not gonna appreciate it much, but we deployed it. We deployed it on AWS. It was an endpoint, scalable. We could integrate it with Keyp and we did. Then we saw some activity. People were creating their own, let's say, note memory notes and storing things in the memory and retrieving.
And, as we were kind of progressing with that, there's so there is a lot more opportunities, what they call agentic crack these days, which is like we could create classifiers that could just pick certain types of memories, retrieve certain subgroups of information, and and we did that as a second iteration with with a tool that was supposed to help architects find the relevant laws in architecture to apply. If you want to know what is the size of the stairs in an apartment building, we could find the relevant law, extract information, give that to the LOM and then just kind of format it. Then this worked quite well. And, as I was doing that, I saw that the architecture of the tool is not really that that that ideal. So Boris, my cofounder, joined me at that time and we were like, okay. We need something more simple and we need something people can actually use because people were interested, but they were really having hard time dealing with our static model and dealing with our code, right, which is fair. So we refactored everything to become a Python library where we have, let's say, in the beginning, we had 3 elements. 1 was add, so you could ingest any type of data from these, let's say, 30 data sources. And on the ingestion, you would have the JSON normalization, merge IDs, extract the all of the fun data engineering stuff that you don't wanna think about if you just come from the JavaScript world or or or, you know, are there other parts where you haven't really explored that? But we do that for you. We put it all in the meta store we know and associate all the IDs and and you're pretty much ready. Right? 2nd piece was we did this what we call cognify. Cognify is this term that in psychology is used to kinda understand and conceptualize your context. It's it's, it's effectively how do you understand the things around you and how you put them in your brain. We took that and then we called the pipeline Cognify pipeline. And we did a lot of operations, created summaries, created loaded chunks, created some ontologies.
A lot of things that that ended up with us being able to search for, you know, notes that had certain names or certain relationships for embeddings themselves, whatever we need. And that was, let's say, the first version of the the the pipeline. As we did that, we realized that our assumption of what Cognify should do is a bit too strong because people need to do different things. So we refactored the library again and now we have everything in the library being a set of tasks that can become pipeline. Similar to what in the airflow world you would have with x, we have something like that. So you can inherit from one deck to another, pass the information.
You can chain them together. You can create pipelines that call other pipelines. So all of the all of the fun stuff that, like, would actually let someone use this, we we try to implement that and we try to implement that with generators and then pretty much make it async. We still need to work a bit in parallelization. But effectively, the whole process now is we ingest some of the data you give us with no external ingestion or, DOT. We load the data, manage it. We create summaries, ontologies. You can custom add your tasks. It's just Python functions now. It's very simple. And then we have the search methods and we have these projections of the graph where we take the whole skeleton of the graph inside of the memory where you can operate on that and deal with that. We raised that last week. And then we plan to kind of build more complex retrieval methods and and implement all these default algorithms that are there in the in the graph field that have been known for a while, and then we'll see where that takes us. Right? So so that that's pretty much that.
[00:22:53] Tobias Macey:
And in terms of the data structures and ontologies, I'm wondering if you can talk to some of the ways that you are thinking about how that maps into the I'm gonna use the term conceptual even though it's a misnomer in the context of LLMs, but how that maps to the conceptual structures that the LLMs are anticipating to be able to help to manage the usage of the context window most efficiently and, in particular, some of the ways that you're able to use those data structures and ontologies to map into the problem domain that the application is trying to address?
[00:23:33] Vasilije Markovich:
Yeah, sure. For us, ontology is a key term here. Understanding how do we actually structure and correlate different types of data points inside of one system, especially when we talk about millions of documents coming from, you know, finance, legal, HR. If you just imagine dumping the whole, company's, data on on to something like this, you would expect these types of connections to get formed or to get transformed by the human, which would help the LLM or, you know, work on some type of a preset that it's already given. And, we think this is the key piece because there is a lot of context that humans bring into the play, but there is also a lot of context that we can kinda create and a lot of ontology that we can generate using deterministic methods, but also using this, let's say, LLM or LN powered.
For us, currently, one of the major goals is to be able to create these ontologies on the fly from the documents by treating each of the Cogniz pipelines as ML pipelines. So we would, you know, do train test sets, that will also be question pairs of different issues. And then we would run the pipeline until it can pass the the train, test it on the test set. And, that way, we assume that the semantic structure and and the ontology of the initial datasets would be, let's say, optimal for the question pairs that were given in in in the question sets. And we could avoid, like, fine tuning and and dealing with that by just using all the tools at our disposal to kind of train. So this is the evolutionary aspect to the pipeline. And this is something that we think is going to become dynamic in the future. You're going to have dynamic ontology because as soon as new data comes into the system, the whole system change. You now need to go delete, you need to deal with the data in such a way that you can always continuously manage it with while it's interacting with, like, let's say, thousands of these agents in the future because we assume that you're gonna have many, many calls to the LLMs, millions probably in a couple of years as like the cost is going down or the 10 x over the past year is probably gonna go 100 x down. And, we we assume that this is gonna be cheaper than the CPU at some point. Right? Or at least like pricier, but but something that, you know, we can afford. So, that's that's where we see it, and that's where we think the context window is. They become unlimited with these bigger ontologies and the subset of ontologies we can actually get for the problem they're trying to solve. We're gonna we can then get to a good state of actually making better decisions. So still work in progress, but but that's just a general idea.
[00:26:11] Tobias Macey:
When dealing with graph systems and ontologies, that brings us into the knowledge graph arena, which has been around for a while. And there has been a lot of tooling built up around it. So I imagine that it's gotten easier to manage, but the creation and curation and evolution of knowledge graphs is in and of itself a fairly substantial undertaking. And I'm wondering how you're managing some of that in Cogni without having to surface it all to the end user and add that as an additional conceptual challenge to how they're structuring their overall system.
[00:26:49] Vasilije Markovich:
Yeah. It's a good question. I would say there is, like, 2, 3 things that would appear to the end user. 1 is the performance. So if I use Neo 4 j, I'm gonna run into issues, especially at some big scale. So yeah. Okay. We need to find something better. So we just did an integration with Alcore DB. We are trying to talk to other people who are building, let's say, these types of databases now and and they are actually popping up. There is some interesting people in San Francisco I met recently. And I think the the the, let's say, database tag is going to evolve significantly to support this. I hope we see something post or soon, you know, like pj vector because that's going to make everything else obsolete. But, yeah, effectively that's one thing. The second thing is like the whole graph management, the interactions and the generations, it's on us. Right? So we have adapters for most of the graph databases. We try to abstract everything inside of the system as a data point, which would represent a pedantic model that we can just pass through the system and not really be dependent on the, different implementation types. But pretty much give the data point or the the the pedantic model to the adapters and then have the, let's say, the difficulty of the implementation line on that side to extract out these parts of the system. That took a lot of iteration to get to that state. We hope it works. We'll see.
But, it's not an easy problem. So, on that side, how we are solving, let's say, the graph management and iteration is like this. As I mentioned, we project the graph in memory. We are playing with that. We can project n graphs in memory and subsets and then, even do simulations on those. So there is a lot of possibilities that we are kinda introducing into the architecture now that makes us a bit more than an OEM for graph and graphs and vector stories. And secondly, what I think is the key is like the who works with you. Right? So, one of our engineers, he has a PhD just working on graphs. He's been doing that for like 10 years. So He's building and bringing this knowledge because I'm definitely not a graph expert and I'm learning from him. I'm running to catch up with the guys. That's always a good thing to have in startup. I think we will definitely see how that evolves. So far we have a certain set of retrieval strategies, a certain set of ways to generate graphs that we want to implement in the next couple of months. And after that, that we'll see that, my assumption is that we'll have to bring this down to one level lower, so something closer to the metal.
But once we get to cognitive use more introduction, then we'll definitely talk about the, let's say, issues that the Python is going to bring us because it's definitely not going to be the optimal tool for this job out the line.
[00:29:29] Tobias Macey:
For people who are interested in incorporating Cogni into their overall system design, I'm wondering if you can talk to some of the steps that are involved in integrating it and taking advantage of it, and in particular, the case of a greenfield system, change the way that you want to approach the overall design of the application architecture?
[00:29:54] Vasilije Markovich:
Sure. So integrating it is relatively simple. We have a Python library. We have a Docker image with an API. So you can have everything documented in our, GitHub repo. It's open source, so you can just download it yourself. Effectively, the way we deployed it and the way we included it in the in the system, there is one way which we did, which is ECS. We plan to add SNS, SQS. That's a queue that would just trigger Lambdas running Cogni and we use DCS now for for the when we didn't have too much of big loads. And you could pretty much have some s three buckets, sending the events to SMS SQS on AWS, for example, and then the triggering Lambdas with Cognite. That's relatively simple architecture to to to do.
We also built Helm Charts for for the deployment of Kubernetes. So you could also deploy it as a service or it's a set of services there. We have this managed metastore. So what you would also need is a Postgres store that you will deploy outside the site, and that will allow it to actually run parallel, instances. And you will be able to kind of trigger it and and keep the state so you don't end up redoing the things. And the the final, place or the final things you would need would be the vector stores and the graph stores. You can deploy these yourselves. We support most of them. So VV8, Quadrant, LensDB.
So so that's supported. We have Neo 4j, PowerCore now of the graph ones and then NetworkX if you want to experiment with things. And, I think like how the architecture will change for anyone who's trying to use Cogniz instead of just having something to load the data from some type of LLM call to vector store, you would add Cognite also as a destination and then you could just fetch the things from Cognite 2. You don't have to do any type of a destructive modification on your existing system. The idea is like, hey, let's just plug in this memory, get the feedback back. We have also now evals that are going to be merged very soon for the LLM as a judge, but also for traditional deterministic ones. And then, you can just, evaluate the outputs, on top of that and and see what you get and how good it is. So that that's kind of the the the short version.
[00:32:10] Tobias Macey:
We touched on the multiagent systems a little bit earlier. And so for people who are using Cogni, how does that factor into that multiagent or multi LLM system architecture where maybe you have different LLMs that are focused on different problem areas. Maybe you've got a, larger model for general language usage. Maybe you've got a code specific model for doing code generation, or you've got an image generation model and just some of the ways that that impacts the ways that you want to apply Cogni or how you think about the, either segmentation or schematization of the underlying knowledge graph for being able to appropriately retrieve the right semantic constructs for the different LLMs and the different use cases?
[00:32:59] Vasilije Markovich:
So for generating different, let's say, semantic layers, you can use bydantic. We pretty much allow you to just generate a bydantic schema that will be represented in the graph in the way you need it. If you have, let's say, different types of data that you have a custom way of you wanting to represent, you can do it with the Pydantic. We can also help with having an LLM generated ontology, although I will not recommend that for production. It's going to be dangerous. Secondly, in terms of ingestion of the data, we support image, audio, text. Pretty much that's all going to be converted and loaded. Lastly, I would say we have users and these users, they have access to write and read to certain parts of the graph. Let's say you have different microservices, each of them could be an individual user if you don't want to really model things but you just want to keep it isolated. You can then you can still have it all in the same space by just loading to to different parts of the the subgraph that you need, that's shared approach.
I think there is many more ways we can we can isolate this. We can we talked about, like, having, you know, different graph vector store combinations for different use cases to isolate completely the data down the line. So you could just spin up the instances that effectively then are loading things, across just one vertical. But, yes, so far we haven't implemented that. So so for now, yeah, the answer is users and the pydantic models you can use to define, and then point towards when you're actually loading the data. But, yeah, we are still thinking about this. Any feedback would be appreciated on this side.
[00:34:35] Tobias Macey:
Another element to the idea of memory is the use case for personalization for the end user where you want to be able to store useful pieces of information, semantic context, pieces of data that you have about a user, whether that's their preferences or their demographics, etcetera. And I know that there are other projects out there. The one in particular that comes to mind is mem0. And I'm I'm wondering how Cogni relates to some of those use cases, whether you're thinking about some of that personalization aspect or if Cogni is more for managing the data in the underlying problem domain and the the data that you would typically put into that rag stack?
[00:35:22] Vasilije Markovich:
Yeah. That's a that's a good question. So, you know, when I was pitching to investors, I think we just secured 1,500,000 funding round, which went public today. So, yeah, I don't know when is this, gonna be shown. But but, yeah, we just, we just did that. So I think, like, one of the base pitches there was, like, hey. What's the problem that, vector stores and and these data external data stores are solving that the LLMs can't just do by themselves is the personalization. Right? So all of this work that we do is effectively to personalize the data, that is fed to the LLMs. LLMs can give us more nuanced answers to what we need. I think, mem0 and and the others like Leta and there's another couple of, let's say, projects that are working a lot on, like, I would call it, like, almost b to c approach where they let the users, or chatbots really have these personalized interactions. We spent time with our 1st, bigger design partner, Dynamo, where we built a support chat system for their human agents for, let's say, gaming industry where we did the same thing. And the thing is, I don't think you really need the graph right there because it's not that much data, and and it's pretty much just, says some classifiers, some extraction of the, you know, user personality and, like, and some thoughts. So I think, that is not on the, let's say, this user individual response level. It's not a difficult problem. Where it becomes difficult problem is when you're trying to reestablish this context across, like, very complicated domain area or let's say more complicated set of constraints or business tools that are applied to personalize the data. Because if John is friendly or John is angry, LLM is gonna be already quick trained for that. But if John has a user session defined until midnight, but then every second week, it's until midnight or 5 because someone in analytics made a mistake 6 years ago when he was defining it and now the reporting for investors work that way. You know, these these types of, like, more complex cases and matching that those and and managing those and also, like, pretty much text to SQL. Right? And then all of these, let's say, this context that we could do and and build, this is where I think the the difference in the approach comes. You're trying to be a bit more general than than just focusing on what could give us quick money, but more of like, hey. Can we actually build something more sustainable, more long term? Like, we're focusing on indexing. Right? And and indexing for us is personalization.
For some others, it's like personalization in really domain of personalization for, let's support agents and things like that, which is easier to sell, but I guess it's it's more more difficult to defend on them. So I think these guys, Memo, I've followed them for a while. I've seen what they did with drag in 3 lines of code. I think they have, like, really good, let's say, approach to, open source to to, you know, community building. They are, like, innovating fast. They're pretty much doing similar things that we do. But the difference is they have their I think that's my impression, maybe something changed. But they have their, code behind some cloud system. They have an API. They will do everything. The magic will be done for you. But I want to be a part of that magic, and I assume the developers in the future want to control how the magic happens. Because as soon as we start giving, like, black boxes, it's, you know, hard to trust the black box. And I think that that's that's pretty much the philosophical difference.
[00:38:41] Tobias Macey:
And as you continue to iterate on the design and use cases of Cogni and try to navigate the constant shifts in the LLM and AI ecosystem, what are some of the unknowns that you're still navigating?
[00:38:56] Vasilije Markovich:
So many. I mean, I don't know. It's a startup. I think everything is unknown, except the knowns, which you want to be unknown sometimes because it's bad news. I think in this context, we are facing, a lot of questions on, what will be the data stack of the future. Right? So what will we need to actually use the LLMs in production together with its more traditional systems and how will these interfaces look like? I think guys like DockerDB and others, they are really doing a good job on kind of moving the needle forward, in terms of, like, what's possible and what can be done. But so far, I'm I'm still not sure where, let's say, the whole stack is gonna go. And there's so many interesting things and the things that really rapidly change, which is not that common in the data world. You know, usually, you can see it like a mile ahead. You know, something's gonna come. But I think now it's it's a different time. It's actually it feels like start up time, you know, after like 10 years of just building slowly the systems. Second thing that we don't know or can't say right now is what the LLMs are going to be like in like 3 to 5 years. Are they going to make this whole work unnecessary because they're going to just be so general and good or not?
I don't know. Is regs going to be even a thing in a couple of years or are we going to even need it? My core assumption is that people will always need to process data and to ingest it into any type of a system because I haven't seen a system that works without data so far. And, I don't know. Ideally, that's still gonna be a need. But if our stool is gonna answer that need, I'm I'm not too sure. And then, lastly, I think the whole, let's say, research space is super interesting but super noisy at same time. So we are reading a lot of papers. We are, like, checking what people do. And, you know, I have people in the team that are much smarter than I am and and have much bigger PhDs from better universities. I have no PhD by the way, far away. And I I think that even they are, like, you know, having a hard time distinguishing what makes sense, what doesn't, and they spend time researching for for for a number of years. And I think this noise plus all the possibilities that we could do and kinda narrowing down, on what we exactly want to do without really clear roadmap is is always a difficult thing, but we do have a couple of ideas there. But I think just the the space is is very large, and you just need to be very strict with yourself. So I think there's always a chance we can make some big mistake there and and kinda go the wrong path. But hopefully, that's why we're trying to work with design partners and and more closely with people so we can, you know, solve these problems as we go. So so, yeah, there there's a few things. There is many more operational risks and stuff like that, but I think, like, I wouldn't go into those too much.
[00:41:34] Tobias Macey:
And beyond the core use case and the core focus of Cogni being a tool that aims to improve the overall capabilities of LLM systems given that you are focused on that data management, data transformation, graph generation. I'm wondering what are some of the other potential applications of Cogni that you either have seen or have tested out or that you foresee as being capable either now or in the future.
[00:42:06] Vasilije Markovich:
Yeah. That's a good question. I've seen people come, like, book a demo and come and say, hey. We have a case in agriculture where, you know, if you plant some seed and there was a pesticide applied 3 weeks before and something else happened. So complex rules. So they need a graph for that if they want to cast some LLM feature. I would never think about agriculture. You know, I come from a village. I I did run away from that for a reason. Like, I would not go back to that. But someone has that type of a problem and then they recognize that, hey, okay, we could do this with reps. So, we're seeing, cases from, you know, basic ones from chatbots. It's not something too surprising in personalization of them. We've seen, like, these cases with architecture.
We've seen cases where, like, construction company that is trying to project on every little brick, you know, some type of a QR code that can get scanned by the workers so you would know how far the brick the the building progress is going. They have an issue that, like, you know, building managers want to actually communicate with their system, so they would want to ingest all their data, structure it, load it to their system so they can update the progress reports on the fly. So very different use cases, very different needs. I think a lot of, let's say, manual work that people have been doing now for, like, 30 years, is gonna get automated because they don't want to do it or or, you know, there is no really need anymore. And, these all these work verticals that we are seeing, it's it's super exciting. You know? Logistics, there was a lot of, let's say, movement across those chains, how things move. There was an issue with something like Starbucks menu. Right? So that was one interesting case where if you have, like, a cappuccino or something, it's gonna have some ingredients. Then this is, like, already what ingredients in what proportion go inside of a cappuccino and what these ingredients contain.
Then, what if you change one ingredient and you want to actually have an hour lunch suggest you something that has this or that ingredient? You pretty much need some semantics, and you need to update that, you know, the flour changed in some of these recipes because it might have some allergens or something like that. So so I think, yeah, we'll see. But so far, I'm pretty positive that there is a lot of use cases where we we'll graph that be able to respond to all of those or some next generation of it, which we hope to kind of evolve towards. Let's see. I think there is already, like, a a good number of things we can do and and and work.
[00:44:28] Tobias Macey:
And in your experience of building Cogni, exploring the space, establishing a business around it, and figuring out how it slots into the overarching LOM, generative AI ecosystem. What are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:44:47] Vasilije Markovich:
So, where do I start? Don't don't hire friends in your start up. I don't even like that. That's that's a good one. You know, you need to, you need to distract yourself emotionally from your business. I think that's that's one one of the biggest lessons I had. I put in all my savings, tried to build this, got now some funding. I think we were also lucky to a degree. Right? But effectively, a lot of lessons that are the world works in a certain ways that you don't imagine when you're working in the data field and then kind of, you know, when you have reports and analytics and and things that are very material to do, the the chaos outside in managing that, it becomes somewhat of a difficulty. So I think there was a lot of learnings on the side of of the operation of the business, a lot of learnings on assuming things for users. I think, like, everyone's saying, like, talk to users more, but not too many people actually do it. And I think, for us, it was like, hey. You know, we tend to have a lot of assumptions, and we don't even realize how many. So I think that was that changed a lot for us. We're now doing user interviews weekly, and talking to people as even if the product is relatively new and and, you know, not used that much, we are still kinda trying to to get as much feedback as we can. And then, lastly, I think it's the the the challenge is often enough the amount, right, of the parallel things going on where you can't really focus and solve one problem optimally.
But I think that's also a a good, let's say, caveat because you tend to then solve the problem to a degree it needs to be solved and you don't over optimize. Because, you know, I used to spend, like, 2, 3 days solving some, like, writing some fancy sequel somewhere to run on some scripts. And, you know, it was nice, but now, you know, it's like you just choose your battles and you know how to choose them better. So I think that's pretty much on the start of the business side. But yeah. I mean, it's it's a broad question, so I'm not sure.
[00:46:36] Tobias Macey:
And for people who are building generative AI systems and they're trying to figure out what is the appropriate means of storing and retrieving context for those applications, what are the cases where Cogni is the wrong choice?
[00:46:54] Vasilije Markovich:
Yeah. So if you have something where an LLM call is gonna solve most of your issues or a set of LLM calls and you don't have much data, you don't need Cogni. You don't need, you know, a complex system just because, well, someone told you graphs are cool. If you have relatively easy use case where you're not that worried about accuracy, but you just want to have something generated, you also don't probably need Cogni. If you need structured outputs, you probably don't need Cogni. You can use, Instructure, library or any of these, let's say, tools, so you can get, like, some data.
What else would you not need it? I think, everything that's pretty much agent work right now, you can use it. But also, honestly, you probably won't need it unless you have really a lot of agents. So, you know, if you have route in the simple workflows where you can just pass things through in some type of a JSON, I mean, maybe you first get to something working and then if you wanna tweak it, then then you can add, like, a memory layer for it. But if and if then you have 700, yeah, you would need something, but it's, like, for more simple use cases also. I would say if you're experimenting, yeah, you can play around with it, but it's probably not gonna lead to a lot of results. If you don't have much data or if your use cases are out of the simple structured outputs can solve them or you're just kinda trying to do some basic LLM calls, yeah, you you're pretty good on your own.
[00:48:20] Tobias Macey:
And as you continue to build and iterate on the core product, the open source project, and just navigate this ecosystem, what are some of the things you have planned for the near to medium term?
[00:48:33] Vasilije Markovich:
That's a that's a good question. So I think so far, we are trying to work on continue dot there, which is an open source AI copilot calling copilot like, GitHub, has, and we're trying to give it memory. We're trying to use PowerCore DB as a new type of a memory kind of playing with a lot of, let's say, code, and, how would you call it? Yeah, called graph rack. I think that would be the the the correct term. And then, once we finish with that, our plan is to have these self improving pipelines that we can just kind of plug into anyone's data and see how they actually evolve. After that, I think we are mainly looking to work with a lot of people in the industry to bring this more production, work on scaling it, managing it better.
And, there is gonna be a few interesting use cases that that we are now discussing with others that I think we can we can, solve. But in the next, let's say, a couple of months, you're gonna see better polarization, better support, more scalability, more tests, you know, everything that can make actually an open source software something you want to deploy on your side. And, what we will also do is, announce a couple of interesting things I still can't talk about due to some contractual issues. But we are trying also to move the needle in terms of the research a bit forward, and and some of that's gonna be public, some of that won't. But effectively, whatever we do, we'll try to to kind of keep the community updated with.
[00:50:05] Tobias Macey:
Are there any other aspects of the Cogni project, the overall application of these memory semantics to the LLM application stack or just the overall problem domain of generative AI applications and moving them into production that we didn't discuss yet that you'd like to cover before we close out the show?
[00:50:28] Vasilije Markovich:
I think this was a pretty good good, good, way to cover everything. You did you did a good job. So I think I I couldn't think of anything else at this point. But, yeah, I'm happy to, you know, talk again in, like, a year and see where we got with this. But so far, I think, this is as far as I can I can say?
[00:50:45] Tobias Macey:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gaps in the tooling technology or human training that's available for AI systems today.
[00:51:04] Vasilije Markovich:
Yeah. So I think the 10 gpt, release left everyone a bit, trying to catch up, and everyone saw also that there is a lot of money to be made. I know, like, guys making 1,500,000, ARR with, like, a teleprompter tool that helps you do interviews, which I'm not using here by the way. It would be fun if I was. They, effectively are monetizing so well that everyone's confused and shocked especially because these monetization strategies and go to markets have been like really explored over the last decade, and people know what to do if they have product market fit. So, the tooling space is rapidly trying to catch up. There is some things that won't change. There was a great post from Yuri Makelov. He's a research scientist in Facebook. I've been also in touch with him a couple of times, And he kinda tried to identify certain areas if they're gonna stay the same. I think the relational world and and and and those those tools are gonna pretty much stay as they are because, no one's gonna really move from something that works to to a fancy new thing just because and it takes years.
In terms of the vector stores, they will probably be replaced, in my opinion, by some open source tools because the technology branch is not that big. We'll have frameworks that, agent frameworks that actually work. Some of them that exist now work to a degree, but it's still, let's say, debatable on how much can you scale it. Data ingestion and unstructured to structured data, that's a huge use case that appeared last year due to Twitter and people figuring out how can they actually tweak the function calling a bit, which no one was aware of. That's becoming an issue in its own and I think there's a lot of competitors there and making moves. Then we will see new players in the database, data layer space where you can have graphs, you can have vector stores, you can have combinations, you might have something even else, we'll see. But I see a lot of better graphs right now that might change in the year.
Then in terms of that stack, we'll have protocols that help us communicate between agent networks. No one's really working on that right now because no one really, yeah, sees that as a relevant point to monetize on at this stage. I think that's going to come in a year or 2. And then we'll probably also have a lot of, let's say, automation on top of the existing tools that we'll probably have some shared base layer between them because it's gonna just be simpler to use Apache something versus, you know, building everything on your own. What that's gonna be like, I don't know. But, but yeah. So I would kinda group it that way, but, I would just suggest, check find Yuri Makarov on on Google and then, his posts on on the future of the AI stack, and I think he does a pretty good job on identifying what areas can be changed.
[00:53:54] Tobias Macey:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on Cogni, helping to shed some light on the overall space of memory and how it is applied in the context of these generative AI systems. It's definitely very interesting problem domain. It's great to see you working to make that more tractable. So I appreciate all the time and energy you're putting in there, and I hope you enjoy the rest of your day.
[00:54:19] Vasilije Markovich:
Thank you. Thanks for being
[00:54:25] Tobias Macey:
with. Thank you for listening. And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast.com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to AI Engineering Podcast
Interview with Vasily Markovich
Understanding Memory in LLM Systems
Challenges of Forgetting in LLMs
Multi-Turn Interactions and Context Management
Hierarchical Memory in LLM Applications
Semantic Memory and Cognitive Science
Architectural Components for Semantic Memory
Development and Evolution of Cogni
Data Structures and Ontologies in LLMs
Integrating Cogni into System Design
Personalization and Use Cases for Cogni
Navigating Unknowns in AI Ecosystem
Potential Applications of Cogni
Lessons Learned in Building Cogni
Future Plans for Cogni