Summary
Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
Announcements
Parting Question
Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.
Announcements
- Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
- Your host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine
- Introduction
- How did you get involved in machine learning?
- Can you describe what Tabnine is and the story behind it?
- What are the individual and organizational motivations for using AI to generate code?
- What are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.)
- What are the elements of skepticism/oversight that developers need to exercise while using a system like Tabnine?
- What are some of the primary ways that developers interact with Tabnine during their development workflow?
- Are there any particular styles of software for which an AI is more appropriate/capable? (e.g. webapps vs. data pipelines vs. exploratory analysis, etc.)
- For natural languages there is a strong bias toward English in the current generation of LLMs. How does that translate into computer languages? (e.g. Python, Java, C++, etc.)
- Can you describe the structure and implementation of Tabnine?
- Do you rely primarily on a single core model, or do you have multiple models with subspecialization?
- How have the design and goals of the product changed since you first started working on it?
- What are the biggest challenges in building a custom LLM for code?
- What are the opportunities for specialization of the model architecture given the highly structured nature of the problem domain?
- For users of Tabnine, how do you assess/monitor the accuracy of recommendations?
- What are the feedback and reinforcement mechanisms for the model(s)?
- What are the most interesting, innovative, or unexpected ways that you have seen Tabnine's LLM powered coding assistant used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI assisted development at Tabnine?
- When is an AI developer assistant the wrong choice?
- What do you have planned for the future of Tabnine?
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
- Thank you for listening! Don't forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers.
- TabNine
- Technion University
- Program Synthesis
- Context Stuffing
- Elixir
- Dependency Injection
- COBOL
- Verilog
- MidJourney
[00:00:10]
Tobias Macey:
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning. Your host is Tobias Macy, and today I'm interviewing Eran Yaghav about building an AI powered developer assistant at Tap9. So, Eran, can you start by introducing yourself?
[00:00:28] Eran Yahav:
Hey. Thanks for having me. I'm, Eran. I'm, the CTO cofounder of top 9. Other than that, professor of CS at Technion, which is 1 of the leading Israeli universities. I've been doing research on program synthesis, for many, many years now, I think, since before it was cool. And, definitely
[00:00:51] Tobias Macey:
looking forward to this conversation. And do you remember how you first got started working in machine learning?
[00:00:56] Eran Yahav:
I've been, working on program synthesis for many years now. I think it's, since the mid 2000 or something like that. And somewhere around 2010, we realized that a lot of programming tasks are extremely repetitive and can be automated if you learn from millions of examples. So, initially, we worked on, classical approaches, mostly logic based approaches to program synthesis, using version spaces to represent spaces of candidate programs and explore the space of programs, for synthesis. But when, you know, neural networks started to gain popularity again, and mostly when LSTMs, started to be useful, I really got hooked on that. And from then on, I think really evolved together with the field closely to NLP techniques through transformers and, you know, and the rest, as they say, is history.
And, really, you you know, since the age of transform, so to speak, together with my students and with, Yoav Goldberg from Barillan University, we've done a lot of cool things around, the expressive power of various networks. So fairly theoretical work, expressive power of various RNNs, applications of LLMs in general for software engineering. And also did very cool work with my student, Gail Wise, and with Yoav on interpretability of transformers and reverse engineering transformers. So that was few years back already. So been in this field for a while now.
[00:02:35] Tobias Macey:
And so for the Tab 9 project, can you describe a bit about what it is and some of the story behind how it came to be and why this is the problem that you want to spend your time and energy on? So Top 9 is an AI assistant for software development. It helps you
[00:02:50] Eran Yahav:
with all software development maintenance tasks. It can help you generate code, generate test, generate documentation, review your code, and it will eventually help you drive the entire software development life cycle end to end using AI. So back, I think, in 2018, maybe, we were the first to bring AI code completions to market initially just in Java based on classical techniques. Let's call them more logic based and semantic techniques. But when we moved to use GPT based networks 2019 and extend the platform to support more native languages. Started by focusing on code completions because we saw that it's a good place to deliver a lot of value to developers.
But the vision is much wider. I think it's pretty obvious now that the future of software development is AI driven. So, like, everything in software development is going to be assisted with AI. You will not write software or maintain software without AI. Just, like, you don't do it without a compiler, right, or without an interpreter doesn't matter or interpreted languages. So the future is software development is AI driven. It will take some time to get to that point where the entire process is AI driven, but we definitely see more and more pieces of this vision materialize. Right? Every every month that passes, we get another task in software engineering, boosted by AI in a significant manner.
[00:04:24] Tobias Macey:
For the use case of AI as an assistant to your development progress, what you mentioned code completion as the initial foray into that space. I'm wondering what you see as the main motivations that individual developers and then at the team or organizational level, they start to adopt these AI capabilities into their development practices?
[00:04:50] Eran Yahav:
Yeah. So so I think for individuals, it's, mostly productivity and also, at least for me, also some elements of discovery. So as I'm working with tab 9, I obviously get the acceleration of not having to think about syntax and even about the implementation of first things that, you know, either I've done a 1000000 times before and I don't want to remember or even things that I don't know how to do. And I don't really want the nitty gritty details. I just want it to be done, like, following some API or something like that. So definitely acceleration and productivity boost for the individual, but also higher quality of code and discoverability, especially with tab 9 chat. I ask questions on how to do certain things where I don't know the answer, and I get educated. I learn stuff from top 9 chat. So may maybe kind of a more structured answer to your question is as an individual. I feel that top 9 serves 3 different layers for me.
1 is remind me. It reminds me things that I know how to do, but, you know, I just don't remember the exact details. Second is teach me. It teaches me how to do things, that I don't necessarily know how to do. And the third thing is elevate me, like, really give me wider context and make me a better developer by kind of exposing various ways to do stuff. So as individual, it is really like working with an expert that can expose you to new things as well as reminding you kind of the non interesting things and get them out of the way quickly. To to answer a second part of the question, for organizations, I think, again, it it's clear to all any organization that all software is going to be developed with the assistance of AI. Right? So there is an AI in the future of software development anywhere and everywhere. So organizations are coming to us to see how Pad 9 can improve productivity, how can it harmonize the code that is being created in the organization. Right? Because when everybody is following kind of the same model or the same AI when they generate new code, you get harmonized code. They're kind of all of them are inside the distribution, so to speak. So nobody is kind of getting on a on a tangent. Right? They're all on the common path in some sense.
And with the launch of top 9 chat, organization also coming to see whether chat can accelerate knowledge sharing and help answer questions that would otherwise require human experts to to answer, so a wide range of things. And, again, it's the functionality of to expand to code review and to other surface areas, I think, but we'll see even more, demand from organizations.
[00:07:40] Tobias Macey:
Yeah. I particularly like that thought of having the expert sitting over your shoulder because that is 1 of the challenges to scaling engineering teams is that there's only so much expertise that can be built up within an individual or a set of individuals, and it takes time to proliferate that knowledge. And it's also sometimes challenging to be able to ensure that all of the people who are coming into the team are allocated enough time with those experts to be able to pick up the various signals that you receive through that interaction. And so by being able to being able to reduce the total amount of time required to gain some of that context and knowledge, it allows that, subject matter expert to scale more effectively because they don't have to spend as much of their time on staring over somebody's shoulder while they write code and try to figure things out.
[00:08:31] Eran Yahav:
Yeah. Absolutely. And it allows you to kind of get things right from the get go because you're getting help as you're generating the code. You may be generating in a way that will pass the review later. Right? So you it's not only that you get the expert knowledge. In a sense, you get the expert knowledge early enough, so you don't have to get rejected in code review and and redo, what you did. Right?
[00:08:53] Tobias Macey:
And the other side of that, though, is there's only so much expertise that an AI can consume or consolidate because some of that expertise is contextual and requires things that are, at least as of now, still within the domain of, human only capabilities such as intuition or being able to make logical leaps between 2 different things or understanding the overall business context without having to explain it in very minute detail to the AI. And I'm wondering what you see as some of the real world limitations of using a generative AI in the creative process of software development?
[00:09:32] Eran Yahav:
Yeah. So so definitely, architectural reasoning and high level reasoning, business kind of reasoning on why are we doing it this way is something that is right now at least beyond, what the AI is capable or at least it's not easy to communicate that information to the AI right now, not to say that it will remain that way forever. But, at at this point, this kind of reasoning is pretty hard to the AI, and it's pretty hard to communicate that. But I think that that's the beautiful thing about AI systems in software development is that you can relieve the humans from, like, the nitty gritty details of the exact syntax of calling an API or how do I sequence, you know, some calls to other things. And the humans can focus on the architecture and on kind of the business aspects of why are we doing it this way. And so, usually, when people ask me, like, you know, like general audience ask you, are developers going to be replaced by AI? Like, our software engineer is going to be replaced by AI. And I tell them, no. Absolutely not. Because the job of a software engineer is to solve business problems using software. It is not to generate code. Right? So the the code generation aspect of a software engineer, yeah, that may be accelerated or, to a large degree, replaced by AI. But thinking about the business problem and what software architecture and what software should be created for it is definitely in the realm of of the human. But maybe to to to answer even, like, slightly more philosophically, I would say that the limitation of Gen AI in general is the human.
The the models are already extremely strong in the ability both to take a lot of input, you know, be it through fine tuning on customer codes, say, or using very large context windows or context stuffing or whatever other techniques you wanna use to communicate a lot of information into into the model. And and the models can also generate very extensive outputs. Right? So so the bottleneck is really the human. Like, do you, Tobias, really want to read 2, 000 lines generated by top 9 and review them to see that they do exactly what you wanted? Like, in 1 vault, like, here, Tobias, like, bam, 2 1, 000 lines, good luck. I don't think so. That's not how humans work. So so really the barrier to a large degree of agenda in general, actually, and software creation is the human, and we need to find better ways to communicate with the human and lowering the cognitive, load on the human when when they work in the system. Right now, this is done by the AI kind of the natural granularity of communication is a class level, method level, something like that that you can say, oh, yeah. It does what I want. But if you want to go beyond that, the barrier is not the model. The barrier is the human being able to say, yes. I understand kind of what this does, and this is what I wanted.
[00:12:34] Tobias Macey:
Speaking as somebody who's been in software engineering for a little while, it's very similar to the experience of working with the the the person who's requesting the software to be built and doing the requirements gathering to understand what is it actually that you're trying to solve for. And so we're just moving that another layer down where I, as the human, have to get the requirements from other humans, and then I, as the human, have to relay those requirements to the AI in a way that the AI can understand it. So it's it's really just the same experience just with a different interface.
[00:13:02] Eran Yahav:
Yeah. It's it's a similar it's a similar experience, and we would think that in the future, we'll find a better interface. Right? Because, otherwise, as you said, we just like roll the problem 1 level up or 1 level down. You what you really want is a better way to to communicate that intent, right, than than to create the software in, in a better manner. Yeah. I mean, as you said, as software engineers, our job is to solve problems for the business, and half of that is understanding what the problem
[00:13:30] Tobias Macey:
what the technologies are able to solve.
[00:13:32] Eran Yahav:
Yeah. I I think may maybe 1 positive outlook on this. As software engineers, we shouldn't be too negative and bitter. I think, that the the positive here is that AI, Gen AI allows us to get the prototype faster and to kind of get to a to a faster realization that the software actually doesn't do what we wanted and that the requirements are unclear. Right? So it's a faster iteration. I think this is extremely valuable.
[00:13:59] Tobias Macey:
Absolutely. Humans as intermediate representation.
[00:14:04] Eran Yahav:
Yeah. Absolutely.
[00:14:05] Tobias Macey:
And then another aspect of AI in the software development context, and we've addressed this a little bit, but the aspect of as software engineers, we feel like, oh, we're the expert in the problem domain. We know exactly what we're doing. We don't need an AI to come and give a bunch of recommendations that might be wrong, and then I have to evaluate it. I'm wondering what are some of the aspects of skepticism that you have come up against with developers who are starting to think about this utility in their tool belt or some of the aspects of oversight that they need to be aware of as they're starting to exercise generative AI for producing larger and larger components of their code?
[00:14:43] Eran Yahav:
So, actually, I I think skepticism is already gone. Like, a couple years back maybe or maybe 3 years ago, it was like, yeah. I'm not sure that this can generate anything useful. It maybe automates only the parts that, you know, I don't care about and, like, very mundane stuff and doesn't help me. It's just another thing that gets in the way. But in the last, maybe, year and a half, maybe 2 years even, I think the technology has matured to a level that is really pleasing, and it's a pleasure to work with something like tab 9. And you you work with it and it's like, oh, I actually didn't know that it is capable of doing that. I'm pleasantly surprised, and and you get, like, a sequence of these pleasant surprise all the time. Not to say that all of them are pleasant. Sometimes there's the, you know, other surprises are unpleasant. But for the most part, it's a pleasant experience. And 1 interesting psychological thing about the product is that people love the product. And I I think 1 reason is that we are developers. We are fans of the technology, and we are really rooting for the the product to succeed. Right? We want AI to succeed in some sense. So whenever it succeeds, like, oh, yeah. It got that right. That's very like, we're happy for the product that it succeeded, which is the fascinating psychological phenomena from my perspective. I think another aspect there is the variable reward. It doesn't always succeed. And so the the fact that you're kind of, expecting to see something and you're so there's some game there. Right? You're playing with AI. It's playful. It's a playful thing. And the technology is mature enough that even when it fails, you can't understand why it fails. So you treat it like a child that has not developed a deeper understanding. You say, oh, yeah. I see why you got that wrong. Let me try to rephrase my prompt, and maybe that will make you succeed. Right? So there's a lot of, like, nurturing of the AI by the developer.
Definitely, if you want to take the kind of the more, critical point on that, I think the AI does act as an amplifier. Right? So tab 9 is contextualized on your code. And if your code sucks, it will keep generating code in that vein. Right? So if you write horrible code, you guess what? The AI understand that this is what you want. You probably want more horrible code, and so it will help you create that. It will try not to. Right? But it could. You you could persuade it to follow your style into the the dark corners of of software. And so, I think people developers need to be aware that AI imitates their style in some sense, and they have to be careful and keep an eye on what gets generated, especially if their current code is not that great.
[00:17:37] Tobias Macey:
Absolutely. And and that brings to mind my other question about technical debt and managing the amount of technical debt that is generated through this interface where it's the same garbage in garbage out principle that you were saying where if you write a really bad way of looping through a string, then you're just gonna get more bad loops and more bad loops, and you're, you know, you're quickly going to get into a situation where your software doesn't work at all. Yeah. It's it's not necessarily that bad. I mean, some of the customers, you know, when when you step into the customer, they typically say something like, oh, can you train on our codes? So we we get something that is similar to our code and, you know, and then they think about it again and say, actually, you know what? Let's train on somebody else's code.
[00:18:20] Eran Yahav:
Remember remember that we said we have 30, 000, 000 lines of code? Actually, we just want to train on the 3. Right? The the other 27, 000, 000 lines of code, it's better that they don't see the light of day. Right? So, I think we've we're seeing a lot of that. And if you do train on your good project, so if you kinda be careful about what you put in into the context, I I think you will get high quality code, that you can be happy with. Right? And you're absolutely right that if you're not careful, you could enter the business of producing, what we call, new legacy. Right? Just like you have a bunch of legacy codes and you're just generating new legacy codes. Right? So you gotta be careful
[00:19:00] Tobias Macey:
with that. And that brings me to the question of some of the ways that developers and development teams use something like tab 9 in their development workflow. Is it largely for exploring new capabilities and, quickly iterating on prototypes? Is it for managing refactoring workflows, generating tests? I'm just wondering if you can talk to some of the ways that people think about the AI capabilities as a complement to their existing skill set.
[00:19:29] Eran Yahav:
I think still the most prevalent use is in code generation and and and code completion just because it integrates so naturally into the flow and happens with a very high frequency. Right? So every time you type, top 9 gives you a code completion. Often, it is what you want. So you just hit tab and take it. So this is definitely kind of the workhorse of, AI assistant. It happens all the time. Similarly, when you write a kind of a method signature or or comment inside the code that is, again, in cogeneration style that may be longer form cogeneration of an entire function or entire class, directly from the comment is also, very widely used. Other things like code gen like, test generation, documentation generation definitely happen, but just in naturally, their frequency is lower just because their frequency in the software development kind of life cycle is slightly lower, right, unless your job is just to generate tests, which happens.
You're still spending most of the coding time, like, just writing code and not writing tests or implementations, not to say. These are being used quite often, but their frequency is obviously lower. And tab 9 chat is also used very frequently. Then it's more like what you said, more discoverability, like, how do I do x or where can I find y and stuff like that? So it's
[00:20:58] Tobias Macey:
yeah. The other piece of curiosity that I have around this overall set of capabilities is if there are any particular categories of software that something like tab 9 is most effective or most capable thinking as far as web applications versus machine learning models versus infrastructure as code. Just wondering if you can talk to some of the ways that it's used most broadly and some of the cases where it's most effective.
[00:21:27] Eran Yahav:
So the top 9 is used by over a 1000000 users, you know, every month. And so, you know, the these users come from all avenues of software, all all programming languages, and I think almost all kinds of applications. And even here internally, in top 9, we use top 9 every day across the entire stack. That being said, I think that at least for cogeneration, the there are languages that are more amenable to kind of getting a very high, what we call automation factor, like huge amounts of your code getting generated for you and, you know, languages such as, JavaScript UI or TypeScript UI, React, whatever. You you can get a lot of automation there.
Definitely, Python data science work is also highly automatable, again, because many times the the tasks are well defined. So it is easy for you to communicate to the AI what is it that you want, and it is easy for you to judge that the result is what you wanted. So maybe taking a step back, I usually talk about the fundamental theorem of Gen AI, which holds also in Gen AI for software, which is a kind of high quality theorem, but it's really maybe a trivial observation, that just says that the cost of specifying what you want plus the cost of consuming the result has to be much, much lower than the cost of doing the task manually. Right? So if I have to work real hard to tell top 9 what I want and I have to work real hard to check the results, is what I wanted, then maybe it would have been better for me just to do it manually. So so I would say the kinds of software or the style of software for which AI is is most, useful or appropriate are tasks that are easy for you to define what you want and easy for you to kinda judge whether what you got back from top 9 is is what you needed. Right? So it's it's really, again, the the style of software is more about the where is it easy for the human to communicate with the system, both in terms of input and in terms of consuming the output. And UI is a trivial example because you can run the program and see whether the UI looks like what you wanted or not. Right? And so it's easy for you to judge that the science is another because you maybe hit the button, see the plot, and say, yeah. That's the plot I wanted. Right? So things that are very easy for you to judge whether the the program does what you wanted it to do.
[00:24:13] Tobias Macey:
And the other interesting piece that you already alluded to is the set of languages that are supported. You mentioned that when you were first doing code completion, you started with Java. Now that you're using generative AI, you've expanded the set of languages that it works with. I'm curious if you can talk to what the analogy is of natural language, large language models being biased largely towards English and how that compares in the software ecosystem and some of the ways that you're looking to tackle the the long tail of languages that people would like to have supported.
[00:24:48] Eran Yahav:
Yeah. That that's interesting. I think the there's enough code in most languages to drive a very successful model. So our current models support, I think, maybe up to, I don't know, 60, 80 languages depending on how you count. The ability of the model to generate clever things is definitely more biased toward towards the heavy head of languages. So we'll you'll get great results, you know, JavaScript, TypeScript, Java, Python, PHP, c plus plus c, Rust. I don't know. Probably Ruby probably forgot a bunch of of others and maybe the support in Lua. And Elixir would be slightly less sophisticated, but there's definitely transfer happening between languages, also in natural languages, by the way, but more so, I think, in in programming languages.
So you'll be surprised that even if you don't have a lot of code out there in the world, the model does pretty well on on Elixir and, you know, even more, interestingly on on very log or system very log code, which is also substantially different than than what we're used to as in higher level programming languages. So it's not exactly the same effect like, you know, having a language model for Hebrew, which is very, very challenging. There's more transfer happening, I think, between the programming languages.
[00:26:33] Tobias Macey:
Yeah. I think that the overall problem space is compressed compared to all of human language because all programming languages are targeting the same core capabilities of touring completeness. They might have different semantics or different ways of approaching it, but they all have the same base set of constructs and capabilities that they're trying to compile down to. And I'm sure that that simplifies the translation between the syntactical elements because the core semantics are largely similar. I mean, they're they're obviously programming languages that have very esoteric capabilities or ways of approaching a problem, but at the end of the day, it all compiles down to CPU instruction sets.
[00:27:16] Eran Yahav:
Yeah. There's another there's another aspect here. You're absolutely right. But there's another aspect here that there's actually the base signal, like the carrier signal, which is the programming language syntax itself. But, largely what we do today with program languages is calling APIs and libraries. And, know, the the the syntax of the program language is very, very simple, and it's easy for the model to pick that up. The model always, almost does not make syntactic mistakes. Right? So it's all about the APIs and libraries and how you call them, how you sequence them, etcetera. And that is largely shared between languages. So, you know, if you're using whatever Twilio API in Java, JavaScript, Rust, or something, you know, you're gonna get almost very similar names for kinda API calls.
And the syntax of calling a function may be slightly different, but, you know, there there is some structure there that is really shared across all languages and and the model picks that up quite nicely.
[00:28:17] Tobias Macey:
And now digging into tab 9 itself, can you talk to some of the design and implementation details of the product and application, but also some of the ways that you think about the model training and development and deployment aspects?
[00:28:32] Eran Yahav:
Yeah. So let me see where where where should I start? Let me start like with large large blocks, let's call them. So, I think people don't really realize the the distance between a model and and a product using a model. And, you know, the the model changes all the time, and new models come out every week almost. And we kinda improve our models all the time. But the model is a kind of small part of of the product. There's so much going on around that. So, things like, vector memory that help the product, be contextually aware of what's going on in your code base.
Things like, semantic memory that help the product be aware of semantic kind of context of what's going on in your project. So top 9, for example, you know, will know, things that are defined in libraries even very deep inside the library, even when the source code of the library is not available. Right? So there's some level of semantic memory there that helps sub 9 be contextually Other components, you know, models for ranking, from filtering, for toxicity, I don't even know. There there are so many kind of moving parts, in this, in in the product.
But maybe I I would say there's, like, the the model is 1 big piece. Semantic memory, vector memory are 2 other, big pieces, and there are many, many small pieces around that. So now in terms of address, The the problem of contextual awareness, how do you, provide a product that is aware of customer code is also very central to top 9. And so we have both the machinery for fine tuning on the customer code base and the machinery for doing retrieval augmented generation both, text based and semantic based to provide contextual awareness. And on top of that, we have some additional symbolic semantic awareness in the IDE that that provides additional hints for the model that should be generated. So, definitely, contextual awareness is a central theme of the product that that required a lot a lot of work there.
In in top 9 chart, there are additional aspects, including, again, the context awareness, but also limiting what gets generated and and some some alignment problems there that are also interesting. You also asked about deployment. So top 9 is basically deployed as into a Kubernetes cluster. It's a bunch of, bunch of services, inference, and and the others that that I mentioned, can deploy top 9 in your own, virtual private cloud or can deploy top 9 also completely air gapped on your hardware. And we have a number of customers that, have chose have gone the kind of hardware route and have their own air gaps deployments of of top 9, which I think is kind of
[00:32:02] Tobias Macey:
As far as the model architecture, I'm also interested in understanding what your approach has been and some of the experiments that you've done of do you just have 1 monolithic core model that does all of the programming languages and code generation, or have you broken it into a, I guess, a core model, but then different models with subspecializations and some of the ways that you think about the interaction across those boundaries?
[00:32:28] Eran Yahav:
Yeah. We we we kinda oscillate architecturally. We have taken the approach of having multiple models, And, definitely, there have been times that we served, I know, maybe up to 8 different models, serving different program languages or even different parts of the of the software universe, like different stacks. I think these days, we have converged to, like, maybe 3 or 4 different models, and they serve different facets of of the product. So different parts of the product have very different latency cost trade offs. And when you come to optimize top 9 and, you know, make it possible from top 9 also on premise without requiring, an entire GPU farm there. You need to optimize the consumption in different facets of the product.
Some facets are very latency sensitive. Right? So for code completions, you want to get generation as you type. And so the model depth cannot be, very large for that if you're optimizing for latency. And that kind of in turn derives other parameters of of the model, and you get to certain model sizes if you want to do that economically, with very low latency. Other facets like chat are maybe less latency sensitive, and you can go with much bigger models and kind of you have more leeway on on how you handle, the inferences and how you, deal with in flight, requests and batching and all that. There there are a lot like, inference is hard. I don't think people realize how hard these inference when you try to do that economically at scale, like when you try to serve in top 9 SaaS, right, when you serve 100 of thousands of users and you have to do that economically.
I think it becomes just the inference becomes a hard problem that then also dictates the kinds of models that you can and do deploy in production. And as you said, specialized modes specialized smaller models sometimes allow you to like, a sparse architecture sometimes allow you to do things, with better latency and and also more cost effectively.
[00:34:56] Tobias Macey:
And while I was preparing for this interview, I looked to see how long tab 9 has been around, how long you've been working in this space, and I saw that Tab 9 as a business was founded back in 2013. And I'm wondering if you can talk to some of the ways that your overall goals and approach to your core problem statement have changed or evolved over that period. And in particular, some of the ways that the advent of generative AI kind of shook up your ideas of what was possible.
[00:35:28] Eran Yahav:
Yeah. Definitely. So in fact, 2013 is an exaggeration because it was, I was in, a professor working on this, like, super, super part time, just with Tor. But 2017, we kinda decided that this is maybe something worth pursuing seriously and and got funding. And so I would I would put the the line I would draw the line from 2017 for a real, company. But, definitely, when when we started, I think we knew that we can do stuff by learning from millions of examples to kinda help you develop software better. The imagination was that all software will be written using AI. So that was the vision, I think, even as early as as 2013 when we just, like, played with ideas.
Initially, when we started, we said, okay. You will write code, and we'll present suggestions in a sidebar. Like, next to you, as you write, we'll give you, like, suggestions of what to do. And it was beautifully done, really. We had, like, a great designer back in 2017. Super talented guy. It was the mute most beautiful product that I ever saw. I loved it. And everybody absolutely hated it other than me. Like, all all developers in the world hated it with a raging passion. And the reason is that as you would type, things would refresh on the side all the time. So you're like, imagine that you start writing and kind of you have a sidebar that always changes all the time. So it it's super distracting even if every suggestion there is exactly the right 1.
And then 1 of our users said, like, hey, guys. Like, why don't you just, like, put that do do exactly the same, but make it as a code completion natively in the ID? Just just insert it in the right place and pick 1. Just pick the best 1 because we said we'll show you several options you can pick, and we were totally kind of, like, very naive in the user interaction. And when once we put that in as a code completion, I think, then clicked, then we we got very good, very good adoption starting 2018 and 2019 when we expanded to additional languages, then top line really started to to get, to get traction. Yeah. I think I've been humbled by the the progress. So back 2019, had you asked me if we'll be able to do what is doing now, so probably in a decade.
I I can't even imagine that this would be possible. And every day, I'm kind of surprised and amazed and humbled, but but what it can do. And as, you know, as as the entire research area kind of, keeps progressing, I'm I'm kind of amazed by by the things by the things that we are able to do these days. I think really, I think for me in the last few years, I've kind of came to the, I don't know. My realization or speculation that quantity does, breed quality, which is something I was highly skeptical in the past. So like I said, piling more and more parameters will not make you, that much better.
And, you know, it looks like it does. So it's, I think may whether you do it, like, you know, with a sparse architecture or keep on piling parameters into a dense model is besides the point, but these models become stronger and stronger. And I'm amazed by by the abilities that we have in our hands today. Again, as I said earlier in the conversation, I think for many of the tasks, the human is the bottleneck. So your ability to describe what is it that you want and your ability to consume back what has been generated is the bottleneck. So the bottleneck is IO, so to speak, between the human
[00:39:39] Tobias Macey:
and and the machine. Right? If only we can figure out how to make humans multicore.
[00:39:46] Eran Yahav:
Yeah. Exactly.
[00:39:48] Tobias Macey:
In your work of using these generative AI capabilities to build a product that is focused on developers and their work flows. You mentioned things like latency being a challenge. I'm wondering if you can talk to some of the most complex aspects of customizing an LLM for this specific context of software engineers.
[00:40:11] Eran Yahav:
Yeah. That's an interesting question. I I think, as you mentioned, latency. I think we spend a lot of time on on the granularity. So it's not necessarily the model directly, but, again, how you interact with it. But what is the right granularity of generation? Like, should it be 5 lines, 1 lines, 200 lines? What are the boundaries? How do you make it easy for the user to consume the result? And this is happening also in test generation. You know, like, how do you make the result accessible? Let's say I generated 200 tests for you. How are you going to what are you going to do with that? Are they all valuable? Which 1 are the valuable 1? How do you guide the model to generate the valuable tests and not just, you know, increase the test count? Alright. So so these kind of questions, I I I wouldn't say that the model per se is is a challenge, but definitely everything around the model.
Our context awareness still remains a problem in this space because the 1 of the advantages of you as a human is that you know the entire project. You maybe know the entire organizational code base, what microservices are available, what other architectural decisions have been made. Right? And as we try to convey this kind of information to the model, there there is also limitation what the model can capture, but also a limitation on how you phrase that, how you extract the information from, say, the code base or from the organization knowledge base and communicate that to to the model in a useful way. So that that's definitely another another challenge.
[00:41:57] Tobias Macey:
Another aspect of the problem is not just the capabilities, but also the accuracy and understanding when you've generated an accurate result and being able to build a feedback loop for the model. And I'm curious if you can talk to some of the ways that you think about the assessment and measurement of accuracy given the fact that that can be a bit of a nebulous concept in software engineering because, again, it comes back to are the requirements accurate. So is the output accurate based on the requirements, or is it just that the requirements are inaccurate and some of the ways that you think about building those feedback loops into the successive training iterations of the model?
[00:42:36] Eran Yahav:
Yeah. It's a it's a really hard problem, and we spend and spend our spending a lot of energy on on evaluation. We obviously have our own evaluation harnesses that kind of, you know, intuitively pick up a repository that we haven't seen in training from GitHub or somewhere, erase some parts, try to complete them, see how well we did. But, you know, even measuring how well you did is is a kind of tricky concept. What exactly do you measure? Each metric has its own problems, and so we have, like, a slew of of metrics that we learned how to kinda weigh them together and how to kinda get a read of whether the model is better than previous 1 or not. So definitely a lot of flab evaluation.
In terms of users, I think the ultimate test, if it's for cogeneration, is whether the code that we generated got adopted or not. Right? And so that's slightly easier. But, also, as you said, it's not necessarily, equivalent in all users because we found out some users take it even if it's completely wrong, and then they massage it to to what they want. Right? Then some users prefer not to take it and just write something very similar themselves. And so the signals are also quite tricky, to analyze. That's for code completion. But for code completion, I I feel like we have a pretty good read based on both the kind of internal harness and metrics and on user feedback of accepted slash rejected, set completions.
For chat, this is much harder because both the specification, like the question of what you asked is kind of more fuzzy often. It's more natural language heavy, the interaction. It's multi term. Right? So it's there's, like, more than 1 iteration in which you get the requirements specified and refined. And then the the result may or may not be taken by the user based on, you know, it may be too long, so the user didn't take it. So there are all sorts of harder to analyze signals when you're analyzing, the the score or the kind of evaluation of of chat. Again, a lot of fab evaluation there, including, human tagging. So we employ some, you know, human tagging teams to give us feedback and to to help us, feedback the model.
Some efforts around RLHF, obviously. We do generate multiple answers in in top 9 chat, and we see how users interact with that, allows us to get some read of of reality. But these violations are very, very tricky, especially for chat that is generating code because just as you said, Tobias, the the the spec is unclear. Right? Somebody wrote some natural language specification and then got some result, which is not clearly completely wrong or completely right, and and it's very hard to evaluate. Even even I I would add, even using humans is very hard to evaluate. Like, even if you give me the results, right, or you give our human team, like, the results and say, was the chat correct? Was timeline chat correct or not?
It's very it's pretty hard.
[00:45:59] Tobias Macey:
Another layer to that challenge also is shared vocabulary where things like if you're in the conversational aspect of write me a module that does x where you're saying, oh, I want you to use dependency injection versus somebody else says, oh, I want you to use inversion of control and just some of the ways that the names for the same concept are different across different teams and how you manage to map those to the same inputs for the model. I'm wondering how how that plays out.
[00:46:29] Eran Yahav:
Yeah. The the honest answer is I don't know. It's it's hard as you said. I I think we're not there yet. I think we're hitting a lower kind of, lower bar of of challenges on on the human expressivity of people saying, whatever. Write me a sorting algorithm. So, you know, you might get bubble sort and say, oh, no. I didn't want bubble sort. I want something else. So you get some other currency. But, actually, this is totally not what I wanted because I wanted to make a library call. I didn't want you to reimplement the entire sorting algorithm. Right? So now this entire interaction, is it good or bad? It should you have given, like, the library call as the first answer for the first interaction?
You don't know how to grade that. Right? And and we're seeing a lot of this is like a trivial example, but seeing a lot of those interactions. I think, as always, with these technologies, they are successful when the humans adjust. Right? So I think it's not about tuning the model. So as to tuning the humans. Right? It's like it's a kind of, calibration of expectations and people learn we and we see that with users. They they learn how to write the questions in a way that is helpful for kind of cutting through to the the right answer on on the first on the first interaction.
It's not different from people learning how to use Google. Right? It's kind of same kind of skill. And I think I think developers will improve in doing that. So it's not people like to call it prompt engineering, but it's not really that. It's not as as far as that. It's just like learning to communicate with an assistant that is maybe very smart,
[00:48:21] Tobias Macey:
but very limited in its ability to communicate. Right? So Yeah. I was going to ask you about that idea of prompt engineering and being able to share a set of prompts that produce a particular category of output or some of the ways that you think about training humans. And the the the biggest question I guess the main question I'm driving at are what are the aspects of customer education that you find yourself having to come back to as people are onboarding onto tab 9?
[00:48:48] Eran Yahav:
There there's definitely some aspect of that. Like, for example, when you're doing code completions, people write comments, right, to kind of prompt the model. But in their head, they're thinking chat. So they write comment like, question, how do I do blah blah? And this is not how the model has been trained for code completion. Code completion is not a conversational model. So you're kind of actually, the model expects comments as they would appear in actual code. Right? And when you start to write these elaborate stories as comments, you're actually distracting the model. So that's part of education, that you have to do.
On on the chat side, again, I think most of the education is, like, be short and to the point, kind of don't tell, like, the elaborate stories because, again, you're distracting the the model in in a sense. But I think people already get that. Right? And
[00:49:42] Tobias Macey:
Yeah. And and to the point of putting a conversational mood into your code comments, it's also bad form for the actual longevity of the project because you don't want a very verbose comment. You just wanna know what is this trying to tell me and why do I care? Right.
[00:49:58] Eran Yahav:
Yeah. Absolutely. I think 1 1 interesting thing in in actually test generation, which I found interesting, we do test generation from the code itself right now. Right? So it's like, the opposite of TPD in some sense. Right? You don't you don't say what are the requirements, look at the code, and you generate the test for it. But you're not given a spec. Right? You're not given a specification unless there is some comment there. So for test generation, for example, it is helpful to have a comment that says what is the code expected to do. It helps test generation test what you intended for the code to do and not what the code does. Right? So these kind of things are are kind of usability issues that should be solved probably in a different way in the product. Right? It's gonna ask you for the spec in natural language.
But but right now, that's the way that they work.
[00:50:49] Tobias Macey:
Yeah. It's interesting because there are certain there's a few libraries I've come across where the idea is that in the comments, you write the contract for the function, which it says, okay. This accepts these inputs. It should be within this range, etcetera. And so that it's intended to help with constraining the tests or informing the other developers of how this function should be used, but it also has the benefit of speaking to the model in the language that it understands.
[00:51:15] Eran Yahav:
Yeah. Ex exactly right. And, yeah, I think if people the the mental model of the models should be of the AI assistant should be, I'm working with this other person who is very smart, but I have to communicate extremely clearly with. Otherwise, they get distracted or derailed. Right? So that's that's the mental model.
[00:51:38] Tobias Macey:
And in your experience of building tab 9, particularly in your latest iterations of using these generative models for code suggestions, code creation. I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:51:57] Eran Yahav:
So people use it okay. So I've seen it applied in strange ways. So maybe my application is that I do all my writing with tab 9. So emails and meeting summaries, everything. I write in in sublime using tab 9 just for the kind of the English model. I I got used to it. I don't think it's extremely unusual, but that's just me. I recently had, some people interested in using it to migrate COBOL code to more modern languages, which I thought would not really work that well. But, again, if you calibrate your expectations, properly, I think it's, it's not that bad.
So, again, migration project as a whole is probably bad use for type 9 because of taking 1 code base and, you know, abracadabra make this cobble into Java, it's just not going to work architecturally. Right? The architecture of the application is going to be completely different. But if you have, like, this opaque Cobble code and you want to massage it into some Java procedure that does something that you can understand and maybe help it a little bit on the edges manually, I think that is an interesting or at least an unexpected use, from my perspective.
I've seen people trying to do magic TDD or TDD to the stream, like writing the tests and trying to force top 9 to generate the code. It kinda works in small cases that, you know, you can fit all the test cases into the context and and and whatnot. But I I think that's an interesting direction for for the future. I think something like that would happen in certain domains. Right? That you just write the tests and hit a button and and get the code. So I can definitely imagine that as a viable use case of the of the technology, moving forward.
Yeah. I think that that's roughly it at the high level.
[00:54:10] Tobias Macey:
From the test creation perspective too, it sounds like you would probably want to have your test harness preconfigured, and you're just filling in the individual tests rather than just, I have absolutely no tests. Write something.
[00:54:22] Eran Yahav:
Yeah. Yeah. So it has to be something like that. Yeah. Some some balance.
[00:54:28] Tobias Macey:
And in your experience of building this product, working with the developer community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:54:38] Eran Yahav:
Yeah. I found out that people are very passionate about languages that I didn't know existed. And so I think we get a lot of requests. Again, it's just my ignorance not to say that the language is bad in any way. It's just that I was not aware of, like, the the amount of code that is being created there. Like, for example, PowerShell was 1 example that the community was very excited about using top 9 for PowerShell. And I was like, hey, guys. I don't even know that it works. Right? And and but but definitely that I think those languages were kind of surprising.
Maybe another surprising aspect is kind of the very logs of the world, the hardware languages that again were completely off my my radar and and people just came and said, hey. It kinda works for these languages, but can we make it better? And so we we are making it better for the for these languages. And it's interesting. It's very different than what I'm used to at least as as a software engineer.
[00:55:47] Tobias Macey:
In the space of software engineering, software development, development teams, I'm wondering what are the cases where you see AI and AI assisted development as the wrong choice.
[00:56:02] Eran Yahav:
That's, that's a good question. I think for for code generation, probably if you are working on, like, the 1 off algorithm that is very clever And especially if you're working on some, I've worked in the past on, like, concurrent low level algorithms. So I can imagine if working on a concurrent low level algorithm that has a lot of requires a lot of global reasoning, subtle reasoning, and you're the only 1 in the world that ever wrote this algorithm. Right? It's not like you're reinventing, some concurrent garbage collector or something. But, like, you are legit.
This is your algorithm first time. Then probably AI cogeneration is not your best friend at this point. Right? It's a very subtle kind of puzzle that every piece has to fit neatly together with global reasoning. And, I think it's probably for the generation part. It's maybe not the best use of the tool. This is really a task that is heavy on human intelligence and human reasoning right now could change tomorrow, right, when the models become even better. For other aspects that are not generation, I think AI assistance is always the correct choice. Like, if you're using it to review your code or to do test generation or do stuff like that, even if the AI is wrong, it teaches you something. Because if you say review this code and AI says gives you some comments, say, like, this comment is wrong, but I understand that the code is written in a way that could mislead you to that reasoning. So let me maybe restructure the code to make it more obvious to you and to the next human reader, actually. So I'm doing a service not to the AI. I'm not in service of the AI. I'm in service of future Iran that will come to this code a year from now and say, who's the idiot who wrote that? Right? Because I I don't understand.
And so I think for review, for test generation, AI assistance is always the correct choice even for the 1 off algorithm.
[00:58:02] Tobias Macey:
And as you continue to build and iterate on tab 9 and continue to play keep up with the rapidly evolving space of large language models? What are some of the things you have planned for the near to medium term or any particular problem areas that you're excited to dig into?
[00:58:19] Eran Yahav:
I'm super excited about code review. So in code review, we've worked on the problem now. I think current code review product is, version 3 of the product. So we had version 1 maybe a couple years back. I loved it. Everybody hated it. I think you start to see a theme here. Right? So that developers on the team said, like, this is mostly distracting us. It's giving us comments. Like, no. 2 out of 10 are what we want and the other 8 are like, we just have to fight the tools. So we don't want that. Version 2 was much, much better. And I think now with version 3, it's actually really, really, useful and and valuable. So definitely excited about code review coming out, later this year.
Integration with non code sources is another thing that is coming out, and I'm super excited about the ability to get all all of top 9, be aware of, you know, Confluence and Notion and and Jira and other sources of information that are non code. And this integration that that we've been working on for a while now, it's really hard it's a really hard 1. People think, oh, you just slapped no effect or database over the documents and you'll be fine. No. No. Far from it. It's it's a really tricky 1 to to make useful. And I'm very excited about that because informing code generation, test generation, all other tasks of of top 9 with some architectural details, with some other non code kind of source of information really changes how the product, reacts, right? It's like kind of you suddenly understand, like, it starts to use a microservice that has not been defined anywhere other than the docs. Right? And you start to see interesting things happening because you've informed it in a more general context like a human has. Right? So this is like, this vertical also, these are horizontal, actually, integrations of with with other data sources, I think, are beginning to inject the level of human expertise that you'd expect from from a human into the product. And I think as we improve that, also, you know, surfacing that in code view, that the product will become immensely
[01:00:49] Tobias Macey:
more human like. Yeah. Leveling it up from an intermediate to a senior engineer.
[01:00:54] Eran Yahav:
Kinda. Yeah. I guess. Yeah. I guess. That's a good way to phrase it then. Thank you.
[01:00:59] Tobias Macey:
Alright. And are there any other aspects of the tab 9 product itself, the overall space of AI assistance for software engineering, or the, rapidly evolving area of large language models that we didn't discuss yet that you would like to cover before we close out the show?
[01:01:15] Eran Yahav:
I'm I'm very curious about about the space. I think we will not go that far with Comm 9. So I'm just like back in 2016 maybe. I I I worked on this kind of crazy idea of learning from learning programming from YouTube videos. So there are so many tutorials on YouTube that teach you how to do certain things, with programming. So I actually had very nice work with the student on, like, how to learn from those video tutorials. And when I have time, someday, I'm curious about going back to that and maybe generating tutorials programming tutorials completely automatically. I don't think we'll do that in top 9 anytime soon. I think the road map is pretty full.
But but something there feels right. I mean, like, programming tutorial videos and ability to generate them automatically sounds sounds exciting to me.
[01:02:09] Tobias Macey:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[01:02:23] Eran Yahav:
2 of them, probably. I think 1 is definitely privacy and security as we see with top 9 customers that don't wanna send all their information to outside the org or to something rather. So definitely a barrier to adoption there on kind of maybe architectural side. On on the product side, I maintain that the biggest adoption is the human. We need to find better ways to to interface with humans. This is not only for software creation. This is for any Gen AI product. You need to somehow find the right level of presentation to make it easy for your user to kinda say, uh-huh. Yeah. This you just generated what I wanted. Right? And this maybe with mid journey, it's easy because you get the princess riding the unicorn.
But, with other products, it it may be very, very hard and may require, maybe even phrasing a different language for communicating back the result. Right? So this is even if you think for us in top 9, 1 1 of the kind of high level thoughts is, like, should there be, like, a different language in which you communicate the results? Even if you ask for a c program, like, maybe I shouldn't show you the 2, 000 lines of c program, but kind of summarize for you the main ideas in this c program and say, yeah. Okay. I got it. That's what I wanted. Right?
[01:03:43] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing on tab 9. It's a very interesting product, definitely very exciting to have these capabilities for people working in the software space. So definitely appreciate all the time and energy that you and your team are putting into accelerating software engineers, and I hope you enjoy the rest of your day.
[01:04:05] Eran Yahav:
Thank you much. Thanks for having me. It was great.
[01:04:12] Tobias Macey:
Thank you for listening, And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast.init, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast. Com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to The Machine Learning Podcast. The podcast about going from idea to delivery with machine learning. Your host is Tobias Macy, and today I'm interviewing Eran Yaghav about building an AI powered developer assistant at Tap9. So, Eran, can you start by introducing yourself?
[00:00:28] Eran Yahav:
Hey. Thanks for having me. I'm, Eran. I'm, the CTO cofounder of top 9. Other than that, professor of CS at Technion, which is 1 of the leading Israeli universities. I've been doing research on program synthesis, for many, many years now, I think, since before it was cool. And, definitely
[00:00:51] Tobias Macey:
looking forward to this conversation. And do you remember how you first got started working in machine learning?
[00:00:56] Eran Yahav:
I've been, working on program synthesis for many years now. I think it's, since the mid 2000 or something like that. And somewhere around 2010, we realized that a lot of programming tasks are extremely repetitive and can be automated if you learn from millions of examples. So, initially, we worked on, classical approaches, mostly logic based approaches to program synthesis, using version spaces to represent spaces of candidate programs and explore the space of programs, for synthesis. But when, you know, neural networks started to gain popularity again, and mostly when LSTMs, started to be useful, I really got hooked on that. And from then on, I think really evolved together with the field closely to NLP techniques through transformers and, you know, and the rest, as they say, is history.
And, really, you you know, since the age of transform, so to speak, together with my students and with, Yoav Goldberg from Barillan University, we've done a lot of cool things around, the expressive power of various networks. So fairly theoretical work, expressive power of various RNNs, applications of LLMs in general for software engineering. And also did very cool work with my student, Gail Wise, and with Yoav on interpretability of transformers and reverse engineering transformers. So that was few years back already. So been in this field for a while now.
[00:02:35] Tobias Macey:
And so for the Tab 9 project, can you describe a bit about what it is and some of the story behind how it came to be and why this is the problem that you want to spend your time and energy on? So Top 9 is an AI assistant for software development. It helps you
[00:02:50] Eran Yahav:
with all software development maintenance tasks. It can help you generate code, generate test, generate documentation, review your code, and it will eventually help you drive the entire software development life cycle end to end using AI. So back, I think, in 2018, maybe, we were the first to bring AI code completions to market initially just in Java based on classical techniques. Let's call them more logic based and semantic techniques. But when we moved to use GPT based networks 2019 and extend the platform to support more native languages. Started by focusing on code completions because we saw that it's a good place to deliver a lot of value to developers.
But the vision is much wider. I think it's pretty obvious now that the future of software development is AI driven. So, like, everything in software development is going to be assisted with AI. You will not write software or maintain software without AI. Just, like, you don't do it without a compiler, right, or without an interpreter doesn't matter or interpreted languages. So the future is software development is AI driven. It will take some time to get to that point where the entire process is AI driven, but we definitely see more and more pieces of this vision materialize. Right? Every every month that passes, we get another task in software engineering, boosted by AI in a significant manner.
[00:04:24] Tobias Macey:
For the use case of AI as an assistant to your development progress, what you mentioned code completion as the initial foray into that space. I'm wondering what you see as the main motivations that individual developers and then at the team or organizational level, they start to adopt these AI capabilities into their development practices?
[00:04:50] Eran Yahav:
Yeah. So so I think for individuals, it's, mostly productivity and also, at least for me, also some elements of discovery. So as I'm working with tab 9, I obviously get the acceleration of not having to think about syntax and even about the implementation of first things that, you know, either I've done a 1000000 times before and I don't want to remember or even things that I don't know how to do. And I don't really want the nitty gritty details. I just want it to be done, like, following some API or something like that. So definitely acceleration and productivity boost for the individual, but also higher quality of code and discoverability, especially with tab 9 chat. I ask questions on how to do certain things where I don't know the answer, and I get educated. I learn stuff from top 9 chat. So may maybe kind of a more structured answer to your question is as an individual. I feel that top 9 serves 3 different layers for me.
1 is remind me. It reminds me things that I know how to do, but, you know, I just don't remember the exact details. Second is teach me. It teaches me how to do things, that I don't necessarily know how to do. And the third thing is elevate me, like, really give me wider context and make me a better developer by kind of exposing various ways to do stuff. So as individual, it is really like working with an expert that can expose you to new things as well as reminding you kind of the non interesting things and get them out of the way quickly. To to answer a second part of the question, for organizations, I think, again, it it's clear to all any organization that all software is going to be developed with the assistance of AI. Right? So there is an AI in the future of software development anywhere and everywhere. So organizations are coming to us to see how Pad 9 can improve productivity, how can it harmonize the code that is being created in the organization. Right? Because when everybody is following kind of the same model or the same AI when they generate new code, you get harmonized code. They're kind of all of them are inside the distribution, so to speak. So nobody is kind of getting on a on a tangent. Right? They're all on the common path in some sense.
And with the launch of top 9 chat, organization also coming to see whether chat can accelerate knowledge sharing and help answer questions that would otherwise require human experts to to answer, so a wide range of things. And, again, it's the functionality of to expand to code review and to other surface areas, I think, but we'll see even more, demand from organizations.
[00:07:40] Tobias Macey:
Yeah. I particularly like that thought of having the expert sitting over your shoulder because that is 1 of the challenges to scaling engineering teams is that there's only so much expertise that can be built up within an individual or a set of individuals, and it takes time to proliferate that knowledge. And it's also sometimes challenging to be able to ensure that all of the people who are coming into the team are allocated enough time with those experts to be able to pick up the various signals that you receive through that interaction. And so by being able to being able to reduce the total amount of time required to gain some of that context and knowledge, it allows that, subject matter expert to scale more effectively because they don't have to spend as much of their time on staring over somebody's shoulder while they write code and try to figure things out.
[00:08:31] Eran Yahav:
Yeah. Absolutely. And it allows you to kind of get things right from the get go because you're getting help as you're generating the code. You may be generating in a way that will pass the review later. Right? So you it's not only that you get the expert knowledge. In a sense, you get the expert knowledge early enough, so you don't have to get rejected in code review and and redo, what you did. Right?
[00:08:53] Tobias Macey:
And the other side of that, though, is there's only so much expertise that an AI can consume or consolidate because some of that expertise is contextual and requires things that are, at least as of now, still within the domain of, human only capabilities such as intuition or being able to make logical leaps between 2 different things or understanding the overall business context without having to explain it in very minute detail to the AI. And I'm wondering what you see as some of the real world limitations of using a generative AI in the creative process of software development?
[00:09:32] Eran Yahav:
Yeah. So so definitely, architectural reasoning and high level reasoning, business kind of reasoning on why are we doing it this way is something that is right now at least beyond, what the AI is capable or at least it's not easy to communicate that information to the AI right now, not to say that it will remain that way forever. But, at at this point, this kind of reasoning is pretty hard to the AI, and it's pretty hard to communicate that. But I think that that's the beautiful thing about AI systems in software development is that you can relieve the humans from, like, the nitty gritty details of the exact syntax of calling an API or how do I sequence, you know, some calls to other things. And the humans can focus on the architecture and on kind of the business aspects of why are we doing it this way. And so, usually, when people ask me, like, you know, like general audience ask you, are developers going to be replaced by AI? Like, our software engineer is going to be replaced by AI. And I tell them, no. Absolutely not. Because the job of a software engineer is to solve business problems using software. It is not to generate code. Right? So the the code generation aspect of a software engineer, yeah, that may be accelerated or, to a large degree, replaced by AI. But thinking about the business problem and what software architecture and what software should be created for it is definitely in the realm of of the human. But maybe to to to answer even, like, slightly more philosophically, I would say that the limitation of Gen AI in general is the human.
The the models are already extremely strong in the ability both to take a lot of input, you know, be it through fine tuning on customer codes, say, or using very large context windows or context stuffing or whatever other techniques you wanna use to communicate a lot of information into into the model. And and the models can also generate very extensive outputs. Right? So so the bottleneck is really the human. Like, do you, Tobias, really want to read 2, 000 lines generated by top 9 and review them to see that they do exactly what you wanted? Like, in 1 vault, like, here, Tobias, like, bam, 2 1, 000 lines, good luck. I don't think so. That's not how humans work. So so really the barrier to a large degree of agenda in general, actually, and software creation is the human, and we need to find better ways to communicate with the human and lowering the cognitive, load on the human when when they work in the system. Right now, this is done by the AI kind of the natural granularity of communication is a class level, method level, something like that that you can say, oh, yeah. It does what I want. But if you want to go beyond that, the barrier is not the model. The barrier is the human being able to say, yes. I understand kind of what this does, and this is what I wanted.
[00:12:34] Tobias Macey:
Speaking as somebody who's been in software engineering for a little while, it's very similar to the experience of working with the the the person who's requesting the software to be built and doing the requirements gathering to understand what is it actually that you're trying to solve for. And so we're just moving that another layer down where I, as the human, have to get the requirements from other humans, and then I, as the human, have to relay those requirements to the AI in a way that the AI can understand it. So it's it's really just the same experience just with a different interface.
[00:13:02] Eran Yahav:
Yeah. It's it's a similar it's a similar experience, and we would think that in the future, we'll find a better interface. Right? Because, otherwise, as you said, we just like roll the problem 1 level up or 1 level down. You what you really want is a better way to to communicate that intent, right, than than to create the software in, in a better manner. Yeah. I mean, as you said, as software engineers, our job is to solve problems for the business, and half of that is understanding what the problem
[00:13:30] Tobias Macey:
what the technologies are able to solve.
[00:13:32] Eran Yahav:
Yeah. I I think may maybe 1 positive outlook on this. As software engineers, we shouldn't be too negative and bitter. I think, that the the positive here is that AI, Gen AI allows us to get the prototype faster and to kind of get to a to a faster realization that the software actually doesn't do what we wanted and that the requirements are unclear. Right? So it's a faster iteration. I think this is extremely valuable.
[00:13:59] Tobias Macey:
Absolutely. Humans as intermediate representation.
[00:14:04] Eran Yahav:
Yeah. Absolutely.
[00:14:05] Tobias Macey:
And then another aspect of AI in the software development context, and we've addressed this a little bit, but the aspect of as software engineers, we feel like, oh, we're the expert in the problem domain. We know exactly what we're doing. We don't need an AI to come and give a bunch of recommendations that might be wrong, and then I have to evaluate it. I'm wondering what are some of the aspects of skepticism that you have come up against with developers who are starting to think about this utility in their tool belt or some of the aspects of oversight that they need to be aware of as they're starting to exercise generative AI for producing larger and larger components of their code?
[00:14:43] Eran Yahav:
So, actually, I I think skepticism is already gone. Like, a couple years back maybe or maybe 3 years ago, it was like, yeah. I'm not sure that this can generate anything useful. It maybe automates only the parts that, you know, I don't care about and, like, very mundane stuff and doesn't help me. It's just another thing that gets in the way. But in the last, maybe, year and a half, maybe 2 years even, I think the technology has matured to a level that is really pleasing, and it's a pleasure to work with something like tab 9. And you you work with it and it's like, oh, I actually didn't know that it is capable of doing that. I'm pleasantly surprised, and and you get, like, a sequence of these pleasant surprise all the time. Not to say that all of them are pleasant. Sometimes there's the, you know, other surprises are unpleasant. But for the most part, it's a pleasant experience. And 1 interesting psychological thing about the product is that people love the product. And I I think 1 reason is that we are developers. We are fans of the technology, and we are really rooting for the the product to succeed. Right? We want AI to succeed in some sense. So whenever it succeeds, like, oh, yeah. It got that right. That's very like, we're happy for the product that it succeeded, which is the fascinating psychological phenomena from my perspective. I think another aspect there is the variable reward. It doesn't always succeed. And so the the fact that you're kind of, expecting to see something and you're so there's some game there. Right? You're playing with AI. It's playful. It's a playful thing. And the technology is mature enough that even when it fails, you can't understand why it fails. So you treat it like a child that has not developed a deeper understanding. You say, oh, yeah. I see why you got that wrong. Let me try to rephrase my prompt, and maybe that will make you succeed. Right? So there's a lot of, like, nurturing of the AI by the developer.
Definitely, if you want to take the kind of the more, critical point on that, I think the AI does act as an amplifier. Right? So tab 9 is contextualized on your code. And if your code sucks, it will keep generating code in that vein. Right? So if you write horrible code, you guess what? The AI understand that this is what you want. You probably want more horrible code, and so it will help you create that. It will try not to. Right? But it could. You you could persuade it to follow your style into the the dark corners of of software. And so, I think people developers need to be aware that AI imitates their style in some sense, and they have to be careful and keep an eye on what gets generated, especially if their current code is not that great.
[00:17:37] Tobias Macey:
Absolutely. And and that brings to mind my other question about technical debt and managing the amount of technical debt that is generated through this interface where it's the same garbage in garbage out principle that you were saying where if you write a really bad way of looping through a string, then you're just gonna get more bad loops and more bad loops, and you're, you know, you're quickly going to get into a situation where your software doesn't work at all. Yeah. It's it's not necessarily that bad. I mean, some of the customers, you know, when when you step into the customer, they typically say something like, oh, can you train on our codes? So we we get something that is similar to our code and, you know, and then they think about it again and say, actually, you know what? Let's train on somebody else's code.
[00:18:20] Eran Yahav:
Remember remember that we said we have 30, 000, 000 lines of code? Actually, we just want to train on the 3. Right? The the other 27, 000, 000 lines of code, it's better that they don't see the light of day. Right? So, I think we've we're seeing a lot of that. And if you do train on your good project, so if you kinda be careful about what you put in into the context, I I think you will get high quality code, that you can be happy with. Right? And you're absolutely right that if you're not careful, you could enter the business of producing, what we call, new legacy. Right? Just like you have a bunch of legacy codes and you're just generating new legacy codes. Right? So you gotta be careful
[00:19:00] Tobias Macey:
with that. And that brings me to the question of some of the ways that developers and development teams use something like tab 9 in their development workflow. Is it largely for exploring new capabilities and, quickly iterating on prototypes? Is it for managing refactoring workflows, generating tests? I'm just wondering if you can talk to some of the ways that people think about the AI capabilities as a complement to their existing skill set.
[00:19:29] Eran Yahav:
I think still the most prevalent use is in code generation and and and code completion just because it integrates so naturally into the flow and happens with a very high frequency. Right? So every time you type, top 9 gives you a code completion. Often, it is what you want. So you just hit tab and take it. So this is definitely kind of the workhorse of, AI assistant. It happens all the time. Similarly, when you write a kind of a method signature or or comment inside the code that is, again, in cogeneration style that may be longer form cogeneration of an entire function or entire class, directly from the comment is also, very widely used. Other things like code gen like, test generation, documentation generation definitely happen, but just in naturally, their frequency is lower just because their frequency in the software development kind of life cycle is slightly lower, right, unless your job is just to generate tests, which happens.
You're still spending most of the coding time, like, just writing code and not writing tests or implementations, not to say. These are being used quite often, but their frequency is obviously lower. And tab 9 chat is also used very frequently. Then it's more like what you said, more discoverability, like, how do I do x or where can I find y and stuff like that? So it's
[00:20:58] Tobias Macey:
yeah. The other piece of curiosity that I have around this overall set of capabilities is if there are any particular categories of software that something like tab 9 is most effective or most capable thinking as far as web applications versus machine learning models versus infrastructure as code. Just wondering if you can talk to some of the ways that it's used most broadly and some of the cases where it's most effective.
[00:21:27] Eran Yahav:
So the top 9 is used by over a 1000000 users, you know, every month. And so, you know, the these users come from all avenues of software, all all programming languages, and I think almost all kinds of applications. And even here internally, in top 9, we use top 9 every day across the entire stack. That being said, I think that at least for cogeneration, the there are languages that are more amenable to kind of getting a very high, what we call automation factor, like huge amounts of your code getting generated for you and, you know, languages such as, JavaScript UI or TypeScript UI, React, whatever. You you can get a lot of automation there.
Definitely, Python data science work is also highly automatable, again, because many times the the tasks are well defined. So it is easy for you to communicate to the AI what is it that you want, and it is easy for you to judge that the result is what you wanted. So maybe taking a step back, I usually talk about the fundamental theorem of Gen AI, which holds also in Gen AI for software, which is a kind of high quality theorem, but it's really maybe a trivial observation, that just says that the cost of specifying what you want plus the cost of consuming the result has to be much, much lower than the cost of doing the task manually. Right? So if I have to work real hard to tell top 9 what I want and I have to work real hard to check the results, is what I wanted, then maybe it would have been better for me just to do it manually. So so I would say the kinds of software or the style of software for which AI is is most, useful or appropriate are tasks that are easy for you to define what you want and easy for you to kinda judge whether what you got back from top 9 is is what you needed. Right? So it's it's really, again, the the style of software is more about the where is it easy for the human to communicate with the system, both in terms of input and in terms of consuming the output. And UI is a trivial example because you can run the program and see whether the UI looks like what you wanted or not. Right? And so it's easy for you to judge that the science is another because you maybe hit the button, see the plot, and say, yeah. That's the plot I wanted. Right? So things that are very easy for you to judge whether the the program does what you wanted it to do.
[00:24:13] Tobias Macey:
And the other interesting piece that you already alluded to is the set of languages that are supported. You mentioned that when you were first doing code completion, you started with Java. Now that you're using generative AI, you've expanded the set of languages that it works with. I'm curious if you can talk to what the analogy is of natural language, large language models being biased largely towards English and how that compares in the software ecosystem and some of the ways that you're looking to tackle the the long tail of languages that people would like to have supported.
[00:24:48] Eran Yahav:
Yeah. That that's interesting. I think the there's enough code in most languages to drive a very successful model. So our current models support, I think, maybe up to, I don't know, 60, 80 languages depending on how you count. The ability of the model to generate clever things is definitely more biased toward towards the heavy head of languages. So we'll you'll get great results, you know, JavaScript, TypeScript, Java, Python, PHP, c plus plus c, Rust. I don't know. Probably Ruby probably forgot a bunch of of others and maybe the support in Lua. And Elixir would be slightly less sophisticated, but there's definitely transfer happening between languages, also in natural languages, by the way, but more so, I think, in in programming languages.
So you'll be surprised that even if you don't have a lot of code out there in the world, the model does pretty well on on Elixir and, you know, even more, interestingly on on very log or system very log code, which is also substantially different than than what we're used to as in higher level programming languages. So it's not exactly the same effect like, you know, having a language model for Hebrew, which is very, very challenging. There's more transfer happening, I think, between the programming languages.
[00:26:33] Tobias Macey:
Yeah. I think that the overall problem space is compressed compared to all of human language because all programming languages are targeting the same core capabilities of touring completeness. They might have different semantics or different ways of approaching it, but they all have the same base set of constructs and capabilities that they're trying to compile down to. And I'm sure that that simplifies the translation between the syntactical elements because the core semantics are largely similar. I mean, they're they're obviously programming languages that have very esoteric capabilities or ways of approaching a problem, but at the end of the day, it all compiles down to CPU instruction sets.
[00:27:16] Eran Yahav:
Yeah. There's another there's another aspect here. You're absolutely right. But there's another aspect here that there's actually the base signal, like the carrier signal, which is the programming language syntax itself. But, largely what we do today with program languages is calling APIs and libraries. And, know, the the the syntax of the program language is very, very simple, and it's easy for the model to pick that up. The model always, almost does not make syntactic mistakes. Right? So it's all about the APIs and libraries and how you call them, how you sequence them, etcetera. And that is largely shared between languages. So, you know, if you're using whatever Twilio API in Java, JavaScript, Rust, or something, you know, you're gonna get almost very similar names for kinda API calls.
And the syntax of calling a function may be slightly different, but, you know, there there is some structure there that is really shared across all languages and and the model picks that up quite nicely.
[00:28:17] Tobias Macey:
And now digging into tab 9 itself, can you talk to some of the design and implementation details of the product and application, but also some of the ways that you think about the model training and development and deployment aspects?
[00:28:32] Eran Yahav:
Yeah. So let me see where where where should I start? Let me start like with large large blocks, let's call them. So, I think people don't really realize the the distance between a model and and a product using a model. And, you know, the the model changes all the time, and new models come out every week almost. And we kinda improve our models all the time. But the model is a kind of small part of of the product. There's so much going on around that. So, things like, vector memory that help the product, be contextually aware of what's going on in your code base.
Things like, semantic memory that help the product be aware of semantic kind of context of what's going on in your project. So top 9, for example, you know, will know, things that are defined in libraries even very deep inside the library, even when the source code of the library is not available. Right? So there's some level of semantic memory there that helps sub 9 be contextually Other components, you know, models for ranking, from filtering, for toxicity, I don't even know. There there are so many kind of moving parts, in this, in in the product.
But maybe I I would say there's, like, the the model is 1 big piece. Semantic memory, vector memory are 2 other, big pieces, and there are many, many small pieces around that. So now in terms of address, The the problem of contextual awareness, how do you, provide a product that is aware of customer code is also very central to top 9. And so we have both the machinery for fine tuning on the customer code base and the machinery for doing retrieval augmented generation both, text based and semantic based to provide contextual awareness. And on top of that, we have some additional symbolic semantic awareness in the IDE that that provides additional hints for the model that should be generated. So, definitely, contextual awareness is a central theme of the product that that required a lot a lot of work there.
In in top 9 chart, there are additional aspects, including, again, the context awareness, but also limiting what gets generated and and some some alignment problems there that are also interesting. You also asked about deployment. So top 9 is basically deployed as into a Kubernetes cluster. It's a bunch of, bunch of services, inference, and and the others that that I mentioned, can deploy top 9 in your own, virtual private cloud or can deploy top 9 also completely air gapped on your hardware. And we have a number of customers that, have chose have gone the kind of hardware route and have their own air gaps deployments of of top 9, which I think is kind of
[00:32:02] Tobias Macey:
As far as the model architecture, I'm also interested in understanding what your approach has been and some of the experiments that you've done of do you just have 1 monolithic core model that does all of the programming languages and code generation, or have you broken it into a, I guess, a core model, but then different models with subspecializations and some of the ways that you think about the interaction across those boundaries?
[00:32:28] Eran Yahav:
Yeah. We we we kinda oscillate architecturally. We have taken the approach of having multiple models, And, definitely, there have been times that we served, I know, maybe up to 8 different models, serving different program languages or even different parts of the of the software universe, like different stacks. I think these days, we have converged to, like, maybe 3 or 4 different models, and they serve different facets of of the product. So different parts of the product have very different latency cost trade offs. And when you come to optimize top 9 and, you know, make it possible from top 9 also on premise without requiring, an entire GPU farm there. You need to optimize the consumption in different facets of the product.
Some facets are very latency sensitive. Right? So for code completions, you want to get generation as you type. And so the model depth cannot be, very large for that if you're optimizing for latency. And that kind of in turn derives other parameters of of the model, and you get to certain model sizes if you want to do that economically, with very low latency. Other facets like chat are maybe less latency sensitive, and you can go with much bigger models and kind of you have more leeway on on how you handle, the inferences and how you, deal with in flight, requests and batching and all that. There there are a lot like, inference is hard. I don't think people realize how hard these inference when you try to do that economically at scale, like when you try to serve in top 9 SaaS, right, when you serve 100 of thousands of users and you have to do that economically.
I think it becomes just the inference becomes a hard problem that then also dictates the kinds of models that you can and do deploy in production. And as you said, specialized modes specialized smaller models sometimes allow you to like, a sparse architecture sometimes allow you to do things, with better latency and and also more cost effectively.
[00:34:56] Tobias Macey:
And while I was preparing for this interview, I looked to see how long tab 9 has been around, how long you've been working in this space, and I saw that Tab 9 as a business was founded back in 2013. And I'm wondering if you can talk to some of the ways that your overall goals and approach to your core problem statement have changed or evolved over that period. And in particular, some of the ways that the advent of generative AI kind of shook up your ideas of what was possible.
[00:35:28] Eran Yahav:
Yeah. Definitely. So in fact, 2013 is an exaggeration because it was, I was in, a professor working on this, like, super, super part time, just with Tor. But 2017, we kinda decided that this is maybe something worth pursuing seriously and and got funding. And so I would I would put the the line I would draw the line from 2017 for a real, company. But, definitely, when when we started, I think we knew that we can do stuff by learning from millions of examples to kinda help you develop software better. The imagination was that all software will be written using AI. So that was the vision, I think, even as early as as 2013 when we just, like, played with ideas.
Initially, when we started, we said, okay. You will write code, and we'll present suggestions in a sidebar. Like, next to you, as you write, we'll give you, like, suggestions of what to do. And it was beautifully done, really. We had, like, a great designer back in 2017. Super talented guy. It was the mute most beautiful product that I ever saw. I loved it. And everybody absolutely hated it other than me. Like, all all developers in the world hated it with a raging passion. And the reason is that as you would type, things would refresh on the side all the time. So you're like, imagine that you start writing and kind of you have a sidebar that always changes all the time. So it it's super distracting even if every suggestion there is exactly the right 1.
And then 1 of our users said, like, hey, guys. Like, why don't you just, like, put that do do exactly the same, but make it as a code completion natively in the ID? Just just insert it in the right place and pick 1. Just pick the best 1 because we said we'll show you several options you can pick, and we were totally kind of, like, very naive in the user interaction. And when once we put that in as a code completion, I think, then clicked, then we we got very good, very good adoption starting 2018 and 2019 when we expanded to additional languages, then top line really started to to get, to get traction. Yeah. I think I've been humbled by the the progress. So back 2019, had you asked me if we'll be able to do what is doing now, so probably in a decade.
I I can't even imagine that this would be possible. And every day, I'm kind of surprised and amazed and humbled, but but what it can do. And as, you know, as as the entire research area kind of, keeps progressing, I'm I'm kind of amazed by by the things by the things that we are able to do these days. I think really, I think for me in the last few years, I've kind of came to the, I don't know. My realization or speculation that quantity does, breed quality, which is something I was highly skeptical in the past. So like I said, piling more and more parameters will not make you, that much better.
And, you know, it looks like it does. So it's, I think may whether you do it, like, you know, with a sparse architecture or keep on piling parameters into a dense model is besides the point, but these models become stronger and stronger. And I'm amazed by by the abilities that we have in our hands today. Again, as I said earlier in the conversation, I think for many of the tasks, the human is the bottleneck. So your ability to describe what is it that you want and your ability to consume back what has been generated is the bottleneck. So the bottleneck is IO, so to speak, between the human
[00:39:39] Tobias Macey:
and and the machine. Right? If only we can figure out how to make humans multicore.
[00:39:46] Eran Yahav:
Yeah. Exactly.
[00:39:48] Tobias Macey:
In your work of using these generative AI capabilities to build a product that is focused on developers and their work flows. You mentioned things like latency being a challenge. I'm wondering if you can talk to some of the most complex aspects of customizing an LLM for this specific context of software engineers.
[00:40:11] Eran Yahav:
Yeah. That's an interesting question. I I think, as you mentioned, latency. I think we spend a lot of time on on the granularity. So it's not necessarily the model directly, but, again, how you interact with it. But what is the right granularity of generation? Like, should it be 5 lines, 1 lines, 200 lines? What are the boundaries? How do you make it easy for the user to consume the result? And this is happening also in test generation. You know, like, how do you make the result accessible? Let's say I generated 200 tests for you. How are you going to what are you going to do with that? Are they all valuable? Which 1 are the valuable 1? How do you guide the model to generate the valuable tests and not just, you know, increase the test count? Alright. So so these kind of questions, I I I wouldn't say that the model per se is is a challenge, but definitely everything around the model.
Our context awareness still remains a problem in this space because the 1 of the advantages of you as a human is that you know the entire project. You maybe know the entire organizational code base, what microservices are available, what other architectural decisions have been made. Right? And as we try to convey this kind of information to the model, there there is also limitation what the model can capture, but also a limitation on how you phrase that, how you extract the information from, say, the code base or from the organization knowledge base and communicate that to to the model in a useful way. So that that's definitely another another challenge.
[00:41:57] Tobias Macey:
Another aspect of the problem is not just the capabilities, but also the accuracy and understanding when you've generated an accurate result and being able to build a feedback loop for the model. And I'm curious if you can talk to some of the ways that you think about the assessment and measurement of accuracy given the fact that that can be a bit of a nebulous concept in software engineering because, again, it comes back to are the requirements accurate. So is the output accurate based on the requirements, or is it just that the requirements are inaccurate and some of the ways that you think about building those feedback loops into the successive training iterations of the model?
[00:42:36] Eran Yahav:
Yeah. It's a it's a really hard problem, and we spend and spend our spending a lot of energy on on evaluation. We obviously have our own evaluation harnesses that kind of, you know, intuitively pick up a repository that we haven't seen in training from GitHub or somewhere, erase some parts, try to complete them, see how well we did. But, you know, even measuring how well you did is is a kind of tricky concept. What exactly do you measure? Each metric has its own problems, and so we have, like, a slew of of metrics that we learned how to kinda weigh them together and how to kinda get a read of whether the model is better than previous 1 or not. So definitely a lot of flab evaluation.
In terms of users, I think the ultimate test, if it's for cogeneration, is whether the code that we generated got adopted or not. Right? And so that's slightly easier. But, also, as you said, it's not necessarily, equivalent in all users because we found out some users take it even if it's completely wrong, and then they massage it to to what they want. Right? Then some users prefer not to take it and just write something very similar themselves. And so the signals are also quite tricky, to analyze. That's for code completion. But for code completion, I I feel like we have a pretty good read based on both the kind of internal harness and metrics and on user feedback of accepted slash rejected, set completions.
For chat, this is much harder because both the specification, like the question of what you asked is kind of more fuzzy often. It's more natural language heavy, the interaction. It's multi term. Right? So it's there's, like, more than 1 iteration in which you get the requirements specified and refined. And then the the result may or may not be taken by the user based on, you know, it may be too long, so the user didn't take it. So there are all sorts of harder to analyze signals when you're analyzing, the the score or the kind of evaluation of of chat. Again, a lot of fab evaluation there, including, human tagging. So we employ some, you know, human tagging teams to give us feedback and to to help us, feedback the model.
Some efforts around RLHF, obviously. We do generate multiple answers in in top 9 chat, and we see how users interact with that, allows us to get some read of of reality. But these violations are very, very tricky, especially for chat that is generating code because just as you said, Tobias, the the the spec is unclear. Right? Somebody wrote some natural language specification and then got some result, which is not clearly completely wrong or completely right, and and it's very hard to evaluate. Even even I I would add, even using humans is very hard to evaluate. Like, even if you give me the results, right, or you give our human team, like, the results and say, was the chat correct? Was timeline chat correct or not?
It's very it's pretty hard.
[00:45:59] Tobias Macey:
Another layer to that challenge also is shared vocabulary where things like if you're in the conversational aspect of write me a module that does x where you're saying, oh, I want you to use dependency injection versus somebody else says, oh, I want you to use inversion of control and just some of the ways that the names for the same concept are different across different teams and how you manage to map those to the same inputs for the model. I'm wondering how how that plays out.
[00:46:29] Eran Yahav:
Yeah. The the honest answer is I don't know. It's it's hard as you said. I I think we're not there yet. I think we're hitting a lower kind of, lower bar of of challenges on on the human expressivity of people saying, whatever. Write me a sorting algorithm. So, you know, you might get bubble sort and say, oh, no. I didn't want bubble sort. I want something else. So you get some other currency. But, actually, this is totally not what I wanted because I wanted to make a library call. I didn't want you to reimplement the entire sorting algorithm. Right? So now this entire interaction, is it good or bad? It should you have given, like, the library call as the first answer for the first interaction?
You don't know how to grade that. Right? And and we're seeing a lot of this is like a trivial example, but seeing a lot of those interactions. I think, as always, with these technologies, they are successful when the humans adjust. Right? So I think it's not about tuning the model. So as to tuning the humans. Right? It's like it's a kind of, calibration of expectations and people learn we and we see that with users. They they learn how to write the questions in a way that is helpful for kind of cutting through to the the right answer on on the first on the first interaction.
It's not different from people learning how to use Google. Right? It's kind of same kind of skill. And I think I think developers will improve in doing that. So it's not people like to call it prompt engineering, but it's not really that. It's not as as far as that. It's just like learning to communicate with an assistant that is maybe very smart,
[00:48:21] Tobias Macey:
but very limited in its ability to communicate. Right? So Yeah. I was going to ask you about that idea of prompt engineering and being able to share a set of prompts that produce a particular category of output or some of the ways that you think about training humans. And the the the biggest question I guess the main question I'm driving at are what are the aspects of customer education that you find yourself having to come back to as people are onboarding onto tab 9?
[00:48:48] Eran Yahav:
There there's definitely some aspect of that. Like, for example, when you're doing code completions, people write comments, right, to kind of prompt the model. But in their head, they're thinking chat. So they write comment like, question, how do I do blah blah? And this is not how the model has been trained for code completion. Code completion is not a conversational model. So you're kind of actually, the model expects comments as they would appear in actual code. Right? And when you start to write these elaborate stories as comments, you're actually distracting the model. So that's part of education, that you have to do.
On on the chat side, again, I think most of the education is, like, be short and to the point, kind of don't tell, like, the elaborate stories because, again, you're distracting the the model in in a sense. But I think people already get that. Right? And
[00:49:42] Tobias Macey:
Yeah. And and to the point of putting a conversational mood into your code comments, it's also bad form for the actual longevity of the project because you don't want a very verbose comment. You just wanna know what is this trying to tell me and why do I care? Right.
[00:49:58] Eran Yahav:
Yeah. Absolutely. I think 1 1 interesting thing in in actually test generation, which I found interesting, we do test generation from the code itself right now. Right? So it's like, the opposite of TPD in some sense. Right? You don't you don't say what are the requirements, look at the code, and you generate the test for it. But you're not given a spec. Right? You're not given a specification unless there is some comment there. So for test generation, for example, it is helpful to have a comment that says what is the code expected to do. It helps test generation test what you intended for the code to do and not what the code does. Right? So these kind of things are are kind of usability issues that should be solved probably in a different way in the product. Right? It's gonna ask you for the spec in natural language.
But but right now, that's the way that they work.
[00:50:49] Tobias Macey:
Yeah. It's interesting because there are certain there's a few libraries I've come across where the idea is that in the comments, you write the contract for the function, which it says, okay. This accepts these inputs. It should be within this range, etcetera. And so that it's intended to help with constraining the tests or informing the other developers of how this function should be used, but it also has the benefit of speaking to the model in the language that it understands.
[00:51:15] Eran Yahav:
Yeah. Ex exactly right. And, yeah, I think if people the the mental model of the models should be of the AI assistant should be, I'm working with this other person who is very smart, but I have to communicate extremely clearly with. Otherwise, they get distracted or derailed. Right? So that's that's the mental model.
[00:51:38] Tobias Macey:
And in your experience of building tab 9, particularly in your latest iterations of using these generative models for code suggestions, code creation. I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:51:57] Eran Yahav:
So people use it okay. So I've seen it applied in strange ways. So maybe my application is that I do all my writing with tab 9. So emails and meeting summaries, everything. I write in in sublime using tab 9 just for the kind of the English model. I I got used to it. I don't think it's extremely unusual, but that's just me. I recently had, some people interested in using it to migrate COBOL code to more modern languages, which I thought would not really work that well. But, again, if you calibrate your expectations, properly, I think it's, it's not that bad.
So, again, migration project as a whole is probably bad use for type 9 because of taking 1 code base and, you know, abracadabra make this cobble into Java, it's just not going to work architecturally. Right? The architecture of the application is going to be completely different. But if you have, like, this opaque Cobble code and you want to massage it into some Java procedure that does something that you can understand and maybe help it a little bit on the edges manually, I think that is an interesting or at least an unexpected use, from my perspective.
I've seen people trying to do magic TDD or TDD to the stream, like writing the tests and trying to force top 9 to generate the code. It kinda works in small cases that, you know, you can fit all the test cases into the context and and and whatnot. But I I think that's an interesting direction for for the future. I think something like that would happen in certain domains. Right? That you just write the tests and hit a button and and get the code. So I can definitely imagine that as a viable use case of the of the technology, moving forward.
Yeah. I think that that's roughly it at the high level.
[00:54:10] Tobias Macey:
From the test creation perspective too, it sounds like you would probably want to have your test harness preconfigured, and you're just filling in the individual tests rather than just, I have absolutely no tests. Write something.
[00:54:22] Eran Yahav:
Yeah. Yeah. So it has to be something like that. Yeah. Some some balance.
[00:54:28] Tobias Macey:
And in your experience of building this product, working with the developer community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:54:38] Eran Yahav:
Yeah. I found out that people are very passionate about languages that I didn't know existed. And so I think we get a lot of requests. Again, it's just my ignorance not to say that the language is bad in any way. It's just that I was not aware of, like, the the amount of code that is being created there. Like, for example, PowerShell was 1 example that the community was very excited about using top 9 for PowerShell. And I was like, hey, guys. I don't even know that it works. Right? And and but but definitely that I think those languages were kind of surprising.
Maybe another surprising aspect is kind of the very logs of the world, the hardware languages that again were completely off my my radar and and people just came and said, hey. It kinda works for these languages, but can we make it better? And so we we are making it better for the for these languages. And it's interesting. It's very different than what I'm used to at least as as a software engineer.
[00:55:47] Tobias Macey:
In the space of software engineering, software development, development teams, I'm wondering what are the cases where you see AI and AI assisted development as the wrong choice.
[00:56:02] Eran Yahav:
That's, that's a good question. I think for for code generation, probably if you are working on, like, the 1 off algorithm that is very clever And especially if you're working on some, I've worked in the past on, like, concurrent low level algorithms. So I can imagine if working on a concurrent low level algorithm that has a lot of requires a lot of global reasoning, subtle reasoning, and you're the only 1 in the world that ever wrote this algorithm. Right? It's not like you're reinventing, some concurrent garbage collector or something. But, like, you are legit.
This is your algorithm first time. Then probably AI cogeneration is not your best friend at this point. Right? It's a very subtle kind of puzzle that every piece has to fit neatly together with global reasoning. And, I think it's probably for the generation part. It's maybe not the best use of the tool. This is really a task that is heavy on human intelligence and human reasoning right now could change tomorrow, right, when the models become even better. For other aspects that are not generation, I think AI assistance is always the correct choice. Like, if you're using it to review your code or to do test generation or do stuff like that, even if the AI is wrong, it teaches you something. Because if you say review this code and AI says gives you some comments, say, like, this comment is wrong, but I understand that the code is written in a way that could mislead you to that reasoning. So let me maybe restructure the code to make it more obvious to you and to the next human reader, actually. So I'm doing a service not to the AI. I'm not in service of the AI. I'm in service of future Iran that will come to this code a year from now and say, who's the idiot who wrote that? Right? Because I I don't understand.
And so I think for review, for test generation, AI assistance is always the correct choice even for the 1 off algorithm.
[00:58:02] Tobias Macey:
And as you continue to build and iterate on tab 9 and continue to play keep up with the rapidly evolving space of large language models? What are some of the things you have planned for the near to medium term or any particular problem areas that you're excited to dig into?
[00:58:19] Eran Yahav:
I'm super excited about code review. So in code review, we've worked on the problem now. I think current code review product is, version 3 of the product. So we had version 1 maybe a couple years back. I loved it. Everybody hated it. I think you start to see a theme here. Right? So that developers on the team said, like, this is mostly distracting us. It's giving us comments. Like, no. 2 out of 10 are what we want and the other 8 are like, we just have to fight the tools. So we don't want that. Version 2 was much, much better. And I think now with version 3, it's actually really, really, useful and and valuable. So definitely excited about code review coming out, later this year.
Integration with non code sources is another thing that is coming out, and I'm super excited about the ability to get all all of top 9, be aware of, you know, Confluence and Notion and and Jira and other sources of information that are non code. And this integration that that we've been working on for a while now, it's really hard it's a really hard 1. People think, oh, you just slapped no effect or database over the documents and you'll be fine. No. No. Far from it. It's it's a really tricky 1 to to make useful. And I'm very excited about that because informing code generation, test generation, all other tasks of of top 9 with some architectural details, with some other non code kind of source of information really changes how the product, reacts, right? It's like kind of you suddenly understand, like, it starts to use a microservice that has not been defined anywhere other than the docs. Right? And you start to see interesting things happening because you've informed it in a more general context like a human has. Right? So this is like, this vertical also, these are horizontal, actually, integrations of with with other data sources, I think, are beginning to inject the level of human expertise that you'd expect from from a human into the product. And I think as we improve that, also, you know, surfacing that in code view, that the product will become immensely
[01:00:49] Tobias Macey:
more human like. Yeah. Leveling it up from an intermediate to a senior engineer.
[01:00:54] Eran Yahav:
Kinda. Yeah. I guess. Yeah. I guess. That's a good way to phrase it then. Thank you.
[01:00:59] Tobias Macey:
Alright. And are there any other aspects of the tab 9 product itself, the overall space of AI assistance for software engineering, or the, rapidly evolving area of large language models that we didn't discuss yet that you would like to cover before we close out the show?
[01:01:15] Eran Yahav:
I'm I'm very curious about about the space. I think we will not go that far with Comm 9. So I'm just like back in 2016 maybe. I I I worked on this kind of crazy idea of learning from learning programming from YouTube videos. So there are so many tutorials on YouTube that teach you how to do certain things, with programming. So I actually had very nice work with the student on, like, how to learn from those video tutorials. And when I have time, someday, I'm curious about going back to that and maybe generating tutorials programming tutorials completely automatically. I don't think we'll do that in top 9 anytime soon. I think the road map is pretty full.
But but something there feels right. I mean, like, programming tutorial videos and ability to generate them automatically sounds sounds exciting to me.
[01:02:09] Tobias Macey:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest barrier to adoption for machine learning today.
[01:02:23] Eran Yahav:
2 of them, probably. I think 1 is definitely privacy and security as we see with top 9 customers that don't wanna send all their information to outside the org or to something rather. So definitely a barrier to adoption there on kind of maybe architectural side. On on the product side, I maintain that the biggest adoption is the human. We need to find better ways to to interface with humans. This is not only for software creation. This is for any Gen AI product. You need to somehow find the right level of presentation to make it easy for your user to kinda say, uh-huh. Yeah. This you just generated what I wanted. Right? And this maybe with mid journey, it's easy because you get the princess riding the unicorn.
But, with other products, it it may be very, very hard and may require, maybe even phrasing a different language for communicating back the result. Right? So this is even if you think for us in top 9, 1 1 of the kind of high level thoughts is, like, should there be, like, a different language in which you communicate the results? Even if you ask for a c program, like, maybe I shouldn't show you the 2, 000 lines of c program, but kind of summarize for you the main ideas in this c program and say, yeah. Okay. I got it. That's what I wanted. Right?
[01:03:43] Tobias Macey:
Absolutely. Well, thank you very much for taking the time today to join me and share the work that you're doing on tab 9. It's a very interesting product, definitely very exciting to have these capabilities for people working in the software space. So definitely appreciate all the time and energy that you and your team are putting into accelerating software engineers, and I hope you enjoy the rest of your day.
[01:04:05] Eran Yahav:
Thank you much. Thanks for having me. It was great.
[01:04:12] Tobias Macey:
Thank you for listening, And don't forget to check out our other shows, the Data Engineering Podcast, which covers the latest in modern data management, and podcast.init, which covers the Python language, its community, and the innovative ways it is being used. You can visit the site at the machine learning podcast.com to subscribe to the show, sign up for the mailing list, and the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at themachinelearningpodcast. Com with your story. To help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Eran Yaghav's Journey in Machine Learning
Overview of Tab 9
Adoption of AI in Software Development
Limitations of AI in Software Development
Skepticism and Acceptance of AI Tools
Use Cases and Applications of Tab 9
Programming Languages and AI Effectiveness
Design and Implementation of Tab 9
Evolution of Tab 9 and Generative AI
Challenges in Customizing LLMs for Software Engineering
Customer Education and Prompt Engineering
Interesting Applications and Lessons Learned
When AI is the Wrong Choice
Future Plans for Tab 9
Final Thoughts and Closing Remarks