RAG Inventor Talks Agents, Grounded AI, and Enterprise Impact

Listen on Spotify, Apple, and Amazon | Watch on YouTube

From the invention of RAG to how it evolved beyond its early use cases, and why the future lies in RAG 2.0, RAG agents, and system-level AI design, this week’s episode of Founded & Funded is a must-listen for anyone building in AI. Madrona Partner Jon Turow sits down with Douwe Kiela, the co-creator of Retrieval Augmented Generation and co-founder of Contextual AI, to unpack:

Why RAG was never meant to be a silver bullet — and why it still gets misunderstood
The false dichotomies of RAG vs. fine-tuning and long-context
How enterprises should evaluate and scale GenAI in production
What makes a problem a “RAG problem” (and what doesn’t)
How to build enterprise-ready AI infrastructure that actually works
Why hallucinations aren’t always bad (and how to evaluate them)
And why he believes now is the moment for RAG agents

Whether you’re a builder, an investor, or an AI practitioner, this is a conversation that will challenge how you think about the future of enterprise AI.

This transcript was automatically generated and edited for clarity.

Jon: So Douwe, take us back to the beginning of RAG. What was the problem that you were trying to solve when you came up with that?

Douwe: The history of the RAG project, we were at Facebook AI Research, obviously, FAIR, and I had been doing a lot of work on grounding already for my PhD thesis, and grounding, at the time, really meant understanding language with respect to something else. It was like if you want to know the meaning of the word cat, like the embedding, word embedding of the word cat, this was before we had sentence embeddings, then ideally, you would also know what cats look like because then you understand the meaning of cat better. So that type of perceptual grounding was something that a lot of people were looking at at the time. Then I was talking with one of my PhD students, Ethan Perez, about, “Can we ground it in something else? Maybe we can ground in other text instead of in images.” The obvious source at the time to ground in was Wikipedia.

We would say, “This is true, sort of true,” and then you can understand language with respect to that ground truth. That was the origin of RAG. Ethan and I were looking at that, and then we found that some folks in London had been working on open-domain question answering, mostly Sebastian Riedel and Patrick Lewis, and they had amazing first models in that space and it was a very interesting problem, how can I make a generative model work on any type of data and then answer questions on top of it? We joined forces there. We happened to get very lucky at the time because the people at the Facebook AI Image Similarity Search, I think is what it stands for, basically, the first vector database, but it was just there. And so we were like — we have to take the output from the vector database, give it to a generative model. This is before we called it language models, Then the language model can generate answers grounded on the things you retrieve. And that became RAG.

We always joke with the folks who were on the original paper that we should have come up with a much better name than that, but somehow, it stuck. This was by no means the only project that was doing this, there were people at Google working on very similar things, like REALM is an amazing paper from around the same time. Why RAG, I think, stuck was because the whole field was moving towards gen AI, and the G in RAG stands for generative. We were really the first ones to show that you could make this combination of a vector database and a generative model actually work.

Jon: There’s an insight in here that RAG, from its very inception, was multimodal. You were starting with image grounding, and things like that, and it’s been heavily language-centric in the way people have applied it. But from that very beginning place, were you imagining that you were going to come back and apply it with images?

Douwe: We had some papers from around that time. There’s a paper we did with more applied folks in Facebook where we were looking at, I think it was called Extra, and it was basically RAG but then on top of images. That feels like a long time ago now, but that was always very much the idea, is you can have arbitrary data that is not captured by the parameters of the generative model, and you can do retrieval over that arbitrary data to augment the generative model so that it can do its job. It’s all about the context that you give it.

Jon: Well, this takes me back to another common critique of these early generative models that, for the amazing Q&A that they were capable of, the knowledge cutoff was really striking, you’ve had models in 2020 and 2021 that were not aware of COVID-19, that obviously was so important to society. Was that part of the motivation? Was that part of the solve, that you can make these things fresher?

Douwe:

Yeah, it was part of the original motivation. That is what grounding is, the vision behind the original RAG project. We did a lot of work after that on that question as well, can I have a very lightweight language model that basically has no knowledge, it’s very good at reasoning and speaking English or any language, but it knows nothing? It has to rely completely on this other model, the retriever, which does a lot of the heavy lifting to ensure that the language model has the right context, but that they really have separate responsibilities. Getting that to work turned out to be quite difficult.

Jon:

Now, we have RAG, and we still have this constellation of other techniques, we have training, and we have tuning, and we have in-context learning, and that was, I’m sure, very hard to navigate for research labs, let alone enterprises. In the conception of RAG, in the early implementations of it, what was in your head about how RAG was going to fit into that constellation? Was it meant to be standalone?

Douwe: It’s interesting because the concept of in-context learning didn’t really exist at the time, that really became a thing with GPT-3, and that’s an amazing paper and proof point that that actually works, and I think that unlocked a lot of possibilities. In the original RAG paper, we have a baseline, what we call the frozen baseline, where we don’t do any training and we give it as context, that’s in table six, and we showed that it doesn’t really work, or at least, that you can do a lot better if you optimize the parameters. In-context learning is great, but you can probably always beat it through machine learning if you are able to do that. If you have access to the parameters, which is, obviously, not the case with a lot of these black box frontier language models, but if you have access to the parameters and you can optimize them for the data you’re working on or the problem you’re solving, then at least, theoretically, you should always be able to do better.

I see a lot of false dichotomies around RAG. The one I often hear is it’s either RAG or fine-tuning. That’s wrong, you can fine-tune a RAG system and then it would be even better. The other dichotomy I often hear is it’s RAG or long-context. Those are the same thing, RAG is a different way to solve the problem where you have more information than you can put in the context. One solution is to try to grow the context, which doesn’t really work yet even though people like to pretend that it does, the other is to use information retrieval, which is pretty well established as a computer science research field, and leverage all of that and make sure that the language model can do its job. I think things get oversimplified where it’s like, “You should be doing all of those things. You should be doing RAG, you should have a long-context window as long as you can get, and you should fine-tune that thing.” That’s how you get the best performance.

Jon: What has happened since then is that, and we’ll talk about how this is all getting combined in more sophisticated ways today, but I think it’s fair to say that in the past 18, 24, 36 months, RAG has caught fire and even become misunderstood as the single silver bullet. Why do you think it’s been so seductive?

Douwe: It’s seductive because it’s easy. Honestly, I think long-context is even more seductive if you’re lazy, because then you don’t even have to worry about the retrieval anymore, the data, you put it all there and you pay a heavy price for having all of that data in the context. Every single time you’re answering a question about Harry Potter, you have to read the whole book in order to answer the question, which is not great. So RAG is seductive, I think, because you need to have a way to get these language models to work on top of your data. In the old paradigm of machine learning, we would probably do that in a much more sophisticated way, but because these frontier models are behind black box APIs and we have no access to what they’re actually doing, the only way to really make them do the job on your data is to use retrieval to augment them. It’s a function of what the ecosystem has looked like over the past two years since ChatGPT.

Jon: We’ll get to the part where we’re talking about how you need to move beyond a cool demo, but I think the power of a cool demo should not be underestimated, and RAG enables that. What are some of the aha moments that you see with enterprise executives?

Douwe: There are lots of aha moments, I think that’s part of the joy of my job. I think it’s where you get to show what this can do, and it’s amazing what these models can do. So basic aha moments for us, is accuracy is almost kind of table stakes at this point. It’s like, okay, you have some data, it’s like one document, you can probably answer lots of questions about that document pretty well. It becomes much harder when you have million documents or tens of millions of documents and they’re all very complicated or they have very specific things in them. We’ve worked with Qualcomm and they’re like circuit design diagrams inside those documents, it’s much harder to make sense of that type of information. The initial wow factor, at least from people using our platform, is that you can stand this up in a minute. I can build a state-of-the-art RAG agent in three clicks basically.

That time of value used to be very difficult to achieve, because you had your developers, they have to think about the optimal chunking strategy for the documents, and things that you really don’t want your developers thinking about but they had to because the technology was so immature. The next generation of these systems and platforms for building these RAG agents is going to enable developers to think much more about business value and differentiation essentially, “How can I be better than my competitors because I’ve solved this problem so much better?” Your chunking strategy should not be important for solving that problem.

Jon: Also, if I now connect what we were just talking about to what you said now, the seduction of long-context and RAG are that it’s straightforward and easy, and it plugs into my existing architecture. As a CTO, if I have finite resources to go implement new pieces of technology, let alone dig into concepts like chunking strategies, and how the vector similarity for non-dairy will look similar to the vector similarity for milk, things like this, is it fair to say that CTOs are wanting something coherent, that can be something that works out of the box?

Douwe: You would think so, and that’s probably true for CTOs, and CIOs, and CAIOs, and CDOs, and the folks who are thinking about it from that level. But then what we often find is that we talk to these people and they talk to their architects and their developers, and those developers love thinking about chunking strategies, because that’s what it means in a modern era, to be an AI engineer is to be very good at prompt engineering and evaluation and optimizing all the different parts of the RAG stack. It’s very important to have the flexibility to play with these different strategies, but you need to have very, very good defaults so that these people don’t have to do that unless they really want to squeeze the final percent, and then they can do that.

That’s what we are trying to offer, is you don’t have to worry about all this basic stuff, you should be thinking about how to really use the AI to deliver value. It’s really a journey. The maturity curve is very wide and flat. It’s like some companies are figuring it out, it’s like, “What use case should I look at?” And others have a full-blown RAG platform that they built themselves based on completely wrong assumptions for where the field is going to go, and now, they’re stuck in this paradigm, it’s all over the place, which means it’s still very early in the market.

Jon: Take me through some of the milestones on that maturity curve, from the cool demo all the way through to the ninja level results.

Douwe: The timeline is, 2023 was the year of the demo, ChatGPT had just happened, everybody was playing with it, there was a lot of experimental budget, last year has been about trying to productionize it, and you could probably get promoted if you were in a large enterprise, if you were the first one to ship genAI into production. There’s been a lot of kneecapping of those solutions happening in order to be the first one to get it into production.

Jon: First-pass-the-post.

Douwe: First-pass-the-post, but in a limited way, because it is very hard to get the real thing past the post. This year, people are really under a lot of pressure to deliver return on investment for all of those AI investments and all of the experimentation that has been happening. It turns out that getting that ROI is a very different question, that’s where you need a lot of deep expertise around the problem, but also you need to have better components that exist out there in an open source easy framework for you to cobble together a Frankenstein RAG solution, that’s great for the demo, but that doesn’t scale.

Jon: Our customers think about the ROI, how do they measure, perceive that?

Douwe: It really depends on the customer. Some are very sophisticated, trying to think through the metrics, like, “How do I measure it? How do I prioritize it?” I think a lot of consulting firms are trying to be helpful there as well, thinking through, “Okay, this use case is interesting, but it touches 10 people. They’re very highly specialized, but we have this other use case that has 10,000 people. They’re maybe slightly less specialized, but there’s much more impact there.” It’s a trade-off. I think my general stance on use case adoption is that I see a lot of people aiming too low, where it’s like, “Oh, we have AI running in production.” It’s like, “Oh, what do you have?” It’s like, “Well, we have something that can tell us who our 401(k) provider is, and how many vacation days I get.”

And that’s nice, “Is that where you get the ROI of AI from?” Obviously not, you need to move up in terms of complexity, or if you think of the org chart of the company, you want to go for this specialized roles where they have really hard problems, and if you can make them 10, 20% more effective at that problem, you can save the company tens or hundreds of millions of dollars by making those people better at their job.

Jon: There’s an equation you’re getting at, which is the complexity, sophistication of the work being done times the number of employees that it impacts.

Douwe: There’s roughly two categories for gen AI deployment, one is cost savings. So I have lots of people doing one thing, if I make all of them slightly more effective, then I can save myself a lot of money. The other is more around business transformation and generating new revenue. That second one is obviously much harder to measure, and you need to think through the metrics, like, “What am I optimizing for here?” As a result of that, I think you see a lot more production deployments in the former category where it’s about cost-saving.

Jon: What are some big misunderstandings that you see around what the technology is or is not capable of?

Douwe: I see some confusion around the gap between demo and production. A lot of people think that, “Oh, yeah, it’s great, I can easily do this myself.” Then it turns out that everything breaks down after a hundred documents, and they have a million. That is the most common one that we see. There are other misconceptions maybe around what RAG is good for and what it’s not.What is a RAG problem and what is not a RAG problem? People don’t have the same mental model that maybe AI researchers like myself have, where if I give them access to a RAG agent, often, the first question they ask is, “What’s in the data?” That is not a RAG problem, or it’s a RAG problem on the metadata, it’s not on the data in itself. A RAG question would be like, what was, I don’t know, Meta’s R&D expense in Q4 or 2024, and how did it compare to the previous year? Something like that.

It’s a specific question where you can extract the information and then reason over it and synthesize that different information. A lot of questions that people like to ask are not RAG problems. It’s like, summarize the document is another one. Summarization is not a RAG problem. Ideally, you want to put the whole document in a context and then summarize it. There are different strategies that work well for different questions, and why ChatGPT is such a great product is because they’ve abstracted away some of those decisions that go into it, but that’s still very much happening under the surface. I think people need to understand better what type of use case they have. If I’m a Qualcomm customer engineer and I need very specific answers to very specific questions, that’s very clearly a RAG problem. If I need to summarize the document, put that in context of a long-context model.

Jon: Now, we have Contextual, which is an amalgamation of multiple techniques, and you have what you call RAG 2.0, and you have fine-tuning, and there’s a lot of things that happen under the covers that customers ideally don’t have to worry about until they choose to do so. I expect that changes radically the conversation you have with an enterprise executive. How do you describe the kinds of problems that they should go find and apply and prioritize?

Douwe: We often help people with use case discovery. So, thinking through, okay, what are the RAG problems, what are maybe not RAG problems? Then for the RAG problems, how do you prioritize them? How do you define success? How do you come up with a proper test set so that you can evaluate whether it actually works? What is the process for, after that, doing what we call UAT, user acceptability testing. Putting it in front of real people, that’s really the thing that matters, right? Sometimes, we see production deployments, and they’re in production, and then I ask them, “How many people use this?” And the answer is zero. During the initial UAT, everything was great and everybody was saying, “Oh, yeah, this is so great.” Then when your boss asks you the question and your job is on the line, then you do it yourself, you don’t ask AI in that particular use case. It’s a transformation that a lot of these companies still have to go through.

Jon: Do the companies want support through that journey today, either direct for Contextual or from a solution partner, to get such things implemented?

Douwe: It’s very tempting to pretend that AI products are mature enough to be fully self-serve and standalone. It’s decent if you do that, but in order to get it to be great, you need to put in the work. We do that for our customers or we can also work through systems integrators who can do that for us.

Jon: I want to talk about two sides of the organization that you’ve had to build in order to bring all this for customers. One is scaling up the research and engineering function to keep pushing the envelope. There are a couple of very special things that Contextual has, something you call RAG 2.0, something you call active versus passive retrieval. Can you talk about some of those innovations that you’ve got inside Contextual and why they’re important?

Douwe: We really want to be a frontier company, but we don’t want to train foundation models. Obviously, that’s a very, very capital intensive business, I think language models are going to get commoditized. The really interesting problems are around how do you build systems around these models that solve the real problem? Most of the business problems that we encounter, they need to be solved by a system. Then there are a ton of super exciting research problems around how do I get that system to work well together? That’s what RAG 2.0 is in our case, how do you jointly optimize these components so that they can work well together? There’s also other things like making sure that your generations are very grounded. It’s not a general language model, it’s a language model that has been trained specifically for RAG and RAG only. It’s not doing creative writing, it can only talk about what’s in the context.

Similarly, when you build these production systems, you need to have a state-of-the-art re-ranker. Ideally, that re-ranker can also follow instructions. It’s a smarter model. There’s a lot of innovative stuff that we’re doing around building the RAG pipeline better and then how you incorporate feedback into that RAG pipeline as well. We’ve done work on KTO, and APO, and things like that, so different ways to incorporate human preferences into entire systems and not just models. That takes a very special team, which we have, I’m very proud of.

Jon: Can you talk about active versus passive retrieval?

Douwe: Passive retrieval is basically old-school RAG. It’s like I get a query, and I always retrieve, and then I take the results of that retrieval, and I give them to the language model, and it generates. That doesn’t really work. Very often, you need the language model to think, first of all, where am I going to retrieve it from and how am I going to retrieve it? Are there maybe better ways to search for the thing I’m looking for than copy and pasting the query? Modern production RAG pipelines are already way more sophisticated than having a vector database and a language model. One of the interesting things that you can do in the new paradigm of agentic things and test-time reasoning is decide for yourself if you want to retrieve something. It’s active retrieval. It’s like if you give me a query like, “Hi, how are you?” I don’t have to retrieve in order to answer that. I can just say, “I’m doing well, how can I help you?”

Then you ask me a question and now I decide that I need to go and retrieve. Maybe I make a mistake with my initial retrieval, so then I need to go and think like, “Oh, actually, maybe I should have gone here instead.” That’s active retrieval, that’s all getting unlocked now. This is what we call RAG agents, and this really is the future, I think, because gents are great, but we need a way to get them to work on your data, and that’s where RAG comes in.

Jon: This implies two uses of two relationships of Contextual and RAG to the agent, there is the supplying of information to the agents so that it can be performant, but if I probe into what you said, active retrieval implies a certain kind of reasoning, maybe even longer reasoning about, “Okay, what is the best source of the information that I’ve been asked to provide?”

Douwe: Exactly. It’s like I enjoy saying everything is Contextual. That’s very true for an enterprise. So the context that the data exists in, that really matters for the reasoning that the agent does in terms of finding the right information that all comes together in these RAG agents.

Jon: What is a really thorny problem that you’d like your team and the industry to try and attack in the coming years?

Douwe: The most interesting problems that I see everywhere in enterprises are at the intersection of structured and unstructured. We have great companies working on unstructured data, there are great companies working on structured data, but once you have the capability, which we’re starting to have now, where you can reason over both of these very different data modalities using the same model, then that unlocks so many cool use cases. That’s going to happen this year or next year, just thinking through the different data modalities and how you can reason on top of all of them with these agents.

Jon: Will that happen under the covers with one common piece of infrastructure or will it be a coherent single pane of glass across many different Lego bricks?

Douwe: I’d like to think that it would be one solution, and that is our platform, which can do all of that.

Jon: Let’s imagine that, but behind the covers, will you be accomplishing that with many different components each handling the structured versus unstructured?

Douwe: They are different components, despite what some people maybe like to pretend, I can always train up a better text-to-SQL model if I specialize it for text-to-SQL, than taking a generic off-the-shelf language model and telling it, “Generate some SQL query.” Specialization is always going to be better than generalization for specific problems, if you know what the problem is that you’re solving, the real question is much more around is it worth actually investing the money to do that? It costs money to specialize and it sometimes hampers economies of scale that you might want to have.

Jon: If I look at the other side of your organization that you’ve had to build, so you’ve had to build a very sophisticated research function, but Contextual is not a research lab, it’s a company, so what are the other kinds of disciplines and capabilities you’ve had to build up at Contextual that complement all the research that’s happening here?

Douwe: First of all, I think our researchers are really special in that we’re not focused on publishing papers or being too far out on the frontier. As a company, I don’t think you can afford that until you’re much bigger, and if you’re like Zuck and you can afford to have FAIR. The stuff I was working on at FAIR, at the time, I was doing like Wittgensteinian language games and all kinds of crazy stuff that I would never let people do here, honestly. But there’s a place for that, and that’s not a startup. The way we do research is we’re very much looking at what the customer problems are that we think we can solve better than anybody else, and then focusing, thinking from the system’s perspective about all of these problems, how can we make sure that we have the best system and then make that system jointly optimized and really specialized, or specializable, for different use cases? That’s what we can do.

That means that it’s a very fluid boundary between pure research and applied research, basically. All of our research is applied. In AI, right now, I think there’s a very fine line between product and research, where the research is basically is the product, and that’s not true for us, I think it’s true for OpenAI, Anthropic, everybody. The field is moving so quickly that you have to productize research almost immediately. As soon as it’s ready, you don’t even have time to write a paper about it anymore, you have to ship it into product very quickly because it is such a fast moving space.

Jon: How do you allocate your research attention? Is there some element of play, even 5%, 10%?

Douwe: The team would probably say not enough.

Jon: But not zero?

Douwe: As a researcher, you always want to play more but you have limited time. So yeah, it’s a trade-off, I don’t think we’re officially committing. We don’t have a 20% rule or something like Google would have, it’s more like we’re trying to solve cool problems as quickly as we can, and hopefully, have some impact on the world. Not work in isolation, but try to focus on things that matter.

Jon: I think I’m hearing you say that it’s zero even in an environment with finite resources and moving fast?

Douwe: Every environment has finite resources. It’s more like if you want to do special things, then you need to try new stuff. That’s, I think, very different for AI companies or AI native companies like us. If you compare this generation of companies with SaaS companies, there is like, okay, all the LAMP stack, everything was already there, you have to basically go and implement it. That’s not the case here, is we’re very much figuring out what we’re doing, like flying the airplane as we’re building it sort of thing, which is exciting, I think.

Jon: What is it like to now take this research that you’re doing and go out into the world and have that make contact with enterprises? What has that been like for you personally, and what has that been like for the company to transform from research-led to a product company?

Douwe: That’s my personal journey as well. I started off, I did a PhD, I was very much a pure research person and slowly transitioned to where I am now, where the key observation is that the research is the product. This is special point in time, it’s not going to always be like that. That’s been a lot of fun, honestly, I’ve been on a podcast a while back and they asked me, “What other job would you think is interesting?” And I said, “Maybe being the head of AI of JP Morgan.” And they were like, “Really?”

And I was like, “Well, I think actually, right now, at this particular point in time, that is a very interesting job.” And because you have to think about how am I going to change this giant company to use this latest piece of technology that is frankly going to change everything is going to change our entire society. For me, it gave me a lot of joy talking to people like that and thinking about what the future of the world is going to look like.

Jon: I think there’s going to be people problems, and organizational problems, and regulatory and domain constraints that fall outside the paper.

Douwe: Honestly, I would argue that those are the main problems to still overcome. I don’t care about AGI and all of those discussions, the core technology is already here for huge economic disruption. All the building blocks are here, the questions are more around how do we get lawyers to understand that? How do we get the MRM people to figure out what is an acceptable risk? One thing that we are very big on is not thinking about the accuracy, but thinking about the inaccuracy, and what do you do, if you have 98% accuracy, what do you do with the remaining 2% to make sure that you can mitigate that risk? A lot of this is happening right now. There’s a lot of change management that we’re going to need to do in these organizations. All of that is outside of the research questions where we have all the pieces to completely disrupt the global economy right now, it’s a question of executing on it, which is scary and exciting at the same time.

Jon: Douwe, you and I have had a conversation many times about different archetypes of founders and their capabilities. There’s one lens that stuck with me that has three click stops on it, there is a domain expert, who has expertise in revenue cycle management, but may not be that technical at all, A. B, there is somebody who is technical and able to write code but is not a PhD researcher, and Mark Zuckerberg is a really famous example of that. Then there’s the research founder, who has deep technical capabilities and advanced vision into the frontier. What do you see as the role for each of those types of founders in the next wave of companies that needs to get built?

Douwe: That’s a very interesting question. I would argue how many PhDs does Zuck have working for him? That’s a lot, right?

Jon: That’s a lot.

Douwe: I don’t think it matters how deep your expertise in a specific domain is, as long as you are a good leader and a good visionary, then you can recruit the PhDs to go and work for you. At the same time, obviously, it gives you an advantage if you are very deep in one field and that field happens to take off, which is what happened to me. I got very lucky, with a lot of timing there as well. Overall, one underlying question you’re asking there is around AI wrapper companies, for example. To what extent should companies go horizontal and vertical using this technology?

There’s been a lot of disdain for these wrapper companies like, “Oh, that’s a wrapper for OpenAI.” It’s like, “Well, it turns out you can make an amazing business just from that, right?” I think Cursor is like Anthropic’s biggest customer right now. It’s fine to be a wrapper company as long as you have an amazing business. People should have a lot more respect for companies building on top of fundamental new technology and discovering whole new business problems that we didn’t really knew existed, and then solving them much better than anything else.

Jon: Well, so I’m really thinking also about the comment you made, that we have a lot of technology that is capable of a lot of economic impact, even today, without new breakthroughs that, yes, we’ll also get. Does that change the next types of companies that should be founded in the coming year?

Douwe: I think so. I am also learning a lot of this myself, about how to be a good founder, basically. It’s always good to plan for what’s going to come and not for what is here right now, and that’s how you get to ride that wave in the right way. What’s going to come is that a lot of this stuff is going to become much more mature. One of the big problems we had even two years ago was that AI infrastructure was very immature. Everything would break down all the time. There were bugs into attention mechanism, implementation of frameworks we were using, really basic stuff. All of that has been solved now. With that maturity also comes the ability to scale much better, to think much, much more rigorously, I think, around cost quality trade-offs and things like that. There’s a lot of business value just right there.

Jon: What do new founders ask you? What kind of advice do they ask you?

Douwe: They ask me a lot about this wrapper company thing, and modes, and differentiation. There’s some fear that incumbents are going to eat everything, and so they obviously have amazing distribution. They’re massive opportunities for companies to be AI native and to think from day one as an AI company. If you do that right, then you have a massive opportunity to be the next Google, or Facebook, or whatever, if you play your cards right.

Jon: What is some advice that you’ve gotten, and I’ll ask you to break it into two, what is advice that you’ve gotten that you disagree with, and what do you think about that? And then what is advice that you’ve gotten that you take a lot from?

Douwe: Maybe we can start with the advice I really like, which is one observation around why Facebook is so successful, it’s like, be fluid like water. It’s like whatever the market is telling you or your users are telling you, fit into that, don’t be too rigorous in what is right and wrong, be humble and look at what the data tells you and then try to optimize for that. That is advice that when I got it, I didn’t really appreciate it fully, and I’m starting to appreciate that much more right now. Honestly, it took me too long to understand that. In terms of advice that I’ve gotten that I disagree with, it’s very easy for people to say, “You should do one thing and you should do it well.” Sure, maybe, but I’d like to be more ambitious than that. We could have been one small part of a RAG stack and we probably would’ve been the best in the world at that particular thing, but then we’re slotting into this ecosystem where we’re a small piece, and I want the whole pie ideally.

Then that’s why we’ve invested so much time in building this platform, making sure that all the individual components are state-of-the-art and that they’ve been made to work together so that you can solve this much bigger problem, but yet, that is also a lot harder to do. Not everyone would give me the advice that I should not go and solve that hard problem, but I think over time, as a company, that is where your moat comes from, doing something that everybody else thinks is kind of crazy. So that would be my advice to founders, is go and do something that everybody else thinks is crazy.

Jon: You’re probably going to tell me that that reflects in the team that comes to join you?

Douwe: Yeah, the company is the team, especially the early team. We’ve been very fortunate with the people who joined us early on, and that is what the company is. It’s the people.

Jon: If I piggyback a little bit and we get back into the technology for a minute, there’s a common question, maybe even misunderstanding that I hear about RAG, that, “Oh, this is the thing that’s going to solve hallucinations.” You and I have spoken about this many times, where is your head at right now on what hallucinations are, what they are not? Does RAG solve it? What’s the outlook there?

Douwe: I think hallucination is not a very technical term. We used to have a pretty good word for it, it was accuracy. If you were inaccurate, if you were wrong, then I guess to explain that, or to anthropomorphize it would be to say, “Oh, the model hallucinated.” I think it’s a very ill-defined term, honestly. If I would have to try to turn it into a technical definition, I would say the generation of the language model is not grounded in the context that is given, where it is told that that context is true. Basically, hallucination is about groundedness. If you have a model that adheres to its context, then it will hallucinate less. Hallucination itself is arguably a feature for a general purpose language model, it’s not a bug. If you have a creative writing department, or marketing department, creative writing thing like content generation, I think hallucination is great for that, as long as you have a way to fix it, you probably have a human somewhere double checking it and rewriting some stuff.

So hallucination itself is not even a bad thing necessarily, it is a bad thing if you have a RAG problem though and you cannot afford to make a mistake. Then that’s why we have a grounded language model that has been trained specifically not to hallucinate, or to hallucinate less. Because one other misconception that I sometimes see is that people think that these probabilistic systems can have 100% accuracy, and that is a pipe dream. It’s the same with people. If you look at a big bank, there are people in these banks and people make mistakes too.

Jon: SEC filings have mistakes.

Douwe: Exactly. The whole reason we have the SEC and that is a regulated market is so that we have mechanisms built into the market so that if a person makes a mistake, then at least we made reasonable efforts to mitigate the risk around that. It’s the same with AI deployments. That’s why I’m talking about how to mitigate the risk with inaccuracies. It’s like, we’re not going to get it to 100%, so you need to think about the 2, 3, 5, 10% depending on how hard the use cases where you might still not be perfect. How do you deal with that?

Jon: What are some of the things that you might’ve believed a year ago about AI adoption or AI capabilities that you think very differently about today?

Douwe: Many things. The main thing I thought that turned out not to be true was that I thought this would be easy.

Jon: What is this?

Douwe: Building the company and solving real problems with AI. We were very naive, especially in the beginning of the company. We were like, “Oh, yeah, we just get a research cluster. Get a bunch of GPUs in there. We train some models, it’s going to be great.” Then it turned out that getting a working GPU cluster was very hard. And then it turned out that training something on that GPU cluster in a way that actually works, where if you’re using other people’s code, then maybe that code is not that great yet. Now, you have to build your own framework for a lot of the stuff that you’re doing if you want to make sure that it’s really, really good. We had to do a lot of plumbing that we did not expect to have to do. Now, I’m very happy that we did all that work, but at the time, it was very frustrating.

Jon: What are we, either you and I, or we, the industry, not talking about nearly enough that we should be?

Douwe: Evaluation. I’ve been doing a lot of work on evaluation in my research career, things like Dynabench where it was about how do we hopefully maybe get rid of benchmarks all together and have a more dynamic way to measure model performance. Evaluation is very boring. People don’t seem to care about it. I care deeply about it, so that always surprises me. We did this amazing launch, I thought, around LMUnit, it’s natural language unit testing. You have a response from a language model, and now you want to check very specific things about that response. It’s like, did it contain this? Did it not make this mistake? Ideally, you can write unit tests as a person for what a good response looks like. You can do that with our approach. We have a model that is by far state-of-the-art at verifying that these unit tests are passing or failing.

I think this is awesome. I love talking about this, but people don’t seem to really care. It’s like, “Oh, yeah, evaluation. Yeah, we have a spreadsheet somewhere with 10 examples.” How is that possible? That’s such an important problem. When you deploy AI, you need to know if it works or not, and you need to know where it falls short, and you need to have trust in your deployment, and you need to think about the things that might go wrong, and all of that. It’s been very surprising to me just how immature a lot of companies are when it comes to evaluation, and this includes huge companies.

Jon: Garry Tan posted on social media not too long ago, that evaluation is the secret weapon of the strongest AI application companies.

Douwe: Also, AI research companies, by the way. So OpenAI and Anthropic, part of why they’re so great is because they’re amazing at evaluation too. They know exactly what good looks like. That’s also why we are doing all of that in-house; we’re not outsourcing evaluation to somebody else. It’s like, if you are an AI company and AI is your product, then you can only assess the quality of your product through evaluation. t’s core to all of these companies.

Jon: Whoever is lucky enough to get that cool JP Morgan head of AI job that you would be doing in another life, is that intellectual property of JP Morgan what the evals really need to look like, or is this something that they can ultimately ask Contextual to cover for them?

Douwe: No. I think the tooling for evaluation, they can use us for, but the actual expertise that goes into that evaluation, the unit tests, they should write that themselves. In the limit, we talked about a company is its people, but in the limit, that might not even be true, because there might be AI mostly, and maybe only a few people. What makes a company a company is its data, and the expertise around that data and the institutional knowledge. That is what defines a company. That should be captured in how you evaluate the systems that you deploy in your company.

Jon: I think we can leave it there. Douwe Kiela, thank you so much. This was a lot of fun.

Douwe: Thank you.

Related Insights

Related Insights

Sign up for our newsletter for emerging trends,insights and new investment news.

Sign up for our newsletter for emerging trends,
insights and new investment news.