Open source AI is just getting started, and its momentum is accelerating. This will enable a proliferation of new LLMs that complement and work together with proprietary models. To understand how and why, and what needs to happen next, let’s zoom in and consider only the past 2 weeks of announcements from this fast-moving space.
The blistering pace of invention from two giant, proprietary partner companies – OpenAI and Microsoft – has inspired and challenged every one of us who thinks about AI and large language models (LLMs). Chris Ré of Stanford has called for a Linux Moment in AI with a proliferation of models, and Jon has written about an iPhone/Android tension in LLMs. But those predictions all happened “way back” in January 2023! Today, the 100M users of ChatGPT, the talent and capital these proprietary players have assembled, and the heavy resources to train new large-scale models all beg the question: Are proprietary offerings the only ones we need? Are we entering an era of Google-style dominance, or perhaps a narrow oligopoly where a handful of players control the platform (models, apps, data feedback loop) for our entire industry?
Maybe. That would be a historic triumph for those players. It would cause a generational realignment of the way our industry innovates, builds, and operates. But our best guess is still that looking back at these early days of LLMs, such questions will sound like TJ Watson’s 1943 prediction of “a world market for maybe 5 computers.”
Instead of tight concentration, we expect to see a proliferation of LLMs. Advanced proprietary offerings will continue to be important. But enterprises will also have one or more of their own private LLMs. And new companies will be created by pushing the envelope of features and modalities. We expect that open source model architectures will play a major role in that proliferation. And we can see that just by considering innovations, both proprietary and open, announced in just the past two weeks.
…enterprises will have one or more of their own private LLMs. And new companies will be created by pushing the envelope of features and modalities.
First, let’s review the progress from OpenAI and Microsoft in just the past two weeks. We have seen remarkable achievements at the science level (the launch of GPT4), the operational level (Yusuf Mehdi’s post-hoc reveal that Bing Chat has been running at scale on GPT4 since launch), the platform level (ChatGPT plugins), and the application level (365 Copilot). There is plenty of reason to believe we are just getting started: GPT-4’s image input capabilities are not even widely available yet. Elsewhere in the world of proprietary models, Adobe and Google also had major launches. Phew.
Next, let’s compare the pace of innovation in open source models from those same two weeks. At the science level, we saw the release of Stanford’s Alpaca model, which was fine-tuned on Meta’s LLaMa model, and behaves similarly to OpenAI’s text-davinci-003 (a flavor of GPT-3). LLaMa is powerful yet small enough to be retrained for a modest amount (about $80,000) and instruction tuned for about $600. It can even be run on a Macbook Pro, smartphone, or Raspberry Pi. We also saw the release of Databricks’ Dolly, an open-source model with ChatGPT-style behavior that developers can privately and inexpensively retrain as well. Not only is Dolly useful in its own right, but it also clearly demonstrates the power of fine-tuning (just 30 minutes of fine-tuning on a two-year-old base model to create Dolly). LLaMa, Alpaca, and Dolly are licensed for research only, not commercial use. But other new models, such as OpenChatKit from Together.xyz, and a family of 7 Cerebras-GPT models from Cerebras, are published under the well-understood Apache 2.0 license that allows commercial use. At the operational level just in the past two weeks, we saw new announcements from OctoML, Banana, Replicate, and Databricks all to facilitate model training and inference in the cloud. At the platform level, even the ChatGPT Plugins are instantly available to integrate with other models, with open-source implementations that launched less than 24 hours after the original announcement. At the application level, Runway also launched the Gen2 text-to-video feature based on open source models. All of this happened in the past two weeks. It clearly represents a rapid and broad-based pace of innovation to answer the proprietary announcements described earlier.
Next, let’s consider how and why customers are using open source AI models today. We see millions of downloads on HuggingFace and millions of runs on Replicate for open-source computer vision models such as Stable Diffusion and ControlNet. On the language model side, we see reference customers for Cerebras’s training infrastructure that include GlaxoSmithKline and AstraZeneca (each with highly sensitive training data), and Jasper AI (a company that launched on OpenAI but now seeking a deeper moat). Luis led a fireside chat with leaders of Coda, Lexion, and CommonRoom last week at Madrona’s annual meeting, who agreed about the importance of fine-tuning on their own proprietary data, and they expressed shared concerns around customer data protection that could arise when sharing with any external party.
Common threads have emerged around both proprietary and open source models. Proprietary models such as those from OpenAI are useful because they are the most advanced and accurate, and also because they are easy to adopt. A new proprietary model from Cohere tied for first place with an OpenAI model in 16 “core” scenarios according to the latest Stanford HELM benchmarking exercise. Developers can readily access these models via API and end-users can readily access them via applications like ChatGPT, Bing, or Microsoft 365. But at the same time, we also see open-source models only about 6-18 months behind that of the most advanced proprietary models. Dolly and similar projects demonstrate the ease and power of fine-tuning to further improve the accuracy. And open-source offers additional benefits of flexibility, configurability, and privacy.
Putting that all together, let’s now consider what we can see about the future:
- Already happening: Data and tools for managing models are fueling open source AI. Today, practitioners can access large-scale datasets such as LAION (images), PILE (diverse language text), and Common Crawl (web crawl data). They can use tools from Snorkel, Ontocord, fastdup, and xethub to curate, organize and access those large datasets.
- Already happening: Customers such as Runway, Stability AI, Midjourney, Jasper, Numbers Station, and more are using open-source models to push the envelope on modalities and features. Customers like GlaxoSmithKline and AstraZeneca are using open-source models to maintain control of proprietary data.
- Count on It: Just as MLOps practitioners have constructed ensembles of models for years, we expect to see ensembles of LLMs that may include a mix of proprietary and open-source models. Such ensembles will allow developers to optimize on dimensions of accuracy/fit, latency, cost, and regulatory/privacy concerns.
- Already happening: Customers are already fine-tuning LLMs, but as the base quality of models increases and tools become easier to use, the impact and practicality of fine tuning will grow still more attractive versus trading up to an even larger base model. That means that increasingly, enterprises will have privately trained LLMs of their own (or many of them).
- Count on It: Taking the LLM ensembles described above just a few steps further, developers will also start to use open-source LLMs for inference and (limited) training at the edge, that works together with model ensembles in the cloud.
We can also see what’s missing, and what has to happen next.
- Broad Collaboration on Data and Models: This movement will enable great businesses to be built, but it’s based on broad collaboration. So it’s important that we collectively support open source data projects like LAION and Common Crawl, which provide the basis for so much of the model training. Additionally, licensing of open-source models in a way that allows commercial use is just as important to ensure this collaboration.
- Broad Collaboration on Research: Much of the enabling research in this space is published by a small number of corporate labs. The commercial tensions are understandable, but it is imperative that such labs continue to publish meaningful research to maximize the collective feedback loop.
- Infrastructure and Ops: OpenAI, Cohere, and even HuggingFace have a tight coupling between the models they offer as an API and the guts of the platform to deploy them — model optimizations, binary code, HW choice, auto-scaling etc. This means that even if folks can select an open model or build their own custom one, they don’t have a platform to deploy it. Banana, Modal, OctoML, etc, are all working to offer platform options to “decouple models from systems engineering.” There is so much work to do here, to make it easy for developers to participate in the open source AI movement.
We’ll close where we started, by recognizing the amazing innovation that OpenAI and Microsoft have achieved, as well as their peers at Cohere, Anthropic, and more. And we’ll also recognize the immense research contributions of HuggingFace, Eleuther, Stable Diffusion, Meta, Google, Amazon, LAION, Common Crawl, OntoCord, Stanford, University of Washington, and more that are enabling this wave of open source innovation.
We couldn’t be more excited for the future of LLMs, and what will become possible with the combined energies of these proprietary and open-source approaches.
Luis and Jon are builders and investors who have already made multiple investments in this space and are committed to make the future happen faster. If you are a founder building models, infrastructure, tooling, applications, or something else in the space of open source AI and would like to meet, get in touch at: [email protected] and [email protected]