Blog Logo

The Next Frontier For Large Language Models Is Biology

Large language models like GPT-4 have taken the world by storm thanks to their astonishing command of natural language. Yet the most significant long-term opportunity for LLMs will entail an entirely different type of language: the language of biology. One striking theme has emerged from the long march of research progress across biochemistry, molecular biology and genetics over the past century: it turns out that biology is a decipherable, programmable, in some ways even digital system. DNA encodes the complete genetic instructions for every living organism on earth using just four variables—A (adenine), C (cytosine), G (guanine) and T (thymine). Compare this to modern computing systems, which use two variables—0 and 1—to encode all the world’s digital electronic information. One system is binary and the other is quaternary, but the two have a surprising amount of conceptual overlap; both systems can properly be thought of as digital. To take another example, every protein in every living being consists of and is defined by a one-dimensional string of amino acids linked together in a particular order. Proteins range from a few dozen to several thousand amino acids in length, with 20 different amino acids to choose from. This, too, represents an eminently computable system, one that language models are well-suited to learn. As DeepMind CEO/cofounder Demis Hassabis put it: “At its most fundamental level, I think biology can be thought of as an information processing system, albeit an extraordinarily complex and dynamic one. Just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.” Large language models are at their most powerful when they can feast on vast volumes of signal-rich data, inferring latent patterns and deep structure that go well beyond the capacity of any human to absorb.