I often think about the first time I tried to hijack biology’s central dogma. I’d been interested in studying how a specific mutation could affect the way a protein regulated the cell cycle in cardiomyocytes, or heart cells. It was a painstaking process. First, I had to create a DNA sequence encapsulating the mutated protein’s code. Then, I turned these sequences into RNA that — upon some gentle nudging — could be expressed as proteins by stem-cells-turned-cardiomyocytes.
Two months later, the euphoric moment came in a blacked out room where I separated and identified proteins expressed in my cell culture. There it was: A smudgy, wobbly, band, exactly where I had expected. Biology — the art of pipetting clear liquids into more clear liquids — felt less mysterious. To my knowledge, this protein, sporting a microscopic mutation, had never been found in nature before. But here it was, right in front of my eyes.
In the past few years, the concept of creating new biology has become more and more real. Instead of my brain meticulously rearranging letters to create a more optimal gene or amino acid sequence, artificial intelligence models can rapidly digest data about millions of gene networks to figure out which genes turn on where and when. They can create new sequences to regulate said genes, and design downstream proteins to accomplish specific tasks, like correcting a disease state or creating a better drug.
Inevitably, though, these new strategies for scientists to create biology evokes many questions. Like, what does bias mean in the context of a model trained to make gene sequences or proteins? And, most importantly, how can we leverage these capabilities to solve previously unsolvable problems in human disease?
We might not need to personally hijack the central dogma anymore — a computer can do it for us.
Predicting Genetic Conversations
Historically, studies that probe the underlying genetics behind human disease often pry out one or a few important genes to explore. But genes do not work in isolation. Instead, they talk to each other — and figuring out the content of those conversations is a difficult task.
For Christina Theodoris, a geneticist at the Gladstone Institutes, the task of mapping out genetic networks became supercharged when, as an MD-PhD candidate, she worked on building machine learning algorithms. These systems could help unveil how genes talked to each other, but the data wasn’t always there.
“One thing that really struck me through the course of all this research was that we need a huge amount of data to be able to confidently infer the connections between genes,” Theodoris said. “And we don’t always have that for rare diseases or diseases affecting clinically inaccessible tissues.”
To solve this problem, Theodoris and her team tried something called transfer learning. Think of transfer learning as how children learn to classify things from what they see, like birds. If a child observes an adult hawk and a baby hawk flying around one day, they can pick up the subtle colors and patterns that distinguish one from the other. This is in part because they’ve already established a strong foundation of what a bird should look like. From there, they can learn more nuanced differences.
Figuring out which genes are important when and where isn’t so different from classifying hawks. Theodoris’ transfer learning model starts out by learning from large amounts of unspecific data — in this case, the transcriptomes (all the produced RNA) of lots of different cells. This phase lets the model “gain this generalizable understanding of which genes are really central to the regulation of those cells,” said Theodoris. The model can then infer how genes are interconnected in different cell types.
Previous iterations of artificial intelligence models that predict how genetic networks work are often restricted to a specific domain, such as the genes relevant to liver cancer. If scientists wanted to probe elsewhere, the entire model would have to be revamped. But Theodoris’ “Geneformer” transfer learning model can be applied to accomplish many different tasks with just a little bit of fine-tuning. They published the work in Nature during May 2023.
Geneformer trained on the transcriptomes of about 30 million cells all over the body from various tissues. To avoid skewing the model’s training on one particular body part, Theodoris’ team made sure that each tissue comprised no more than 25% of the final dataset. But because some tissues (like the brain) draw more funding than others, the final distribution is not equally split.
I asked Theodoris what potential bias could mean when modeling gene networks. As it turns out, there are some biological features that can change gene expression, like age. “Age of course causes differences in cell state,” she said. “So, I think it is important to have a broad range and diverse samples.”
After initial training, Geneformer could distinguish which genes in the dataset were “master regulators,” where changes in that single gene’s expression would cause a cascade of downstream effects. This is important because it maps out genetic networks like a branching tree. If clumps of genetic pathways go haywire in a particular disease, a scientist may target the master regulator of those pathways rather than having to chase them down one by one.
The next step was to leverage Geneformer towards understanding how these gene networks changed between a normal tissue and a diseased one.
To do this, the team programmed Geneformer to generate so-called embeddings of the cell’s identity. An embedding is the model’s mathematical representation of something like a gene or a cell. They exist in an “embedding space” — kind of like a map where things that are more alike reside in closer proximity than those that are different. Embeddings help differentiate mathematically between what a diseased state looks like compared to a healthy state. The model can then figure out which characteristics are important differences between these two states.
After fine-tuning Geneformer on data from diseased and healthy heart tissue, the team looked for genes that, when perturbed, might shift the diseased cell into a healthy state. When the scientists knocked out four of these genes in a lab dish model of diseased cardiomyocytes, two knockouts did actually improve the cells’ condition.
It’s an approach that, after decades of research on heart disease, unveils how much is still to be learned. There’s an “astronomical number of different perturbations that you can do to a cell,” said Theodoris. “If you’re able to do that computationally and more efficiently, you can prioritize your downstream experiments, test a smaller number of candidates, and have a higher rate of success.”
Writing Regulatory Code for the Genome
Theodoris’ work, as well as other models that predict gene networks, showed how AI can help generate new insights into the ways genes talk to each other. You can use these insights to then eternally knock out whatever suggested genes might fix the disease, but that might create other problems down the line.
The complexity of our genome percolates into many layers. Gene expression can be controlled by regulatory elements, which are stretches of DNA that mediate how often the genes around them are expressed. It’s not an entirely binary operation. Genes can turn on and off, and their expression level can change like the volume dial on a radio. A lot of ongoing research focuses on figuring out which regulatory elements go where, and what genes they control. But beyond that, what if you could create new regulatory elements that would modulate gene expression in a more precise way for the specific genes you’re interested in?
This was a question that intrigued Luca Pinello, a computational biologist at Harvard Medical School. Pinello approached biology from a background in computer science. He quickly saw how both could be combined in the field of gene regulation.
Gene regulation is a field full of unsolved mysteries, in part because many of its actors reside inside the non-coding parts of the genome. These regulatory elements, such as enhancers, can be as short as 50 base pairs or as long as 1500 base pairs, while controlling gene expression up to 1 million base pairs away.
Pinello saw an opportunity to connect generative artificial intelligence models, which generate new information, to glean more knowledge about how enhancers work. “If you teach a model to write new DNA sequences, maybe you can also use the model to extract information,” he said. This could be information about which regulatory sequences are more common in a certain cell type, for example.
How to pick which generative model to use, though? On the way back from a scientific conference in Washington D.C., a postdoc in Pinello’s lab, Lucas Ferreira da Silva, was playing around with an image generation model when he realized that this might be useful for gene regulation.
“We were generating pictures of Italians eating pineapple on pizza,” Pinello said. “You would never see this in reality.” Inspired by the fact that these models were powerful enough to sacrilege Italian food, Pinello and his team decided to co-opt the so-called “diffusion model” to generate new regulatory elements. They named the model DNA-Diffusion.
A diffusion model learns by seeing “noisy” inputs. For an image, this could mean a couple of grayed out pixels. For the purposes of generating regulatory elements (which are just strings of DNA nucleotides), this noise is the probabilistic uncertainty that one particular nucleotide will appear in its DNA sequence. As the model gets more “noised” input data, it will learn to “denoise” the data — essentially creating a new output.
In metaphorical terms, diffusion models learn similarly to how humans build houses, says Gevorg Grigoryan, who is Chief Technology Officer of Generate Biomedicines, which uses diffusion models to create new protein-based drugs. “We can decompose the house into its own metric bits and describe to you how to go about building one,” he said.
Even if you built a structure that looks slightly different from the original blueprint, it’s still considered a house. “Diffusion is when you destroy a thing in a systematic way, and then learn how to build it,” Grigoryan added.
Pinello and his team added another layer of complexity, called “conditioning,” to their diffusion model, according to a preprint published in February 2024. “Conditioning means that you can teach the model to understand the context,” he said. For DNA-Diffusion, this means that when generating new sequences, you could ask the model to create a regulatory sequence specific to a certain cell type.
To train the model, the scientists used chromatin accessibility data. DNA clumps up into a complex called chromatin. Chromatin unspools at certain locations so that the proteins required to start gene expression can reach regulatory elements in the DNA. (Enhancers, which increase gene expression, are regions of unspooled (“open”) chromatin, generally speaking.)
The team had the end goal of training the model to generate new regulatory elements for three specific cell types (lymphocytes, liver cancer, and leukemia cells). After generating 100,000 short DNA sequences per cell type as proposed regulatory elements, the team modeled whether the sequences would indeed affect gene expression in desired ways. They used two computational models — ChromBPNet from Anshul Kundaje’s group at Stanford University, and Enformer from Calico Life Sciences. The first could predict how those specific sequences would affect chromatin accessibility, and the second predicted the effect on gene expression.
This allowed them to zoom in on the gene GATA1 known to be involved in blood cell development. Sequences designed to increase chromatin accessibility for GATA1 did so in leukemia cells as desired, without inducing the same change in the other two cell types. This told the team that sequences spit out by their diffusion model were generally specific to the cell type that they were designed for.
The same result emerged in terms of predicted gene expression — showing that the sequences “can open the chromatin and drive gene expression,” said Pinello. “This was quite exciting because the model was never trained on gene expression data.”
Interestingly, the team also found that DNA-Diffusion could design sequences that would have different levels of regulatory activity on the same gene — turning the volume up to different degrees, as opposed to just one setting. While the scientists have not yet published their data on validating the activity of these newly generated regulatory sequences inside actual cells, Pinello is optimistic, and the paper describing this work is currently under review.
DNA-Diffusion joins Genentech’s regLM (which is a language model like Open AI’s GPT) and other deep learning or generative adversarial networks all directed towards the goal of designing new regulatory elements for the human genome.
Pinello can see how these models could be useful. One strength, he says, is being able to create regulatory sequences for specific cells. “You can also fine-tune the level of [gene] activation,” he added. This might be useful for something like gene therapy, where different patients could require varying degrees of gene activation. Scientists could then control the amount of downstream protein produced in a way that is more patient personalized than a one-size-fits-all therapy.
Creating New Proteins
While using AI to better understand gene networks or design new regulatory elements has fascinating consequences for human health, a fundamental use case in biology is to create new, better proteins. I think back to that little mutated protein isolated in my undergraduate lab — the ultimate goal was to find proteins that help cardiomyocytes divide more after a heart attack. A therapeutic protein-based intervention.
In 2021, scientists at Deepmind published AlphaFold — a neural network model that could accurately predict protein structure. Subsequent releases and other AI-based models that predicted protein structures soon followed.
It seemed natural, with the improvement of these AI models, to then try to make new proteins. “The sheer accuracy and expressiveness of machine learning, and the remarkable progress made in image generation and text generation, made it feel like the time was right for applying that sort of thing to molecules and proteins,” said David Juergens, a postdoc in computational biologist David Baker’s lab at the University of Washington.
Luckily for Juergens, scientists around the world had been contributing to a large forum, called the Protein Data Bank, that contained hundreds of thousands of protein structures. This database of protein structures was the perfect repository of training data — enabling Juergens and other scientists in the Baker Lab to develop RFdiffusion, a conditional diffusion model that could generate new protein structures. They published this model in July 2023 in Nature.
While the Protein Data Bank is a treasure trove of protein structures, there are inevitably biases to what sorts of proteins are included — like the problem that Theodoris and her team faced when trying to scrape cellular transcriptomes off publicly available work. “One bias is that all of these structures are probably proteins that are easier to purify,” said Juergens. “People also just have interest in certain families of proteins much more than other ones.”
The diffusion strategy used in RFdiffusion is similar conceptually to that in Pinello’s DNA-Diffusion. Here, though, the noisy input data are the atomic coordinates for various proteins, which describe which molecules are placed where in the protein’s 3D structure.
Juergens and his team could use some strategies, like sequence clustering, to better equalize the types of proteins that were included as training data. Next, they added some fancy features into the model through conditioning — like Pinello’s group had done with DNA-Diffusion. One was the capability to input a specific protein backbone, then design something that would bind to that backbone. This might be useful for something like drug design where you would want to have a molecule that sticks tightly to the target. They also trained RFdiffusion to design symmetrical proteins — which could be useful for vaccine design.
Chroma, a generative protein structure model developed by Grigoryan in November 2023 and used by Generate:Biomedicines, also fundamentally depends on diffusion. According to Juergens, the differences between these two models lie mostly within their respective neural network architectures. Other generative protein models, like ProGen (a language model from Salesforce Research in January 2023), do not rely on diffusion but can achieve similar goals of creating new proteins.
Grigoryan’s Chroma similarly utilizes conditioning to enable protein design guided by specific parameters, by using classifiers. “The classifiers basically ingest the protein and write a caption of what it thinks this protein is,” he said. “So now, you can start conditioning based on text.” For example, one could simply specify that they wanted an “oxidoreductase” protein of a certain class.
To test how these proteins would function in real life, both teams used Escherichia coli bacteria as a model system to synthesize the new amino acid sequences. Juergens’ team attached a tag to the end of each protein sequence so that it could be easily purified from the E. coli. “Once you do that, we’ve got a test tube that’s full of mostly buffer and our protein,” he said.
It’s a series of events that still gives me a little bit of whiplash — tell an AI model to design a protein with specific characteristics, then generate a bacterial plasmid (a small loop of DNA) containing that new protein’s sequence. Next, express ship that plasmid over to the lab, stick it in some bacteria, and boom, the new protein exists in the world.
That being said, Juergen’s colleagues and Grigoryan’s team were both immensely excited to see their newly designed proteins fulfill a lot of their design requirements in real life. When Juergens tested the generated binding proteins to see if they bound to their target, the success rate was 19%.
For Grigoryan, the exciting part came when one of the generated protein sequences from their computational platform entered clinical trials — one of the first therapies designed with generative AI to reach clinical development. Their candidate, GB-0669, was designed as an antibody to protect against COVID-19 infection. While the company has not yet released clinical results, Grigoryan says that the data looks “very positive, and we’re very excited to further make this into a drug.”
New Biology in the Future
In recent months, even more tools have popped up that use AI to generate biological structures. Evo, a foundation model, can generate DNA sequences at genome scale. Profluent Bio, a biotechnology startup run by the developers of ProGen, can create new CRISPR-Cas systems. ESM3, a gigantic language model from the company EvolutionaryScale, can also generate proteins based on user specifications. It is heartening that most (if not all) of these models have been released as open source — allowing scientists all over the world to play with the code.
But there’s always more to do and more to improve upon. Besides overcoming training biases with more datasets and a greater diversity of data types, scientists are also concerned about scalability when designing these model frameworks. Biological structures can be extremely complex. Both time and computational resources are very expensive. And after all of that, there is the herculean task of validating if the new sequences or proteins generated by these models actually work in real life.
That particular job is still left up to humans. I joked to Theodoris that it was somewhat ironic, after all this AI-based creation of biological substances that are ideally superior to what is found in nature, human beings would probably still be pipetting clear liquids in the lab — trying, experiment after experiment, to test and validate whatever outputs the model came up with.
“I don’t think that you’re ever going to go straight from the model into a clinical trial,” she said. “We do a lot of preclinical safety studies.”
What may subtly change is the way that humans do science.
I was taught to do hypothesis-driven research — to always have a clear question in mind when designing an experiment. Now, computational models can create new hypotheses for us. “In my lab, when we design experiments, we’re not just thinking about how we can answer our biological question, but also if that information can be important for improving the model and giving it feedback,” she said. “You can design experiments to fill gaps in knowledge.”
Ideally, it’s like an infinite loop — human scientist and AI model working in tandem to help one another.
This change can trickle into the ways by which we create new therapies. “It’s not just about making a drug,” Grigoryan mused. “It’s about changing the process so that it’s much more intentional, data-driven, efficacious, and rapid than it is today.”