January 6, 2021

Reverse Engineering the Pfizer Vaccine's Codon Optimization


One of my favorite recent pieces of science writing is by Bert Hubert, who wrote this excellent post breaking down Pfizer/BioNTech’s vaccine sequence into its functional parts by comparing them to engineering concepts. On the last day of the year, Bert followed up with Part 2, which challenged readers to reverse engineer the differences between the virus’s RNA sequence and Pfizer/BioNTech’s vaccine RNA sequence.

This challenge nerd-sniped me hard, because not only do I work at a biotech company and have an affinity for generally-accessible technical/scientific writing, but my team literally built Benchling’s Codon Optimization tool way back in the before-pandemic-feels-like-a-century-ago 2019. I couldn’t resist and whoops, there went my last morning of 2020. Of course, we don’t know exactly how Pfizer/BioNTech optimized their sequence, but that’s the fun of reverse-engineering: if successful, either you learn something about how an invention was produced, or you learn that there are multiple valid paths to get to the same result; either way is pretty cool.

I wanted to do a quick writeup of the optimization I did as part of the challenge, so here it is!

First, some background

The Central Dogma of molecular biology is that DNA (a double-stranded helical molecule composed of the bases A, C, G, and T) is transcribed into RNA (a single-stranded molecule composed of the bases A, C, G, and U), which is translated into proteins (molecules composed of amino acids chained together in a line, which then fold into shapes and do basically everything in our cells).

Translation works by mapping RNA bases to amino acids. This happens by reading a sequence of RNA in groups of three bases, called codons. Each codon maps to one of the 20 natural amino acids. Thus, the RNA string CCUCAUAGU codes for the amino acid chain ProHisSer. And when I say “mapping”, I literally mean that there are molecules called tRNAs floating around in your cells’ cytoplasm that stick to a specific amino acid on one end and a specific codon on the other, linking the two together.

Image title

However, since we have 64 possible codons (4^3) but only 20 amino acids, there are redundant mappings, as you can see in the chart above. This redundancy leads to some pretty cool bioengineering hacks: multiple different RNA sequences can code for the same protein (what does CCACAAAGC code for?), and those sequences may differ on axes a scientist would care about. “Codon optimization” is the method of improving an RNA sequence along those other axes while keeping the translation the same. What axes might a scientist care about?

GC content

GC content is directly related to melting temperature (my naive understanding is that G-C bonds are harder to break than A-T bonds, so the more GC bonds that exist, the more energy you need to use to break them, leading to higher melting temperatures).

Avoiding hairpins

This lets you eliminate options that would cause a single strand to bond to itself, which could impact other molecules’ ability to interact with the sequence as intended.

Image title

Organism optimization

It turns out that different species of animals have different proportions of tRNAs for a given amino acid in their cells. If you take, say, Alanine (GCA, GCG, GCC, GCU), you can compare the frequency of those codons relative to each other across two different organisms. As a made-up example, a bat might have 40% GCUs and only 10% GCGs, while a snake might have the reverse. Empirically, this means that if you’re taking a DNA sequence from one organism and trying to express it in the cell of another organism, you’re going to get less efficient production of that protein. My mental model of this is that producing the protein takes longer, because you have to wait for more time for the right tRNA to come along - so if you have a lot of GCUs in your sequence, but not many GCU tRNAs floating around, you’ll produce less of the same protein in a given amount of time. This means that you’d want to optimize your sequence for the organism of the production cell.

How do you do that? Well, there are a bunch of different approaches. A common approach is to simply replace each codon with the synonymous codon that appears most frequently in the target organism. Another approach, “harmonization”, tries to match the frequencies for each amino acid to the target organism’s codon frequencies - so if you’re producing a bat protein in a snake, you’d want ~10% of the Alanine codons to be GCUs and ~40% to be GCGs.

The Challenge

Given the virus sequence and the Pfizer vaccine sequence (both below), write an algorithm that can start with the virus and produce the closest match possible to the vaccine sequence.

Bert’s initial approach asked for a brief algorithm, but I thought that might be difficult:

It’s clear that Pfizer/BioNTech used some form of codon optimization when creating the vaccine, because while the vaccine codes for the same spike protein that SARS-CoV-2 produces (well, almost, except for the stabilizing proline substitutions), it’s got a LOT of changes compared to the virus’ original sequence (read Bert’s original post for more detail).

Enter DNAChisel, an open-source library for Codon Optimization. This is a very cool Python library that allows you to optimize a given DNA sequence along a huge variety of parameters. You can specify constraints (goals which must be met, otherwise the problem will fail) and/or optimizations (no guarantees). Most optimizations will want to maintain the same translation, and so will use the EnforceTranslation constraint. For each of the axes described above, they have a corresponding constraint/optimization, plus many more. Each of these is called a Specification, and you can write your own - you just have to be able to write a scoring algorithm that can compare two sequences and tell you which one is better. Then, DNAChisel will introduce random mutations, evaluate the new sequence across all of the specifications, pick the best one, and continue - but it does this on localized subregions of the sequence, not just the whole sequence. Here is a great overview.

So, using DNAChisel, I coded up pretty much the simplest working DNAChisel problem possible and gave it a shot! My attempt came out to about an 80% match, a big improvement over the best one at the time.

Here, I optimize for a Hamster (on the wild guess that Pfizer was using CHO cells to produce the vaccine) and enforce GC content between 40% and 70%

After that, I decided to sign off of Twitter and enjoy the last day of the year. When I checked back again, I saw that other folks had taken my code and improved it, which was awesome! Hilariously, I had an outdated version of DNAChisel on my local machine, and simply upgrading to a newer version and optimizing for humans instead of hamsters got a whopping 10% boost, bringing the score up to 91%. Since then, some even simpler solutions have come out that get slightly better performance, but nothing significantly above DNAChisel’s. We still have a ~10% difference to account for; my guess is that adding in an optimization that maximizes Uridine content would get us more of the way there (see Bert’s previous post for why), but I’ll leave that up to the reader to verify. 😊

Overall, this was a really fun way to end the year with a combination of biology, programming, and topical vaccine news. Huge thanks to Bert for running it, and to all the participants!