Protein Design: Nature's Math Origami

^{^{[Kaleidoscope of Designed Proteins NatMag]}}

Even if not resolving a biology problem one may hope for a flexible formal language for encoding experimental data or/and for something mathematically non-trivial and yet not fully biologically absurd.

^{^{Mikhael Gromov. IHÉS Paris. June 26 of 2007^[1]}}

In the 1960's, chemists^[2] found that when denatured proteins unravel but when they cooled off they returned to their original shape. Like a coil, but this was not dependent on deformation like with the coil or by the internal machinery of the cell but by the particular nature of the amino acids.

Proteins are nanomachines. They perform at extremely high speeds and with a different paradigm for energy use when compared to macro-machines. The way they create work is by changing their shape. A property resulting from the interaction among the amino acids that conform them.

More formally, proteins are chains of amino acids joined by peptide bonds. They have different spatial structures. As I've mentioned before, form and function go hand-in-hand in biology, so predicting and designing reliably new structures would impact the function of these proteins.

I'm biased towards this as is my field of study but besides particle physics and A.I. this is the most promising and exciting development in science at the moment.

The Problems

Computational biology is hard. So hard that is most common for new mathematical structures to appear from biology than the opposite. I'll explain why.

There are two problems one must approach. The first is the structure prediction of proteins starting from genome sequences to arrive at macromolecular structures. The second is the design problem where one starts with a macromolecular structure one wants to make to arrive at a genome sequence.^[3]

Currently, there are 4 levels of classification based on their similarities as classified by the SCOP.

^{^{[Protein structure levels: Wiki]}}

The primary structure that refers to the sequence of amino acids in a linear chain (polypeptide).
The secondary structure that refers to local redundancies in the sequences of amino acids (created by the angles at an intersecting plane, the dihedral angles that occur in the Ramachandran plot that is a way to visualize places where energy allows structures to work as a backbone), the main ones being the α-helix and the β-sheets (less frequently beta turns and omega loops).
The tertiary structure that refers to the monomeric (individual packs of secondary structures locked by interactions like salt bridges, disulfide bonds) or the multimeric (multiple packs) conformation.
The quaternary structure that refers to the number and arrangement of subunits in a multimeric complex.

The native structures of proteins are probably the lowest energy states for a protein sequence in their solution medium, like inside cells. This is known as the folding funnel hypothesis. For each amino acid, there's a different level of low energy that affects the overall end result.^[4] So the problem of design is how do you go from the unfolded-high energy state to the folded-low energy state?

This is a computational problem of astronomical complexity. Is a really hard problem.

For each amino acid in a protein chain, there are several rotatable bonds, around 3 for each amino acid. ≅3ⁿ. This means for a peptide of 100 amino acids (a small protein) you have 3¹⁰⁰ possible configurations. If this wasn't enough, the number of protein sequences depends on the particular amino acid residue (out of the 20 amino acids) and their position in the chain .^[5]

This means the possibilities are $20^{n^{3^{n}}}$ Big? Well, that's only for the protein's primary structure.

^{^{[Low energy-Low entropy/enthalpy: Wiki SCIstyle]}}

For the secondary structure, you need to calculate the lowest energy of the individual atoms in the proteins and the atoms of the surrounding water. To determine how the protein structure will collapse in a water solution and reach their low energy state a.k.a. native structure. This requires finding the Gibbs free energy and the geometrical optimizations that although hard are more tractable problems.^[6]

To make this relatable, think of calculating the type, size, angle and moment to throw a sheet of paper at the wind so it folds itself into a crane.

The tertiary structure is a simpler problem (still hard), as it depends on apparently a limited set of tertiary structural motifs (e.g. helix-loop-helix motif that is a super secondary structure), approximately 2,000 distinct foldings.^[7]

^{^{[3D crystallography modeling: NatMag]}}

After the success of the Human Genome project, the sequences of proteins that come out each day far surpasses the limit of characterization possible by X-ray Crystallography or by NMR-Spectroscopy, which limits brute force approximations. This geometrical problem extends depending on the packing of the side chains.

The cost and intensity of characterization of proteins by imaging techniques is a huge limitation in the field. So it's a pleasant surprise just many structures are being characterized thanks to modeling.

^{^{[NatMag: Comparisson of Characterized vs modeled in the space of proteins in nature]}}

The quaternary structure is a lot simpler. It can already be predicted with high accuracy for protein complexes. As protein-protein interaction prediction by the study of flexible and rigid macromolecular docking, thanks to an arduous inductive and modeling work in intracellular pathways and biochemistry. The crowdsource nature of this approach is bearing fruits.

^{Rosetta@Home: De Novo protein prediction using and crowdsourced dstributed computational power, even gamification in the form of Fold.it where you can solve puzzles for science [NatMag]}

The Engineering Approach

Other fields of science and industry have almost sucked dry the field of biology out of people with training in engineering, maths, and physics. For a long time, the lack of people from these fields was not really noted until recently, when computational progress and information theory have interconnected all fields more than ever.

For instance, in the field of medicine, most of the developments in tools for tissue replacement have been engineered by doctors themselves. This done with tools they find at home that is not necesarily the best for the human body. Slightly shameful are the examples of naive materials science:^[8]

Medical Use	Initial Use	Polymer
Artificial Heart	Ladies Girdle	Polyether Urethane
Dialysis Tubing	Sausage Casing	Cellulose Acetate
Vascular Graft	Clothing	Dacron
Breast Implants	Lubricant Mattres Stuffing	Sylicone Polyurethane

Fortunately, thanks to the outsourcing to engineering much better alternatives are being found. Although the economic incentives in the industry and existing materials will take some time to be replaced.

Information gain has mainly been possible thanks to mathematical modeling of gene sequences and advancement in 3D graphics.

^{Long distance spatial relationships in the genome [NatMag]}

One of the first guesses for proteins bigger than 100 amino acids, was that a particular combination of amino acids occurring a certain number from each other would start a folding pattern. They were tracked by following amino acids that would end up spatially close or together in the final 3D structure. If this was true, a mutation in the gene sequence would also alter this particular combination to keep the structure. ^[9]

The comparison of homologous proteins across species gave a strong hint that this was true. As changing the position of one of those amino acids rendered the protein useless but changing both in sequence so they ended together conserved the function.

In design, an idealized version of general proteins is achieved. This is done by calculation of the optimal amino acid sequence for the desired protein. Then from the amino acid sequence a back translation to DNA. Design the gene that encodes the desired protein. Put it into bacteria, purify the protein and then solve the structure by crystallography. Basically, reverse engineering. Not surprisingly, this allows super-precise calculations with atomic level accuracy^[10]

^{2D arrays: Self-assembling nanomaterials^[10]}	^{Information: can be coded into protein sequences like DNA. "Biocomputers"^[11]}
^{Antagonists: bind to a target protein, blocking its activation^[12]}	^{Channels: through membranes act as gateways^[13]}
^{Cages: can contain medicinal cargo or carry it on their surfaces^[14]}	^{Sensors: travel throughout the body to detect various signals^[15]}

As evolution is a conservative process there's a lot of baggage in the formation of proteins due to this. Designed ideal proteins are far more resistant to denaturation and more stable structurally. There are already principles and tools to do this.^[16] At the moment you can create structures that mimic the capsid of viruses as drug delivery systems and could probably be the future of vaccines^[14]

This opens the door to for a marriage between engineering and biology. To create human-made nanomachines with energy efficiency and speed never seen before. The consequences of motorized tools not reliant directly on internal combustion engines and electric motors could be sci-fi level changing.

One needs only to remember that a molecule of glucose moves inside the cell at 400km/h reaching it's target while competing with other molecules that also travel at that speed. At scale, it would be like coordinating cars that travel at 35 millions of kilometers per hour in the middle of crowded times square.^[17],^[18]The ATP synthase molecule of an E. coli spins at 42,000 rpm^[19] Like this is not strange that enzymes collide 500.000/s with other molecules while propelled by motors that spin at 60 million rpm at scale.

^{^{Mitochondria: At approximately 1/10.000th of the real speed}}

Studying math has never been more important than now.

The road from biology (or from any branch of science on the fundamental level) to mathematics goes in several ( often Brownian rather than straight) paths in parallel.

_{_{Mikhael Gromov. IHÉS Paris. June 26 of 2007^[1]}}

REFERENCES:

_₁_{_{Gromov. M., Mendelian Dynamics and Sturtevant’s Paradigm, E-print http://www.ihes.fr/~gromov/topics/mendel-may31.pdf June 26, 2007}}

_₂_{_{Guzzo, A. V. (1965). The Influence of Amino Acid Sequence on Protein Structure. Biophysical Journal, 5(6), 809–822.}}

_₃_{_{McLaughlin Jr, R. N., Poelwijk, F. J., Raman, A., Gosal, W. S., & Ranganathan, R. (2012). The spatial architecture of protein function and adaptation. Nature, 491(7422), 138–142.}}

_₄_{_{Leopold, P. E., Montal, M., & Onuchic, J. N. (1992). Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proceedings of the National Academy of Sciences, 89(18), 8721–8725.}}

_₅_{_{Garnier, J., Osguthorpe, D. J., & Robson, B. (1978). Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. Journal of Molecular Biology, 120(1), 97–120.}}

_₆_{_{Sugita, Y., & Okamoto, Y. (1999). Replica-exchange molecular dynamics method for protein folding. Chemical Physics Letters, 314(1–2), 141–151.}}

_₇_{_{Zhang, Y. (2008). Progress and challenges in protein structure prediction. Current Opinion in Structural Biology, 18(3), 342–348.}}

_₈_{_{Robert S. Langer (MIT) Part 3: Biomaterials for Drug Delivery Systems and Tissue Engineering. Youtube. iBiology}}

_₉_{_{Rost, B., & Sander, C. (1993). Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology, 232(2), 584–599.}}

_₁₀_{_{King, N. P., Sheffler, W., Sawaya, M. R., Vollmar, B. S., Sumida, J. P., Andre, I., … Baker, D. (2012). Computational Design of Self-Assembling Protein Nanomaterials with Atomic Level Accuracy. Science, 336(6085), 1171–1174.}}