AlphaFold from scratch
5 min read Just now
NOTE: This is my raw notes that i got while exploring the alhphafold and how protein folding can be useful for humanity learning from scratch t = 0
Protein
- Proteins are large, complex biomolecules made up of long chains of amino acids.
- The specific sequence of a protein determines its three-dimensional structure and function.
- Protein structural complexity includes secondary, tertiary, and quaternary structures.
- The process of converting a sequence into a 3D structure is called protein folding.
- The process of converting a 3D structure into a sequence is called inverse protein folding.
- Proteins are crucial for biochemistry, molecular biology, genetics, genomics, cell biology, pharmacology, biotechnology, and genetic engineering.
- Many diseases are caused by proteins functioning abnormally or interacting in problematic ways. Curing these diseases requires targeting the activity of relevant proteins.
- To determine a protein sequence, roughly, the first step is to obtain the DNA or RNA sequences that encode it using techniques like DNA sequencing. Then, the DNA or RNA is translated into the corresponding amino acid sequence of the protein. This translation process follows the genetic code, where each triplet of nucleotides corresponds to a specific amino acid. Finally, this is experimentally verified.
- We use various experimental techniques to get structure from sequences.
- Protein structure prediction can be used to advance biotechnology in various ways, like drug discovery, personalized drug discovery, and protein-based therapeutics.
- Inverse protein folding is the reverse of protein folding, where we have the structure and predict the sequence. This allows researchers to design proteins with custom structures and functions, such as enzymes with improved catalytic abilities or antibodies.
- This also allows researchers to explore novel 3D structures that may have desirable functions beyond what naturally occurring proteins can provide. Researchers can gain deeper insights into the rules and principles that govern how proteins fold into their 3D structure, and can also validate if a structure is correct by inverting protein structures.
Let’s walk through the step-by-step details of how we can use AI models to cure diseases:
- Let’s take the example of Cystic Fibrosis: -> It’s a type of genetic disorder affecting the lungs and digestive system, inherited in an autosomal recessive pattern (one of the 22 non-sex chromosomes, needing two mutated genes to have the disease). -> It occurs in the Caucasian population, caused by a mutation of the CFTR gene (Cystic Fibrosis Transmembrane Conductance Regulator). -> Symptoms include coughing, shortness of breath, poor weight gain despite eating, and nutritional deficiencies. -> CFTR is a protein that helps balance water and salt in the body, but a defective CFTR can’t transport chloride, leading to increased sodium absorption and water staying inside cells.
- Obtaining the Sequence:uniprot.org -> search “CFTR human” -> uniprot ID: P13569 -> Download the FASTA sequence.
- Predicting the Structure: Currently using esmfold from Meta.
- Creating a Mutant Structure of the Protein: Create a mutant CFTR by deleting F508 and predict its structure. Then, compare it with the wild-type (original/natural protein).
- Analyze it.
Here is the Google Colab link: https://colab.research.google.com/drive/1rHjRO7CFzdnQ4xgbAkDE-OkzU6Y-Nre7?usp=sharing :))
Let’s go deeper into architectural design
- Evolution of protein folding machine learning architecture:
- Paper: “Prediction of Protein Secondary Structure at Better than 70% Accuracy” (Rost & Sander, 1993)
- Terms:
- Multiple Sequence Alignments (MSAs): A way of arranging multiple protein sequences to identify regions of similarity.
- Position-Specific Scoring Matrices (PSSMs): Matrices representing amino acid frequencies at each position, derived from MSAs, quantifying conservation patterns.
- PHD (Profile network from HeiDelBerg) predictor:
- The author suggested that instead of focusing on each individual protein, we should focus on families of proteins to learn about protein structure.
- Input data: Each sequence of amino acids is represented by the amino acid residue frequencies derived from multiple sequence alignments. Residue frequencies of 20 residues are represented by 3 bits each or by one real number.
- Output data: The output layer has three units corresponding to the three secondary-structure states: helix, β-strand, and “loop,” at the central position of the input sequence window. Output values are between 0 and 1.
- Model: Input to the second network is the three output real numbers for helix, strand, and loop from the first network, plus a fourth spacer unit, for each position in a 17-residue window. From the 17 x (3 + 1) = 68 input nodes, the signal is propagated via a hidden layer to three output nodes for helix, strand, and loop, as in the first network.
- Dataset: For the dataset, I used ProteinNet’s CASP7 text version, then I shifted to TensorFlow because it has available TensorFlow-optimized datasets.
- I tried a lot but kept getting errors while processing the ProteinNet dataset. First, I tried PyTorch, but then I shifted to TensorFlow as it contains TensorFlow-specific TFRecord files, which I thought would be easier, but I was wrong. Then, I started learning about the dataset from scratch.
- After getting numerous errors using ProteinNet, I switched to a smaller dataset as it requires less compute and can be easier to validate working.
- Here is the Google Colab link. The code is implemented with the help of a genius friend (Claude). I got 60 percent accuracy, which can easily be improved if you play more and make some tweaks in the model. Also, I’m getting higher training loss than test loss, which is very unusual. Probably you can tweak the model architecture to get more improvement.
- Link: https://colab.research.google.com/drive/1E3GLtI5IuLAZ2zPZpHKmY2hu5eHmE868?usp=sharing
- Terms:
- Proteins homologous: Share a common evolutionary ancestor and have similar structures and functions.
- Reference network:
- 130 representative protein chains: Chosen to minimize sequence similarities and ensure the network is trained on a broad representation of structural features.
- Cross-homologies: Used in the testing set, termed “without homology” in ref. 5. Proteins with similar structural similarities.
- Summary of the paper: Proteins have three main types of secondary structures: α-helix (helix) (~32% in the dataset), β-strand (strand) (~21%), and loop (~47%). We aim for higher accuracy of the overall structure, but due to the high percentage of helix in the output, it’s overrepresented, and loops are underrepresented. This paper’s purpose is to balance the representation to 33 percent for each structure type, which increases accuracy without any side effects. Earlier predictions resulted in short and fragmented helices, so it introduces two networks. The first network takes a window as input, and the second network takes the output of the first network as input and improves upon it. It was the first time sequence profiles derived from multiple sequence alignments were used instead of protein sequences directly.
Okay, here’s a Google Colab notebook where I tried to convert the RSA dataset (dataset that I used to train the model to create PSSM): https://colab.research.google.com/drive/1TCV4Aaqe1EMgpSjvuO_zL2vK1a_rs3ig?usp=sharing
- Paper: “Protein Contact Prediction by Deep Learning” (Wang et al., 2016)
- Dataset: ProteinNet https://github.com/aqlaboratory/proteinnet
- Google Colab link: https://colab.research.google.com/drive/1vhfj7RYOjJmAZK11h1IcF0gL6ARELfHp?usp=sharing
Skipped some papers to AlphaFold 3, as it seemed interesting. Here is the paper link: https://www.nature.com/articles/s41586-024-07487-w
New concepts: Pairformer https://arxiv.org/pdf/2311.03583 The Extremal Graph Theory Problem