Fold.It is an online puzzle game about protein folding in which players attempt to fold selected proteins in a variety of different configurations based on a set of pre-defined biological rules. Though the general process of protein folding is understood, predicting a protein’s eventual structure is computationally demanding. Therefore, researchers at the University of Washington, Center of Game Science and Department of Biochemistry, created Fold.It in order to enable the general public to learn more about protein folding in a fun and interactive way, while also helping researchers potentially advance protein folding research. Folded configurations that abide most closely to biological rules get higher scores. And the highest scoring solutions are then analyzed by researchers to see if these folded structures can be applied to proteins in the real world.
In this project, we used Deep Learning to train two separate models to a) rate the quality of folded proteins using protein energies, and 2) fold protein structures to maximize points by approaching the folding processes from human players’ perspectives.
The Importance of Protein Folding
Proteins are responsible for a majority of cellular activity from DNA replication to fighting infection. Proteins are long chains of amino acids that are typically composed of thousands of atoms. They contain many active groups, however the structure of these proteins (in particular how they fold in the alkaline environment of the body) dictates which regions of the protein are exposed and active (Figure 1). The exposed area of the protein is what determines how the protein will interact with the surrounding environment. This makes it critical that we not only understand protein composition but also how they fold. If we can better understand proteins’ natural structural configurations, we can design novel proteins to help: treat diseases, fight invasive species, or even limit waste & pollution.
The Complexity of Protein Folding
Proteins are composed of amino acids in a linear sequence. This allows us to list a linear list of atoms that completely describe the bonding structure of the proteins. On a larger scale, these linear structures form secondary and tertiary structures with complex shapes like the alpha helix and beta sheets (Figure 2). Calculating these by hand would require calculating spatial positions for thousands of atoms in three dimensional space. Thus solving the equations analytically is nearly impossible and computationally intensive.
Fold.It: Determining ‘Smart Moves’ via Crowdsourcing
In 2010, University of Washington researchers Seth Cooper et al. used the ‘Rosetta’ structure-prediction algorithm and human players to develop strategies for protein refinement in order to find new structures (Figure 3). They attempted this through a multiplayer online game called Fold.It (Figure 4). The game highlighted high energy areas through coloring and other visual cues. The players were then able to rotate, pull, and tweak the structure to maximize the score. The score was the negative of the Rosetta energy calculation. In the end, players were able to match Rosetta’s native performance on 3/10 puzzles and outperform Rosetta on 5/10 puzzles. Overall, their research showed that human visual analysis could be used to meaningfully supplement analytical solutions.
Predicting Protein Energy and Finding ‘Smart Moves’ to fold proteins
Given this background and prior work on folding proteins, we wanted to use game data from Fold.It to a) predict protein energy from just atom configuration (Model 1: Predicting Protein Energies), and b) train a second model to fold proteins in a smarter way (Model 2: Folding Proteins).
Thanks to Seth Cooper and the Fold.It team, we were able to obtain Protein Database (PDB) files from recorded Fold.It games. Overall there are about 1,500 Proteins in the Fold.It database, played by a subset of the 60,000 users, and each protein contains a few hundred atoms. These files were in .pdb format which contains protein information at the molecular level (Figure 5). These files describe the 3D structural information of the proteins. The .pdb file can then be also converted into visual diagrams via PyMOL, an open-source molecular visualization software (Figure 6).
Files from the Fold.It database were in the form .pdb files. Each of the .pdb files stored protein composition and structural information. The .pdb files could easily be opened and processed as .txt files via python. Custom python scripts were written to parse through these files to extract and format atom information into the necessary data structures.
Extracting Relevant Training Information
Each .pdb file consists of meta data, atom information, and other file descriptors that are used to visualize the protein in PyMOL. Our models do not need this metadata, and only need the atom information as input. Therefore during the training process, each new .pdb file needed to be opened, parsed to remove the extra metadata, and then saved into a usable data structure before it could be used as input. We decided to implement this data processing during the training process because it is more computationally and memory efficient to do so. The file parsing process did not affect the training speed of the model significantly.
Varying Input Sizes
The number of atoms in each .pdb file vary widely depending on protein size. For our simple feedforward neural networks, input sizes needed to be consistent in order to correctly initialize model weights. Therefore the varying atom count among different protein inputs posed an initial problem. We resolved this issue by first parsing through all the training files once in order to find the protein with the maximum atom count. Then we created data structures based on the size of the maximum atom count. Unused atom inputs for proteins with atom counts smaller than the maximum atom count were replaced with “0” or NaN.
Comparing Protein PDBs
In Model 2: Folding Proteins, we train a model to fold proteins in a smarter, more human way, rather than the more common greedy approach. We did this by using past user gameplay data from the game Fold.It. We trained the model based on differences in protein configuration and energy states between game states. During the training process, we realized Rosetta often inserted new atoms into .pdb files in order to make new protein configurations abide by biological rules. This made it difficult to train Model 2, as the model compared the atom position and protein energy differences between game states. This required that the number of atoms remain the same between different timesteps of the game. Therefore, in order to keep the atom count consistent, we also needed to separately parse through the data before training to remove extraneous atoms.
Model 1: Predicting Protein Energies takes new proteins in the form of PDB files as input, and attempts to predict the associated energy score. It is only trained against history PDB data and scores (Fig. 6). It is not explicitly programmed to understand any of the science behind proteins. The data was inputted in a way for it to understand atomic resolution information (with Element data), local clusters (with average distance data and amino acid information), and high level structures (with input XYZ Coords in a 3D image passed through two Conv3D layers). The model was trained on the Holyoke cluster and had an activation layer with Leaky ReLU (see Fig. 7).
Protein folding is very complex because the way a protein folds is not only dependant on the interactions between the individual atoms themselves (ie. Carbon vs. Oxygen), but also in the interaction between the different amino acid groups (ie. Glutamine vs. Alanine), as well as the interaction between the secondary and tertiary structures (alpha helix vs beta sheets). Moreover, higher resolution structures are highly dependant on how the lower resolution structures interact as well. For example, the way in which secondary/tertiary structures are formed is reliant on how amino acid groups interact. And amino acid groups are formed according to how the individual atoms interact. Therefore, in order for the model to be able to accurately predict protein energies, it needs to be able to truly understand how and why proteins are formed the way they do especially between the different resolution levels. Therefore, we provide Model 1 with input data of 3 levels of resolution: a) atomic level, b) amino acid / rotational information, and c) an overall protein point cloud representation.
3D Image Representation
Often proteins that are folded more compactly have smaller protein energies. The placement and vicinity of how atoms are oriented in the protein are important. Therefore in order to represent the density/sparseness of the atoms in the protein configuration, we converted the cartesian coordinates of each atom into a 3D image representation, which was then passed through several 3D CNNs and a Max Pool layer to train the network on atom positional densities. One issue we encountered was that if we mapped the xyz coordinates to a 3D array directly, the data structure became too large for memory to train. Therefore we needed to decrease the coordinate resolution by a magnitude of 100.
Because the .pdb files only included cartesian coordinates for each atom, this could cause problems if the orientations of proteins are different between training and test data. Therefore, in order to ensure that the same protein, even when slightly shifted or rotated, is not labelled as a different protein during the training process, the input data needed represented with rotation invariance. This was done by first calculating the centroid of the protein, and then finding the euclidean distance between each atom and the centroid. This creates a 1D vector of a list of atom euclidean distances, thereby projecting the points onto a circle, which is inherently rotation invariant.
At the atomic level, atoms are categorized via element type, amino acid group, and side chain type. We represented this information using a one-hot vector.
As described above, the model first takes in a 3D image point cloud of the protein with all the atom coordinate information converted into a 3D matrix. This 3D image is passed through two convolution layers, before being max pooled and flattened. The flattened array is then concatenated with another array of one hot vectors representing atom element, amino acid, and side chain type. The merged model is then fed through three hidden fully connected layers using a LeakyReLU as an activation layer before finally outputting a predicted protein energy. This predicted protein energy is compared against the actual protein energy as labelled in the dataset, and uses the mean absolute error to compute the loss.
We trained the model overnight on the MIT Engaging Holyoke Cluster with ~500 pdb files. Each .pdb file took around 15-30 seconds to train. As shown in Figure 10 and Figure 11, the predicted values and losses oscillated wildly, however the loss generally decreased overtime. With more time, we would try to improve the model by increasing the batch sizes and trying different loss and optimizers. The python scripts used for data processing and training found for Model 1 can be found in GitHub.
Model 2 takes as input a set of pdb files for a particular protein. It inputs into the model the initial x,y,z positions for a protein and the amount it was shifted. It trains this against the delta in distance over the delta in energy. It is trained on every pair of pdb files using this technique. This is put through a model composed of two fully connected layers and an output softmax layer. The model outputs a list of relative importance for the atoms and that is used to move the atom in x, -x, y, -y, z, and -z. The file is then relaxed and scored The best of the 6 shifts and the original pdb is then found. If the original file is the best it is returned. If another file is superior, it is fed back into the model from the beginning until an optimal protein configuration is found. This is shown in Fig xx. The model was not completed in the course of this work except to compare across 2 pdb files and then was only run for ten steps.
To go into the specifics the code was written with the following functions:
The results showed a 10 pt. Improvement over a Fold.it file and are shown in Figure 13.
The biggest challenge for this project was creating a viable model architecture as well as figuring out how to represent the protein data. Because the data not only included 3D spatial information as well as atom elemental and feature information, it was difficult to find ways to represent all this information compactly AND enable the model to learn how the different atom features interacted between layer resolutions with all the given information. Furthermore, although training 2D images is an indepthly explored topic among the research community, deep learning for 3D point clouds is still a novel research area. Past research such as Stanford’s PointNet focused on object classification. However, the problem of protein classification and prediction is more complex as it involves multiple layers of atomic interactions, as opposed to needing only the positional information required in normal 3D object classification. Furthermore the data structures created to hold atom coordinate information also tended to be too large to process and train on local computers, so all training needed to be completed on clusters. Figuring out how to represent atom feature information and include rotation invariance also posed a challenge. Designing the model architecture and data processing took up the brunt of our time in this project overall.
Given more time and computing resources to train and modify the models, we believe that our approach regarding protein data representation can definitely be improved upon for better results by adjusting parameters and layers of the model. Interestingly enough, two days before our final project presentation and demo, Google’s DeepMind released AlphaFold, an AI that predicts the 3D shape of proteins. Similar to our model, AlphaFold also trains its network by making it learn the relationship between amino acids in order to predict 3D structures. In comparison our model takes into account atomic, elemental, amino acid, and spatial information, which may have introduced more noise into the system than needed. Furthermore, Siraj, an AI educator on YouTube, recently released a video on Geometric Deep Learning regarding Deep Learning on non-Euclidean objects in 3D space, which could be helpful in better understanding how to better represent cartesian coordinates with rotation invariance.
For Model 1: Predicting Protein Energies, because we were using large data structures and 3D matrix convolutions, all the training was done on the MIT Holyoke Engaging Cluster (234 64GB, 2 x 8-core 2.0GHz CPUs, 90 K20m GPU, 16 Xeon phi, base OS - RHEL/Centos 6.4).
For Model 2: Folding Proteins, training was done on a personal machine (16 GB, Intel core 2.60 GHz CPU, Nvidia GeForce gtx 970M, base OS - Ubuntu 18.04).
We would like to thank our LA mentor Deepankar Gupta and our industry mentor Charles Tam for providing technical and high-level guidance throughout the project. We would also like to thank our course instructors Professor Hal Abelson and Natalie Lao for giving us the opportunity to learn about deep learning as well as implement a meaningful hands-on project. We would also like to give a special thanks to Northeastern University’s Seth Cooper for providing us with the Fold.It resources and data needed to train our models with. Finally, we would like to thank all of the course staff, MIT Engaging Cluster, the Fold.It researchers, and MIT researchers Sam Hendel (MIT Shoulders Lab) and Benson Chen (MIT CSAIL) for providing us with the advice and resources we needed to implement this project.