Protein structure prediction using language models

From: hu-po

Meta (formerly Facebook) Research has developed an approach to protein structure prediction using language models, positioning itself as a contender in the race to use deep learning for biological applications [00:01:11]. This effort aims to extend protein structure prediction to a catalog of 200 million proteins [00:02:56].

Evolutionary Scale Language Models (ESM2)

The core of Meta’s approach is a new family of Transformer protein language models called ESM2 [00:30:51]. These are the largest language models of proteins to date, with versions trained up to 15 billion parameters [00:03:46]. Smaller models with 8 million parameters are also available [00:11:49].

Training and Architecture

ESM2 models are trained with a masked language modeling objective, an unsupervised method [00:31:22]. This involves masking out parts of a protein sequence and training the model to predict the missing amino acids [00:31:29]. This training method allows for infinite creation of training data from existing protein databases [00:31:40]. The models utilize Transformer architectures, which have become prevalent in language models [00:07:15].

The training dataset includes 650 million unique sequences, sampled evenly across 43 million UniRef50 training clusters and 138 million UniRef90 sequences [00:32:35]. The training employed a heterogeneous cluster of 2,000 GPUs and took two weeks to characterize [00:18:18]. Training efficiency is improved by sharding model weights and optimization parameters across multiple GPUs [01:44:45].

Positional Embeddings

ESM2 uses rotary position encoding, which differs from the absolute sinusoidal positional encoding used in the original Transformer paper [01:41:34]. Rotary position embedding (RoPE) encodes absolute positional information with a rotation matrix and naturally incorporates explicit relative position dependency [01:56:30]. This method is flexible for varying sequence lengths and introduces decaying inter-token dependency with increasing relative distances [01:56:45]. This approach makes sense for proteins, where the linear sequence doesn’t directly map to 3D spatial relationships in the same way human language has a temporal progression [01:42:45].

Protein Structure Prediction (ESMfold)

ESMfold is the structure prediction model that leverages the pre-trained ESM2 language model [00:50:50]. ESM2 acts as a feature encoder, processing the protein sequence and passing its internal states (embeddings) to a “folding head” [00:51:01]. This folding head contains “folding blocks” that iteratively update sequence and pairwise representations before passing them to a structure module that outputs 3D coordinates and confidences [00:57:04].

Key Innovations Compared to AlphaFold

ESMfold eliminates the need for multiple sequence alignment (MSA) [00:13:20], a computationally intensive process often required by other state-of-the-art models like AlphaFold [00:52:29]. By processing structures directly from the primary sequence, ESMfold simplifies the neural network used for inference [00:14:08]. This change significantly speeds up prediction times; for a protein with 384 residues, ESMfold can make a prediction in 14.2 seconds on a single Nvidia V100 GPU [00:53:38], whereas AlphaFold can take over 10 minutes [00:14:26].

Performance and Results

Perplexity and Contact Prediction

Perplexity is a measure of how well a probability model predicts a sample; a lower perplexity indicates a better model [00:33:43]. ESM2 models show large improvements in modeling protein structures as parameters increase [00:32:50]. The 8 million parameter model has a perplexity of 10, while the 15 billion parameter model achieves 6.37 [00:33:27]. Perplexity and contact map prediction accuracy are linked, with proteins undergoing large changes in one also showing large changes in the other [00:38:05].

Larger models exhibit “emergent capabilities,” where greater capabilities emerge as computation, data, and the number of parameters increase [00:08:43]. This raises the possibility that a similar form of emergence might be exhibited by language models trained on protein sequences [00:09:02].

3D Structure Prediction

ESMfold generates state-of-the-art three-dimensional structure predictions directly from the primary protein sequence [00:50:15]. The models were validated against holdout sets like CAMEO and CASP14 proteins, where the actual atomic structure is known [00:39:03]. Accuracy is measured using metrics like pLDDT (predicted local distance difference test), a well-calibrated estimate of prediction accuracy [01:01:05], and RMSD95 (root mean squared deviation at 95% residue coverage) [01:10:35].

One notable finding is that regions of a protein structure that are difficult for ESMfold to predict accurately also pose challenges for AlphaFold, suggesting both models are learning similar underlying patterns [01:07:44].

Novel Structures and Metagenomic Atlas

ESMfold has made 25 million high-confidence predictions, including millions whose structures are novel compared to experimentally determined structures [00:04:41]. In a comprehensive characterization of the MGnify90 dataset (over 617 million proteins) [00:16:31], 225 million high-confidence predictions were made [00:18:30]. A vast majority of these high-confidence predictions were distinct from existing UniRef90 entries, indicating discovery in metagenomic space distant from existing knowledge [01:15:06]. All predicted structures are accessible via the ESM Metagenomic Atlas [02:27:19].

Data Sources and Challenges

The protein sequence databases like UniProt and MGnify are largely funded by government research and non-profits, including the National Human Genome Research Institute and NIH [01:24:31]. This public funding influences the open-source availability of some large-scale models [01:24:22].

A potential challenge identified is the quality of existing ground truth data. When models like ESMfold and AlphaFold predict a structure different from the “ground truth,” it raises the question of whether the experimental ground truth might sometimes be inaccurate or represent only one possible conformation [01:12:11]. Humans have less intuitive understanding of protein sequences compared to human language, making it difficult to “fact-check” AI predictions or fully understand emergent intelligence in this domain [00:09:41].

The task of protein structure prediction is at a point where the size of existing protein databases might become a limiting factor for further “step-function improvements” in model performance [01:59:37]. However, there’s potential for models to synthetically create new datasets to train even larger models [02:00:05].

Conclusion

The development of ESMfold represents significant advancements in language models for biological applications, specifically in protein structure prediction. By training language models on amino acid sequences and developing efficient folding architectures, Meta has demonstrated competitive performance and the ability to discover novel protein structures, contributing to a future where the structure of all proteins discovered through gene sequencing experiments might be understood [01:25:14].

Tubegraph

Explorer

Table of Contents