AAAI 2025 Tutorial
Artificial Intelligence for Protein Design
Abstract
Proteins are fundamental to biological processes, and AI techniques are revolutionizing their study, with applications ranging from drug
discovery to enzyme design. A key challenge in protein science is to predict and design protein sequences and structures, and to model
their dynamics. In this tutorial, we will present a comprehensive overview of AI approaches applied to protein sequence, structure, and
function prediction and design. Topics include sequence-based and structure-based protein representation learning, protein folding and
dynamics prediction, and protein design with generative models. Participants are expected to have a foundational understanding of machine
learning methods (e.g., neural networks, generative models). No prior experience with computational biology or bioinformatics is necessary,
as the tutorial will include a comprehensive introduction to the field.
Schedule
8:30 am - 12:30 pm EST, February 26, 2025
Location
Room 117, Philadelphia Convention Center, Philadelphia, PA USA
Slides
The slides can be found here.
Outline
- Part I: Introduction [30 min, Wengong, slides]
- Major Breakthroughs in AI for Proteins
- Introduction to Proteins
- Learning on Protein Data
- Part II: Protein Representation Learning [60 min, Zuobai, slides]
- Sequence Representation Learning
- Structure Representation Learning
- Multi-Modality Representation Learning
- Application
- Q&A [5 mins]
- Break: 15 min
- Part III: Protein Structure and Dynamics Prediction [60 min, Jiarui, slides]
- Protein Structure Prediction
- Single-chain Folding [AlphaFold2 (Jumper et al., 2021), RoseTTAFold (Baek et al., 2021), OmegaFold (Wu et al., 2022), ESMFold (Lin et al., 2023)]
- Side-chain Packing [AttnPacker (McPartlon et al., 2023), DiffPack (Zhang et al., 2024)]
- Complex Prediction [AlphaFold-Multimer (Evans et al., 2021), RoseTTAFold-AA (Krishna et al., 2024), Umol (Bryant et al., 2024), AlphaFold3 (Abramson et al., 2024)]
- Protein Conformation Sampling
- Boltzmann Generators Noé et al., 2019
- Coarse-Graining Based Methods [Two for One (Arts et al., 2023), EigenFold (Jing et al., 2023)]
- Rigid-Frame Based Methods [Str2Str (Lu et al., 2024), ConfDiff (Wang et al., 2024), DiG (Zheng et al., 2024), AlphaFlow (Jing et al., 2024), BioEmu (Lewis et al., 2024)]
- Structure Language Models [ESMDiff (Lu et al., 2025)]
- MD Trajectory Emulation
- Q&A [5 mins]
- Part IV: Protein Design [60 min, Jiwoong & Wengong, slides]
- Sequence Design
- Structure Design
- FrameDiff (Yim et al., 2023), FrameFlow (Yim et al., 2023)
- Genie (Lin et al., 2023), Genie2 (Lin et al., 2024)
- Chroma (Ingraham et al., 2023), RFDiffusion (Watson et al., 2023)
- FoldFlow (Bose et al., 2024), FoldFlow-2 (Heguet et al., 2024)
- Sequence-Structure Co-Design
- Antibody Design
- RNA Design
- Q&A [5 mins]
- Part V: Concluding Remarks and Future Works [15 min, Jian, slides]
Organizers
- Zuobai Zhang, website
- Zuobai Zhang is a 4th-year Ph.D. student at Mila – Québec AI Institute, advised by Prof. Jian Tang. He obtained B.Sc. in computer science from Fudan University. Previously, he interned at the Fundamental GenAI team at NVResearch. His research focuses on developing protein structure foundation models.
- Jiarui Lu, website
- Jiarui Lu is a 3rd-year Ph.D. student at Mila - Québec AI Institute, supervised by Prof. Jian Tang. He obtained B.Sc. in chemistry and mathematics from Shanghai Jiao Tong University. His research focuses on generative learning on biomolecular structure data such as proteins.
- Divya Nori, website
- Divya Nori is a Senior and joint Master’s student at MIT and student researcher at the Broad Institute, advised by Prof. Wengong Jin and Prof. Caroline Uhler. Previously, she interned on the ML teams at D.E. Shaw Research, Absci, and Microsoft Research. Her research focuses on developing AI methods for biomolecular design.
- Jiwoong Park, website
- Jiwoong Park is a postdoctoral researcher at Northeastern University working with Professor Wengong Jin. He completed his PhD in electrical and computer engineering at Seoul National University. His research field is generative models for drug design and machine learning for graph-structured data.
- Wengong Jin, website
- Wengong Jin is an assistant professor at Khoury College of Computer Sciences at Northeastern University. His research focuses on geometric and generative AI models for drug discovery. His work has been published in journals including ICML, NeurIPS, ICLR, Nature, Science, Cell, and PNAS, and covered by such outlets as the Guardian, BBC News, and CBS Boston.
- Jian Tang, website
- Jian Tang is an associate professor at Mila - Québec AI Institute, a Canada CIFAR AI Research Chair and the founder and CEO of BioGeometry. His research interests are deep generative models, graph machine learning and their applications to drug discovery. He has done many pioneering work on AI for drug discovery, including the first open-source machine learning framework for drug discovery, TorchDrug and TorchProtein.
References
- Bepler, Tristan, Berger, Bonnie. "Learning the protein language: Evolution, structure, and function." Cell System 2019.
- Rao, Roshan, et al. "Evaluating protein transfer learning with TAPE." NeurIPS 2019.
- Madani, Ali, et al. "Large language models generate functional protein sequences across diverse families." Nature Biotechnology 2023.
- Rives, Alexander, et al. "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences." PNAS 2021.
- Lin, Zeming, et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 2023.
- Wang, Xinyou, et al. "Diffusion Language Models Are Versatile Protein Learners." ICML 2024.
- Rao, Roshan, et al. "MSA Transformer."" ICML 2021.
- Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." TPAMI 2021.
- Satorras, Victor Garcia, Emiel Hoogeboom, and Max Welling. "E(n) equivariant graph neural networks." ICML 2021.
- Jing, Bowen, et al. "Learning from protein structure with geometric vector perceptrons." ICLR 2021.
- Hermosilla, Pedro, et al. "Intrinsic-Extrinsic Convolution and Pooling for Learning on 3D Protein Structures." ICLR 2021.
- Zhang, Zuobai, et al. "Protein representation learning by geometric structure pretraining." ICLR 2023.
- Wang, Limei, et al. "Learning Hierarchical Protein Representations via Complete 3D Graph Networks." ICLR 2023.
- Fan, Hehe, et al. "Continuous-discrete convolution for geometry-sequence modeling in proteins." ICLR 2023.
- Chen, Can, et al. "Structure-aware protein self-supervised learning." Bioinformatics 2023.
- Zhang, Zuobai, et al. "Pre-training protein encoder via siamese sequence-structure diffusion trajectory prediction." NeurIPS 2023.
- Wang, Zichen, et al. "LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction." Scientific Reports 2023.
- Zhang, Zuobai, et al. "A systematic study of joint representation learning on protein sequences and structures." ArXiv Preprint ArXiv:2303.06275.
- Su, Jin, et al. "Saprot: Protein language modeling with structure-aware vocabulary." ICLR 2024.
- Heinzinger, Michael, et al. "Bilingual Language Model for Protein Sequence and Structure." NAR Genomics and Bioinformatics 2024.
- Wang, Xinyou, et al. "DPLM-2: A multimodal diffusion protein language model." ICLR 2025.
- Hayes, Thomas, et al. "Simulating 500 million years of evolution with a language model." Science 2024.
- Xu, Minghao, et al. "ProtST: Multi-modality learning of protein sequences and biomedical texts." ICML 2023.
- Zhang, Ningyu, et al. "OntoProtein: Protein Pretraining With Gene Ontology Embedding." ICLR 2022.
- Xu, Minghao, et al. "PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding." NeurIPS DB Track 2022.
- Zhu, Zhaocheng, et al. "TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery." ArXiv Preprint ArXiv:2202.08320.
- Notin, Pascal, et al. "ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction." NeurIPS 2023.
- Zhang, Zuobai, et al. "Multi-Scale Representation Learning for Protein Fitness Prediction." NeurIPS 2024.
- Shan, Sisi, et al. "Deep Learning-Guided Optimization of Human Antibody Against SARS-CoV-2 Variants with Broad Neutralization." PNAS 2022.
- Cai, Huiyu, et al. "Pretrainable Geometric Graph Neural Network for Antibody Affinity Maturation." Nature Communications 2024.
- Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." Nature 2021.
- Baek, Minkyung, et al. "Accurate prediction of protein structures and interactions using a three-track neural network." Science 2021.
- Wu, Ruidong, et al. "High-resolution de novo structure prediction from primary sequence." BioRxiv 2022.
- Lin, Zeming, et al. "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 2023.
- McPartlon, Matthew, et al. "An end-to-end deep learning method for protein side-chain packing and inverse folding." PNAS 2023.
- Zhang, Yangtian, et al. "Diffpack: A torsional diffusion model for autoregressive protein side-chain packing." NeurIPS 2024.
- Evans, Richard, et al. "Protein complex prediction with AlphaFold-Multimer." bioRxiv 2021.
- Krishna, Rohith, et al. "Generalized biomolecular modeling and design with RoseTTAFold All-Atom." Science 2024.
- Bryant, Patrick, et al. "Structure prediction of protein-ligand complexes from sequence information with Umol." Nature Communications 2024.
- Abramson, Josh, et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature 2024.
- Noé, Frank, et al. "Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning." Science 2019.
- Arts, Marloes, et al. "Two for one: Diffusion models and force fields for coarse-grained molecular dynamics." JCTC 2023.
- Jing, Bowen, et al. "Eigenfold: Generative protein structure prediction with diffusion models." ICLR 2023 MLDD Workshop.
- Lu, Jiarui, et al. "Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling." ICLR 2024.
- Wang, Yan, et al. "Protein conformation generation via force-guided se (3) diffusion models." ICML 2024.
- Zheng, Shuxin, et al. "Predicting equilibrium distributions for molecular systems with deep learning." Nature Machine Intelligence 2024.
- Jing, Bowen, et al. "AlphaFold meets flow matching for generating protein ensembles." ICML 2024.
- Lu, Jiarui, et al. "Structure Language Models for Protein Conformation Generation." ICLR 2025.
- Lewis, Sarah, et al. "Scalable emulation of protein equilibrium ensembles with generative deep learning." BioRxiv 2024.
- Fu, Xiang, et al. "Simulate time-integrated coarse-grained molecular dynamics with multi-scale graph networks." TMLR 2023.
- Schreiner, Mathias, et al. "Implicit transfer operator learning: multiple time-resolution surrogates for molecular dynamics." NeurIPS 2023.
- Klein, Leon, et al. "Timewarp: Transferable acceleration of molecular dynamics by learning time-coarsened dynamics." NeurIPS 2023.
- Jing, Bowen, et al. "Generative modeling of molecular dynamics trajectories." NeurIPS 2024.
- Hsu, Chloe, et al. "Learning inverse folding from millions of predicted structures." ICLR 2022.
- Dauparas, Justas, et al. "Robust deep learning-based protein sequence design using ProteinMPNN." Science 2022.
- Yim, Jason, et al. "SE (3) diffusion model with application to protein backbone generation." ICML 2023.
- Yim, Jason, et al. "Fast protein backbone generation with SE (3) flow matching." arXiv preprint ArXiv Preprint ArXiv:2310.05297.
- Lin, Yeqing, and Mohammed Alquraishi. "Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds." ICML 2023.
- Lin, Yeqing, et al. "Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2." ArXiv Preprint ArXiv:2405.15489.
- Ingraham, John B., et al. "Illuminating protein space with a programmable generative model." Nature 2023.
- Watson, Joseph L., et al. "De novo design of protein structure and function with RFdiffusion." Nature 2023.
- Bose, Joey, et al. "SE (3)-Stochastic Flow Matching for Protein Backbone Generation." ICLR 2024.
- Huguet, Guillaume, et al. "Sequence-Augmented SE (3)-Flow Matching For Conditional Protein Backbone Generation." NeurIPS 2024.
- Shi, Chence, et al. "Protein Sequence and Structure Co-Design with Equivariant Translation." ICLR 2023.
- Lisanza, Sidney Lyayuga, et al. "Joint generation of protein sequence and structure with RoseTTAFold sequence space diffusion." Nature Biotechnology 2024.
- Campbell, Andrew, et al. "Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design." ICML 2024.
- Chu, Alexander E., et al. "An all-atom protein generative model." PNAS 2024.
- Jin, Wengong, et al. "Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design." ICLR 2021.
- Zhu, Tian, Milong Ren, and Haicang Zhang. "Antibody Design Using a Score-based Diffusion Model Guided by Evolutionary, Physical and Geometric Constraints." ICML 2024.
- Jin, Wengong, et al. "DSMBind: SE (3) denoising score matching for unsupervised binding energy prediction and nanobody design." NeurIPS 2023.
- Jin, Wengong, Regina Barzilay, and Tommi Jaakkola. "Antibody-antigen docking and design via hierarchical equivariant refinement." ICML 2022.
- Nori, Divya, and Wengong Jin. "Rnaflow: Rna structure & sequence design via inverse folding-based flow matching." ICML 2024.
- Huang, Tinglin, et al. "Protein-nucleic acid complex modeling with frame averaging transformer." NeurIPS 2024.