MBZUAI Workshop Banner

MBZUAI Workshop on Artificial Intelligence for Biology

Oct 9 - 10, 2024
MBZUAI, Abu Dhabi

Workshop Goals

We are excited to announce the AI4Bio Workshop on Artificial Intelligence for Biology. Organized by the Machine Learning department of MBZUAI, this workshop aims to explore the intersection of artificial intelligence and biology, with a focus on the latest advancements in AI techniques applied to biological data. The event is designed to foster collaboration and accelerate progress in AI-driven research in biology. We will bring together leading experts to discuss state-of-the-art methodologies, share groundbreaking research findings, and identify key challenges and opportunities in the AI for Biology field.

Schedule

We will host two days of invited talks (Oct 9-10) interspersed with panel discussions on challenges in AI for Biology.

Organizers

Eric Xing, MBZUAI

Le Song, MBZUAI

Invited Speakers

Eric Xing

Eric Xing, MBZUAI

Ziv Bar-Joseph

Ziv Bar-Joseph, CMU

Michal Rosen-Zvi

Michal Rosen-Zvi, IBM Research

Jean Philippe Vert

Jean Philippe Vert, Owkin

Yu Li

Yu Li, The Chinese University of Hong Kong

Peter Koo

Peter Koo, Cold Spring Harbor Laboratory

Hoifung Poon

Hoifung Poon, Microsoft Health Futures

Le Song

Le Song, MBZUAI

Eran Segal

Eran Segal, MBZUAI / WIS

Shirley Liu

Shirley Liu, GV20 Therapeutics

Brian Hie

Brian Hie, Stanford University

Christoph Bock

Christoph Bock, Medical University of Vienna

Christoph Feinauer

Christoph Feinauer, Stealth Startup

Firas D Khatib

Firas D Khatib, University Of Massachusetts Dartmouth

Day 1 Program (Oct 9, Wed, Executive Theatre)

Session 1

Coffee & Breakfast 9:00 am - 9:30 am
Eric Xing 9:30 am - 10:15 am

Topic: "FMs4Bio: toward simulating, predicting, and programming biological activities at all levels"

Abstract: At the core of medicine, pharmacy, health, longevity, agriculture and food security, environmental protection, and clean energy, it is biology at work. Biology in physical world is too complex to manipulate and always expensive and risky to temper with. An AI driven virtual organism opens up a safer and affordable alternative platform to explore and experiment novel designs and interventions. Here I present a vision and recent results of building such a system with multimodal and multiscale foundation models for biology at all levels.

Ziv Bar-Joseph 10:15 am - 11:00 am

Topic: "AI / ML in big pharma"

Abstract: While much of the cutting edge work in biomedical data analysis and modeling is still mainly done in academia, biotechs and pharma are very advanced, and in some cases even leading, in areas related to molecular design and clinical data analysis. I have been leading the AI / ML work for R&D at one of the largest pharma companies for almost two years and will share some of the AI methods and models we have been developing and using to address computational challenges across all stages of the drug discovery and development process. I Will also try to share some of the lessons I have learned over this period.

Coffee Break 11:00 am - 11:15 am
Michal Rosen-Zvi 11:15 am - 12:00 am

Topic: "Multimodal generative AI technology for drug discovery"

Abstract: Foundation models have demonstrated remarkable success in extracting connections and hierarchies from text, generating meaningful new content, and addressing a variety of challenges. When applied to protein data, small molecules, single-cell RNA data, and genomic mutations, these models can capture biological and chemical relationships that enable protein design, molecular property prediction, and more. In this talk, I will review how foundation models have ushered in a new era of scientific discoveries, with a focus on drug discovery. I will also provide an in-depth look at novel multimodal biomedical foundation model technologies and discuss the current needs for further advancing these technologies, such as the development of benchmarks and improving explainability.

Lunch break 12:00 am - 1:00 pm

Session 2

Jean Philippe Vert 1:00 pm - 1:45 pm

Topic: "Foundation models, from pathology to genomics"

Abstract: Large self-supervised foundation models have boosted the capabilities of AI models in natural language processing and computer vision. In this talk I will present our efforts to train foundation models for digital pathology, and to connect visual observations to the underlying genomics.

Yu Li 1:45 pm - 2:30 pm

"Complex disease modeling and efficient drug discovery with large language models"

Abstract: Large language models, which can integrate and process large amounts of data in biomedicine, have great potential in modeling complex diseases and discovering functional biomolecules. Here, we showcase the potential with three examples. In the first example, we build a large language model trained on the insurance claims of around 123 million US people. With the model, we can give a unified representation of all the common complex diseases, which enables us to predict the genetic parameters of the diseases and discover unique genetic loci related to them efficiently. In the second example, we show how to utilize protein language models to discover remote homologs and functional peptides, such as peptides. With the model, we can discover diverse functional peptides with low sequence similarity against the known ones. In the final example, we show how to use the RNA language model to model the RNA sequence and structure relation, which enables us to perform RNA structure prediction.

Coffee break 2:30 pm - 2:45 pm

Session 3

Peter Koo 2:45 pm - 3:30 pm

Topic: "Biological discovery with virtual experiments using cellular digital twins"

Abstract: Deep neural networks (DNNs) have emerged as powerful tools for analyzing high-throughput functional genomics data. By fitting experimental data, DNNs can serve as "digital twins" of biological systems, enabling virtual experiments and counterfactual analyses. This approach opens up unprecedented opportunities for scientific discovery through in silico perturbation studies. Here, we introduce two domain-specific interpretability methods that leverage DNNs as a digital twin to uncover rules of gene regulation at various scales. First, we present SQUID, which employs nucleotide-level in silico perturbations to characterize biological mechanisms within a genomic locus. SQUID provides interpretable parameters representing cis-regulatory mechanisms by approximating DNN behavior in this local region of sequence space with biophysics-inspired surrogate models. This approach yields improved characterization of transcription factor binding motifs and enhances single-nucleotide variant effect predictions compared to existing attribution methods. Second, we present CRÈME, which employs multiscale in silico perturbations to identify cis-regulatory elements and characterize their enhancing or silencing effects on DNN predictions. By interpreting Enformer, a state-of-the-art sequence-based DNN for gene expression prediction, CRÈME reveals that Enformer's predictions generally integrate the effects of numerous enhancers and silencers through complex interaction rules, including additivity, cooperativity, and redundancy. Leveraging DNNs as a digital twin enables us to conduct virtual experiments at unprecedented scale, exploring vast combinatorial spaces of genetic variations to reveal regulatory patterns that are otherwise challenging to investigate through traditional wet-lab experiments. This AI-driven approach demonstrates the potential of interpretable deep learning in genomics, paving the way for more targeted experimental design and accelerating discoveries in biology and healthcare.

Hoifung Poon 3:30 pm - 4:15 pm

Topic: "Advancing Health at the Speed of AI"

Abstract: The dream of precision health is to develop a data-driven, continuous learning system where new health information is instantly incorporated to optimize care delivery and accelerate biomedical discovery. The confluence of technological advances and social policies has led to rapid digitization of multimodal, longitudinal patient journeys, such as electronic medical records (EMRs), imaging, and multiomics. Our overarching research agenda lies in advancing multimodal generative AI for precision health, where we harness real-world data to pretrain powerful multimodal patient embedding, which can serve as digital twins for patients. This enables us to synthesize multimodal, longitudinal information for millions of cancer patients, and apply the population-scale real-world evidence to advancing precision oncology in deep partnerships with real-world stakeholders such as large health systems and pharmaceutical companies.

Open Discussion 4:15 pm - 5:00 pm

Day 2 Program (Oct 10, Thu, Executive Theatre)

Session 1

Coffee & Breakfast 9:00 am - 9:30 am
Le Song 9:30 am - 10:15 am

Topic: "Foundational AI Models for Multiscale Biological Systems"

Abstract: What will be the foundational AI models for biological systems? What data can be used to build them? How to build them exactly? Nowadays, biological data grow rapidly and converge into a few standard modalities, such as DNA, RNA and protein sequences and structures, biomolecular interaction networks, and single-cell RNA sequencing and imaging. It seems timely to ask the intriguing questions as to whether foundational AI models can be established for multiscale biological systems which possess certain level of generality and transferability and can serve as the infrastructure to enhance the entire spectrum of downstream prediction tasks from different scales of biological systems.

In this talk, I will share my recent work along this direction for large scale pretrained models leveraging a large amount of data from multiscale of biology including biological sequences, structures, molecular interactions, and single-cell transcriptomics. The pretrained models can be used as the foundation to address many predictive tasks arising from protein design and cellular engineering and achieve SOTA performances.

Eran Segal 10:15 am - 11:00 am

"HealthFormer: Foundation biomedical models from deep human phenotyping data"

Abstract: Recent technological advances allow large cohorts of human individuals to be profiled, presenting many challenges and opportunities. I will present The Human Phenotype Project, a large-scale (>25,000 participants) deep-phenotype prospective longitudinal cohort and biobank that we established, aimed at identifying novel molecular markers with diagnostic, prognostic and therapeutic value, and at developing prediction models for disease onset and progression. Our deep profiling includes medical history, lifestyle and nutritional habits, vital signs, anthropometrics, blood tests, continuous glucose and sleep monitoring, and molecular profiling of the transcriptome, genetics, gut and oral microbiome, metabolome and immune system. Our analyses of this data provide novel insights into potential drivers of obesity, diabetes, and heart disease, and identify hundreds of novel markers at the microbiome, metabolite, and immune system level. Foundation AI models that we developed provide novel representations of the diverse modalities that we measured on the cohort and achieve state-of-the-art performance in predicting future onset of disease and trajectories of disease risk factors. Overall, our predictive models can be translated into personalized disease prevention and treatment plans, and to the development of new therapeutic modalities based on metabolites and the microbiome.

Coffee break 11:00 am - 11:15 am
Shirley Liu 11:15 am - 12:00 am

"Integrated Computation and Genomics for Cancer Target and Drug Discovery"

Abstract: "Despite the exciting clinical benefits of immune checkpoint inhibitors, only a minority of cancer patients respond to treatment. Addressing resistance to immune checkpoint inhibitors is an urgent unmet need and requires novel approaches for target identification and drug discovery. GV20 Therapeutics adopts an interdisciplinary approach integrating AI on big data, functional genomics, and cancer immunology for cancer target identification and antibody drug discovery. Our STEAD platform computationally extracts antibodies from large cohorts of patient tumor RNA-seq profiles and uses AI to pair targets and corresponding antibodies in silico, de novo, with speed and scale. We then leverage in-house and public functional genomics and proteomics data to de-risk the AI-identified targets from patient tumors and provide insights on target function before we conduct systematic in vitro and in vivo validation experiments.

Using the STEAD platform, we discovered our lead program, GV20-0251, a first-in-class monoclonal antibody against a novel innate immune checkpoint IGSF8. In multiple syngeneic tumor models, anti-IGSF8 antibody has single-agent efficacy and is synergistic with anti-PD1 in controlling tumor growth. We then worked backward to unravel the function of IGSF8 in suppressing natural killer cells and dendritic cells in antigen presentation defective tumors (Li et al, Cell 2024). The power of GV20’s AI platform is demonstrated by GV20-0251’s unprecedented 3-year timeline from target research to IND, and validated by GV20-0251’s favorable safety and promising monotherapy efficacy in advanced metastatic cancer patients in US clinics (NCT05669430 trial). As the first AI-predicted target and AI-designed antibody drug in the clinic, GV20-0251 represents the beginning of rationally combining AI and genomics to unlock the hidden information from patient profiles to develop novel biotherapeutics."

Lunch break 12:00 am - 1:00 pm

Session 2

Brian Hie 1:00 pm - 1:45 pm

"Biological sequence modeling from molecular to genome scale with Evo"

Abstract: The genome is a sequence that encodes the DNA, RNA, and proteins orchestrating an organism’s function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report the first scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA co-design with a language model. Evo also learns how small nucleotide changes affect whole-organism fitness and generates coding-rich sequences exceeding 1 megabase. These prediction and generation capabilities span molecular to genome scales of complexity, advancing our understanding and control of biology.

Christoph Bock 1:45 pm - 2:30 pm

Topic: "Programmed Cells? ML/AI-assisted Cell Engineering for Cancer and Immunity"

Abstract: Cell engineering is becoming widely useful for fundamental biology (e.g., cells as probes and recorders) and for therapy (e.g., CAR T cells). Our research combines biotechnology and bioinformatics to genetically program human and mouse cells, in order to execute complex biological functions in vitro and in vivo. We specifically focus on the many roles of epigenetic mechanisms as mediators of cellular memory and plasticity, connecting the developmental history of individual human cells to their future potential.

Our goal is to understand cells by programming them based on a quantitative understanding of epigenetic cell states. We pursue three synergistic directions: To map and analyze cell states by multi-omics, single-cell, and spatial profiling (READ), to model regulatory circuitries with deep learning (LEARN), and to build artificial biological programs into cells by genome engineering (WRITE). We develop wet-lab and computational methods for all three directions and pursue initial applications for cancer and immunity.

READ: We analyzed the single-cell and spatial landscape of autoimmune granulomas (Krausgruber et al. 2023 Immunity), epigenetic heterogeneity in solid tumors (Klughammer et al. 2018 Nature Medicine; Sheffield et al. 2017 Nature Medicine), the role of structural cells in immune regulation (Krausgruber et al. 2020 Nature), and organoids in the Human Cell Atlas (Bock et al. 2021 Nature Biotechnology).

LEARN: We inferred regulatory circuits from single-cell data using “knowledge-primed neural networks” (Fortelny et al. 2020 Genome Biology), used knockout collections for JAK-STAT signaling in mouse (Fortelny et al. Nature Immunology) to test for causality, demonstrated the use of GPT-4 as a biomedical simulator (Schaefer et al. 2024 CBM) and developed a joint multimodal embedding model of transcriptomes and text for natural language based analysis of scRNA-seq data (https://cellwhisperer.cemm.at).

WRITE: We developed concepts and assays for high-content CRISPR screening (Bock et al. 2022 Nature Reviews Methods Primers), including the CROP-seq method for pooled CRISPR screening with single-cell RNA sequencing readout (Datlinger et al. 2017 Nature Methods) and the scifi-RNA-seq method cost-effective single-cell RNA-seq in 100,000s or millions of cells (Datlinger et al. 2021 Nature Methods).

Combining these three directions, we developed a platform for epigenetic and gene-regulatory optimization of CAR T cells using high-content screens in human primary cells (ex vivo) and in mice (in vivo). We identified several regulators that boost the performance of CAR T cells in these screens, and we systematically validated the in vivo effects of these boosted CAR T cells in systemic in vivo tumor models.

In conclusion, the combination of high-throughput profiling (READ), deep neural networks (LEARN), and genome editing at scale (WRITE) enables rapid functional dissection of epigenetic cell states and gene-regulatory networks in human cells, and their rational programming for biological research and therapy.

Coffee break 2:30 pm - 2:45 pm

Session 3

Christoph Feinauer 2:45 pm - 3:30 pm

Topic: "Unlocking Sequence Space from Structure for Diverse and Efficient Protein Design"

Abstract: This talk introduces a novel deep learning approach for unlocking sequence diversity from a single protein structure. By using structural encoders and decoders that generate entire sequence models for a given structure, we demonstrate the ability to rapidly generate diverse sequences while maintaining structural fidelity. This enhanced sequence diversity allows for a more comprehensive exploration of functional properties and opens the door to a broader understanding of sequence variability within fixed structures, with applications across therapeutics and biotechnology.

Firas D Khatib 3:30 pm - 4:15 pm

Topic: "Automating Human Intuition with the protein folding citizen science game Foldit"

Abstract: Can the brainpower of humans worldwide be applied to critical problems posed in computational biology, such as the structure determination of proteins and designing novel enzymes? Yes! Citizen scientists—most of whom have little or no prior biochemistry experience—have uncovered knowledge that eluded scientists for years. Players of the online protein folding video game Foldit have contributed to several scientific discoveries through gameplay. Rather than solving problems with a purely computational approach, combining humans and computers can provide a means for solving problems neither could solve alone.

Open Discussion 4:15 pm - 5:00 pm