Bridging Genetics and Protein Structures: Unlocking the Molecular Mechanisms of Disease with the G2P Portal
The Genomics 2 Proteins (G2P) portal represents a powerful, innovative resource designed to bridge the gap between genetic variant data and protein structure-function relationships. In the context of the era of big biological data, where millions of protein structures are predicted using AI-based methods like AlphaFold, there is an urgent need for tools that enable scientists to link genetic variants with their molecular phenotypes effectively.
Overview of the G2P Portal
Developed by a multidisciplinary team from the Broad Institute of MIT and Harvard, along with other international collaborators, the G2P portal aggregates genetic variant data from public databases like gnomAD, ClinVar, and the Human Gene Mutation Database (HGMD). It maps these variants onto over 42,000 protein sequences and nearly 78,000 protein structures, covering 99% of the human proteome. This is achieved using the Genomics 2 Proteins 3D (G2P3D) API, which seamlessly integrates various identifiers for genes, transcripts, and protein structures.
The G2P portal is more than just a variant-to-protein mapping tool. It allows users to upload their own genetic data, such as variants or protein residue-wise annotations, and map them to corresponding protein structures. The portal supports downloadable outputs, including CSV or TSV formats and PyMOL-compatible files, enabling further downstream analysis.
Solving Key Challenges in Linking Genomics to Proteins
One of the major hurdles in connecting genetic data to protein structural data is the diversity of RNA transcripts and protein isoforms generated from a single DNA sequence. Furthermore, the varied formats for genomic identifiers, such as rsIDs (SNP identifiers) and Human Genome Variation Society (HGVS) notations, complicate the alignment between genetic data and protein sequences, which may only be partially available as protein fragments. The G2P portal solves these challenges by aligning genetic and protein data through the G2P3D API, thereby enabling users to map variants onto full-length proteins.
This comprehensive portal empowers researchers to explore the structural implications of natural or synthetic genetic variants, fostering an improved understanding of how these variants might affect protein function, stability, or interactions.
Key Features and Functionality
Gene/Protein Lookup Module: This module allows users to search for specific genes or proteins and see how variants from databases like gnomAD, ClinVar, and HGMD map onto protein sequences and structures. Users can also filter these variants based on their clinical significance or population frequency, making it easier to focus on pathogenic or rare variants.
Interactive Mapping Module: A standout feature of the G2P portal is its interactive mapping capability. Users can upload their own genetic variants and sequence annotations to map them onto protein structures from public databases or even user-provided structures. This flexibility allows for customized analysis that goes beyond the scope of public data, making the G2P portal a powerful tool for both clinical and research applications.
Data Visualization: The portal provides extensive visualization options, such as mapping variants and structural features onto 3D protein models. For example, users can visualize variants in the context of known binding sites or post-translational modifications, aiding in the interpretation of their functional impact. Additionally, the portal integrates structural confidence scores from AlphaFold, enabling users to assess the reliability of their protein structure models.
Comprehensive Data Coverage: The G2P portal aggregates data from a variety of sources, covering a wide range of variant types (e.g., missense, frameshift, nonsense) and their predicted functional consequences. This includes over 18 million protein-coding variants from gnomAD, 1.7 million from ClinVar, and 312,738 disease-causing mutations from HGMD, mapped onto protein structures derived from PDB and AlphaFold databases.
Use Cases and Applications
The G2P portal has broad applications in both basic and translational research. In clinical genetics, for example, the portal can help identify the molecular mechanisms behind variants of uncertain significance (VUS), providing insights into whether a specific mutation might disrupt protein function or structure. In drug discovery, researchers can use the portal to map disease-associated variants onto drug targets, identifying potential sites for therapeutic intervention.
A case study provided by the authors illustrates how the portal can be used to explore pathogenic variants in MORC2, a gene associated with Charcot-Marie-Tooth disease type 2Z. By filtering for ClinVar variants that are classified as pathogenic or likely pathogenic, researchers can map these variants onto MORC2's protein structure and identify how they cluster in functionally critical regions, such as binding sites.
Future Directions
The G2P portal is continuously updated, with plans to incorporate even more functionality in the future. Planned features include cross-species variant mapping, predictions of binding pockets and free energy changes upon mutation, and automated structure searches for variant mapping. These additions will expand the scope of the G2P portal, making it an even more valuable resource for studying protein function and disease mechanisms.
Conclusion
The G2P portal is an open-access, user-friendly tool that enables researchers to connect genetic variants to their protein-level effects. By integrating vast amounts of genomic and protein structural data, it facilitates a deeper understanding of how genetic variation influences protein function and disease. This resource has the potential to accelerate discoveries in personalized medicine, drug development, and basic biology, making it a key player in the future of genomics and structural biology.
References:
Kwon, S., Safer, J., Nguyen, D.T. et al. Genomics 2 Proteins portal: a resource and discovery tool for linking genetic screening outputs to protein sequences and structures. Nat Methods (2024).