15.5 C
London
Friday, September 20, 2024

Genomics England Harnesses Amazon SageMaker to Revolutionize Cancer Diagnosis with Multi-Modal Data-Driven Predictions for Enhanced Patient Survival Outcomes

Here is the rewritten article in HTML:

Introduction

Genomics England, a leading genomics research organization, is dedicated to harnessing the power of genomic data to improve human health. In this blog post, we explore the company’s efforts to develop machine learning (ML) models that can accurately identify cancer subtypes and predict patient outcomes using multi-modal data, including genomic and imaging information.

Collaboration with Amazon Web Services

Genomics England has partnered with Amazon Web Services (AWS) to develop a multi-modal program aimed at enhancing its dataset and creating an automatic cancer sub-typing and survival detection pipeline. This collaboration has led to the creation of two proof-of-concept (PoC) exercises, which demonstrate the potential of multi-modal ML for survival analysis and cancer sub-typing.

Data

The PoCs have used publicly available cancer research data from The Cancer Genome Atlas (TCGA), which contains paired high-throughput genome analysis and diagnostic whole slide images with ground-truth survival outcome and histologic grade labels. The data includes whole slide histopathology images of tissue samples, as well as gene expression, copy number variations, and the presence of deleterious genetic variants.

Multi-Modal Machine Learning Frameworks

The ML pipelines tackling multi-modal subtyping and survival prediction have been built in three phases throughout the PoC exercises. The first phase implemented the state-of-the-art framework, Pathology-Omic Research Platform for Integrative Survival Estimation (PORPOISE). The second phase proposed, developed, and implemented a novel architecture based on Hierarchical Extremum Encoding (HEEC). The final phase improved on the results of HEEC and PORPOISE using a foundation model trained in a self-supervised manner, Hierarchical Image Pyramid Transformer (HIPT).

PORPOISE

PORPOISE is a multi-modal ML framework that consists of three sub-network components: CLAM, a self-normalizing network component, and a multi-modal fusion layer. Despite being performant, PORPOISE was observed to output reduced multi-modal performance than single best modality (imaging) performance alone when gene expression data was excluded from the genomic features.

HEEC

To mitigate the limitations of PORPOISE, AWS developed a novel model structure, HEEC, which is based on three ideas: using tree ensembles to mitigate the sparsity and overfitting issue, representation construction using a novel encoding scheme, and hierarchical learning to allow representations at multiple spatial scales. HEEC is interpretable out of the box, as it possesses implicit spatial information and supports feature importance.

Results

Table 2 shows the classification and survival prediction performance of the two implemented multi-modal ML models on TCGA data. HEEC outperforms the results of the best-performing single modality by combining multiple modalities.

Conclusion

Genomics England has made significant progress in developing multi-modal ML models for cancer subtyping and survival prediction. The implementation of state-of-the-art models and assistance in developing robust practices will ensure that users are maximally enabled in their research.

Frequently Asked Questions

Q1: What is the goal of Genomics England’s multi-modal program?

A1: The goal of Genomics England’s multi-modal program is to develop machine learning models that can accurately identify cancer subtypes and predict patient outcomes using multi-modal data, including genomic and imaging information.

Q2: What is the difference between PORPOISE and HEEC?

A2: PORPOISE is a state-of-the-art framework that consists of three sub-network components, while HEEC is a novel architecture that uses tree ensembles, representation construction, and hierarchical learning to mitigate the limitations of PORPOISE.

Q3: What is the advantage of using HEEC over PORPOISE?

A3: HEEC outperforms PORPOISE by combining multiple modalities and providing interpretable results.

Q4: What is the significance of the TCGA data used in the PoC exercises?

A4: The TCGA data is publicly available and contains paired high-throughput genome analysis and diagnostic whole slide images with ground-truth survival outcome and histologic grade labels, making it an ideal dataset for developing and evaluating multi-modal ML models.

Q5: What is the future direction of Genomics England’s multi-modal program?

A5: The future direction of Genomics England’s multi-modal program is to continue developing and refining multi-modal ML models, as well as exploring new applications and use cases for these models in cancer research and patient care.

Latest news
Related news
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x