By Mu Huan and Eric Li
A large gene sequencing company generates 10 TB to 100 TB of offline data every day, and the big data Bioinformatics Analysis platform needs PB-level data processing capability. Biotechnology and computer technology provide two-way support behind the scenes: Sequencing applications are gradually moving from scientific research to clinical applications, computing models are evolving from offline to online, and the delivery efficiency is becoming more and more important.
DNA Sequencing and Related Research
James Watson and Francis Crick discovered the double helix structure of DNA in 1953. Since then, the mysteries of species evolution and heredity has been uncovered, and the human cognition of digital heredity has been opened. However, the mystery of human genes is understood bit by bit.
In 1956, a discovery of the correlation between cancer and chromosome shocked the whole cancer research community: For patients with Chronic Myeloid Leukemia (CML), chromosome 22 is significantly shorter than common chromosomes. More than two decades later, scholars discovered that the ABL gene on chromosome 9 was linked to the BCR gene on chromosome 22, and staggered translocation resulted in a BCR-Abl fusion gene. The BCR-ABL protein is always active and uncontrolled, which causes uncontrolled cell division and leads to cancer.
That is to say, as long as cells express the BCR-ABL protein, a risk of blood cancer exists. The several researchers in the United States have begun in-depth research on this, and they successfully launched a new drug for the treatment of CML. This is Glivec, also known as the “high-priced drugs” in the movie “Dying To Survive” last year.
Before the emergence of Glivec, only 30% of patients with CML survived for 5 years after diagnosis. With Glivec, this number has increased from 30% to 89%, and after 5 years, 98% of the patients have achieved complete hematological remission. For this reason, Glivec is also included in the standard list of essential drugs for WHO, and is considered to be one of the essential drugs in the medical system that are the most effective, the safest and can meet the most significant needs.
How Containers and Hybrid Cloud Can Be Used with Gene Sequencing
Gene sequencing is widely used in the field of hematological tumors. According to the diagnosis results of a patient, the hematological oncologist will choose corresponding examinations. For example, PCR and the real-time fluorescence probe technology is combined to detect the BCL-ABL fusion gene to diagnose CML, or SEGF (Single-end Gene Fusion) can also detect complex gene fusion types using single-end next-generation sequencing (NGS) data through the NGS technology.
On the other hand, the Non-Invasive Prenatal Testing (NIPT) for Down’s (or Edwardian) syndrome, with high accuracy and low risk to the fetus, has become increasingly popular among young women in China in recent years. A gene company completes hundreds of thousands of NIPTs every year, and each NIPT involves the processing, storage and report generation for more than hundreds of MB of data. A large gene sequencing company generates 10 TB to 100 TB of offline data every day, and the big data Bioinformatics Analysis platform needs PB-level data processing capability. Biotechnology and computer technology provide two-way support behind the scenes: Sequencing applications are gradually moving from scientific research to clinical applications, computing models are evolving from offline to online, and the delivery efficiency is becoming more and more important.
Challenges in Genetic Computing:
- Data Storage: Data volume increases rapidly, storage costs are high, and management is difficult. It is difficult to guarantee the reliability of long-term data storage. It is necessary to seek a data compression method with low cost and large amount of data. Metadata management is chaotic and data cleaning is difficult.
- Distribution and Sharing: Massive data needs to be quickly and securely distributed to many places in China and overseas. The traditional hard drive delivery method have a long delivery cycle and low reliability. Data in multiple centers must be shared for access.
- Computing and Analysis: There is a long processing time of batch samples, and the peak and valley of resource demand are obvious, which is difficult to plan. Large-scale sample data mining requires massive computing resources, which is difficult for local clusters to meet. The process of computing workflow is difficult to migrate, to schedule online and offline, and to manage across regions. The offline elasticity is poor, and computation is performed as required.
- Security and Compliance: Genetic data requires extremely high security and privacy. The security defense capability of the self-built data center is insufficient. Data contracts (blockchain). The support for RAM users.
In this case, a complete architecture solution is essential. Compared with Alibaba Cloud HPC, the solution should be able to split tasks as required, automatically apply for resources from the cloud, automatically scale to minimize resource holding costs, achieve a resource utilization rate of more than 90%, and automatically return computing resources after use. It maximizes resource usage efficiency, minimizes the processing cost of a single sample, and completes processing of large batches of samples in the fastest way. With the growth of gene sequencing business, it can automatically complete offline resource usage and online resource expansion. It has high speed intranet bandwidth, high throughput storage, and almost unlimited storage space.
Unlike conventional computing, Genetic computing imposes high requirements on the computing and storage capability of massive data. The computing resources on the cloud can be flexibly scheduled on a large scale mainly through the automatic scaling feature of container computing and the automatic scaling capability of Alibaba Cloud ECS. By rationally splitting genetic data, large-scale parallel computing and the TB-level sample data processing can be realized. Through the computing capability acquired on demand and the use of high-throughput object storage, the cost of holding computing resources and the processing cost of a single sample are greatly reduced.
The overall technical architecture is a hybrid cloud of cloud native containers, integrating resources on-cloud and off-cloud and managing clusters uniformly across regions. As the main player, container technology provides standardized process, acceleration, elasticity, authentication, observation, and measurement and other capabilities in data splitting, data quality control and call variation. On the other hand, high-value mining requires the use of container-based machine learning platforms and parallel frameworks to build models for large-scale linear algebra computing of genes, proteins, and medical data, thus making accurate medical capabilities a reality.
Key Problems and Solutions in Genetic Engineering
The workflow of genetic engineering is characterized by a multi-level and directed acyclic graph. DAG with a depth of 1000–5000+ for big scientific research workflow requires accurate process state monitoring and high process stability. Simple processes restart from any step. Failed steps can automatically retry and continue. Tasks, notifications, logs, audits, queries can be scheduled and operation portal CLI/UI can be integrated.
- Data Migration and Transmission
The reading of millions of small files in the data migration and data splitting phases puts pressure on the underlying file system. The processing efficiency of samples can be improved by avoiding unnecessary reading and writing of small files. Through a physical connection between the data center and Alibaba Cloud, high-throughput and low-latency data migration to the cloud, as well as cloud migration, verification, and detection combined with workflows can be implemented. The final goal is to complete the encrypted migration of dozens of TB of data in a short time, ensure the high performance and security of the data transmission client, implement concurrent transmission and resumable transmission, and maintain complete access authorization control.
- Typical Genetic Computing Task: Enhanced Workflow
A typical feature of genetic computing is that the data is computed in batches, which need to be completed sequentially according to specific steps. Abstractly, it means that a declarative workflow definition, AGS (Alibaba Cloud Genomics Service) workflow, is required.
The solution adopted is as follows:
- Simple YAML declarative definition, multi-level and directed acyclic graph, support for complex dependencies, automatic task splitting, and automatic parallelization.
- Cloud native, and enhanced workflow definition fully compatible with the Argo Community.
- Real-time resource statistics, integrated CloudMonitor monitoring, cloud log SLS integration, audit integration, and task scheduling.
- Unified operation portal AGS-CLI integrated with Kubectl.
- Alibaba Cloud storage volume declarative support, NAS, OSS, CloudDisk, and cache acceleration support.
- Unified Scheduling of On-Cloud and Off-Cloud Resources
The unified scheduling of computing resources and data aggregation on the cloud can be implemented through the hybrid cloud Alibaba Cloud Container Service for Kubernetes (ACK) cluster across IDCs and available zones on the cloud. The data migration to the cloud and subsequent data processing processes are automated and streamlined to complete the local processing, the cloud migration and the cloud-based processing for the batch offline data, and report generation within 24 hours. Computing nodes or service-free computing resources are flexibly provided on demand to form on-demand computing capability and handle sudden analysis tasks The Alibaba Cloud genetic data service team that I lead strives to build a more elastic container-based cluster with the ability to automatically scale hundreds of nodes in minutes and the Serverless capability to pull thousands of thousands of lightweight containers in minutes. It improves the utilization rate of intranet bandwidth by improving the parallelism, and ultimately improves the overall data throughput rate. It improves IO read/write speed through TCP optimization of the NAS client and server, and accelerates the reading of object storage by adding a cache layer and distributed cache for OSS.
Life Science and Precision Medicine Applications
The NovaSeq sequencer provides a low-cost ($100/WGS) and high-output (6 TB) NGS solution. The use of a large number of NovaSeq generates dozens of TB of data every day for the gene sequencing company, which requires a powerful computing capability to split and discover variations, as well as a large amount of storage to store raw data and variant data. Alibaba Cloud genetic data service continuously improves its ultimate elastic computing capabilities, massively parallel processing capabilities, and massive high-speed storage to help gene sequencing companies quickly automate the processing of dozens of TB of offline data each day, and generate high-quality variant data through the GATK standard.
With the emergence of the third generation sequencing (TGS), represented by PacBio and Nanopore, long reads from over 30 KB to several hundred KB, and a large output from 20 GB to 15 TB are generated. Long reads and large data volume request more powerful computing capability and high I/O throughput for data comparison, splitting and discovery of variation, and also bring greater challenges to optimizing genetic analysis process, splitting data, scheduling a large amount of computing resources on demand, and providing ultra-high IO throughput in the genetic computing process.
Decoding the unknown, and measuring the life. Every small step in science and technology will be a big step forward for mankind.
Eric Li, an Alibaba Cloud senior architect, a data scientist, the Top2 Winner of FDA2018 Precision Medical Competition, and a financial/biological computing industry solution expert, specializes the development of Kubernetes-based container products and the production of banking and Bioinformatics industries. Before joining Alibaba Cloud, he served as the chief architect of the Watson Data Service container platform, the architect of machine learning platform, and the gold winner of the IBM 2015 Spark global competition. He is the leader of several large-scale development projects, covering cloud computing, database performance tools, distributed architecture, biological computing, big data and machine learning.