Blogs > Introduction to GaussDB(fro MySQL)

Introduction to GaussDB(fro MySQL)

Manyi Lu Aug 23, 2021

Database

Copy the link to clipboard

GaussDB(for MySQL) is a cloud native database based on MySQL, available as a fully managed service on Huawei Cloud, targeting both the internet and enterprise customers. In this article, Manyi Lu explains common customer workloads and how we leverage the unique capabilities of the Huawei Cloud cloud computing stack to handle their workloads.

GaussDB(for MySQL) – unleashing the power of cloud stack vertical integration

Cloud computing is experiencing increasing popularity, and one key component in the cloud stack is transactional database services. Applications rely on scalable, high- performance managed database services to fully benefit from the cloud platform. Cloud database services need to efficiently utilize underlying cloud infrastructure to realize the potential of cloud-scale operations.

GaussDB(for MySQL) is a cloud native database based on MySQL, available as a fully managed service on Huawei Cloud, targeting both the internet and enterprise customers. In this article, I will explain common customer workloads and how we leverage the unique capabilities of the Huawei Cloud computing stack to handle this workload.

Who are the customers on cloud and what are their workloads?

Some people thought that only internet startups adopt the cloud platform,and MySQL is popular as a cloud database due to its popularity among internet companies. But in reality, enterprises already started to embrace the cloud concept a few years ago, and it is also the current trend. MySQL, as the most popular open source database in the world, is widely adopted in all industries as well as internet companies.

So what are the typical workloads for cloud database customers? Two characteristics we observe are 1) increasingly larger data volume -- single to double digit terabytes of data to begin with and it keeps growing over time, and 2) a mix of simple insert/delete/update/point select and complex analytics queries. In addition, there are occasional DDL operations.

The challenge is how to make the database perform well when the data volume is large. Customers want to run analytics queries while maintaining the throughput of the core transactional workload. Queries are complex due to the nature of business logic in enterprises. Luckily, MySQL 8.0 added long awaited SQL support for windowing functions and recursive CTE. For unstructured data, MySQL's JSON support is already hugely popular.

Overview of the GaussDB(for MySQL) architecture

At a high level, GaussDB(for MySQL) is similar to AWS Aurora. It is built on top of a shared distributed storage system, and the maximum data volume for one database is currently 128 TB. One master node serves the read-write load, and up to fifteen read-only replicas serve the read load. The SQL engine is a heavily modified MySQL server version 8.0, and therefore 100% compatible with MySQL both in terms of syntax and semantics. There is an RDMA network between compute nodes and storage.

The storage system used by GaussDB(for MySQL) is a highly reliable, cross-AZ cloud storage. On public cloud, storage systems may contain a large cluster with hundreds of nodes. Its scaling in terms of storage nodes can be orders of magnitude larger than what you find on single tenant on-prem solutions. SQL nodes flush redo logs to storage, and pages are materialized in the storage layer, and this design significantly reduces network communication for update intensive workloads. Pages belonging to a single database are organized in slices, and slices are distributed over multiple storage nodes.

GaussDB (for MySQL) architecture

Huawei's unique strength - vertical integration

Cloud databases, unlike traditional on-prem databases, allow for vertical integration of all layers in the cloud stack. Huawei, as a leading provider of all layers, is in a unique position to become an industrial leader in cloud.

What comes closest to database in the cloud stack is the storage. On-prem software-only databases need to work with commodity storage over standard filesystem interfaces, which leaves little room for optimization. Oracle Exadata, on the other hand, already demonstrated the power of integration between storage and database in the form of an appliance.

On cloud, this integration plays an even bigger role, as cloud storage has much higher scalability in terms of storage nodes than what you find inside Oracle Exadata, and customers can dynamically scale the storage based on data volume and load. Since the storage is shared between many tenants, and not all tenants will be running large scans all the time, we can achieve a much higher resource utilization rate by offloading parts of query processing to the storage layer.

Improving performance through parallelization

A generic approach of improving performance is through parallelization, and this can be done on multiple layers. The community version of MySQL 8.0 only supports single-threaded query execution, and cannot fully leverage all cores available in the hardware for complex queries. We have modified MySQL execution engine, to allow a single query to be executed using multiple threads in parallel. Cloud infrastructure, unlike on-prem solutions, allows us to utilize vertical scaling on compute nodes. The largest VM shape has 64 cores at the moment, and this gives an indication of the maximum parallelization we can achieve through parallel query execution. This optimization is suited when data mostly fits into the buffer pool.

Customer workloads include not only DMLs, but also DDLs, such as creating an index or changing the data type of a column.Although most DDLs are online in MySQL, some operations can be blocked, and this problem is enlarged by the use of logical replication. GaussDB avoids this issue as it uses physical replication. When a table is large, DDL operations can take many hours to complete. In order to support the data volume we commonly see on the cloud, the need for optimizing DDLs is obvious.

Another layer which allows for an even higher degree of parallelization is storage, as the storage system may potentially have hundreds of nodes and thousands of cores. This massive cloud-scale storage system used in GaussDB(for MySQL) is one key component we leverage in order to improve query performance. Combined with parallel query execution, we can potentially achieve query performance improvements of more than 100x.

Single threaded query execution at compute layer

Parallel processing at compute and storage layers

Leveraging cloud storage for query processing (NDP)

Data in GaussDB(for MySQL) is organized in slices, and distributed over multiple storage nodes. We leverage this data distribution and the available compute resources to do query processing where the data is located, without fetching data into the compute nodes. In database terminology, we call it near data processing (NDP) or query push down. Basically, we push a portion of the query processing down to the distributed storage system near the data. Operations we push down are data intensive operations such as table scan and index scan. Projection and evaluation of some WHERE conditions as well as aggregations are executed in the storage layer, so only the matching rows and columns needed for a particular query are returned to the server instead of full pages. In addition to parallelization, this approach also reduces network I/O, as the data volume fetched into the compute nodes is significantly reduced. In addition, NDP also allows for full utilization of local bandwidth to caches and storage media.

Offloading to storage works well when a query needs to scan a large amount of data and the data is not in the InnoDB buffer pool. For example, the following figure shows that NDP and parallel query execution improve the execution time of by an impressive 34x in the TCP-H Q12 benchmark.

Future direction

GaussDB(for MySQL) is designed as a cloud native database, and this architecture supports an extremely powerful and flexible framework for vertical integration. Compute and storage resources are decoupled and scaled independently, but still tightly integrated in terms of functionality, and database operations can be executed in multiple layers. In the future, database operations can also be offloaded to network cards and other cloud components, and not limited to compute nodes and storage.

We believe deep integration of the cloud computing stack is the key to unleash the power of cloud databases, and Huawei is in a unique position to achieve just that, as demonstrated here in GaussDB(for MySQL).

Please stay tuned, there is more news to come!

undefined

Introduction to GaussDB(fro MySQL)