云数据库 GAUSSDB-数据倾斜调优:存储层数据倾斜
存储层数据倾斜
GaussDB 数据库中,数据分布存储在各个DN上,通过分布式执行提高查询的效率。但是,如果数据分布存在倾斜,则会导致分布式执行某些DN成为瓶颈,影响查询性能。这种情况通常是由于分布列选择不合理,可以通过调整分布列的方式解决。
例如下例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
openGauss=# explain performance select count(*) from inventory; 5 -- CS tore Scan on lmz.inventory dn_6001_6002 (actual time=0.444..83.127 rows=42000000 loops=1) dn_6003_6004 (actual time=0.512..63.554 rows=27000000 loops=1) dn_6005_6006 (actual time=0.722..99.033 rows=45000000 loops=1) dn_6007_6008 (actual time=0.529..100.379 rows=51000000 loops=1) dn_6009_6010 (actual time=0.382..71.341 rows=36000000 loops=1) dn_6011_6012 (actual time=0.547..100.274 rows=51000000 loops=1) dn_6013_6014 (actual time=0.596..118.289 rows=60000000 loops=1) dn_6015_6016 (actual time=1.057..132.346 rows=63000000 loops=1) dn_6017_6018 (actual time=0.940..110.310 rows=54000000 loops=1) dn_6019_6020 (actual time=0.231..41.198 rows=21000000 loops=1) dn_6021_6022 (actual time=0.927..114.538 rows=54000000 loops=1) dn_6023_6024 (actual time=0.637..118.385 rows=60000000 loops=1) dn_6025_6026 (actual time=0.288..32.240 rows=15000000 loops=1) dn_6027_6028 (actual time=0.566..118.096 rows=60000000 loops=1) dn_6029_6030 (actual time=0.423..82.913 rows=42000000 loops=1) dn_6031_6032 (actual time=0.395..78.103 rows=39000000 loops=1) dn_6033_6034 (actual time=0.376..51.052 rows=24000000 loops=1) dn_6035_6036 (actual time=0.569..79.463 rows=39000000 loops=1) |
在performance信息中,可以看到inventory表各DN的scan行数,发现各DN的行数差距较大,最大的为63000000,最小的只有15000000,差了4倍。这个差距对于数据扫描的性能影响还可以接受,但如果上层有join算子,则影响较大。
通常,数据表在各DN上是hash分布的,因此分布列的选择很重要。通过table_skewness()来查看上述inventory表在各DN的数据分布倾斜,查询结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
openGauss=# select table_skewness('inventory'); table_skewness ------------------------------------------ ("dn_6015_6016 ",63000000,8.046%) ("dn_6013_6014 ",60000000,7.663%) ("dn_6023_6024 ",60000000,7.663%) ("dn_6027_6028 ",60000000,7.663%) ("dn_6017_6018 ",54000000,6.897%) ("dn_6021_6022 ",54000000,6.897%) ("dn_6007_6008 ",51000000,6.513%) ("dn_6011_6012 ",51000000,6.513%) ("dn_6005_6006 ",45000000,5.747%) ("dn_6001_6002 ",42000000,5.364%) ("dn_6029_6030 ",42000000,5.364%) ("dn_6031_6032 ",39000000,4.981%) ("dn_6035_6036 ",39000000,4.981%) ("dn_6009_6010 ",36000000,4.598%) ("dn_6003_6004 ",27000000,3.448%) ("dn_6033_6034 ",24000000,3.065%) ("dn_6019_6020 ",21000000,2.682%) ("dn_6025_6026 ",15000000,1.916%) (18 rows) |
通过查询建表定义,可以发现,目前该表是以inv_date_sk作为分布列的,导致存在倾斜。通过查看各列的数据分布情况,建表时改为inv_item_sk作为分布列,则倾斜情况分布如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
openGauss=# select table_skewness('inventory'); table_skewness ------------------------------------------ ("dn_6001_6002 ",43934200,5.611%) ("dn_6007_6008 ",43829420,5.598%) ("dn_6003_6004 ",43781960,5.592%) ("dn_6031_6032 ",43773880,5.591%) ("dn_6033_6034 ",43763280,5.589%) ("dn_6011_6012 ",43683600,5.579%) ("dn_6013_6014 ",43551660,5.562%) ("dn_6027_6028 ",43546340,5.561%) ("dn_6009_6010 ",43508700,5.557%) ("dn_6023_6024 ",43484540,5.554%) ("dn_6019_6020 ",43466800,5.551%) ("dn_6021_6022 ",43458500,5.550%) ("dn_6017_6018 ",43448040,5.549%) ("dn_6015_6016 ",43247700,5.523%) ("dn_6005_6006 ",43200240,5.517%) ("dn_6029_6030 ",43181360,5.515%) ("dn_6025_6026 ",43179700,5.515%) ("dn_6035_6036 ",42960080,5.487%) (18 rows) |
数据分布倾斜的问题得到解决。
除了table_skewness()视图外,当前版本还提供了table_distribution函数和PGXC_GET_TABLE_SKEWNESS视图,可以更加高效的查询各表的数据倾斜情况。