MAPREDUCE服务 MRS-窗口函数:Ranking Functions

时间:2024-07-02 16:39:35

Ranking Functions

  • cume_dist()→ bigint

    描述:小于等于当前值的行数/分组内总行数–比如,统计小于等于当前薪水的人数,所占总人数的比例。

     --查询示例
    SELECT dept, userid, sal, CUME_DIST() OVER(ORDER BY sal) AS rn1, CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM salary;
    dept | userid |  sal   | rn1 |        rn2         
    ------|--------|--------|-----|--------------------
     d2   | user4  | 4000.0 | 0.8 |                0.5 
     d2   | user5  | 5000.0 | 1.0 |                1.0 
     d1   | user1  | 1000.0 | 0.2 | 0.3333333333333333 
     d1   | user2  | 2000.0 | 0.4 | 0.6666666666666666 
     d1   | user3  | 3000.0 | 0.6 |                1.0 
    (5 rows)
  • dense_rank()→ bigint

    描述:返回值在一组值中的排名。这与rank()相似,不同的是tie值不会在序列中产生间隙。

  • ntile(n)→ bigint

    描述:用于将分组数据按照顺序切分成n片,返回当前切片值。NTILE不支持ROWS BETWEEN,比如NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)如果切片不均匀,默认增加第一个切片的分布。

    --创建表并插入数据
    create table cookies_log (cookieid varchar,createtime  date,pv int);
    insert into cookies_log values 
    	('cookie1',date '2020-07-10',1),
    	('cookie1',date '2020-07-11',5), 
    	('cookie1',date '2020-07-12',7), 
    	('cookie1',date '2020-07-13',3),
    	('cookie1',date '2020-07-14',2), 
    	('cookie1',date '2020-07-15',4), 
    	('cookie1',date '2020-07-16',4), 
    	('cookie2',date '2020-07-10',2),
    	('cookie2',date '2020-07-11',3), 
    	('cookie2',date '2020-07-12',5),
            ('cookie2',date '2020-07-13',6), 
            ('cookie2',date '2020-07-14',3), 
            ('cookie2',date '2020-07-15',9),
    	('cookie2',date '2020-07-16',7);
    -- 查询结果
    SELECT cookieid,createtime,pv,
    NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,	--分组内将数据分成2片
    NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,  --分组内将数据分成3片
    NTILE(4) OVER(ORDER BY createtime) AS rn3   --将所有数据分成4片
    FROM cookies_log
    ORDER BY cookieid,createtime;
    cookieid  | createtime | pv | rn1 | rn2 | rn3 
    ----------|------------|----|-----|-----|-----
     cookie1  | 2020-07-10 |  1 |   1 |   1 |   1 
     cookie1  | 2020-07-11 |  5 |   1 |   1 |   1 
     cookie1  | 2020-07-12 |  7 |   1 |   1 |   2 
     cookie1  | 2020-07-13 |  3 |   1 |   2 |   2 
     cookie1  | 2020-07-14 |  2 |   2 |   2 |   3 
     cookie1  | 2020-07-15 |  4 |   2 |   3 |   4 
     cookie1  | 2020-07-16 |  4 |   2 |   3 |   4 
     cookie2  | 2020-07-10 |  2 |   1 |   1 |   1 
     cookie2  | 2020-07-11 |  3 |   1 |   1 |   1 
     cookie2  | 2020-07-12 |  5 |   1 |   1 |   2 
     cookie2  | 2020-07-13 |  6 |   1 |   2 |   2 
     cookie2  | 2020-07-14 |  3 |   2 |   2 |   3 
     cookie2  | 2020-07-15 |  9 |   2 |   3 |   3 
     cookie2  | 2020-07-16 |  7 |   2 |   3 |   4 
    (14 rows)
  • percent_rank()→ double

    描述:返回值在一组值中的百分比排名。 结果为(r-1)/(n-1),其中r是该行的rank(),n是窗口分区中的总行数。

    SELECT dept,userid,sal,
    PERCENT_RANK() OVER(ORDER BY sal) AS rn1,   --分组内
    RANK() OVER(ORDER BY sal) AS rn11,          --分组内RANK值
    SUM(1) OVER(PARTITION BY NULL) AS rn12,     --分组内总行数
    PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
    from salary;
     dept | userid |  sal   | rn1  | rn11 | rn12 | rn2 
    ------|--------|--------|------|------|------|-----
     d2   | user4  | 4000.0 | 0.75 |    4 |    5 | 0.0 
     d2   | user5  | 5000.0 |  1.0 |    5 |    5 | 1.0 
     d1   | user1  | 1000.0 |  0.0 |    1 |    5 | 0.0 
     d1   | user2  | 2000.0 | 0.25 |    2 |    5 | 0.5 
     d1   | user3  | 3000.0 |  0.5 |    3 |    5 | 1.0 
    (5 rows)
  • rank()→ bigint

    描述:返回值在一组值中的排名。等级为1加上该行之前与该行不对等的行数。因此,排序中的平局值将在序列中产生缺口。对每个窗口分区执行排名。

    SELECT 
    cookieid,
    createtime,
    pv,
    RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
    DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
    ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 
    FROM cookies_log
    WHERE cookieid = 'cookie1';
     cookieid | createtime | pv | rn1 | rn2 | rn3 
    ----------|------------|----|-----|-----|-----
     cookie1  | 2020-07-12 |  7 |   1 |   1 |   1 
     cookie1  | 2020-07-11 |  5 |   2 |   2 |   2 
     cookie1  | 2020-07-15 |  4 |   3 |   3 |   3 
     cookie1  | 2020-07-16 |  4 |   3 |   3 |   4 
     cookie1  | 2020-07-13 |  3 |   5 |   4 |   5 
     cookie1  | 2020-07-14 |  2 |   6 |   5 |   6 
     cookie1  | 2020-07-10 |  1 |   7 |   6 |   7 
    (7 rows)
  • row_number()→ bigint

    描述:从1开始,按照顺序,生成分组内记录的序列–比如,按照pv降序排列,生成分组内每天的pv名次ROW_NUMBER() 的应用场景非常多,再比如,获取分组内排序第一的记录。获取一个session中的第一条refer等。

    SELECT cookieid, createtime, pv, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn from cookies_log;
      cookieid | createtime | pv | rn 
     ----------|------------- |----|----
     cookie2  | 2020-07-15 |  9 |  1 
     cookie2  | 2020-07-16 |  7 |  2 
     cookie2  | 2020-07-13 |  6 |  3 
     cookie2  | 2020-07-12 |  5 |  4 
     cookie2  | 2020-07-14 |  3 |  5 
     cookie2  | 2020-07-11 |  3 |  6 
     cookie2  | 2020-07-10 |  2 |  7 
     cookie1  | 2020-07-12 |  7 |  1 
     cookie1  | 2020-07-11 |  5 |  2 
     cookie1  | 2020-07-15 |  4 |  3 
     cookie1  | 2020-07-16 |  4 |  4 
     cookie1  | 2020-07-13 |  3 |  5 
     cookie1  | 2020-07-14 |  2 |  6 
     cookie1  | 2020-07-10 |  1 |  7 
    (14 rows)
support.huaweicloud.com/cmpntguide-lts-mrs/mrs_01_300206.html