历史数据通常只会记录发生过的事件,比如某辆车在某时刻在某地出现,某个手机在某时刻和谁通话,某个银行帐户在某日期发生过汇入汇出,…。用指定条件直接筛选历史记录会得到很多帐号相同的记录(某辆车可能多次去过北京,…),但只能按 1 次计数,所以要做去重处理。
著名的电商漏斗统计就是这类复杂化的 COUNT(DISTINCT),只不过它会更复杂一些。漏斗统计涉及多步有次序的事件,每一步都对应有一个 COUNT(DISTINCT),以便和上一步的 COUNT(DISTINCT) 一起计算这一步的客户流失率;下一步的 COUNT(DISTINCT) 要在上一步的基础上筛选。并且要考虑事件发生的次序。整个过程比较复杂,但本质上仍然是在做 COUNT(DISTINCT)。
<!-- 堆代码 duidaima.com --> WITH e1 AS ( SELECT userid, visittime AS step1_time, MIN(sessionid) AS sessionid, 1 AS step1 FROM defined_events e1 JOIN eventgroup ON eventgroup.id = e1.eventgroup WHERE visittime >= DATE_ADD(arg_date,INTERVAL -14 day) AND visittime < arg_date AND eventgroup.name='SiteVisit' GROUP BY userid,visittime ), e2 AS ( SELECT e2.userid, MIN(e2.sessionid) AS sessionid, 1 AS step2, MIN(visittime) AS step2_time, MIN(e1.step1_time) AS step1_time FROM defined_events e2 JOIN e1 ON e1.sessionid = e2.sessionid AND visittime > step1_time JOIN eventgroup ON eventgroup.id = e2.eventgroup WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND eventgroup.name = 'ProductDetailPage' GROUP BY e2.userid ), e3 AS ( SELECT e3.userid, MIN(e3.sessionid) AS sessionid, 1 AS step3, MIN(visittime) AS step3_time, MIN(e2.step1_time) AS step1_time FROM defined_events e3 JOIN e2 ON e2.sessionid = e3.sessionid AND visittime > step2_time JOIN eventgroup ON eventgroup.id = e3.eventgroup WHERE visittime < DATE_ADD(step1_time ,INTERVAL +1 day) AND (eventgroup.name = 'OrderConfirmationType1') GROUP BY e3.userid ) SELECT s.devicetype AS devicetype, COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid ELSE NULL END) AS step1_count, COUNT(DISTINCT CASE WHEN funnel_conversions.step2 IS NOT NULL THEN funnel_conversions.step2_userid ELSE NULL END) AS step2_count, COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid ELSE NULL END) AS step3_count, COUNT(DISTINCT CASE WHEN funnel_conversions.step3 IS NOT NULL THEN funnel_conversions.step3_userid ELSE NULL END) / COUNT(DISTINCT CASE WHEN funnel_conversions.step1 IS NOT NULL THEN funnel_conversions.step1_userid ELSE NULL END) AS step3_rate FROM ( SELECT e1.step1_time AS step1_time, e1.userid AS userid, e1.userid AS step1_userid, e2.userid AS step2_userid,e3.userid AS step3_userid, e1.sessionid AS step1_sessionid, step1, step2, tep3 FROM e1 LEFT JOIN e2 ON e1.userid=e2.userid LEFT JOIN e3 ON e2.userid=e3.userid ) funnel_conversions LEFT JOIN sessions s ON funnel_conversions.step1_sessionid = s.id GROUP BY s.devicetype可以看出,这里不仅有多个 COUNT(DISTINCT),还有多个自关联子查询以实现复杂的漏斗步骤判断。这个 SQL 语句在 Snowflake 的 Medium 级集群(4 节点)三分钟没跑出结果。那么,该怎么解决这个讨厌的 COUNT(DISTINCT) 呢?
1 =now() 2 =eventgroup=file("eventgroup.btx").import@b() 3 =devicetype=file("devicetype.btx").import@b() 4 =long(elapse(arg_date,-14)) 5 =long(arg_date) 6 =long(arg_date+1) 7 =A2.(case(NAME,"SiteVisit":1,"ProductDetailPage":2,"OrderConfirmationType1":3;null)) 8 =file("defined_events.ctx").open() 9 =A8.cursor@m(USERID,SESSIONID,VISITTIME,EVENTGROUPNO;VISITTIME>=A4 && VISITTIME<A6,EVENTGROUPNO:A7:#) 10 =sessions=file("sessions.ctx").open().cursor@m(USERID,ID,DEVICETYPENO;;A9) 11 =A9.joinx@m(USERID:SESSIONID,A10:USERID:ID,DEVICETYPENO) 12 =A11.group(USERID) 13 =A12.new(~.align@a(3,EVENTGROUPNO):e,e(1).select(VISITTIME<A5).group@u1(VISITTIME):e1,e(2).group@o(SESSIONID):e2,e(3):e3) 14 =A13.run(e=join@m(e1:e1,SESSIONID;e2:e2,SESSIONID).select(e2=e2.select(VISITTIME>e1.VISITTIME && VISITTIME<e1.VISITTIME+86400000).min(VISITTIME) ) ) 15 =A14.run(e0=e1.id(DEVICETYPENO),e1=e.min(e1.VISITTIME),e2=e.min(e2),e=e.min(e1.SESSIONID),e3=e3.select(SESSIONID==e && VISITTIME>e2 && VISITTIME<e1+86400000).min(VISITTIME),e=e0) 16 =A15.news(e;~:DEVICETYPE,e2,e3) 17 =A16.groups(DEVICETYPE;count(1):STEP1_COUNT,count(e2):STEP2_COUNT,count(e3):STEP3_COUNT,null:STEP3_RATE) 18 =A17.run(DEVICETYPE=devicetype.m(DEVICETYPE).DEVICETYPE,STEP3_RATE=STEP3_COUNT/STEP1_COUNT) 19 =interval@s(A1,now())(SPL 代码写在格子里,这和普通程序语言很不像,参考这里 写在格子里的程序语言 )
A12 每次读出一个 USERID 的数据进行后续判断,到 A17 再计算时就直接用 count 函数,不需要 icount 了。这个代码不仅更简洁通用(做更多步漏斗只要改 A7,而 SQL 代码要加很多子查询),跑得也更快,使用 Snowflake 同规格的 EC2,只用单台 10 秒即可完成。
信息系统中几乎所有事件性质的数据都会挂在某个帐号下,所以这类运算非常普遍,在各种查询跑批任务中都会碰到,可以说是最常见的业务逻辑模型之一了。有了 esProc SPL 这种基于有序存储上的有序运算,这一大类问题就都可以简洁且高性能的实现了,而对于 SQL 体系的关系数据库即非常困难。
WITH DT AS ( SELECT DISTINCT id, ROUND(tm/900)+1 as tn, loc FROM T WHERE tm<3*86400) SELECT * FROM ( SELECT B.id id, COUNT( DISINCT B.tn ) cnt FROM DT AS A JOIN DT AS B ON A.loc=B.loc AND A.tn=B.tn WHERE A.id=a AND B.id<>a GROUP BY id ) ORDER BY cnt DESC LIMIT 20这里又有嵌套的 DISTINCT 运算以及自关联 JOIN,单节点的 ClickHouse 直接崩掉,动用了 5 节点的集群用了 30 多分钟才跑出来。
1 =now() 2 >NL=100000,NT=3*96 3 =file("T.ctx").open() 4 =A3.cursor(tm,loc;id==a).fetch().align(NL*NT,(loc-1)*NT+tm\900+1) 5 =A3.cursor@mv(;id!=a && A4((loc-1)*NT+tm\900+1)) 6 =A5.group@s(id;icount@o(tm\900):cnt).total(top(-20;cnt)) 7 =interval@ms(A1,now())细心的读者可能会发现,esProc SPL 的算法有效性依赖于数据对 id 有序,而数据产生次序通常不会是 id,而是时间。那么,这个算法是不是只能应用于事先排序过的历史数据上,对来不及一起排序的新数据就无效了呢?esProc 已经考虑到这一点,SPL 的复组表可以在数据进入时实现增量排序,实时保证数据在读出时对 id 有序,可以让这套有序计算方案应用到最新的数据上。而且,这类运算通常都会涉及时间区间,SPL 的虚表支持双维有序机制,可以迅速将时间区间外的数据过滤掉,进一步提升运算性能。