select id, name from T where id = 1
select area,sum(amount) from T group by area
WITH e1 AS ( SELECT uid,1 AS step1, MIN(etime) AS t1 FROM events WHERE etime>=end_date-14 AND etime < end_date AND etype='etype1' GROUP BY uid), e2 AS ( SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2 FROM events AS e2 JOIN e1 ON e2.uid = e1.uid WHERE e2.etime>=end_date-14 AND e2.etime < end_date AND e2.etime>t1 AND e2.etime < t1+7 AND etype='etype2' GROUP BY uid), e3 as ( SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3 FROM events AS e3 JOIN e2 ON e3.uid = e2.uid WHERE e3.etime>=end_date-14 AND e3.etime < end_date AND e3.etime>t2 AND e3.etime < t1+7 AND etype='etype3' GROUP BY uid) SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3 FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid
SQL没有设置断点、单步执行这些很常见的调试方法,嵌套多层时就要逐层拆分执行。
数据库尚可优化:SQL中有ORDER BY字样意味着要进行大排序,但这样会很慢。这时数据库会自动优化,比如使用一个始终保持最大10个成员的小集合,遍历一次就能完成计算,性能会快很多。
数据库不会优化了:复杂度增加不多,却已经让大部分优化器晕掉了,结果只会按照字面意思进行大排序,性能直线下降。
SPL显得更简单,不再需要循环语句
SELECT CODE, MAX(con_rise) AS longest_up_days FROM ( SELECT CODE, COUNT(*) AS con_rise FROM ( SELECT CODE, DT, SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days FROM ( SELECT CODE, DT, CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0 ELSE 1 END AS updown_flag FROM stock ) ) GROUP BY CODE, no_up_days )GROUP BY CODE
A | |
1 | =stock.sort(StockRecords.txt) |
2 | =T(A1).sort(DT) |
3 | =A2.group(CODE;~.group@i(CL< CL[-1]).max(~.len()):max_increase_days) |
import pandas as pd stock_file = "StockRecords.txt" stock_info = pd.read_csv(stock_file,sep="\t") stock_info.sort_values(by=['CODE','DT'],inplace=True) stock_group = stock_info.groupby(by='CODE') stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum() max_increase_days = {} for code, group in stock_info.groupby('CODE'): max_increase_days[code] = group.groupby('label').size().max() – 1 max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])
SPL同样更简洁,也更符合自然思维,而且这段代码还能对付任意多步的漏斗,比SQL简单又通用
WITH e1 AS ( SELECT uid,1 AS step1, MIN(etime) AS t1 FROM events WHERE etime>=end_date-14 AND etime< end_date AND etype='etype1' GROUP BY uid), e2 AS ( SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2 FROM events AS e2 JOIN e1 ON e2.uid = e1.uid WHERE e2.etime>=end_date-14 AND e2.etime< end_date AND e2.etime>t1 AND e2.etime< t1+7 AND etype='etype2' GROUP BY uid), e3 as ( SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3 FROM events AS e3 JOIN e2 ON e3.uid = e2.uid WHERE e3.etime>=end_date-14 AND e3.etime< end_date AND e3.etime>t2 AND e3.etime< t1+7 AND etype='etype3' GROUP BY uid) SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3 FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid
A | |
1 | =["etype1","etype2","etype3"] |
2 | =file("event.ctx").open() |
3 | =A2.cursor(id,etime,etype;etime>=end_date-14 && etime< end_date && A1.contain(etype)) |
4 | =A3.group(uid) |
5 | =A4.(~.sort(etime)).new(~.select@1(etype==A1(1)):first,~:all).select(first) |
6 | =A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime< t1+7).etime, null)))) |
7 | =A6.groups(;count(~(1)):step1,count(~(2)):step2,count(~(3)):step3) |
A | |
1 | smallData.txt |
2 | =file(A1).import@t() |
3 | =A2.groups(state;sum(amount):amount) |
A | |
1 | bigData.txt |
2 | =file(A1).cursor@t() |
3 | =A2.groups(state;sum(amount):amount) |
对于超出内存的大数据,SPL提供了游标访问外存数据方法。更重要的是,内存计算和外存计算的代码几乎一样,不会额外增加工作量。
A | |
1 | bigData.txt |
2 | =file(A1).cursor@t() |
3 | =A2.groups(state;sum(amount):amount) |
A | |
1 | bigData.txt |
2 | =file(A1).cursor@tm() |
3 | =A2.groups(state;sum(amount):amount) |
SPL支持基于线程的并行计算,可以充分利用CPU多核,并行计算只要增加一个@m选项就行,很方便。
SPL还容易写出计算量小的代码,跑得更快。比如前面说到的topN问题,SPL把topN 理解为聚合计算,计算逻辑无须大排序,速度快很多,这是天然支持的,并不需要优化器辅助。
而且,组内topN和全集topN写法基本一样。写得简单,跑得也快。
SPL还提供了大量高性能算法,包括查找、遍历、关联以及集群运算,其中很多都是SPL的独创发明。有了这些算法,计算性能直接飞起。
import pandas as pd stock_file = "StockRecords.txt" stock_info = pd.read_csv(stock_file,sep="\t") stock_info.sort_values(by=['CODE','DT'],inplace=True) stock_group = stock_info.groupby(by='CODE') stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum() max_increase_days = {} for code, group in stock_info.groupby('CODE'): max_increase_days[code] = group.groupby('label').size().max() – 1 max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])