esProc SPL解放数据科学家生命

数据科学家用什么做分析计算?

SQL ?
esProc SPL !
Python ?

看起来简单,其实很难 

简单查询
select id, name from T where id = 1
select area,sum(amount) from T group by area
电商漏斗分析
WITH e1 AS (
    SELECT uid,1 AS step1, MIN(etime) AS t1
    FROM events
    WHERE etime>=end_date-14 AND etime < end_date AND etype='etype1'
    GROUP BY uid),
e2 AS (
    SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
    FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
    WHERE e2.etime>=end_date-14 AND e2.etime < end_date AND e2.etime>t1 AND e2.etime < t1+7 AND etype='etype2'
    GROUP BY uid),
e3 as (
    SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
    FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
    WHERE e3.etime>=end_date-14 AND e3.etime < end_date AND e3.etime>t2 AND e3.etime < t1+7 AND etype='etype3'
    GROUP BY uid)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid
1分钟内连续得分3次的球员
每7天中连续三天活跃的用户数
每天新用户的次日留存
股价高于前后5天时当天的涨幅
...

调试很麻烦

SQL没有设置断点、单步执行这些很常见的调试方法,嵌套多层时就要逐层拆分执行。

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
SELECT CODE, COUNT(*) AS con_rise
FROM (
SELECT CODE, DT,
SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
FROM (
SELECT CODE, DT,
CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT) THEN 0
ELSE 1 END AS updown_flag
FROM stock
)
)
GROUP BY CODE, no_up_days
)
GROUP BY CODE

跑不快,写不出高性能算法,指望数据库优化

从一亿条数据中取前10名
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC

数据库尚可优化:SQL中有ORDER BY字样意味着要进行大排序,但这样会很慢。这时数据库会自动优化,比如使用一个始终保持最大10个成员的小集合,遍历一次就能完成计算,性能会快很多。

计算每个分组内的前 10 名
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn
FROM Orders )
WHERE rn<=10

数据库不会优化了:复杂度增加不多,却已经让大部分优化器晕掉了,结果只会按照字面意思进行大排序,性能直线下降。

封闭性,外部数据要先入库,琐事太多

6大优势解放数据科学家

1更简洁的代码

每支股票最长连续上涨天数

SPL显得更简单,不再需要循环语句

SQL

SELECT CODE, MAX(con_rise) AS longest_up_days
FROM (
    SELECT CODE, COUNT(*) AS con_rise
    FROM (
        SELECT CODE, DT, 
				SUM(updown_flag) OVER (PARTITION BY CODE ORDER BY CODE, DT) AS no_up_days
        FROM (
            SELECT CODE, DT, 
				CASE WHEN CL > LAG(CL) OVER (PARTITION BY CODE ORDER BY CODE, DT)  THEN 0
					ELSE 1 END AS updown_flag
            FROM stock
        )    )
    GROUP BY CODE, no_up_days
)GROUP BY CODE

SPL

A
1=stock.sort(StockRecords.txt)
2=T(A1).sort(DT)
3=A2.group(CODE;~.group@i(CL< CL[-1]).max(~.len()):max_increase_days)

Python

import pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])
电商漏斗分析

SPL同样更简洁,也更符合自然思维,而且这段代码还能对付任意多步的漏斗,比SQL简单又通用

SQL

WITH e1 AS (
    SELECT uid,1 AS step1, MIN(etime) AS t1
    FROM events
    WHERE etime>=end_date-14 AND etime< end_date AND etype='etype1'
    GROUP BY uid),
e2 AS (
    SELECT uid,1 AS step2, MIN(e1.t1) as t1, MIN(e2.etime) AS t2
    FROM events AS e2 JOIN e1 ON e2.uid = e1.uid
    WHERE e2.etime>=end_date-14 AND e2.etime< end_date AND e2.etime>t1 AND e2.etime< t1+7 AND etype='etype2'
    GROUP BY uid),
e3 as (
    SELECT uid,1 AS step3, MIN(e2.t1) as t1, MIN(e3.etime) AS t3
    FROM events AS e3 JOIN e2 ON e3.uid = e2.uid
    WHERE e3.etime>=end_date-14 AND e3.etime< end_date AND e3.etime>t2 AND e3.etime< t1+7 AND etype='etype3'
    GROUP BY uid)
SELECT SUM(step1) AS step1, SUM(step2) AS step2, SUM(step3) AS step3
FROM e1 LEFT JOIN e2 ON e1.uid = e2.uid LEFT JOIN e3 ON e2.uid = e3.uid

SPL

A
1=["etype1","etype2","etype3"]
2=file("event.ctx").open()
3=A2.cursor(id,etime,etype;etime>=end_date-14 && etime< end_date && A1.contain(etype))
4=A3.group(uid)
5=A4.(~.sort(etime)).new(~.select@1(etype==A1(1)):first,~:all).select(first)
6=A5.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime< t1+7).etime, null))))
7=A6.groups(;count(~(1)):step1,count(~(2)):step2,count(~(3)):step3)

2强交互适合多步骤探索分析,便捷调试

3内置大数据和并行

简易的大数据计算

内存计算
A
1smallData.txt
2=file(A1).import@t()
3=A2.groups(state;sum(amount):amount)
外存计算
A
1bigData.txt
2=file(A1).cursor@t()
3=A2.groups(state;sum(amount):amount)

对于超出内存的大数据,SPL提供了游标访问外存数据方法。更重要的是,内存计算和外存计算的代码几乎一样,不会额外增加工作量。

简易的并行计算

串行
A
1bigData.txt
2=file(A1).cursor@t()
3=A2.groups(state;sum(amount):amount)
并行
A
1bigData.txt
2=file(A1).cursor@tm()
3=A2.groups(state;sum(amount):amount)

SPL支持基于线程的并行计算,可以充分利用CPU多核,并行计算只要增加一个@m选项就行,很方便。

4高性能算法

SPL还容易写出计算量小的代码,跑得更快。比如前面说到的topN问题,SPL把topN 理解为聚合计算,计算逻辑无须大排序,速度快很多,这是天然支持的,并不需要优化器辅助。

Orders.groups(;top(10;-Amount))

而且,组内topN和全集topN写法基本一样。写得简单,跑得也快。

Orders.groups(Area;top(10;-Amount))

SPL还提供了大量高性能算法,包括查找、遍历、关联以及集群运算,其中很多都是SPL的独创发明。有了这些算法,计算性能直接飞起。

查找

  • 二分法
  • 序号定位
  • 索引查找
  • 批量查找

遍历

  • 游标过滤
  • 遍历复用
  • 多路游标
  • 聚合理解
  • 有序分组
  • 程序游标
  • 列式计算

关联

  • 外键预关联
  • 外键序号化
  • 对位序列
  • 大维表查找
  • 单边分堆
  • 有序归并
  • 关联定位

集群

  • 集群组表
  • 复写维表
  • 分段维表
  • 负载均衡

5开放性,多样数据源直接算

6纯Java,探索结果直接进企业应用

比SQL好点,复杂情况还是麻烦

每支股票最长连续上涨天数
import pandas as pd
stock_file = "StockRecords.txt"
stock_info = pd.read_csv(stock_file,sep="\t")
stock_info.sort_values(by=['CODE','DT'],inplace=True)
stock_group = stock_info.groupby(by='CODE')
stock_info['label'] = stock_info.groupby('CODE')['CL'].diff().fillna(0).le(0).astype(int).cumsum()
max_increase_days = {}
for code, group in stock_info.groupby('CODE'):
    max_increase_days[code] = group.groupby('label').size().max() – 1
max_rise_df = pd.DataFrame(list(max_increase_days.items()), columns=['CODE', 'max_increase_days'])

调试还是print大法

没大数据支持,伪并行

非Java体系,探索结果进企业应用经常还要重写