大資料實戰(三十四):電商數倉(二十七)之使用者行為資料倉庫(十三)使用者留存主題
1需求目標
1.1使用者留存概念
1.2需求描述
使用者留存分析
2 DWS層
2.1DWS層(每日留存使用者明細表)
1)建表語句
hive (gmall)> drop table if exists dws_user_retention_day; create external table dws_user_retention_day ( `mid_id` string COMMENT '裝置唯一標識', `user_id` string COMMENT '使用者標識', `version_code` string COMMENT '程式版本號', `version_name` string COMMENT '程式版本名', `lang` string COMMENT '系統語言', `source` string COMMENT '渠道號', `os` string COMMENT '安卓系統版本', `area` string COMMENT '區域', `model` string COMMENT '手機型號', `brand` string COMMENT '手機品牌', `sdk_version` string COMMENT 'sdkVersion', `gmail` string COMMENT 'gmail', `height_width` string COMMENT '螢幕寬高', `app_time` string COMMENT '客戶端日誌產生時的時間', `network` string COMMENT '網路模式', `lng` string COMMENT '經度', `lat` string COMMENT '緯度', `create_date` string comment '裝置新增時間', `retention_day` int comment '截止當前日期留存天數' ) COMMENT '每日使用者留存情況' PARTITIONED BY (`dt` string) stored as parquet location '/warehouse/gmall/dws/dws_user_retention_day/';View Code
2)
=============================留存主題=========================
-----------------------------需求1.DWS層每日留存使用者明細表-----------------------
-----------------------------相關表---------------------
dws_new_mid_day:每日的新增使用者表
dws_uv_detail_day:日活表
-----------------------------思路-----------------------
明細資訊:從dws_uv_detail_day(日活表)取
從dws_new_mid_day根據mid_id查詢
retention_day: 截至到當前日期留存的天數
dt(日活資料的日期)=create_date+retention_day
-----------------------------SQL------------------------
--求1日留存
-- 先過濾,再關聯比較好
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
1 retention_day,
'2020-02-15'
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',1)) t2
on t1.mid_id=t2.mid_id
----------------------求1,2,3,n天的留存明細----------------------------
insert overwrite TABLE dws_user_retention_day PARTITION(dt='2020-02-15')
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.brand,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
1 retention_day
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',1)) t2
on t1.mid_id=t2.mid_id
UNION all
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.brand,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
2 retention_day
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',2)) t2
on t1.mid_id=t2.mid_id
UNION all
SELECT
t1.mid_id,
t1.user_id,
t1.version_code,
t1.version_name,
t1.lang,
t1.source,
t1.os,
t1.area,
t1.model,
t1.brand,
t1.sdk_version,
t1.gmail,
t1.height_width,
t1.app_time,
t1.network,
t1.lng,
t1.lat,
t2.create_date,
3 retention_day
FROM
(SELECT * from gmall.dws_uv_detail_day where dt='2020-02-15') t1
JOIN
(select mid_id,create_date from gmall.dws_new_mid_day where create_date=date_sub('2020-02-15',3)) t2
on t1.mid_id=t2.mid_id
--union all在使用時要求拼接的SQL,欄位數量和型別需要一致!
--union all和union區別,union去重,union all不去重!
2.3Union與Union all區別
1)準備兩張表
tableA tableB
id name score id name score
1 a 80 1 d 48
2 b 79 2 e 23
3 c 68 3 c 86
2)採用union查詢
select name from tableA
union
select name from tableB
查詢結果
name
a
d
b
e
c
3)採用unionall查詢
select name from tableA
union all
select name from tableB
查詢結果
name
a
b
c
d
e
c
4)總結
(1)union會將聯合的結果集去重,效率較union all差
(2)union all不會對結果集去重,所以效率高
3 ADS層
3.1 留存使用者數
1)建表語句
hive (gmall)>
drop table if exists ads_user_retention_day_count;
create external table ads_user_retention_day_count
(
`create_date` string comment '裝置新增日期',
`retention_day` int comment '截止當前日期留存天數',
`retention_count` bigint comment '留存數量'
) COMMENT '每日使用者留存情況'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_retention_day_count/';
View Code
2)
-----------------------------需求2.統計ads_user_retention_day_count每日留存使用者的數量-----------------------
-----------------------------相關表---------------------
dws_user_retention_day
-----------------------------思路-----------------------
create_date: 從dws_user_retention_day查詢
retention_day: 從dws_user_retention_day查詢
retention_count: 使用count(*)統計
先根據create_date過濾指定的新增日期日期使用者的裝置明細!
再根據retention_day分組,之後count(*)
-----------------------------SQL------------------------
insert into table gmall.ads_user_retention_day_count
SELECT
'2020-02-14',
retention_day,
count(*)
FROM gmall.dws_user_retention_day
where create_date='2020-02-14'
group by retention_day;
3.2 留存使用者比率
1)建表語句
hive (gmall)>
drop table if exists ads_user_retention_day_rate;
create external table ads_user_retention_day_rate
(
`stat_date` string comment '統計日期',
`create_date` string comment '裝置新增日期',
`retention_day` int comment '截止當前日期留存天數',
`retention_count` bigint comment '留存數量',
`new_mid_count` bigint comment '當日裝置新增數量',
`retention_ratio` decimal(10,2) comment '留存率'
) COMMENT '每日使用者留存情況'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_retention_day_rate/';
View Code
2)
-----------------------------需求3. 求留存率-----------------------
-----------------------------相關表---------------------
ads_user_retention_day_count
ads_new_mid_count
從以上兩表取出同一條新增的裝置的資訊,因此裝置的新增日期是關聯的欄位
-----------------------------思路-----------------------
`stat_date` : 一般是當前要統計資料的當天或後一天。不早於統計資料的日期!
`create_date` : 從ads_user_retention_day_count取
`retention_day` : 從ads_user_retention_day_count取
`retention_count` : 從ads_user_retention_day_count取
`new_mid_count` : 從ads_new_mid_count統計當前新增裝置的數量
`retention_ratio` : retention_count/new_mid_count
-----------------------------SQL------------------------
insert into table ads_user_retention_day_rate
SELECT
'2020-02-16',
ur.create_date,
ur.retention_day,
ur.retention_count,
nm.new_mid_count,
cast (ur.retention_count / nm.new_mid_count as decimal(10,2))
FROM
ads_user_retention_day_count ur
JOIN
ads_new_mid_count nm
on ur.create_date=nm.create_date
where date_add(ur.create_date,ur.retention_day)='2020-02-16'