1. 程式人生 > 實用技巧 >專案實戰從0到1之hive(45)大資料專案之電商數倉(用三)

專案實戰從0到1之hive(45)大資料專案之電商數倉(用三)

第20章 需求九:每個使用者累計訪問次數

結果如下

使用者   日期         小計   總計
mid1 2019-12-14 10 10
mid1 2019-02-11 12 22
mid2 2019-12-14 15 15
mid2 2019-02-11 12 27

20.1 DWS層

20.1.1 建表語句

drop table if exists dws_user_total_count_day;
create external table dws_user_total_count_day(
`mid_id` string COMMENT '裝置id',
`subtotal` bigint COMMENT '每日登入小計'
)
partitioned by(`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dws/dws_user_total_count_day';

20.1.2 匯入資料

1)匯入資料

insert overwrite table dws_user_total_count_day 
partition(dt='2019-12-14')
select
mid_id,
count(mid_id) cm
from
dwd_start_log
where
dt='2019-12-14'
group by
mid_id;

2)查詢結果

select * from dws_user_total_count_day;

20.1.3 資料匯入指令碼

1)建立指令碼dws_user_total_count_day.sh

[kgg@hadoop102 bin]$ vim dws_user_total_count_day.sh
在指令碼中填寫如下內容
#!/bin/bash

# 定義變數方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

# 如果是輸入的日期按照取輸入日期;如果沒輸入日期取當前時間的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi

echo "===日誌日期為 $do_date==="
sql="
insert overwrite table "$APP".dws_user_total_count_day partition(dt='$do_date')
select
mid_id,
count(mid_id) cm
from
"$APP".dwd_start_log
where
dt='$do_date'
group by
mid_id,dt;
"

$hive -e "$sql"

2)增加指令碼執行許可權

chmod 777 ads_user_total_count.sh

3)指令碼使用

 ads_user_total_count.sh 2019-02-20

4)查詢結果

select * from ads_user_total_count;

5)指令碼執行時間

企業開發中一般在每天凌晨30分~1點

20.2 ADS層

20.2.1 建表語句

drop table if exists ads_user_total_count;
create external table ads_user_total_count(
`mid_id` string COMMENT '裝置id',
`subtotal` bigint COMMENT '每日登入小計',
`total` bigint COMMENT '登入次數總計'
)
partitioned by(`dt` string)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_user_total_count';

20.2.2 匯入資料

insert overwrite table ads_user_total_count partition(dt='2019-10-03')
select
if(today.mid_id is null, yesterday.mid_id, today.mid_id) mid_id,
today.subtotal,
if(today.subtotal is null, 0, today.subtotal) + if(yesterday.total is null, 0, yesterday.total) total
from (
select
*
from dws_user_total_count_day
where dt='2019-10-03'
) today
full join (
select
*
from ads_user_total_count
where dt=date_add('2019-10-03', -1)
) yesterday
on today.mid_id=yesterday.mid_id

20.2.3 資料匯入指令碼

1)建立指令碼

[kgg@hadoop102 bin]$ vim ads_user_total_count.sh
在指令碼中編寫如下內容
#!/bin/bash

db=gmall
hive=/opt/module/hive-1.2.1/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

if [[ -n $1 ]]; then
do_date=$1
else
do_date=`date -d '-1 day' +%F`
fi

sql="
use gmall;
insert overwrite table ads_user_total_count partition(dt='$do_date')
select
if(today.mid_id is null, yesterday.mid_id, today.mid_id) mid_id,
today.subtotal,
if(today.subtotal is null, 0, today.subtotal) + if(yesterday.total is null, 0, yesterday.total) total
from (
select
*
from dws_user_total_count_day
where dt='$do_date'
) today
full join (
select
*
from ads_user_total_count
where dt=date_add('$do_date', -1)
) yesterday
on today.mid_id=yesterday.mid_id
"

$hive -e "$sql"

2)增加指令碼執行許可權

chmod 777 ads_user_total_count.sh

3)指令碼使用

ads_user_total_count.sh 2019-02-20

4)查詢結果

select * from ads_user_total_count;

5)指令碼執行時間

企業開發中一般在每天凌晨30分~1點

第21章 需求十:新收藏使用者數

新收藏使用者:指的是在某天首次新增收藏的使用者

21.1 DWS層建立使用者日誌行為寬表

考慮到後面的多個需求會同時用到多張表中的資料, 如果每次都join操作, 則影響查詢的效率. 可以先提前做一張寬表, 提高其他查詢的執行效率.

每個使用者對每個商品的點選次數, 點贊次數, 收藏次數

21.1.1 建表語句

drop table if exists dws_user_action_wide_log;
CREATE EXTERNAL TABLE dws_user_action_wide_log(
`mid_id` string COMMENT '裝置id',
`goodsid` string COMMENT '商品id',
`display_count` string COMMENT '點選次數',
`praise_count` string COMMENT '點贊次數',
`favorite_count` string COMMENT '收藏次數')
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_user_action_wide_log/'
TBLPROPERTIES('parquet.compression'='lzo');

21.1.2 匯入資料

insert overwrite table dws_user_action_wide_log partition(dt='2019-12-14')
select
mid_id,
goodsid,
sum(display_count) display_count,
sum(praise_count) praise_count,
sum(favorite_count) favorite_count
from
( select
mid_id,
goodsid,
count(*) display_count,
0 praise_count,
0 favorite_count
from
dwd_display_log
where
dt='2019-12-14' and action=2
group by
mid_id,goodsid

union all

select
mid_id,
target_id goodsid,
0,
count(*) praise_count,
0
from
dwd_praise_log
where
dt='2019-12-14'
group by
mid_id,target_id

union all

select
mid_id,
course_id goodsid,
0,
0,
count(*) favorite_count
from
dwd_favorites_log
where
dt='2019-12-14'
group by
mid_id,course_id
)user_action
group by
mid_id,goodsid;

21.1.3 資料匯入指令碼

[kgg@hadoop102 bin]$ vi dws_user_action_wide_log.sh
[kgg@hadoop102 bin]$ chmod 777 dws_user_action_wide_log.sh

#!/bin/bash
db=gmall
hive=/opt/module/hive-1.2.1/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

if [[ -n $1 ]]; then
do_date=$1
else
do_date=`date -d '-1 day' +%F`
fi

sql="
use gmall;
insert overwrite table dws_user_action_wide_log partition(dt='$do_date')
select
mid_id,
goodsid,
sum(display_count) display_count,
sum(praise_count) praise_count,
sum(favorite_count) favorite_count
from
( select
mid_id,
goodsid,
count(*) display_count,
0 praise_count,
0 favorite_count
from
dwd_display_log
where
dt='$do_date' and action=2
group by
mid_id,goodsid

union all

select
mid_id,
target_id goodsid,
0,
count(*) praise_count,
0
from
dwd_praise_log
where
dt='$do_date'
group by
mid_id,target_id

union all

select
mid_id,
course_id goodsid,
0,
0,
count(*) favorite_count
from
dwd_favorites_log
where
dt='$do_date'
group by
mid_id,course_id
)user_action
group by
mid_id,goodsid;
"

$hive -e "$sql"

21.2 DWS層

使用日誌資料使用者行為寬表作為DWS層表

21.3 ADS層

21.3.1 建表語句

drop table if exists ads_new_favorites_mid_day;
create external table ads_new_favorites_mid_day(
`dt` string COMMENT '日期',
`favorites_users` bigint COMMENT '新收藏使用者數'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_new_favorites_mid_day';

21.3.2 匯入資料

insert into table ads_new_favorites_mid_day
select
'2019-12-14' dt,
count(*) favorites_users
from
(
select
mid_id
from
dws_user_action_wide_log
where
favorite_count>0
group by
mid_id
having
min(dt)='2019-12-14'
)user_favorite;

21.3.3 資料匯入指令碼

1)建立指令碼ads_new_favorites_mid_day.sh

[kgg@hadoop102 bin]$ vim ads_new_favorites_mid_day.sh
在指令碼中填寫如下內容
#!/bin/bash

# 定義變數方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

# 如果是輸入的日期按照取輸入日期;如果沒輸入日期取當前時間的前一天
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi

echo "===日誌日期為 $do_date==="
sql="
insert into table "$APP".ads_new_favorites_mid_day
select
'$do_date' dt,
count(*) favorites_users
from
(
select
mid_id
from
"$APP".dws_user_action_wide_log
where
favorite_count>0
group by
mid_id
having
min(dt)='$do_date'
)user_favorite;
"

$hive -e "$sql"

2)增加指令碼執行許可權

chmod 777 ads_new_favorites_mid_day.sh

3)指令碼使用

ads_new_favorites_mid_day.sh 2019-02-20

4)查詢結果

select * from ads_new_favorites_mid_day;

5)指令碼執行時間

企業開發中一般在每天凌晨30分~1點

第22章 需求十一:各個商品點選次數top3的使用者

22.1 DWS層

使用日誌資料使用者行為寬表作為DWS層表

22.2 ADS層

22.2.1 建表語句

drop table if exists ads_goods_count;
create external table ads_goods_count(
`dt` string COMMENT '統計日期',
`goodsid` string COMMENT '商品',
`user_id` string COMMENT '使用者',
`goodsid_user_count` bigint COMMENT '商品使用者點選次數'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_goods_count';

22.2.2 匯入資料

insert into table ads_goods_count
select
'2019-10-03',
goodsid,
mid_id,
sum_display_count
from(
select
goodsid,
mid_id,
sum_display_count,
row_number() over(partition by goodsid order by sum_display_count desc) rk
from(
select
goodsid,
mid_id,
sum(display_count) sum_display_count
from dws_user_action_wide_log
where display_count>0
group by goodsid, mid_id
) t1
) t2
where rk <= 3

22.2.3 資料匯入指令碼

1)建立指令碼ads_goods_count.sh

[kgg@hadoop102 bin]$ vim ads_goods_count.sh
在指令碼中填寫如下內容
#!/bin/bash

db=gmall
hive=/opt/module/hive/bin/hive
hadoop=/opt/module/hadoop/bin/hadoop

if [[ -n $1 ]]; then
do_date=$1
else
do_date=`date -d '-1 day' +%F`
fi

sql="
use gmall;
insert into table ads_goods_count
select
'$do_date',
goodsid,
mid_id,
sum_display_count
from(
select
goodsid,
mid_id,
sum_display_count,
row_number() over(partition by goodsid order by sum_display_count desc) rk
from(
select
goodsid,
mid_id,
sum(display_count) sum_display_count
from dws_user_action_wide_log
where display_count>0
group by goodsid, mid_id
) t1
) t2
where rk <= 3
"
$hive -e "$sql"

2)增加指令碼執行許可權

chmod 777 ads_goods_count.sh

3)指令碼使用

ads_goods_count.sh 2019-02-20

4)查詢結果

select * from ads_goods_count;

5)指令碼執行時間

企業開發中一般在每天凌晨30分~1點